Revealing consensus and dissensus between network partitions

Community detection methods attempt to divide a network into groups of nodes that share similar properties, thus revealing its large-scale structure. A major challenge when employing such methods is that they are often degenerate, typically yielding a complex landscape of competing answers. As an attempt to extract understanding from a population of alternative solutions, many methods exist to establish a consensus among them in the form of a single partition"point estimate"that summarizes the whole distribution. Here we show that it is in general not possible to obtain a consistent answer from such point estimates when the underlying distribution is too heterogeneous. As an alternative, we provide a comprehensive set of methods designed to characterize and summarize complex populations of partitions in a manner that captures not only the existing consensus, but also the dissensus between elements of the population. Our approach is able to model mixed populations of partitions where multiple consensuses can coexist, representing different competing hypotheses for the network structure. We also show how our methods can be used to compare pairs of partitions, how they can be generalized to hierarchical divisions, and be used to perform statistical model selection between competing hypotheses.


I. INTRODUCTION
One of the most important tools in network analysis is the algorithmic division of an unannotated network into groups of similar nodes -a task broadly known as network clustering or community detection [1]. Such divisions allow researchers to provide a summary of the large-scale structure of a network, and in this way obtain fundamental insight about its function and underlying mechanism of formation. Within this broad umbrella, many community detection methods have been developed, based on different mathematical definitions of the overall task [2]. What most methods share in common is that they are based on some objective function defined over all possible partitions of the network, which if optimized yields the most adequate partition for that particular network. Another universal property of community detection methods is that when they are applied to empirical networks, they exhibit at least some degree of degeneracy, in that even if there exists a single partition with the largest score among all others, there is usually an abundance of other solutions that possess a very similar score, making a strict optimization among them somewhat arbitrary [3]. This issue is compounded with the fact that instances of the community detection problem are generically computationally intractable, such that no known algorithm that guarantees the correct solution can perform substantially better than an exhaustive search over all answers [4,5], which is not feasible for networks with more than very few nodes. As a consequence, most available methods rely on stochastic heuristics that give only approximations of the optimum, and end up being * peixotot@ceu.edu specially susceptible to the degenerate landscape, yielding different answers whenever they are employed.
In response to this inherent degeneracy, many authors have emphasized the need to collectively analyse many outputs of any given community detection method, not only the best scoring result [6][7][8][9]. In this direction, one particularly interesting proposition is to recover the task of detecting a single partition, but doing so in a manner that incorporates the consensus over many different alternatives [7,[9][10][11][12][13][14]. If most results are aligned with the same general solution, the consensus among them allow us in fact to profit from the degeneracy, since small distortions due to irrelevant details or statistical fluctuations are averaged out, leading to a more robust answer than any of the individual solutions. However, consensus clustering cannot provide a full answer to the community detection problem. This is because any kind of approach based on point estimates possesses an Achilles' heel in situations where the competing answers do not all point in a cohesive direction, and instead amount to incompatible results. A consensus between diverging answers is inconsistent in the same manner as the mean of a bimodal distribution is not a meaningful representation of the corresponding population. Therefore, extracting understanding from community detection methods requires more than simply finding a consensus, as we need also to characterize the dissensus among the competing partitions. In fact, we need robust methods that give us a complete picture of the entire population of partitions, which is something we currently lack.
In this work we develop a round set of methods to comprehensively characterize a population of network partitions, in a manner that reveals both the consensus and dissensus between them. We approach this task by first providing a solution the community label identification problem, which allows us to unambiguously identify groups of nodes between partitions even when their node compositions are not identical. This allows us to perform the basic (but until now unsolved) task of computing marginal distributions of group memberships for each node in the network, and also leads naturally to a way of comparing partitions based on the maximum overlap distance, which has a series of useful properties that we demonstrate. The maximum overlap between partitions yields a simple way to characterize the consensus between them, acting in a way analogous to a maximum a posteriori estimation of a categorical distribution. We highlight also the pitfalls of consensus estimation in community detection, which fails when the ensemble of solutions is heterogeneous. Finally, we provide a more powerful alternative, consisting of the generalization of our method to the situation where multiple consensuses are possible, such that groups of partitions can align in different directions. The identification of these partitions "modes" yields a compact and understandable description of the heterogeneous landscape of community detection results, allowing us to access their consistency and weigh the alternative explanations they offer to the network data.
This work is divided as follows. In Sec. II we describe the label identification problem, and in Sec. III we present our solution based on the inference of what we call the random label model. In Sec. IV we show how the random label model leads naturally to the maximum overlap distance between partitions, which serves as a simple and efficient way to compare them, with many other useful properties. In Sec. V we show how to use the maximum overlap distance and other dissimilarity functions to extract consensus from network partitions, and we investigate their consistency in situations when the different partitions disagree. We show then in Sec. VI how the random label model can be generalized to provide a comprehensive description of multimodal populations of partitions, including how partitions may agree and disagree with each other. In Sec. VII we show how our ideas can be easily generalized to ensembles of hierarchical partitions, and finally in Sec. VIII we show how our methods allow us to perform more accurate Bayesian model selection, which requires a detailed depiction of the space of solutions that our approach is able to provide. We end in Sec. IX with a conclusion.

II. THE GROUP IDENTIFICATION PROBLEM IN COMMUNITY DETECTION
In this work we will focus on the approach to community detection that is based on the statistical inference of generative models [15]. Although our techniques can be used with arbitrary community detection methods (or in fact for any data clustering algorithm), those based on inference lend themselves more naturally to our analysis, since they formally define a probability distribution over partitions. More specifically, if we consider a generative model for a network conditioned a node partition b = {b i }, where b i is the group label of node i, such that each network A occurs with a probability P (A|b), we obtain the posterior distribution of network partitions by employing Bayes' rule, where P (b) the prior probability of partitions, and P (A) = b P (A|b)P (b) is the model evidence. There are many ways to compute this probability, typically according to one of the many possible parametrizations of the stochastic block model (SBM) [16] and corresponding choice of prior probabilities for their parameters. Since our analysis will not depend on any particular choice, we omit their derivations, and instead point the reader to Ref. [15] for a summary of the most typical alternatives. To our present goal, it is sufficient to establish that such a posterior distribution can be defined, and we have mechanisms either to approximately maximize or sample partitions from it. The first central issue we seek to address is that for this class of problems the actual numeric values of the group labels have no particular significance, as we are interested simply in the division of the nodes into groups, not in their particular placement in named categories. This means that the posterior probability above is invariant to label permutations. More specifically, if we consider a bijective mapping of the labels µ(r) = s, such that its inverse µ −1 (s) = r recovers the original labels, then a label permutation c = {c i } where c i = µ(b i ), has the same posterior probability, for any choice of µ. Very often this is considered an unimportant detail, since many inference methods break this label permutation symmetry intrinsically. For example, if we try to find a partition that maximizes the posterior distribution with a stochastic algorithm, we will invariably find one of the many possible label permutations, in an arbitrary manner that usually depends on the initial conditions, and we can usually move on with the analysis from there. Methods like belief-propagation [5], which can be employed in the special case where the model parameters other than the partition b are known, yield marginal distributions over partitions that, due to random initialization, also break the overall label permutation symmetry, and yield a distribution centered around one particular group labelling. The same occurs also for some Markov chain Monte Carlo (MCMC) algorithms, for example those based on the movement of a single node at a time [17,18], which will often get trapped inside one particular choice of labels, since the swap of two labels can only occur if the respective groups exchange all their nodes one by one, a procedure that invariably moves the Markov chain through low probability states, and thus is never observed in practice. Although this spontaneous label symmetry breaking can be seen as a helpful property in these cases, strictly speaking it is a failure of the inference procedure in faithfully representing the overall label symmetry that does exists in the posterior distribution. In fact, this symmetry guarantees that the marginal posterior group membership probability of any node must be the same for all N nodes, i.e.
where P (B) is the marginal distribution of the number of labels (nonempty groups), and we assume that the labels always lie in a contiguous range from 1 to B. Because of this, the true answer to the question "what is the probability of a node belonging to a given group?" is always an unhelpful one, since it is the same one for every node, and carries no information about the network structure. Far from being a pedantic observation, this is a problem we encounter directly when employing more robust inference methods such as the merge-split MCMC of Ref. [19].
In that algorithm, the merge and split of groups are employed as direct move proposals which significantly improve the mixing time and the tendency of the Markov chain to get trapped in metastable states, when compared to single-node moves. However, as a consequence, the merge and split of groups result in the frequent sampling of the same partition where two group labels have been swapped, after a merge and split. In fact, the algorithm of Ref. [19] also includes a joint merge-split move, where the memberships of the nodes belonging to two groups are redistributed in a single move, which often results in the same exact partition, but with the labels swapped.
Such an algorithm will rapidly cycle through all possible label permutations leading to the correct albeit trivial uniform marginal probabilities given by Eq. 3.
In Fig. 1 we show how the label permutation invariance can affect community detection for a network of co-purchases of political books [20], for which we used the Poisson degree-corrected SBM (DC-SBM) [21], using the parametrization of Ref. [22], and the merge-split MCMC of Ref. [19]. Although the individual partitions yield seemingly meaningful divisions, they are observed with a random permutation of the labels, preventing an aggregate statistics at the level of single nodes to yield useful information.
At first we might think of a few simple strategies that can alleviate the problem. For example, instead of marginal distributions, we can consider the pairwise co-occurence probabilities c ij = b δ bi,bj P (b|A) ∈ [0, 1], which quantify how often two nodes belong to the same group, and thus is invariant with respect to label permutations. However this gives us a large dense matrix of size N 2 which is harder to interpret and manipulate than marginal distributions -indeed the usual approach is to try to cluster this matrix [10], by finding groups of nodes that have similar co-occurences with other nodes, but this just brings us back to the same kind of problem. Another potential option is to choose a canonical naming scheme for the group labels, for example by indexing groups according to their size, such that r < s if n r < n s , where n r is the number of nodes with group label r. However this idea quickly breaks down if we have groups of the same size, of if the group sizes vary significantly in the posterior distribution. An alternative canonical naming is one based on an arbitrary ordering of the nodes and forcing the labels to be confined to a contiguous range, so that b j > b i for j > i whenever b j corresponds to a group label previously unseen for nodes k ≤ i. In this way every partition corresponds to a single canonical labeling, which we can generate before collecting statistics on the posterior distribution. Unfortunately, this approach is not straightforward to implement, since the marginal distributions will depend strongly on the chosen ordering of the nodes. For example, if the first node happens to be one that can belong to two groups with equal probability, whenever this node changes membership, it will incur the relabeling of every other group, and thus spuriously causing the marginal distribution of every other node to be broader, even if they always belong to the "same" group. It seems intuitive therefore to order the nodes according to the broadness of their marginal distribution, with the most stable nodes first, but since determining the marginal distribution depends on the ordering itself, it leads to a circular problem.
In the following we will provide a different solution to this problem, based on a generative model of labelled partitions, that is both satisfying and easy to implement, and allows us in the end to obtain marginal distributions in an unambiguous manner.

III. THE RANDOM LABEL MODEL
If we have as an objective to obtain the marginal distributions π i (r) from a posterior distribution, this is equivalent to fit a factorized "mean-field" model on samples of this posterior distribution, given by where p i (r) is the probability of node i belonging to group r. Given M sampled partitions {b} = {b (1) , . . . , b (M ) }, the maximum likelihood estimate of the above model iŝ which is exactly how we estimate marginal distributions, sincep i (r) → π i (r) as M → ∞. Note however, that this model is inconsistent with our posterior distribution of Eq. 1, since it is in general not invariant to label permutations, i.e. if we swap two labels r and s we have the same distribution only if p i (r) = p i (s) for every node i. Therefore, in order to tackle the label symmetry problem, we may modify this inference procedure, by making it also label symmetric. We do so by assuming that our partitions are initially sampled from the above model, but then the labels are randomly permuted. In other words, we have where the intermediary partition c is relabelled into b with a uniform probability where we make use of the symmetric indicator function and where B(c) is the number of labels in partition c, and B(c)! in total number of label permutations of c. Now, inferring the probabilities p from the model above involves finding a single underlying canonical labelling that is erased at each sample, but after it is identified allows us to obtain marginal distributions. This canonical labeling itself is not unique, since every permutation of its labels is equivalent, but we do not care about the identity of the labels, just an overall alignment, which is what the inference will achieve. We proceed with the inference of the above model in the following way. Suppose we observe M partitions {b} = {b (1) , . . . , b (M ) } sampled from the posterior distribution as before. According to our model, their joint probability, together with their hidden labellings {c} = {c (1) , . . . , c (M ) } is given by where is the number of relabelled partitions where node i has hidden label r. Our first step is to infer the hidden labels {c} from the posterior with the marginal likelihood integrated over all possible probabilities p, and given by where we have used a uninformative prior and B is the total number of labels in p. Therefore, up to an unimportant multiplicative constant, we have that the posterior distribution of hidden relabellings is given by We proceed by considering the conditional posterior distribution of a single partition c (m) , where n i (r) = m =m δ c (m ) i ,r is the label count excluding c (m) , and we have dropped the indicator function for conciseness, but without forgetting that [c (m) ∼ b (m) ] = 1 must always hold. If we seek to find the most likely hidden labelling c (m) we need to maximize the above probability, or equivalently its logarithm, which is given by up to an unimportant additive constant. We can rewrite the right hand side of the above equation in an equivalent way as follows, where and µ(r) is a particular relabeling of b such that µ(b i ) = c i . Maximizing the quantity of Eq. 18 boils down to finding the optimal bijection µ among all B(b)! possibilities. Unfortunately, this number grows too fast for an exhaustive search to be feasible, unless the number of labels is very small. However, the maximization of Eq. 18 can be cast as an instance of a very well known combinatorial optimization problem called maximum bipartite weighted matching, also known as the assignment problem, which corresponds to finding a "matching" on a bipartite graph, defined as a subset of the edges that share no common nodes, such that the sum of the weights of the edges belonging to the matching is maximized. This corresponds precisely to the sum given in Eq. 18, where the groups labels of b and c form the nodes in the bipartite graph, and the matrix w rs gives the weight of the edge (r, s), and each choice of µ corresponds to a particular matching (see Fig. 2). In particular we are interested in the unbalanced and imperfect version of the matching problem, where the number of groups on both sides might be different, and groups on either side might be left unmatched [23], in which case for each unmatched group we give it a label of a new group. Luckily, fast polynomial algorithms for this problem have been long known. For example using the "Hungarian" or Kuhn-Munkres algorithm [24,25] this problem can be solved with a worst-case running time of O(B(b) 3 ), which is substantially better than an exhaustive search, rendering our approach not only feasible but efficient. Equipped with this solution, we can summarize our whole inference algorithm as follows: 1. We sample M partitions b (1) , . . . , b (M ) from the posterior distribution P (b|A).

2.
We initialize c (m) = b (m) for every sample m.
3. For each sample m, in random order, we obtain a new relabelling c (m) such that Eq. 18 is maximized.
4. If any value of c (m) is changed during the last step, we repeat it, otherwise we stop and return {c}.

Group labels r
Group labels s w r s Figure 2. Relabeling a partition corresponds to finding the solution of a maximum bipartite weighted matching problem, where the partition labels are the nodes of a bipartite graph with weights wrs on the edges. The matching is a bijection µ(r) that needs to be chosen so that the total sum r w r,µ(r) is maximized. In this illustration, the edge thickness corresponds to the weight wrs, and the edges in green correspond to the maximum matching.
By the end of this algorithm, we are guaranteed to find a local maximum of Eq. 15, but not a global one, hence we need to run it multiple times and obtain the result with the largest posterior probability. However, we found that repeated runs of the algorithm give the same result the vast majority of cases we tried. 1 Computationally, step 3 is the heart of the above algorithm, as it corresponds to the alignment of each partition with the rest.
Note that the above procedure is not much more computationally intensive than obtaining the marginals in the naive way, i.e. directly from the originally labelled partitions b, which requires a time O(M N ) to record the label counts. It does, however, require more memory, with a total O(M N ) storage requirement, as we need to keep all M partitions for the whole duration of the algorithm. In practice, however, we do not need to perform the whole procedure above for all M partitions, as it is often sufficient to choose a relatively small subset of them, provided they give a good representation of the ensemble, and then we run steps 1 to 4 only on this subset. Based on that, we can simply process each remaining partition by simply finding its relabelling c (m) , updating the global label counts n i (r), and then discard the partition. Although this gives only an approximation of   the optimization procedure, we find it works very well in practice, yielding results that are often indistinguishable from what is obtained with the full algorithm, while requiring less memory. In Fig. 3 we show the partitions of the political books network considered in Fig. 1, but now relabelled according to the algorithm above. Despite groups changing size and composition, and the appearance and disappearance of groups, the unique labelling allows us to identify them clearly across partitions. In Fig. 4 these relabellings are used to obtain marginal distribution on the nodes, where we can say unambiguously with each frequency a node belongs to a given group.

IV. THE MAXIMUM OVERLAP DISTANCE
The method described in the previous section serves as a principled way to disambiguate group labels in an ensemble of partitions, but the ideas articulated in its derivation also lead us to a way of comparing two partitions with each other in a general and meaningful way. Consider the situation where we employ the model above, but we have only M = 2 partitions. In this case, without loss of generality, we can set one of them arbitrarily to correspond to the canonical labelling, and we seek to relabel the second one, by maximizing Eq. 18, which in this case simplifies to r m r,µ(s) ln 2, where is the so-called contingency table between partitions b (1) and b (2) , which quantifies how many nodes in group r of b (1) belong to group s of b (2) . Therefore, maximizing Eq. 21 is equivalent to finding the bijection µ so that x i ) and y = b (2) maximize the partition overlap which counts how many nodes share the same label in both partitions. Therefore, incorporating our inference procedure leads to the maximum overlap distance This quantity has a simple interpretation as the minimal classification error, i.e. the smallest possible number of nodes with an incorrect group placement in one partition if the other is assumed to be the correct one. This measure has been considered before in Refs. [27,28], but here we see its derivation based on a probabilistic generative model. Here we point out some useful properties of this distance function: 1. Simple interpretation. Since it quantifies the classification error, it is easy to intuitively understand what the distance is conveying. In particular its normalized version d(x, y)/N yields values in the range [0, 1] which can be interpreted as fractions of differing nodes, and hence allows the direct comparison between results obtained for partitions of different sizes and numbers of groups.

2.
Behaves well for unbalanced partitions. The distance d(x, y) behaves as one would expect even when the partitions have very different number of groups, or the number of groups approaches N for either x or y, unlike alternatives such as mutual information [29]. More specifically, if we simply increase the number of groups of either partition being compared, this does not spuriously introduce small values of d(x, y). We see this by noticing that if B(x) = B and B(y) = N , the maximum overlap is always ω(x, y) = B, since each group in x can be trivially matched with any of the single-node groups in y, yielding which leads to the maximum normalized distance 3. Simple asymptotic behavior for uncorrelated partitions. Suppose partitions x and y are sampled independently and uniformly from the set of all possible partitions into B(x) and B(y) labelled groups, respectively. In this case, as N 1, the contingency table will tend to the uniform one with m rs = N/[B(x)B(y)], which results in the asymptotic normalized distance given by Although it is not a substitute for a proper hypothesis test (which would need to account for finite values of N ), this asymptotic value gives a rule of thumb of how to interpret the distance between two partitions as a strength of statistical correlation.

4.
Defines a metric space. The distance d(x, y) is a proper metric, since it fulfills the properties of , and most notably, triangle inequality d(x, z) ≤ d(x, y)+d(y, z) (we offer a simple proof of this in Appendix A). This makes this notion of distance well-defined, unambiguous, and conforming to intuition.

5.
Information-theoretic interpretation. The maximum overlap has a direct information theoretic interpretation, due to its connection to the random label generative model exposed earlier. According to the model of Eq. 13 the joint probability of observing two partitions {c} = {x, y}, up to an arbitrary relabelling of the groups, is given by This means that any two partitions have a joint description length which measures the amount of information (in bits) necessary to describe both partitions. The above quantity is proportional to the negative value of the maximum overlap ω(x, y), and hence is proportional to d(x, y). (Note that this is not the most efficient encoding scheme based on the maximum overlap, we consider an alternative in Appendix B.) 6. Efficient computation. As discussed previously, computing the maximum overlap involves solving an instance of the maximum bipartite weighted matching problem, with weights given by the contingency tabel, w rs = m rs (see Fig. 2), which can be done using the Kuhn-Munkres algorithm [24,25]. In its sparse version, the running time is bound being the number of nonzero entries in the contingency matrix m rs [23]. Combining this with the work required to build the contingency table itself, the computation of . Therefore the running time will depend on whether we expect the number of labels and the density of the contingency table to be much smaller or comparable to N . In the former case, the maximum matching algorithm takes a  then E m = O(N ), and hence the running time will be quadratic, O(N 2 ). However, the latter scenario is atypical when N is very large, so therefore we most often encounter the linear regime allowing for very fast computations (see Fig. 5).
The maximum overlap distance has been used before in situations where the labeling is unambiguous or the number of labels is so small that exhaustive iteration over label permutations is feasible (e.g. [5,30]), but, to the best of our knowledge, rarely in combination with the maximum bipartite weighted matching algorithm as outlined above (with an exception being Ref. [31] that employed it when comparing with other metrics) which makes it usable in general settings. Instead, more focus has been given to measures such as mutual information (and its several variants) [32] or variation of information (VI) [33], which are based on the contingency table without requiring us to obtain a label matching. As pointed out by Meilă [33], it is not meaningful to talk about the "best" way of comparing partitions without any context, since such a task must be unavoidably tied with our ultimate objective. Therefore, a different set of axiomatic conditions might prefer another dissimilarity function [34]. In particular, since the maximum overlap distance is based only on the number of nodes correctly classified, it ignores the nodes that do not match, and hence does not exploit any potential regularity with which the labels are mismatched. Other functions such as variation of information, might provide alternatives which can be used to highlight different properties of partition ensembles. Nevertheless, few other dissimilarity functions share the same ease of interpretation with the maximum overlap distance, while possessing its other useful formal properties, such as natural normalization, information-theoretical interpretation, and the fact it defines a metric space.
Among the alternative partition similarities and dis-similarities, the recently introduced reduced mutual information (RMI) [35] deserves particular mention. This is because, like the maximum overlap distance, it is related to a joint description length of two partitions, which in the case of RMI involves encoding the full contingency table. This means that both similarities can be compared to each other in their own terms, and the most appropriate measure must yield the shortest description length. We perform a succinct comparison between RMI and an overlap-based encoding in Appendix B. We will also consider both RMI and VI more closely in the following section.

V. EXTRACTING CONSENSUS BETWEEN PARTITIONS
The explicit objective of community detection, like any data clustering method, is to find a partition of the nodes of a network, in a manner that captures its structure in a meaningful way. However, instead of a single partition, the inference approach gives us a distribution of partitions, which ascribes to every possible division of the network a plausibility, reflecting both our modelling assumptions as well as the actual structure of the network. In order to convert this information into a single partition "point estimate" we have to be more specific about what we would consider a successful outcome, or more precisely how we define the error of our estimate. A consistent scenario is to assume that our observed network is indeed generated from our model P (A|b * ) where b * is the true partition we are trying to find. In order to quantify the quality of our inference we need to specify an error function (x, y) that satisfies Based on a choice for this function, and since we do not really have access to the true partition b * , our best possible estimateb from the posterior distribution is the one which minimizes the average error over all possible answers, weighted according to their plausibility, i.e.
Therefore, it is clear that our final estimate will depend on our choice of error function (x, y), and hence is not a property of the posterior distribution alone. In statistics and optimization literature the function (x, y) is called a "loss function," and it determines the ultimate objective of the inference procedure.
In addition to producing a point estimateb, it also useful for our inference procedure to yield an uncertainty value σb, which quantifies how sure we are about the result, with σb = 0 indicating perfect certainty. As we will see, this value will also be determined by the choice of the error function.
In the following we to consider simple choices of the error function, and investigate how they compare to each other in the inference results they produce.

A. Maximum a posteriori (MAP) estimation
Arguably the simplest error function we can use is the indicator function (also called the "zero-one" or "all-ornothing" loss) which would separate the true partition completely from any other, without differentiating among wrong ones. Inserting this in Eq. 31, we obtain the maximum a posteriori (MAP) estimator which is simply the most plausible partition according to the posterior distribution. The corresponding uncertainty for this estimate is simply σb = 1 − P (b|A), such that if σb = 0 we are maximally certain about the result. Despite its simplicity, there are several problems with this kind of estimation. Namely, the drastic nature of the error function completely ignores partitions which may be almost correct, with virtually all nodes correctly classified, except very few or in fact even one node placed in the incorrect group. We therefore rely on a very strong signal in the data, where the true partition is given a plausibility that is larger than any small perturbation around it, in order for to us to be able to make an accurate estimation. This puts us in a precarious position in realistic situations where our data are noisy and complex, and does not perfectly match our modelling assumptions. Furthermore the uncertainty σb is in most cases difficult to compute, as it involves determining the intractable sum P (A) = b P (A, b) which serves as a normalization constant for P (b|A) (although we will consider approximations for this in Sec. VIII). Even if computed exactly, typically we will have σb approaching the maximum value of one, since very few networks have a single partition with a dominating posterior probability.

B. Maximum overlap consensus (MOC) estimation
As an alternative to the MAP estimation, we may consider a more relaxed error function given by the overlap distance which counts the number of nodes correctly classified when compared to the true partition. With this function, from Eq. 31 we obtain the maximum marginal estimator with being the marginal posterior distribution for node i. The uncertainty in this case is then simply the average of the uncertainty for each node, σb = 1 − i π i (b i )/N . Since this estimator considers the average over all partitions instead of simply its maximum, it incorporates more information from the posterior distribution. Nevertheless, we encounter again the same problem we have described before, namely that due to label permutation invariance the marginal distribution will be identical for every node, and this estimator will yield in fact useless results. We can fix this problem by employing instead the maximum overlap distance as an error function (x, y) = d(x, y), leading to the estimator Performing the maximization yields now a set of selfconsistent equations, with the marginal distributions obtained over the relabeled partitions, where the relabeling is done in order to maximize the overlap withb Like before, the uncertainty is given by σb = 1 − In practice we implement this estimator by sampling a set of M partitions {b} from the posterior distribution and then performing the double maximizationb where in the last equation we have thatm andb. The solution of Eq. 41 is obtained by simply counting how often each label appears for each node and then extracting the label with the largest count, and Eq. 42 is once more an instance of the maximum bipartite weighted matching problem. The overall solution can be obtained by simple iteration, starting from an arbitrary choice of b, and then alternating between the solution of equation Eq. 42 and using its result to solve Eq. 41, untilb no longer changes. This guarantees a local optimum of the optimization problem, but not necessarily a global one, therefore this algorithm needs to be repeated multiple times with different initial conditions, and the best result kept. Since it involves the relabelling over all M partitions, the overall algorithmic complexity of a single iteration is Note that the marginal distributions obtained via Eq. 39 with the MOC estimator are not necessarily the same as those obtained by inferring the random label model considered previously. This is because while the MOC calculation attempts to find a single partition with a maximum overlap to all samples, inferring the random label model amounts to finding the most likely marginal distribution compatible with all samples, irrespective of its maximum. Although in many cases these two calculations will give similar answers, they are not equivalent.

C. Error functions based on the contingency table
In principle, we can make a variety of other choices for error functions. A particular class of them are those based on the contingency table between partitions, using concepts from information theory. These error functions are not based on an explicit labeling or alignment of partitions, but instead focus on the joint probability of labels in both partitions being compared. A popular function of this kind is the variation of information (VI) [33], which is defined as with m rs = i δ xi,r δ yi,s being the contingency table between x and y, n r = s m rs and n s = r m rs are the group sizes in both partitions. We can use VI as an error function by setting As detailed in Ref. [33], VI is a dissimilarity function that fulfills many desirable formal properties, including triangle inequality, making it a proper metric distance (like the maximum overlap distance). Another possible alternative consists of using the reduced mutual information (RMI) [35], as done by Riolo and Newman [36], with where with Ω(n, n ) being the total number of contingency tables with fixed row and column sums, which we omit here for brevity (see Ref. [35] for asymptotic approximations). The negative sign used in the definition of (x, y) is because RMI is a similarity function, which takes its maximum value when x and y are identical, unlike VI, which is a dissimilarity that takes its minimum value of zero in the same case. RMI can be seen as a correction to mutual information, which fails as an appropriate similarity function in key cases. It is based on a nonparametric minimum description length encoding of both partitions, which quantifies the amount of information required to describe them if the contingency table is known, together with the necessary information required to describe the contingency table itself.
In either of the above cases, our point estimateb consists of minimizing the sum of the error function over M samples from the posterior distribution, according to Eq. 31. Unlike the indicator and the maximum overlap distance, the above loss functions are more cumbersome to optimize, with the overall optimization itself amounting to a nonconvex clustering problem of its own. Therefore we can use some of the same algorithms we use to perform community detection in the first place, with a good choice being the merge-split MCMC of Ref. [19], which we have used in our analysis.

D. Consensus point estimates are inconsistent for heterogeneous distributions
Our aim is not to list or perform an exhaustive comparison between all possible error functions, but instead to focus on the fact that they do not always yield the same answer. Although there is only one way with which all partitions in an ensemble can be identical, there are many ways in which they can be different. While the various error functions allow us to extract a form of consensus between differing partitions, they each achieve this based on different features of the population. Therefore, for partition ensembles with sufficiently strong heterogeneity, the different estimators may give conflicting answers. Such a disagreement can signal an arbitrariness in the inference procedure, and our inability of summarizing the population in a simple manner. We illustrate this problem with a few simple examples.
We consider first a simple artificial scenario with strong heterogeneity, composed of M independently sampled partitions of N nodes, where to each node is sampled a group label uniformly at random from the interval [1, B]. Indeed, in this example there is no real consensus at all between partitions. Intuitively, we might expect the estimated consensus between such fully random partitions to be a sort of "neutral" partition, in the same way the average of a fully isotropic set of points in Cartesian space will tend towards the origin. However, all consensus estimators considered previously behave very differently from each other in this example. In Fig. 6 we compare the being the group label entropy. Arguably, the estimator that behaves the closest to the intuitive expectation just mentioned is VI, which for M > 2 yields a consensus partition composed of a single group, B e (b) = 1. The MOC estimator yields instead a partition into B e (b) = 4 groups, which itself is hard to distinguish from a random partition sampled from the original ensemble. This is because the marginal distributions obtained by Eq. 39 will be close to uniform, even after the label alignments of Eq. 42 are achieved, such that the maximum chosen by Eq. 41 will be determined by small quenched fluctuations in the partition ensemble. Finally, the RMI estimate yields consensus partitions with a number of groups that increases with the number of samples M . This is because the RMI estimate tends to find the overlaps between partitions, i.e. sets of nodes that tend to occur together in the same group across many partitions [9]. In our random case, two nodes belong to the same group due to pure coincidence, therefore the probability of this happening for a large set of nodes decreases for larger M , thus making the overlapping sets progressively smaller, and leading to a larger number of groups in the consensus.  We further illustrate the discrepancy issue with a more realistic example where we can see both agreements and disagreements between the different estimates. In Fig. 7 we show the estimates obtained for the same political books network considered previously, again using the DC-SBM to obtain a posterior distribution of par- titions. We observe, rather curiously, that the MAP estimate coincides perfectly with the VI estimate, but gives a different result from the MOC and RMI estimates. The MAP/VI estimates separate the network into three groups, which in this context can be understood as types of books describing "liberal" and "conservative" politics, and "neutral" books not taking any side. The MOC estimate further divides the "liberal" category into a new subgroup, and somewhat strangely at first, singles out two "neutral" books into their own category. As can be seen in Fig. 4, the reason for this is that the posterior distribution exhibits a possible subdivision of the neutral group into two, however it is only these two nodes that happen to belong to this subdivision with the highest probability. The RMI estimate also yields a division into five groups, but the two extra groups have a larger size when compared to the MOC result. In view of the behavior seen for the fully random example considered earlier, the discrepancies raise some doubts about what is the most faithful division. Are the MOC and RMI arbitrary divisions due to the randomness of the posterior distribution or do they point to a meaningful summary? Are the MAP/VI estimates being too conservative about the structure of the posterior distribution? With some other networks, the discrepancy between estimators can be even stronger, making such questions even harder to answer. In Fig. 8 we show the results for Zachary's karate club network [37], again using the DC-SBM. In this case, the MAP, MOC and VI estimators yield the same division of the network into a single group, whereas the RMI estimate yields a partition into five groups, following no clear pattern. None of the estimates resemble the putative division for this network in two assortative communities.
Despite the partial agreement between some of the estimates in the examples above, the disagreements still raises obvious interpretation questions. Here we argue that this discrepancy cannot be resolved simply by trying alternative ways to form a consensus, since trying to summarize a whole distribution with a point estimate is in general an impossible task, and therefore we need instead a way to also characterize the dissensus between partitions, by exposing also the existing heterogeneity of the posterior distribution.
To some extent, the characterization of dissensus is already achieved by the random label model of Sec. III, since it attempts to describe the posterior distribution via marginal probabilities, rather than just a point estimate, and therefore can convey how concentrated it is. However, because this model assumes the group membership of each node to be independent, it still hides a significant fraction of the potential heterogeneity in the ensemble, which can come from the correlation between these memberships. In the next section we will generalize this approach to the situation where the posterior distribution is multimodal, so that multiple consensuses are simultaneously possible. We will see how this allows us to extract a more complete and coherent picture of distributions of partitions.

VI. EXTRACTING DISSENSUS BETWEEN PARTITIONS
We aim to characterize the discrepancy between partitions by considering the possibility of several possible consensuses that only exist between a subset of the partitions. This corresponds to the situation where the inference procedure can yield substantially different explanations for the same network. We do so by modelling the posterior distribution of partitions with a mixture model, where each partition can belong to one of K clusterswhich we call "modes" to differentiate from the groups of nodes in the network. Inside each mode the partitions are generated according to the same random label model considered before, but with different parameters. More specifically, a partition b is sampled according to where is the relative size of mode k, with k w k = 1, and inside a mode k the partitions are sampled according to the random label model, with the hidden labels generated according to where p (k) i (r) is the probability that a node i has group label r in mode k, and finally a random label permutation chosen uniformly at random, Naturally, we recover the original random label model for K = 1.
We perform the inference of the above model by considering the mode label k as a latent variable, which yields a joint probability together with the original and relabelled partitions P (b, c, k|p, w) = P (b|c)P (c|p, k)P (k|w).
If we now observe M partitions {b} = {b (1) , . . . , b (M ) } sampled from the SBM posterior distribution, we assume that each one has been sampled from one of the K modes, so that for each observed partition b m we want to infer its relabelled counterpart together with its originating mode, i.e. (c (m) , k). The joint posterior distribution for these pairs, together with the total number of modes K, is given by where the relabelling probability is given by and with the marginal likelihood obtained by integrating over all possible probabilities p for each mode, with M k = m δ km,k being the number of samples that belong to mode k, B k the total number of group labels in mode k, and n (k) i (r) = m δ c m i ,r δ km,k are the marginal label counts in mode k, and finally the prior mode distribution is obtained by integrating over all possible mode mixtures w, where we used once more an uninformative prior For the total number of modes K we use a uniform prior P (K) ∝ 1, which has no effect in resulting inference. With this posterior in place, we can find the most likely mode distribution with a clustering algorithm that attempts to maximize it. We do so by starting with an arbitrary initial placement of the M partitions into modes, and implementing a greedy version of the merge-split algorithm of Ref. [19] that chooses at random between the following steps, and accepting it only if increases the posterior probability:

A random partition b (m)
is moved from its current mode to a randomly chosen one, including a new mode.
2. Two randomly chosen modes are merged into one, reducing the total number of modes.
3. A randomly chosen mode is split into two, increasing the total number of modes. The division itself is chosen by a surrogate greedy algorithm, which tries one of the following strategies at random: 4. Two randomly chosen modes are merged into one, and then split like in option 3, preserving the total number of modes.
The algorithm stops whenever further improvements to the posterior cannot be made. In the above, whenever a sample m is placed into a mode k, its hidden labelling c (m) is obtained by maximizing the conditional posterior probability, where n i (r|k) = m =m δ k m ,k δ c m i ,r is the label count of node i considering all samples belonging to mode k, excluding c (m) . Like in the original random label model, this maximization is performed by solving the corresponding maximum bipartite weighted matching problem with the Kuhn-Munkres algorithm in time O (N + B 3 ), where B is the number of partition labels involved. Overall, a single "sweep" of the above algorithm, where each sample has been moved once, is achieved in time After we have found the mode memberships k, the mode fractions can be estimated as This is interpreted as the relative posterior plausibility of each mode serving as an alternative explanation for the data. In the following, we consider a simple example that illustrate how the method above can characterize the structure of a distribution of partitions, and we proceed to investigate how the multimodal nature of the posterior distribution can be used to assess the quality of fit of the network model being used.

A. Simple example
In Fig. 9 we show the result of the above algorithm for the posterior distribution obtained for the same political books network considered previously, where in total K = 11 modes are identified. For each mode we show the corresponding marginal distribution of the relabeled partitions, and the uncertainty σb = 1 − i p i (b i ) of its maximumb, which serves as a quantification of how broadly distributed are the individual modes. As a means of illustration, in Fig. 9 we show also a two-dimensional projection of the distribution of partitions, obtained using the UMAP dimensionality reduction algorithm [38] using the maximum overlap distance as the dissimilarity metric. This algorithm attempts to project the distribution of partitions in two dimensions, while preserving the relative distances between partitions in the projection. As a result we see that each mode is clearly discernible as a local concentration of partitions, much like we would expect of a heterogeneous mixture of continuous variables. We note here that we have not informed the UMAP algorithm of the modes we have found with the algorithm above, and therefore this serves an additional evidence for the existence of the uncovered heterogeneity in the posterior distribution. The most important result of this analysis is that no single mode has a dominating fraction of the distribution, with the largest mode corresponding only to around 23% of the posterior dis- tribution, and with the second largest mode being very close to it. This means that there is no single cohesive picture that emerges from the distribution, and therefore our attempt at summarizing it with a single partition seems particularly ill-suited.
In view of this more detailed picture of the ensemble of partitions, it is worth revisiting the consensus results obtained previously with the various error functions. As shown in Fig. 9, the MAP/VI estimates correspond to the most likely partition of mode (c), which is overall only the third most plausible mode with w 3 = 0.134. From the point of view of the MAP estimator, this serves to illustrate how choosing the most likely partition may in fact run counter to intuition: Although the single most likely partition belongs to mode (c), collectively, the partitions in mode (a) and (b) have a larger plausibility. This means that, if we are forced to choose a single explanation for the data, it would make more sense instead to choose mode (a), despite the fact that it does not contain the single most likely partition. More concretely, when comparing modes (a), (b), and (c), we see that the network does in fact contain more evidence for a division of either the "neutral" or the "liberal" groups into subgroups than the MAP estimate implies, however not both, as mode (d), corresponding to the simultaneous sub-divisions, has a smaller plausibility than the other options. The VI estimate also points to mode (c), but it is unclear why. This is indeed a problem with using VI, since despite its strong formal properties, it lacks a clear interpretability.
Differently from MAP and VI, the MOC estimation combines the properties of all modes into a "Frankenstein's monster," where local portions of the final inferred partition correspond to different modes. As a result, the resulting point estimate has a very low posterior probability, and hence is a misleading representation of the population -a classic estimation failure of multimodal distributions.
The RMI estimate behaves differently, and corresponds to a typical partition of mode (d), which has an overall plausibility of w 4 = 0.132. We can understand this choice by inspecting its composition, and noticing that the more plausible modes (a) to (c) correspond to partitions where groups of (d) are merged together. Because of this, the RMI similarity sees this partition as the "center" composed of the building blocks required to obtain the other ones via simple operations. But by no means it is the most likely explanation of the data according to the model, and given that it is a division into a larger number of groups, it is more likely to be an overfit, in view of the existence of simpler modes (a) to (c).

B. Evaluating model consistency
The full characterization of the posterior distribution with our approach gives us the opportunity to access the quality of fit between model and data. Indeed, if the model was an excellent fit, e.g. if the data was in fact generated by the SBM, we should expect a single mode in the posterior distribution that is centered in the true partition. Therefore, the fact alone we observe multiple modes is an indication of some degree of mismatch, with the model offering multiple explanations for the data. Since our analysis allows us to inspect each individual explanation, and ascribe to it a plausibility, this can be used to make a more precise evaluation of the fit.
Inspecting the modes observed for the political books network in Fig. 9, we notice that the four largest modes amount approximately to different combinations of the same five groups that appear in the fourth mode (Fig. 9d) -although the remaining modes deviate from this pattern. This is reminiscent of a situation considered by Riolo and Newman [9], who have applied RMI estimation for artificial networks where none of the posterior sam-  ples matches the true division, which is only uncovered by the RMI consensus. In particular, in their scenario, the consensus exposed "building blocks," i.e. groups of nodes that tend to be clustered together, although the building blocks themselves always appear merged together into bigger groups. The situation where the partitions exhibit clear shared building blocks that always appear merged together, but in different combinations, begs the question as to why does the posterior distribution fail to concentrate on the isolated building blocks in the first place. One possibility is that the building blocks do not correspond to the same kind of communities that the inference approach is trying to uncover, e.g. in the case of the SBM these should be nodes that have the same probability of connection to the rest of the network. This would be a case of model mismatch, and hence it would be difficult to interpret what the building blocks actually mean. Another option, that we can address more directly, is that the model being used underfits the data, i.e. the model formulation fails to recognize the available statistical evidence, resulting in the choice of simpler SBMs with fewer groups, such that some "true" groups are merged together. A common cause of underfitting is the use noninformative priors which overly penalize larger numbers of groups, as was shown in Ref. [39]. The use of hierarchical priors solves this particular underfitting problem, as discussed in Refs. [22,40]. Another potential cause for underfitting is the use of Poisson formulations for the SBM for networks with heterogeneous density, which assumes that the observed simple graph is a possible realization of a multigraph model that generates simple graphs with a very small probability. Ref. [41] introduced an alternative SBM variation based on a simple but consequential modification of the Poisson SBMs, where multigraphs are generated at a first stage, and the multiedges are converted into simple edges, resulting in a Bernoulli distribution obtained from the cumulative Poisson distribution. These "latent Poisson" SBMs also prevent underfitting, and in fact makes the posterior distribution concentrate on the correct answer for the examples considered by Riolo and Newman [9], as shown in Ref. [41]. In Fig. 10 we show our method employed on the posterior distribution of the political books network using the latent Poisson DC-SBM with nested priors, which should be able to correct the kinds of underfitting mentioned above. Indeed, the most likely mode shows a more elaborate division of the network into B = 8 groups, corresponding to particular subdivisions of the same liberalneutral-conservative groups seen previously. However, these subdivisions are not quite the same as those seen in Fig. 9 for the Poisson SBM. Therefore, in this example it would be futile to search for these uncovered groups in the posterior distribution of the Poisson DC-SBM, even if we search for overlaps between partitions. However, despite the more detailed division of the network, the latent Poisson SBM is far from being a perfect fit for this network, as we still observe K = 11 modes, corresponding mostly to different divisions of the "conservative" books. When comparing the structure of the different modes, we see that these are not simple combinations of the same subdivisions, but rather different rearrangements. This seems to point to a kind of structure in the network that is not fully captured by the strict division of the nodes in discrete categories, at least not in the manner assumed by the SBM. In Fig. 11 we compare also the inferences obtained with both SBM models for the karate club network considered previously. The posterior distribution obtained with the Poisson DC-SBM is very heterogeneous, with K = 30 modes. It has as most plausible mode one composed of a single partition into a single group (implying that the degree sequence alone is enough to explain the network, and no community structure is needed). The second most likely mode corresponds to leader-follower partitions, largely dividing the nodes according to degree (despite the degree correction). The putative division of this network into two assortative communities comes only as the ninth most likely mode. With such an extreme heterogeneity between partitions, finding a consensus between them seems particularly futile, thus explaining the obtained point estimates in Fig. 8, in particular the behavior of the RMI estimate that tries to assemble all diverging modes into a single partition. On the other hand, with the latent Poisson SBM the posterior distribution changes drastically, as is shown in right panel of Fig. 11. In this case the dominating mode corresponds to partitions that, while not fully identical to the accepted division, are more compatible with it, as they only further divide one of the communities into two extra groups. The commonly accepted division itself comes as a typical partition of the second most likely mode. Overall, the posterior distribution becomes more homogeneous, with only K = 9 modes identified, and with most of the posterior probability assigned to the first few.
It is important to observe that the heterogeneity of the posterior distribution by itself cannot be used as a criterion in the decision of which model is a better fit. Indeed, a typical behavior encountered in statistical inference is the "bias-variance trade-off" [42], where a more accurate representation the data comes at the cost of increased variance in the set of answers. We illustrate this with a network of American football games [43] shown in Fig. 12. The Poisson DC-SBM yields a very simple posterior distribution, strongly concentrated on a typical partition into B = 10 groups. On the other hand, as seen in Fig. 13, the latent Poisson DC-SBM yields a more heterogeneous posterior distribution with K = 7 modes, typically uncovering a larger number of groups. It would be wrong to conclude that the Poisson SBM provides a better fit only because it concentrates on a single answer, if that single answer happens to be underfitting. But from this analysis alone, it is not possible to say if the latent Poisson SBM is not overfitting either. To make the final decision, we need compute the total evidence for each model, as we will consider in Sec. VIII. This computation takes the heterogeneity of the posterior distribution into consideration, but combined with the model plausibility.
Before we proceed with model selection, we first show how the methods constructed so far can be generalized for hierarchical partitions, which form the basis of generically better-fitting models of community structure in networks [22]. clustered in their own meta-groups, associated with a coarse-grained version of the network described via its own smaller SBM, and so on recursively, resulting in a nested version of the model [22,40]. This hierarchical formulation recovers the usual SBMs when the hierarchy has only a single level, and also introduces many useful properties, including a dramatically reduced tendency to underfit large networks [22,40] as well a simultaneous description of the network structure at several scales of resolution. This model variant takes as parameters a i is the group membership of node i in level l, and each group label in level l is a node in the above level l + 1, which results in the number of nodes in level l being the number of groups in the level below, N l = B l−1 , except for the first level, N 1 = N . For this model, we have a posterior distribution over hierarchical partitions given by Like in the non-hierarchical case, this posterior distribution is invariant to label permutations, i.e.
ifb andc are identical up a relabelling of the groups. However in the hierarchical scenario the group relabellings that keep the posterior distribution invariant must keep the same partitions when projected at the lower levels. In other words, the invariant permutation of the labels in level l affects the nodes in level l +l. More specifically, if we consider a bijection µ(r) for labels at level l, such that b l i (r) = µ(c l i (r)), then we must change the membership in level l + 1 to b l+1 µ(i) = c l+1 i . If two hierarchical partitionsb andc are identical up to this kind of transformation, we denote this with the indicator function or [b ∼c] = 0 otherwise. Based on this, we can generalize the random label model considered before to model hierarchical partitions sampled from the posterior distribution. We first assume that the labels at all levels are sampled independently as with where p l i (r) is the probability that node i in level l belongs to group r. After sampling a partitionc, we then obtain a final partitionb by choosing uniformly among all label permutations, yielding where If we now consider M sampled hierarchical partitions {b} = {b (1) , . . . ,b (M ) }, the posterior distribution of the hidden relabelled hierarchical partitions {c} is given by where n and µ(r) is the bijection that matches the groups labels between c . Therefore we can find the maximum of Eq. 72 once more by solving the maximum weight bipartite matching problem with weights given by w rs . This leads to an overall algorithm entirely analogous to the non-hierarchical case, where, starting from some configuration, we remove a sample m from the ensemble, and add it again, choosing its labels according to the maxi- The mixed random label model of Sec. VI can also be generalized in a straightforward manner for hierarchical partitions, i.e.
where inside a mode k the partitions are sampled according to the hierarchical random label model given by Eq. 68. The inference algorithm from this point onward is exactly the same as in the non-hierarchical case, where we need only to relabel the hierarchical partitions according to Eq. 72 when we move them between modes.
In Fig. 14 we show the inferred modes for hierarchical partitions sampled from the posterior distribution using the nested latent Poisson DC-SBM for a co-occurrence network of characters of the Les Misérables novel [44]. As this example shows, this algorithm allows us to summarize a multimodal distribution of hierarchical partitions in a rather compact manner. In this particular example we see that the distribution is fairly dominated by one of the modes (shown in Fig. 14a), followed by less probable alternatives.

Comparing and finding consensus between hierarchical partitions
If we infer the hierarchical random label model above for two hierarchical partitionsx andȳ, it amounts to solving a recursive maximum bipartite weighted matching problem on every level, starting from l = 1 to l = L, using as weights the contingency table at each level l, where N x is the set of nodes in partition x (as upper level partitions might have a disjoint set of nodes), and propagating the matched labels to the upper levels. This is equivalent to maximizing the recursive overlap across all levels where at each level we need to incorporate the relabeling at the lower levels viaŷ where µ l is a label bijection at level l, with the boundary condition µ 0 (i) = i. This leads us to the hierarchical maximum overlap distance, defined as where N l = max(|N x l |, |N y l |). A version of this distance that is normalized in the range [0, 1] can be obtained by dividing it by the largest possible value, It is important to note here that hierarchy levels with a single node, N l = 1, always have a contribution of zero to the distance, therefore this measure can be applied to infinite hierarchies with L → ∞, as long as any level is eventually grouped into a single group. For hierarchies with a single level, L = 1, we recover the maximum overlap distance considered previously, except for the normalized version, which is slightly different with d(x, y)/(N − 1). This is also a valid normalization for the non-hierarchical distance, since we must always have d( . Following the same steps as before, we can use the hierarchical maximum overlap distance as an error function (x,ȳ) = d(x,ȳ) to define a MOC estimator over hierarchical partitions based on the minimization of the mean posterior loss, Substituting its definition leads us to a set of selfconsistent equations at each level l, with the marginal distributions obtained over the relabeled partitions, where the relabeling is done in order to maximize the overlap withb, and where once again we need to recursively incorporate the relabellings at the lower levels, We can define an uncertainty σb ∈ [0, 1] for this estimator by inspecting the marginal distributions computed along the way, In the above sum, we omit levels with N l = 1 since those always have a trivial marginal distribution concentrated on a single group. In practice we implement this estimator by sampling a set of M hierarchical partitions {b} from the posterior distribution and then performing the sequential maximizations starting from l = 1 to l = L, where m (l,m) r,s is the contingency table of level l of sample m withb l . The final solution is obtained when repeating the above maximization no longer changes the result. Like in the non-hierarchical case, this algorithm yields a local optimum of the optimization problem, but not necessarily a global one, therefore it needs to be repeated multiple times with different initial conditions, and the best result kept. Since it involves the relabelling over all M hierarchical partitions, the overall algorithmic complexity of a single iteration is O(M N B + M B 3 ), assuming once more the typical case with N l = O(N/σ l−1 ) and L = O(log N ).

VIII. MODEL SELECTION AND EVIDENCE APPROXIMATION
If we are interested in comparing two models M 1 and M 2 in their plausibility for generating some network A, we can do so by computing the ratio of their posterior probability given the data, Therefore, if we are a priori agnostic about either model with P (M 1 ) = P (M 2 ), this ratio will be determined by the total probability of the data P (A|M) according to that model. This quantity is called the evidence, and appears as a normalization constant in the posterior distribution of Eq. 1. It is obtained by summing the joint probability of data and partitions over all possible partitions, Unfortunately, the exact computation of this sum is intractable since the number of partitions is too large in most cases of interest. It also cannot be obtained directly from samples of the posterior distribution, which makes its estimation from MCMC also very challenging.
To illustrate this, it is useful to write the logarithm of the evidence in the following manner, where is the posterior distribution of Eq. 1, and is the mean joint log-probability computed over the posterior distribution, and finally  Table I. Description length (negative log-evidence) Σ = − ln P (A) for several networks and SBM variations, with DC and NDC indicating degree-correction and not, respectively. The shaded cells indicate the smallest value for the each model class, with the dark grey indicating the best fitting model overall. The "single partition" columns correspond to the two-part description length Σ = − ln P (A, b) obtained with the best-fitting partition of the Poisson model.
is the entropy of the posterior distribution. Eq. 90 has the shape of a negative Gibbs free energy of a physical ensemble, if we interpret ln P (A, b) as the mean negative "energy" over the ensemble of partitions. It tells us that what contributes to the evidence is not only the mean joint probability, but also the multiplicity of solutions with similar probabilities, which is captured by the posterior entropy. In this formulation, we see that while it is possible to estimate ln P (A, b) from MCMC simply be averaging ln P (A, b) for sufficiently many samples, the same approach does not work for the entropy term H(b), since it would require the computation of the logposterior ln π(b) for every sample, something that cannot be done without knowing the normalization constant P (A), which is what we want to find in the first place. However, the mixed random label model of Sec. VI can be used to fit the posterior distribution, allowing us to compute the entropy term via the inferred model, and use and the rich information gained on its structure to perform model selection. Let us recall that the mixed random label model, when inferred from partitions sampled from π(b), amounts to an approximation given by where P (k) = w k determines the mode mixture and are the independent marginal distributions of mode k and finally is the random relabeling of groups. In most cases we have investigated, the inferred modes tend to be very well separated (otherwise they would get merged together into a larger mode), such that we can assume This means we can write the entropy as where is the mode mixture distribution, and is the entropy of mode k and is the relabelling entropy. Putting it all together we have the following approximation for the evidence according to the mixed random label model, We can extend this for hierarchical partitions in an entirely analogous way, which leads to The above quantities are then computed by sampling M partitions from the posterior distribution, using them (or a superset thereof) to compute the first two means ln P (A, b) and ln B(b)! , and then fit the mixed random label model, from which the parameters w and p are obtained, and then computing the remaining terms.
In Table I we show the evidence obtained for several SBM variants and datasets, including latent Poisson versions (which require special considerations, see Appendix C). Overall, we find that when considering the Poisson SBMs, degree correction is only favored for larger networks, corroborating a similar previous analysis based on a less accurate calculation [22]. This changes for latent Poisson models, where for some networks the balance tips in favor of degree correction. Overall, we find more evidence for the latent Poisson models for all networks considered, which is unsurprising given that they are all simple graphs. Likewise, we always find more evidence for the hierarchical SBMs, which further demonstrate their more flexible nature.

A. Bayesian evidence and the minimum description length (MDL) criterion
In this section we explore briefly some direct connections between Bayesian model selection and the minimum description length (MDL) criterion based on information theory [51]. We begin by pointing out the simple fact that the MAP point estimate given by the single most likely partition yields a lower bound for the evidence, i.e.
This means that taking into account the full posterior distribution, rather than only its maximum, almost always can be used to compress the data, as we now show. We can see this by inspecting first the usual "two-part" description length, which corresponds to amount of information necessary to describe the data if one first describes the partition b and then, conditioned on it, the network A. Therefore, finding the most likely partition b means finding the one that most compresses the network, according to this particular two-part encoding. However, the full posterior distribution gives us a more efficient "one-part" encoding where no explicit description of the partition is necessary. Simply defining the joint distribution P (A, b) means we can compute the marginal probability P (A) = b P (A, b), which yields directly a description length According to Eq. 107 we have which means that considering all possible partitions can only increase the compression. In Table I we can verify that this holds for all results obtained. In a slightly more concrete setting, let us consider a transmitter who wants to convey the network A to a receiver, who both know the joint distribution P (A, b). According to the two-part code, the transmitter first sends the partition b, for that using − log 2 P (b) bits, and then sends the final network using − log 2 P (A|b) bits, using in total Σ 1 (A, b)/ ln 2 bits. In practice, this is achieved, for example, by both sender and receiver sharing the same two tables of optimal prefix codes derived from P (b) and P (A|b). On the other hand, using the second one-part code, both transmitter and receiver share only a single table of optimal prefix codes derived directly from the marginal distribution P (A), which means that only Σ 2 (A)/ ln 2 = − log 2 P (A) bits need to be transmitted. In practice, it will be more difficult to construct the one-part code since it involves marginalizing over a highdimensional distribution, which is intractable via brute force -although our mixed random label model can be used as a basis of an analytical approximation. However what is important in our model selection context is only that such a code exists, not its computational tractability.

IX. CONCLUSION
We have shown how the random label model can be used to solve the group identification problem in community detection, allowing us to compute marginal distributions of group membership on the nodes in a unambiguous way. This led us to the notion of maximum overlap distance as a general way of comparing two network partitions, which we then used as a loss function to obtain the consensus of a population of network partitions. By investigating the behavior of different loss functions on artificial and empirical ensembles of heterogeneous partitions, we have demonstrated that they can yield inconsistent results, due precisely to a lack of uniformity between divisions. We then developed a more comprehensive characterization of the posterior distribution, based on a mixed version of the random label model that is capable of describing multimodal populations of partitions, where multiple consensuses exist at the same time. This kind of structure corresponds to a "multiple truth" phenomenon, where a model can yield diverging hypotheses for the same data. We showed how our method provides a compact representation for structured populations of network partitions, and allows us to assess quality of fit and perform model selection. The latter was achieved by using the multimodal fit of the posterior distribution as a proxy for the computation of its entropy, which is a key but often elusive ingredient in Bayesian model selection.
Although we have focused on community detection, the methods developed here are applicable for any kind of clustering problem from which a population of answers can be produced. They allow us to be more detailed in our assessment of the consistency of results when applied to real or artificial data. In particular, we no longer need to rely on "point estimates" that can give a very misleading picture of high dimensional and structured popula-tions of partitions, even if they attempt to assemble a consensus among them. We achieve this without losing interpretability, as our method yields groupings of partitions that share a local consensus, each telling a different version of how the data might have been generated, and weighted according to the statistical evidence available.
As described in the main text, the random label model yields a description length for a pair of partitions given by Σ(x, y) = − ln P (x, y) (B1) = N ln[B(B + 1)] − ω(x, y) ln 2, Likewise, if we observe y, and use to describe partition x, the additional amount of information we need to convey is Σ(x|y) = − ln P (x|y) (B3) = − ln P (x, y)/P (y) (B4) = N ln(B + 1) − ω(x, y) ln 2, where we have used P (y) = 1/B N from Eq. 13. From this we can note that this encoding is sub-optimal in the sense that even when the overlapping is maximal with ω(x, y) = N , the additional information needed to encode x is Σ(x|y) = N ln[(B + 1)/2] which is scales as O(N ) when B > 1. Nevertheless we can develop a different encoding that is more efficient at using the overlap information. We do so by incorporating it as an explicit parameter as follows: 1. We sample an overlap value ω uniformly in the range [1, N ], such that 2. We chose a subset V ω of the N nodes of size ω, uniformly with probability 3. For the nodes in V ω we sample a partition z with probability P (z|V ω , γ) = i∈Vω γ zi , which leads to a marginal distribution P (z|V ω ) = P (z|V ω , γ)P (γ) dγ (B9) where n z (r) = i∈Vω δ zi,r , assuming a uniform prior P (γ) = (B − 1)!. with n x (r) = i ∈Vω δ xi,r and n y (r) = i ∈Vω δ yi,r . 5. For the nodes i ∈ V ω we set x i = y i = z i , and we choose a label bijection µ uniformly at random from the set of size B! and use to relabel either x or y arbitrarily.
In the end, this model generates partitions x and y that have an overlap at least ω, although the actual overlap can be larger by chance. The scheme above allows groups to be unpopulated in the final partition, which is sub-optimal, but this can be neglected for our current purpose. The final joint probability of this scheme is P (x, y, z, V ω , ω, µ) = P (x|V ω )P (y|V ω )P (z|V ω )P (V ω |ω)P (ω)P (µ), (B11) which leads to a description length Σ(x, y, z, V ω , ω, µ) = − ln P (x, y, z, V ω , ω, µ) = The minimum description length for x and y is given by Σ(x, y) = min z,Vω,ω,µ Σ(x, y, z, V ω , ω, µ).
which corresponds simply to finding the maximum overlap ω(x, y) and the corresponding label matching between x and y from which V ω , z and µ can be derived. It is easy to see now that if the overlap is maximal with ω = N , the description length amounts to meaning the additional information needed to describe x given y is vanishingly small with respect to N , and hence the code is efficient in this case. It is instructive to compare the above scheme with the reduced mutual information (RMI) encoding recently proposed in Ref. [35]. It corresponds to a three part scheme where one encodes first partition y, then the full contingency table between both partitions m rs , and finally the remaining partition x, leading to a description length Σ RMI (x, y) = ln N − 1 B y + 1 + ln N − 1 B x + 1 + ln N ! r n y (r)! + r ln n x (r)! s m rs + ln Ω(n x , n y ), (B17) where B x and B y are the number of labels in partitions x and y and Ω(n x , n y ) is the number of possible contingency tables with row and column sums given by n x and n y , which cannot be computed in closed form, but for which approximations are available (see Ref. [35]). Note that the encoding above is not symmetric, i.e. in general Σ RMI (x, y) = Σ RMI (y, x), as the overall description length will depend on which partition is encoded first (although the relative description length Σ RMI (x) − Σ RMI (x, y) is always symmetric). Therefore the minimum description length amounts to choosing the optimal partition to encode first Σ RMI (x, y) = min [Σ RMI (x, y), Σ RMI (y, x)] . (B18) In Fig. 15 we compare the compression of two partitions sampled independently from the DC-SBM posterior distribution of 571 empirical networks selected from the Konect [52] and CommunityFitNet [53] repositories. Overall, we observe somewhat mixed results with, the overlap encoding providing a better compression for around 61% of the networks. As we might expect, the overlap encoding tends to provide a better description if the overlap between partitions is very high, such that a full description of the non-matching nodes becomes superfluous. Otherwise, for highly differing partitions, the RMI encoding is able to capture similarities more efficiently.

Appendix C: Evidence for latent Poisson SBMs
The latent Poisson SBMs of Ref. [41] are generative models for simple graphs, where at first a multigraph is G is generated with probability Because of the latent multiedges, we need to approximate the evidence in a similar, but different manner. We write the log evidence as is the joint posterior distribution. For our approximation we assume the factorization, together with the "mean-field" over the latent multiedges, with the marginals estimated via MCMC so that the latent edge entropy can be computed as From this we obtain the final approximation, where H(b) is computed using the mixed random label models as done in the main text. The approximation for the hierarchical model follows analogously.