Statistical inference of assortative community structures

We develop a principled methodology to infer assortative communities in networks based on a nonparametric Bayesian formulation of the planted partition model. We show that this approach succeeds in finding statistically significant assortative modules in networks, unlike alternatives such as modularity maximization, which systematically overfits both in artificial as well as in empirical examples. In addition, we show that our method is not subject to a resolution limit, and can uncover an arbitrarily large number of communities, as long as there is statistical evidence for them. Our formulation is amenable to model selection procedures, which allow us to compare it to more general approaches based on the stochastic block model, and in this way reveal whether assortativity is in fact the dominating large-scale mixing pattern. We perform this comparison with several empirical networks, and identify numerous cases where the network's assortativity is exaggerated by traditional community detection methods, and we show how a more faithful degree of assortativity can be identified.


I. INTRODUCTION
Community detection is one of the most central methods in network science [1,2], and it consists in the algorithmic partition of the nodes of a network into cohesive groups, according to a mathematical definition of this concept (for which there are many). Historically, most community detection methods proposed have focused on the detection of assortative communities, i.e. groups of nodes that tend to be more connected to themselves than to other nodes in the network. However, there are also community detection methods that are more general, and attempt to cluster together nodes that have similar patterns of connection, regardless if they are assortative or not [3][4][5]. The widespread use of assortative community detection methods has lead to the belief that the presence of communities is a pervasive feature of many different kinds of real networks [6]. Although the concept of assortativity is a central one in the study of social networks (known as "homophily" in that context) [7], and is also an appealing construct in biology [8][9][10], it is to some extent unclear if the perceived assortativity of many networks is a byproduct of using algorithms that can only find this kind of structure. This is particularly problematic since many popular methods do not take into account the statistical significance of the patterns they uncover, and find seemingly strong community structure in completely random graphs [11,12], as well as in trees [13] and other manifestly non-modular networks [14]. More recently, these shortcomings have been addressed by employing Bayesian inference of generative network models [15], which accounts for statistical significance with a built-in Occam's razor, that decides to partition the network into groups only if this is necessary to explain its structure, beyond what can be done by a uniformly random placement of the edges. These approaches, however, are based on general mixing patterns, which include assortativity only as a special case. In many ways this is useful, and in fact arguably superior, since if assortativity happens to be the dominating pattern, then the general approach will capture it, otherwise it will reveal a different structure. However, having only a more general method at our disposal also has its shortcomings. First, if it is true that assortativity is the main pattern for a class of networks, then the more general representation is needlessly wasteful for them, since it not only gives us more than we need, but in doing so it prevents us from focusing on the more central features, at the cost of algorithmic precision. Second, with a more general method it can be difficult to quantify precisely how much has been wasted in the representation, and what is indeed the simpler pattern hiding inside it.
In this work we develop a Bayesian inference approach designed to uncover assortative communities in networks, based on the planted partition (PP) model [16][17][18], which is itself a special case of the more general stochastic block model (SBM) [3,5]. Our approach is nonparametric, and can uncover communities even when their number is unknown, without overfitting. Furthermore we show that it does not suffer from the resolution limit present in other approaches, such as modularity maximization [19], and it can find an arbitrarily large number of communities, provided they are statistically significantly. We also re-visit an existing equivalence between the inference of the PP model and modularity maximization [20,21], and dispel the notion that both methods are interchangeable in practice, by showing that the equivalence is in general inconsistent with maximum likelihood estimation, and discuss the fact that even if this were not the case, the parametric nature of that approach would not address the overfitting problem. Our approach is not more complicated to employ than modularity maximization, and can be used as a drop-in replacement for it and other quality functions in popular community detection heuristics [22], and we describe how it can be used with an unbiased merge-split MCMC algorithm [23] that can explore the entire posterior distribution of partitions.
Furthermore, we perform a comparison of the PP model with the more general SBM for a variety of empirical networks, allowing us to determine if and to what extent is assortativity the most salient characteristic of the large-scale network structure. We find a variety of outcomes, ranging from very similar to very different results obtained with both models, demonstrating that there are indeed many cases where searching exclusively for assortative structures can give a very misleading representation of the network.
This work is organized as follows. In Sec. II we present the planted partition model, and we revisit its equivalence with modularity. In Sec. III we describe our Bayesian approach at inferring the PP model, and introduce a more realistic non-uniform version of the model. In Sec. IV we analyse the method when applied to artificial networks with known community structures, and compare it to variations of modularity-based approaches, demonstrating that our method prevents overfitting. We also show that the Bayesian method does not suffer from the resolution limit of modularity maximization, and hence it does not generically underfit as well. In Sec. V we employ our method in a variety of empirical networks, and we compare the results with those obtained by the more general SBM, as well as modularity maximization. We demonstrate once more that modularity tends to either massively overfit or underfit, and when comparing to the SBM we can determine the true nature of the assortativity of empirical networks. We finalize in Sec. VI with a conclusion.

II. THE PLANTED PARTITION (PP) MODEL
The statistical inference approach to community detection is based on the definition of generative models that contain communities as part of the their parameters. Before we consider the particular case of inferring assortative communities, it is useful at first to review the more general case of arbitrary mixing patterns between groups, as characterized by the Poisson degree-corrected stochastic block model (DC-SBM) [5]. This formulation describes a network of N nodes that are divided into B groups, amounting to a partition b = {b i }, where b i ∈ [1, B] is the group membership of node i, and a specific multigraph A is generated with probability where A ij determines the number of edges between nodes i and j, and by convention A ii corresponds to twice the number of self-loops incident on node i. Note that the parameters λ and θ always appear multiplying each other in the likelihood, so they are not uniquely identifiable, i.e. there are many choices that yield the same model. In order to uniquely specify the model, it is useful to introduce the quantitŷ which in the model above can be set to arbitrary values, without sacrificing its generality. For example, if we set θ r = 1, then we can interpret λ rs as determining the expected number of edges between groups r and s (or twice this value for r = s), and θ i is the relative probability with which a node i is selected to form an edge among those that belong to the same group. However, any other choice forθ r would be equally valid, with the only immaterial consequence being a different interpretation of the parameters. The maximum likelihood estimate of the above parameters is given by where e rs is the number of edges that go between groups r and s (or twice that for r = s), and e r = s e rs = i k i δ bi,r is the sum of degrees in group r. Indeed, the values ofθ r cannot be uniquely obtained from these equations, since any valueθ r > 0 offers a valid solution, and more importantly, any choice disappears when we compute the probabilities λ * bibj θ * i θ * j . This means we can choose these values independently of the inference procedure, with any particular choice functioning as a mere technical convention. 1 The degree-corrected planted partition (PP) model corresponds to the special case of the DC-SBM given by In this situation there are only two parameters that determine the placement of edges between groups, λ in and λ out , that set the expected number of edges inside and outside groups. A choice λ in rθ 2 r /2 > λ out r<sθ rθs corresponds to the assortative case, where edges connect mostly nodes of the same group. Therefore, this model captures what is more typically known as community structure in the proper sense, at least if the condition just mentioned is met. With this parametrization, the model likelihood of Eq. 1 becomes where, are the number of edges inside and outside groups, respectively. The maximum likelihood estimate of the parameters of the PP model are then given by Looking at this result, we see that, unlike in the general DC-SBM, we no longer have full freedom to chooseθ r , since its maximum likelihood value must be a solution of the following system of nonlinear equationŝ We recover partial freedom to chooseθ r in the special situation where all groups are uniform, with e r = 2E/B, in which caseθ r can take any value, as long as it is the same for every group, i.e.θ r =θ. Note that we are, in fact, allowed to make an arbitrary prior assumption for the values ofθ r before doing inference, making them imposed constraints that are part of our model specification. In this case Eq. 9 becomes simply We emphasize, however, that imposing this sort of constraint does jeopardize the degree correction of the model, since the expected degree of node i is given by If we now substitute the maximum likelihood estimates of Eqs. 11, 7 and 8 in the above, we obtain Therefore, the inferred model generates the observed degrees in expectation, i.e. k i = k i , only if all groups have the same sum of degrees e r = 2E/B and all imposedθ r are the same, or if we do not impose any constraints on θ r , and use Eq. 9 instead. This means that we face a trade-off between consistent degree correction and ease of inference with the planted partition model, which is important to keep in mind as we consider the connection between statistical inference and modularity maximization, which we address in the following.
A. On the consistency between statistical inference and modularity maximization As was shown in Refs. [20,21], it is possible to manipulate the likelihood of the PP model to expose a connection with modularity maximization [25]. We can rewrite the likelihood of Eq. 5 as up to unimportant additive constants, and where If we now enforce the following constraint as part of our model specificationθ and replace the maximum likelihood estimate for θ * i = k i / √ 2E obtained from Eq. 11 in the above we obtain again up to unimportant additive constants. Therefore, maximizing the above likelihood with respect to the partition b alone, while keeping λ in and λ out constant, is equivalent to maximizing the generalized modularity [26] Q with γ playing the role of the resolution parameter. However, before concluding that modularity maximization and the inference of the PP model amount to the same task, we need to make the following crucial observations: 1. The imposed constraint of Eq.17 involves the knowledge of the sum of all observed degrees in each group e r = i k i δ bi,r , which cannot be known before doing inference, and thus cannot be part of our model specification. 2 However, any other choice of θ r will not yield θ * i = k i / √ 2E via maximum likelihood, which is required to recover modularity. Not imposing any prior constraint onθ r also does not yield the appropriate value via Eq. 9 in general, and will result in the necessary value of θ * i only for a particular uniform partition of the network where e r = 2E/B (in which case Eq. 17 holds without being imposed). Therefore, the modularity of Eq. 19 is consistent with maximum likelihood of the PP model only in the very narrow case where all groups have the same sum of degrees.
2. In addition, we must keep in mind that the values of λ in and λ out are never known a priori in empirically relevant settings. Therefore, we are required to infer them as well, together with the network partition. When employing maximum likelihood, the resulting values of µ and γ, as well as the second term of Eq. 18 all depend on the network partition, and are no longer just constants. In this situation, the partial equivalence with modularity maximization breaks down (even if e r = 2E/B as per point 1 above), as the functional form resulting from substituting Eqs. 7-9 and Eq. 16 into Eq. 18 makes the latter very different from Eq. 19. We emphasize that the scheme suggested in Ref. [21] of updating the value of γ according to Eq. 16 is insufficient to restore consistency since the contribution of the non-negligible terms µ and E(ln λ out −λ out ) remain unaccounted for.
Based on the above, we see that the overall connection between the inference of the PP model and modularity maximization is in fact rather tenuous, and we should not expect in general to obtain the same results with both approaches. As explained in Ref. [20], the only statement that can be made is that there exists a particular choice of parameters λ in , λ out and θ such that maximizing modularity with the appropriate choice of γ and the PP likelihood conditioned on these parameters will yield the same partition. But since these parameters are unknown in practice, and are in general inconsis-tent with maximum likelihood estimation, the relevance of this equivalence is arguably limited.
Furthermore, as we discuss in Appendix B, it is easy to establish a formal equivalence between any community detection method and the statistical inference of a suitably chosen generative model. Therefore, the central issue is not whether this mapping exists, but if the procedure itself is consistent and behaves well. In fact, neither approach considered above, i.e. maximum likelihood inference of the PP model and modularity maximization, actually offers a robust method to uncover community structure in networks. As is well known, modularity maximization suffers from severe shortcomings, such as a strong tendency to identify spurious communities in fully random [11] and non-modular [12][13][14]27] networks, a systematic failure to identify relatively small communities in large networks [19], it exhibits extreme degeneracy in key empirically relevant cases [28], and has been recently shown to systematically overfit on a broad range of empirical networks [29]. Any equivalence with the statistical inference of a parametric model would just mean that the latter also inherits these same limitations. However, the full maximum likelihood inference approach of the PP model outlined above (which is not equivalent to modularity optimization) is not substantially superior. Even though it has a better justification, it does not really address any of core problems of modularity. Most prominently, the inference approach is still prone to overfitting, with the uncontrolled detection of an ever increasing number of meaningless communities in fully random networks, as long as those increase the likelihood of the model. This happens in the same manner as fitting a polynomial to a set of points will also overfit, even if we use maximum likelihood, as long as we are allowed to increase the polynomial order without any constraint. We will demonstrate this problem with some simple examples, but before we do so we turn instead to a Bayesian approach, which includes the correct penalization of model complexity, and hence addresses the overfitting problem at its root, in a manner analogous to what has been done for the general SBM [15], as we describe in the next session.

III. BAYESIAN INFERENCE OF THE PLANTED PARTITION MODEL
Instead of maximum likelihood, a more formally correct approach to statistical inference is to sample or maximize from the posterior distribution of partitions [15] where (21) is the marginal likelihood integrated over all model parameters, weighted according to their prior probabilities, and is the prior probability for partition b, with B(b) denoting the number of groups of b (see Ref. [24] for a derivation). The remaining term P (A) = b P (A|b)P (b) is called the evidence, and it has the role of a normalization constant, and therefore will not play an important role in our calculations. In order to compute the integral of Eq. 21 we must specify our priors, which involves us also dealing with the model specification problem exposed earlier, with respect to the parameters θ. Here we will make the simple choicê which allows the model parameters to have a straightforward interpretation, namely the θ i are the relative probabilities of selecting a node randomly from the group it belongs, and Bλ in will determine twice the expected total number of edges inside communities, and B 2 λ in the number of edges outside communities. (Remember that we are allowed to make any choice ofθ r as part of our model specification, as long as the choice is made a priori, and does not depend on the data being modelled. As discussed previously, this choice does limit the accuracy of degree correction of the PP model when performing maximum likelihood estimation. However, as we will see in a moment, this will not be a problem in the Bayesian formulation.) Now we can proceed in a manner that reflects our a priori indifference to any kind of model pattern, namely we select a uniform prior for θ, and we choose maximum-entropy priors for the remaining parameters whereλ is a hyperparameter that determines the expected total number of edges, withλ = 2 E /B 2 . Performing the integral of Eq. 21 we obtain This marginal likelihood still depends on global hyperparameterλ, which we can infer together with the other model parameters. However, there is an alternative that allows us to remove it altogether. We can re-interpret this marginal likelihood as an entirely equivalent model formulation given by is the likelihood of the microcanonical DC-SBM [24], where e rs specifies the exact number of edges between groups r and s (or twice that for r = s) and k i is the exact degree of node i. We can recover Eq. 27 by making the following choice of priors, which is a product of uniform multinomial distributions for the diagonal and off-diagonal entries of the matrix e rs , conditioned on the total sums e in and e out , respectively. For e in and e out themselves we use geometric distributions, and finally for the degrees we choose uniform distributions inside each group [24], Inserting these priors in Eq. 28 and re-arranging leads to Eq. 27. Interestingly, and somewhat surprisingly, since this microcanonical model generates the exact degrees k that are observed, we no longer have the same inconsistency as in the "canonical" model under maximum likelihood that we discussed earlier, where the inferred degrees were different from the observed, even though we have made use of the constraintθ r = 1 in its derivation. We can therefore rest assured this model can accommodate arbitrary degree sequences. This equivalence also allows us to replace some of the priors of the microcanonical formulation by more convenient choices that make the approach fully nonparametric. In particular, for e in and e out we can use instead the following P (e in , e out |b) = P (e in , e out |E, b)P (E) where is a uniform distribution of the E edges into two values (unless B = 1, where we must have e in = E). The prior for the total number of edges P (E) can now be chosen arbitrarily, as it will only amount to a unimportant constant in the marginal distribution, and hence vanish from the posterior. With this, we have a fully non-parametric marginal distribution This expression, together with the partition prior P (b) of Eq. 22, are not much more difficult to compute than the modularity of Eq. 19. In fact, it is easy to see that if we consider the change in the posterior probability that is incurred if we move a node i from group r to group s, we need to compute only a few terms that depend on e in , e out , e r , e s , n r , n s and B. In order to compute the change, we need only to inspect the neighborhood of the node, which takes time O(k i ), independently of any other quantity, such as the number of groups. This is the same algorithmic complexity of computing changes in modularity, so the quantity ln P (A, b) can be used a drop-in replacement of the quality function in any modularity maximization algorithm, 3 thereby addressing many existing fundamental limitations. In fact, we can understand in more detail why this approach prevents overfitting by exploiting a direct connection between Bayesian inference and information theory. Namely we can write the negative joint log-likelihood as follows, The quantity Σ is called the description length of the data [30], as it measures the amount of information required to describe the network A when the parameters e, k, and b are known, together with the information necessary to describe the parameters themselves. Therefore, the most likely partition of the network is the one that allow us to compress it the most. This means that this approach amounts to a formal implementation of Occam's razor, that favors the most parsimonious explanation for the data: As we increase the complexity of the model, by considering a larger number of communities, the first term − ln P (A|e, k, b) will tend to decrease, as the model becomes more constrained, however the second term − ln P (e, k, b) will tend to increase, functioning as a penalty for more complex models. Since it is not possible to compress fully random data using any method, this approach cannot find communities in fully random networks. (This also explains why maximum likelihood overfits, since it omits the contribution of the second term of the description length, and hence there is no penalization for model complexity.) We will also show in Sec. IV A that this approach also does not suffer from the "resolution limit" underfitting problem present with modularity maximization. However, before we do so, we will first consider a small variation of the PP model that is slightly more realistic, and allows for a larger amount of heterogeneity in the community structure.

A. The non-uniform PP model
Even if we commit ourselves to search exclusively for assortative community structures, the particular formulation of the PP model considered previously seems needlessly restrictive. This is because it assumes that, if all θ i are the same, then the expected number of edges inside communities is the same for every community, which is likely to be an inadequate assumption in a variety of empirical scenarios. We can relax this constraint by formulating instead a non-uniform version of the PP model, with This parametrization allows for the expected number of edges inside communities to vary arbitrarily, via the parameters λ = {λ r } that can be different for every community. Given this formulation, we can essentially repeat the same calculations as before, as we show in Appendix A. In the end, we obtain a marginal likelihood given by This likelihood is very similar to the uniform planted partition model, and is just as easy to compute, but it should work better when the communities are sufficiently heterogeneous.

B. Inference algorithm
The posterior distribution of Eq. 20 is not simple enough to allow us to sample directly from it, so we have to perform this indirectly using Markov chain Monte Carlo (MCMC). This is done by defining move proposals that are conditioned on the current partition b, and accepting a new partition b sampled from this distribution according to the Metropolis-Hastings probability [31,32] min 1, otherwise we reject the move, and remain at the previous partition b. Note that the computation of the above ratio does not depend on the intractable normalization constant P (A) of Eq. 20, since it cancels out in the computation. By iterating the above procedure sufficiently often, we are guaranteed to sample from the target distribution P (b|A) asymptotically, provided our proposals P (b |b) are ergodic and aperiodic. However, the time required to reach the target distribution will depend on the quality of our proposals, which will determine the practical feasibility of the algorithm. In this work we use the merge-split proposals described in Ref. [23], which have been shown to work well in many cases, in particular when the number of groups tends to vary. The only modification we make of that algorithm is that when proposing the move of a single node i from its current group to group r, we do it according to the following probability, where B is the number of occupied groups. The parameter determines the probability with which we look at a random neighbor of node i to copy its group membership, otherwise we select a group at random. We require a value > 0 to guarantee ergodicity, but otherwise any other value yields a valid algorithm (we have used = 1/2 in our analysis, which provided good acceptance rates). As we mentioned before, when using our model formulation, the likelihood ratio when changing the membership of node i, as well as the move proposal probability, can be computed in time O(k i ), where k i is its degree. This means that a single MCMC "sweep", where every node had a chance to be moved once, takes a time O (N + E), where E is the number of edges, which is the best we can hope for this kind of problem. Therefore we can use this algorithm to approach networks with a very large size. A reference implementation of this algorithm is freely available as part of the graph-tool library [33].
In some cases, we may seek to maximize from the posterior distributions instead of sampling from it. This is achieved via a simple modification of the above algorithm, where we replace the target distribution with P (b|A) → P (b|A) β , where β is an inverse temperature parameter. If we increase β → ∞ (preferably slowly, to avoid getting trapped in local optima) we obtain a maximization algorithm. The merge-split MCMC often shows a good behavior when employed in this manner, as it can more easily escape local maxima that would trap alternative schemes, such as those based on the change of a single node at a time.

IV. ARTIFICIAL NETWORKS
Here we show how our approach behaves for artificial networks that have imposed community structure. We compare the inference of the PP model with the DC-SBM [24], as well as with variations of modularity. We focus on the overfitting problem, and the potential identification of non-existing communities. We do so by sampling networks with N = 10 5 nodes and average degree k = 5, and a specific number of equal-sized groups B, from the PP model defined above, with a choice of parameters given by θ i = 1/B, λ in = c k N/B and λ out = (1−c) k N/[B(B −1)], with c = 1/B +(1−1/B) , such that if = 0 we have fully random networks, and = 1 we have perfectly assortative communities. For the inference of the PP model and the DC-SBM, we sample from the posterior distribution of Eq. 20, using the algorithm above. When using the modularity function, we sample from the target distribution where Z(A) = b e βQ(b,A) , and Q(b, A) is given by Eq. 19. We choose β = 2Eµ = 2E ln(λ in /λ out ), such that if then the posterior will be proportional to the likelihood of the true underlying model, i.e. P (b|A) ∝ P (A|λ in , λ out , θ * , b). We also compare with the results obtained with the maximum likelihood choice for γ where λ * in = Be in /E and λ * out = Be out /[E(B − 1)] (assuming Eq. 17 holds). Finally, we also compare with γ = 1, which corresponds to the original definition of modularity, still widely used in practice.
The results for the inferred number of groups can be seen in Fig. 1. The Bayesian inference of both versions of the PP model (uniform and non-uniform) as well as the DC-SBM yield identical results, always returning the true number of groups. All versions of the modularity-based approach overfit systematically, often finding a number of groups which is orders of magnitude wrong. The bad behavior of the case γ = γ true may seem surprising, since it corresponds to the true likelihood of the model, which one could expect to be "Bayes optimal," in the sense that since it already includes the correct model parameters other than the partition itself, then any other approach would need to yield a strictly worse performance. However, this would only be true if the number of groups would also be set to its true value (rendering its inference moot), otherwise this choice of parameter is no longer optimal. The behavior with γ = γ fit is considerably worse than all others, showing how maximum likelihood is inadequate for models with unconstrained degrees of freedom, as it trivially overfits. Interestingly, the choice γ = 1 seem to yield a better regularization than the alternatives, although the approach still systematically overfits, specially for a small number of planted communities. Results like this should give us pause when employing modularity to uncover communities in networks. Our Bayesian approach, on the other hand, behaves robustly, without requiring us to tune any parameter.
A. Bayesian inference of the PP model has no resolution limit As was shown by Fortunato and Barthélemy [19], the method of modularity maximization possesses an intrinsic preferred scale for the size of the communities, which results in the so-called "resolution limit" that prevents relatively small modules to be uncovered, even if they have a very clear structure. Here we show that our Bayesian method does not suffer from the same problem.
We begin by briefly revisiting the result of Ref. [19], and we consider the structure of a maximally modular network, i.e. one that is constructed in order to maximize modularity. Following Ref. [19] we consider, without loss of generality, a network of N nodes and E edges that are divided into B equal-sized groups, each with (E − B)/B internal edges, connecting nodes of the same group, and in total B edges connecting nodes of different communities, forming a circular ring between communities (the ring construction simply enforces that the network can in principle be connected, but plays no other role in the results). With this parametrization we have e r = 2E/B, e in = E − B, e out = B. The number of groups itself is a free parameter, and it determines the overall modularity, which from Eq. 19 we obtain We now seek to find the value B = B * that maximizes the above equation. Treating B as a continuous value for this purpose, and taking the derivative and setting it to zero, dQ/dB = 0, we obtain This result tells us that if we construct a network in the above way but with B > B * , even if the groups themselves happen to be obvious assortative communities, e.g. large cliques connected by single edges, then these communities will be unintuitively merged together to achieve a larger modularity. The above result also reveals the role of the resolution parameter γ, which serves as the base of methods that attempt to determine its most appropriate value to counteract the limit in resolution [26,[34][35][36][37].
We now turn to the PP model, to determine if the same natural scale emerges. We need to consider the value of ln P (A, b) for the same construction above, and determine the value of B = B * that maximizes it. We will use the non-uniform PP model with Eqs. 41 and 22, although the final result is the same with the uniform version. We can obtain a simpler expression for the joint log-likelihood by assuming a large network with N 1 and B 1 (although we make no assumption on the value of B relative to N or E), so that we can use Stirling's formula ln (50) up to unimportant additive constants. Taking the derivative and setting it to zero, we obtain the equation If we now assume we have a sparse graph with E = k N/2, with k > 2 being the average degree independent of N , for N k the solution of the above equation is This means that the Bayesian approach has a natural scale which prefers group sizes N/B * = O(ln N ), which is significantly smaller than the modularity scale The scale of the Bayesian approach arises mostly due to the requirements of statistical evidence -we should partition a network only if its structure cannot be explained by a uniformly random placement of the edges. This explains also why for k < 2 we obtain a value B * = 1, since sparser networks inherently contain less information about the existing community structure. 4 As the size of the network increases, so does the possible ways of partitioning it, and as a consequence the required statistical evidence to support it also increases, and hence it becomes impossible to uncover groups smaller than O(ln N ). However this threshold grows so slowly that it can barely be compared with what exists for modularity maximization. We emphasize that this approach virtually eliminates the resolution limit without the introduction of a single parameter that needs to be tuned.
It is interesting to compare the value of B * for the PP model with the same value obtained for the general SBM. As shown in Refs. [24,39], when using noninformative priors, the SBM has a resolution limit B * = O( √ N ), which is similar to modularity, although it occurs for a completely different reason, namely the model depends on a matrix of parameters of size O(B 2 ), which results in a penalty in the joint log-likelihood in the order of O(B 2 ln E), which becomes comparable to the likelihood when B ∼ √ N for sparse networks. This limitation is lifted when the noninformative priors are replaced by a sequence of nested priors and hyperpriors, resulting in the nested SBM [24,40], which exhibits the natural scale B * = O(N/ ln N ), similar to the PP model. However, the PP model achieves this high resolution already with simple noninformative priors, since it depends on a set of parameters which has total size O(N + B) in the case of the non-uniform model, and O(N ) in the case of the uniform variant. This illustrates the usefulness of simpler models, which can achieve a higher performance than more general ones, if they happen to be a good description of the data.  Figure 2. Difference in the description length between the best fitting and the remaining models, as specified in the legend, for a selection of networks obtained from the KONECT repository [41]. The best fitting model always appears in the bottom. For reference, the values ln 10 and ln 100 are show as dashed lines.

V. EMPIRICAL NETWORKS
The existence of assortative community structure if often assumed to be a ubiquitous property of many kinds of networks across different domains. However this kind of latent structure is not something that can be readily obtained from network data, and most methods that are used to detect it search exclusively for assortative structure, ignoring other patterns. Therefore they cannot be used to rule out the existence of more fundamental nonassortative mixing patterns that are qualitatively different. A comparison between the assortative PP models that we consider here, together with more general SBM formulations allow us to address this comparison in a principled way, in order to understand how pervasive assortativity really is.
Here we compare the results obtained with the inference of PP model (both uniform and non-uniform versions) for a variety of empirical networks, together with those obtained using a Bayesian version of the DC-SBM [24], using both noninformative priors as well as nested hierarchical priors [40]. A powerful feature of the Bayesian inference approach is that it permits principled model selection, in the following way. Suppose we want to compare the community structure b 1 found with model M 1 with structure b 2 found with model M 2 , both for the same network A. We can do this by comparing their posterior probability ratio Therefore, if we have no prior preference towards any model, i.e. P (M 1 ) = P (M 2 ), then this ratio will be given by the difference in the description length obtained with both models, where Σ 1 = − ln P (A, b 1 |M 1 ) and Σ 2 = − ln P (A, b 2 |M 2 ). Hence, the most likely model is the one that offers the best compression for the data, and the difference in the compression itself yields the statistical significance of the preference towards the best model.
We performed the inference of the four models on a selection of 29 networks, representing different scientific domains, obtained from the KONECT repository [41]. In Fig. 2 we show the difference in description length obtained from the best fitting to the other models. Perhaps unsurprisingly, we find that the general DC-SBM provides a better fit for most networks, indicating that the strictly assortative structure of the PP model is insufficient to account for the observed networks. However, the PP model is selected as the best fitting model in a minority of the cases, and it is instructive to inspect those more closely. In Fig. 3 we show the communities uncovered using the PP model and the nested DC-SBM, for a network of games between American college football teams [42] and a network of co-purchases of books about American politics [43]. In both cases, PP and the nested DC-SBM find very similar partitions, but the PP model finds a slightly larger number of communities. The model selection criterion outlined above selects the PP model as the more plausible alternative due to the strong assortativity observed. The result found for the football network is particularly interesting, since it is a rare case where the uniform PP model is the one that gets selected. This is because the number of edges inside each community is indeed very similar for all of them, and the connections between the communities seem fairly random, exactly how the PP model prescribes. This highlights the robust character of our approach, which will not favor a more complicated model when it is unnecessary, and gives us confidence that when the PP model is not selected, it is indeed because it does not fully account for the actual structure observed in the network.
For other networks such as the associations between terrorists [46] and the social network between dolphins [47], even though the DC-SBM is strictly preferred, the difference between the nonuniform PP model is negligible, and therefore there is no sufficient evidence in the data to reliable distinguish between both models. For all other data, however, we find substantial evidence in favor of the more general DC-SBM. What is particularly interesting is that the DC-SBM is often preferred even when the uncovered structures are in fact very assortative. We give an example of this in Fig. 4, which shows the communities found with the non-uniform PP model and the nested DC-SBM for a social network of high school students [44]. Even though all communities found have a larger probability of forming internal than external connections, the ones found by the DC-SBM yield a larger plausibility. If we inspect it more closely, we see that the divisions found by the DC-SBM amount largely to subdivisions of the ones found by the PP model. This can be explained by the DC-SBM using the preference of connections between the different communities as additional evidence for their existence, instead of merely their assortativity strength. This illustrates how a more general model like the DC-SBM can be more useful even when assortativity is a dominant but not unique pattern. (We note that the result found by the PP model has a slightly larger modularity, but this is not a very significant fact, given that modularity is in general largely decoupled form statistical significance.) In Fig. 5 we show more details about the inferences obtained for all the networks, including the number of communities found, the normalized maximum overlap distance [45] between the best fitting and the remaining partitions, and the modularity of the partitions found. Overall, we observe a fair amount of variability in the comparisons between models for the different networks. Very often, the PP models yield a more conservative view of the networks, uncovering a smaller number of groups when compared to the DC-SBM, but there are also cases where the opposite is true. We also observe that al-  . Inferred community structure of a social network of high school students [44], using both the non-uniform PP model and the nested DC-SBM. The bottom panels show the community-wise modularity values qr = (err − e 2 r /2E)/2E, such that Q = r qr. A value of qr > 0 indicates that group r has a predominantly assortative contribution. The bottom legend shows the description length obtained with both models, as well as the value of modularity of the partitions. Both divisions have a normalized maximum overlap distance [45] of d = 0.299. The group colors are chosen to maximize the matching between both partitions, as described in Ref. [45], and the same colors are used in the bottom panels.
though there are many cases where both the DC-SBM and PP models yield partitions with similar modularity, the overlap distance between partitions is very high, indicating that these networks admit a variety of divisions that have a similar overall level of assortativity (a good example of this is the high school network we considered in Fig. 4). Therefore, despite similar values of modularity found with the DC-SBM, the more general model rarely yields partitions that are very similar to the ones returned by any of the PP variants.
The values of modularity obtained with the best fitting model (which is most often the nested DC-SBM) are in some situations similar to what is found with the PP models, like for the high school social networks. However, for networks like the douban.com online social recommendation network [48], political blogs [49], Internet at the autonomous system level [40], and others, the modularity obtained with the best-fitting DC-SBM is significantly smaller than with the PP models, indicating that assortativity is not the most fundamental pattern in these networks, and using a community detection method that searches exclusively for these patterns gives us a significantly biased view.
In Fig. 5 we also include results obtained with the method of modularity maximization. As it must be the case, this approach yields the division of the network with the highest values of modularity among the alternative ones considered. When we compare the results obtained with more statistically grounded approaches, we observe a rather erratic behavior. For some networks, such as word associations [51], modularity maximization yields a seemingly more conservative result with fewer groups, which could be an underfit potentially due to the resolution limit [19]. In other instances, like the E-mail network of an undisclosed European institution [52], proteinprotein interactions [50] and bipartite person-crime associations [53], modularity maximization finds a number of communities that is multiple orders of magnitude larger than what is obtained with the inference methods, strongly indicating a massive overfit of these datasets. We illustrate this point in more detail by focusing on the protein-protein interaction network in Fig. 6. There we see that while modularity maximization finds over a hundred communities, the inference of the PP model finds only two, with one of them being relatively small. If we now consider a fully randomized version of the network, shown in Fig. 6 as well, we see that modularity maximization still finds a very similar number of communities in it, with a high value of modularity, while the inference of the PP model finds, correctly, only a single group. This example clearly shows that while the structure of the original network is probably not completely random, most of it (including its disconnectedness) can be explained by its degree sequence alone, with no convincing evidence of community structure, and that the results obtained with modularity maximization are mostly spurious. Our findings corroborate a recent analysis based on link prediction, carried over a large corpus of empirical networks, that showed that modularity maximization tends to systematically overfit [29]. Together with our results, this serves to illustrate that the tenuous connection with maximum likelihood of the PP model should not encourage practitioners to employ modularity maximization in the analysis of real networks, if they expect to be guided by statistical significance, or have any inherent guarantee against overfitting or underfitting.

VI. CONCLUSION
We have described how to perform a nonparametric Bayesian inference of the planted partition generative model, resulting in a principled community detection algorithm tailored for assortative structures. Our method separates structure from randomness, and does not find spurious communities in fully random networks. We also showed that it does not suffer from the resolution limit present in modularity-based methods, and is capable of uncovering arbitrarily small communities, provided those are statistically significant, and without the tuning of any parameter. Our approach is based on the sampling or a maximization of a posterior probability function that is not much more complicated to implement than popular heuristics like modularity, and hence can be used as a drop-in replacement for it in a variety of algorithms, intrinsically providing better regularization for them.
We showed how our inference approach is amenable to statistical model selection, and we have compared our model variations, together with more general stochastic block models, on a variety of empirical networks. We discussed how this comparison allows us to determine the true assortativity of community structures, by removing a systemic bias that exists when only constrained methods are employed. We have shown that in many cases the assortativity of real networks is exaggerated when viewed through the lenses of community detection methods that search exclusively for assortative patterns, and how model selection can reveal more fundamental mixing patterns.
A recent investigation of a variety of community detection methods on a large corpus of empirical networks found that most of them tend to yield a number of communities that seems compatible with an overall O( √ N ) scaling [29], indicating that this might be a limitation that is present in a larger set of community detection algorithms. That analysis did not include statistical inference methods that are known not to have this particular limitation, like the nested SBM [24,40] and our Bayesian PP model, as we have demonstrated. Incorporating these methods into such large-scale comparisons would allow us to better understand what are the true fundamental limitations of the community detection task. The inference approach is infinitely extensible, as it admits any conceivable generative model, and it provides a general platform for a meaningful comparison between them. It is easy to envision a more general comparison across network models that are tailored towards other kinds of specific mixing patterns, such as bipartiteness [54] and core-peripheries [55,56], as well as different classes of models such as those based on latent spaces [57,58]. A systematic comparison under such a framework would shed important light on the inherent trade-offs between more general and specific models, and how they relate to the various empirical domains. The model likelihood of the non-uniform PP model described in the text can be written as P (A|λ, ω, θ, b) = e −ω r<sθ rθs ω eout r e −λrθ 2 r /2 λ err/2 r Enforcing the constraintθ r = 1, and using the noninformative priors P (λ r |λ) = e −λr/(2λ) /(2λ) P (ω|λ) = e −ω/λ /λ P (θ|b) = r (n r − 1)!δ ( r θ i δ bi,r − 1) , we obtain the following marginal likelihood, after integrating over all parameters, This likelihood is once more identical to a microcanonical model, Here we show that it not difficult to establish a formal connection between any community detection method and statistical inference. Let us consider an arbitrary quality function that is used to perform community detection via the optimization We can retrofit any such method, and transform it into a statistical inference procedure by using W (A, b) as the Hamiltonian of an ad hoc generative model given by with normalization given by In general, performing a maximum likelihood estimation of this model will not be equivalent to the original optimization problem, due to the role of the normalization constant Z(b). However, we can cast it as a Bayesian procedure in order to achieve a trivial equivalence, via the posterior distribution P (b|A) = P (A|b)P (b) and by choosing the prior with Therefore, finding a mere equivalence between any given community detection method and statistical inference, by itself, is not a very insightful exercise, as it can amount to little more than tautology. This also shows that not every inference procedure is any more meaningful or principled than using an arbitrary quality function. Instead, these features are contingent on the actual generative models used, which need to be properly justified, together with the choice of priors, and care should be taken to verify the consistency of the whole approach, which is not granted automatically in every case.
Despite the above, it should be mentioned that constructing a posterior distribution in the ad hoc way described above does have its uses. In particular, it allows us to formally define a distribution over all possible divisions of the network according to any given community detection method. As shown in Ref. [20], by characterizing this entire distribution, we have, to some extent, a mechanism to detect degeneracy and evaluate the statistical significance of the results, by seeking the consensus of a large fraction of the solutions. Nonetheless, this does not address the arbitrariness of the Hamiltonian chosen, and the ultimate interpretation of the results.