Model selection and hypothesis testing for large-scale network models with overlapping groups

The effort to understand network systems in increasing detail has resulted in a diversity of methods designed to extract their large-scale structure from data. Unfortunately, many of these methods yield diverging descriptions of the same network, making both the comparison and understanding of their results a difficult challenge. A possible solution to this outstanding issue is to shift the focus away from ad hoc methods and move towards more principled approaches based on statistical inference of generative models. As a result, we face instead the more well-defined task of selecting between competing generative processes, which can be done under a unified probabilistic framework. Here, we consider the comparison between a variety of generative models including features such as degree correction, where nodes with arbitrary degrees can belong to the same group, and community overlap, where nodes are allowed to belong to more than one group. Because such model variants possess an increasing number of parameters, they become prone to overfitting. In this work, we present a method of model selection based on the minimum description length criterion and posterior odds ratios that is capable of fully accounting for the increased degrees of freedom of the larger models, and selects the best one according to the statistical evidence available in the data. In applying this method to many empirical unweighted networks from different fields, we observe that community overlap is very often not supported by statistical evidence and is selected as a better model only for a minority of them. On the other hand, we find that degree correction tends to be almost universally favored by the available data, implying that intrinsic node proprieties (as opposed to group properties) are often an essential ingredient of network formation.


I. INTRODUCTION
Many networks possess non-trivial large-scale structures such as communities [1,2], core-peripheries [3,4], bi-partitions [5] and hierarchies [6,7]. These structures presumedly reflect the organizational principles behind their formation. Furthermore, their detection can be used to predict missing links [6,8], detect spurious ones [8], determine the robustness of the system to failure or intentional damage [9], the outcome of the spread of epidemics [10] and functional classification [11], among many other applications. The detail with which such modular features are both represented in generative models, and detected with inference algorithms, reflects directly on the quality of these tasks. Hence, a natural undertaking has been the development of more elaborate models, which include degree correction [12], community overlap [13,14], hierarchical structure [6,7], selfsimilarity [15,16], bipartiteness [5], edge and node correlates [17,18], social tiers [19], multilayer structure [20], temporal information [21], to name only a few. Although such developments are essential, they should be made with care, since increasing the complexity of generative models may lead to artificial results caused by overfitting. While this is a well understood phenomenon when dealing with independent data or time series, open problems remain when the empirical data is a network, for which many common assumptions no longer hold and usual methods perform very poorly [22]. This problem is significantly exacerbated when methods are used which make no attempt to assess the statistical significance of * tiago@itp.uni-bremen.de the results. Unfortunately, the most widely used methods fall into this class, such as modularity maximization [23], link similarity [24], clique percolation [25], and many others [2]. Although for certain specially constructed examples some direct connections between statistical inference and ad hoc methods can be made [26,27], and in the case of some spectral methods a much deeper connection seems to exist [28,29], they still inherently lack the capacity to reliably distinguish signal from noise. Furthermore, what is perhaps even more important, these different methods cannot easily be compared to each other. For example, if a non-overlapping partition is found with some spectral method, another overlapping partition is obtained with clique percolation, and yet another with a local method based on link similarity, most of the time they will be very different, and yet there is no obvious way to decide which one is a more faithful representation of the network. We show in this work that this issue can be solved in a consistent and principled manner by restricting oneself to generative models, and by performing model selection based on statistical evidence. In particular, we employ the minimum description length principle (MDL), which seeks to minimize the total information necessary to describe the observed data as well as the model parameters. This can be equivalently formulated as the maximization of a Bayesian posterior likelihood which includes non-informative priors on the parameters, from which a posterior odds ratio between different hypothesis can be computed, yielding a degree of confidence for a model to be rejected in favor of another. We focus on the stochastic block model as the underlying generative model, as well as variants which include degree correction and mixed memberships. We show that with these models MDL can be used to produce a very efficient algorithm which scales well for very large networks and with arbitrarily large number of groups. Furthermore we employ the method to a wide variety of network datasets, and we show that degree correction tends to be selected for a significant majority of them, whereas community overlaps are seldom selected. This casts doubt on the claimed pervasiveness of group overlaps [24,25], obtained predominantly with nonstatistical methods, which -as long as there is a lack of corroborating evidence other than the network structure supporting the overlap -should perhaps be interpreted as an artifact of using methods with more degrees of freedom, instead of an underlying property of many systems. This paper is divided as follows. In Sec. II we present the generative models considered, and in Sec. III we describe the model selection procedure based on MDL. In Sec. IV we present the results for a variety of empirical networks. In Sec. V we analyze the general identifiability limits of the overlapping models, and in Sec. VI we describe in detail the inference algorithm used. In Sec. VII we finalize with a discussion.

II. GENERATIVE MODELS FOR NETWORK STRUCTURE
A generative model is one which attributes to each possible graph G a probability P (G|{θ}) for it to be observed, conditioned on some set of parameters {θ}. Here we will be restricted to discrete models, where specific choices of {θ} prohibit some graphs from occurring, but those which are allowed to occur have the same probability. For these models we can write P (G|{θ}) = 1 {θ} (G)/Ω({θ}) = e −S(G|{θ}) , with Ω({θ}) being the total number of possible graphs compatible with a given choice of parameters, 1 {θ} (G) is the indicator function with value one if the graph G belongs to the ensemble constrained by {θ} or zero otherwise, and S(G|{θ}) = ln Ω({θ}) − ln 1 {θ} (G) is the entropy of this constrained ensemble (where it should be understood simply that if 1 {θ} (G) = 0, then S(G|{θ}) is undefined, since the graph has zero probability) [30,31]. In order to infer the parameters {θ} via maximum likelihood, we need to maximize P (G|{θ}), or equivalently, minimize S(G|{θ}). This approach, however, cannot be used if the order of the model is unknown, i.e. the number of degrees of freedom in the parameter set {θ}, since choices with higher order will almost always increase P (G|{θ}), resulting in overfitting. For the same reason, maximum likelihood cannot be used to distinguish between models belonging to different classes, since models with larger degrees of freedom will inherently lead to larger likelihoods. In order to avoid overfitting, one needs to maximize instead the Bayesian posterior probability P ({θ}|G) = P (G|{θ})P ({θ})/P (G), with P (G) being a normalizing constant. The prior probability P ({θ}), which encodes our a priori knowledge of the parameters (if any) should inherently become smaller if the number of degrees of freedom increases. We will also be restricted to discrete parameters with constant prior probabilities, so that P ({θ}) = e −L({θ}) , with L({θ}) being the entropy of the ensemble of possible parameter choices. We can thus write the total posterior likelihood as P ({θ}|G) = e −Σ , with Σ = L({θ}) + S(G|{θ}). The value Σ is the description length of the data [32,33], i.e. the total amount of information required to describe the observed data conditioned on a set of parameters as well as the parameter set itself [34]. Hence, if we maximize P ({θ}|G) we are automatically finding the parameter choice which most compresses the data, since it will also minimize its description length Σ. Because of this, there is no difference between specifying probabilistic models for both G and {θ}, or encoding schemes that quantify the amount of information necessary to describe both. In the following, we will use both terminologies interchangeably.

A. Model without degree correction
Here we consider a simple variation of the stochastic block model [35][36][37][38] with N nodes and E edges, where the nodes can belong to different groups. Hence, to each node we attribute a binary vector b i with B entries, where a given entry b r i ∈ {0, 1} specifies whether or not the node belongs to block r ∈ [1, B]. In addition to this overlapping partition, we simply define the edge-count matrix {e rs }, which specifies how many edges are placed between nodes belonging to blocks r and s (or twice that number for r = s, for convenience of notation), where we have rs e rs = 2E. This simple definition allows one to generate a broad variety of overlapping patterns, which are not confined to purely assortative structures, and the non-overlapping model can be recovered as a special case, simply by putting each node in a single group.
The posterior likelihood of observing a given graph with the above constraints is simply P is the number of possible graphs. In this construction, the existence of multiple edges is allowed. However, the placement of multiple edges between nodes of blocks r and s should occur with a probability proportional to O(e rs /n r n s ), where n r is the number of nodes which belong to block r, i.e. n r = i b r i (note that r n r ≥ N ). Since here we are predominantly interested in the sparse situation where e rs ∼ O(N/B 2 ) and n r ∼ O(N/B), the probability of observing parallel edges will decay as O(1/N ), and hence can be neglected in the large network limit. Making use of this simplification, we may approximately count all possible graphs generated by the parameters { b i }, {e rs } as the number of graphs where each distinct membership of a single node is considered to be a different node with a single membership. This corresponds to an augmented graph generated via a non-overlapping block model with N = r n r nodes, where N ≥ N , but with the same matrix {e rs }, for which the entropy is [31] S t E − 1 2 rs e rs ln e rs n r n s , where S t = ln Ω({ b i }, {e rs }), and n r n s e rs was assumed. Under this formulation, we recover trivially the single-membership case simply by assigning each node to a single group, since Eq. 1 is identical in that special case. It is possible to remove the approximation that no parallel edges occur, by defining the model somewhat differently, as in shown in appendix B 1, in which case the Eq. 1 holds exactly as long as no parallel edges are observed.
Like its non-overlapping counterpart, the block model without degree correction assumes that nodes belonging to the same group will receive approximately the same number of edges of that type. Hence, when applied to empirical data, the modules discovered will also tend to have this property. This means that if the graph possesses large degree variability, the groups inferred will tend to correspond to different degree classes. On a similar vein, if a node belongs to more then one group, it will also tend to have a total degree which is larger than nodes that belong to either group alone, since it will receive edges of each type in an independent fashion. In other words, the group intersections are expected to be strictly denser than the non-overlapping portions of each group. Note that in this respect this model differs from other popular ones, such as the mixed membership stochastic block model (MMSBM) [13], where the density at the intersections is the weighted average of the groups (see appendix B 1).

B. Model with degree correction
In a manner analogous to the non-overlapping model [12], a multiple membership version of the stochastic block model with degree correction can be defined. This can be achieved simply by specifying, in addition to the overlapping partition { b i }, the number of half-edges incident on a given node i which belong to group r, i.e. k r i . Given this labelled degree sequence, one can simply use the same edge count matrix {e rs } as before to generate the graph. If we again make the assumption that the occurrence of parallel edges can be neglected, the total number of graphs fulfilling these constrains is approximately equal to the non-overlapping ensemble where each set of half-edges incident on any given node i that belongs to the same group r is considered as an individual node with degree k r i , for which the entropy is [31] S d −E − 1 2 rs e rs ln e rs e r e s − ir ln k r i !, (2) where e rs ( k 2 r − k r )( k 2 s − k s )/ k 2 r k 2 s n r n s has been assumed. Similarly to the non-degree-corrected case, it is possible to remove the approximation that no parallel edges occur, by using a "Poisson" version of the model, as is shown in appendix B 2. Under this formulation, it can be shown that this model is equivalent to the one proposed by Ball et al [14], although here we keep track of the individual labels on the half-edges as latent variables, instead of their probabilities.
Since we incorporate the labelled degree sequence as parameters to the model, nodes which belong to the same group can have arbitrary degrees. Furthermore, since the same applies to nodes which belong simultaneously to more than one group, the overlap between groups are neither preferably dense nor preferably sparse; it all depends on the parameters {k r i }.

III. MODEL SELECTION
As discussed previously, in order to perform model selection, it is necessary to include the information necessary to describe the model parameters, in addition to the data. The parameters which need to be described are the overlapping partition { b i }, the edge counts {e rs }, and in the case of the degree-corrected model we also need to the describe the labeled degree sequence {k r i }. When choosing an encoding for the parameters we need to avoid redundancy, and describe them as parsimoniously as possible. This means that we need to make few prior assumptions, in order to be able to use observed patterns in the data to compress the parameter description as much as possible. In Bayesian language, we need non-informative priors which express maximal uncertainty about the parameters. On the other hand, we need to fully exploit known or intrinsic constraints, since they should not be learned from the data.
In order to specify the partition { b i }, we assume that all different 2 B − 1 mixtures are not necessarily equally likely, and furthermore the sizes d i = r b r i of the mixtures are also not a priori assumed to follow any specific distribution. More specifically, we consider the mixtures to be the outcome of a generative process with two steps. We first generate the local mixture sizes {d i }, which are sampled uniformly from the distribution with fixed counts {n d }, such that its description length is where D is the maximum value of d, and D being the total number of m-combinations with repetitions from a set of size n. Then, for all n d nodes with the same value of d, we sample a sequence of { b i } from a distribution with support | b i | 1 ≡ r b r i = d and with fixed counts n b , which has a description length where, similarly, ( B d ) n d enumerates the total number of Although it is possible to encode the partition in different ways (e.g. by sampling the membership to each group independently [39]), this choice makes no assumptions regarding the types of overlaps which are more likely to occur, either according to the number of groups to which a node may belong, or the actual combination of groupsit is all left to be learned from data. In particular, it is not a priori assumed that if many nodes belong to groups r and s then the overlap between these two groups will also contain many nodes. As desired, if the observed partition deviates from this pattern, this will be used to compress it further. Only if the observed partition falls squarely into this pattern will further compression not be possible, and we would have an overhead describing it using Eq. 5, when compared to an encoding which expects it a priori. However, one can also see that in the limit n b 1, as the first two terms in Eq. 5 grow asymptotically only with ln N and ln n d , respectively, the whole description length becomes L p is the entropy of the distribution {p x }, which is the optimal limit. Hence if we have a prior which better matches the observed overlap, the difference in description length compared to Eq. 5 will disappear asymptotically for large systems. Another advantage of this encoding is that it incurs no overhead when there are no overlaps at all (i.e. D = 1), and in this case the description length is identical to the non-overlapping case, as defined in Ref. [7].

A. Degree correction
For the degree-corrected model, we need to describe the labeled degree sequence { k i }. We need to do so in a way which is compatible with the partition { b i } so far described, and with edge counts {e rs }, which will restrict the average degrees of each type.
In order to fully utilize the partition { b i }, we describe for each value of b its individual degree sequence { k i | b i = b}, via the counts n b k , i.e. the number of nodes in partition b which possess labeled degrees k. We do so in order to preserve the lack of preference for patterns involving the degrees in the overlaps between groups. Since the model itself is agnostic with respect to the density of the overlaps, not only does this choice remain consistent with this, but also any existing pattern in the degree sequence in the overlaps will be used to construct a shorter description.
In addition, we must also consider the total number of half-edges of a given type r incident on a partition e r b = k k r n b k , which must be compatible with the edge counts {e rs } via e r = s e rs = b e r b . We begin by first distributing all the e r half-edges of type r in all the m r bins corresponding to each nonempty b partition which contains the label r, i.e.
The total number of such partitions is simply mr er . Now we need to distribute the labelled half-edges inside each partition to obtain each degree sequence. The logarithm of the total number of degree sequences fulfilling all necessary constraints is However, most degree sequences uniformly sampled from this set will result in nodes with very similar degrees.
Since we want to profit from degree variability, it is better to condition the description on the counts n b k , i.e. how many nodes in partition b possess labelled degree k. The description in this case becomes where Ξ r b is the enumeration of all possible n b k counts which fulfill the constraints k n b k = n b and k k r n b k = e r b . Unfortunately, this enumeration cannot be done easily in closed form. However, the maximum entropy ensemble where these constraints are enforced on average is analytically tractable, and as we show in appendix C, can be very well approximated by where ζ(x) is the Riemann zeta function. This approximation with "soft" constraints should become asymptotically exact as the number of nodes become large, but otherwise will deviate from the actual entropy. On the other hand, if the number of nodes is very small, describing the degree sequence via Eq. 8 may not provide a shorter description, even if computed exactly. In this situation, Eq. 7 may actually provide a shorter description of the degree sequence. We therefore compute both Eq. 7 and Eq. 8 and choose whatever is shorter. Putting it all together, the description length for the whole degree sequence becomes In the limit n b , and hence the degree sequences in each partition are described close to the optimum.
For the non-overlapping case with D = 1, the description length simplifies to with and ln Ξ r 2 ζ(2)e r . For n r 1 we obtain L κ r H({n r k /n r }). This approximation was used a priori in Ref. [7], but Eq. 11 is a more complete description length of the non-overlapping degree sequence, and its use should be preferred. Hence, like the description length of the overlapping partition, the encoding above offers no overhead when the partition is non-overlapping.

B. Matrix of edge counts
The final piece that needs to be described is the matrix of edge counts {e rs }. We may view this set as an adjacency matrix of a multigraph with B nodes and E = rs e rs /2 edges. The total number of such matri- , and the logarithm of this number can be used as the description length [40]. However, this implicitly assumes that all matrices are equally likely a priori. Not only this is unlikely to be the case, since most networks still possess structure at the block level, but this assumption also leads to a limit in the detection of small groups, with a maximum detectable number of groups scaling as B max ∼ √ N [40]. Similarly to what we did for the node partition and the degree sequence, this can be solved by considering a generative model for the edge counts themselves, with its own set of hyperparameters. Since they correspond to a multigraph, a natural choice is the stochastic block model itself, which has its own set of edge counts, and so on recursively until one has only one group at the top. This nested stochastic block model was proposed in Ref. [7], where it has been shown to reduce the resolution limit to B max ∼ N/ log N , making it often virtually non-existent in practice. Furthermore, since the number of levels and the topology at each level is obtained by minimizing the overall description length, it corresponds to a fully non-parametric way of inferring the multilevel structure of networks. If we denote the observed network to be at the level l = 0 of the hierarchy, then the total description length is with {e l rs }, {n l r } describing the block model at level l, and is the entropy of the corresponding multigraph ensemble and is the description length of the node partition at level l > 0. For the level l = 0 we have L 0 t = L p given by Eq. 5, or L 0 t = L p + L κ for the degree-corrected model. Note that here we use the single-membership nondegree-corrected model at the upper layers. This could be modified to include arbitrary mixtures of degree correction and multiple membership, but we stick with this formulation for simplicity.

C. Significance levels
Minimizing the description length will select the model which is most favored given the evidence in the data. But there will be situations where more than one model describes the data almost equally well, and one would like to be able to rule out alternative models with some degree of confidence. This can be done by computing the posterior probability of observing a given partition according to some version H of the model (e.g. degreecorrected or non-degree-corrected), where P (H) is the prior probability associated with model type H, and P (G) = is obtained by summing over the remaining model parameters. In the case of the degree corrected model (H = DC) they are the {e rs } matrix and the labelled degree sequence { k i } (which is omitted for the non-degree-corrected model, H = NDC), with ∆Σ = Σ a − Σ b being the difference in the description length, and in Eq. 21 it was assumed that P (H a ) = P (H b ) = 1/2, corresponding to a lack of a priori preference for either model variant (which in fact makes Λ identical to the Bayes factor [41] Using the posterior odds ratio Λ is more practical than some alternative model selection approaches, such as likelihood ratios. As has been recently shown [22], the likelihood distribution for the stochastic block model does not follow a χ 2 -distribution asymptotically for sparse networks, and hence the calculation of a p-value must be done via an empirical computation of the likelihood distribution which is computationally costly, and prohibitively so for very large networks. In contrast, computing Λ can be done easily, and it properly accounts for the increased complexity of models with larger parameters, and protects against overfitting.  [42]. The left panel shows the best model with an overlapping partition, and the right the best non-overlapping one. Nodes with a blue halo belong to the Republican faction, as determined in Ref. [42]. For the visualization, the hierarchical edge bundles algorithm [43] was used.

IV. EMPIRICAL NETWORKS
The method outlined in the previous section allows one to determine the best model from the various available choices. Here we analyze some empirical examples, and determine the most appropriate model, and examine the consequences of the balance struck between model complexity and quality of fit. We start with two small networks, the co-appearance of characters in the Victor Hugo novel "Les Misérables" [45], and a network of college American football games [46,47]. For both networks, we obtain the best partition according all model variations and for different number of groups B, and compute the value of Λ relative to the best model, as shown in Fig. 1. For the "Les Misérables" network, the best fit is a non-degree-corrected overlapping model which puts the most central characters in more than one group. All other partitions for different values of B and model types result in values significantly below the plausibility line of Λ = 10 −2 , indicating that the overlapping model offers a better explanation for the data with a large degree of confidence. In particular, it offers a better description than the non-overlapping model with degree correction. For the Football network, on the other hand, the proffered model is non-overlapping and without degree correction with B = 10, which matches very well the assumed correct partition into 10 conferences. The groups are relatively homogeneous, with most nodes having similar degrees, such that degree correction becomes an extra burden, with very little explanatory power. For this net-work, however, there are alternative fits with values of Λ within the plausibility region, which means that the communities are not very strongly defined, and they admit alternative partitions with B = 9 and B = 8 groups which cannot be fully discarded given the evidence in the data.
Degree correction tends to become a better choice for larger data sets, which display stronger degree variability. One example of this is the network of political blogs obtained by Adamic et al [42]. For this network, the best model is a degree-corrected, overlapping partition into B = 7 groups, shown in Fig. 2. Compared to this partition, the best alternative model without overlap divides the network into B = 12 groups 1 , but has a posterior odds ratio significantly below the plausibility region. It should be observed that the non-overlapping version captures well the segregation into two groups (Republicans and Democrats) at the topmost level of the hierarchy. The overlapping version, on the other hand, tends to classify half-edges belonging to different camps into different groups, which is compatible with the accepted division, but the upper layers of the hierarchy do not reflect this, and prefers to merge together groups that belong to different factions, but that have similar roles.  Overlapping partitions, however, do not always provide better descriptions, even in situations where it might be considered more intuitive. One of the contexts where overlapping communities are often considered to be better explanations are in social networks, where different social circles could be represented as different groups (e.g. family, co-workers, friends, etc.), and one could belong to more than one of these groups. This is well illustrated by so-called "ego networks", where one examines only the immediate neighbours of a node, and their mutual connections. One such network, extracted from the Facebook online social network [44], is shown in Fig. 3. The common interpretation of networks such as these is shown on the right in Fig. 3, and corresponds to a partition of the central "ego" node so that it belongs to all of the different circles. Under this description, the ego node is only special in the sense that it belongs to all groups, but inside each group it is just a common member. However, among all model variants, the best fit turns out to be the one where the ego node is put separately in its own group, as shown in the left in Fig. 3. In this example it is easy to see why this is the case. If we observe the degree distribution inside each group for the network on the left, we see that there is no strong degree variation. On the right, as the ego is included in each group, it becomes systematically the most connected node. This is simply by construction, since the ego must connect to every other node. The only situation where the ego would not stand out inside each group, would be if the communities were cliques. Hence, since the ego is not a typical member of any group, it is simpler to classify it separately in its own group, which is selected by the method as a being a more plausible hypothesis. Note that degree correction is not selected as the most plausible solution, since it is burdened with the individual description of every degree in the network, which are fairly uniform with the exception of the ego. One can imagine a different situation where there would be other very well connected nodes inside each group, so that the ego could be described as a common member of each group, but this not observed in any other network obtained in Ref. [44]. Naturally, if one considers the complete network, of which the ego neighbourhood is only a small part, the situation may change, since there may me members of each group to which the ego does not have a direct connection.
When performing model selection for larger networks, it is often the case that the overlapping models are not chosen. In table I are shown the results for many empirical networks belonging to different domains. For the majority of cases, the nonoverlapping degree-corrected models are selected. The are, however, many exceptions which include two social networks (Gowalla and Brightkite [53]), the global airport network of openflights.com, the neuronal network of C. elegans [54], the political blog network already mentioned, the arXiv co-authorship networks [55] [in the fields of general relativity and quantum cosmology (gr-qc), high-energy physics (hep-th), condensed matter (cond-mat), and astronomy (astroph)], co-authorship in network science [56], and the network of genes implicated in diseases [58], for which some version of the overlapping model is chosen. Interestingly, for the arXiv co-authorship network in high-energy physics/phenomenology (hep-ph) a nonoverlapping model is selected instead. For only one of the remaining four arXiv networks (astro-ph), the degreecorrected version of the overlapping model is selected, whereas for the other three the non-degree-corrected version is preferred. Hence, for co-authorship networks the model selection procedure seems to correspond to the intuition that they are composed predominantly of overlapping groups [25].
We take the arXiv cond-mat network as a representative example of the differences between the inferred models. As can seen in Fig. 4, although the degree distribution is very broad, the inferred labelled degree distribution is narrower, meaning that many large-degree nodes can be well explained as having a smaller degree of any single type, but belonging simultaneously to many groups (in the specific context of this network, prolific authors tend to be the ones which belong to many different types of collaborations). The distribution of mixture sizes n d has almost always a maximum at d = 1, meaning that most nodes belong to one group, but with a tail which is comparatively broad (this seems to be a general feature which is observed in the majority of networks analyzed). The distribution of group sizes can be very different, depending on which model is used. Non-overlapping models No. N k log 10 ΛDCO log 10 ΛDC log 10 ΛNDCO log 10 ΛNDC B d Σ/E    name given at the bottom  table), the number of nodes N , the average degree k = 2E/N , the posterior odds ratios relative to the best model for the degree-corrected overlapping (ΛDCO), the degreecorrected non-overlapping (ΛDC), non-degree-corrected overlapping (ΛNDCO) and non-degree-corrected non-overlapping (ΛNDC) models. Missing entries correspond to situations where the best overlapping partition turns out to be nonoverlapping. The last three columns show some parameters of the best model: The number of groups B, the average mixture size d , and the description length per edge (in bits per edge). without degree correction tend to find groups which are strongly correlated with degrees [12], and hence lead to a broad distribution of group sizes when the degree distribution is also broad. On the other hand, either degree correction or group overlap tend to change the distribution considerably. In the literature there are often claims of community sizes following power-law distributions [70][71][72][73] with figures similar to the lower left panel of Fig. 4. Regardless to the validity of this hypothesis for the various methods used in the literature, this is certainly not the case for the overlapping model as shown in the lower right panel of the same figure. Indeed, for most networks analyzed, the model which best fits the data (which tends to be degree-corrected and non-overlapping) shows no vestige of group sizes following a scale-free distribution. Some further examples of this are shown in Fig. 5, where characteristic size scales can be clearly identified.

V. MODEL IDENTIFIABILITY: OVERLAPPING VS. NON-OVERLAPPING
A central issue when selecting between nonoverlapping and overlapping models is to decide when a group of nodes should belong simultaneously to two or more groups, of if these nodes should be better represented by a single membership to a different unique group. The choice is not always immediately obvious, since we can always generate very similar networks with either model. If we generate a network with the overlapping model, but treat it as if it were generated by the non-overlapping model, with each distinct mixture b corresponding to a separate non-overlapping group, the associated entropy will be where is the expected number of edges between mixtures b 1 and b 2 . By exchanging the sums and using Jensen's inequality we observe directly that S t ≤ E − 1 2 rs e rs ln e rs n r n s , with the right-hand side being the entropy of original overlapping model S t , and with the equality holding only if the original model happens to be non-overlapping to begin with. Thus, the non-overlapping model will invariably possess a lower entropy. Nevertheless, the overlapping hypothesis may still be preferred if the number of groups B is sufficiently smaller than the number of individual b mixtures, so that the total description length is shorter. It should be observed, however, that since one model is contained inside the other, the difference in the description length can be interpreted simply as the difference in the prior probabilities for the model parameters. As the amount of available data increases, the effect of the priors should "wash out", and the description length should be increasingly dominated by the model entropy alone. In these cases one should expect the nonoverlapping model to be preferred, regardless of the specific model which was used to generate the data. However, differently than models which generate independent data points, the "amount of available data" for network models is a finer issue. In the case of the stochastic block model it involves the simultaneous scaling of the number of edges E, the number of nodes N and the number of groups B. As a case example, here we consider a simple overlapping assortative model, with e rs = 2E[δ rs c/B + (1 − δ rs )(1 − c)/B(B − 1)], with c ∈ [0, 1] controlling the degree of assortativity. The mixtures are parameterized as n b = C r µ br , with C being a normalization constant, and µ ∈]0, 1] controlling the degree of overlap. For µ → 0 we obtain asymptotically a non-overlapping partition with n r = N/B, and for µ = 1 all mixtures b have the same size. We compare the difference in description length between this model and its equivalent parametrization with each mixture as a separate group. As can be seen in Fig. 6, for any given value of c, there is a value of µ above which the non-overlapping model is preferred. In this parameter region, the group intersections are sufficiently well populated with nodes, so that their representation as individual groups is chosen. For values of µ below this value, the intersections are significantly smaller than the non-overlapping portion. In this case, the data is better explained as larger groups of almost non-overlapping nodes, with few nodes at the intersections. The boundary separating the two regions recedes upwards as the number of groups B is increased, meaning that a larger number of distinct intersections can compensate for a smaller number of non-overlapping nodes. It should also be pointed out that the boundaries move downwards as the number of nodes and edges is increased, such that the average degree in the network remains the same (not shown), so that it is not only the relative sizes of the intersections which are the relevant properties, but also their absolute sizes. The same occurs if the average degree increases and everything else remains constant. Hence in the limit of sufficient data, either with the number of nodes inside each group and intersection becoming sufficiently large, or each part becoming sufficiently dense, the non-overlapping model is the one which will be selected. For empirical networks, this may not be the scaling condition which is more representative, since the most appropriate number of groups and degree of overlap may in fact follow any arbitrary scaling, and hence the overlapping model may still be selected, even for very large or very dense networks. Nevertheless, this example seems to suggest that the non-overlapping model is general enough to accommodate structures generated by the overlapping model in these limiting cases, and may serve as a partial explanation to why the overlapping model is seldom selected in the empirical systems analyzed in Sec. IV.

VI. INFERENCE ALGORITHM
The inference procedure consists in finding the labeling of the half-edges of the graph such that the description length is minimized. Such global optimization problems are often NP-hard, and require heuristics to be solvable in practical time. One possibility is to use Markov chain Monte Carlo (MCMC), which consists in modifying the block membership of each half-edge in a random fashion, and accept or reject each move with a probability given as a function of the description length difference ∆Σ. By choosing the acceptance probabilities in the appropriate manner, i.e. by enforcing ergodicity and detailed balance, one can guarantee that the labellings will be sampled with the correct probability after a sufficiently long equilibration time is reached. However, naive formulations of the Markov chain will lead to very long equilibration times, which become unpractical for large networks. Here we adapt the algorithm developed in Ref. [74] for the nonoverlapping case which implements a fast Markov chain. It consists in the move proposal of each half-edge incident on node i of type r to type s with a probability given by where t is the block label the half-edge opposing a randomly chosen half-edge incident to the same node as the half-edge being moved, and ≥ 0 is a free parameter. Eq. 25 means that we attempt to guess the label of a given half-edge by inspecting the group membership the neighbors of the node to which it belongs, and using the currently inferred model parameters to choose the most likely group to which it should be moved. It should be emphasized that this move proposal does not result in a preference for either assortative or dissortative networks, since it depends only on the matrix {e rs } currently inferred. For any choice of > 0, this move proposal preserves ergodicity, but not detailed balance. This last characteristic can be enforced via the Metropolis-Hastings criterion [75,76] by accepting each move with a probability a given by where p i t is the fraction of opposing half-edges of node i which belong to block t, and p(s → r|t) is computed after the proposed r → s move (i.e. with the new values of e tr ), whereas p(r → s|t) is computed before. The parameter β in Eq. 26 is an inverse temperature, which can be used to sample partitions according to their description length (β = 1) or to find the ground state (β → ∞). As explained in Ref. [74], this move proposal as well as the computation of a can be done efficiently, with minimal book-keeping, so that a sweep of the network (where each half-edge move is attempted once) is done in time O(E), independent of the number of groups B. This is true even in the overlapping case, since updating Eqs. 1, 2, 5 and 10 after each half-edge move can be done in time O(1).
As discussed in Ref. [74], although the MCMC method above succeeds in equilibrating faster than a naive Markov chain, it still suffers from a strong dependence on how close one starts from the global minimum. Usually, starting from a random partition of the half-edges leads to metastable states where the Markov chain seems to have equilibrated, but in fact the network structure has only been partially discovered, and will move from such configurations only after a very long time. This is a problem common to many inference procedures based on local moves such as expectation maximization [14] and belief propagation [77]. In Ref. [74] a multilevel agglomerative heuristic was proposed, which significantly alleviates this problem. It consists in equilibrating the chain for a larger number of groups, and then merging the groups using the same algorithm used for the block membership moves. This method, however, cannot be used unmodified in the overlapping case, since the strict merging of groups will not properly explore the landscape of possible overlap- ping partitions. We therefore modify the approach as follows. Before groups are merged, the half-edges belonging to each one of them are split into subgroups corresponding to the different group memberships at the opposing sides. These subgroups are then treated as separate groups, and are merged together until the desired number of groups is achieved. All the details of the algorithm beyond this modification are performed exactly as described in Ref. [74]. Since this algorithm usually does a good job in finding the a partition very close to the final one, it also tends to perform very well when the algorithm is turned into a greedy heuristic, by starting with B = 2E and each half-edge in its own group, and by making β → ∞. An example of a typical outcome of the greedy algorithm is show in Fig. 7. The greedy version is very fast, with an overall complexity of O(E ln 2 E), which makes it usable for very large networks. Note that this complexity is independent on the number of groups, B. This is a strong contrast to other methods proposed for the same problem, such as the stochastic optimization algorithm of Gopalan et al [78], and the expectation maximization algorithm of Ball et al [14], both of which have a complexity of O(EB) per sweep, although they only consider strictly assortative models, and applying the same techniques to the more general models considered here would lead to a O(EB 2 ) complexity, similar to belief propagation algorithms for non-overlapping models [22,77]. Although these approaches can be very efficient if the number of groups is very small, they quickly become prohibitive if the most appropriate number of groups scales as some function of the system size (which seems to be generally the case when model selection is applied, see table I and Ref. [7]), which is not an issue with the algorithm described above. It should also be noted that none of the other algorithms mentioned [14,22,77,78] are designed to overcome metastable solutions, like the multilevel approach presented here. For most networks analyzed in this work, the fast heuristic version of the algorithm was used, together with the algorithm described in Ref. [7] to infer the upper layers of the hierarchy (which includes the determination of the number of groups B at the lowest level, in addition to the entire hierarchy, in a non-parametric fashion) 2 .

VII. CONCLUSION
We presented a method of inferring overlapping and degree-corrected versions of the stochastic block model based on the minimum description length principle (MDL) that avoids overfitting and allows for the comparison between model classes. Based on a Bayesian interpretation of MDL, we derived a posterior odds ratio test that yields a degree of confidence with which models can be selected or discarded. In applying this method to a variety of empirical networks, we obtained that for the majority of them the non-overlapping degree-corrected model variant is the one which best fits the data.
Although overlapping structures are often considered to be more reasonable explanations for some networks, we showed that in many representative cases the nonoverlapping model can accommodate the same structure while providing a more parsimonious description of the data. We expect this fact to bear on tasks which require high quality fits, such as the prediction of missing or spurious links [6,8], or other generalizations of the data.
The models considered in this work generate unlabeled networks, without any other properties associated with the nodes or edges. However, it is often the case that either the nodes or edges have weights [18] or are of different types [17,20], or have temporal information [21]. This sort of additional data may corroborate the evidence supporting the generation via a specific type of model (e.g. with overlaps) and tip the scale towards it. The approach presented here is generalizable to these cases as well, by augmenting the model to generate covariates associated with the edges and nodes. Furthermore, one should be able to perform a similar comparison with models which belong to very different classes, such as latent space models [80], or others.
The same approach of the main text can be carried over to directed graphs with no difficulties. In this case the edge counts are in general asymmetric, e rs = e sr , which leads to the entropy for the non-degree-corrected model [31] S t E − rs e rs ln e rs n r n s .
For the degree-corrected case, there are two degree sequences for the labelled out-and in-degrees, {k + r i } and {k − r i }, respectively. Applying the same argument as for the undirected case, the entropy becomes [31] where e + r = s e rs and e − r = s e sr . The description length for the overlapping partition is identical to the undirected case, with L p given by Eq. 5. For the labeled degree sequence, we have instead and k + , k − k + r n b k + , k − and e r b − = k + , k − k − r n b k + , k − , respectively, which give the total number of out-and in-edges incident on the mixture b. In the previous equations the counts n b k + , k − refer to the joint distribution of labelled in-and out-degrees, so that each vector k +/− describes the in-and out-degrees labelled according to degree membership, i.e. This approximation of the formulation with "hard" constraints of the multiple membership model discussed in the main text is closely related to a Poisson variant of the model with "soft" constraints, where each half-edge of the graph is labeled with a latent variable specifying which group memberships were responsible for its existence, and the number of edges of type (r, s) between nodes i and j, A rs ij , is independently sampled according to a Poisson distribution, so that the likelihood becomes where p rs is the average number of edges of type (r, s) between nodes which belong to each group. The loglikelihood can be written as ln P = 1 2 rs e rs ln p rs − n r n s p rs − i>j r≥s Maximizing ln P w.r.t. p rs , we obtainp rs = e rs /n r n s , and hence lnP = −E+ 1 2 rs e rs ln e rs n r n s − i>j r≥s For simple graphs with A rs ij ∈ {0, 1}, the last term in the above equation is equal zero, and we have that the approximation of the likelihood of the model with "hard" constraints in the sparse case is identical to the exact maximum likelihood of the Poisson model with "soft" constraints.
This model is similar to the popular mixed membership stochastic block model (MMSBM) [13], however it differs in the important aspect that it generates strictly denser overlaps. In the MMSBM, the existence of an edge A ij is sampled from a Bernoulli distribution with parameter λ ij = rs θ r i θ s j p rs , where θ r i is the probability that node i belongs to group r, such that r θ r i = 1, and p rs ∈ [0, 1] is the probability that two nodes belonging to groups r and s are connected. Although for sparse graphs the difference between Poisson and Bernoulli models tend to disappear, with this parametrization the density of the overlaps are mixed with normalized weights. More specifically, for a node i which belongs simultaneously to groups r and s, its expected degree is equal to the weighted average of the unmixed degrees, k i = θ r i k r + θ s i k s , where k r = s p rs i θ s i is the expected degree of a node that belongs only to group r. Thus, in the MMSBM the nodes in the mixture have an intermediate density between the sparser and the denser groups. In contrast, in the model considered in the main text, as well as the Poisson model above, we have simply k i = k r + k s , and therefore the overlaps are always strictly denser than the pure groups. In this respect, it is equivalent to other formulations of the MMSBM, e.g. Refs. [81,82].

Degree-corrected
A connection to a version of the model with "soft" constraints can also be made. We may consider each labelled entry A rs ij in the adjacency matrix to be Poisson distributed with an average given by θ r i θ s j λ rs , P (G|{ b i }, {λ rs }, {θ r }) = i>j r≥s (θ r i θ s j λ rs ) A rs ij e −θ r i θ s j λrs /A rs ij !, (B4) where A rs ij is the number of edges of type (r, s) between nodes i and j, and θ r i is the propensity with which a node receives an edge of type r. The log-likelihood can be written as Again, for simple graphs with A rs ij ∈ {0, 1}, the last term in the above equation is equal zero, however even in that case the likelihood is not identical to the version with "hard" constraints considered above, as is the case for the single membership version as well [31]. Both likelihoods only become the same in the limit k r i 1 such that ln k r i ! k r i ln k r i − k r i . Nevertheless, for the purpose of this paper, which is classification of empirical networks, the differences between these models can be overlooked.
There is a direct connection between this model and the one proposed by Ball et al [14]. In the not strictly assortative version of their model, the number of edges A ij is distributed according to a Poisson with average λ ij = rs η r i η s j ω rs , where η r i is the propensity with which node i receives edges of type r and ω rs regulates the number of edges across groups. The total likelihood of that model is Since the sum of independent Poisson random variables is also distributed according to a Poisson, if we generate a graph with the model of Eq. B4 and observe only the total unlabelled edge counts A ij = rs A rs ij , they are distributed exactly like Eq. B7, for the same choice of parameters θ r i = η r i and λ rs = ω rs . Hence, the model of the main text is an equivalent formulation of the one in Ref. [14] where one keeps track of the latent variables specifying the exact type of each half-edge, instead of their marginal probability. This has the advantage that the maximum likelihood estimates for the model parameters λ rs and θ r i can be obtained directly by differentiation, and do not require iterations of an EM algorithm as in Ref. [14]. On the other hand we are left with the determination of labels in the half-edges, which is done with the method already described in Sec. VI.

Appendix C: Maximum entropy ensemble of counts with constrained average
Suppose we want to compute the number of all possible non-negative integer counts {n k }, subject to a normalization constraint ∞ k=0 n k = N and a fixed average ∞ k=0 kn k = E. This can be obtained approximately, by relaxing the constraints so that they hold only on average. The maximum entropy ensemble given these constraints is the one with the probabilities P ({n k }) = e −H({n k }) /Z, with H({n k }) = λ k n k + µ k kn k , where λ and µ are the Lagrange multipliers which keep the constraints in place. This is mathematically analogous to a simple Bose gas with energy levels given by k. The partition function is given by with The average counts are given by n k = −∂Z k /∂λ = [exp(λ + µk) − 1] −1 , and the parameters λ and µ are determined via the imposed constraints, Eq. C5 can be inverted as e −λ = 1 − exp(−N/µ), but Eq. C6 cannot be solved for λ in closed form. However, by assuming a sufficiently "high temperature" regime where µ ∼ O(1), we have that the fugacity simplifies in the thermodynamic limit, e −λ → 1 for N 1, and hence we obtain µ Li 2 (1)/E. Using Eqs. C5 and C6, we can write the entropy of the ensemble ln Ξ = − k [∂ ln Z k /∂λ + ∂ ln Z k /∂µ + ln Z k ], as and for the regime e −λ → 1, we have where the identity Li 2 (1) = ζ(2) was used. Although Eq. C8 becomes asymptotically exact in the thermodynamic limit with E ∼ N and N 1, the exact solution can also be obtained with arbitrary precision simply by iterating Eqs. C5 and C6 asλ(t+1) = 1−exp(−N/µ(t)), µ(t+1) = E/ Li 2 (λ(t)), whereλ ≡ e −λ , with the starting pointsλ(0) = 1, µ(0) = Li 2 (1)/E, until sufficient convergence is reached, and the results are substituted in Eq. C7. (We actually use this more precise procedure when computing Eq. 8 in the main text, throughout the analysis.)