Network structure, metadata and the prediction of missing nodes

The empirical validation of community detection methods is often based on available annotations on the nodes that serve as putative indicators of the large-scale network structure. Most often, the suitability of the annotations as topological descriptors itself is not assessed, and without this it is not possible to ultimately distinguish between actual shortcomings of the community detection algorithms on one hand, and the incompleteness, inaccuracy or structured nature of the data annotations themselves on the other. In this work we present a principled method to access both aspects simultaneously. We construct a joint generative model for the data and metadata, and a non-parametric Bayesian framework to infer its parameters from annotated datasets. We assess the quality of the metadata not according to its direct alignment with the network communities, but rather in its capacity to predict the placement of edges in the network. We also show how this feature can be used to predict the connections to missing nodes when only the metadata is available. By investigating a wide range of datasets, we show that while there are seldom exact agreements between metadata tokens and the inferred data groups, the metadata is often informative of the network structure nevertheless, and can improve the prediction of missing nodes. This shows that the method uncovers meaningful patterns in both the data and metadata, without requiring or expecting a perfect agreement between the two.


I. INTRODUCTION
The network structure of complex systems determine their function and serve as evidence for the evolutionary mechanisms that lie behind them. However, very often their large-scale properties are not directly accessible from the network data, and need to be indirectly derived via nontrivial methods. The most prominent example of this is the task of identifying modules or "communities" in networks, that has driven a substantial volume of research in recent years [1][2][3]. Despite these efforts, it is still an open problem both how to characterize such largescale structures and how to effectively detect them in real systems. In order to assist in bridging this gap, many researchers have compared the features extracted from such methods with known information -metadata, or "ground truth" -that putatively correspond to the main indicators of large-scale structure [4][5][6]. However, this assumption is often accepted at face value, even when such metadata may contain a considerable amount of noise, is incomplete or is simply irrelevant to the network structure. Because of this, it is not yet understood if the * darko.hric@aalto.fi † tiago@itp.uni-bremen.de ‡ santo.fortunato@aalto.fi discrepancy observed between the metadata and the results obtained with community detection methods [4,7] is mainly due to the ineffectiveness of such methods, or to the lack of correlation between the metadata and actual structure.
In this work, we present a principled approach to address this issue. The central stance we take is to make no fundamental distinction between data and metadata, and construct generative processes that account for both simultaneously. By inferring this joint model from the data and metadata, we are able to precisely quantify the extent to which the data annotations are related to the network structure, and vice versa. This is different from approaches that explicitly assume that the metadata (or a portion thereof) are either exactly or approximately correlated with the best network division [8][9][10][11][12][13][14]. With our method, if the metadata happens to be informative on the network structure, we are able to determine how; but if no correlation exists between the two, this gets uncovered as well. Our approach is more in line with a recent method by Newman and Clauset [15] -who proposed using available metadata to guide prior probabilities on the network partition -but here we introduce a framework that is more general in three important ways: Firstly, we do not assume that the metadata is present in such a way that it corresponds simply to a partition of the nodes. While the latter can be directly compared to the outcome of conventional community detection methods, or used as priors in the inference of typical generative models, the majority of datasets contain much richer metadata, where nodes are annotated multiple times, with heterogeneous annotation frequencies, such that often few nodes possess the exact same annotations. Secondly, we develop a nonparametric Bayesian inference method that requires no prior information or ad hoc parameters to be specified, such as the number of communities. And thirdly, we are able not only to obtain the correlations between structure and annotations based on statistical evidence, but also we are capable of assessing the metadata in its power to predict the network structure, instead of simply their correlation with latent partitions. This is done by leveraging the information available in the metadata to predict missing nodes in the network. This contrasts with the more common approach of predicting missing edges [16,17], which cannot be used when entire nodes have not been observed and need to be predicted. Furthermore, our method is also capable of clustering the metadata themselves, separating them in equivalence classes according to their occurrence in the network. As we show, both features allows us to distinguish informative metadata from less informative ones, with respect to the network structure.
In the following we describe our method and illustrate its use with some examples based on real data. We then follow with a systematic analysis of many empirical datasets, focusing on the prediction of nodes from metadata alone. We show that the predictiveness of network structure from metadata is widely distributedboth across and within datasets -indicating that typical network annotations vary greatly in their connection to network structure.

II. JOINT MODEL FOR DATA AND METADATA
Our approach is based on a unified representation of the network data and metadata. We assume here the general case where the metadata is discrete, and may be arbitrarily associated with the nodes of the network. We do so by describing the data and metadata as a single graph with two node and edge types (or layers [18,19]), as shown in Fig. 1. The first layer corresponds to the network itself (the "data"), where an edge connects two "data" nodes, with an adjacency matrix A, where A ij = 1 if an edge exists between two data nodes i and j, or A ij = 0 otherwise. This layer would correspond to the entire data if the metadata were to be ignored. In the second layer both the data and the metadata nodes are present, and the connection between them is represented by a bipartite adjacency matrix T , where T ij = 1 if node i is annotated with a metadata token j (henceforth called a tag node), or T ij = 0 otherwise. Therefore, a single data node can be associated with zero, one or multiple tags, and likewise a single tag node may be associated

D a t a ,
A M e t a d a t a , T Figure 1. Schematic representation of the joint data-metadata model. The data layer is composed of data nodes and is described by an adjacency matrix A, and the metadata layer is composed of the same data nodes, as well as tag nodes, and is described by a bipartite adjacency matrix T . Both layers are generated by two coupled degree-corrected SBMs, where the partition of the data nodes into groups is the same in both layers.
with zero, one or multiple data nodes. Within this general representation we can account for a wide spectrum of discrete node annotations. In particular, as it will become clearer below, we make no assumption that individual metadata tags actually correspond to specific disjoint groups of nodes.
We construct a generative model for the matrices A and T by generalizing the hierarchical stochastic block model (SBM) [20] with degree-correction [21] for the case with edge layers [22]. In this model, the nodes and tags are divided into B d and B t groups, respectively. The number of edges between data groups r and s are given by the parameters e rs (or twice that for r = s), and between data group r and tag group u by m ru . Both data and tag nodes possess fixed degree sequences, {k i } and {d i }, for the data and metadata layers, respectively, corresponding to an additional set of parameters. Given these constraints, a graph is generated by placing the edges randomly in both layers independently, with a joint likelihood where b = {b i } and c = {c i } are the group memberships of the data and tag nodes, respectively, and both θ = ({e rs }, {k i }) and γ = ({m ru }, {d i }) are shorthands for the remaining model parameters in both layers. Inside each layer, the log-likelihood is 1 [21,23] ln P (A|b, θ) ≈ −E − 1 2 rs e rs ln e rs e r e s − i ln k i !, (2) and analogously for P (T |b, c, γ). Since the data nodes have the same group memberships in both layers, this provides a coupling between them, and we have thus a joint model for data and metadata. This model is general, since it is able to account simultaneously for the situation where there is a perfect correspondence between data and metadata (for example, when B d = B t and the matrix m ru connects one data group to only one metadata group), when the correspondence is non-existent (the matrix T is completely random, with B t = 1), as well as any elaborate relationship between data and metadata in between. In principle, we could fit the above model by finding the model parameters that maximize the likelihood in Eq. 1. Doing so would uncover the precise relationship between data and metadata under the very general assumptions taken here. However, for this approach to work, we need to know a priori the number of groups B d and B t . This is because the likelihood of Eq. 1 is parametric (i.e. it depends on the particular choices of b, c, θ and γ), and the degrees of freedom in the model will increase with B d and B t . As the degrees of freedom increase, so will the likelihood, and the perceived quality of fit of the model. If we follow this criterion blindly, we will put each node and metadata tag in their individual groups, and our matrices e rs and m rs will correspond exactly to the adjacency matrices A and T , respectively. This is an extreme case of overfitting, where we are not able to differentiate random fluctuations in data from actual structure that should be described by the model. The proper way to proceed in this situation is to make the model nonparametric, by including Bayesian priors on the model parameters P (b), P (c), P (θ) and P (γ), as described in Ref. [20,24]. By maximizing the joint nonparametric likelihood P (A, T , b, θ, c, γ) = P (A, T |b, θ, c, γ)P (b)P (θ)P (c)P (γ) we can find the best partition of the nodes and tags into groups, together with the number of groups themselves, without overfitting. This happens because, in this setting, the degrees of freedom of the model are themselves sampled from a distribution, which will intrinsically ascribe higher probabilities to simpler models, effectively working as a penalty on more complex ones. An equivalent way of justifying this is to observe that the joint likelihood can be expressed as P (A, T , b, θ, c, γ) = 2 −Σ , where Σ is the description length of the data, corresponding to the number of bits necessary to encode both the data according to the model parameters as well as the model parameters themselves.
Hence, maximizing the joint Bayesian likelihood is identical to the minimum description length (MDL) criterion [25,26], which is a formalization of Occam's razor, where the simplest hypothesis is selected according to the statistical evidence available. We note that there are some caveats when selecting the priors probabilities above. In the absence of a priori knowledge, the most straightforward approach is to select flat priors that encode this, and ascribe the same probability to all possible model parameters [27]. This choice, however, incurs some limitations. In particular, it can be shown that with flat priors it is not possible to infer with the SBM a number of groups that exceeds an upper threshold that scales with B max ∼ √ N , where N is the number of nodes in the network [28]. Additionally, flat priors are unlikely to be good models for real data, that tend to be structured, albeit in an unknown way. An alternative, therefore, is to postpone the decision on the prior until we observe the data, by sampling the prior distribution itself from a hyperprior. Of course, in doing so, we face the same problem again when selecting the hyperprior. For the model at hand, we proceed in the following manner: Since the matrices {e rs } and {m rs } are themselves adjacency matrices of multigraphs (with B d and B d + B t nodes, respectively), we sample them from another set of SBMs, and so on, following a nested hierarchy, until the trivial model with B d = B t = 1 is reached, as described in Ref. [20]. For the remaining model parameters we select only two-level Bayesian hierarchies, since it can be shown that higher-level ones have only negligible improvements asymptotically [24]. We review and summarize the prior probabilities in Appendix. A. With this Bayesian hierarchical model, not only we significantly increase the resolution limit to B max ∼ N/ ln N [20], but also we are able to provide a description of the data at multiple scales.
It is important to emphasize that we are not restricting ourselves to purely assortative structures, as it is the case in most community detection literature, but rather we are open to a much wider range of connectivity patterns that can be captured by the SBM. As mentioned in the introduction, our approach differs from the parametric model recently introduced by Newman and Clauset [15], where it is assumed that a node can connect to only one metadata tag, and each tag is parametrized individually. In our model, a data node can possess zero, one or more annotations, and the tags are clustered into groups. Therefore our approach is suitable for a wider range of data annotations, where entire classes of metadata tags can be identified. Furthermore, since their approach is parametric 2 , the appropriate number of groups must be known beforehand, instead of being obtained from data, which is seldom possible in practice. Additionally, when employing the fast MCMC algorithm developed in Ref. [30], the inference procedure scales linearly as O(N ) (or log-linearly O(N ln 2 N ) when obtaining the full hierarchy [20]), where N is the number of nodes in the network, independently of the number of groups, in contrast to the expectation-maximization with belief propagation of Ref. [15], that scales as O(B 2 N ), where B is the number of groups being inferred. Hence, our method scales well not only for large networks, but also for arbitrarily large number of communities. An implementation of our method is freely available as part of the graph-tool library [31] at http://graph-tool.skewed.de.
This joint approach of modelling the data and metadata allows us to understand in detail the extent to which network structure and annotations are correlated, in a manner that puts neither in advantage with respect to the other. Importantly, we do not interpret the individual tags as "ground truth" labels on the communities, and instead infer their relationships with the data communities from the entire data. Because the metadata tags themselves can be clustered into groups, we are able to assess both their individual and collective roles. For instance, if two tag nodes are assigned to the same group, this means that they are both similarly informative on the network structure, even if their target nodes are different. By following the inferred probabilities between tag and node groups, one obtains a detailed picture of their correspondence, that can deviate in principle (and often does in practice) from the commonly assumed oneto-one mapping [4,7], but includes it as a special case.
Before going into the systematic analysis of empirical datasets, we illustrate the application of this approach with a simple example, corresponding to the network of American college football teams [32], where the edges indicate that a game occurred between two teams in a given season. For this data it is also available to which "conferences" the teams belong. Since it is expected that teams in the same conference play each other more frequently, this is assumed to be an indicator for the network structure. If we fit the above model to this dataset, both the nodes (teams) and tags (conferences) are divided into B n = 10 and B t = 10 groups, respectively ( Fig. 2). Some of the conferences correspond exactly to the inferred groups of teams, as one would expect. However other conferences are clustered together, in particular the independents, meaning that although they are collectively informative on the network structure, individually they do not serve as indicators of the network topology in a manner that can be conclusively distinguished from random fluctuations.
In Fig. 2 we used the conference assignments presented maximum likelihood, but cannot be used to select the model order (via the number of groups) as we do here, for the reasons explained in the text (see also Ref. [29]). Figure 2. Joint data-metadata model inferred for the network of American football teams [32]. (a) Hierarchical partition of the data nodes (teams), corresponding to the "data" layer. (b) Partition of the data (teams) and tag (conference) nodes, corresponding to the second layer. (c) Average predictive likelihood of missing nodes relative to using only the data (discarding the conferences), using the original conference assignment of Ref. [32] (GN) and the corrected assignment of Ref. [33] (TE).
in Ref. [33], which are different from the original assignments in Ref. [32], due to a mistake in the original publication, where the information from the wrong season was used instead [34]. We use this as an opportunity to show how errors and noise in the metadata can be assessed with our method, while at the same time we emphasize an important application, namely the prediction of missing nodes. We describe it in general terms, and then return to our illustration afterwards.

A. Prediction of missing nodes
To predict missing nodes, we must compute the likelihood of all edges incident on it simultaneously, i.e. for an unobserved node i they correspond to the ith row of the augmented adjacency matrix, a i = {A ij }, with A kj = A kj for k = i. If we know the group membership b i of the unobserved node, in addition to the observed nodes, the likelihood of the missing incident edges is whereθ andθ are the only choices of parameters compatible with the node partition. However, we do not know a priori to which group the missing node belongs. If we have only the network data available (not the metadata) the only choice we have is to make the probability conditioned on the observed partition, where . This means that we can use only the distribution of group sizes to guide the place-ment of the missing node, and nothing more. However, in practical scenarios we may have access to the metadata associated with the missing node. For example, in a social network we might know the social and geographical indicators (age, sex, country, etc) of a person for whom we would like to predict unknown acquaintances. In our model, this translates to knowing the corresponding edges in the tag-node graph T . In this case, we can compute the likelihood of the missing edges in the data graph as where the node membership distribution is weighted by the information available in the full tag-node graph, where againγ andγ are the only choices of parameters compatible with the partitions c and b. If the metadata correlates well with the network structure, the above distribution should place the missing node with a larger likelihood in its correct group. In order to quantify the relative predictive improvement of the metadata information for node i, we compute the predictive likelihood ratio λ i ∈ [0, 1], which should take on values λ i > 1/2 if the metadata improves the prediction task, or λ i < 1/2 if it deteriorates it. The latter can occur if the metadata misleads the placement of the node (we discuss below the circumstances where this can occur).
In order to illustrate this approach we return to the American football data, and compare the original and corrected conference assignments in their capacity of predicting missing nodes. We do so by removing a node from the network, inferring the model on the modified data, and computing its likelihood according to Eq. 5 and Eq. 7, which we use to compute the average predictive likelihood ratio for all nodes in the network, λ = i λ i /N . As can be seen in Fig. 2c, including the metadata improves the prediction significantly, and indeed we observe that the corrected metadata noticeably improves the prediction when compared to the original inaccurate metadata. In short, knowing to which conference a football team belongs, does indeed increase our chances of predicting against which other teams it will play, and we may do so with a higher success rate using the current conference assignments, rather than using those of a previous year. These are hardly surprising facts in this illustrative context, but the situation becomes quickly less intuitive for datasets with hundreds of thousands of nodes and a comparable number of metadata tags, for which only automated methods such as ours can be relied upon.

III. EMPIRICAL DATASETS
We performed a survey of several network datasets with metadata (described in detail in Appendix B), where we removed a small random fraction of annotated nodes (1% or 100 nodes, whichever is smaller) many times, and computed the likelihood ratio λ i above for every removed node. The average value for each dataset is shown in Fig. 3. We observe that for the majority of datasets the metadata is capable of improving the prediction of missing nodes, with the quality of the improvement being relatively broadly distributed. While this means that there is a positive and statistically significant correlation between the metadata and the network structure, for some datasets this leads only to moderate predictive improvements. On the other hand, there is a minority of cases where the inclusion of metadata worsens the prediction task, leading to λ < 1/2. In such situations, the metadata seems to divide the network in a manner that is largely orthogonal to the how the network itself is connected. In order to illustrate this, we consider some artificially generated datasets as follows, before returning to the empirical datasets.

A. Alignment between data and metadata
We construct a network with N nodes divided into B d equal-sized groups, with E edges randomly placed, with the constraint that both endpoints lie in only one of the B groups (i.e. the network is perfectly assortative). The nodes of this network are also connected to M = N metadata tags via E m = E tag-node edges, which are also divided into B t = B d = B equal-sized groups. The placement of the tag-node edges is done according to an additional equal-sized partition {b i } of the data nodes into B groups, such that a tag in one metadata group can only connect to one particular data group, and vice versa. The partition {b i } is chosen in two different ways: identical to the partition used to place the nodenode edges; 2. Misaligned with the data partition, i.e. the partition {b i } is chosen completely at random.
3. The tag-node edges are placed entirely at random.
We emphasize that 2 and 3 are different: the former corresponds to structured metadata that do not correspond to the network structure, and the latter corresponds to Avg. likelihood ratio, λ Figure 3. Node prediction performance, measured by the average predictive likelihood ratio λ for a variety of annotated datasets (see Appendix B for descriptions). Values above 1/2 indicate that the metadata improves the node prediction task. On the right axis a histogram of the likelihood ratios is shown, with a red line marking the average. unstructured metadata. An example of each type of construction for B = 2 is shown in Fig. 4. When performing node prediction for artificial networks constructed in this manner, one observes improved prediction with aligned metadata systematically; however with misaligned metadata a measurable degradation can be seen, while for random metadata neutral values close to λ = 1/2 are observed (see Fig. 4). The degradation observed for misaligned metadata is due to the subdivision of the data groups into B smaller subgroups, according to how they are connected to the metadata tags. This subdivision, however, is not a meaningful way of capturing the pattern of the node-node connections, since all nodes that belong to the same planted group are statistically indistinguishable. If the number of subgroups is sufficiently large, this will invariably induce the incorporation of noise into the model via the different number of edges incident on each subgroup 3 . Since these differences result only from statistical fluctuations, they are bad predictors of unobserved data, and hence cause the degradation in predictive quality. We note, however, that in the limiting case where the number of nodes inside each subdivision becomes sufficiently large, the degradation vanishes, since these statistical fluctuations become increasingly less relevant (see Fig. 4, curve N/B = 10 3 ). Nevertheless, for sufficiently misaligned metadata the total number of inferred data groups can increase significantly as d is the number of data groups used to gener-ate the network. Therefore, in practical scenarios, the presence of structured (i.e. non-random) metadata that is strongly uncorrelated with the network structure can indeed deteriorate node prediction, as observed in a few of the empirical examples shown in Fig. 3.

B. How informative are individual tags?
The average likelihood ratio λ used above is measured by removing nodes from the network, and include the simultaneous contribution of all metadata tags that annotate them. However our model also divides the metadata tags into classes, which allows us to identify the predictiveness of each tag individually according to this classification. With this, one can separate informative from noninformative tags within a single dataset.
We again quantify the predictiveness of a metadata tag in its capacity to predict which other nodes will connect to the one it annotates. According to our model, the probability of some data node i being annotated by tag t is given by which is conditioned on the group memberships of both data and metadata nodes. Analogously, the probability of some data node i being a neighbor of a chosen data node j is given by P e (i|j) = k i e bi,bj e bi e bj .  Hence, the probability of i being a neighbor of any node j that is annotated with tag t is given by In order to compare the predictive quality of this distribution, we need to compare it to a null distribution where the tags connect randomly to the nodes, where Π(i) = d i /M , with M = r<s m rs , is the probability that node i is annotated with any tag at random. The information gain obtained with the annotation is then quantified by the Kullback-Leibler divergence between both distributions, This quantity measures the amount of information lost when we use the random distribution Q instead of the metadata-informed P t to characterize possible neighbors, and hence the amount we gain when we do the opposite. It is a strictly positive quantity, that can take any value between zero and − ln Q * , where Q * is the smallest non-zero value of Q(i). If we substitute Eqs. 12 and 11 in Eq. 15, we notice that it only depends on the group membership of t, and can be written as being the probabilities of a node that belongs to group u being a neighbor of a node annotated by a tag belonging to group r, for both the structured and random cases, where p e (u|s) = e us /e s , p m (s|u) = m sr /m r , and π(s) = m s /M . Since this can take any value between zero and − ln q * , where q * is the smallest non-zero value of q(u), this will in general depend on how many edges there are in the network, given that q * ≥ 1/2E. For a concise comparison between datasets of different sizes, it is useful to consider a relative version of this measure that does not depend on the size. Although one option is to normalize by the maximum possible value, here we use instead the entropy of q, H(q) = − r q(r) ln q(r), and denote the predictiveness µ r of tag group r as This gives us the relative improvement of the annotated prediction with respect to the uniformed one. Although it is possible to have µ r > 1, this is not typical even for highly informative tags, and would mean that a particularly unlikely set of neighbors becomes particularly likely once we consider the annotation. Instead, a more typical highly informative metadata annotation simply narrows down the predicted neighborhood to a typical group sampled from q.
Using the above criterion we investigated in detail the datasets of Fig. 3, and quantified the predictiveness of the node annotations, as is shown in Fig. 5 for a selected subset. Overall, we observe that the datasets differ greatly not only in the overall predictiveness of their annotations, but also in the internal structures. Typically, we find that within a single dataset the metadata predictiveness is widely distributed. A good example of this is the IMDB data, which describes the connection between actors and films, and includes annotations on the films corresponding to the year and country of production, the producers, the production company, the genres, user ratings as well as user-contributed keywords. In Fig. 5a we see that the larger fraction of annotations posses very low predictiveness (which includes the vast majority of usercontributed keywords and ratings), however there is still a significant number of annotations that can be quite predictive. The most predictive types of metadata are combinations of producers and directors (e.g. Cartoon productions), followed by specific countries (e. g. New Zealand, Norway) and year of productions. Besides keywords and ratings, film genres are among those with the lowest predictiveness. A somewhat narrower variability is observed for the APS citation data in Fig. 5b, where the three types of annotations are clearly distinct. The PACS numbers are the most informative on average, followed by the date of publication (with older dates being more predictive then new ones -presumably due to the increasing publication volume and diversification over the years), and lastly the journal. One prominent exception is the most predictive metadata group that corresponds to the now-extinct "Physical Review (Series I)" journal, and its publication dates ranging from 1893 to 1913. For the Amazon dataset of Fig. 5c, the metadata also exhibits significant predictive variance, but there are no groups of tags that possess very low values, indicating that most product categories are indeed strong indications of copurchases. This is similar to what is observed for the Internet AS, with most countries being good predictors of the network structure. The least predictive annotations happen to be a group of ten countries that include the US as the most frequent one. A much wider variance is observed in the DBLP collaboration network, where the publication venues seem to be divided in two branches: very frequent and popular ones with low to moderate predictiveness, and many very infrequent ones with high to very high predictiveness. For other datasets a wide variance in predictiveness is not observed. In particular for most Facebook networks as well as protein-protein interaction networks, the available metadata seems to be only tenuously correlated with the network structure, with narrowly-distributed values of low predictiveness, in accordance with their relatively low placement in Fig. 3.

IV. CONCLUSION
We presented a general model for the large-scale structure of annotated networks that does not intrinsically assume that there is a direct correspondence between metadata tags and the division of network into groups, or communities. We presented a Bayesian framework to infer the model parameters from data, which is capable of uncovering the connection between network structure and annotations, if there is one to be found. We showed how this information can be used to predict missing nodes in the network when only the annotations are known.
When applying the method for a variety of annotated datasets, we found that their annotations lie in a broad range with respect to their correlation with network structure. For most datasets considered, there is evidence for statistically significant correlations between the annotations and the network structure, in a manner that can be detected by our method, and exploited for the task of node prediction. For a few datasets, however, we found evidence of metadata which is not trivially structured, but seems to be largely uncorrelated with the actual network structure.
The predictiveness variance of metadata observed across different datasets is also often found inside individual datasets. Typically, single datasets possess a wealth of annotations, most of which are not very informative on the network structure, but a smaller fraction clearly is. Our method is capable of separating groups of annotations with respect to their predictiveness, and hence can be used to prune such datasets from "metadata noise", by excluding low-performing tags from further analysis.
Our results provide an important but overlooked perspective in the context of community detection validation. In a recent study [7] a systematic comparison between various community detection methods and node annotations was performed, where for most of them strong discrepancies were observed. If we temporarily (and unjustifiably) assume a direct agreement with available annotations as the "gold standard", this discrepancy can be interpreted in a few ways. Firstly, the methods might be designed to find structures that fit the data poorly, and hence cannot capture their most essential features. Secondly, even if the general ansatz is sound, a given algorithm might still fail for more technical and subtle reasons. For example, most methods considered in Ref. [7] do not attempt to gauge the statistical significance of their results, and hence are subject to overfitting [35,36]. This incorporation of statistical noise will result in largely meaningless division of the networks, which would be poorly correlated with the "true" division. Additionally, recently Newman and Clauset [15] suggested that while the best-fitting division of the network can be poorly correlated with the metadata, the network may still admit alternative divisions that are also statistically significant, but happen to be well correlated with the annotations.
On the other hand, the metadata heterogeneity we found with our method gives a strong indication that node annotations should not be used in direct comparisons to community detection methods in the first place -at least not indiscriminately. In most networks we analyzed, even when the metadata is strongly predictive of the network structure, the agreement between the annotations and the network division tends to be complex, and very different from the one-to-one mapping that is more commonly assumed. Furthermore, almost all datasets contain considerable noise in their annotations, corresponding to metadata tags that are essentially random. From this, we argue that data annotations should not be used as a panacea in the validation of community detection methods. Instead, one should focus on validation methods that are grounded in statistical principles, and use the metadata as source of additional evidence -itself possessing its own internal structures and also subject to noise, errors and omissions -rather than a form of absolute truth.   As mentioned in the text, the microcanonical degreecorrected SBM log-likelihood is given by [23] ln P (A|b, θ) ≈ −E − 1 2 rs e rs ln e rs e r e s − i ln k i !, (A1) and likewise for ln P (T |c, γ). This assumes that the graph is sufficiently sparse, otherwise corrections need to be introduced, as described in Ref. [23,24]. In order to compute the full joint likelihood, we need priors for the parameters {b i }, {c i }, {k i }, {d i }, {e rs } and {m rs }.
For the node partitions, we use a two-level Bayesian hierarchy as done in Ref. [20], where one first samples the group sizes from a random histogram, and then the node partition randomly conditioned on the group sizes. The nonparametric likelihood is given by P ({b i }) = e −Lp , with where n m = n+m−1 m is the total number of mcombinations with repetitions from a set of size n. The prior P ({c i }) is analogous.
For the degree sequences, we proceed in the same fashion [24], by sampling the degrees conditioned on the total number of edges incident on each group, by first sampling a random degree histogram with a fixed average, and finally the degree sequence conditioned on this distribution. This leads to a likelihood P ({k i }|{e rs }, {b i }) = e −Lκ , with where ln Ξ r 2 ζ(2)e r . Again, the likelihood for P ({d i }|{m rs }, {c i }) is entirely analogous.
For the matrix of edge counts {e rs } we use the hierarchical prior proposed in Ref. [20]. Here we view this matrix as the adjacency matrix of a multigraph with B d nodes and E d = rs e rs /2 edges. We sample this multigraph from another SBM with a number of groups B 1 d , which itself is sampled from another SBM with B 2 d groups and so on, until B L d = 1 for some depth L. The whole nonparametric likelihood is then P ({e rs }) = e −Σ , with   is the description length of the node partition at level l > 0. The procedure is exactly the same for the prior P ({m rs }).
different sources: Krogan and Yu correspond to yeast (Saccharomyces Cerevisiae), from two different publications: Krogan [38] and Yu [39]; isobase-hs corresponds to human proteins, as collected by the Isobase project [40]; Predicted include predicted and experimentally determined protein-protein interactions for humans, from the PrePPI project [41] (human interactions that are in the HC reference set predicted by structural modeling but not non-structural clues); Gastric, pancreas, lung are obtained by splitting the PrePPI network [41] by the tissue where each protein is expressed. c. Facebook networks (FB). Networks of social connections on the facebook.com online social network, obtained in 2005, corresponding to students of different universities [42]. All friendships are present as undirected links, as well as six types of annotation: Dorm (residence hall), major, second major, graduation year, former high school, and gender.
d. Internet AS. Network of the Internet at the level of Autonomous Systems (AS). Nodes represent autonomous systems, i.e. systems of connected routers under the control of one or more network operators with a common routing policy. Links represent observed paths of Internet Protocol traffic directly from one AS to another. The node annotations are countries of registration of each AS. The data were obtained from the CAIDA project 5 .
e. DBLP. Network of collaboration of computer scientists. Two scientists are connected if they have coauthored at least one paper [43]. Node annotations are publication venues (scientific conferences). Data is downloaded from SNAP 6 [4].
f. aNobii. This is an online social network for sharing book recommendations, popular in Italy. Nodes are user profiles, and there can be two types of directed relationships between them, which we used as undirected links ("friends" and "neighbors"). Data were provided by Luca Aiello [44,45]. We used all present node metadata, of which there are four kinds: Age, location, country, and membership.
g. PGP. The "Web of trust" of PGP (Pretty Good Privacy) key signings, representing an indication of trust of the identity of one person (signee) by another (signer). A node represents one key, usually but not always corresponding to a real person or organization. Links are signatures, which by convention are intended to only be made if the two parties are physically present, have verified each others' identities, and have verified the key fingerprints. Data is taken from a 2009 snapshot of public SKS keyservers [46].
h. Flickr. Picture sharing web site and social network, as crawled by Mislove et al [47]. Nodes are users and edges exist if one user "follows" another. The node