Abstract
The empirical validation of community detection methods is often based on available annotations on the nodes that serve as putative indicators of the large-scale network structure. Most often, the suitability of the annotations as topological descriptors itself is not assessed, and without this it is not possible to ultimately distinguish between actual shortcomings of the community detection algorithms, on one hand, and the incompleteness, inaccuracy, or structured nature of the data annotations themselves, on the other. In this work, we present a principled method to access both aspects simultaneously. We construct a joint generative model for the data and metadata, and a nonparametric Bayesian framework to infer its parameters from annotated data sets. We assess the quality of the metadata not according to their direct alignment with the network communities, but rather in their capacity to predict the placement of edges in the network. We also show how this feature can be used to predict the connections to missing nodes when only the metadata are available, as well as predicting missing metadata. By investigating a wide range of data sets, we show that while there are seldom exact agreements between metadata tokens and the inferred data groups, the metadata are often informative of the network structure nevertheless, and can improve the prediction of missing nodes. This shows that the method uncovers meaningful patterns in both the data and metadata, without requiring or expecting a perfect agreement between the two.
- Received 7 April 2016
DOI:https://doi.org/10.1103/PhysRevX.6.031038
This article is available under the terms of the Creative Commons Attribution 3.0 License. Further distribution of this work must maintain attribution to the author(s) and the published article’s title, journal citation, and DOI.
Published by the American Physical Society
Physics Subject Headings (PhySH)
Popular Summary
A long-standing goal related to network systems is the characterization of their large-scale structure. In this context, the most popular approach is the division of the network into modules or “communities” that group nodes into discrete classes according to their connection patterns with the rest of the network. However, this idea has generated an enormous variety of competing methods to extract communities from network data. With the goal of interpreting and validating the output of various methods of dividing networks, researchers have used comparisons with metadata (i.e., data annotations that are assumed to correspond to the “true” division of the network into meaningful categories). However, recent systematic comparisons of various community detection methods with metadata have revealed that their relative correspondence is very poor in the majority of cases. Here, we argue that this problem is the outcome of a naive and unrealistic interpretation of the role of the network annotations, and we propose a more refined approach by formally erasing the distinction between data and metadata, and by formulating a probabilistic generative model that describes both of their structures simultaneously.
By inferring the parameters of our model from data, we can determine the precise relationship between data and metadata, instead of specifying it beforehand. Furthermore, our nonparametric Bayesian inference framework is capable of distinguishing structure from noise, and it reveals patterns according to their statistical significance. We apply our method to an assortment of annotated network data sets such as APS citations, Amazon co-purchases, and IMDB movie data, and we find that their annotations vary considerably in their predictive quality, showing that their indiscriminate use in the validation of community-finding algorithms is often misguided.
Our findings accordingly reveal that metadata should be treated with caution and used as one tool of many to group nodes within networks.