• Open Access

Network Structure, Metadata, and the Prediction of Missing Nodes and Annotations

Darko Hric, Tiago P. Peixoto, and Santo Fortunato
Phys. Rev. X 6, 031038 – Published 12 September 2016

Abstract

The empirical validation of community detection methods is often based on available annotations on the nodes that serve as putative indicators of the large-scale network structure. Most often, the suitability of the annotations as topological descriptors itself is not assessed, and without this it is not possible to ultimately distinguish between actual shortcomings of the community detection algorithms, on one hand, and the incompleteness, inaccuracy, or structured nature of the data annotations themselves, on the other. In this work, we present a principled method to access both aspects simultaneously. We construct a joint generative model for the data and metadata, and a nonparametric Bayesian framework to infer its parameters from annotated data sets. We assess the quality of the metadata not according to their direct alignment with the network communities, but rather in their capacity to predict the placement of edges in the network. We also show how this feature can be used to predict the connections to missing nodes when only the metadata are available, as well as predicting missing metadata. By investigating a wide range of data sets, we show that while there are seldom exact agreements between metadata tokens and the inferred data groups, the metadata are often informative of the network structure nevertheless, and can improve the prediction of missing nodes. This shows that the method uncovers meaningful patterns in both the data and metadata, without requiring or expecting a perfect agreement between the two.

  • Figure
  • Figure
  • Figure
  • Figure
  • Figure
  • Figure
  • Received 7 April 2016

DOI:https://doi.org/10.1103/PhysRevX.6.031038

This article is available under the terms of the Creative Commons Attribution 3.0 License. Further distribution of this work must maintain attribution to the author(s) and the published article’s title, journal citation, and DOI.

Published by the American Physical Society

Physics Subject Headings (PhySH)

Statistical Physics & ThermodynamicsNetworksInterdisciplinary Physics

Authors & Affiliations

Darko Hric*

  • Department of Computer Science, Aalto University School of Science, P.O. Box 12200, FI-00076 Aalto, Finland

Tiago P. Peixoto

  • Department of Mathematical Sciences and Centre for Networks and Collective Behaviour, University of Bath, Claverton Down, Bath BA2 7AY, United Kingdom; ISI Foundation, Via Alassio 11/c, 10126 Turin, Italy; and Institut für Theoretische Physik, Universität Bremen, Hochschulring 18, D-28359 Bremen, Germany

Santo Fortunato

  • Department of Computer Science, Aalto University School of Science, P.O. Box 12200, FI-00076 Aalto, Finland and Center for Complex Networks and Systems Research, School of Informatics and Computing, Indiana University, 47408 Bloomington, USA

  • *darko.hric@aalto.fi
  • t.peixoto@bath.ac.uk
  • santo@indiana.edu

Popular Summary

A long-standing goal related to network systems is the characterization of their large-scale structure. In this context, the most popular approach is the division of the network into modules or “communities” that group nodes into discrete classes according to their connection patterns with the rest of the network. However, this idea has generated an enormous variety of competing methods to extract communities from network data. With the goal of interpreting and validating the output of various methods of dividing networks, researchers have used comparisons with metadata (i.e., data annotations that are assumed to correspond to the “true” division of the network into meaningful categories). However, recent systematic comparisons of various community detection methods with metadata have revealed that their relative correspondence is very poor in the majority of cases. Here, we argue that this problem is the outcome of a naive and unrealistic interpretation of the role of the network annotations, and we propose a more refined approach by formally erasing the distinction between data and metadata, and by formulating a probabilistic generative model that describes both of their structures simultaneously.

By inferring the parameters of our model from data, we can determine the precise relationship between data and metadata, instead of specifying it beforehand. Furthermore, our nonparametric Bayesian inference framework is capable of distinguishing structure from noise, and it reveals patterns according to their statistical significance. We apply our method to an assortment of annotated network data sets such as APS citations, Amazon co-purchases, and IMDB movie data, and we find that their annotations vary considerably in their predictive quality, showing that their indiscriminate use in the validation of community-finding algorithms is often misguided.

Our findings accordingly reveal that metadata should be treated with caution and used as one tool of many to group nodes within networks.

Key Image

Article Text

Click to Expand

References

Click to Expand
Issue

Vol. 6, Iss. 3 — July - September 2016

Subject Areas
Reuse & Permissions
Author publication services for translation and copyediting assistance advertisement

Authorization Required


×
×

Images

×

Sign up to receive regular email alerts from Physical Review X

Reuse & Permissions

It is not necessary to obtain permission to reuse this article or its components as it is available under the terms of the Creative Commons Attribution 3.0 License. This license permits unrestricted use, distribution, and reproduction in any medium, provided attribution to the author(s) and the published article's title, journal citation, and DOI are maintained. Please note that some figures may have been included with permission from other third parties. It is your responsibility to obtain the proper permission from the rights holder directly for these figures.

×

Log In

Cancel
×

Search


Article Lookup

Paste a citation or DOI

Enter a citation
×