Network Structure, Metadata, and the Prediction of Missing Nodes and Annotations

Open Access

Network Structure, Metadata, and the Prediction of Missing Nodes and Annotations

Darko Hric, Tiago P. Peixoto, and Santo Fortunato

Phys. Rev. X 6, 031038 – Published 12 September 2016

Abstract

The empirical validation of community detection methods is often based on available annotations on the nodes that serve as putative indicators of the large-scale network structure. Most often, the suitability of the annotations as topological descriptors itself is not assessed, and without this it is not possible to ultimately distinguish between actual shortcomings of the community detection algorithms, on one hand, and the incompleteness, inaccuracy, or structured nature of the data annotations themselves, on the other. In this work, we present a principled method to access both aspects simultaneously. We construct a joint generative model for the data and metadata, and a nonparametric Bayesian framework to infer its parameters from annotated data sets. We assess the quality of the metadata not according to their direct alignment with the network communities, but rather in their capacity to predict the placement of edges in the network. We also show how this feature can be used to predict the connections to missing nodes when only the metadata are available, as well as predicting missing metadata. By investigating a wide range of data sets, we show that while there are seldom exact agreements between metadata tokens and the inferred data groups, the metadata are often informative of the network structure nevertheless, and can improve the prediction of missing nodes. This shows that the method uncovers meaningful patterns in both the data and metadata, without requiring or expecting a perfect agreement between the two.

Received 7 April 2016

DOI:https://doi.org/10.1103/PhysRevX.6.031038

This article is available under the terms of the Creative Commons Attribution 3.0 License. Further distribution of this work must maintain attribution to the author(s) and the published article’s title, journal citation, and DOI.

Published by the American Physical Society

Physics Subject Headings (PhySH)

Community structure Network structure

Bayesian methods Data analysis

Statistical Physics & ThermodynamicsNetworksInterdisciplinary Physics

Authors & Affiliations

Darko Hric^*

Department of Computer Science, Aalto University School of Science, P.O. Box 12200, FI-00076 Aalto, Finland

Tiago P. Peixoto^†

Department of Mathematical Sciences and Centre for Networks and Collective Behaviour, University of Bath, Claverton Down, Bath BA2 7AY, United Kingdom; ISI Foundation, Via Alassio 11/c, 10126 Turin, Italy; and Institut für Theoretische Physik, Universität Bremen, Hochschulring 18, D-28359 Bremen, Germany

Santo Fortunato^‡

Department of Computer Science, Aalto University School of Science, P.O. Box 12200, FI-00076 Aalto, Finland and Center for Complex Networks and Systems Research, School of Informatics and Computing, Indiana University, 47408 Bloomington, USA

^*darko.hric@aalto.fi
^†t.peixoto@bath.ac.uk
^‡santo@indiana.edu

Popular Summary

A long-standing goal related to network systems is the characterization of their large-scale structure. In this context, the most popular approach is the division of the network into modules or “communities” that group nodes into discrete classes according to their connection patterns with the rest of the network. However, this idea has generated an enormous variety of competing methods to extract communities from network data. With the goal of interpreting and validating the output of various methods of dividing networks, researchers have used comparisons with metadata (i.e., data annotations that are assumed to correspond to the “true” division of the network into meaningful categories). However, recent systematic comparisons of various community detection methods with metadata have revealed that their relative correspondence is very poor in the majority of cases. Here, we argue that this problem is the outcome of a naive and unrealistic interpretation of the role of the network annotations, and we propose a more refined approach by formally erasing the distinction between data and metadata, and by formulating a probabilistic generative model that describes both of their structures simultaneously.

By inferring the parameters of our model from data, we can determine the precise relationship between data and metadata, instead of specifying it beforehand. Furthermore, our nonparametric Bayesian inference framework is capable of distinguishing structure from noise, and it reveals patterns according to their statistical significance. We apply our method to an assortment of annotated network data sets such as APS citations, Amazon co-purchases, and IMDB movie data, and we find that their annotations vary considerably in their predictive quality, showing that their indiscriminate use in the validation of community-finding algorithms is often misguided.

Our findings accordingly reveal that metadata should be treated with caution and used as one tool of many to group nodes within networks.

Key Image

Article Text

Click to Expand

References

Click to Expand

Issue

Vol. 6, Iss. 3 — July - September 2016

Subject Areas

Reuse & Permissions

Author publication services for translation and copyediting assistance advertisement

Images

Figure 1
Schematic representation of the joint data-metadata model. The data layer is composed of data nodes and is described by an adjacency matrix $A$ , and the metadata layer is composed of the same data nodes, as well as tag nodes, and is described by a bipartite adjacency matrix $T$ . Both layers are generated by two coupled degree-corrected SBMs, where the partition of the data nodes into groups is the same in both layers.
Reuse & Permissions
Figure 2
Joint data-metadata model inferred for the network of American football teams [45]. (a) Hierarchical partition of the data nodes (teams), corresponding to the “data” layer. (b) Partition of the data (teams) and tag (conference) nodes, corresponding to the second layer. (c) Average predictive likelihood of missing nodes relative to using only the data (discarding the conferences), using the original conference assignment of Ref. [45] (GN) and the corrected assignment of Ref. [46] (TE).
Reuse & Permissions
Figure 3
Node prediction performance, measured by the average predictive likelihood ratio $⟨ λ ⟩$ for a variety of annotated data sets (see Appendix pp2 for descriptions). Values above $1 / 2$ indicate that the metadata improve the node prediction task. On the right-hand axis a histogram of the likelihood ratios is shown, with a red line marking the average.
Reuse & Permissions
Figure 4
Top: Examples of artificial annotated networks, showing aligned, misaligned, and random metadata, as described in the text. Bottom: Node prediction performance, measured by the likelihood ratio $⟨ λ ⟩$ , average over all possible single-node removals, for annotated networks generated with $B_{d} = B_{t} = B$ groups, $N = M = 30 B$ nodes and tags, $E = E_{m} = 5 N$ node-node and tag-node edges, with specific network construction given by the legend. One of the curves corresponds to networks with misaligned metadata with a larger number of nodes, $N = M = 1 0^{3} \times B$ .
Reuse & Permissions
Figure 5
Metadata predictiveness for several empirical data sets. The panels show the predictiveness of metadata groups $μ_{r}$ [Eq. (18)] versus metadata group sizes $n_{r}$ .The sizes of the symbols indicate the metadata frequency. The symbols correspond to the most frequent types of tags in each group (which may contain tags of different types). On the axis of each figure are shown marginal histograms, weighted according to the tag frequencies. A red horizontal line marks the average predictiveness.
Reuse & Permissions
Figure 6
Average predictive likelihood ratio $⟨ λ ⟩$ of missing metadata tags (conferences) for the American football data, using the annotations given in Ref. [47]. Tags 11–18 are “independents,” i.e., teams that do not belong to any conference. The dashed line marks the value $1 / 19$ , corresponding to a uniform likelihood between all tags.
Reuse & Permissions

Physical Review X