Phase transitions in semisupervised clustering of sparse networks

Pan Zhang, Cristopher Moore, and Lenka Zdeborová

Phys. Rev. E 90, 052802 – Published 5 November 2014

Abstract

Predicting labels of nodes in a network, such as community memberships or demographic variables, is an important problem with applications in social and biological networks. A recently discovered phase transition puts fundamental limits on the accuracy of these predictions if we have access only to the network topology. However, if we know the correct labels of some fraction $α$ of the nodes, we can do better. We study the phase diagram of this semisupervised learning problem for networks generated by the stochastic block model. We use the cavity method and the associated belief propagation algorithm to study what accuracy can be achieved as a function of $α$ . For $k = 2$ groups, we find that the detectability transition disappears for any $α > 0$ , in agreement with previous work. For larger $k$ where a hard but detectable regime exists, we find that the easy/hard transition (the point at which efficient algorithms can do better than chance) becomes a line of transitions where the accuracy jumps discontinuously at a critical value of $α$ . This line ends in a critical point with a second-order transition, beyond which the accuracy is a continuous function of $α$ . We demonstrate qualitatively similar transitions in two real-world networks.

Received 1 May 2014

DOI:https://doi.org/10.1103/PhysRevE.90.052802

Authors & Affiliations

Pan Zhang¹, Cristopher Moore¹, and Lenka Zdeborová²

¹Santa Fe Institute, Santa Fe, New Mexico 87501, USA
²Institut de Physique Théorique, CEA Saclay and URA 2306, CNRS, Gif-sur-Yvette, France

Article Text (Subscription Required)

Click to Expand

References (Subscription Required)

Click to Expand

Issue

Vol. 90, Iss. 5 — November 2014

Reuse & Permissions

Access Options

Author publication services for translation and copyediting assistance advertisement

Images

Figure 1
Overlap and convergence time of BP as a function of $ε = c_{out} / c_{in}$ for different $α$ , on networks generated by the stochastic block model. On the left, $k = 2, c = 3$ , and $n = 10^{5}$ . For just two groups, the transition disappears for any $α > 0$ . On the right, $k = 10, c = 10, n = 5 \times 10^{5}$ . Here the easy/hard transition persists for small values of $α$ , with a discontinuity in the overlap and a diverging convergence time; this transition disappears at a critical value of $α$ .
Reuse & Permissions
Figure 2
Left: overlap as a function of $ε = c_{out} / c_{in}$ and $α$ for networks with same parameters as in the right of Fig. 1. The heat map shows a line of discontinuities, ending at a second-order phase transition beyond which the overlap is a smooth function. Right: the logarithm (base 10) of the convergence time in the same plane, showing divergence along the critical line.
Reuse & Permissions
Figure 3
Left: overlap and BP convergence time in the planted five-coloring problem as a function of the average degree $c$ for various values of $α$ . Right: the three transitions described in the text: the easy/hard or Kesten-Stigum transition where the factorized fixed point becomes stable (blue), the lower spinodal transition where the accurate fixed point disappears (green), and the detectability transition where the Bethe free energies of these fixed points cross (red). The hard but detectable regime, where BP with random initial messages does no better than chance but exhaustive search would succeed, is between the red and blue lines. All three transitions meet at a critical point, beyond which the overlap is a smooth function of $c$ and $α$ . Here $n = 10^{5}$ .
Reuse & Permissions
Figure 4
Overlap (top and left) and convergence time (right) as a function of the average degree $c$ and the fraction of known labels $α$ for the planted five-coloring problem on networks with $n = 10^{5}$ . The height of the discontinuity decreases until we reach the critical point. The convergence time diverges along the discontinuity. Compare Fig. 2 for the assortative case.
Reuse & Permissions
Figure 5
Semisupervised learning in a network of political blogs [34]. Different points correspond to independent runs with different initial labels. On the top left, the best possible parameters $q_{a}$ and $c_{a b}$ are given to the algorithm in advance. On the top right, the algorithm learns these parameters using an EM algorithm, seeded by the known labels. The bottom panels show how the learned $q_{1}$ and $c_{a b}$ change as $α$ increases, moving from a core-periphery structure where nodes are divided according to high or low degree, to the correct assortative structure where they are divided along political lines.
Reuse & Permissions
Figure 6
Semisupervised learning in Zachary's karate club [36], with experiments analogous to Fig. 5: in the upper left, the optimal parameters are given to the algorithm in advance, while in the upper right it learns them with an EM algorithm, giving the parameters shown in the bottom panels. As in the political blog network, the algorithm makes a transition from a core-periphery structure to the correct assortative structure.
Reuse & Permissions

Physical Review E

covering statistical, nonlinear, biological, and soft matter physics