Phase transitions and optimal algorithms for semisupervised classiﬁcations on graphs: From belief propagation to graph convolution network

We perform theoretical and algorithmic studies for the problem of clustering and semisupervised classiﬁcation on graphs with both pairwise relational information and single-point attribute information, upon a joint stochastic block model for synthetic graphs with both item-item edges and item-attribute edges. Asymptotically exact analysis based on the Bayesian inference of the model are conducted, using the cavity method in statistical physics. Analytically, we identify a phase transition of the generative model, which poses fundamental limits on the detectability of the underlying model in the clustering task for all possible algorithms. Algorithmically, we propose a belief propagation algorithm that is asymptotically optimal on the generative model, which can be further extended to a belief propagation graph convolution neural network (BPGCN) for semisupervised classiﬁcation on graphs. Well-controlled benchmark data sets of factor graphs accompanied with asymptotically optimal solutions in classiﬁcation could be produced for the evaluation of graph convolution neural networks and for the theoretical understanding of their strengths and weaknesses. In particular, on these synthetic benchmark networks we observe that existing graph convolution neural networks are subject to an sparsity issue and an overﬁtting issue in practice, both of which could be successfully overcome by our BPGCN. Moreover, when combined with classic neural network methods, BPGCN yields extraordinary classiﬁcation performances on real-world data sets that are at least comparable to state-of-the-art graph convolution networks. DOI: 10.1103/PhysRevResearch.2.033325


I. INTRODUCTION
Learning on graphs is an important task in machine learning and the broader data sciences which triggers a lot of successful applications in various fields, including social sciences (e.g., social network analysis), biology (e.g., protein structure prediction and molecular fingerprints learning), and computer science (e.g., knowledge graph analysis).The key difference between learning with graph data and traditional machine learning on images and natural languages is that, in addition to content features on each item, there are also relational features between items that are encoded by edges in the graph, which adds an extra layer of complexity to the analysis.
* panzhang@itp.ac.cnPublished by the American Physical Society under the terms of the Creative Commons Attribution 4.0 International license.Further distribution of this work must maintain attribution to the author(s) and the published article's title, journal citation, and DOI.
One classical problem of learning on graphs is the classification of nodes into groups.Consider a citation network, where each article is represented by a graph node, and the groups of nodes are scientific research fields.In addition to the edges between nodes, which represent the citations between articles, each node is also associated with some attributes (i.e., key words in any article), which encode its categorical information of research fields.If the group information is known on a small subset of nodes, then practically, these nodes could serves as a training set, and the learning task is to determine the group membership of the remaining nodes through exploring the direct group information via their attributes, as well as the indirect information via their relationships with the training nodes (edge connectivities of the graph).Essentially, this learning task is semisupervised classification on graphs, a problem that recently has drawn much attention in both networks sciences and machine learning communities; for this problem, we witnessed the burst of graph convolution neural networks (GCN), which is a powerful neural network architecture that yields ground-breaking performances [1].
Deep convolution neural networks have achieved tremendous success in machine learning and artificial intelligence [2].Since there are many situations in which data are represented as graphs, rather than as voices or images that could be recorded on one-or two-dimensional grids, a lot of effort has been made to extend convolution networks from applying on grid data to applying on graph data, with a heavy focus on constructing linear convolution kernels to extract local node attributes in graphs and on learning effective representations of graph objects.In the past several years, many GCNs have been proposed, utilizing different types of convolution kernels and different network architectures [1,[3][4][5].Recent studies show that GCNs have quickly dominated among different neural network techniques in the performance on various learning tasks, including text-or graph-object classification, link prediction, forecasting, importance sampling, and are also believed to have big potential for relational reasoning [6].Nevertheless, although GCNs have achieved the state-ofthe-art performance on semisupervised classification, so far there is little theoretical understanding of the mathematical principles behind graph convolutions and of the extent that they may work in a particular problem setting.The main difficulty is that, in previous studies, GCNs are often only tested on real-world data sets which do not have clear theoretical structures, and thus the success or failure of GCNs is hard to pin down in theoretical analysis.A set of network data sets with established mathematical properties and the analysis of GCNs on such data sets are missing and greatly welcomed.Note that some work [7,8] also gave theoretical analyses to show limited expressiveness of GCNs on entire graph learning, and here we offer another perspective to show GCNs' strengths and weaknesses based on semisupervised classification on well-controlled graphs.
The basis for our study is a generative model for both graphs and attributes.In the field of community detection where the task is to detect clusters purely based on edges, many analysis are based on the celebrated stochastic block model (SBM) [9].However, the SBM is not enough for our purpose because in addition to generating edges, we also need to generate attributes.In this work, we propose to use a variant of the SBM, a joint model consisting of two graph components, characterizing both the relational information and the attribute information of item nodes, which are captured respectively by a standard SBM and a bipartite SBM [10][11][12].We call the model the joint stochastic block model (JSBM).This model was originally proposed in Ref. [10] for the problem of link and node predictions through the Markov chain-Monte Carlo method.In this study, we analyze theoretical properties of the JSBM and design techniques for semisupervised learning on graphs based on its desirable properties.Filling the gap discussed in the above, the JSBM produces well-controlled benchmark graphs with continuously tunable parameters for the evaluation of GCNs' classification performance, and for the theoretical understanding of their strengths and weaknesses under certain conditions.
On graphs generated by the JSBM, the clustering and classification problems can be translated into a Bayesian inference problem, which can be solved theoretically with the statistical physics approach in an asymptotically exact manner.This approach leads to a message-passing algorithm, known in computer science as the belief propagation (BP) algorithm [13], which we claim to be asymptotically exact on large random graphs generated by the JSBM.Through analyzing the stability of fixed points of the constructed BP equations on the JSBM, a phase transition-the detectability transition-is identified; beyond the phase transition point, no algorithm is able to conduct successful clustering on JSBM graphs in an unsupervised manner.This is an extension of the detectability phase transition [14] in the standard SBM [9], which puts fundamental limits on the ability of algorithms in the clustering tasks on graphs that can be sufficiently modeled by JSBM.
In the semisupervised classification setting, where a small fraction of nodes have ground-truth group labels and could be used as the training data, the BP algorithm for JSBM can be embedded into a graph convolution network architecture.The unknown generative parameters of the JSBM graph can be learned in a standard classification approach, through the forward-passing of (truncated) BP equations together with the backward-passing of the gradients of the loss function.This GCN algorithm, which we term BPGCN, guarantees to yield Bayes optimal classification results [15] on synthetic graphs generated by the JSBM and performs comparably with stateor-the-art GCNs on real-world networks.
This paper is organized as follows.In Sec.II, we introduce the joint stochastic block model.In Sec.III, we formulate the Bayesian inference problem for clustering and classification on the joint stochastic block model and derive the belief propagation equations for the JSBM.In Sec.IV, we study the detectability phase transition of JSBM using stability analysis of the BP algorithm.In Sec.V, we convert the BP equations on JSBM to a graph convolution neural network and propose a GCN algorithm, BPGCN.In Sec.VI, the performance of BPGCN is evaluated and compared with the performance of several state-of-the-art GCNs, on both synthetic and realworld networks.Section VII concludes the study.

II. JOINT STOCHASTIC BLOCK MODEL
The idea of the JSBM, literally the joint of two stochastic block models, is to simultaneously model item nodes and attribute nodes in the network setting by representing both a connectivity graph over item nodes and an attribute graph between item nodes and attribute nodes.The connectivity graph corresponds to the relation network in the traditional sense and the attribute graph is a bipartite graph established on top of the relation network; both graphs associate each node in the graph with a unique group membership.These two graphs, constructed from two SBM processes, constitute the JSBM graph G; a similar framework was proposed in Ref. [10] to study link predictions.
Assume n item nodes in the (undirected) connectivity graph, belonging to κ groups.Each node i has an unknown label t * i denoting its independent group membership (i.e., t * i ∈ {1, 2, ...κ}); each group label is chosen at random by nodes, according to a κ-dimensional probability vector α, whose entries sum to 1. Edge connectivities are exclusively determined by their group memberships: For each pair of nodes i and j, there exists an edge (i, j) ∈ E with probability p t * i t * j , where E is the entire edge set of the graph.This stochastic generative process could be understood as a measuring process in which Circles denote item nodes, whose interconnections constitute the connectivity graph, with adjacency matrix A defined over edges; boxes denote feature nodes.The attribute graph is the bipartite graph between items and attributes, whose adjacency matrix is F. The JSBM graph is generated by parameters θ = {P, Q, α, β}, with κ groups of nodes built in for both items and features.the ground truth t * i is being measured, whose information is encoded implicitly in the edge connectivities as measuring results.The n × n adjacency matrix of this connectivity graph A ∈ {0, 1} n×n follows the generative process and is therefore stochastic, controlled by the deterministic κ × κ generation probability matrix P = {p t * i t * j }.If the diagonal elements in P are larger than the off-diagonal elements, it corresponds to the situation (known as assortative SBM) where there are more edges within node groups than between groups, and vice versa (known as disassortative SBM).An intuitive example for an assortative SBM is a citation network with nodes denoting research articles (items), which belong to a certain research area (group) and are linked through citations (edges).Conceptually, articles from the same research area are more likely to cite each other (e.g., Ref. [1]).
Besides having membership in a certain research area, moreover, each research article may also be associated with some keywords which further denote its categorical information; expectedly, the research area that an article belongs to could be inferred from these keywords.This notion underlines the idea of making inference on the joint SBM, i.e., inferring hidden group membership of an item node from its relationships to known attribute nodes in the attribute graph.Assume such a graph with m attributes over n item nodes.As with item nodes, each attribute node μ is embedded with an independent label t * μ ∈ {1, 2, ...κ}, chosen randomly from the κ groups according to probabilities in a κ-dimensional vector β.Relationship between an item node i and an attribute node μ exists with probability q t * i t * μ , which corresponds to an edge (i, μ) ∈ F in the attribute graph whose edge set is F.This generative process yields a bipartite graph between item nodes and attribute nodes, analogous to the bipartite SBM [11]; the resulting n × m adjacency matrix of the attribute graph F ∈ {0, 1} n×m is stochastic and is governed by a κ × κ matrix Q = {q t * i t * μ }, similar to the case of A and P. A specific JSBM is thus represented by the two adjacency matrices A and F for the connectivity graph and the attribute graph, respectively, and is controlled by the generation parameters θ = {P, Q, α, β} (Fig. 1).It consists of a unipartite graph and a bipartite graph, similar to the structure in semirestricted Boltzmann machines [16].For convenience, we call edges in unipartite graphs itemitem edges and edges in bipartite graphs item-attribute edges.It is also worth noting that in the current model we make no assumption on the relationships between different feature nodes (i.e., the adjacency matrix on the attribute graph); such connectivities may provide information about the group membership of feature nodes (which we do not worry about in the current study), but not directly about the membership of item nodes which is the ground truth under concern.

III. BAYESIAN INFERENCE ON JSBM
On the graph G generated by the JSBM, the problem of clustering is defined as recovering ground-truth labels {t * i } exclusively using edge information in the connectivity graph and the attribute graph (i.e., adjacency matrix A and F); the problem of semisupervised classification, in a slightly different manner, asks to conduct the same recovery but utilizes as extra information a small number of training labels {t * i } where i belongs to the training set .If the parameters θ in generating the JBSM are known, this classification task can essentially be translated into an inference problem on group labels {t * i }, which could be viewed as hidden parameters in the model, provided with measurements on a specific type of outcomes of these parameters {t * i } (the location of edges in the two graphs).Bayesian inference, which amounts to computing the posterior distribution, is typically used for such an inference task.For our problem, under Bayesian rules, the posterior is written as where P 0 ({t i }, {t μ }) represents prior information on the labels (e.g., information regarding generative parameters α and β), and P(G|{t i }, {t μ }, θ ) is the likelihood of observing graph G given labels {t i }, {t μ } and parameters θ , which is the productform probability of generating existent item-item edges and item-attribute edges in G: The clustering problem corresponds to adopting a flat prior [i.e., P 0 ({t i }, {t μ }) = P 0 ({t i }, {t μ }) = const.or α and β are uniform-entry (probability) vectors], and semisupervised classification corresponds to adopting strong prior on item nodes that belong to the training set, such that the probability marginals of item nodes belonging to the training set are pinned in the direction of training labels [15].It is well known that computing the normalization of the posterior distribution [i.e., the denominator of Eq. ( 1) is a #P problem], and thus efficient and accurate approximations are needed for the inference.In the language of statistical physics, the representation of Bayesian inference on the posterior distribution Eq. ( 1) corresponds to a Boltzmann distribution at unit temperature: The negative log-likelihood − log P(G|{t i }, {t μ }, θ ) represents the energy; the normalization constant {t i } P(G|{t i }, {t μ }, θ )P 0 ({t i }, {t μ }) for the poste-rior is the partition function; and the prior information P 0 plays the role of external fields acting on item nodes and feature nodes.For the clustering problem with a flat prior, the external field is zero for all nodes; for semisupervised classification, the external field is infinity for nodes in the training set and zero for unlabelled nodes [15].
For random sparse graphs, the inference could be studied at the thermodynamic limit using the cavity method from statistical physics [13,17].If the parameters used in generating the SBM is known a priori, the system is on the Nishimori line [18,19] and no spin glass phase could appear.Moreover, the replica symmetry cavity method naturally translates into the belief propagation (BP) algorithm, a well-known algorithm for computation on SBM.In BP, cavity messages are passed along directed edges of the factor graph; when the propagation converges, they are used to compute the posterior marginals.Inheriting the message-passing idea, for JSBM, where the factor graph G is threefold (i.e., there are three types of directed edges: item-item, item-attribute, and attribute-item), we formulate the iterative equations of BP for three types of messages (see Appendix A for derivations): Here ψ i→ j t i are the cavity marginals (messages) passing through item node i to item node j, representing the probability of item node i taking label t i when the item node j is removed from the graph.Similarly, ψ i→μ t i represents the probability of item node i taking label t i when attribute node μ is removed from the graph, and ψ μ→i t μ represents the probability of attribute node μ taking label t μ when item node i is removed.Z i→μ , Z μ→i , and Z i→ j are normalizing factors; ∂i denotes the set of neighbors of item node i in the graph.In the three equations, variables h t i and h t μ are adaptive fields contributed by nonexistent edges of the graph, which are formulated as (see Appendix A) Once the above iterative equations converge (i.e., messages do not change significantly), using the determined cavity messages, the posterior marginals on the two graphs (connectivity, attribute) could be calculated by where ψ i t i is the marginal probability of item node i taking label t i and ψ μ t μ is the marginal probability of attribute node μ taking label t μ .In the end, based on these computed marginals, one is able to estimate the label of each item node, which is the specific label that maximizes the item node's marginal: In Bayesian inference, t i is the maximum posterior estimate, representing the optimal result with the minimum mean square error (MMSE) [19].In terms of algorithm design, BP equations essentially adopt the Bethe approximation [13,20], which is a variational distribution formulated as where i, μ represent item nodes and attribute nodes respectively, |∂i| and |∂μ| denote degrees of node i and μ respectively; ψ represents single-point marginal, and represents two-point marginal.By optimizing the Kullback-Leibler (KL) divergence Eq. ( 8) [13,21] between the variational distribution Eq. ( 7) and Boltzmann posterior distribution Eq. ( 1), with the constraint induced by normalization of marginals, one can arrive at belief propagation equations (3).
We notice that the two-point marginals i,μ t i ,t μ , i, j t i ,t j can be computed using the cavity messages in BP equations after they converge; we refer to Ref. [17] for more details of belief propagation in the factor graphs.The core assumption is the conditional independence assumption, which is exact on trees.Thus, the Bethe approximation ( 7) is always correct for a tree graph in describing joint probabilities, in which case BP algorithms yield exact posterior marginals.Empirically, BP results are shown to be good approximations to true posterior marginals, if the graph is sparse and of locally treelike structures; hence the BP algorithm is widely applied to inference problems in sparse systems [17].

IV. DETECTABILITY TRANSITIONS OF JSBM
For graphs generated by the JSBM with parameters θ , the cavity method provides asymptotically exact analysis, and the belief propagation algorithm (almost) always converges, according to the Nishimori line property [18,19].Thus, asymptotically exact properties of the JSBM, such as the phase diagram, can be studied directly at the thermodynamic limit by analyzing the messages in the belief propagation.
From ( 4) and ( 6), observe that there is a trivial fixed point of BP equations (3) This fixed point corresponds to the situation where every node in the graph has equal probability of belonging to every group; therefore, it is known as the paramagnetic fixed point or liquid fixed point.In this case, the marginals do not provide any information about the ground-truth group labels, whereas only reflecting the permutational symmetry of the system.When this paramagnetic fixed point is stable, the system is in the paramagnetic state, where it is believed that no algorithm can do better than a random guess in revealing planted group structures.This scenario is known as the nondetectable phase for SBM, whose existence has been mathematically proved in Ref. [22]; in this study, we extend the analysis to the JSBM with two SBM components.Conceptually, the nondetectable phase in the JSBM is analogous to the ferromagnetic Ising model in the paramagnetic phase where the underlying ground-truth labels correspond to the all-one configuration, as well as the Hopfield model where underlying ground-truth labels refer to the stored patterns.From the viewpoint of statistical inference, item-item edges {(i j)} and item-attribute edges {(iμ)} are observations of the signal (i.e., the groundtruth labels), so the paramagnetic phase denotes the situation where the number of observations is too few to reveal any valid information of the signal, such that the system evolves to the paramagnetic fixed point where the label assignment is of equal probability for any node.
When the number of observations increases, the paramagnetic fixed point will eventually become unstable, leading to a nontrivial fixed point of BP (3) whose values are correlated with the ground truth.Where the paramagnetic fixed point of BP becomes unstable indicates the position of the detectability transition for the JSBM, which poses fundamental limits on the ability of algorithms in revealing information of the ground truth, independent of the specific algorithm being used.
This phase transition point can be determined by the stability analysis of the paramagnetic fixed point of the BP (9).Assume that the JSBM graph has n → ∞ item nodes and m → ∞ attribute nodes.Each item node is connected to on average c 1 item nodes and c 2 attribute nodes, and each attribute node is connected to on average c 3 item nodes.So degree distribution of item nodes in the connectivity graph (only counting neighbors of item nodes) and degree distribution of item nodes in the attribute graph (only counting neighbors of attribute nodes) are Poisson distributions, as in the SBM, and c 1 and c 2 also equal to the average excess degree (see Appendix B for definition of average excess degree).Consider putting random noises with zero mean and unit variance on every node (both items and attributes) of the graph.Assuming a local-tree topology of the graph, after one iteration of the BP equations, the noises will be propagated to on average c 1 item nodes through edges, and c 2 c 3 item nodes through attribute nodes (Fig. 2).If every leaf node of the tree is associated with a random noise of zero mean and unit variance, after l → ∞ iterations of BP equations (3), the aggregated variance of noises on the root node i can be computed as (see Appendix B for details of derivations) λ A and λ F are the largest eigenvalues of the Jacobian matrices T i→ j and T μ→i , related to the connectivity graph and the attribute graph (i.e., the adjacency matrices A and F), respectively, which are evaluated at the paramagnetic fixed point (together with a third matrix T i→μ ): Since l → ∞, the paramagnetic fixed point is unstable under random perturbations whenever V > 1.As a result, the detectability phase transition is located at This kind of stability conditions are known in the spin glass literature as the Almeida-Thouless local stability condition [23], and in computer sciences as the Kesten-Stigum bound on reconstruction on trees [24,25] and the robust reconstruction threshold [26].
We demonstrate the phase transition Eq. ( 12) through a simple case of JSBM.Consider a P matrix with p in being the diagonal and p out being the off-diagonal elements, and a similar Q matrix with q in on the diagonal and q out on the off-diagonal.Notice that p in and p out , as well as q in and q out , are related by and hence there is only one free parameter for each matrix.We introduce 1 = p in p out and 2 = q in q out .For this simple JSBM, the first eigenvalues of the two Jacobian matrices are expressed as Numerical experiments are conducted to verify our theoretical results.The performance of BP is evaluated using the overlap O between the BP results { t i } obtained from Eq. ( 6) and the ground truth {t * i }: where δ a,b is the Kronecker function, and the overlap function is maximizing over all permutations of groups π (i.e., |π | = κ!).In Eq. ( 15), if the inferred labels { t i } are chosen randomly, which has nothing to do with the ground truth {t * i }, the overlap O = 1/κ (lower bound); if there is an exact match between { t i } and {t * i }, O = 1 (upper bound).For a small number of groups, overlap is a commonly used metric for estimating the similarity between two group assignments.When the number of groups is large, or with different group sizes, more advanced measures are expected, such as the normalized mutual information (NMI) and its variants [27,28].
Results are shown in Fig. 3.In Fig. 3(a), 1 is fixed at 0.3 and 2 varies; in Fig. 3(b) it is the other way around.In both Figs.3(a) and 3(b), the overlap between BP results and the (synthetic) ground truth are plotted against the varied ( 1 or 2 ), which are optimal in the thermodynamic limit among using all possible algorithms.The overlap is close to 1.0 for small , indicating near-perfect reconstructions of the ground truth; with increased values of , the accuracy of BP inference decreases, which eventually downgrades to 0.5.This is consistent with the theoretical limit of the detectability phase transition Eq. ( 12) (indicated by dashed lines).For a large , the system goes beyond the phase transition point and lies in the paramagnetic phase, where the overlap is always 0.5, i.e., results of the BP detection are indistinguishable from that of a 50:50 random guess.In Fig. 3(c), the accuracy of BP for JSBM is shown on the 1 -2 plane, where the overlap decays from large values (yellow) to the noninformative 0.5 (blue).The dashed line represents the theoretical phase transition point, consistent with the numerical results.

V. FROM BELIEF PROPAGATION TO GRAPH CONVOLUTION NETWORK
When applying BP algorithms [Eq.( 3)] to real-world graphs or synthetic graphs without a priori knowledge of the generative process, a critical problem is to determine the parameters θ of the underlying JSBM.Iterative schemes need to be called for to locate the optimal parameter set; for a pure clustering problem, classical approaches for this task are extensively used, such as the expectation maximization (EM) algorithm [29], in which parameters are updated through maximizing the total log-likelihood of data (i.e., minimizing the total free energy).In practice, however, the established approaches are often prone to the problem of overfitting [14] and may be trapped in local minima, which is clearly undesired in parameter estimation.A different approach could be taken to determine the parameters, however, if a small number of ground truth labels are available, as is the setting of the current study, since now the parameters could be determined in a semisupervised fashion.Such a semisupervised update of parameters could possibly be achieved through back-propagation on a neural network structure, which also facilitates the message passing of BP since it needs to be done in an iterative manner.With this idea in mind, we propose a solution framework for the JSBM which contains two steps in each epoch: In the forward step, BP equations propagate to finite time steps (layers) and conduct messages passing, and in the backward step, the parameters θ of the underlying JSBM are determined in a supervised approach, and back-propagate to BP layers.
Essentially, the graph convolution network (GCN) [1] structure is adopted in the above framework, which is an outstanding neural network model that triggers a large number of variants (see the next section).In a GCN, information on item nodes X propagate forward among layers; in a high-level description, the propagation takes the following form, where σ (•) represents an activation function and X l ∈ R n×κ l is the neural network state at layer l (κ l denoting the dimension of network state at layer l, which is not necessarily equal to κ, the number of groups in the network).The matrix W ∈ R κ l ,κ l+1 is the trainable weight matrix at the lth layer, and the propagator U is responsible for convolving the neighborhood of nodes, a kernel shared over the entire graph.
Based on the GCN structure, we propose the belief propagation graph convolution network (BPGCN), as our solution framework of the JSBM.Marginals and cavity messages ψ are network states; kernels take the job of the products and summations in Eqs. ( 3) and ( 5); and parameters of the JSBM P and Q become the weight matrices of the neural network, which are to be gradually learnt through back-propagation.Under these conventions, the propagation from the lth layer to the (l + 1)-th layer on a BPGCN is formulated as where Softmax(z t ) = e z t κ s=1 e zs is used as the activation function, inherited naturally from BP, which asks to normalize marginal probabilities with κ components.If κ = 2, the Softmax activation function reduces to the logistic function.Also note the logarithm on inside the Softmax, which might be considered as a part of the activation function in a strict sense.Kernel matrices U I to U III , B I to B V are nonbacktracking matrices [30] that encode adjacency information of cavity messages and marginals (see Appendix D for details).In practice, random values are used to initialize the input layer of the network, for marginals i 0 ∈ R n×κ and μ 0 ∈ R m×κ , as well as messages i→ j 0 ∈ R 2M A ×κ , i→μ 0 ∈ R 2M F ×κ , and μ→i 0 ∈ R 2M F ×κ , where M A and M F are the number of itemitem edges (in the connectivity graph) and item-attribute edges (in the attribute graph), respectively.
The marginals in the last layer of BPGCN = { i L } ∈ [0, 1] n×κ are the output of a L-layer BPGCN.A loss function on the fraction of marginals belong to training labels is adopted for the supervised learning.A common choice of the loss function for classification is the cross entropy L; in our model, it is defined as where denotes the training set of item nodes (a fraction of ground-truth nodes) and y i = {0, 0, .., 1 position(t * i ) , ...0} is a one-hot vector denoting the ground-truth label of node i in the training set.
Training the BPGCN is the same as training other GCNs: In each epoch, we first do a forward pass to obtain the computed marginals in the last layer, based on which we calculate the loss function on the training set; after that, we use the back-propagation [2] algorithm to compute the gradients of the loss function with respect to elements in P and Q, and then apply (stochastic) gradient descent or its variants (e.g., ADAM [31]) to update the parameters.The iteration stops when the results converge, or after a finite number of epochs.In the end, the performance of the BPGCN algorithm is evaluated by the accuracy [i.e., overlap, Eq. ( 15)] of label assignments on the ground-truth item nodes in the test set.
A critical difference between BPGCN and traditional BP (including the semisupervised version [15]) is that BP minimizes the (Bethe) free energy, whereas BPGCN minimizes an loss function evaluated on the training data.On JSBM synthetic graphs with matched parameters, the free energy is theoretically the best loss function to minimize; however, on graphs not generated by JSBM, minimizing the (free) energy may be prone to overfitting [14,32].More importantly, minimizing the loss function is more flexible since the function could be formulated in different ways; for example, new (informative, or regularization) terms could be added.Notably, one implicit feature of inference on JSBM is that the classification on feature nodes (i.e., μ L ) is left unconstrained, in both BP [Eq.( 3)] and BPGCN [Eq.( 17)], since the ground truth of attribute labels are often not available.Nevertheless, in specific situations, such constraints could be readily incorporated into the loss function.For example, sometimes it is reasonable to believe that the distribution of classified attribute nodes labels should be close to a uniform distribution (or Gaussian in other cases); thus, in these cases, we are able to append a term in the loss function to constrain the classification performance on features.In particular, using a one-hot vector w μ = {1, 1, .., 1} to characterize the uniform distribution, we could formulate a new loss function as where η is a damping factor and m denotes the set of all attribute nodes successfully classified.This new loss function may probably help enhance the classification performance of BPGCN on real-world networks, although it is certainly not optimal for synthetic JSBM graphs with nonzero η.It is also worth noting that the adaptive fields [Eq.( 4)], which in traditional BP are contributed by nonedges and are indispensable, are not necessary in BPGCN.This is because the existence of adaptive fields helps BP equations avoid the convergence to trivial fixed points where all nodes are assigned to the same group; for BPGCN in the semisupervised classification task, as we have ground-truth group information available which rules out the existence of these unrealistic points already, the adaptive fields are not necessary.
Comparing BPGCN with canonical GCNs, two major differences emerge.First, the activation function in BPGCN (Softmax) is not chosen arbitrarily but rather determined by BP message passing equations; this is in contrast with common GCNs where the activation function may take various forms, such as ReLU, PReLU, or Tanh, but without sufficient physical or mathematical warrants.Second, there are only a few parameters to be trained in BPGCN, which are the elements of P and Q, and they are shared across all layers of the neural network; therefore, the number of dimension of all layer is fixed to be group numbers κ.Although this may narrow the overall representational power of BPGCN, the problem of overfitting could nevertheless be obviated to a great extent.Indeed, in semisupervised classification, the amount of training data is often much less than that of (completely) supervised classification, whereas the number of observations, in the case of networks, may be proportional to the number of edges; therefore, the sharing of parameters across layers is believed to be extremely helpful in preventing overfit of the training data.

VI. COMPARING BPGCN WITH OTHER GRAPH CONVOLUTION NETWORKS
In this section, we compare the classification performance of BP and BPGCN, constructed on the joint stochastic block model, with several state-of-the-art graph convolution networks, including the (standard) GCN, the approximate personalized propagation of neural predictions (APPNP), the graph attention network (GAT), and the simplified graph convolution network (SGCN).

A. (Standard) graph convolution network (GCN)
The standard GCN [1] is probably the most famous graph convolution network, which drastically outperformed all nonneural-network algorithms when first proposed in 2017.The forward propagation rule of standard GCN is formulated as where H (l ) and W (l ) are states of hidden variables and weight matrices at the lth layer.A is the graph convolution kernel, defined as A is the connectivity matrix and D is the diagonal degree matrix with D ii = 1 + k A ik .The propagation rule of the standard GCN was motivated by a first-order approximation of localized spectral filters (see Appendix C).A two-layer standard GCN is written as where F is the just the adjacency matrix representing attribute graph and Z ∈ R n×κ is the classification result.

B. Approximate personalized propagation of neural predictions (APPNP)
APPNP [3] extracts attribute information (encoded in F) to hidden neuron states H using a multilayer perceptron (MLP): Similar to the GCN, the hidden states H are propagated via the personalized PageRank scheme to produce predictions of node labels: where Z is the output probability (Z ← Z K , where the subscript K indicates the last layer) of item nodes used to construct training loss and predict labels, K is the layer of APPNP, and ω ∈ [0, 1] is a weight factor.In the recent benchmarking study on the performance of various GCNs, APPNP produces the best results on multiple datasets [33].

C. Graph attention network (GAT)
GAT [4] adopts the attention mechanism [34] in which attention coefficients between pairs of connected nodes are regarded as weights.This scheme can be viewed as based on the adjacency matrix of the graph, but with adjustable weights on edges.

D. Simplified graph convolution networks (SGCN)
SGCN [35] tries to remove redundant and unnecessary computations from GCN by adopting a low-pass filter followed by a linear classifier.As a result, SGCN takes a simplified propagation rule using the kth power of the adjacency matrix A, as in GCN: where W and Z are weights and classification results, respectively.Empirical results show that the simplification process may yield positive impacts on the accuracy of GCN and could dramatically speed up the computation.

E. Results on synthetic networks
First, we compare these GCN algorithms on synthetic networks generated by the JSBM.Each JSBM graph consists of n = 10 000 item nodes, m = 10 000 attribute nodes; each item node has on average c 1 neighbors in the connectivity graph and c 2 neighbors in the attribute graph; each attribute node is connected to on average c 3 item nodes.In the semisupervised experiments, 5% of randomly chosen item nodes with groundtruth labels are used as the training set, and another 5% of item nodes are used as the validation set.The rest belong to the test set, on which the classification performance is evaluated by the overlap measure [Eq.(15)].For BP and BPGCN, diagonal matrices are used as the initial values for P and Q, with p in (and q in ) on the diagonal and p out (and q out ) on the off-diagonal entries.Define ratio 1 = p out /p in and 2 = q out /q in , both of which could be viewed as the (inverse) signal-to-noise ratio; for all synthetic networks, we use 1 = 2 = 0.5 as the initial values for BPGCN, while for BP we use the initial values that are used to generate the underlying graphs, which guarantees asymptotically optimal results.The initial values of 1 and 2 for BPGCN could be instead determined in a soft manner by the validation set; however, tests show that this treatment does not make much difference.Also, all experiments including below real networks are set with known χ (i.e., the ground truth group numbers), since we have semisupervised information.When χ is unknown (i.e., a pure clustering task), we can determine the χ by optimizing the free energy with respect to χ , as discussed in Refs.[14,15].
Extensive experiments have been carried out on large synthetic graphs with varying average degrees c 1,2,3 and varying 1,2 (Fig. 4; see also Appendix E).First, it is confirmed that BPGCN results are perfectly aligned with the (asymptotically) optimal BP results, better than all other GCNs.In Fig. 4(a), c 1 is fixed to 4, and c 2 increases from 3 to 20.We can see that all other reference algorithms (GCN, APPNP, GAT, and SGCN) work much worse than BPGCN, even when c 2 is quite large.In Fig. 4(b), c 2 is fixed to 4 and c 1 is varied.Similar to Fig. 4(a), it shows that when c 1 is small, GCN, APPNP, GAT, and SGCN perform far worse than BP and BPGCN.These results on synthetic graphs show clearly the superiority of BPGCN over existing popular GCN algorithms.The poor performance of these reference GCN algorithms is due to the sparsity of the graphs (see below).In Fig. 4(b), we mark the detectability threshold [Eq.(12)] by the vertical dashed line, as in Fig. 3, which is located at c 1 ≈ 2.9.Therefore, in the c 1 < 2.9 regime, we almost always cannot detect the graph's community structures without extra information.Notably, there is a very narrow regime when c 1 is slightly smaller than 2.9, that searching over different initial values fed in the algorithms will locate the nontrivial fix point and hence the communities are detectable; yet this search requires brute force and is essentially exponential time.With additional information provided, such as a few more node labels, the threshold (dashed line) gradually moves to the left, toward the y axis, and the narrow unstable regime gradually disappears; i.e., the whole regime is detectable.For these discussions, we refer to Ref. [15] for more details.Nevertheless, even with additional information, any GCN algorithm will not work better than BP with correct model parameters, since BP yields the optimal solutions on these JSBM graphs.
In Fig. 4(c), 2 is fixed to 0.1; the average degrees of the synthetic graphs are fixed at c 1 = 10 and c 2 = 10; 1 is varied.Quite surprisingly, results show that BPGCN works perfectly in the whole range of 1 , even when 1 is close to 1; i.e., the P matrix is homogeneous, whereas conventional GCNs quickly fail when 1 increases.This phenomenon suggests that these reference GCNs have difficulties in extracting the information on group labels from attributes when the group structures of the graph are noisy, as one would imagine; in contrast, to an extraordinary extent, BPGCN is not subject to this failure.Further checks on the outputs of conventional GCNs reveal that these results all have good overlap in the training set (see Appendix E), but poor overlap in the test set, even worse than a MLP, which indicates a clear overfitting to training labels.
Per the above discussion, in what follows, we analyze the two issues emerged in the results (the sparsity issue and the overfitting issue) and explain how they could be successfully overcome by BPGCN.
(1) The sparsity issue.Properties of the forwardpropagation of GCNs are closely related to their linear convolution kernels, and hence the reason to account for the sparsity issue can be uncovered by studying the spectrum of the linear kernels used in different graph convolutions.In GCN, SGCN, and APPNP, the convolution kernel is a variant of the normalized adjacency matrix A [Eq. ( 21)].It has been established in, e.g., Refs.[30,36] that this type of linear operators have localization problems on large sparse graphs, with leading eigenvectors only encoding local rather than global information on group structures (see Appendix C).By contrast, our BPGCN is immune to the sparsity issue, because the convolution kernels of BPGCN are the nonbacktracking matrices, which naturally overcome the localization problem on large sparse graphs [30].Therefore, inspired by BPGCN, a straightforward approach to overcome the sparsity issue in classic GCNs might be to consider using a linear kernel that does not trigger the localization problem in sparse graphs, such as the nonbacktracking matrix or the X-Laplacian [36].
(2) The overfitting issue.From Eqs. ( 20), (24), and ( 25), one could observe that in these models the linear filters always operate directly on the weight matrices or on the hidden states.This implies an underlying assumption of conventional GCNs: The relational data (i.e., edges in the connectivity graph) must always contain information on item nodes' group structures (labels).This is a natural assumption, yet may not always be true.In an advanced manner, BPGCN relaxes this assumption by learning affinity matrices P (and Q) that store the learned signal-to-noise ratios 1 (and 2 ), which indicate the relative distribution of edges within versus outside planted communities.Hence, in this case it is not necessary that group information is directly encoded in edge connectivities.

F. Results on real-world data sets
Next we compare BPGCN and reference algorithms on several well-known real-world networks with and without node attributes.The Karate Club network [37] and Political Blogs network [38] are classical network data sets with planted community structures.Since in these two networks there are no node attributes, canonical GCNs commonly use the identity matrix as the feature matrix F [1].Given their small sizes, for the Karate Club network we use two item nodes per group as training nodes and two item nodes per group as validation nodes, and for the Political Blogs network [1] on these graphs.For all these networks, there are ground-truth labels for item nodes' group membership.Same as the case for synthetic data, the performance of tested algorithms is evaluated by the overlap between classification results and the ground-truth labels.
For BPGCN, a tunable external field is adopted and corresponding hyper parameters γ are introduced, so as to adjust the relative strength of the training labels on nodes (see Appendix D for details).Identical to the case of synthetic graphs, diagonal matrices are used as the initial values for P and Q, controlled by the two signal-to-noise ratios 1 = p out /p in and 2 = q out /q in .Unlike on synthetic graphs where initial 1 and 2 are fixed at 0.5, on real-world graphs we did a coarse search using the validation set to determine proper values for 1 and 2 during the preprocessing step, altogether with the search for proper hyper parameters, including the strength of the external field strength γ and the number of layers of BPGCN L (see Appendix D for details).
Classification results are shown in Table I.On the two networks without node attributes, all tested GCN algorithms perform quite well, and label information is successfully extracted from edge connectivities.BPGCN outperforms other GCNs for the Political Blogs network; the good performance may come from its nonbacktracking convolution kernel, which has good spectral properties on large sparse graphs such as the Political Blogs network [30].We notice that in Ref. [15], parameters P are updated using the expectation maximization (EM) algorithm, where the accuracy of detection for the Political Blogs network is worse that our result as shown in Table I.This is because with fewer known labels, the EM algorithm often converges to core-periphery structure where nodes are divided according to high or low degrees.On networks with node attributes, the evaluation of classification performance is less straightforward.On Citeseer and Cora, the performance of BPGCN is certainly comparable to other GCNs; in both cases it is superior to the performance of at least one reference algorithm.Nevertheless, BPGCN works poorly on the Pubmed network, yielding the worse performance among all tested GCNs.The reason is that the Pubmed network contains 19 717 item nodes, but only 500 attributes; when attributes are densely connected to item nodes, the attribute graph significantly deviates from being a sparse random graph, a critical assumption our BPGCN algorithm and the underlying JSBM rely on.Indeed, it is verified that, when completely ignoring attributes and only using edge connectivities in classification, BPGCN yields a classification accuracy at 71.0%, better than 70.0%.If we try to increase the number of group labels of attribute nodes, unfortunately it does not work.Since the multilayer percetron (MLP) method, which only uses the attribute information, already achieves a high accuracy, a possible remedy for BPGCN in situations where the attribute graph is far from having a locally treelike topology is to use the classification results yielded by MLP in place of the original attribute graph, as the external field acting on BPGCN (i.e., adopting MLP as a preclassification step).Tests confirm that this preprocessed version of BPGCN greatly enhances the classification accuracy on the Pubmed network, up to 81.7%, which is significantly better than the state-of-the-art results yielded by APPNP.This preprocessing step does not improve the results on Citeseer and Cora, as expected, since the attribute graphs are sparse of these two networks.

VII. CONCLUSIONS
In this study, we constructed the joint stochastic block model (JSBM) that generates a twofold graph which simultaneously models item nodes and attribute nodes in an aggregated network setting, based on the celebrated SBM.Utilizing the cavity method in statistical physics and the corresponding belief propagation (BP) algorithm which is asymptotically exact in the thermodynamic limit, theoretical results on the detectability phase transition point and the phase diagram for JSBM are uncovered.Expectedly, JSBM could be used to generate benchmark networks with continuously tunable parameters, which might be particularly useful in evaluating the classification performance of graph convolution neural networks.
Based on the BP equations established on JSBM, we proposed an algorithm for semisupervised classification, adopting the graph convolution network structure, which we termed as BPGCN.In contrast to most existing graph convolution networks, the convolution kernel and activation function of BPGCN are determined mathematically from the Bayesian inference on the JSBM.We show that on synthetic networks generated by JSBM, BPGCN clearly outperforms several well-known existing graph convolution networks, and obtains extraordinary classification accuracy in the parameter regime where conventional GCNs fail to work; on real-world networks, BP also displays comparable performance to state-of-the-art GCNs.Compared with conventional GCNs, BPGCN is quite powerful in extracting label information from edge connectivities; this advantage is rooted in its nonbacktracking convolution kernel inherited from the BP algorithm.The weakness of BPGCN is demonstrated by its performance on the Pubmed dataset, in which case there are too few attributes for the attribute graph to be approximated by a SBM; a remedy for applying BPGCN on such graphs is proposed, which uses MLP as a preprocessing step, and the corresponding advanced version of BPGCN is implemented and tested.Based on the fact that BPGCN is immune to the sparsity issue and the overfitting issue exposed by conventional GCNs, we discussed possible ways inspired by BPGCN to improve current GCN techniques.It would be interesting to combine successful features of BPGCN and state-of-the-art GCNs in greater depth and design new architectures for graph convolution neural networks; we leave this idea for future work.
PYTORCH and C++ implementations of our BPGCN and other GCNs on JSBM together with the real-world data sets used in our experiments are available at Ref.
Since we have nonzero interactions between every pair of nodes, in Eq. (A1) we have in total n(n − 1) + 2mn messages.This results to an algorithm where even a single update takes O(n 2 ) time, making it suitable only for networks of up to a few thousand nodes.Fortunately, for large sparse networks, i.e., when n, m is large, we can neglect terms of subleading order in the equations.
In this case, it is assumed that i or μ sends the same message to all its non-neighbors j, and these messages are viewed as representing an external field.By this means, now we only need to keep track of 2M messages, where M is the total number of all edges, and each update step takes O(n + m) time.
Suppose that j / ∈ ∂i, we have Similarly, we have ψ μ→i To the leading order, the messages on nonedges do not depend on the target node.For nodes with j ∈ ∂i, we have where terms having O(1/n) and O(1/m) contributions to ψ i→ j t i combine to the definition of the auxiliary external field: Applying the same approximations to the external fields acting on attribute nodes, one finally arrives at BP equations (3).

APPENDIX B: DETECTABILITY TRANSITION ANALYSIS USING NOISE PERTURBATIONS
On a graph generated by the JSBM with n item nodes and m attribute nodes, the probability that an item node has k 1 neighboring item nodes p 1 (k 1 ) follows a Poisson distribution with average degree c 1 , and the probability of it being associated with k 2 attribute nodes p 2 (k 2 ) follows a Poisson distribution with average degree c 2 .Similarly, the probability that an attribute node has k 3 neighboring item nodes p 3 (k 3 ) follows a Poisson distribution with average degree c 3 = nc 2 /m.Considering a branching process on a graph generated by JSBM with infinite size, the average branching ratio of the process is related to the excess degree which is defined upon the average number of neighbors.The average excess degree of an item node is computed as Similarly we have c 2 = c 2 as well, and the excess degree of attribute nodes is Consider the noise propagation process on a tree graph as depicted in Fig. 2, with depth l → ∞.In the tree, odd layers contain exclusively item nodes, and even layers contain attribute nodes (blue boxes) and item-item edges (green boxes) which are used to connect item nodes in odd layers.Assume that on the leaves of the tree (nodes on the lth layer) the paramagnetic fixed point is perturbed as where v l represent an item node on the l layer, is the perturbation, v 0 corresponds to the root node i in Fig. 2. Now let us investigate the influence of the perturbation, from the message on any leaf, to the message on the root node.For simplicity, first choose one path only containing item nodes (i.e., not connected via any feature node) and latter generalize to paths containing both item nodes and attribute nodes.We define the Jacobian matrix T i→ j with respect to the message passing from an item node i to another item node j along an edge (i j), and the Jacobian matrix T μ→i corresponding to the message passing from an attribute node μ to an item node i, and similarly the matrix T i→μ for the backward passage.Elements of these Jacobian matrices are formulated as These matrices represent propagation strength (3) between two messages in the vicinity of the paramagnetic fixed point.
We can see that all three matrices are independent of node indices, only depending on the type of the nodes.If the path only contains item nodes, the perturbation η v 0 →x t v 0 on the root node induced by the perturbation η v l →v l−1 t v l on the leaf node can be written as ) or in the vector form η v 0 →x = (T i→ j ) l η v l →v l−1 .Now consider the path contains both item nodes and attribute nodes: Every time an attribute node is passed through, T i→μ and T μ→i transmits to T i→ j ; therefore, the total weight acting on the path is where s is the number of edges and l − s is the number of attribute nodes on this path.For l → ∞, (T i→ j ) s (T i→μ T μ→i ) (l−s) is dominated by the product of the largest eigenvalues, λ A , λ F , and λ F of the three matrices; notice that λ F = λ F .So the above equation can be written as Then consider the collection of all perturbations on the root node v 0 from all leaves.Obviously the mean is 0, and the variance on the tth component of the perturbation vector is computed as Here, we have made use of the property that all perturbations on different leaves are independent.Obviously, the paramagnetic fixed point is locally unstable under random perturbations when (c 1 λ A 2 + c 2 c 3 λ F 4 ) l > 1, so the phase transition of detectablity locates at

APPENDIX C: GRAPH CONVOLUTION NETWORKS AND THE SPECTRAL LOCALIZATION PROBLEM OF LINEAR CONVOLUTION KERNELS
Convolution networks [40] have been proven to be one of the most successful models for image classifications [41] and many other machine learning problems [2].The success of CNN is credited to the convolution kernel, which is a linear operator defined on gridlike structures.However, in recent years, a significant amount of attention has been paid to generalizing convolutional operations to graphs, which do not have grid structures.
In Ref. [42], authors propose to define convolutional layers on graphs that operate on the spectrum of the graph Laplacian: where X l is the state in the lth layer, V is the eigenvector matrix of the graph Laplacian L = I − D − 1 2 AD − 1 2 (A is adjacency matrix and D is the diagonal matrix on node degrees), σ (•) is an elementwise nonlinear activation function, and is a diagonal matrix representing a kernel in the frequency (graph Fourier) domain.Usually contains only a few nonzero elements, working as cutoffs to the frequency in the Fourier domain, because it is believed that using only a few eigenvectors of the graph Laplacian are sufficient for describing smooth structures of the graph.Only computing leading eigenvectors is also computationally efficient.
Even though computing only a few eigenvectors of the Laplacian of large graphs could be quite cumbersome, in Ref. [43] authors proposed to parametrize the kernel in the frequency domain through a truncated series expansion, using the Chebyshev polynomials up to Kth order: where c k denotes coefficients, are recursively defined Chebyshev polynomials, and λ max is the largest eigenvalue.In Ref. [1], the authors limited the convolution operation to K = 1, and approximate λ max = 2, and then obtain with two free parameters θ 1 and θ 2 .Further, Ref. [1] restricted the two parameters by specifying θ = θ 1 = −θ 2 and introduced a normalization trick where A = A + I and D i = 1 + j A i j .Finally, upon generalizing to node features of κ components, that is, to the κ-channel signal X ∈ R n×κ , one arrives at the GCN [1], with the forward model taking the form of where σ (•) is an activation function such as ReLU and X l ∈ R n×κ l is the network state at layer l (n denotes the number of nodes in the graph and κ l is the dimension of state at layer l).
Matrix W ∈ R κ l+1 ,κ l is the trainable weight matrix at the lth layer, and the propagator A (21) is responsible for convolving the neighborhood of a node, which is shared over the whole network.
Of the many developments of GCNs in recent years, almost all of them can be understood as an effective object representation starting from the original feature vector and then projected forward by multiplying finite times of the linear convolution kernel.So it is recognized that the main principle of graph convolution kernels is inspired by the spectral properties of linear operators, such as Laplacians, normalized Laplacians [1,35], random walk matrix [3], etc.The underlying assumption is that the eigevectors of the graph convolution kernel contain global information about group labels, which can be revealed during the forward propagation of GCNs.
However, it is known that this assumption may not hold on large sparse networks, because the eigenvectors of conventional linear operators such as graph Laplacians are subject to the localization problem, induced by the fluctuation of degrees or the local structures of the graph [30,36].Even on random graphs where the Poisson degree distribution is rather concentrated, the localization problem is still significant on large graphs.For the adjacency matrix, it is well known that the largest eigenvalue is bounded below by the squared root of the largest node degree (which grows as log (n)  log log(n) ), and diverges on large graphs when the number of nodes n → ∞.Thus, the corresponding eigenvectors only report information of the largest degrees.For normalized matrices such as the normalized Laplacian, there are many eigenvalues that are very close to 0, with corresponding eigenvectors reporting information about local dangling subgraphs rather than global structures related to the group labels.We refer to Refs.[30,36] for detailed analysis of spectrum localizations of graph Laplacians and for the comparison between spectral algorithms using different operators on large sparse graphs.Unfortunately, real-world networks are usually sparse, as we can see from Table I that (although they might not be large enough) the citation networks we used for experiments all have an average degree around 4, which is extremely low.
On large synthetic networks, the sparsity issue for conventional GCNs is revealed more clearly (see main text).We carried out extensive experiments to verify our analysis (in Fig. 5).

APPENDIX D: BELIEF PROPAGATION GRAPH CONVOLUTION NETWORK(BPGCN)
On a graph generated by JSBM with n item nodes, m attribute nodes, nc 1 /2 item-item edges, and nc 2 item-attribute edges, the average degree of item nodes in the connectivity graph is c 1 and the average degree of the attribute graph is c 2 .We define matrices storing cavity messages i→ j l ∈ R nc 1 ×κ , i→μ l ∈ R nc 2 ×κ , and μ→i l ∈ R nc 2 ×κ , and marginal matrices i l ∈ R n×κ and μ l ∈ R m×κ , where l indicates the cavity messages after the lth step of iterations, The BP equations (3) can be written in the form of matrix multiplications: where P, Q are affinity matrices of size κ × κ.
FIG. 5. Overlap [Eq.( 15)] obtained by BP, BPGCN, and reference GCNs on synthetic networks generated by JSBM with different parameters.All networks have n = m = 10 000, and group numbers κ = 5.The fraction of training labels is fixed at ρ = 0.05.Each data point is averaged over 10 random instances.In the top row, the average degree of the connectivity graph is fixed to c 1 = 4, 6, and 10 from panels (a) to (c).In the second row, the average degree of the attribute graph is fixed to c 1 = 4, 6, and 10 from panels (d) to (f).From subfigures (a) to (f), 1 is fixed to 0.1 and 2 is fixed to 0.2.In the third row of the figure , c  Kernel matrices U I to U III , B I to B V are nonbacktracking matrices [30] that encode adjacency information of cavity messages and marginals, which are defined as (D2) Therefore, the matrix multiplication form of the parallelupdated BP equations can be viewed as a forward model of a neural network.When the above equations propagate and are truncated after L steps of iterations, the resulting algorithm scheme is termed as BPGCN, with L layers.States in the last layer are extracted as the output of BPGCN for computing the cross-entropy loss on training labels (18).Matrices Q and P are trainable parameters of the BPGCN and are updated using the back-propagation algorihtm.Input of the neural network are probability-normalized random initial cavities and marginals.In order to accelerate the training process of BPGCN, we use an adjustable external field strength γ as a hyper parameter.For example, suppoese node i is in the training set, i.e., we have label t i for node i, and a term γ log(0.1,0.1, ..., 0.9, ..., 0.1) is added to (D1) inside the Softmax function.When γ approaches infinity, node i is pinned to label t i [15], and is used as an hyper parameter adjusted by the validation set (i.e., when training BPGCN with the training set, we tune these hyper parameters to get better accuracy on the validation set).For nodes in the training set, cavity probabilities and marginals are initialized as (0,1,...,0).Thus, the parameters of BPGCN are P and Q, and the hyper parameters are 1 , 2 , and γ .The hyper parameters we used in the experiments on real-world data sets (Table I) are for Citeseer and Cora, L = 5 and external strength γ = 2.0; on Cora we set 1 = 0.1, 2 = 0.6; on Citeseer we set 1 = 0.1, 2 = 0.5.For Pubmed, L = 2, external strength γ = 1.5, 1 = 0.3, and 2 = 0.8; when the attribute graph is preprocessed with a MLP to offer BPGCN as a prior or an external field, we set L = 12, external strength γ = 0.5, 1 = 0.1.For both Karate Club and Political Blogs networks, L = 5, external strength γ = 0.5, and 1 = 0.1 ( 2 is not applicable).

APPENDIX E: MORE COMPARISONS ON SYNTHETIC NETWORKS
We carried out extensive numerical experiments on synthetic networks generated by JSBM with various parameters.The overlap of BP, BPGCN, and reference algorithms BPGCN, GCN, APPNP, GAT, and SGCN are compared in the spectrum localization problem of graph Laplacians, as we have discussed earlier, when c 1 is fixed to 4, GCN, APPNP, GAT, and SGCN work much worse than BPGCN, even when c 2 is large.We also see that when c 1 is large, most of these tested GCNs work well, even for a very low c 2 .In the second row, the average degree of the attribute graph c 2 is fixed and c 1 varies.Performances of reference GCNs approaches the performance of BPGCN, when c 1 gradually increases.
In the third row of the figure, c 1 , c 2 and 1 are fixed and 2 varies.On the left, c 1 = c 2 = 4, 1 = 0.1.The graphs are in the sparse regime so conventional GCNs do not work well due to the sparsity issue.In the middle, c 1 = c 2 = 10, 1 = 0.4.Graphs are not sparse, but the edges contains relatively little information about group labels.Results show that SGCN has trouble in this regime even when 2 is very small.On the right c 1 = c 2 = 10, and 1 = 0.1.The graphs are not in the sparse regime, and edges contain enough information about labels.We see that all GCNs except APPNP perform reasonably well; APPNP has trouble with low 2 , probably because an nonoptimal ω parameter was learned.
In the last row of the figure, c 1 , c 2 , and 2 are fixed, and the inverse signal-to-noise ratio 1 varies.On the left, c 1 = c 2 = 4, 2 = 0.1.We see that BPGCN starts to have a large variance, but the overlap is on average much higher than other GCNs, due to the sparsity of the graphs.In the middle, c 1 = c 2 = 10, 2 = 0.4, meaning that the item-attribute edges in the attribute graph contain little information of labels, and the major information comes from the connectivity graph which is not sparse.In this case, we see that the performances of reference GCNs are comparable to BP and BPGCN, since there is no sparsity issue.On the right, c 1 = c 2 = 10, and 2 = 0.1.There is no sparsity issue for conventional GCNs, and the attribute graph contains enough information about the labels.However, the result is quite surprising: While BP and BPGCN gives almost 100% classification accuracy in the full range of 1 , conventional GCNs yield very low accuracy with 1 larger than 0.5.This phenomenon implies that when the item-item edges of the connectivity graph are quite noisy, information from the attributes is not well extracted by conventional GCNs.
To understand the reason for this observation, in Fig. 6 we plot the training process of GCN, MLP, and BPGCN on a graph generated with c 1 = c 2 = 10, 1 = 1, and 2 = 0.1.In this case, it means that the connectivity graph is a pure random graph, with edges containing no information of the label at all ( 1 = 1), while the attribute graph contains adequate information about the ground truth ( 2 = 0.1).We can see that BPGCN yields very high accuracy on the training set and the test set, even from the beginning, because 2 is so small that the initial values for BPGCN are already good enough.At the begining of the training process, MLP has low training accuracy as well as low test accuracy.But after several epochs of training, its training accuracy goes to 1, and it also generalizes very well to the test set on which the accuracy gradually increases to around 0.6.However, this generalization does not happen for GCN, as we can see that although GCN fits very well to the training set, the overlap for the test set is only slightly above 0.2, i.e., result of a random guess.Obviously, GCN overfits to the training set; the reason is that GCN assumes that edges always contain information of the labels, even when the underlying graph is purely random.Moreover, we can deduct directly from the formula of APPNP (24) that, since APPNP uses ratios ω and 1 − ω to weight the contributions of item-item edges and item-attribute edges, the best possible result given by APPNP on graphs of the above type, where item-item edges contain no information and item-attribute edges contain all the information, is almost equal to the result of MLP.Indeed, we have tested with 1 = 1, 2 = 0.1, c 1 = c 2 = 10; APPNP achieves an overlap around 0.6, while MLP falls around 0.66.

FIG. 1 .
FIG. 1. Illustration of the joint stochastic block model (JSBM).Circles denote item nodes, whose interconnections constitute the connectivity graph, with adjacency matrix A defined over edges; boxes denote feature nodes.The attribute graph is the bipartite graph between items and attributes, whose adjacency matrix is F. The JSBM graph is generated by parameters θ = {P, Q, α, β}, with κ groups of nodes built in for both items and features.

FIG. 2 .
FIG.2.Illustration of noise propagation on JSBM.Noises are transmitted from leaves to the root node i, with each transmission weighted by the eigenvalues of the Jacobian matrices T, determined by BP message passing equations [Eq.(10)].Red circles denote item nodes, green boxes represent item-item edges connecting item nodes, and blue boxes denote attribute nodes.

FIG. 3 .
FIG. 3. Detectability phase transition of JSBM.Synthetic JSBM graphs are generated with n = 2 × 10 5 , m = 2 × 10 5 , group number κ = 2, and average degree c 1 = c 2 = 3. [(a), (b)] Overlap as a function of 1 or 2 , with the other parameter fixed to 0.3 [(a) 1 fixed; (b) 2 fixed].The numerical detectability phase transition point agrees with the theoretical calculation [Eq.(11)], except for small fluctuations due to the finite size effect.(c) Overlap on the 1 -2 plane.In all three plots, dashed lines indicate the theoretical detectability phase transition point, separating the detectable phase [white in panels (a) and (b)] from the paramagnetic phase [gray in panels (a) and (b)].Numerical and analytical results are consistent.

TABLE I .
[33]sification performance of BPGCN and reference GCNs on real-world networks.cdenotesthe average degree of the connectivity graph.For reference GCNs, results on Karate Club and Political Blogs are carried out using publicly available implementations of these algorithms; results on Cora, Pubmed, and Citeseer are adapted from Ref.[33].For BPGCN, the reported accuracy values are based on the vanilla version.The preprocessed version of BPGCN greatly enhances the classification accuracy on the Pubmed network, up to 81.7% (see text).
we use 10 item nodes per group for training and 10 item nodes per group for validation.The Citeseer, Cora, and Pubmed networks are standard data sets for semisupervised classification widely used in graph convolution network studies; we follow the splitting rule of training, test, and validation sets introduced in Ref.