Equivalence between algorithmic instability and transition to replica symmetry breaking in perceptron learning systems

Binary perceptron is a fundamental model of supervised learning for the non-convex optimization, which is a root of the popular deep learning. Binary perceptron is able to achieve a classification of random high-dimensional data by computing the marginal probabilities of binary synapses. The relationship between the algorithmic instability and the equilibrium analysis of the model remains elusive. Here, we establish the relationship by showing that the instability condition around the algorithmic fixed point is identical to the instability for breaking the replica symmetric saddle point solution of the free energy function. Therefore, our analysis would hopefully provide insights towards other learning systems in bridging the gap between non-convex learning dynamics and statistical mechanics properties of more complex neural networks.


I. INTRODUCTION
Theoretical studies of neural networks become increasingly important in recent years [1][2][3], as deep neural networks are widely used in various domains of both scientific and industrial communities.One of the most powerful theoretical tools is the replica method, which is able to derive equilibrium properties of neural networks (systems of interacting neurons or synapses), such as phase diagram [4][5][6][7], storage capacity [8][9][10][11], and even large-deviation behavior of learning algorithms [12][13][14].Intuitively, the replica method introduces n (an integer) copies of the original system.Within each copy, there exist strong interactions among constituent elements (e.g., synaptic or neural states), and these interactions make the model intractable without any approximation in most cases.However, the elements would become decouple with each other as an overlap (of states) matrix is introduced, which allows a hierarchical level of approximation depending on the stability analysis of the saddle points of the free energy action.A seminal approximation, namely replica symmetry breaking, was introduced by Giogio Parisi in 1980s [15,16].
Overall, the replica method, despite its non-intuitive physics, could lead to exact results in some models.One drawback of this method is that it could not be used to design any efficient algorithms in neural networks.Instead, cavity method is constructed via a physically intuitive way, i.e., a statistical mechanics model of learning can be mapped onto a graphical model, where interactions are represented by factor nodes, and synapses are represented by variable nodes (such as the graphical model representation in unsupervised learning [17]).Through virtually deleting these two kinds of nodes, a cavity probability could be defined.Using the tree-like structures of the factor graph, or weakly-interacting-element assumption, an iterative equation for these cavity probabilities can be derived, which leads to self-consistent evaluations of thermodynamic quantities, such as ground-state energy, free energy and entropy [18][19][20].Most interestingly, this iterative equation is exactly the same as the belief propagation developed independently in computer science [21].The belief propagation could be also derived for learning problems with discrete synapses [22].An open question is whether the replica symmetry breaking transition corresponds to the algorithmic instability in learning of neural networks.
Here, we provide a proof about this fundamental equivalence in the seminal model of binary perceptron learning, in which learning is achieved by adjusting discrete synapses (actually, the synaptic state takes ±1).This model is first studied by Gardner and Derrida [10,23].A follow-up calculation showed that the storage capacity of this model is given by P c ≃ 0.833N [11], where N is the number of neurons, and P c is the critical number of random patterns being correctly classified.The binary perceptron belongs to the NP-hard class in the worst case complexity.The typical weight-configuration is quite hard to find by any algorithms based on local flips (e.g., Monte-Carlo dynamics) [24][25][26][27].A first efficient algorithm was inspired by the cavity method, reaching an algorithmic threshold P alg ≃ 0.72N.
It was then proved by defining a distance-dependent potential that the entire solution space is composed of single valleys of vanishing entropy [28].This picture was further shown mathematically rigorous in some perceptron learning problems [29].However, the region of the solution space accessed by practical algorithms does not belong to the equilibrium hard-to-reach isolated parts, but subdominant dense parts [12,13].These dense parts are further shown to have good generalization properties [14,30], providing a new paradigm to understand deep learning.
Therefore, studying the mathematical foundation of the binary percetpron problem is fundamentally important to our understanding of neural networks.To our best knowledge, there are rare studies on the relationship between the replica symmetry breaking transition and the belief propagation instability along this line.In this work, we show how the belief propagation instability is connected to the instability of the replica symmetric (RS) solution (or replicon mode) of the model.

II. BINARY PERCEPTRON
Binary perceptron is a single-layer neural network that learns a random input-output mapping by discrete synapses (see Fig. 1).We assume that there are P uncorrelated inputoutput associations, where the µ-th one consists of an N-dimensional pattern ξ µ and a corresponding label σ µ 0 , where ξ µ i and σ µ 0 take ±1 with equal probabilities.Given a configuration of synaptic weights {J i } N i=1 (each entry takes +1 or −1), the binary perceptron gives the output σ µ = sgn N i=1 J i ξ µ i for the input pattern ξ µ .If σ µ = σ µ 0 , we say that the synaptic weight vector J has recognized the µ-th pattern.The binary perceptron is able to store an extensive number of random patterns.Therefore, we define a loading rate α = P/N.
When the loading rate is below some threshold, there exists at least a set of synaptic weights  as a solution to correctly classify all the patterns.However, as α exceeds the threshold, it is impossible to find a compatible configuration of weights for all patterns [11].This threshold is also defined as the storage capacity.Naturally, we define the energy of this model as the number of misclassified patterns as follows, where Θ (x) is a step function with the convention that Θ (x) = 0 if x ≤ 0 and Θ (x) = 1 otherwise.The prefactor 1/ √ N ensures that the statistical mechanics analysis leads to extensive free energy.
In the zero-temperature limit, the flat measure over the weights realizing the pattern-label associations can be computed as where Z is not only the partition function but also the number of solutions for the learning problem.Equation ( 2) can be derived from the finite temperature Boltzmann measure P (J) ∝ e −βE(J) .Notice that there is a gauge transformation ξ µ i → ξ µ i σ µ 0 to each pixel of the input patterns that does not affect the Boltzmann measure.We thus assume σ µ 0 = +1 for all patterns in the following analysis.

III. MEAN-FIELD MESSAGE PASSING EQUATIONS FOR LEARNING
The belief propagation (BP) algorithm is an iterative mean-field equation to calculate the marginal probabilities of synaptic state by passing beliefs between two types of nodes (function nodes and variable nodes) [3].In other words, the beliefs or cavity probabilities can be assumed as messages and thus the BP algorithm is actually a mean-field message passing equation.Taking pattern-classification constraints as function nodes and synaptic weights as variable nodes, we obtain the iterative equations for learning as follows [22,31] where x Dz, Dz is a Gaussian measure, m i→ν is a cavity magnetization parameter to parameterize the cavity probability P J i |{ξ µ =ν } = (1 + m i→ν J i ) /2. w µ→i and σ µ→i represent the mean and variance of the Gaussian distribution of U µ→i ≡ 1 √ N j =i J j ξ µ j , respectively.We have applied the centre-limit theorem to the sum U µ→i of weakly-correlated terms.This mean-field approximation must be cross-checked by numerical experiments.In the following analysis, we use µ, ν to indicate function nodes or pattern constraints, and i, j to indicate the variable nodes.
We remark that Eq. ( 3) can be combined with an iterative reinforcement to develop an efficient solver.The reinforcement is a kind of soft-decimation, which progressively enhances or weakens current local fields (a summation of cavity biases u µ→i ) with an increasing probability with iterations.The algorithm terminates once a solution is found.This procedure yields the algorithmic threshold α alg ≃ 0.72 [22].During the stochastic reinforcement, the BP iteration does not require convergence, despite convergence guarantee below the storage capacity.The algorithmic threshold is later found to be below a large-deviation threshold α LD ≃ 0.77 after which the subdominant dense clusters fragment into separate regions [12,32].However, our current analysis is restricted to the original BP iteration [Eq.( 3)], rather than the dynamics of reinforced BP and the geometric landscape.It remains challenging to use our framework (without a lengthy replica computation) to derive the landscape geometry which relies heavily on replica formula.The following analysis may shed light on this important research line.

IV. TIME EVOLUTION OF MESSAGE DISTRIBUTIONS
In this section, we study the iteration dynamics of the belief propagation.In the large-N limit, u µ→i can be approximated by the first-order Taylor expansion where G(x) = exp(−x 2 /2)/ √ 2π, and H (x) ≡ ∞ x Dz with the Gaussian measure Dz ≡ G(z) dz.At the iteration step t, the macroscopic distributions of messages m t l→µ and u t µ→l are given by: According to the Kabashima's method [33], the time evolution of π t 1 (x) and π t 2 (x) can be written down in an iterative form as follows, where • • • represents the disorder average over ξ.Note that ξ µ is independent of the pattern entries in the sum of X µ .
We then introduce an auxiliary field h t l→µ = ν =µ u t ν→l = tanh −1 (m t l→µ ) and its macroscopic distribution ρ(h).More precisely, When P becomes infinite (e.g., P ∝ N), due to the central limit theorem, the distribution of the auxiliary field can be regarded as a Gaussian distribution: where E t and F t are the mean and variance of the Gaussian distribution ρ(h), respectively.
In fact, E t = 0 because of the setting that ξ µ takes ±1 with equal probabilities.With the expression of ρ t (h), we get π t+1 1 (x) = dhρ t (h)δ (x − tanh(h)) for Eq.(6a).Plugging this expression into Eq.(6b) and using Eq. ( 8), we obtain a compact expression for the update of F t as We leave the technical details of this derivation to Appendix A. Note that this result is exactly identical to the saddle point equation under the replica symmetric assumption, which we shall briefly introduce in Sec.VI .

V. MICROSCOPIC INSTABILITY OF THE ALGORITHMIC ITERATION
In this section, we turn to the analysis of the microscopic stability of the BP equations at a fixed point.Provided that a field fluctuation δh t l→ν is introduced around the fixed point m t l→ν = m l→ν , the time evolution of δh t l→ν is computed as where Note that in the right hand side of Eq. ( 11), all messages or perturbations refer to their values at a previous step (t − 1).In the following analysis (including Appendix A), we omit this time index.We then define the macroscopic distribution of δh t l→ν as f t (y) [33].Due to the central limit theorem, f t (y) can be assumed to be a Gaussian form, i.e., where a t and b t are the mean and variance of the distribution, respectively.The time evolution of f t (y) is provided by a functional equation as follows, Following the similar spirit as before, a t is zero.We thus only need to focus on the update of b t .We provide details of derivation of this update in Appendix A. We finally obtain where that the initially-introduced fluctuation of the auxiliary field will eventually vanish.On the contrary, when γ > 1, b t would grow with iteration, which implies that the fluctuation will be amplified, leading to the instability of the fixed point.Therefore, Eq. ( 14) provides the critical condition of the instability with respect to the growth of b t , i.e., VI. EQUILIBRIUM PROPERTIES VIA REPLICA TRICK In this section, we apply the replica trick to analyze the equilibrium properties of the binary perceptron.In the thermodynamic limit, the free energy has the self-averaging property, i.e., the distribution of the free energy for different realizations of learning is peaked at the typical value.Thus, we can compute the disorder-average given by −βf = ln Z , where the average is carried out with respect to i.i.d random patterns.In fact, this disorder-average is very hard to compute.However, by introducing n replicas of the original learning system and then setting n → 0, we can obtain the free energy of the system in a mathematically concise way [3]: Here, we are interested in the zero-temperature limit (focusing on ground states).Therefore, Equation ( 16) is actually the entropy counting the number of solutions to the perceptron learning.By introducing replicas (copies of the system), we transfer a direct intractable treating of complex interactions in learning to handling the overlap matrix of states, which can be tackled by physics approximations, e.g., the RS ansatz in which the overlap does not depend on specific replica index (permutation symmetry).An intuitive picture is that the RS ansatz is consistent with the delta-like distribution of messages on each link of the factor graph, and the broadening of the distribution (under the message perturbation) leads to the mathematical instability of the saddle point.We will come back to this point at the end of Sec.VII.
To compute ln Z n , we introduce n replicated synaptic weight vectors J a (a = 1, . . ., n) where we have introduced the state overlap q ab = 1 N i J a i J b i and its associated conjugated counterpart qab .The expressions of G 0 ({q ab }) (energy term) and G 1 ({q ab }) (entropy term) are given as follows [11,31] Plugging Eq. ( 17) into Eq.( 16), we get the entropy Under the RS ansatz q ab = q, qab = q for a = b , the extremization of Eq. ( 19) gives rise to the following saddle-point equations: These saddle point equations are again identical to Eq. ( 9) derived from the BP equation.

VII. INSTABILITY OF THE REPLICA SYMMETRIC SOLUTION
The stability of the RS solution requires that the eigenvalues of the Hessian matrix (the second derivative matrix) of F max must be negative.The sign of the eigenvalues of this matrix evaluated at the RS solution tells us all the information about the stability [36].We first introduce η ab and ǫ ab as the fluctuations around the RS solution as By taking the Taylor expansion, we obtain 1 2 ∆ as the second order terms of F max , where (22) where the prefactor α in the first term is the loading rate, the superscript of the order parameters indicates the replica index, G 0 = G 0 ({q αβ }), and G 1 = G 1 ({q αβ }).In other words, the Hessian matrix looks like , and I is an identity matrix.
Following the Gardner's analysis [23], we first consider the problem of diagonalizing the matrices of H 0 and H 1 separately.We then use the symmetry structure with respect to permutation of replica indices.The associated eigenvectors can be divided into three types (see details in Appendix C and Appendix D).The first type are symmetric for all indices.The second type are symmetric for all but one specific index, and the third type are symmetric for all but two specific indices.In the limit of n → 0, the second type of eigenvectors coincides with the first type of eigenvectors.The first type of eigenvectors defines the longitudinal fluctuations within the RS subspace [23,34].This stability is already guaranteed by optimizing the action F max .In other words, the sufficient condition for λ 1,2 < 0 is equivalent to the saddle point equation [34].
Therefore, only the third type of eigenvectors leads to the instability of the RS solution.
This type of eigenvectors corresponds to the instability that is able to take the stationary point outside the RS subspace, capturing the transverse fluctuations.Supposed that the eigenvalues of these eigenvectors for H 0 and H 1 are γ 1 and γ 2 , respectively.The related eigenvalues that cause the instability of the RS solution are given by the two eigenvalues of the following matrix The sign of the determinant determines the stability of the RS solution, i.e., the RS solution is stable only when αγ 1 γ 2 < 1.When α → 0, the determinant of this matrix is given by αγ 1 γ 2 − 1 = −1, which means that the product of eigenvalues is negative.Therefore, in this limit, the RS solution is correct as expected.When α increases above a critical value, the sign of the determinant changes, which means that one of these eigenvalues changes its sign, thereby breaking the stability of the RS solution.
According to the calculation details in the Appendix C, we have where Z = q/(1 − q)z.Therefore, the critical condition for the transition to replica symmetry breaking is specified by We thus conclude that Eq. ( 26) is identical to Eq. ( 15), which suggests that the equivalence between algorithmic instability and transition to replica symmetry breaking can be established in perceptron learning systems.The replica symmetry breaking captures a hierarchical organization of replicas.In physics, this actually corresponds to the decomposition of the Gibbs measure into (exponentially or sub-exponentially) many pure states [35].
We finally carry out a numerical simulation to check whether the theoretical instability coincides that obtained by running BP in specific instances.As shown in Fig. 2, we observe the theoretical prediction, namely the Almeida-Thouless (AT) [36] loading rate (α AT ) matches well the numerical estimation.The theoretical prediction is computed by solving Eq. ( 9) and Eq. ( 15).During simulations, we estimate the convergence proportion as the fraction of instances for which the BP iteration converges within a prescribed criterion (e.g., all updated messages within a small deviation from the values at the previous iteration).
It is expected from the plot that in the thermodynamic limit, the BP iteration does not converge beyond α AT with the probability tending to one.

VIII. CONCLUSION
In this work, from a physics perspective, we prove that the stability of the learning algorithm, derived using physically intuitive cavity method, is connected to the stability of the replica symmetric saddle point solution of the model.The equivalence between physically intuitive cavity method and the mathematically concise replica method was also explored in spin interaction systems [33], information transmission systems [37], linear estimation problems such as compressed sensing [38][39][40], and spectra estimation of random sparse matrices [41].Our proof adds another evidence of this equivalence in perceptron learning systems, by claiming rigorously (in the thermodynamic limit) the one-to-one correspondence between the BP instability and the AT instability of the equilibrium saddle point.
Our framework shows that the cumbersome replica analysis could be avoided in studying learning systems, e.g., stability analysis considered in this work.Therefore, this work would hopefully inspire further studies on landscape analysis [12,28], unsupervised learning [7,42], and even deep learning, e.g., a current hot topic of learning in overparameterized neural networks [14].
and immediately we get Note that Finally, we get where a statistically invariant change z → −z has been made.

Appendix B: Derivation of saddle point equations
In this section, we show explicitly how the replica computation is carried out.Applying the RS ansatz q ab = q, qab = q for a = b to the energy term, we obtain where we have rescaled t = t √ 1 − q and λ = λ/ √ 1 − q.Then we compute the entropy term Therefore, the entropy of the model turns out to be Finally, we arrive at the saddle point equations as follows Appendix C: Instability analysis of the RS solution Considering the perturbation on the order parameters, we write the energy term G 0 as where we have shifted the integral variable t a → t a − √ qz.In addition, we define G ′ 0 as When n → 0 and η ab → 0, we have lim Therefore, we can replace G 0 by G ′ 0 in the above two limits in Eq. ( 22).We then get where At the RS saddle point, Eq. (C4) takes three possible values: where α, β, γ and δ are not equal with each other.We then compute the relevant moment terms as follows.
Finally, we obtain γ 1 as (C11) To compute γ 2 , we first define where f (J) is defined as We finally get

FIG. 1 :
FIG.1: Sketch of a binary perceptron.The binary perceptron is composed of N input units (light gray circles) connected to one output unit (dark gray circle).Each input unit receives one pixel of the input pattern.The output unit computes a weighted sum (indicated by Σ), which is passed through a non-linear sign function (indicated by sgn) to carry out a binary classification.The task of the binary perceptron is to find a set of weights that can match all the actual classification ({σ µ }) of given patterns with their labels ({σ µ 0 }).

FIG. 2 :
FIG.2:The convergence proportion of the BP algorithm versus loading rate.The red dash line marks the α AT computed by the stability condition equation of the RS solution.The three curves for different network sizes intersect at a point that coincides with α AT .For each data point on the curves, we simulate M instances of binary perceptron.M = 2 000 for N = 500, 1 000 for N = 1000, and 500 for N = 2 000.