Spatially heterogeneous learning by a deep student machine

Deep neural networks (DNN) with a huge number of adjustable parameters remain largely black boxes. To shed light on the hidden layers of DNN, we study supervised learning by a DNN of width $N$ and depth $L$ consisting of $NL$ perceptrons with $c$ inputs by a statistical mechanics approach called the teacher-student setting. We consider an ensemble of student machines that exactly reproduce $M$ sets of $N$ dimensional input/output relations provided by a teacher machine. We show that the problem becomes exactly solvable in what we call as 'dense limit': $N \gg c \gg 1$ and $M \gg 1$ with fixed $\alpha=M/c$ using the replica method developed in (H. Yoshino, (2020)). We also study the model numerically performing simple greedy MC simulations. Simulations reveal that learning by the DNN is quite heterogeneous in the network space: configurations of the teacher and the student machines are more correlated within the layers closer to the input/output boundaries while the central region remains much less correlated due to the over-parametrization in qualitative agreement with the theoretical prediction. We evaluate the generalization-error of the DNN with various depth $L$ both theoretically and numerically. Remarkably both the theory and simulation suggest generalization-ability of the student machines, which are only weakly correlated with the teacher in the center, does not vanish even in the deep limit $L \gg 1$ where the system becomes heavily over-parametrized. We also consider the impact of effective dimension $D(\leq N)$ of data by incorporating the hidden manifold model (S. Goldt et. al., (2020)) into our model. The theory implies that the loop corrections to the dense limit become enhanced by either decreasing the width $N$ or decreasing the effective dimension $D$ of the data. Simulation suggests both lead to significant improvements in generalization-ability.


I. INTRODUCTION
The mechanism of machine learning by deep neural networks (DNN) [3] remains largely unknown.One of the most puzzling points is the issue of over-parametrization: supervised learning by DNN can work even in the regime where the number of adjustable parameters is larger than the data size by orders of magnitudes.This goes sharply against the traditional wisdom of data modeling: for example, one should avoid fitting 10 data points by a fitting function with 100 adjustable parameters, which is just nonsense.However empirically it has been found repeatedly that such over-parametrized DNNs can somehow avoid over-fitting and generalize well, i. e. they can successfully describe new data not used during training.Uncovering the reason for this peculiar phenomenon is a very interesting and challenging scientific problem [4,5].An important point to be noted is that the effective dimension D of the data which can be much smaller than the apparent dimension N of data.It has been shown in studies of shallow networks that the generalization ability improves by increasing N/D due to a kind of self-averaging mechanism [6][7][8].However, the generalization ability of the deeper system remains unexplained.
Statistical mechanics on neural networks has a long history that dates back to the 1980's [9][10][11].Studies on the single perceptrons [10,11] and shallow networks [6,12] have provided many useful insights and some progresses have been made also on deeper networks [13][14][15][16].However what is going on in the hidden layers remain largely unknown.The first attempt to uncover the black box was made in [1] by the present author based on the replica method and predicted unexpected phenomenology of DNN: spatially heterogeneous learning.Unfortunately the theory suffered a serious problem due to an uncontrolled approximation and the validity of the prediction remained elusive.
To understand the mechanism for the generalization ability of deep networks, we study supervised learning by DNN considering the so-called teacher-student setting which is a canonical setting to study statistical inference problems [17,18] by methods of statistical mechanics.We consider a prototypical DNN of rectangular shape with width N and depth L consisting of N L perceptrons with c inputs, which defines a mapping between a N dimensional input vector to a N dimensional output vector (See Fig. 1).For the data, we consider M pairs of input/output vectors provided by a teacher machine and we consider an ensemble of student machines that exactly satisfy the same input/output relations as the teacher.The phase space volume of such an ensemble is called as Gardner's volume [10,11] which should be very large for over-parametrized DNNs.In fact, it is known that gradient descent dynamics find such a machine without going over barriers in the loss landscape [19][20][21].In Fig. 2, we show a schematic picture of the phase space of the machines.If M is small, typically students will not find the teacher.This situation would be regarded as i , S 2 i , . . ., S M i ) with its component S µ i = ±1 representing the state of a 'neuron' in the µ-th pattern.FIG. 2. Schematic picture of the phase space of machines: the gray box represents the set of all machines which can be generated varying the parameters (e.g.synaptic weights) given a network structure.The yellow region resents the subspace in which machines agree teacher's machine for a given M set of training data.Liquid phase: if the number of the training data M is small, the subspace is so large that the machines are typically widely separated and their mutual overlap Q is typically zero .Crystalline phase: M is large enough, that machines have finite overlap Q with respect to each other.liquid phase.If M is increased, crystalline phase may emerge in which students find the (hidden) crystal, i. e. teacher.We also wish consider the impact of effective dimension D of the data by incorporating the hidden manifold model [2,12,22]) in our model.Using methods of statistical mechanics we wish to investigate how different machines which satisfy the same set of input/output boundary conditions become correlated with each other in the hidden layers and evaluate their generalization ability : the ability of the students to reproduce the teacher's output against new input data not used in training.
The first attempt to tackle the statistical mechanics problem of DNN was made recently in [1] based on the replica method in a high dimensional limit with c = N = D 1 and M 1 with fixed α = M/c.Unfortunately it suffered from an uncontrolled 'tree' approximation which is invalid for the global coupling case c = N .In the present paper we show that the problem can be overcome in the limit N c 1, which we call as dense limit.As the result we establish an exactly solvable statistical mechanics model of DNN which has been waited for a long time.
Using the exact solution of the model which can be obtained by the replica method developed in [1], we analyze the key question: generalization-ability of DNN including the over-parametrized regime.We also show that the effect of the finiteness of the width (apparent dimension of the data) N and the effective dimension D similarly enhances loop corrections which induce correlations between distant layers.In parallel to the theoretical study, we also perform extensive numerical simulations to examine the theoretical predictions.We use the Monte Carlo method which allows more efficient exploration of the solution space compared with the usual gradient descent algorithms.
The following sections are organized as follows.In sec II we summarize the main results of this paper.In sec III we introduce our model.We discuss the replica approach in sec IV and numerical simulations in sec V.In sec VI we conclude this paper with perspectives.In appendix A we discuss a connection between some layered spinglass models and DNNs and in appendix B we present some details of the replica theory.

II. SUMMARY OF RESULTS
Let us summarize below the main results of this work.On the theoretical side we find the following.
• We establish an exactly solvable statistical mechanics model of DNN in the dense Limit N c 1.The exact solution of the model is obtained using replica the approach.It is shown that the correction to the dense limit due to finiteness of the width N can be expressed by loop-corrections.Fortunately, it turns out that the theoretical results presented in [1] are essentially valid in the dense limit N c 1 although they are unjustified for the global coupling c = N assumed there.
• We show that smallness of the effective dimension D(< N ) of the hidden manifold model [2] enhances the loop corrections.Thus finite dimension D effect is predicted to be similar to finite width N effect.
• The learning curve = L (α) of the DNN with various depth L is analyzed evaluating the generalization error L (α) in the case of Bayes optimal teacher-student setting where the replica symmetry holds.It becomes independent of the depth L, i. e. L (α) = ∞ (α) as long as the network is deep enough such that liquid phase, where students are de-correlated from the teacher, remain in the center reflecting strong over-parametrization.
On the numerical side we find the following.
• We simulated the model with finite connectivity c and width N in the Bayes-optimal teacher-student setting and found that a simple greedy Monte Carlo algorithm allows the student machines to equilibrated after sufficiently long times.Thus typical equilibrium states are accessible starting from typical random initial configurations without going over barriers in the loss landscape.
• Observation of the overlap between the machines reveal spatially in-homogeneous learning in qualitative agreement with the theory.While the theory in the dense limit N c 1 predicts, in the case of strong overparametrization, crystalline regions with finite overlap close to the input/output layers separated by a liquid region with zero overlap in the center, the distinction between the crystalline and the liquid phases become blurred in systems with finite width N and finite connectivity c.Nonetheless, the presence of the liquid like region in the center becomes clearer by making the width N large and the connectivity c large or the depth L large.We consider that the remnant overlap left in the center by the finite width N and finite connectivity c effects play the role of symmetry breaking field which connect the two crystalline regions attached to the boundaries.
• Observation of the learning curve = L,c,N,D (α) reveal that it becomes independent of the depth L in deep enough systems in agreement with the theoretical prediction.
• The observation reveal that finite effective dimension D effect and finite width N effects are indeed very similar as suggested by the consideration of the loop effects in the theory.The generalization error L,c,N,D (α) decreases significantly decreasing either the width N or the effective dimension D.

III. MODEL A. Multi-layer perceptron network
We consider a simple multi-layer neural network of a rectangular shape with width N and depth L (see Fig. 1).The input and output layers are located at the boundaries l = 0 and L respectively while l = 1, 2, . . ., L − 1 are hidden layers.On each layer l = 0, 1, 2, . . ., L there are N neurons labeled as (l, i) with i = 1, 2, . . ., N .The state of the neuron (l, i) is represented by an Ising spin S l,i : it is active if S l,i = 1 and inactive if S l,i = −1.
The network is constructed as follows.There are N = N L perceptrons.Consider a perceptron = (l, i) which is the i-th neuron in the l-th layer.It receives c inputs from the outputs of the perceptrons (k) (k = 1, 2, . . ., c) in the previous l − 1-th layer, weighted by J = (J 1 , J 2 , . . ., J c ). (For the special case l = 1, (k) should be understood as one of the spins in the input layer.)The c perceptrons are selected randomly out of N possible perceptrons in the l − 1 th layer.
The output of the perceptron , which we denote as S , is given by, where sgn(y) = y/|y| is our choice for the activation function.We assume that the synaptic weights J k take real numbers normalized such that, c k=1 For convenience, we call the state of the neurons S l,i 's as 'spins', and the synaptic weights J k s as 'bonds' in the present paper.We denote the set of perceptrons in the l-th layer as ∈ l and denote the set of perceptrons whose outputs become input for as ∂ , i. e. ∂ = { (1), (2), . . ., (c)}.For convenience we introduce also ∈ 0 so that we can write the set of spins in the input layer as S ∈0 .

B. Dense coupling
As stated above, c legs of a perceptron at l-th layer is connected to c neurons S (k) (k = 1, 2, . . ., c) in the previous l − 1-th layer.The c neurons out of N possible neurons are selected randomly.Thus our graph becomes a sort of sparse (layered) random graph when c is finite.We will find in sec.IV that this construction enables us to obtain an exactly solvable statistical mechanics model of DNN because of the following reasons.
• The graph becomes locally tree-like as in the case of Bethe-lattices so that contributions of 'loops' can be neglected in the wide limit N → ∞ with fixed c.This can be seen as follows.For instance, consider a loop 0 → 1 → 2 → 3 → 0 shown in Fig. 3. Starting from 0, choose any 1 connected to 0. Then choose any 2 connected to 1. Then choose any 3 (different from 1) connected to 0. In the case of global coupling c = N , 2 is certainly connected to 3 completing a loop.However, in the case of dense coupling, in a given realization of the random graph, 2 is connected to 3 only with a probability ∼ c/N .Thus in the limit N → ∞ with fixed c, the probability to complete the loop vanishes.This argument can be generalized for 2-loops, 3-loops,....which happen with probability O(c/N ) 2 ), O(c/N ) 3 ), ... Note that the loops cannot be neglected.In the case of global coupling c = N (assumed in [1]).• In the case of the global coupling c = N , the system is symmetric under permutations of the perceptrons within each layer so that one has to consider whether this symmetry becomes broken spontaneously [23].In the case of sparse coupling c < N , we can eliminate this symmetry by choosing the connections in stochastic ways, i. e. random graph.
• In the setup of our theory, we finally consider c → ∞ (and M = αc → ∞ (see Eq. ( 5)) (after N → ∞).This greatly simplifies the theory as it allows us to use the saddle point method in theoretical analysis.We call such intermediately dense coupling with as dense coupling.

C. Connection to spinglasses
The feed-forward network made of perceptrons is equivalent to the zero-temperature limit of the transfer-matrix of a spin-glass with Hamiltonian, as shown in appendix A. This is a spin-glass model put in a layered structure.Specifying the spin configuration on the boundary l = 0, spin configurations at layers l = 1, 2, . . ., L become specified deterministically in the T → 0 limit of the transfer matrix.The perceptrons Eq. ( 1) just do this operation.An important point is that there are no direct interactions within each layer much as the restricted Boltzmann machines (RBMs) [24] which make the operations of the T = 0 transfer matrices equivalent to the simple feed-forward non-linear mappings Eq. (1).In appendix A we also show that such representations are possible for generic activation functions including the function sgn(y) which we employ in this paper just as a special case.
For a given set of interactions J k , the ground state of the system is unique if the boundaries are allowed to relax.But here we are considering ground states with different realizations of frozen boundaries.Specifying the boundary condition on one side, the configurations on the other side becomes fixed deterministically.
From this view point, the exponential expressibility of DNN [25] can be traced back to the chaotic sensitivity of spinglass ground state [26,27].Even if a change of the configuration on the boundary S µ ∈0 → S ν( =µ) ∈0 is small, the resultant changes of the spin configurations become larger going deeper into the system l = 1, 2, . ... This can be viewed as an avalanche process.In deeper layers l = 1, 2, . . .larger number of nodes i = 1, 2, . . .will be involved in a single avalanche event.In sec.V B 4 we discuss a quantity which reflects the avalanche sizes, i. e. the number of nodes involved in a same avalanche caused by S µ ∈0 → S ν( =µ) ∈0 .

D. Teacher-Student setting
As shown in Fig. 4 we consider a learning scenario by a teacher machine and a student machine.For simplicity we assume that the teacher is a 'quenched-random teacher': its synaptic weights {(J k } teacher } are iid random variables which take continuous values subjected to the normalization condition Eq. (2).
Training: we generate M sets of training data labeled as µ = 1, 2, . . ., M as follows.The values of the spins in the input layer S µ ∈0 = {S µ 0,1 } teacher are set as iid random Ising numbers ±1 (i = 1, 2, . . ., N , µ = 1, 2, . . ., M ) and the corresponding output of the teacher {S µ L,i } teacher are obtained.The student does training by adjusting its own synaptic weights {(J k ) student } such that it reproduces perfectly the M sets of the input-output relations of the teacher.More precisely we consider an idealized setting that 1) the student has exactly the same architecture as the teacher including the specific realization of the random network between adjacent layers 2) student knows exactly the M sets of the input/output relations of the teacher.In short, the student knows everything about the teacher except for its actual values of {(J k ) teacher }.Within the framework of Bayesian inference, this is a so-called Bayes optimal setting [18,28].
The configurations of the spins associated with the M -patterns of the training data may be represented by Mcomponent vectors S l,i = (S 1 l,i , S 2 l,i , . . ., S M l,i ) (see Fig. 1).In the theory, we will consider M → ∞ limit with, fixed.Note that our network is parametrized by N cL variational bonds and the N M constrained spin components on the input and output boundaries.The ratio of the two scales as, Test (validation): the generalization ability of the student can be examined empirically using a set of test data.Preparing M sets test data as new iid random data (S µ 0,i ) teacher (i = 1, 2, . . ., N , µ = 1, 2, . . ., M ) (not used for the training) we compare the output of the teacher and student machines.The probability that the student makes an error can be measured as, If the student is just making random guesses = 1/2 while = 0 if it perfectly reproduces the teacher's output.

E. Gardner's volume
Following the pioneering work by Gardner [10,11] we investigate the ensemble of all possible machines (choices of the synaptic weights J k s) of the student which are perfectly compatible with the M set of the input S 0 and output data S L provided by the teacher machine (See Fig. 2).As we noted in sec.III C, each machine with the feed-forward propagation of signals can be viewed as a zero-temperature limit of the transfer-matrix of a spin-glass with a set of J k s.So the ensemble of machines is an ensemble of such transfer matrices, which are typically chaotic.
The phase space volume, which is called Gardner's volume, can be expressed for the present DNN as [1], where v(r) is a hardcore potential, with θ(r) being the Heviside step function and we introduced the 'gap' variable, The trace over the spin and bond configurations can be written explicitly as, The key idea behind the expression Eq. ( 8) is the internal representation [29]: we are considering the spins (neurons) in hidden layers (l = 1, 2, . . ., L−1) as dynamical variables in addition to the synaptic weights.This is allowed because the input-output relation of the perceptrons Eq. ( 1) is forced to be satisfied by requiring the gap to be positive r µ > 0 for all perceptrons = 1, 2, . . ., N in the network for all training data µ = 1, 2, . . ., M in Eq. ( 8).As shown in appendix A, the expression Eq. ( 8) can be also obtained considering the transfer matrix representation.
The main quantity of our interest in the present paper is the generalization error Eq. ( 7).The Gardner's volume V M provides a way to estimate the generalization ability of the network for the test data [30,31].The probability that the network which perfectly satisfies the constraint put by M sets of training data happens to be compatible with one more unseen data is given by the ratio V M +1 /V M .Then the generalization error, namely the error probability , the probability that the configuration of one spin in the output layer l = L of the student machine is wrong (different from the teacher) for a test data can be expressed as, F. Symmetries Let us note here that there are some symmetries (besides the replica symmetry which we discuss later) in the present problem.The following becomes important, especially in numerical simulations.For any , the system is invariant under gauge transformation specified by gauge variables Note that we do not have a gauge transformation in the output layer l = L since the output layer is constrained.It can be easily seen that the gap variables r µ (see Eq. ( 10)) are invariant under the gauge transformation.Thus for a given realization of a machine with a set of synaptic weights, there are 2 (L−1)N completely equivalent machines specified by 2 (L−1)N possible realizations of the gauge variables: all of them operate exactly in the same way yielding the same output for any input.If the synaptic weights only take Ising values J k = ±1, the number of possible configurations of the machines modulo the gauge symmetry is 2 N Lc 2 −N (L−1) .
The presence of the gauge invariance is natural given the connection to the spinglass as mentioned in sec.III C. While the gauge variables are frozen in spin-glass problems with quenched bonds [32], here the bonds are dynamical variables so that the gauge variables also evolve in time during learning.
Importantly this is a local symmetry in the sense that change of any σ induce changes only in the neighborhood of (see Fig. 5): S µ → −S µ for ∀µ, J k → −J k for ∀k, J l → −J l for ∀( , l) such that (l) = .This means that in sparse systems with finite connectivity c and M (= αc), the evolution of the machine from one to another connected by a local gauge transformation takes only a finite time in dynamics.Only in the limit c → ∞, do such gauge transformations become frozen in time.

Permutation symmetry in globally coupled systems
As we already noted in sec.III B, in globally coupled systems with c = N , the system is invariant under permutations of perceptrons ∈ l within each layer l = 1, 2, . . ., L. This symmetry can be removed if the coupling is not global c < N since we can construct random networks (see sec.III B).

G. Hidden manifold
We incorporate the hidden manifold model for the data (S.Goldt et al (2020) [2] in our model as the following.We replace the original teacher machine of width N with a narrower teacher machine of with D(≤ N ) (see Fig. 6).The teacher is working entirely in D dimensional space being subjected to D dimensional input data and produces D dimensional output.Student machines are provided N dimensional input/output data which are obtained from the D dimensional input/output of the teacher via folding matrices F i,k of size N × D, for i = 1, 2, . . ., N and µ = 1, 2, . . ., M .For the folding matrix, we consider a simple model, In this model N elements of the data for students are created simply by making N/D copies of the D elements of the teacher's data.
FIG. 6.Schematic picture of the hidden manifold model.

IV. REPLICA THEORY
Now we develop and analyze a replica theory for the statistical mechanics problem of DNN in the dense limit N c 1 introduced in sec.III B. We first show that it can be solved exactly overcoming the issue of uncontrolled approximation made in [1].This is the first main result of this paper.For clarity we repeat the steps made in [1] and indicate how the problem is resolved.Then we revisit the replica symmetric solution in the teacher-student setting presented in [1] and analyze it in more details.Using the exact solution we analyze the generalization-ability of DNN evaluating the generalization error via Eq.( 12), which is second main result of this paper.The technical details of the theory is presented in appendix B.
A. Formalism

Order parameters
We are considering the dense coupling Eq. ( 3) in which 1) perceptrons have large connectivity c 1 and 2) the permutation symmetry of the perceptrons which exist in globally coupled systems is removed.We are also considering a large number of training patterns M = αc 1 (see Eq. ( 5)).Then we can naturally introduce 'local' order parameters associated with each perceptron , The overlaps between the teacher and student machines are represented by Q 0b, = Q b0, and q 0b, = q b0, (b = 1, 2, . . ., s) while those between the student machines are are represented by Q ab, = Q ba, and q ab, = q ba, (a, b = 1, 2, . . ., s).
It is important to note that the order parameters Q ab, and q ab, defined above changes sign under the change of the gauge variable σ a which can be defined independently for each replica (a = 1, 2, . . ., n) (see Fig. 5).Thus they trivially vanish in thermal equilibrium in sparse systems with finite connectivity c.Only in the dense limit c → ∞ the gauge variables σ a can be considered as slow variables.
It is natural to expect that order parameters are homogeneous within each layer since we will take the average over realization of random connections between adjacent layers (see Eq. ( 23)).Thus we assume they only depend on the index l of layers, Here we have included, for our convenience, the spin overlaps at the boundaries l = 0, L where spins of all student replicas a = 1, 2, . . ., s are forced take the same values as the teacher a = 0, Note also that the normalization condition for the bonds Eq. ( 2) and the spins (which take Ising values ±1) implies Q aa (l) = q aa (l) = 1 for ∀a and ∀l.
The order parameters also vanish in thermal equilibrium in globally coupled system with c = N due to the permutation symmetry -the 2nd symmetry mentioned in sec III F. This issue is removed by using the dense coupling by selecting connections between adjacent layers randomly.

Replicated Gardner volume, Free-energy
Let us introduce the replicated Gardner's volume, where the teacher machine is included as the 0-th replica, Here the output S L is the output of the teacher so that S L = S L (S 0 , {(J k ) teacher }).The main object we are interested in is the free-energy functional (Franz-Parisi's potential [33]), where the over-line denotes the average over 1) the random inputs S 0 imposed commonly on all machines and 2) realization of the random synaptic weights {(J k ) teacher } of the teacher and 3) realization of random connections between adjacent layers.It turns out that the dense coupling Eq. ( 3) N c 1 allows us to obtain the exact expression for the replicated Gardner volume for n = 1 + s replicas in terms of the order parameters Eq. ( 19), with Here s ent,bond [ Q(l)] and s ent,spin [q(l)] and the entropic part of the free-energy associated with bonds and spins respectively and −F int [q(l − 1), Q(l), q(l)] is the interaction part of the free-energy.Fortunately the free-energy functional obtained in [1] turns out to be valid in the dense limit N c 1 although it is unjustified for the global coupling c = N assumed there.The details of the expressions and the derivation are presented in the appendix B 5.
The main reasons for the success, which allows us to overcome the problems in [1], are the three points discussed in sec.III B. First, the sparseness of the network allows us to safely neglect contribution of loops as we explain in detail in sec.B 4. Second, the random connections between adjacent layers eliminate the permutation symmetry which exist in globally coupled system c = N .Third the limit c → ∞ allows us to use the standard saddle point method to evaluate thermodynamic quantities exactly.

Replica symmetric ansatz
Since our current problem is a Bayes optimal inference problem, we can safely assume a replica symmetric (RS) solution, for l = 1, 2, . . ., L and the Nishimori condition (see sec.V B 3), which must hold in Bayes optimal cases.The saddle point equations which extremize the replicated free energy are obtained in [1].It can be checked that the saddle point equations can verify the relation Eq. ( 27).< l a t e x i t s h a 1 _ b a s e 6 4 = " V e 6 O r 5 a o b 9 o 9 N q k j C P J n t g t 6 7 J H d s e e 2 f u v v d p + D 8 9 L i 1 a 5 p + V W L X o 6 X X z 7 V 6 X T 6 u L w U / W n Z x f 7 W P a 9 q u T d 8 h n v F E p P 3 z w + 7 x Z X C 8 n 2 H L t m L + T / i n X Y A 5 3 A a L 4 q N 3 l e u P z D j 0 U u W p T x b s /

Generalization error
Based on the above results we can analyze the error probability Eq. ( 12) which is the main object of our interest in the present paper.Using the free-energy Eq. ( 23) and Eq. ( 24) we readily find it as, = 1 − exp Explicit expressions of the free-energy needed to evaluate the above quantity are given in the appendix B 6.
< l a t e x i t s h a 1 _ b a s e 6 4 = " < l a t e x i t s h a 1 _ b a s e 6 4 = " E U q z P 9 q i H q q / A G k e b b U E A 2 B t 1 z L a r i 3 e 1 q O W s G / q P i a q A x C m L y Q e 9 I h z + S B v J L P P 3 s 1 n R 6 2 l w a u c l d L j a L / Y i 7 1 8 a FIG. 8. Order parameters and generalization error obtained by solving the replica symmetric saddle point equations.In the panels on the 1st and 2nd lows, overlaps of spins q(l) (filled symbols) and Q(l) bonds (open symbols) are shown.In the bottom low the generalization error is shown.

Order parameters
We numerically solved the saddle point equations Eq. (B41) to obtain the order parameters repeating the analysis in [1] but in a wider parameter space.In Fig. 7 we show the spatial profile of the order parameters.As already shown in [1], the theory predicts spatially heterogeneous learning.It can be seen that the 'crystalline' phase with finite order parameter (inference of the teacher's configuration is successful) grows increasing α starting from the input/output boundaries.This is reminiscent of wetting transitions [34][35][36].Details of the behavior of the order parameters are displayed in Fig. 8.
The central region remains in the liquid phase with zero order parameter (where inference of the teacher's configuration is impossible) until the two crystalline phases meet in the center at a critical point α c1 (L).Naturally α c1 (L) increases with L. For α > α c1 (L) the central liquid phase is absent.As far as α < α c1 (L) we find the the crystalline parts attached to the two opposite boundaries grow with α but the profile of the order parameters remain independent of L. Now for α > α c1 (L), where the liquid phase is absent, the order parameters depend explicitly on the depth L. One may regard this as a "finite depth L effect".At some larger α it becomes difficult to follow the saddle point solution numerically.Presumably this implies spinodal instability associated with a first order transition to another solution q = Q = 1 (which is a saddle point solution) at some α c2 (L)(> α c1 (L)).Such a discontinuous 'perfect recovery' behavior has been found in the case of single perceptron with binary couplings [37].We skip detailed analysis of the 1st order transition and leave it for future works.Up to the discontinuous change, the evolution of the order parameters with increasing α is continuous.

Generalization errors
Now we turn to the generalization error which is of our main interest in the present paper.It is obtained as shown in the bottom panels of Fig. 8 and Fig. 9.The relation vs α is called often as learning curves.Without learning α = 0, = 1/2 because the student just makes random guesses.The learning curves = L (α) consist of two parts as follows.
• Let us recall that for sufficiently small α, the two crystalline phases at the boundaries remain disconnected from each other separated by the liquid phase in the center and that the profile of the order parameters are independent of L since the two crystalline regions do not meet as we discussed in sec.IV B 1. In this regime the learning curve does not depend on the depth L, i. e. = ∞ (α).The reason is that the contribution from the liquid region where q(l) = Q(l) = 0 to Eq. ( 28) is just zero: it contributes neither positively nor negatively to .On the other hand, the crystalline region where q(l), Q(l) > 0 contribute negatively to Eq. ( 28) and it is is independent of L as long as the two crystalline regions do not meet.It is remarkable that ∞ (α) < 1/2 and decreases with increasing α: the system generalizes even though the central part is in the liquid phase due to over-parametrization.
• Increasing α, the crystalline phases meet at some critical point α c1 (L) and the central liquid phase disappear.
Note again that α c1 (L) is larger for larger L. For sufficiently large α> α c1 (L) where the central liquid gap is filled up by the crystalline phase, the learning curve depend on the depth L as the order parameters now depend on L. For even larger α > α c2 (L), we speculate that jumps to 0 due to the 1st order transition mentioned in sec.IV B 1 The L independent behavior of the learning curve can be seen in Fig. 9 as follows.For instance one can see that L (α) of L = 10, 20 are indistinguishable for 1/α > 0.01 and L = 10, 20, 5 are indistinguishable for 1/α > 0.1.
< l a t e x i t s h a 1 _ b a s e 6 4 = " In reality, DNNs have some finite width N and finite connectivity c while in the theory we assumed an idealized situation: the dense limit N c 1 and M 1 with fixed α = M/c.It is very important to consider the effects of finite with N and finite connectivity c (and M ).

Finite width N effect
The effects of finite width N can be attributed to the corrections due to geometrically closed loops in the network which becomes non-negligible when the width N is finite as we discussed in sec.III B. The simplest is the one shown Most importantly the loops connect different layers and different nodes within the same layer inducing correlations inside the network.Indeed as discussed in detail sec.B 4 in the appendix, the loops yield finite width N corrections to the interaction part of the free-energy.We also note that the symmetry concerning the exchange of input/output sides present in the saddle point solutions (See Fig. 7) becomes lost in the presence of such loop correction terms.

Finite hidden dimension D effect
It is interesting to discuss here the hidden manifold model [2] introduced in sec.III G. Let us recall that our original model contains no correlations within the boundaries.We can consider the effect of the correlations put in the input/output boundaries by the hidden manifold model in a perturbative manner around the replica symmetric saddle point solution as the following.
Within the simplest model Eq. ( 17) for the folding matrix F , the same values are repeated in the input (output) data on different nodes i(= 1, 2, . . ., N ).This amount to induce additional closed loops.For example the unclosed loop in panel b) of Fig. 10 becomes closed if the input data at k 1 and k 2 are forced to take the same value by the simplest hidden manifold model.This means that finite width N effects become enhanced as the effective dimension D becomes smaller.This consideration implies finite width N effect and finite hidden dimension D effect will be similar.Both will lead to increase of correlations inside the network.
Let us note that the teacher and students have different architectures in the hidden manifold model so that the inference by the students is not Bayes optimal.In such a circumstance the replica symmetry is not guaranteed.We leave the analysis beyond the perturbative analysis for future works.

Finite connectivity c effect
Finally, in N → ∞ limit, we will be still left with finite connectivity c effects.In our theoretical analysis we assumed c → ∞ which allowed us to perform the saddle point computations.One can naturally consider 1/c corrections taking into account contributions from fluctuations around the saddle point as sketched in sec.B 7 in the appendix.
Naturally the fluctuating field around the saddle point induce correlations inside network.Let us also note that the cubic term in the expansion breaks the symmetry concerning the exchange of input/output sides (see sec.B 7 b in the appendix).

Discussions
The corrections due to the loops ( sec.IV C 1) and those due to the fluctuations around the saddle points (sec.IV C 2) can be easily separated considering the dense limit N c 1, which is difficult for the case of global coupling c = N .Nonetheless we found that the two corrections bring qualitatively similar effects: 1) correlations inside the network and 2) asymmetry with respect to the exchange of input/output sides.
We consider that the two effects, which disappear in the dense limit, are important in practice in the following respects.
• Remnant symmetry breaking field One would wonder: how a student machine can recognize the existence of the two crystalline regions (teacher's configuration) if the two are separated by the liquid region as in Fig. 7?".For an algorithm to work in this situation, some remnant symmetry breaking field should help the student.We consider the corrections due to the loops and the fluctuations around the saddle point play this role.
• Input-output asymmetry One would also wonder: how a DNN with the feed-forward propagation of information can have such spatial profile which is completely symmetric concerning the exchange of input/output sides as in Fig. 7 ?We consider the corrections due to the loops and the fluctuations around the saddle point are responsible for the breaking of this symmetry.

V. SIMULATION
Now let us turn to discuss Monte Carlo simulations on the same model we analyzed theoretically.We first explain the simulation method in sec.V A, introduce the observables in sec.V B and then present the results in sec.V C.
A. Method

Learning scenarios
We simulate the teacher-student scenario (see sec.III D) in the Bayes-optimal setting and the setting with the hidden manifold model (see sec.III G).
• Bayes-optimal scenario -Network: teacher and student machines have the same rectangular network of width N and depth L (see Fig. 1).The rectangular network is created as a random graph as the following.Every ∈ l is given c arms.Each of the arms is connected to a ∈ l − 1 chosen randomly out of N possible ones.
-Synaptic weights of teacher machine: the teacher's synaptic weights {(J k ) teacher } for ∈ 1, 2, . . ., L and k = 1, 2, . . ., c are prepared as iid random numbers drawn from the Gaussian distribution with zero mean and unit variance.-Data: M set of training data is prepared as follows.First the input data for the teacher are prepared as iid random numbers (S teacher ) µ 0,i = ±1 for i = 1, 2, . . ., N and µ = 1, 2, . . ., M .Then the output (S teacher ) µ L,i for i = 1, 2, . . ., N and µ = 1, 2, . . ., M are obtained by the feed-forward propagation of the signal using Eq. ( 1).These outputs are used as the target outputs (S * ) µ L,i to train the student machines (see below), i. e. (S * ) µ L,i = (S teacher ) µ L,i .Another M set of data for the test (validation) are created in the same way.
• Hidden manifold scenario [2] -Network: the networks of the teacher and student machines are the rectangular, random regular network as in the Bayes optimal scenario but the teacher machine is narrower than the student machine, i. e. D < N (see Fig. 6).
-Synaptic weights of teacher machine: the teacher's synaptic weights are prepared in the same manner as in the case of the Bayes optimal scenario.
-Data: M sets of data for training and another M sets of data for the test (validation) are created in the same way as the following.Pairs of input/output data of the teacher's machine is created just as in the case of Bayes optimal scenario but with D replacing N .Then the N dimensional inputs (S student ) 0,i for i = 1, 2, . . ., N to be given to the student machines are created using the simple folding matrix F given by Eq. (17).Similarly the N dimensional target output (S * ) L,i for i = 1, 2, . . ., N for the student machines are created using the same folding matrix F .
To train the student machines we use a simple zero-temperature or greedy Monte Carlo algorithm.We introduce the loss function defined as, where (S * ) µ L,i is the target output data defined above.Note that the loss function takes discrete values.In particular, we are interested with the ensemble of student machines in the E = 0 space whose phase space volume is nothing but the Gardner's volume.
Starting from a set of initial synaptic weights, the student machines are updated as the following.
1. Select a perceptron randomly out of the N possible ones and select a link k randomly out of the c possible ones k = 1, 2, . . ., c. Then propose a new synaptic weight, where δ is a parameter and x is an iid random number drawn from the Gaussian distribution with zero mean and unit variance.Note that (J k ) new student is normalized such that its variance remains to be 1. 2. Accept the proposed one if the resultant loss function does not increase.Otherwise reject it and go back to 1. Importantly we accept updates by which the loss function remains unchanged.This is crucial to allow exploration of the E = 0 (SAT) space.
Within one Monte Carlo step (MCS) we repeat the above procedure for N c times.We simulate learning by two student machines '1' and '2' which are subjected to the same training data but evolve independently from each other using statistically independent random numbers for the step 1. and 2. explained above.

Learning and unlearning
For the training, we consider the following two protocols • Learning: the initial synaptic weights of the student machines {(J k ) student } are prepared just as iid Gaussian random numbers totally uncorrelated with the teacher's weights {(J k ) teacher }.
To facilitate the training, we perform a sort of 'annealing'.At a given time t (MCS), perform the greedy Monte Carlo update using a subset of the training data of size M batch (t)(< M ).Starting from M batch (0) = 1, increase M batch (t) logarithmically in time t progressively adding more data to the training data-set such that M batch (t max ) = M in the end of the simulation at t max (MCS).
• Unlearning (or planting): the initial synaptic weights of the student machines {(J k ) student } are set to be exactly the same as the teacher's weights {(J k } teacher }.The student machine explores the E = 0 (SAT) space.
If the greedy Monte Carlo method equilibrated the system, the two protocols should yield the same results for macroscopic observables, which we explain below, after averaging over time and/or initial configurations in the stationary state.

Simple overlaps
We are interested in the similarity between different machines in the hidden layers l = 1, 2, . . ., L − 1.To quantify this we first introduce, between the two student machines '1', '2' and the teacher machine '0', the following 'simple' overlaps, Here µ = 1, 2, . . ., M for the training data and µ = 1, 2, . . ., M for the test data (and replace the factor 1/M by 1/M in the latter case).These are the same as the order parameters for the spins used in the replica theory ( see Eq. ( 18)).However, as mentioned in sec IV A 1 the expectation value of the simplest overlaps defined above vanish in thermal equilibrium because of the local gauge symmetry (and the permutation symmetry in the case c = N ) as discussed in sec III F.

Squared Overlaps
To overcome the above problem we define the following order parameters which we call as squared overlaps which are invariant under the symmetry operations.Let us first introduce, Here a and b are indices for machines: 0 for the teacher machine, 1 and 2 for the student machines.Then we introduce the squared overlaps as, We note that this is analogous to the order parameter used in numerical simulations of vectorial spinglass models which have the rotational symmetry in spin space [38].
Interestingly these are very similar to the measure proposed in [39] call as 'centered kernel alignment'.

Nishimori condition
Since our teacher-student scenario is a Bayes optimal inference, we have, This is a Nishimori condition which must hold in Bayes optimal inferences [18,28,32].These relations are useful to check the equilibration of the system.

Physical meaning of the squared overlaps
Let us discuss more closely the significance of the squared overlap defined in Eq. ( 35) and its normalized version Eq. (36).In the following we denote the average over different realization of the inputs as • • • input .From Eq. ( 34) we can write, with Here r µ→ν a,i can be regarded as change of the sign of the spin (neuron) at the node (l, i) of the student-a when the input pattern is changes from µ to ν.Using the above expression we find Eq.( 35) with the subtraction term −N/M becomes, This can be viewed as a kind of correlation volume within layer l in the following sense.
• Totally uncorrelated random machines Suppose that the student-a and student-b are totally uncorrelated (far beyond the trivial difference by the gauge transformation and the permutation) randomly generated machines.Then we naturally expect This means that the minimum value of the squared overlap q 2,ab (l) is 0.
• Same random machine modulo gauge transformation and permutation On the other hand, if the two machines are the same machine modulo the gauge transformation and permutation we can write Thus in this case the squared overlap q 2,ab (l) is at least 1 and can be larger.
In the case of the perceptrons with random synaptic weights and the highly non-linear activation function (see Eq. ( 1)), we expect r input becomes significant also between different nodes i = j.This is because of the chaos effect which we discussed in sec.III C: it is known that in such a non-linear random feed-forward network a slight change of the input induces chaotic changes in the state of spins (neuron) as the signal propagates deeper into the network [25].This is an avalanche-like process so that the correlation r µ→ν( =µ) a,i r µ→ν( =µ) b,j input for i = j becomes more significant increasing l.In this case the squared overlap q 2,ab (l) can be viewed as a measure of avalanche size within layer l.

• General case
Based on the above consideration, we naturally expect that in general q 2,ab (l) quantifies the avalanche size and similarity of the avalanche patterns taking place in machines a and b through changes of inputs µ → ν( = µ).
Then it becomes clear that normalized version Eq. ( 36) quantifies the quantifies the similarity of the avalanche patterns in machines a and b.
f / b y w x 6 B l y q t a l 0 r 7 L 2 B k 5 H s x 7 8 q g 1 Y P h 9 + q p p 4 9 l D A X e t X I u x 0 y w S l 4 X V 8  < l a t e x i t s h a 1 _ b a s e 6 4 = " s L          < l a t e x i t s h a 1 _ b a s e 6 4 = " p / s r F n x a 8 x m e Y 9 j J a + e n T e S C + m 4 r U p d s V e y P 8 l q 7 N 7      and c) show the simple student-student overlap q(l) and simple teacher-student r(l) overlap respectively.

Generalization error
To measure the generalization ability of the student machines (see sec.III D) we measure, Here we use the M sets of the outputs of the teacher and student machines for the test data (not used for training).
The generalization error (see Eq. ( 12)) can be evaluated as, In the above expression, we used simple overlap defined on the output layer l = L.Note that there are no gauge transformations or permutations on the output layer.

C. Results
Now let us discuss the results of the simulations.First, we discuss the equilibration process through the learning and unlearning protocols (see sec.V A 3). Next, we discuss the equilibrium properties of macroscopic observables.
In the simulations we used δ = 0.1 to generate new weights by Eq. (30).For the test we used M = M data uncorrelated with the training data.In the following observables are averaged are took over statistically independent 240 samples (different realizations of the teacher machine, initial configurations of student machines for learning, realizations of random numbers used in Monte Carlo updates).

Learning
In Fig. 11 we present the relaxation of the loss function Eq. ( 29) in the learning protocol (see sec.V A 3).It can be seen in panel a) that relaxation of the loss function slows down by increasing the number of the training data M = cα.On the other hand, it can be observed in panel b) that relaxation becomes faster increasing the depth L of the network.
As shown in panel c), the relaxation depends also on the width N but converges in large enough N with fixed c and α suggests that relaxation time is finite in systems with finite connectivity c even in N → ∞ limit.For larger c, the relaxation curves converge to a slower curve as shown in panel d) suggesting that the relaxation time becomes larger for larger connectivity c.

Unlearning
In Fig. 12 we show the simple overlaps defined in Eq. ( 32) observed in the unlearning protocol which explores the E = 0 landscape (SAT phase) (see sec.V A 3).Note that q(l) = r(l) = 1 at the beginning.The student machines become de-correlated from the teacher machine and also de-correlated from each other as time t elapses.It is interesting to note that relaxation is in-homogeneous in space: relaxation is faster in the central part of the network and slower closer to the input/output boundaries.
It is important to note that the complete vanishing of the simple overlaps does not necessarily mean that the solution space is completely in a liquid state as the overlaps are not gauge invariant.Because of the gauge symmetry (see sec III F 1), even machines that are completely the same as the teacher-machine modulo the gauge transformation can have vanishing simple overlap with the teacher-machine.Indeed we will find below that normalized squared overlaps Eq. ( 36) (which are gauge invariant) instead indicate correlations between different machines.
The in-homogeneity of the relaxation observed here suggests that the system is more constrained closer to the boundary while the center is freer.We have also observed that the deeper system relax faster as shown in Fig. 11  b).These may be interpreted as an echo of the 'crystal-liquid-crystal' sandwich structure predicted by the theory (Fig. 7).

Equilibration
In equilibrium learning and unlearning protocols should give the same results for macroscopic observables after sufficiently long times.This is indeed verified as shown in the top panels a) c) e) of Fig. 13.In panels a) and c) we show the normalized squared overlaps defined in Eq. ( 36) which are invariant under the gauge transformations.The normalized squared overlaps of unlearning and learning protocols agree suggesting the establishment of equilibrium.Furthermore, it can be seen that the Nishimori condition q 2 (l) = r 2 (l) (see Eq. ( 37)) expected for the Bayes optimal inferences become satisfied after sufficiently long times.This is another evidence of thermal equilibration.Equilibration can also be seen in panel e) where we show the simple overlap between the teacher and student machines in the output layer l = L for the test data.
As can be seen in Fig. 13, the spatial profile of the normalized squared overlaps q 2 (l) and r 2 (l) are strongly in-homogeneous in space.As discussed in sec.V B 4, we consider the normalized squared overlaps quantifies the similarity of the avalanche patterns taking place in different machines through changes of inputs µ → ν( = µ).At the beginning of unlearning, which starts from the teacher's configuration, the normalized squared overlaps take high values.On the other hand, they are small at the beginning of learning which is not surprising because teacher and student machines are totally uncorrelated at the beginning.In equilibrium, they converge to a non-trivial, spatially non-monotonic function.This implies that the equilibrium phase is not just a liquid as we might have thought based on the observation of the vanishing simple overlap (Fig. 12).On the contrarily, the gauge invariant quantity show that the student machines are strongly correlated with each other and with the teacher machine in equilibrium.The spatial non-monotonicity means that they become less correlated with each other in the center (beyond the trivial difference by the gauge transformations) while they are similar to each other (modulo the gauge transformation) closer to the input and output boundaries.This observation can be regarded as another echo of the spatial in-homogeneity predicted by the theory (Fig. 7).From the theoretical point of view, the strong asymmetry concerning the exchange < l a t e x i t s h a 1 _ b a s e 6 4 = " X q T w / o b x 9 j V I S d L 7 S e p i r n B + r x a W e m c 5 + v c c X X L 3 B Q N m 5 e q y p 6 p 7 + q a I o j K i n I 0 z p L M i 9 h P I P s g D j / W r O g 1 t r E D C x p q q I L D h C B s Q I F L o w g Z D D Z x J T S J c w j p 3 j l H C y H S 1 i i L U 4 Z C b I X m P d o V f d a k f a e m 6 6 k P K M M i p U V V W o g h w R 7 Y D W u z e 3 b L n t n 7 r 9 W a X p W O m w a t a l f L 7 X L k a G r j 7 V 9 V l V a B / S / V n 6 4 F d r H g u d X J v e 0 x n X d o X X 3 9 8 K y 9 s b i e a M 6 y S / Z C / i / Y I 7 u j F 5 j 1 V + 0 q y 9 f P E f J a k P E i 1 g X p l A 8 y 8 m c L t u a T c i q Z y a b i S 8 t + M 4 K Y x g z m 6 M f T W M I q 1 p C j e z m O c Y L T w J M U l i a k y W 6 q F P A 1 4 / g W U u w D W i 2 L E w = = < / l a t e x i t > t < l a t e x i t s h a 1 _ b a s e 6 x a W e m c 5 + v c c X X L 3 B Q N m 5 e q y p 6 p 7 + q a I o j K i n I 0 z p L M i 9 h P I P s g D j / W r O g 1 t r E D C x p q q I L D h C B s Q I F L o w g Z D D Z x J T S J c w j p 3 j l H C y H S 1 i i L U 4 Z C b I X m P d o V f d a k f a e m 6 6 k P K M M i p U V V W o g h w R 7 Y D W u z e 3 b L n t n 7 r 9 W a X p W O m w a t a l f L 7 X L k a G r j 7 V 9 V l V a B / S / V n 6 4 F d r H g u d X J v e 0 x n X d o X X 3 9 8 K y 9 s b i e a M 6 y S / Z C / i / Y I 7 u j F 5 j 1 V + 0 q y 9 f P E f J a k P E i 1 g X p l A 8 y 8 m c L t u a T c i q Z y a b i S 8 t + M 4 K Y x g z m 6 M f T W M I q 1 p C j e z m O c Y L T w J M U l i a k y W 6 q F P A 1 4 / g W U u w D W i 2 L E w = = < / l a t e x i t > test c) unlearning learning < l a t e x i t s h a 1 _ b a s e 6 4 = " K w x u k i m X 8 6  of input/output sides, which is absent in the saddle point solution, may be attributed to the finiteness of the width N and the connectivity c as we discussed in sec.IV C 4.
As shown in the bottom panels b) d) f) in Fig. 13, the overlaps increase as α increases as expected.In panel f) it can be seen that dynamics of both learning and unlearning slow down as α increases.

Typical student machines
Now let us examine further the equilibrium properties, i. e. properties of typical student machines sampled in the solution space.We show in Fig. 14 some data of the normalized squared overlaps q 2 (l).It can be seen again that the data obtained by both learning (filled symbols) and unlearning (open symbols) agree confirming that the system is equilibrated.Quite remarkably the equilibrium normalized squared overlap q 2 (l) evolves non-monotonically in space for large enough N and c.It first decreases with l but finally increases with l.This means avalanches taking place in different machines become de-correlated in the middle of the network but strongly correlated closer to the input and output boundaries.It appears that the situation has become closer to the 'crystal-liquid-crystal' sandwich structure predicted by the theory (Fig. 7) increasing N and c.
In sec.IV C 4 we discussed that the corrections due to the loops and fluctuations around the saddle point can be separated considering the dense limit N c 1 but bring similar effects 1) remnant symmetry breaking field 2) inputoutput asymmetry.Indeed it can be seen in Fig. 14 that for fixed connectivity c, the normalized squared overlaps < l a t e x i t s h a 1 _ b a s e 6 4 = " V e 6 < l a t e x i t s h a 1 _ b a s e 6 4 = " r 8 f G c 5 8 p V e x J 7 < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 w H d N h y T 0 g n q 5 G h decrease around the center of the network and the asymmetry with respect to the exchange of input/output becomes weaker as N increases.Furthermore, the data suggests convergence in N → ∞ limit with fixed c. Comparing panels a) and b) it can be seen that the remnant asymmetry becomes weaker as the connectivity c increases.Remarkably de-correlation in the center also becomes clearer increasing N and c, suggesting the emergence of liquid like region in the center due to over-parametrization as suggested theoretically in the dense limit N c 1.In Fig. 15 we show the normalized squared overlaps q 2 (l), r 2 (l) and the generalization error in systems with α = 4, c = 5 observed after t = (MCS) in systems with different depth L = 5, 10, 20.As shown in the top panels a) c) and e) data obtained by both learning (filled symbols) and unlearning (open symbols) agree proving again that the system is equilibrated.As shown in panels a) and c) we find again that the normalized squared overlaps are strongly in-homogeneous in space.The machines de-correlate more concerning each other in the central region in deeper systems but correlations recover approaching the output layer.We also find again that normalized squared overlaps increase significantly and that the asymmetry concerning the exchange of the input and output sides becomes stronger decreasing the width N .Now let us turn to the effect of finite dimension D introduced by the hidden manifold model (see sec.V A 1).The results of simulations on the hidden manifold model are displayed in panels b) and d) of Fig. 15.Here we used the simplest folding matrix F of the form Eq. ( 17) but we obtained qualitatively the same results (not shown) also in the case of random matrices.Comparing the panels b) to a) and d) to c) we immediately notice that the effect of hidden dimension D is quite similar to the effect of finite width N : decreasing D with fixed N is much like decreasing N (= D).We conjecture that this is due to the enhancement of the loop corrections induced by the closing of the loops by the correlated inputs as discussed in sec.IV C 2.
Finally, let us discuss the generalization error shown in panels e) and f) of Fig. 15.In panel e) we also show the generalization error obtained by the theory in the dense limit c → ∞ (see Eq. ( 12) and Fig. 9).Remarkably the effect of finite width N and hidden dimension D is very similar again.The generalization error improves significantly either by decreasing N (= D) or D with fixed N .Presumably this is due to the increase of correlations inside the network induced by the loop corrections.Moreover, the generalization error becomes independent of the depth L at sufficiently deep systems much like the theoretical prediction.The result implies the generalization ability first decreases making the system deeper but does not vanish even in L → ∞ limit.This is consistence with the L independent learning curve = ∞ (α) predicted theoretically (see sec.IV B 2).

VI. CONCLUSIONS
In the present paper we obtained an exactly solvable statistical mechanics model of machine learning by a deep neural network (DNN) in the dense limit N c 1. Exact solutions are obtained using the replica method developed in [1].We used the replica theory to analyze the generalization-ability of the DNN in the Bayes-optimal teacher-student setting.The learning curve = L (α) becomes independent of of the depth L as long as the two crystalline phases attached to input/output boundaries are separated by the liquid phase in the center.Thus the system is predicted to generalize even in the limit L → ∞ where the system becomes extremely over-parametrized.We discussed the loop corrections to the dense limit and argued that finite width N and finite hidden dimension D l < l a t e x i t s h a 1 _ b a s e 6 4 = " X q T w / o b x 9 j V I S d L 7 S e p i r n B + r < l a t e x i t s h a 1 _ b a s e 6 4 = " < l a t e x i t s h a 1 _ b a s e 6 4 = " X q T w / o b x 9 j V I S d L 7 S e p i r n B + r l < l a t e x i t s h a 1 _ b a s e 6 4 = " X q T w / o b x 9 j V I S d L 7 S e p i r n B + r      < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 w H d N h y T 0 g n q 5 G h v P n E 0 o 5 M p s 1 g = " > A A A C a n i c h V F N L w N B G H 6 6 v q q + q i 6 k F 1 H E q Z k V < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 w H d N h y T 0 g n q 5 G h v P n E 0 o 5 M p s 1 g = " > A A A C a n i c h V F N L w N B G H 6 6 v q q + q i 6 k F 1 H E q Z k V < l a t e x i t s h a 1 _ b a s e 6 4 = " J 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " J 0  43)).In panel e) we also show the generalization error obtained by the theory in the dense limit: N → ∞ followed by N → ∞ (see Eq. ( 12) and Fig. 9).In panels a effects appear similarly.Both should lead to increase of correlations inside the network.In simulations, the simple greedy Monte Carlo method turned turned out to work efficiently to enable sampling of typical machines in equilibrium suggesting the simplicity of the loss landscape.The main obstacle in simulations is the gauge invariance of the system by which order parameters in the original simple form vanish.To overcome the difficulty we measured normalized squared overlap which quantifies the correlation of avalanches concerning changes in input data between different machines.It is a gauge (and permutation) invariant quantity that reflects the similarity between machines modulo the gauge (and permutation) symmetries.The result is qualitatively consistent with the theoretical prediction that over-parametrization in the DNN lead to spatially in-homogeneous learning: students become close to the teacher around the input/output boundaries while remain only weakly correlated in the center.We note that liquid like central region was also noticed in [40].Furthermore, somewhat counter-intuitively but in agreement with the theoretical prediction, the generalization error first increases increasing the depth L but then becomes independent of the depth L suggesting that the generalization ability survives in L → ∞ limit.Simulations confirm that finite width N and finite hidden dimension D effects are quite similar and lead similarly to significant improvements in the generalization ability.Presumably this reflects increase of correlations inside the network due to the loop corrections.As we noted in sec.IV C 4 we consider that the corrections due to the loops and fluctuations around the saddle point play the role of symmetry breaking field which allows the student to recognize again the teacher in spite of the liquid like center.
After all what is the advantage of making the system deeper?One important advantage is that the learning dynamics become faster increasing the depth as we found numerically.This should be due to the presence of the central region where the system is less constrained.We believe that this point will become more important as we move away from the idealized, Bayes optimal teacher-student setting we considered in the present work.From the theoretical point of view, there is no guarantee that replica symmetry continue to hold as we move away from the Bayes optimal situation toward the situations in the real world.For example, one can consider a noisy teacherstudent scenario by adding noise to the training data provided by the teacher.Then the situation becomes closer to the random scenario considered in [1] where complex replica symmetry breaking (RSB) was found in the DNN.In the latter case, RSB evolves in space such that the hierarchy of RSB becomes simplified layer-by-layer approaching the center so that the central region can remain in replica symmetric liquid phase if the network is made deep enough.This implies deeper system will relax faster even in the presence of the RSB around the boundaries.
There are numerous directions to generalize and extend the present work.Let us mention a few of them below.It is straightforward to study the model exactly with the parameter α depending on space α = α(l).This amount to make the width N of the network to vary in space N = N (l).It will be interesting to study how one can control the spatial heterogeneity of learning by changing α(l).The dense limit N c 1 will be useful not only for the replica theory but other theoretical approaches.For instance, it should be possible to develop cavity approaches in the dense limit.It will also be very interesting to generalize our theory considering more general activation functions as we noted in sec.III C.
s C a t P d 6 u r 5 a p 7 8 Y 9 D q k n E K S 3 b M r 9 s z u 2 D V 7 Z G + / 9 u r 6 P T w v H V q 1 D y 2 3 9 6 J H E 4 X X f 1 V N W i X q n 6 o / P U t U s e R 7 F e T d 9 h n v F P q H v n 1 w 8 l x Y 2 U x 2 Z 9 g F e y L / 5 + y B 3 d I J z P a L f p n n m 2 d / + L H J R Y c y 3 u 2 5 3 t 3 R q N L f B / M T l D K p 9 E I q k 5 9 < l a t e x i t s h a 1 _ b a s e 6 4 = " r 5  where ( * 0 ) ab and (ε * 0 ) ab are defined such that, (B16) Then at O(λ) we find,   • Note that loop corrections breaks the symmetry with respect to the exchange of input/output sides.
• In general, by associating more replicas to the same diagram we find contributions which vanish more rapidly increasing c.

Summary 2
Now we can collect the above results to obtain the free-energy functional −βF n [ Q, q] defined in sec.The last term in Eq. (B36) is the interaction part of the free-energy −βF ex [{ Q , q }] (see Eq. (B23)).
In the dense limit lim c→∞ lim N →∞ , we have found that only the 1st order term −β Fn,1 (see Eq. (B26)) in the Plefka expansion contributes to −β Fex [ Q, q, {∂/∂ µ, ,a }].Thus we find, On the boundaries we have q ab, = q ab, = 1 for ∈ 0 and ∈ L. Finally, assuming that order parameters are homogeneous within the layers Eq. ( 19) we find the expression Eq. ( 24).Here we display the expressions for the Franz-Parisi's potential within the replica symmetric ansatz needed to evaluate the generalization error.Here expansion of the free-energy functional given by Eq. (B36) supplemented by Eq. (B38), Eq. (B39) and Eq.(B40) around the saddle point given by Eq. (B41).We can write Q ab, = Q * ab (l) + ∆Q ab, q ab, = q * ab (l) + ∆q ab, (B46) where Q * ab (l) and q * ab (l) are the saddle point values of the order parameters, l is the label of the layer to which belongs to, ∆Q ab, and ∆q ab, are fluctuations around the saddle point.

a. Quadratic expansion
The quadratic expansion of the replicated free-energy functional is specified in the Hessian matrix.It is obtained as, where I A (x) is the indicator function, i. e.I a (x) = 1 if x ∈ a and 0 otherwise.Let us note that in the liquid phase where Q ab = q ab = 0 for a = b, the Hessian matrix become simplified as, H Qq ab,cd, 1, 2 = 0 We can write Q ab, = Q * ab (l) + ∆Q ab, q ab, = q * ab (l) + ∆Q ab, (B53) where Q * ab (l) and q * ab (l) are the saddle point values of the order parameters, l is the label of the layer to which belongs to, ∆Q ab, and ∆q ab, are fluctuations around the saddle point.Including the correction due to the fluctuations around the saddle point, the replicated Gardner volume Eq. ( 21) can be written as, V 1+s (S 0 , S L (S 0 , J teacher ))) S0,J teacher = e N M s1+s[{ Q * ,q * }] Z fluctuation (B54) where where H QQ ab,cd, , ... are the Hessian matrices given in sec.B 7. For the following discussion, we do not need to perform a complete analysis of the correction.We restrict ourselves in the liquid phase Q = q = 0. Then as shown in sec.B 7, the Hessian matrices become completely local, i. e. H QQ

FIG. 1 .
FIG. 1. Schematic picture of the multi-layer perceptron network of depth L and width N .In this example, the depth is L = 4.Each arrow represents a M -component vector spin Si = (S 1 i , S 2 i , . . ., S M i ) with its component S µ i = ±1 representing the state of a 'neuron' in the µ-th pattern.

3 FIG. 3 .
FIG. 3. A loop of interactions in a DNN extended over 3 layers, through 3 perceptrons and 4 bonds.

FIG. 5 .
FIG. 5. Variables associated with a perceptron which changes sign by the flip of the gauge variable σ → −σ

r S t 5 N j 3 FFIG. 7 .
FIG. 7. Spatial profile of the order parameter obtained by solving the replica symmetric saddle point equations.(top) the overlap of spins (neurons) (bottom) and the overlap of bonds (synaptic weights).Here L = 20.Different lines corresponds to α = 16 − 10 3 with equal spacing in ln α = 0.23...

FIG. 9 .
FIG. 9. Learning curves of DNN with various depth L obtained by solving the replica symmetric saddle point equations.

FIG. 10 .
FIG. 10.Schematic picture of the closed and unclosed loop at the boundary t e x i t s h a 1 _ b a s e 6 4 = " d c g F B 8 w F x A J x 4 p d 5 P n v L z m + O + x P s m r 2 S / y v 2 z B 7 o B G b l n d 9 s i M 3 z J n 5 s c l G l T H B 7 b n B 3 N C r 5 9 2 A a w f Z U S p 5 J T W 1 M J 9 O L 0 d B i S G A M k z S Z W a S R w T p y 1 N P G K S 5 w K S W k B S k j r d R L p Z Z I M 4 w f I W W / A C C 8 l x 8 = < / l a t e x i t > E NM < l a t e x i t s h a 1 _ b a s e 6 4 = " 8 4 C 1 z 8 T 1 D o 7 0 T S t 4 t 3 s X r h y x N u 2 D n n 6 o N u g v J r 0 e K e N I s g d 2 z V 7 Y P b t h T + z 9 1 1 6 t s E f g p U m r 3 t Z y d 3 f 0 a D L / 9 q / K o l V i / 1 P 1 p 2 e J C h Z D r 4 K 8 u y E T n M J o 6 x s H Z y / 5 p Y 1 k a 4 Z d s m f y f 8 E e 2 R 2 d w G 6 8 G l c 5 v n H + h x + X X D Q p E 9 y e H 9 w d j U r 9 P p i f Y C u d U u d T 6 d x c I r M a D a 0 P U 5 j G L E 1 m A R l k s Y 4 C 9 e Q 4 x g l O l Z g y r y w r K + 1 S p S P S T O B L K N k P C r u S X Q = = < / l a t e x i t > t b) t e x i t s h a 1 _ b a s e 6 4 = " d c g F B 8 w F x A J x 4 p d P w G G Q 2 c m 1 I w r 6 A 3 6 A L l z 4 A B f i L 7 h z 4 w + 4 8 B P E p Y I b F 9 6 k A d F i v S G Z M + f e c 3 N m r m r r m u s x 9 t w i t b a 1 d 3 T G u r p 7 e v v 6 B + K DQ 9 u u V X a 4 y H F L t 5 y 8 q r h C 1 0 y R 8 z R P F 3 n b E Y q h 6 m J H P V o K 8 j s V 4 b i a Z W 5 5 V V s U D e X A 1 E o a V z y i C o W S o 3 B / u e a v r d b 2 4 k m W Y m G M N g I 5 A k l E s W 7 F 7 1 H A P i x w l G F A w I R H W I c C l 5 5 d y G C w i S v C J 8 4 h p I V 5 g R q 6 S V u m K k E V C r F H 9 D 2 g 3 W 7 E m r Q P e r q h m t N f d H o d U o 5 i n D 2 x W / b G H t k d e 2 G ff / b y w x 6 B l y q t a l 0 r 7 L 2 B k 5 H s x 7 8 q g 1 Y P h 9 + q p p 4 9 l D A X e t X I u x 0 y w S l 4 X V8 5 P n v L z m + O + x P s m r 2 S / y v 2 z B 7 o B G b l n d 9 s i M 3 z J n 5 s c l G l T H B 7 b n B 3 N C r 5 9 2 A a w f Z U S p 5 J T W 1 M J 9 O L 0 d B i S G A M k z S Z W a S R w T p y 1 N P G K S 5 w K S W k B S k j r d R L p Z Z I M 4 w f I W W / A C C 8 l x 8 = < / l a t e x i t > E NM < l a t e x i t s h a 1 _ b a s e 6 4 = " 8 4 C 1 z 8 T 1 D o 7 0 T S t 4 t 3 s X r h y x N u 2 D n n 6 o N u g v J r 0 e K e N I s g d 2 z V 7 Y P b t h T + z 9 1 1 6 t s E f g p U m r 3 t Z y d 3 f 0 a D L / 9 q / K o l V i / 1 P 1 p 2 e J C h Z D r 4 K 8 u y E T n M J o 6 x s H Z y / 5 p Y 1 k a 4 Z d s m f y f 8 E e 2 R 2 d w G 6 8 G l c 5 v n H + h x + X X D Q p E 9 y e H 9 w d j U r 9 P p i f Y C u d U u d T 6 d x c I r M a D a 0 P U 5 j G L E 1 m A R l k s Y 4 C 9 e Q 4 x g l O l Z g y r y w r K + 1 S p S P S T O B L K N k P C r u S X Q = = < / l a t e x i t > t < l a t e x i t s h a 1 _ b a s e 6 4 = " W R z r F 7 p 2 b W D D n 0 m 7 b 2 e r q / m 9 B e D X o e U E 5 h i z + y O N d k T u 2 c v 7 O P X X j W / h + e l S q v W 0 g r 7 c P h 8 L P v + r 6 p E q 8 T x l + p P z x I F L P l e d f J u + 4 x 3 C t 7 S V 0 7 r z e x y Z q o 2 z W 7 Y K / m / Z g 3 2 S C c w K 2 / 8 N i 0 y V 3 / 4 s c l F l T L e 7 b n e 3 d G o E j 8 H 0 w 5 2 Z u O J h f h s e i 6 W X A + G F s I 4 J j F D k 1 l E E p t I I U c 9 i 7 j A J e p K V F l S V p W 1 V q n S E W h G 8 S 2 U r U 8 o z Z L S < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " d c g F B 8 w F x A J x 4 p d P w G G Q 2 c m 1 I w r 6 A 3 6 A L l z 4 A B f i L 7 h z 4 w + 4 8 B P E p Y I b F 9 6 k A d F i v S G Z M + f e c 3 N m r m r r m u s x 9 t w i t b a 1 d 3 T G u r p 7 e v v 6 B + K DQ 9 u u V X a 4 y H F L t 5 y 8 q r h C 1 0 y R 8 z R P F 3 n b E Y q h 6 m J H P V o K 8 j s V 4 b i a Z W 5 5 V V s U D e X A 1 E o a V z y i C o W S o 3 B / u e a v r d b 2 4 k m W Y m G M N g I 5 A k l E s W 7 F 7 1 H A P i x w l G F A w I R H W I c C l 5 5 d y G C w i S v C J 8 4 h p I V 5 g R q 6 S V u m K k E V C r F H 9 D 2 g 3 W 7 E m r Q P e r q h m t N f d H o d U o 5 i n D 2 x W / b G H t k d e 2 G ff / b y w x 6 B l y q t a l 0 r 7 L 2 B k 5 H s x 7 8 q g 1 Y P h 9 + q p p 4 9 l D A X e t X I u x 0 y w S l 4 X V8 5 P n v L z m + O + x P s m r 2 S / y v 2 z B 7 o B G b l n d 9 s i M 3 z J n 5 s c l G l T H B 7 b n B 3 N C r 5 9 2 A a w f Z U S p 5 J T W 1 M J 9 O L 0 d B i S G A M k z S Z W a S R w T p y 1 N P G K S 5 w K S W k B S k j r d R L p Z Z I M 4 w f I W W / A C C 8 l x 8 = < / l a t e x i t > E NM < l a t e x i t s h a 1 _ b a s e 6 4 = " 8 4 C 1 z 8 T 1 D o 7 0 T S t 4 t 3 s X r h y x N u 2 D n n 6 o N u g v J r 0 e K e N I s g d 2 z V 7 Y P b t h T + z 9 1 1 6 t s E f g p U m r 3 t Z y d 3 f 0 a D L / 9 q / K o l V i / 1 P 1 p 2 e J C h Z D r 4 K 8 u y E T n M J o 6 x s H Z y / 5 p Y 1 k a 4 Z d s m f y f 8 E e 2 R 2 d w G 6 8 G l c 5 v n H + h x + X X D Q p E 9 y e H 9 w d j U r 9 P p i f Y C u d U u d T 6 d x c I r M a D a 0 P U 5 j G L E 1 m A R l k s Y 4 C 9 e Q 4 x g l O l Z g y r y w r K + 1 S p S P S T O B L K N k P C r u S X Q = = < / l a t e x i t > t < l a t e x i t s h a 1 _ b a s e 6 4 = " L + P o D o J O e t P v T n y 3 b

d 7 I
W v y 3 u / p B m q N / 2 L w 6 7 B y G r P 0 R L f U o k e 6 o x f 6 + L N X I + j h e 6 n z q r a 1 w i 5 G z y Y z 7 / + q K r x 6 O P 5 S d f T s 4 Q g r g V e d v d s B 4 5 9 C a + t r J 8 1 W Z i 0 9 2 5 i j a 3 p l / 1 f 0 T A 9 8 A r P 2 p t 3 s i P R l B z 8 2 u 6 h z x r 8 9 1 7 8 7 H p X 8 c z C / Q W 4 h I S 8 l F n Y W 4 8 m N c G g D m M I M 5 n k y y 0 h i C y l k u e c x z t H E h T Q h r U p J K a y V u k L N B L 6 F t P 0 J r m S T C A = = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " d c g F B 8 w F x A J x 4 p d P w G G Q 2 c m 1 I w r 6 A 3 6 A L l z 4 A B f i L 7 h z 4 w + 4 8 B P E p Y I b F 9 6 k A d F i v S G Z M + f e c 3 N m r m r r m u s x 9 t w i t b a 1 d 3 T G u r p 7 e v v 6 B + K DQ 9 u u V X a 4 y H F L t 5 y 8 q r h C 1 0 y R 8 z R P F 3 n b E Y q h 6 m J H P V o K 8 j s V 4 b i a Z W 5 5 V V s U D e X A 1 E o a V z y i C o W S o 3 B / u e a v r d b 2 4 k m W Y m G M N g I 5 A k l E s W 7 F 7 1 H A P i x w l G F A w I R H W I c C l 5 5 d y G C w i S v C J 8 4 h p I V 5 g R q 6 S V u m K k E V C r F H 9 D 2 g 3 W 7 E m r Q P e r q h m t N f d H o d U o 5 i n D 2 x W / b G H t k d e 2 G ff / b y w x 6 B l y q t a l 0 r 7 L 2 B k 5 H s x 7 8 q g 1 Y P h 9 + q p p 4 9 l D A X e t X I u x 0 y w S l 4 X V8 5 P n v L z m + O + x P s m r 2 S / y v 2 z B 7 o B G b l n d 9 s i M 3 z J n 5 s c l G l T H B 7 b n B 3 N C r 5 9 2 A a w f Z U S p 5 J T W 1 M J 9 O L 0 d B i S G A M k z S Z W a S R w T p y 1 N P G K S 5 w K S W k B S k j r d R L p Z Z I M 4 w f I W W / A C C 8 l x 8 = < / l a t e x i t > E NM < l a t e x i t s h a 1 _ b a s e 6 4 = " 8 4 C 1 z 8 T 1 D o 7 0 T S t 4 t 3 s X r h y x N u 2 D n n 6 o N u g v J r 0 e K e N I s g d 2 z V 7 Y P b t h T + z 9 1 1 6 t s E f g p U m r 3 t Z y d 3 f 0 a D L / 9 q / K o l V i / 1 P 1 p 2 e J C h Z D r 4 K 8 u y E T n M J o 6 x s H Z y / 5 p Y 1 k a 4 Z d s m f y f 8 E e 2 R 2 d w G 6 8 G l c 5 v n H + h x + X X D Q p E 9 y e H 9 w d j U r 9 P p i f Y C u d U u d T 6 d x c I r M a D a 0 P U 5 j G L E 1 m A R l k s Y 4 C 9 e Q 4 x g l O l Z g y r y w r K + 1 S p S P S T O B L K N k P C r u S X Q = = < / l a t e x i t > t a) < l a t e x i t s h a 1 _ b a s e 6 4 = " E m f L y c 5 l p 6 / m a A H

l a t e x i t > L = 10 FIG. 11 .
FIG. 11.Relaxation of the loss function in learning (annealing with tmax = 10 4 (MCS)) observed by the MC simulation.In all cases N = 10.(a) various α = M/c with L = 10 and c = 5.(b) various L with α = M/c = 4 and c = 5.(c) various N with L = 10, α = 4 and c = 5.(d) the same as (c) but with c = 10.The unit of time t is 1 (MCS).
n 4 e J I q y W p p t l d o y b d 7 q 7 d a Y P G H 5 C 4 c n A i k R A / w 8 U f c O h P w L E S F w d v t 5 s I g n e S m W e e 9 3 3 e e W Z G d 0 z h S c b q

r 5 m g 8 p 1
m G 5 R R U x e W G b v K c 0 I X B C 7 b D l a p q 8 L x a W e m c 5 + v c c X X L 3 B Q N m 5 e q y p 6 p 7 + q a I o j K i n I 0 z p L M i 9 h P I P s gD j / W r O g 1 t r E D C x p q q I L D h C B s Q I F L o w g Z D D Z x J T S J c w j p 3 j l H C y H S 1 i i L U 4 Z C b I X m P d o V f d a k f a e m 6 6 k P K M M i p U V V W o g h w R 7 Y D W u z e 3 b L n t n 7 r 9 W a X p W O m w a t a l f L 7 X L k a G r j 7 V 9 V l V a B / S / V n 6 4 F d r H g u d X J v e 0x n X d o X X 3 9 8 K y 9 s b i e a M 6 y S / Z C / i / Y I 7 u j F 5 j 1 V + 0 q y 9 f P E f J a k P E i 1 g X p l A 8 y 8 m c L t u a T c i q Z y a b i S 8 t + M 4 K Y x g z m 6 M f T W M I q 1 p C j e z m O c Y L T w J M U l i a k y W 6 q F P A 1 4 / g W U u w D W i 2 L E w = = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " a E C P X S 4 K 1 c 4 Z c M n / O g I H H j 0 b C I N B o e 4 b T S J c w m J I M 9 x h D 7 S 1 qi K U 4 V K b J W + e 7 T b D F m L 9 n 5 P L 1 D r 9 B e D X p e U C U y y R 3 b D X t k D u 2 X P 7 O P P X s 2 g h + + l Q a v W 0 n J n J 3 Y 8 t v L + r 8 q k V W L / S 9 X W s 0 Q F c 4 F X Q d 6 d g P F P o b f 0 9 c P z 1 5 X 5 0 m R z i l 2x F / J / y Z 7 Y P Z 3 A q r / p 1 0 V e u m j j x y E X D c r 4 t + f 5 d 0 e j S v 8 c z G + w l k m l s 6 l M c S a Z W w q H e 6 + n 6 6 t 1 + o t B r 0 P K G c T Z E 7 t l b + y R 3 b E X 9 v l n r 5 b f w / P S p F V r a 7 l d i p x M 7 X z 8 q 6 r R K n H 4 r e r q W a K M J d + r I O + 2 z 3 i n 0 N v 6 x t H F 2 8 7 y d r w 1 x 6 7 Z K / m / Y s / s g U 5 g N t 7 1 m w z f v u z i x y Y X T c p 4 t + d 6 d 0 e j S v 4 e T C f Y S y W S 6 U r h y x N u 2 D n n 6 o N u g v J r 0 e K e N I s g d 2 z V 7 Y P b t h T+ z 9 1 1 6 t s E f g p U m r 3 t Z y d 3 f 0 a D L / 9 q / K o l V i / 1 P 1 p 2 e J C h Z D r 4 K 8 u y E T n M J o 6 x s H Z y / 5 p Y 1 k a 4 Z d s m f y f 8 E e 2 R 2 d w G 6 8 G l c 5 v n H + h x + X X D Q p E9 y e H 9 w d j U r 9 P p i f Y C u d U u d T 6 d x c I r M a D a 0 P U 5 j G L E 1 m A R l k s Y 4 C 9 e Q 4 x g l O l Z g y r y w r K + 1 S p S P S T O B L K N k P C r u S X Q = = < / l a t e x i t > t < l a t e x i t s h a 1 _ b a s e 6 4 = " 8 4 C 1 z 8 T 1 D o 7 0 T S t 4 t 3 s X
t e x i t s h a 1 _ b a s e 6 4 = " X q T w / o b x 9 j V I S d L 7 S e p i r n B + r S 4 4 g a S t U x a l C I r Z E e 5 F u O Z / V 6 e 7 2 t D 3 1 A V U Y p D S o S x 0 R T L J H d s u a 7 I H d s S f 2 / m u 3 m t f F d X N I p 9 z S c j M f O h 5 J v v 2 r K t P p Y O 9 T 9 a d r B w U s e W 5 V c m 9 6 j P s O p a W v H p 0 2 k 8 u J y d o U u 2 L P 5 P + S N d g 9 v U C v v i r X m z x x h i B 9 g f h 9 4 D 9 B O h 4 T F 2 L z m 3 P R l V X / M 7 o x i g l M 0 8 Q X s Y J 1 b C D l T e w E 5 7 g I v A g j w p g w 3 i o V A r 5 m G F 9 C m P k A C H S O k g = = < / l a t e x i t > q , r < l a t e x i t s h a 1 _ b a s e 6 4 = " l + k g N

r 2 FIG. 13 .
FIG. 13.Spatial profile of the normalized squared teacher-student overlaps r2(l) and student-student overlaps q2(l) (see Eq. (36)) for training a),b) and test c),d) and time evolution of the simple teacher-student overlap r(L) in the output layer (l = L) for test e), f).All data are obtained by the MC simulations.In panels a), c), and e), data of q2(l) (open symbols) and r2(l) (filled symbols) of learning/unlearning are represented by red/blue points (α = 4) .Panels a) and c) show the normalized squared overlaps at various times t = 1, 10, 100, 1000, 10000 (increasing along the arrows) at each layer.Panel e) shows the time evolution of the simple teacher-student overlap r(L) at the output layer (l = L) for the test data.Panel b) and d) show the normalized squared teacher-student overlap for unlearning (open symbols)/learning (filled symbols) at α = 2, 4, 8 at t = 10 4 (MCS).Panel f) show the time evolution of the simple teacher-student overlap r(L) at the output layer (l = L) for the test data, obtained by unlearning (open symbols) /learning (filled symbols) protocols with α = 2, 4, 8, 16, 32.Here N = 10, L = 10, and c = for all data.
t e x i t s h a 1 _ b a s e 6 4 = " X q T w / o b x 9 j V I S d L 7 S e p i r n B + r S R C B g I l e G N n I W I u b m K Z x D A L U N r K J Y I S F b x 2 8 N d 0 W P 1 X H v 9 O S u W s G / q P h a q A x D l L y Q e 9 I j z + S B v J L P P 3 u 1 3 R 6 O l x a u c l 9 L z U r w Y i n z 8 a 9 K w 9 W G 0 2 / V U M 8 2 V G H L 9 c r Q u + k y z i m U vr 5 5 d t X L b K e j 7 V V y S 9 7 Q / w 3 p k i c 8 g d 5 8 V + 5 S N N 0 Z 4 s d E F y 3 M O 3 m b E a a t y 9 w g B B r 6 5 w l O E N m t s p j h V d b P m v w u l 3 T 8 d Q q n 6 J z t 1 k Z Q 4 L u 6 Z p e 6 I 5 u 6 I n e f 6 3 V 8 m q 0 v T R 5 V j p a Y Z X D h + P 5 t 3 9 V N Z 5 d 7 H 2 q / v T s Y h e L n l e N v V s e 0 7 6 F 2 t E 3 9 k 9 f 8 k u 5 R G u a L u i Z / Z / T A 9 3 y D R C B g I l e G N n I W I u b m K Z x D A L U N r K J Y I S F b x 2 8 N d 0 W P 1 X H v 9 O S u W s G / q P h a q A x D l L y Q e 9 I j z + S B v J L P P 3 u 1 3 R 6 O l x a u c l 9 L z U r w Y i n z 8 a 9 K w 9 W G 0 2 / V U M 8 2 V G H L 9 c r Q u + k y z i m U vr 5 5 d t X L b K e j 7 V V y S 9 7 Q / w 3 p k i c 8 g d 5 8 V + 5 S N N 0 Z 4 s d E F y 3 M O / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " M a f K d c h J l Z D 2 n o 4 o Y 1 w p z 2 M d S K E 3 j Z p 0 v 7 K 7 b V T T P y B x 1 Y M T i Y T 4 G S 7 + g I O f g C O J i 4 N 3 t 5 s I g n e S m W e e 9 3 3 e e W Z G c w z h + Y w 9 x K S m 5 p b W t n h 7 o q O z q 7 t H 7 u 1 b 9 + y S q / O 8 b h u 2 u 6 m p H j e E x f O + 8 A 2 + 6 b h c N T W D b 2 j F x S C / U e a u J 2 x r z a 8 4 f M d U C 5 b Y 6 7 Z C 7 t j N + y R v f / a r R p 2 C d x U a N U a W u 7 s 9 h w P 5 t 7 + V Z m 0 + j j 4 V P 3 p 2 s c + Z k O 3 g t w 7 I R P c Q 2 / o y 0 f 1 l 9 z c 6 l h 1 n F 2 w Z / J / z h 7 Y L d 3 A K r / q l y t 8 9 Q w J + g L l+ 4 P / B O u T K W U 6 l V 6 Z S m Y W o s + I Y x i j m K A X n 0 E G S 8 g i T + c W c I J T 1 G N P k i w N S E O N U i k W a f r x J a T R D / v G i t 8 = < / l a t e x i t > L = 5 < l a t e x i t s h a 1 _ b a s e 6 4 = " c 4 A / z c z Z s s i G U l 8 3 u 4 5 F R F P T B R s = " > A A A C a X i c h V H L S s N A F D 2 N r 1 p f r W 6 K b o p F c V V u x B e C I L p x 4 a J a q 4 K K J H F a Q 9 M k J G m x F n / A h d s K r h Q E x c 9 w 4 w + 4 6 C e I S w U 3 L r x J A 6 K i 3 o G Z M + f e c + f M j G o b u u s R N S N S W 3 t H Z 1 e 0 O 9 b T 2 9 c / E E 8 M b r p W x d F E X r M M y 9 l W F V c Y u i n y n u 4 Z Y t t 2 h F J W D b G l l p b 9 / F Z V O K 5 u m R t e z R Z 7 Z a V o 6 g V d U z y f W l 2 Q a T + e p g w F k f o J 5 B C k E U b W i t 9 g F w e w o K G C M g R M e I w N K H B 5 7 E A G w W Z u D 3 X m H E Z 6 k B c 4 Q Y y 1 F a 4 S X K E w W + K 5 y L u d k D V 5 7 / d 0 A / U R V 1 i s t L j L C V I Y o 0 e 6 p R d 6 o D t 6 o v d f u 9 W D L r 6 b G q 9 q S y v s / Y H T Z O 7 t X 1 W Z V w + H n 6 o / X X s o Y C 5 w q 7 N 7 O 2 D 8 e 2 g t f f W 4 8 Z K b X x + r j 9 M V P b P / S 2 r S P d / A r L 5 q 1 2 t i / Q I x / g L 5 + 4 P / B J u T G X k m M 7 0 2 l V 5 c C j 8 j i h G M Y o J f f B a L W E E W e T 7 3 E G d o 4 D z y L C W k p D T c K p U i o W Y I X 0 J K f w B 6 T Y s V < / l a t e x i t > L = 10< l a t e x i t s h a 1 _ b a s e 6 4 = " 6 S j F u s y r I Z i S e x 6 / E s S x F f / N Y t g = " > A A A C a X i c h V H L S s N A F D 2 N r 1 o f b X U j u i m W i q t y W 3 w h C K I b F y 5 a t S p U k S S O N j R N Q p I W a + k P u H C r 4 E p B U P w M N / 6 A C z 9 B u l R w 4 8 K b N C A q 6 h 2 Y O X P u P X f O z C i W r j k u 0 V N I 6 u j s 6 u 4 J 9 0 b 6 + g c G o 7 H 4 0 K 2 s X B 5 j z 3 W r s 3 v I Z 7 x 5 q W 1 8 7 P n t Z n 1 9 L N S b o i l r s / 5 K e 6 J 5 v Y N R e 1 e u 8 W L t A h L 8 g 8 / 3 B f 4 L N b D o z k 5 7 O T y U X l 4 L P C G M M 4 5 j k F 5 / F I l a Q Q 4 H P L e E U Z z g P t a S 4 N C K N t k u l U K A Z x p e Q k h 9 8 U I s W < / l a t e x i t > X 3 E S L o 6 U 1 S Y l z e 4 a T G x 3 1 + 6 2 w a Z / w M U R c S K R E D / D x R 9 w 8 B O E W y U u D t 7 d b i I I 3 k l m n n n e 9 3 n n m R n N N o T r M f Y Y k 5 q a W 1 r b 4 u 2 J j s 6 u 7 p 5 k b 2 r N X 3 E S L o 6 U 1 S Y l z e 4 a T G x 3 1 + 6 2 w a Z / w M U R c S K R E D / D x R 9 w 8 B O E W y U u D t 7 d b i I I 3 k l m n n n e 9 3 n n m R n N N o T r M f Y Y k 5 q a W 1 r b 4 u 2 J j s 6 u 7 p 5 k b 2 r N

2 FIG. 15 .
FIG. 15.Spatial profile of the normalized squared overlaps and the generalization error in systems with various width N (top panels) and hidden dimension D(≤ N ) (bottom panels) obtained by MC simulations.In the top panels, data obtained by both learning (filled symbols) and unlearning (open symbols) are shown.Panels a) and b) show the normalized squared overlaps for training, c) and d) show those for the test.Panels e) and f) show the generalization error (see Eq. (43)).In panel e) we also show the generalization error obtained by the theory in the dense limit: N → ∞ followed by N → ∞ (see Eq. (12) and Fig.9).In panels a), b), c), and d) data with L = 5, 10, 20 are shown.In panels a), c), and e) data with N = 8, 10, 20 are shown.In panels b), d) and f) data with D = 4, 6, 8, 10 are shown.In all cases α = 4, c = 5 and t = 10 4 .
FIG. 15.Spatial profile of the normalized squared overlaps and the generalization error in systems with various width N (top panels) and hidden dimension D(≤ N ) (bottom panels) obtained by MC simulations.In the top panels, data obtained by both learning (filled symbols) and unlearning (open symbols) are shown.Panels a) and b) show the normalized squared overlaps for training, c) and d) show those for the test.Panels e) and f) show the generalization error (see Eq. (43)).In panel e) we also show the generalization error obtained by the theory in the dense limit: N → ∞ followed by N → ∞ (see Eq. (12) and Fig.9).In panels a), b), c), and d) data with L = 5, 10, 20 are shown.In panels a), c), and e) data with N = 8, 10, 20 are shown.In panels b), d) and f) data with D = 4, 6, 8, 10 are shown.In all cases α = 4, c = 5 and t = 10 4 .
t e x i t s h a 1 _ b a s e 6 4 = " Q Q 5 / e N F 36 3 r Q 9 t U J / 0 e i 1 S B l G l D 2 z O 9 Z j T + y e v b C P X 3 t 1 v B 6 u l z a t c l / L z d r M 6 U L + / V 9 V g 1 Y H 9 S / V n 5 4 d H G D d 8 6 q S d 9 N j 3 F M o f X 3 r + L K X 3 8 h F O 8 v s h r 2 S / 2 v W Z Y 9 0 A r 3 1 p t x m e e 7 q D z 8 m u W h T x r 0 9 2 7 0 7 G l X 8 5 2 A G Q T E R i y d j i e x q J L X t D y 2 I R S x h h S a z h h T S y K B A P T n O c I 4 L Q R S S w q a w 1 S 8 V A r 5 m H t 9 C S H 8 C 8 L S S U Q = = < / l a t e x i t > h < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 c 5 x Z D n 4 s p Q 6 9 P i o 1 s T b t g 5 5 + q D b o L y a 9 H i l T m G S P 7 I a 9 s g d 2 y 5 7 Z x 5 + 9 m m G P w E u D V r 2 l 5 e 5 u 4 m S s 9 P 6 v y q JV 4 v B L 1 d a z x D 7 m Q 6 + C v L s h E 5 z C a O n r R 2 e v p c X i Z H O K X b E X 8 n / J n t g 9 n c C u v x n X B V6 8 a O P H J R c N y g S 3 5 w d 3 R 6 N S f w 7 m N 6 j k s u p s N l e Y S e e X o q H w A X b j w A S 7 E X 3 D n x h 9 w 4 S e I S w U 3 L r x J A 6 J F v S G Z M + f e c 3 N m r m r r w v U Y e 4 p J H Z 1 d 3 T 3 x 3 r 7 + g c G h 4 c T I 6 L Z r 1 R y N F z R L t 5 y S q r h c F y Y v e M L T e c l 2 u G K o

FIG. 16 .
FIG.16.Some examples of the activation function f (h) and the associated confining potential U (S).Here u = 1 for the piece-wise linear function.
fluctuation = a<b d∆Q ab, d∆q ab, exp