Statistical Mechanics of Deep Linear Neural Networks: The Back-Propagating Kernel Renormalization

The success of deep learning in many real-world tasks has triggered an intense effort to understand the power and limitations of deep learning in the training and generalization of complex tasks, so far with limited progress. In this work, we study the statistical mechanics of learning in Deep Linear Neural Networks (DLNNs) in which the input-output function of an individual unit is linear. Despite the linearity of the units, learning in DLNNs is nonlinear, hence studying its properties reveals some of the features of nonlinear Deep Neural Networks (DNNs). Importantly, we solve exactly the network properties following supervised learning using an equilibrium Gibbs distribution in the weight space. To do this, we introduce the Back-Propagating Kernel Renormalization (BPKR), which allows for the incremental integration of the network weights starting from the network output layer and progressing backward until the first layer's weights are integrated out. This procedure allows us to evaluate important network properties, such as its generalization error, the role of network width and depth, the impact of the size of the training set, and the effects of weight regularization and learning stochasticity. BPKR does not assume specific statistics of the input or the task's output. Furthermore, by performing partial integration of the layers, the BPKR allows us to compute the properties of the neural representations across the different hidden layers. We have proposed an extension of the BPKR to nonlinear DNNs with ReLU. Surprisingly, our numerical simulations reveal that despite the nonlinearity, the predictions of our theory are largely shared by ReLU networks in a wide regime of parameters. Our work is the first exact statistical mechanical study of learning in a family of DNNs, and the first successful theory of learning through successive integration of DoFs in the learned weight space.


I. INTRODUCTION
Gradient-based learning in multilayered neural networks has achieved surprising success in many real-world problems including machine vision, speech recognition, natural language processing, and multi-agent games [1][2][3][4]. Deep learning (DL) has been applied successfully to basic and applied problems in physical, social and biomedical sciences and has inspired new neural circuit models of information processing and cognitive functions in animals and humans [5,6]. These exciting developments have generated a widespread interest in advancing the theoretical understanding of the success and limitations of DL, and more generally, in computation with Deep Neural Networks (DNNs). Nevertheless, many fundamental questions remain unresolved including the puzzling ability of gradient-based optimization to avoid being trapped in poor local minima, and the surprising ability of complex networks to generalize well despite the fact that they are usually heavily over-parameterized -namely, the number of learned weights far exceeds the minimal number required for perfectly fitting the training data [7,8]. These problems have fascinating ramifications for statistical mechanics, such as energy landscapes in high dimensions, glassy dynamics, and the role of degeneracy, symmetry and invariances [9][10][11][12]. Indeed, statistical mechanics has been one of the most fruitful theoretical approaches to learning in neural networks [13][14][15][16]. However, its classical phenomenology of capacity, learning curves, and phase transitions, was formulated largely in the context of single-layer or shallow architectures. In this work we develop a new statistical mechanical theory, appropriate for learning in deep architectures. We focus on the statistical mechanics of weight space in deep linear neural networks (DLNNs) in which single neurons have a linear input-output transfer function. DLNNs do not possess superior computational power over a singlelayer linear perceptron [17]. However, because the inputoutput function of the network depends on products of weights, learning is a highly nonlinear process and exhibits some of the salient features of the nonlinear networks. Indeed, in very interesting recent work [18,19], the authors investigated the nonlinear gradient descent dynamics of DLNNs. These studies focused on the properties of the dynamic trajectories of gradient-based learning. To tackle the problem analytically, they had to rely on restrictive assumptions about initial weights and on simplifying assumptions about the data statistics. In contrast, we focus on the equilibrium properties of the distribution in weight space induced by learning, allowing us to address some of the fundamental problems in DL, such as the features determining the DNN's ability to generalize despite over-parameterization, the role of depth and width, as well as the size of the training set, and the effect of regularization and learning stochasticity (akin to temperature).
To analyze the property of the weight distribution, we consider the posterior probability distribution in the weight space after learning with a Gaussian prior under a Bayesian framework [20][21][22]. As introduced in Section II, the posterier distribution of the weights can also be formulated as a Gibbs distribution with a cost function consisting of the training error and an L 2 weight regularization term. The Bayesian formulation and the Gibbs distribution of the weights have become a standard framework for analyzing statistical properties of neural network models and have been applied in various studies on the statistical mechanics of learning [14,[23][24][25][26][27]. In most of our analysis we constrain ourselves to the zerotemperature limit, in which case the network attains zero training error when operating below capacity, and the Gaussian prior introduces bias to the weight distribution to favor weights with smaller L 2 norms within the weight space that yields zero training error.
We evaluate statistical properties in weight space induced by DL by successive backward integration of the weights layer-by-layer starting from the output layer. As shown in Fig.1, each stage of the successive integration of a layer of weights yields an effective Hamiltonian of the remaining upstream weights. As Section II shows, this effective Hamiltonian is expressed in terms of a renormalized kernel matrix K l which is the P × P matrix of the overlaps of all pairs of vectors of activations of the l-th layer induced by the P inputs of the training set. This matrix is a function of all upstream weights, and in the successive integration process is renormalized by a scalar renormalization variable which 'summarizes' the effect of the integrated downstream weights on the effective Hamiltonian of the remaining weights. Therefore, we refer to the successive backward integration process as Back-Propagating Kernel Renormalization (BPKR). Using mean field techniques, this scalar renormalization variable can be evaluated by a self-consistent equation, exact in the thermodynamic limit. Thus, our theory is the first exact statistical mechanical study of the weight space properties of DNNs and the first discovery of kernel renormalization of the learned degrees of freedom (DoFs). Our BPKR is schematically explained in Fig.1, and described in detail in Section II.
Our work is closely related to the interesting recent research on infinitely wide DNNs [28,29]. It is well known that that the ensemble of input-output functions implemented by infinitely wide networks are equivalent to a Gaussian Process (GP) in function space with covari-ance matrix defined by a Gaussian kernel, which is the kernel matrix averaged over weights sampled from the Gaussian distribution. This GP limit holds when the network width, the number of neurons in each layer, N , approaches infinity while the size of the training data, P , is held constant, severely limiting its applicability to most realistic conditions. In contrast, our theory holds in the thermodynamic limit assumed in most statistical mechanical studies of neural computation [24,25,[30][31][32], namely letting both N and P approach infinity while keeping the load α = P/N fixed. As we show here, the behavior of the system at finite α is often qualitatively different from the infinite width limit, and our theory correctly predicts network behavior in a much wider range of parameters that are more relevant in real learning tasks, providing much richer insight into the complex properties of DL in large networks, and into the role of the training data in shaping the network performance and emergent representations. Our theory applies to the entire range of 0 ≤ α ≤ ∞ . We introduce the new notions of wide and narrow networks depending on whether α is smaller or larger than 1. Narrow networks in particular, deviate qualitatively from the α → 0 limit of the GP theory. This is because when α > 1, a zero training error solution cannot be achieved just by optimizing the readout weights but necessarily requires appropriate changes in the hidden-layer weights.
In Section III we apply our theory to derive the network generalization performance and its dependence on the architectural parameters of width and depth as well as the size and statistics of the training dataset. We calculate the system's phase diagram and show the important role of the L 2 weight regularization parameter, σ. In Section IV we present two important extensions. First, we extend our network architecture to include multiple outputs (denoted as m > 1) and show that in this case the BPKR is characterized by an m × m kernel renormalization matrix. Interestingly, their m eigenvalues are independent and obey self-consistent equations similar to the singleoutput case. While most of our study focuses on the zero-temperature limit of the weight space Gibbs distribution, taking into account only the portion of weight space that yields zero training error, we show in Section IV that our BPKR is readily applicable to the finite temperature case and discuss the way kernel renormalization affects the effect of temperature for networks of different depths.
The power of the BPKR is that it allows the computation of not only the system's performance as a whole but also the representation of the data at each layer, readily captured by the statistics of the mean layerwise kernel matrices. We show in Section V how both the input and the task statistics affect these representations: for instance, revealing underlying block structures of the task. In Section VI we present a heuristic extension of the BPKR  Figure 1. Schematics of the Back-Propagating Kernel Renormalization. (a) Integrating out the readout weights a of the network yields a partial partition function ZL in the weight space with an effective Hamiltonian HL, which is a function of all the hidden layers' weights through its dependence on the L-th layer kernel matrix (see (e)) (with an additional L2 regularization on the remaining weights neglected here). (b) Integrating out layer L yields a partial partition function ZL−1 in the remaining weight space with an effective Hamiltonian HL−1 which has the same structure as HL except that the L−1-th layer kernel is multiplied by an order parameter uL−1, a scalar renormalization variable which 'summarizes' the effect of the L-th layer weights on the effective Hamiltonian. (c) Similarly, integrating out all weights downstream of layer L − l yields a partial partition function Z L−l with an effective Hamiltonian with an order parameter u L−l , 'summarizing' the effect of all l upstream integrated layers.
(d) Integrating out all the weights in the network yields the total partition function Z with the total free energy of the system H0 with an order parameter u0, which depends only on the training inputs X µ , µ = 1, ..., P . (e) The l-th layer kernel matrix is the similarity matrix of this layer's responses to the P training inputs up to a normalization factor and is a function of all upstream weights.
to nonlinear deep networks and test numerically its predictions for ReLU networks. Surprisingly, we find that this approximation nicely predicts the behavior of ReLU networks with modest depth and not too small width N . Our results are discussed in the last Section.

A. Statistical mechanics of learning in deep networks
We consider a multilayer network with L hidden layers whose input-output mapping is given by where x ∈ R N0 , is an input vector of dimension N 0 , φ i is the response of a neuron in the top hidden layer (of size N L ) to that input. In general, φ is a nonlinear function of x and the network hidden weights, denoted by W . The output of the network is a scalar which sums linearly the top layer activations weighted by the readout weights a i . We denote all network weights by Θ = (a, W ).
We assume supervised learning with the following cost function, The first term is the mean squared deviation of the network outputs on a set of P training input vectors x µ , µ = 1, ..., P from their target labels y µ . The second term, with amplitude T σ −2 , is a regularization term which favors weights with small L 2 norm. The temperature parameter T in this term means that the L 2 regularization acts as an entropic term. In particular, in the regime where we are mostly interested, T → 0, the first term will enforce minimization of the training error while the L 2 term shapes the statistical measure of the weight vectors that minimize the training error, biasing it in favor of weights with small norms. Without this term, all weights that minimize the error would have the same probability. On the other hand, the L 2 term is irrelevant at low T if the minimum of the error is unique. We will call the parameter σ, the weight noise parameter as it controls the amount of fluctuation in the weights at zero T . We investigate the properties of the equilibrium distribution of the weights, defined by the Gibbs distribution, P (Θ) = Z −1 exp(−E/T ), where Z is the partition function, Z = dΘ exp(−E/T ). The Gibbs distribution is equivalent to the posterior distribution of the weights with a Gaussian prior. The fundamental statistical mechanical properties of the system can be derived from the partition function and its extensions as shown below. However, calculating Z exactly is intractable, but integrating out the readout weights is straightforward, and by doing this we where H L (W ) is the effective Hamiltonian of the hidden-layer weights W after integrating out the readout weights a ( Fig.1(a), see details in Appendix A), where Y is the P × 1 column vector of the training target labels. The matrix K L is a P × P kernel matrix of the top layer. We assume for simplicity that all the hidden layers have equal width N . For each layer, we define its kernel matrix by where X l is the N ×P matrix of activation of the l-th layer in response to the training inputs, [Importantly, unlike other uses of kernels (e.g., in SVMs and DNNs, [28,29]) here we define kernels as simply the un-averaged dot products of the representations at the corresponding layers, hence the l-th kernel matrix is a function of all the weights upstream of X l , and in particular K L depends on all the hidden-layer weights W .] Thermodynamic limit: The results we will derive throughout are exact in the thermodynamic limit, which is defined as N, N 0 , P → ∞ while α = P/N and α 0 = P/N 0 remain finite. Aside from these limits, we do not make any assumptions about the training inputs x µ or the target outputs Y .
Zero temperature: Although our theory is developed for all temperatures (Section IV B and Appendix A), our primary focus is on the limit of zero temperature, exploring the statistical properties of the solution weight space, namely the space of all Θ that yields zero training error. The zero T theory is particularly simple for α < 1 in which case the kernel matrices Eq.4 are full rank. Substituting the zero-temperature limit, H L (W ) reduces to We will call networks with N > P wide networks. Narrow networks (α > 1) will be discussed at the end of this Section.
Linear neurons: Integrating over the weight matrices {W k } k≤L is an intractable problem in general. Here we focus on the simple case where all the input-output functions φ are linear, so that x i,l = 1 √ N w i l x l−1 .

B. The Back-Propagating Kernel Renormalization
Even in the linear case the Hamiltonian Eq.5 is not quadratic in the weights, thus integrating out the weights is highly non-trivial. Instead we compute the full partition function Z by successive integrations, in each of them only a single-layer weight matrix is integrated and yields a partial partition function of the remaining DoFs in the 'weight space' (see schematics in Fig.1). Starting from the top-layer W L , we can move backward until all weights are integrated out. Integrating the top hiddenlayer weight matrix W L yields a partial partition function, where W = {W k } k<L denotes all the weights upstream of W L (schematically shown in Fig.1(b)). The first three terms are similar in form to H L , Eq.5, with K L−1 (W ) denoting the P × P kernel matrix of the L − 1 layer (Eq.4 with l = L − 1), which is now a function of W . The kernel terms in H L−1 are renormalized by a scalar u L−1 , representing the effect of the integrated W L on the effective Hamiltonian of W . While u L−1 originally appears as an auxiliary integration variable (Appendix A), in the thermodynamic limit it is an order parameter determined self-consistently by minimizing H L−1 , where we have denoted for general l, We call this quantity the l-th layer mean squared readout, since it equals the squared mean of the vector of output weights that readout the target labels directly from layer l (after training the full network) (see Appendix A). As will be shown, the mean square readouts are key parameters that capture the effect of the task on the properties of the trained network.
To proceed we note that apart from scaling by the order parameter u L−1 , Eq.6 has exactly the same form as Eq.5, hence the steps of integration of the remaining weights can be repeated layer-by-layer. After the l-th iteration, we obtain ( Fig.1(c)) which is a function of W = {W k } k<L−l+1 (note that the superscript in u k L−l denotes a power, u k L−l ≡ (u L−l ) k ). Thus, at each stage a scalar kernel renormalization appears, u L−l , summarizing the effect of the downstream layers that have been integrated out. This order parameter obeys the mean field equation the solution of which depends on the remaining weights W and the task, through the layerwise mean squared readout, Eq.8. Finally, integrating out all the weights yields an equation for the network scalar renormalization factor u 0 , with the input layer's mean squared readout Here K 0 = σ 2 N0 X X is the input kernel, where X is the N × P input data matrix. Standard techniques using the partition function as a generating functional allow for the derivation of important statistics of the system, in particular its generalization performance. While the statistics of the performance of the system are evaluated by completing the integration over all weights, i.e., l = L as in Eqs.11,12 above, the results of partial weight integration are important in evaluating the properties of the representations in individual layers (Section V). Comparison with the Gaussian Process (GP) theory: We will compare our results to that of the GP theory ( [28]) for infinitely wide networks. In the GP theory, the kernels K l , Eq.4, are self-averaged and furthermore the weight distribution is Gaussian so that the weight dependent kernels can be replaced by their average over Gaussian weights (with variances σ 2 /N , or σ 2 /N 0 for the first layer). For a linear network this amounts to having the kernel of layer l being simply σ 2 times the kernel of layer l − 1, hence, K l = σ 2l K 0 .
Interpretation of order parameters: Importantly, unlike the predictions of the GP theory, we will show below that the statistics of the kernel matrices induced by learning are complex and the relation between kernel statistics at one stage of integration and the next cannot be fully captured by a simple renormalization of the entire matrix by a scalar factor. Instead, different statistics change differently upon integrating the degrees of freedom. Nevertheless, several quantities depending on the kernel do undergo a simple renormalization. In particular, the layerwise mean squared readout, r l , undergoes a simple scaling under weight averaging, i.e., for all 1 ≤ l ≤ L. The subscript in · l denotes averaging over the upstream weights from l to L so that l − 1 is the top unintegrated layer. Likewise, upon successive integrations of all weights, we have r 0 = u l 0 r l . This relation is important as it provides an operational definition of the order parameters u l which can be used for their direct evaluation in numerical simulations. Also, we will show below that the mean and variance of the network predictor transform simply by renormalizing the associated kernels with u 0 (see Eqs.2120). Dependence on the size of data: As can be seen from Eq.11 our results hold when the input kernel is full rank, which implies α 0 = P/N 0 < 1. This condition is understandable, since for α 0 > 1 there is no W that achieves zero training error (in the linear networks). We denote α 0 = 1 as the interpolation threshold of our network (below which the training data can be exactly matched). This threshold holds for generic input (i.e., such that the rank of K 0 is min(P, N 0 )) and for a target function that is not perfectly realized by a linear input-output mapping (otherwise zero error can be achieved for all α 0 ). In most of our work we will focus on the properties of zero error solution space, i.e., we will assume α 0 < 1 .

C. BPKR for narrow architectures:
As stated above, the BPKR at finite temperatures is well defined for all α. However, the zero-temperature limit is subtle when α > 1 since the P × P kernel matrices, Eq.4, of the hidden layers, are of rank N < P , while we have assumed above that the kernel matrices are invertible. Indeed, the above results Eqs.6-10,13 hold only for α < 1. This difference between wide and narrow architectures reflects the difference in the impact of learning on W in the two regimes. While in the wide regime, even for generic untrained W , the training data can be perfectly fit by an appropriate choice of readout weights a, in the narrow regime, a perfect learning of the task cannot be achieved without an appropriate modification of W . Specifically, at every stage of the integration, after averaging out the weights upstream to the l-th layer, the remaining weights must ensure that Y is in the N -dimensional subspace of R P spanned by the N (P -dimensional) vectors X i,l , i = 1, ..., N , induced by the P training inputs. As shown in Appendix A, these constraints lead to replacement for Eq.10 by where the layer mean squared readout at zero temperature, r L−l , is given by for 1 ≤ l < L. Here K L−l is the kernel of the L − l-th layer for a set of weights W L−l that yields zero training error, and K + L−l denotes the pseudo-inverse of K L−l . In addition, the scaling relationship, Eq.13 which provides an operational definition of the kernel renormalization OPs, still holds for 1 ≤ l < L. However, for the average of r l over all hidden weights, the relation with r 0 is given by where r 0 is given by Eq.12 (see Appendix A and SM IA for details). Thus, r l has a cusp as a function of α at α = 1 (see SM Fig.1). Importantly, Eqs.11,12 between u 0 and r 0 hold for 0 ≤ α < ∞, as K 0 is full rank as long as α 0 < 1. Hence, many important system properties such as the generalization error and the predictor statistics which depend on α through u 0 , are smooth functions of α for all α (Section III).

D. Predictor statistics:
The generalization performance is closely related to the learning-induced statistics of the predictor, Eq.1, for a new input vector, x. First, we note that when the hidden-layer weights W are fixed, the predictor f (x) obeys Gaussian statistics (from the fluctuations in the readout weights a) with where x L is the vector of top-layer activations in response to the new input x; k L (x L ) is P × 1 vector given by k µ L (x L ) = K(x L , x µ L ) where for any two vectors x, y, The subscript a in Eqs.17,18 denotes averaging w.r.t. a only. Thus the moments of the predictor depend on W through the P × P kernel matrix K L and through x L and x µ L . Evaluating its first two moments w.r.t the full averaging (over Θ) we find (Appendix B) where k 0 is P × 1 vector given by k µ 0 = K 0 (x, x µ ) where for any two input vectors x, y, K 0 (x, y) = N −1 0 σ 2 x y. Thus, at zero temperature, the mean predictor is independent of network archiecture or noise level σ and retains its value predicted by the GP limit. This makes sense as at zero temperature multiplying the kernels in the numerator and denominator by a scalar cancels out. The variance of the predictor takes into account the Waverage of Eq.18, which is the mean contribution from the fluctuations in a as well as the variance of the conditioned mean Eq.17. These contributions produce the following simple result, (Appendix B). Thus, the predictor variance equals the variance of the L = 0 network (Eq.18 for L = 0) scaled by the kernel renormalization factor u L 0 , which makes sense since the variance scales linearly with the kernel. This variance renormalization due to the presence of hidden layers has an important impact on the generalization error, which depends on both moments of f , and can be written as The network properties after integrating the weights of all hidden layers depends on the kernel renormalization factor u 0 but not the intermediate renormalization factors u l (1 ≤ l < L). We summarize below several important expressions for the renormalization factor u 0 , and the equations for the predictor statistics that depend on u 0 that holds for 0 ≤ α < ∞. These results are the main conclusions from this section, and they will be useful for analyzing the generalization performance and how it depends on various network parameters in Section III.
Kernel renormalization factor u 0 : u 0 relates the average of the top-layer mean squared readout r L (over all L weight matrices) to the input-layer mean squared readout r 0 and through The matrix K + L is the (psuedo) inverse of the top-layer representation, K L = σ 2 N X L X L , and K 0 is the inputlayer kernel . The predictor statistics for an input x is: where k 0 (x) = K 0 (x, X 0 ) and X 0 stands for the P training vectors. The self-consistent equation for u 0 is: E. Qualitative differences between wide and narrow architectures To highlight the qualitative differences between wide and narrow architectures, we use the equation for u 0 given by Eq.27 and show in Fig.2(a,b) u 0 vs. σ for different α (see SM IIA). Note because K 0 scales with σ 2 , when we vary σ we hold σ 2 r 0 constant. The limit of infinite width corresponds to α → 0. In this limit Eq.27 yields u 0 = σ 2 which is the prediction of the GP theory for a linear network. First, in contrast to the GP theory, for finite α, u 0 attains a nonzero value for σ → 0. Furthermore, the dependence of u 0 on σ is qualitatively different in the wide and narrow regimes. For α < 1, u 0 increases monotonically with σ, diverging for large σ, u 0 → σ 2 (1 − α). Importantly, the behavior is reversed for α > 1 .
Here u 0 decreases monotonically with σ and vanishes for large σ as u L 0 → σ −2 (ασ 2 r 0 /(α − 1)). This difference in behavior particularly for large σ, reflects the differences in the effect of learning on the weight space, as discussed is finite for low σ. Additionally, in (a) ( α < 1) u0 diverges as σ 2 (1 − α) for σ → ∞ , slower than in the GP theory . In (b), for α > 1, u0 vanishes as σ → ∞, drastically different from the GP. (c,d) Dependence of the generalization error on σ for wide (α = 0.8) and wide (α = 1.1) regimes. The change with σ is due to the change in variance, which scales as σ 2 u0 (Eq.26). The bias contribution is independent of σ, Eq.25 and black dashed lines. (c) The generalization error diverges slower than in GP theory for α < 1 as σ → ∞.
(d) The generalization error increases and approaches a finite limit as σ → ∞ for α > 1, in stark contrast to the divergence predicted by the GP theory.
at the beginning of Section II C. Learning imposes more constraints on W in the narrow regime and only a small fraction of W have nonzero Gibbs probability, deviating strongly from the predictions of the GP limit.

III. GENERALIZATION
In a linear network, the mapping between input and output is given by an N 0 -dimensional effective weight vector ∼ a W L W L−1 · · · W 1 . As mentioned above, we here assume the system is below the interpolation threshold, i.e., α 0 = P/N 0 < 1, hence our network learns perfectly the training input-output relations as T → 0, even without hidden layers (i.e., L = 0). Thus, our deep network (with L ≥ 1 and α = P/N of O(1)) is always in the heavily over-parameterized regime, where the number of modifiable parameters is much larger than the number of parameters needed to satisfy the training data. Naively, this would imply that the system is extremely poor in generalization. However, as we will show, this is not necessarily so due to the presence of 'inductive bias' in the form of L 2 -regularization. In this section, we will discuss how the generalization error depends on various network parameters including the noise parameter σ, the network width N , and the network depth L, which may provide helpful insights for selecting network parameters during training.

A. Dependence of generalization on noise
From the predictor statistics Eqs.26,25 we conclude that the contribution of the squared bias ( f (x) − y(x)) 2 to ε g is constant, independent of the network parameters, N , L, and σ. As for the contribution from the variance, this tracks the behavior of u L 0 σ 2 (the factor σ 2 stems from the noise dependence of the kernels). Thus, from our previous analysis of u 0 in Section II C we can predict the generalization error's dependence on the noise as shown schematically in Fig.2(c) for wide and in Fig.2(d) for narrow networks. In both regimes, the variance grows monotonically with noise, but for large noise, in the narrow regime, the generalization error does not diverge but saturates to a finite value αr 0 σ 2 /(α − 1) (r 0 scales as σ −2 ), while in the wide regime the generalization error diverges as σ 4 (1 − α). We now consider in detail the dependence of ε g on α for different levels of noise. A detailed analysis (SM IIB) shows that when other parameters are fixed, the generalization error varies monotonically with width, increasing if Otherwise, it decreases with width. The latter case is an example where, despite increasing model complexity through increasing N , generalization performance improves, as shown in Fig.3. In the example shown, we use normally distributed training input vectors and training labels Y generated by a noisy linear teacher. The generalization error is measured here on the network outputs generated by inputs which are corrupted versions of the training vectors (detailed in Appendix E). Thus, this example corresponds to the case where the training data plays the role of P templates and the test inputs are sampled from Gaussian noise around these templates ( [33]). The model introduces a relation between target outputs and data statistics, which is a more realistic situation than a 'vanilla' normally distributed test data, where there is no inherent relation between inputs and outputs (aside from weak correlation induced by the random teacher weights). We emphasize that although we chose a specific example to present our numerics, our theory is not limited to any specific type of input-output distribution.
C. Dependence of generalization on depth First, we discuss the limit of large L, analyzing the fixed point of Eq.11, i.e., the solution for u 0 (L → ∞).
We recall that in the GP limit u 0 (L) = u 0 = σ 2L , hence if σ < 1, u 0 → 0, and the entire deep network collapses to a single-layer network, or if σ > 1, u 0 diverges. Thus, in order to obtain a non-trivial behavior the noise needs to be fine tuned to σ = 1 . Similar fine-tuning is required, in the GP theory, in nonlinear network in the limit of large L . As we will show below, the behavior of our networks is strinkingly different. In fact, we will show that in the low-noise regime, the system is self-tuned in that u 0 approaches a finite fixed point value.
Specifically, Eq. 11 predicts that in the low-noise regime, defined by u 0 → u ∞ = 1, independent of σ. Furthermore, the approach to this limit is inversely proportional to L, Thus, there are two sub-regimes. If ασ 2 r 0 < 1−σ 2 (1−α), then v 0 > 0, implying that u 0 (L) < 1 and increases with L toward its fixed-point value 1. If this inequality does not hold, u 0 decreases with L towards its fixed point. Note that narrow networks in which α > 1, are always in the low-noise regime given by Eq.29, for all values of σ.
In contrast, in the high-noise regime σ 2 (1 − α) > 1, the fixed-point value is u 0 → u ∞ = σ 2 (1 − α) > 1. These results have important implications for the predictor variance and the generalization error. Inspecting Eq.21 we conclude that in the low-noise regime, the predictor variance and the generalization error reach a finite value, since u L 0 → exp −v 0 , which depends on α, σ,and r 0 . On the other hand, in the high-noise limit u L 0 diverges exponentially with L, yielding a divergent generalization error. Further analysis shows that when all other parameters are held fixed, the generalization error is monotonic with L (SM IIC). In the low-noise regime, it saturates to a finite value, decreasing when ν 0 > 0 (because u L 0 is inversely related to u 0 ) and increasing otherwise. These three behaviors are illustrated in Fig.4 for the same 'template' noisy linear teacher model as illustrated in Fig.3. The phase diagram of the generalization error depicting its different behaviors is shown in Fig.5. In Fig.5(a) we show the trend w.r.t. the width (equivalently α) in the plane of noise (σ 2 ) and input mean squared readout (σ 2 r 0 ) (scaled so that it is independent of noise), where  the boundaries are given by σ 2(L+1) = σ 2 r 0 . In Fig.5(b) we show the behaviors w.r.t. the depth in the plane of σ 2 and α (for σ 2 r 0 = 0.8). Note that in the GP limit (α → 0) the behavior of the generalization error w.r.t. L is either decreasing and goes to 0 or divergent D. Varying the size of the training set Until now we have considered the dependence of ε g on network parameters for a fixed training set, in particular fixed training set size P . Here we consider the effect of varying P . In addition to varying α, α 0 , changes in the training set affect r 0 , as well as the kernels appearing in Eqs.20,21. The exact effect of changing P depends, of course, on the details of the input and output data. Here we address what happens near and above the interpolation threshold, α 0 = 1. For α 0 > 1 , the minimal training error is nonzero. However, there is a huge degeneracy of weights that minimize this error, defined by all values of Θ that yield the same input-output linear mapping, given by input output effective weights (see SM IID) that obey This implies that the predictor is uniquely given by for all x (whether belonging to the training set or not). Hence the training and the generalization error in the deep network is identical to that of a single-layer network. The generalization error is given by the bias component of the error, as the predictor variance is zero.
The singularity of the input kernel at α 0 = 1 gives rise to a simple example of a 'double descent' in the generalization error. The divergence at the interpolation threshold is known to be suppressed by the addition of L 2 regularizers [16]. The reason for the persistence of this divergence in our theory is the fact that our L 2 regularization term is scaled by the temperature T , see Eq.2, which means that at zero T it does not lift the degeneracy of the solutions (and the concentration of the solution space in large norm weights at α 0 = 1). Indeed, this degeneracy is halted, in our theory, only by finite temperature, as shown in Section IV B.
The singularity may affect drastically the behavior near it on both sides of the interpolation threshold. For instance, if the input is sampled from a standard Gaussian i.i.d. distribution then r 0 diverges when α 0 → 1 as ∝ |1 − α 0 | −1 , leading to vanishing of the variance of the predictor (averaged over the testing example) as (1 − α 0 ) 1/L . The sample average of the squared mean, f (x) 2 , diverges as ∝ |1 − α 0 | −1 (see Appendix B for the assumptions made here and additional analysis in SM IID). Hence the generalization error on both sides of α 0 = 1 is dominated by the bias and diverges as |1 − α 0 | −1 . This non-monotonicity of ε g is reminiscent of double descent [34,35]. However, genuine double descent, namely, non-monotonicity of ε g when increasing the number of network parameters for a fixed training set, does not occur in our system, since the error is always monotonic with α (as shown in Fig.3). In Fig.6, we show these results for the 'template' model where the inputs are clustered, with two rules for the labels: a noisy linear teacher as in Fig.3, and random labels where the label for each cluster is binary and drawn randomly (both detailed in Appendix E).
For the noisy linear teacher task, the minimum generalization error is achieved on the RHS of the interpolation threshold. Due to the linearity of the task, we do not need a large number of parameters (i.e., small α 0 ) to generalize well. However, for the random labeling task, the minimum generalization error is achieved on the LHS of the interpolation threshold. Because of the nonlinearity of the task itself, having N 0 > P is required for good performance.

A. Multiple outputs
We now consider the case where there are m > 1 linear outputs, with N × m readout weight matrix A. The rest of the architecture is the same as above with L hidden layers of width N . Extending our theory we obtain, instead of scalar renormalization factors u l (m = 1), m × m renormalization matrices U l per layer (see details in Appendix C). Here we focus on the zero-temperature limit. We first consider α < 1. We define the hidden-layer m×m readout covariance matrix Here diag(r 1l , · · · , r kl , · · · , r ml ) denotes a diagonal matrix with components {r kl } k=1,..,m , and V l is the unitary matrix diagonalizing R l and depends on the specific realization of {W k } k<l+1 . We find that the matrix U l is diagonalized by the same unitary matrix V l , i.e., U l = V l diag(u 1l , · · · , u kl , · · · , u ml )V l . Each layerwise renormalization factor u kl obeys the same equation as single-output scalar renormalization Similar to the single-output case, R l = Y K −1 l Y undergoes a matrix product renormalization under weight averaging (Appendix C and SM IIIA), i.e., The average of the m-dimensional vector f (x) has the same form as in the single-output case, Thus, the fluctuations in the predictor of each mode (eigenvectors in V 0 ) are independent and are given as in the scalar output case. However, the overall behavior may be different than in the m = 1 case, since individual outputs typically consist of contributions from multiple modes. For instance, the generalization error may not be monotonic with either α or with L. Narrow architecture with multiple outputs: Similar to BPKR for the single output case, for α > 1 the zero-temperature limit is affected by the singularity of the kernel matrices K l . Eqs.33, 34, 37 hold only for α < 1, and need to be modified for α > 1 as we introduce here. We find (Appendix C and SM IIIB) that the renormalization matrices U l are still diagonalized by the unitary matrix V l which diagonalizes the m × m readout covariance readout matrices, R l , which for α > 1 are given as, , · · · , r kl , · · · , r ml )V l (39) where r kl 's are the eigenvalues of R l and K + l is the pseudo-inverse of K l . The renormalization eigenvalues u kl 's of the matrix U l are related to the eigenvalues of R l by The relation given in Eq.37, which provides an operational definition of the matrix kernel renormalization OPs, still holds for 1 ≤ l < L. However, for the full average over weights, it is replaced by (see Appendix C and SM IIIC for details).
As in the single-output case, Eqs.35,36 hold for 0 ≤ α < ∞, as K 0 is full rank as long as α 0 < 1. Hence, U 0 and other quantities such as the generalization error are a smooth function of α for all α .
Similarly as for the single-output case in Section II, the network properties after integrating the weights of all hidden layers depends on the kernel renormalization matrix U 0 but not the intermediate renormalization factors We summarize below several important expressions for the renormalization matrix U 0 , and the equations for the predictor statistics that depend on U 0 . These expressions hold for 0 ≤ α < ∞.
Kernel renormalization mxm matrix U 0 for a network with m outputs: The renormalization matrix U 0 relates the average mean squared top-layer and the input-layer readout covariance matrices via, where The predictor statistics: Until now we focused on the limit of zero-temperature. We now consider briefly the effect of finite temperature, i.e., when the training error is not strictly minimized (see details in Appendix A). Our BPKR framework holds for general temperature as well, where the sole effect of temperature is to add to the renormalized W -dependent kernel matrix, a regularizing diagonal term, T I, see Eq.3. In particular, after l successive integration of layer weights, the effective Hamiltonian becomes, where the the kernel still undergoes kernel renormalization with the scalar renormalization factor u l L−l . After integration of all the weights, the equations for the kernel renormalization factor, u 0 become Furthermore, at finite T , the predictor mean and variances are given as Thus, at finite temperature both the mean and the variance of the predictor differ from their GP counterparts, by renormalization of all kernels by the factor u L 0 . Although the predictor value for x µ in the training set is not y µ (as is evident from Eq.47), the regularization provided by finite temperature may improve the generalization error, in particular in the neighborhood of α 0 = 1 where otherwise it would diverge. Indeed, for any specific task there is an optimal temperature that minimizes the generalization, as shown in the example of Fig.7(a). Depending on the specific training task and parameters, there exists cases where the optimal temperature is 0, and also where the optimal temperature is above 0, in which case the training error for optimal generalization is nonzero. The existence of an optimal T with minimum generalization error may provide guidance for choosing an appropriate regularization strength during training. The effect of temperature is analogous to the effect of early stopping in the gradient descent dynamics (see [16] and SM VIII for more details). Finally, we discuss how the effect of temperature changes with the depth of the network. From Eq.21 we observe that the effect of temperature on generalization performance is controlled by the relative strength between the temperature term T I and the renormalized kernel u L 0 K 0 . For finite temperature, both T I and K 0 are of O(1), and the relative strength between the temperature term and the renormalized kernel is thus controlled by λ ≡ u −L 0 , if λ is small, the effect of temperature is also small. The finite temperature order parameter u 0 behaves in the limit of large L similarly to that at zero temperature. In the low-noise regime σ 2 (1 − α) < 1, u 0 approaches unity for large L as u 0 ≈ 1 − v0 L , hence λ approaches the finite value T exp v 0 , as shown in Fig.7(b) (see Eq.A34 in Appendix A). On the other hand, in the high-noise regime, σ 2 (1−α) > 1, u 0 approaches a limit larger than 1 hence λ goes to zero for deep networks for all finite T (Fig.7(c)), as L → ∞, implying that the temperature term given by T I can be neglected compared to the renormalized In the regime σ 2 (1 − α) > 1, since λ goes to 0 and the temperature term T I becomes more neglegible compared to the renormalized kernel u L 0 K0 as L increases, larger T is required to compensate for the small λ to obtain optimal generalization error.
kernel, and the behavior of the network becomes similar to the zero-temperature behavior. In Fig.7(d), we look at the effect of temperature on the generalization performance for networks with different depth L in the large noise regime. Since the temperature term T I becomes more neglegible compared to the renormalized kernel as L increases, we see that the curve becomes shallower as L increases, suggesting the behavior at finite T becomes more similar to T = 0; also for larger L, larger T is required to compensate for the small λ to achieve optimal performance.

V. CHANGES IN REPRESENTATIONS ACROSS LAYERS
Until now we discussed the behavior of the network output for different parameter regimes. We now turn to ask how the representation of the data changes across the different layers.

A. Layerwise mean kernels
The kernel matrices of the hidden layers are an important indicator of the stimulus features represented by these layers, similar to the role of similarity matrices [36,37]. Hence, it is interesting to consider the statistics of the layerwise kernels in our system. We find that the weight averaged kernel matrices are to leading order in N identical to those resulting from Gaussian weights (i.e., as in the GP limit). The non-Gaussianity appears in mean kernels only in the O(1/N ) corrections. Specifically, for the single-output case The subscript l emphasizes that the average is over W l and the upstream weights, i.e., {W k } k>l−1 . Proceeding to successively integrate all upstream weights, the fully averaged kernel of the l-th layer (1 ≤ l ≤ L) is given by where the amplitudes m l consist of the sum of the geometric series with terms such as in Eq.49, yielding m l = σ 2l u −L 0 u l 0 σ −2l −1 u0σ −2 −1 (see Appendix D and SM VA). Thus, the correction term in Eq.50 encodes the output task via the output similarity matrix Y Y (and u 0 ). In the following analysis and examples, we consider the regime σ 1, where the second term in Eq.50 becomes evident compared to the first term. The shape of this correction is the same in all layers; however, its relative strength compared to the GP term increases with the depth of the layer (see Appendix D and SM VB).
The situation is richer in the case of multiple outputs. Here the layerwise mean kernels are (1 ≤ l ≤ L) where M l is the diagonal matrix M l whose k-th eigenvalue is m kl = σ 2l u −L k0 σ −2l u l k0 −1 σ −2 u k0 −1 . For all l, the maximal eigenvalue of M l corresponds to the mode with the smallest eigenvalue of R 0 and this mode may dominate the correction to the mean kernel matrix (Appendix D and SM VB). Note that with multiple outputs the corrections represented by the last term in Eq.51 are not simply proportional to the output similarity matrix as in the single-output case. The difference is pronounced if the spectrum of U 0 (or equivalently that of R 0 ) departs substantially from uniformity. A synthetic example is shown in Fig.8, in which the P input vectors x µ are linear combinations of P orthogonal vectors z i with z T i z j = δ ij .
The linear coefficients w µ i are sampled i.i.d from N (0, I) for i ≤ P − 1 but from N (0, 1 10 I) for i = P . The output is two-dimensional, classifying the inputs according to the sign of their projections on the P − 1-th and the P −th basis vectors, respectively (i.e., Y = sgn([z P −1 , z P ] X) + σ 0 η). As a result, the output similarity matrix 1 m Y Y shows 4 blocks corresponding to the 4 categories ( Fig.8(b)). However, because the input is not fully aligned with the output, and the output direction corresponding to z P has smaller variance than that corresponding to z P −1 , the block corresponding to the P -th classification direction is suppressed in the non-GP correction to the kernel of the hidden layer, Eq.51 ( Fig.8(c,d)). Note that the observed similarity pattern is not a linear combination of the input similarity 1 N X X matrix (which is almost structureless, Fig.8(a)) and the output similarity matrix (Fig.8(b)). Another example shown in Fig.9 considers a linear network with 3 hidden layers and 6 output units each performing a binary classification task on MNIST input images of 4 digits (see details of this task in Appendix E). Here the input similarity matrix 1 N X X shows a weak but noticeable 4-block structure ( Fig.9(a)) corresponding to 4 different digits. The output similarity matrix 1 m Y Y exhibits a pronounced hierarchical block structure ( Fig.9(b)). We ask how the block structure is modified in the different hidden-layer kernels (Fig.9). We observe three major effects of the changes in the average kernels across layers. First, the magnitude of the task related contribution to the mean kernel increases as l increases as expected from the theory (Appendix D and SM VB). Second, in this example we find that the finerscale structure becomes more pronounced in the mean layer kernel than in the output similarity matrix, as can be seen from comparing Fig.9(b) and Fig.9(c). Third, the contributions from finer-scale structure becomes less pronounced for deeper layers, as seen in Fig.9(c). The second and third point can also be observed more straightforwardly in Fig.9(d), where we show the ratio between the mean of the second and third largest eigenvalues (corresponding to the 4 smaller blocks) and the largest eigenvalue (corresponding to the 2 larger blocks) of the non-GP correction terms in the layerwise mean kernels. This ratio decreases with l, suggesting that the finer structure becomes less pronounced for large l, and this ratio for all hidden layers is larger than that for the output layer The origin for agreement of the mean kernels with their GP limit to leading order in N , is that the second-order statistics of hidden-layer weights are just their GP values to leading order in N . Their renormalization appears only in the O(1/N ) corrections to their covariance matrix (Appendix D). This is because the learning-induced terms in the effective Hamiltonians, such as Eq.9, are of order P (as there are P training constraints) which is of the order of N , while the L 2 Gaussian term is of the order of the number of weights in each layer, which is N 2 . On the other hand, the leading term and the correction terms scale differently with σ, such that in the low-noise limit the strength of the correction relative to the GP term grows as σ −2l /N (SM VB).

B. Mean inverse kernels
While the average kernel retains, to leading order, its GP value, the average inverse kernel does not. In fact, to leading order in N , we obtain However, similar to the mean, the average inverse kernels encode the target outputs only in the correction terms. Eq.53 implies that the mean inverse kernel matrix diverges as α → 1 at zero temperature (Fig.10) . In fact, its trace for all α > 1 is proportional to 1/T for small T (see Fig.10 and SM VI). This divergence of the zerotemperature mean inverse kernels in narrow networks is expected as discussed in Section II. Note that due to averaging over weights, the mean kernel, Eq.50, has a rank of P , even when N < P . Furthermore, the divergence of the mean inverse kernels when α ≥ 1, does not lead to divergence of the mean squared readout parameters r l and the renormalization scalars u l , as observed above (and explained in Appendix A and SM IA, IB). Concluding this section, we note that even though the second-order weight statistics and the related mean kernel are to leading order equal to their GP limit, other statistics of the weights and kernels, and in particular the predictor statistics and the generalization error de- viate from the GP limit already to leading order, for all α = O(1), as shown here and in Section III.

A. Approximate BPKR for ReLU networks
Our theory applies to deep networks with linear units, which are limited in their expressive power. To enhance the system's expressivity, one might adopt an architecture comprising a fixed (non-learned) nonlinear mapping of the input to a shallow layer that then projects to the deep linear networks with learned synapses, as has been studied extensively in recent years (e.g., inputs projecting to a nonlinear kernel representation or to a layer of nonlinear neurons via random weights [35,38,39]). Since our theory does not rely on specific assumptions about the input statistics, our BPKR applies readily to this architecture, with the input vectors and the associated input kernel defined by the nonlinear representation of the shallow layer. Our theory is not expected to hold for architectures where the learned weights project to nonlinear units, as is the case in most applications of DNNs. In such cases integration of even one layer of synapses is hard. Here we ask to what extent our theory can be adapted to nonlinear networks to yield a reasonable approximation in some parameter regimes. For simplicity we assume a single linear output unit. We recall that in the GP limit, the properties of DNNs are accounted for by the GP kernels appropriate for the chosen nonlinearity [29]. For example, infinitely wide deep networks with ReLU nonlinearity, which will be studied here, yield GP kernels for the l-th layer of the form J(θ l−1 ) = sin(θ l−1 ) + (π − θ l−1 ) cos(θ l−1 ) where θ l−1 represents the angle between the l − 1 representations of x and y , and the superscript GP in K GP l specifies the GP kernel of the l-th layer, differentiating from the K l we previously defined in Eq.4, which represents the dot product of activations of the network. These equations can be solved by iteration from the initial condition K GP 0 (x, y) = K 0 (x, y) = σ 2 N0 x y, for a pair of input vectors x and y. The average symbol is the result of (self-)averaging w.r.t. Gaussian weights. In the GP limit, the predictor statistics of a network with L layers are given in terms of these kernels as where k GP µ L (x) = K GP L (x, x µ ) . To extend the BPKR to ReLU networks with finite α, we make the ansatz that the weight statistics are modified relative to their GP value by a scalar kernel renormalization, u 0 . Because in the ReLU nonlinearity K l is a linear function of the amplitude of K l−1 we reason that the iterative equation has a similar structure to the linear network case, culminating in and consequently, the mean predictor is unchanged from Eq.56, while the variance is given by Note that in the linear case, the Gaussian averaged K GP L = σ 2L K 0 , which reduces the above equations to the exact BPKR equations (see Eqs.20,-21). Also, for α = 0 Eq.58 yields u 0 = σ 2 and the theory reduces to the GP limit for ReLU networks.

B. Generalization in ReLU networks
Our approximate BPKR predicts that the generalization error increases with α for low σ and decreases for high σ. We have checked these predictions for a ReLU network of a single hidden layer network trained for the noisy linear teacher task described in Appendix E. Results are shown in Fig.11. We have also checked these predictions for a ReLU network of a single hidden layer trained for MNIST binary classification task (see details in Appendix E) as shown in Fig.12.
The simulation behaves qualitatively the same as predicted by the theory, ε g in the ReLU network increases with α for small noise and decreases at high noise. Furthermore, surprisingly, there is also a good quantitative agreement between the simulations and the approximate BPKR for ReLU networks, even for small N (i.e., α ∼ 10). The mean predictor contributing to the bias component of the generalization error is constant and its value fits the prediction given by Eq.56. The predictor variance as well as ε g vary with N in close agreement with the approximate BPKR prediction. Furthermore, the order parameter u 0 defined by varies with N in close agreement with Eq.58.
In the examples above, α 0 < 1. In the linear network we found a divergence of the bias and the generalization error at α 0 = 1 and vanishing of the predictive variance for α 0 > 1. These features are not expected to hold for the nonlinear network due to the stronger expressivity contributed by the trained nonlinear hidden layer. We asked whether our ansatz serves as a good approximation also in the regime of α 0 > 1 where the nonlinearity plays a crucial role in allowing for zero training error. The results are shown in Fig.13. Surprisingly, even here, results for both the predictor statistics and the order parameter are in good agreement with the theory.
In all previous examples we did not observe double descent in ε g because we are in the regime where the network achieves zero training error for all N ≥2 (because our N 0 is sufficiently large). We therefore test our results with small N 0 , pushing the network closer to its interpolation threshold, which is roughly when N ∼ α 0 , i.e., P ∼ N N 0 (i.e., the number of learned parameters equals the number of training data, [40]). Indeed, we see significant deviation from the approximate BPKR as N decreases and approaches α 0 , which suggests that the scalar renormalization of the kernel becomes inadequate as the network approaches its expressivity capacity. While the simulation shows a double descent behavior, our theoretical ansatz does not (Fig.14). The theoretical results agree with the simulations only on the RHS of the interpolation threshold, i.e., larger N, and they fit the simulations significantly better than the GP approximation as shown in Fig.14(b,g,l). Incidentally, it is interesting to compare the generalization behavior in the three tasks which differ in their complexity. In the linear teacher task (Appendix E), the minimum generalization error is on the LHS of the interpolation threshold. However, for the random labeling task (Appendix E) and classification of MNIST data (Appendix E), due to the nonlinearity of the task, a large number of network parameters are required in order to generalize well, and the minimum generalization error is achieved on the RHS of the interpolation threshold, which is similar to the linear network (Fig.6). Importantly, for N below the interpolation threshold while the training error is nonzero, the minimal training error solution is not unique, and this degeneracy in the weights induces variability in the input-output mapping of the network, as shown by the non-vanishing of the predictor variance in the left side of the peak in Fig.14, except at N = 1. This is different from the linear case, where for α 0 > 1 the predictor variance vanishes (see Eq.21 and Fig.6).
All the examples above were for a single hidden layer. We also test our ansatz against the simulation results for the ReLU network with multiple hidden layers. As we see in Fig.15, our approximate BPKR agrees reasonably well with the simulation for L = 1 ∼ 5, and is significantly better than the predictions of the GP limit, but the agreement fails for large L. This suggests that for finite α, when L becomes larger, renormalization of the kernel just by a scalar becomes inadequate.

VII. DISCUSSION
Summary: Since the seminal work of Gardner [41,42], statistical mechanics has served as one of the major theoretical frameworks for understanding the complex- ity of supervised learning. However, so far it has focused mostly on shallow architectures and addressed the classical bias-variance tradeoff where calculated learning curves displayed improvement of generalization when the number of examples was large compared to the system size [43,44]. Statistical mechanics has also focused on phase transitions, local minima, and spin-glass properties due to the underlying nonlinearity of the learning cost function and the quenched randomness of the training data [26,31]. It is well-known that Deep Learning challenges many of the above intuitions, calling for a new theory of learning in deep architectures. In this work we have developed a new statistical mechanics framework of learning in a family of networks with deep architectures. To make the theory analytically tractable, we have focused on networks with linear units. Despite their limited expressive power, they do share important aspects of nonlinear deep networks, as is highlighted in our work. Importantly, unlike most previous statistical mechanical theories of learning, which resorted to extremely simplifying assumptions about the input statistics and the target labels, our theory is general -fully exposing the relation between network properties and task details.
DLNNs have been the focus of several studies. [45,46]  prove the absence of sub-optimal local minima under mild conditions, a result which is consistent with our results. A very interesting work [18] studied the gradient descent dynamics of learning in DLNNs, with results that depended critically on the initial conditions (small random weights) and only became tractable with simplifying assumptions about the data (XX = I and P > N ). Keeping N and P fixed for most of the simulation and analysis, [18] addressed the changes in representation during training and across different layers. Under the restricted assumptions, they found that (when there are multiple outputs) the learning dynamics can be decomposed into multiple modes which evolve independently -qualitatively similar to the multiple modes found in our analysis (see further below). However, they did not address the basic question of the system's performance such as the predictor statistics and the generalization error and its critical dependence on various network parameters. Here we study the nature of the Gibbs distribution in the weight space induced by learning with the training mean squared error as the Hamiltonian. We have focused mainly (but not exclusively) on the properties of the feasible weight space consisting of weight vectors that yield zero training error, which is the case in many real-world applications of DNNs, with the wellknown L 2 regularization (with an amplitude parameterized by inverse noise, σ −2 ). Due to the highly nonlinear nature of the training Hamiltonian, evaluating the statistical mechanical properties of DLNNs seems intractable. Here, we developed the BPKR method to integrate out the weight matrices layer-bylayer, allowing us to derive equations for the system's properties which are exact in the thermodynamic limit. Importantly, in contrast to most kernel-based theories of deep networks, our thermodynamic limit is defined by letting both the width, N , and training size P diverge while the load α = P/N remains of order 1, extending the well-known thermodynamic limit of statistical mechanics of learning to deep architectures [16,25]. We have shown that the effect of the finite load is to change the effective Hamiltonians through an α-dependent kernel renormalization at each successive step of weight integration.
In addition to load, α, depth L, and weight noise parameter σ 2 . Inputs and their labels in the training data affect the properties of the system through the mean squared input-layer readout parameter, σ 2 r 0 , Eq.12. Our results yield rich phase diagrams specifying the dependence of the generalization error on the width and the depth of the network, Figs.3,4,5. Importantly, depending on σ 2 and σ 2 r 0 , the generalization error may decrease upon increasing width (i.e., decreasing α) and increasing depth (i.e., increasing L). Since this occurs within the overparameterized regime where the training error is zero throughout, our results prove that in an exactly solvable deep network increasing network complexity may lead to a substantial improvement in generalization. We were also able to identify the parameter regimes where this improvement happens.
Importantly, the BPKR also enables us to evaluate the posterior properties of each layer's weights imposed by learning. We leverage this to explore the effect of input and output data on the layerwise similarity matrices induced through learning and show that due to different renormalization strengths, amplification of modes in the layer representations is not uniform, as demonstrated in the examples in Figs. 8,9. Recent studies have analyzed the similarity matrices of neuronal activities for structured tasks and compared them with the representations at the hidden layers of DNNs [47][48][49]. Therefore, our work may provide theoretical understanding of how neuronal representations are constrained by the task structure.
BPKR and GD learning: In the case of multiple outputs, we show in Section IV A that the layerwise renormalization order parameters are not scalars but matrices. The renormalization order parameter after full averaging is diagonalized by the unitary matrix which diagonalizes the input-layer readout covariance matrix, Eq.33. Different eigenvalues corresponding to different modes obey an independent set of equations, analogous to [18], which showed modes evolving independently with time during GD learning. However, our renormalization modes are defined by diagonalizing input-layer mean squared readout matrices, and not by a fixed input-output covariance matrix as in [18]. As stated above, our results rely on the equilibrium assumption but not on the special structure of the data nor on an initial condition.
It is interesting to explore the similarity of the behavior of our system with the properties of gradient descent, with implicit regularization induced by early stopping starting from random initial conditions [50][51][52]. As we show in SM VIII, the generalization properties of the early stopping dynamics may exhibit qualitatively similar features to those predicted by our theory for Gibbs learning, with the initial variance of the weights in the early stopping dynamics playing the role of our noise parameter σ 2 . For example, the generalization error of weights learned through the early stopping dynamics increases with the network width for large initial weight variance, and decreases with the network width for small initial weight variance, which qualitatively agrees with the behavior of the generalization error in our BPKR theory in different regimes of the noise parameter σ. Extending our theory to the learning dynamics is an interesting ongoing study.
Nonlinear DNNs: We have extended our theory to ReLU networks by applying a scalar kernel renormalization scheme on the GP nonlinear kernels (Section VI, Eqs. [56][57][58][59]. Testing this approximation against numerical simulations of a few learning tasks with ReLU networks with a moderate number of layers revealed strikingly good qualitative and quantitative agreement regarding the width and noise dependencies of the predictor statistics and the generalization error, as well as the layerwise mean squared readout order parameters -with much greater accuracy than the GP theory. Importantly, this BPKR approximate theory for ReLU networks holds, even in cases where for a linear network the system would be in a highly under-parameterized regime (Fig.13), such that the neuronal nonlinearity plays a crucial role in the ability of the system to yield zero training and low generalization error (Fig.15). The failure of the approximation for deeper networks (L ≥ 5) is expected. The GP theory for nonlinear networks predicts that as L → ∞ not only the magnitude of the kernel matrices converges to a (finite or infinite) fixed point but also its matrix structure converges to a fixed point, implying the loss of information about the structure of the inputs in deep networks with infinite width. Thus, when the width is finite, i.e., α = O(1), we expect to see a renormalization not only of the kernels' magnitudes (as in our scalar renormalization) but also in their shape. In addition, in nonlinear networks with finite width, the ba-sic description of the system may depend on higher-order statistics than the kernel matrices, as suggested by the recent work of [53] and [54].
Even for shallow nonlinear networks, the approximate nonlinear BPKR breaks down in the underparameterized regime, on the left side of the interpolation threshold at N ≈ α 0 . Thus, if α 0 is large, there is a substantial range of small N for which the system is in the under-parameterized regime, and this gives rise to a peak in the generalization error (as a function of N ) near the interpolation threshold -a genuine 'double descent' phenomena as studied in [34,35]. Naturally, our approximate theory predicts monotonic dependence in N , hence it is valid only on the right side of the double descent peak, i.e., in the over-parameterized regime, Fig.14.
Relation to other methods: Successive integration of random variables of joint distributions is used in belief propagation algorithms [15,[55][56][57][58][59][60]. However, despite the Markovian property of the distribution of the deep network activations, the posterior distribution of the weights takes a complicated form, as described in Section II A (Eqs.3,5), rendering the layer-wise weight integration intractible in general, and even in linear networks can be performed only in the thermodynamic limit, as shown here. Therefore, although Bayesian inference algorithms such as message passing are commonly applied to study the distribution of hidden-layer activations of Bayesian Neural Networks [61][62][63], they are not directly applicable for computing the posterior distribution of the weights. Recent works on inference of the posterior weight distribution have proposed to extend backprop learning algorithms to update also the variances of the weights, by approximating the weight distribution as an independent Gaussian distribution [64,65]. As our work shows, the posterior distribution is far from being i.i.d. Gaussian. Importantly, backprop learning algorithms do not necessarily provide insight into the final solutions. In contrast, our work is a theoretical study of the properties of the posterior distribution of weights after learning.
Our BPKR also has some analogy with the Renormalization Group (RG) approach in physics. Similar to BPKR, RG evaluates properties of high-dimensional systems by successive integration of subsets of the systems' DoFs [66]. However, the analogy is limited because in contrast to RG, here there is no obvious notion of coarse graining of DoFs. Our system combines properties of layered physical systems [67][68][69][70] with mean field aspect arising from the full layer-to-layer connectivity. The latter is demonstrated by the fact that the behavior at the critical point σ 2 (1 − α) = 1 is mean-field-like, see Eq. 30.
Extensions of present work: There are several paths for extending our theory to deeper nonlinear networks. Exact mean-field equations are possible for specific forms of nonlinearities. For a generic nonlinearity, approximate methods might be possible. These methods would likely involve renormalization not only of kernels but also of other terms in the effective Hamiltonian, such as 4-th order kernels [53]. These are topics of on going work. Our theory applies to fully connected networks without additional constraints on the network structure, while in practice, other types of neural networks such as Convolutional Neural Networks (CNNs) are commonly used for image processing, speech recognition and various tasks. Recent work discussed extension of the GP theory to CNNs [71][72][73]. Incorporating such architectural restrictions into our theory induces shape renormalization of the kernel (i.e., not simply renormalization by a scalar) and is a topic of ongoing work. Other extensions of our theory include loss functions other than MSE and regularization terms other than L 2 . Acknowledgements: We have benefitted from helpful discussions with Andrew Saxe, Gadi Naveh and Zohar Ringel and useful comments on the manuscript from Itamar Landau, Dar Gilboa, Haozhe Shan, and Jacob Zavatone-Veth. This research is partially supported by the Swartz Program in Theoretical Neuroscience at Harvard, the NIH grant from the NINDS (1U19NS104653) and the Gatsby Charitable Foundation.

APPENDIX Appendix A: The Back-Propagating Kernel Renormalization for DLNNs
We begin with the partition function and introduce P auxiliary integration variables, t µ (µ = 1, · · · , P ) to linearize the quadratic training error.
Integrating over a, we have Z = dW Z L (W ) with where the kernel matrix K L is defined in Eq.4 with l = L.
Integrating over t yields, To make further progress we will assume all the units are linear, so that the hidden units are x i,l = 1 √ N w i l x l−1 (and the first layer units are where the average is w.r.t. to a single N -dimensional weight vector w i L with i.i.d. N (0, σ) components, and K L,µν To integrate over t, we enforce the identity Eq.A7, by Fourier representation of the delta function, introducing the auxiliary variable, u L−1 , and integrating over t, In the limit of N → ∞,P → ∞, and fixed α, we solve this integral with the saddle-point method. One of the saddle-point equations yields u L−1 = σ 2 1+h L−1 , plugging back in Eq.A9we obtain with the effective Hamiltonian Thus, integrating over W L resulted in the presence of an auxiliary scalar DOF, u L−1 . Finally, we eliminate u L−1 through a saddle-point equation, At the T → 0 limit, we obtain Eq.7. This procedure can be iterated layer-by-layer. We demonstrate it by computing H L−2 (W ) defined via where W denotes all weight matrices upstream of W L−1 , where the average is w.r.t. a single N -dimensional Gaussian vector with i.i.d. N (0, σ) components. Performing this average, yields G(t) = − log(1 + h L−2 ) with Similar to above, we introduce two additional scalar integration variables u L−2 and h L−2 , Integrate over t and plugging the saddle point of h L−2 (u L−2 = σ 2 h L−2 +1 ), we have the effective Hamiltonian Finally, u L−1 and u L−2 are computed via saddle-point equations The solution obeys u L−1 = u L−2 ,and we now have Evaluating u L−2 via the saddle-point equation yields where the kernel K L−2 is renormalized by u 2 L−2 . Note that in the integration of W L−2 both u L−1 and u L−2 are auxiliary integration variables (hence independent of weights) and are determined at the last step by the new saddle-point equation Eq.A22 as functions of W . In contrast, the saddle-point value of u L−1 in the first renormalization step, Eq.A12 is a function of W . In fact the average of u L−1 of the first renormalization step over W L−1 obeys u L−1 = u L−2 of the second renormalization step, see paragraph below on renormalization of order parameters. Similarly, iterating this renormalization l times, yields Eqs. 45,46. Narrow network at zero temperature: At finite temperature the above derivation holds for all α. However, in the zero-temperature limit, we need to address the singularity of the hidden layers' kernel matrices when α > 1. We begin with the partition function after integrating the readout layer at zero temperature, where V is a unitary P × P matrix, and Σ is a P × P diagonal matrix with elements (Σ 1 , · · · , Σ N , 0, · · · , 0), and orthogonal transformation of variables V t → t, we have Integrating over t ⊥ yields δ(V ⊥ Y ). The δ-function enforces the projection of Y onto the directions perpendicular to X L to vanish. In the zero-temperature limit, this constraint on the weights ensures zero training error, therefore Y must lie in the subspace spanned by X L . We next integrate t || , and obtain where K + L = V || Σ −1 || V || is the pseudo-inverse of K L , and C L = σ 2 N X L X L has the same determinant as Σ || . Similarly we have Z L−l (W ) = δ(V ⊥ Y ) exp[−H L−l (W )], here V ⊥ are the eigenvectors of K L−l spanning its null space, and Differentiating Eq.A27 w.r.t. u L−l , we obtain Eqs.14, 15.
Renormalization of the order parameters: Here we show that the order parameters u l undergo a trivial renormalization upon averaging. For any function of u l , we can write, which is equal to the saddle-point value of f (u l ). Since u l = u l−1 at the saddle point, where u l−1 obeys the saddle point Eq.46 appropriate for L − l + 1 iterations, we have which holds for all 0 ≤ α < ∞ and all T .
Interpretation of the order parameters: The order parameters u l have a simple interpretation given by Eq.13 for 1 ≤ l ≤ L at zero temperature for α < 1(See SM IVA for the derivation at finite temperature). We evaluate Performing integration over W l with the same approach we used to compute the partition function Z l−1 above, and introducing the same order parameter u l−1 , we reduce the above expression to At zero temperature for α < 1, the expression leads to Eq.13 for all l (1 ≤ l ≤ L). Note that this result is conditioned on the upstream weights {W k } k<l .
For α > 1 at zero temperature, the OP obeys Eq.13 for 1 ≤ l < L (partial averaging of the weights), but the relation is replaced by Eq.16 for averaging over all hidden weights. The details of these results are delegated to the SM IA.
Mean squared readout weights: From Eq.A2, it follows that the W -dependent average of the readout weights is The statistics of t can be obtained from Eq.A3, Therefore we have a = σ 2 √ N Φ (K L + T I) −1 Y , and a a = σ 2 P r L = σ 2 Y (K L +T I) −2 K L Y . In the zerotemperature limit, for α < 1, a a = σ 2 Y K −1 L Y , for α > 1, a a = σ 2 Y K + L Y . Similarly, we can define a l as the readout weight vector trained with inputs from the l-th layer of the trained network to produce the target output Y , we can obtain the statistics of a l by simply replacing the K L in the above equations with K l . At zero temperature, we have a l a l = σ 2 Y K −1 l Y for α < 1, and a l a l = σ 2 Y K + l Y for α > 1. In Eqs.8,15, the definition of r l is equivalent to r l = σ −2 P a l a l , therefore we name r l as the mean squared layer readout.
The second-order statistics of a l , including its variance and its norm, are discussed further in SM IB.
Order parameter at the L → ∞ limit: Earlier in this section we introduced the detailed derivation of the self-consistent equation for the order parameter at finite temperature, given by Eq.46. At the L → ∞ limit, in the low-noise regime σ 2 (1 − α) < 1, u 0 approaches 1. We asssume that u 0 goes to 1 as u 0 ≈ 1 − v0 L , as we discussed in Section III for zero temperature. Plugging in Eq.46, we have This self-consistent equation determines exp(v 0 ), which is the limit of λ as L → ∞ as we present in Fig.7(b,c) in Section IV B.

Appendix B: Generalization
The mean squared generalization error depends only on the mean and variance of the predictor, and they can be computed using the following generating function where x is an arbitrary new point. The statistics of the predictor are given by The integral can be performed similarly as in Appendix A by introducing P auxiliary integration variables, t µ (µ = 1, · · · , P ), integrating over W and introducing order parameters u l 's layer-by-layer.
After integrating the weights of the entire network, we obtain where T 0 = u −L 0 T , as defined in Section IV B, Differenitating Z we obtain Because the derivative is evaluated at t P +1 = 0, the saddle point u 0 satisfies the same equation as Eq.46 for l = L. Similarly, we calculate the second-order statistics Taking the T → 0 limit we obtain Eq.20 and Eq.21.
The dependence of the generalization error w.r.t. σ, N and L is determined by the behavior of σ 2 u 0 w.r.t. σ, N and L, which is shown in SM IIA, IIB, IIC. The dependence on P (which affects both α and α 0 ) hinges on the specific statistics of input and output. Here we analyze the relatively simple case of input sampled from i.i.d. Gaussian distribution, and target output generated by a linear teacher with additive noise, and we focus on the behavior near α 0 = 1.
For P > N 0 , because now the network cannot achieve zero training error, we replace the Y in r 0 with X (XX ) −1 XY , which is the output the network actually learns on the training data. Near α 0 = 1, since 1 P Tr((XX ) −1 ) diverges as α −1 0 (α 0 − 1) −1 ( [74]), r 0 is also dominated by the contribution from the noise term in Y , and is given by The generalization error is dominated by the divergent bias as α 0 → 1, and diverges as (α 0 − 1) −1 .
The case of clustered inputs, as in our template model, is treated analytically in SM IID.
Appendix C: Multiple Outputs BPKR for multiple outputs: Here we extend the calculations in Appendix A to multiple outputs (m > 1) in the zero-temperature limit for α < 1. For m > 1 we introduce the integration variables t form an P × m matrix, hence, Integrating over w yields G(t) = − 1 2 log det(I + H L−1 ) where the mxm dim matrix is H L−1 = σ 2 N t K L−1 t, a relation which is enforced by an auxiliary matrix variable For α < 1, we can integrate overt, yielding Again substituting the saddle point of H L−1 , i.e., I + H L−1 = σ 2 U −1 L−1 ,yields Differentiating w.r.t. U L−1 we obtain the self-consistent equation for U L−1 , A similar conclusion can be extended to the following integration steps, and we have From these equations, it follows that for all l , U L−l can be diagonalized with the eigenvectors of the mean squared readout matrix. Writing the eigenvalue matrix of the readout matrix as diag(r 1,L−l , · · · , r k,L−l , · · · , r m, the renormalization eigenvalue matrix can be written as diag(u 1,L−l , · · · , u k,L−l , · · · , u m,L−l ) = V L−l U L−l V L−l and Eq.C7 can be reduced to independent equations for the eigenvalues u k,L−l 1≤k≤m , as given by Eq.34. However, Eq.C7 holds only for α < 1, due to the singularity of K L−l for l < L at α > 1, the equation for the eigenvalues of U L−l is replaced by Eq.14 for narrow networks. (See SM IIIB for details).
For wide networks, we can calculate the statistics of Y K −1 l Y with a similar approach as that for the singleoutput case, by relating it to the average of t (see SM IIIA for details), obtaining Eq.37. For narrow networks, we calculate the statistics of Y K + l Y , for the same reason as in the single-output calculations, we need to relate the quantity to second-order moment of t, the procedure is also similar as for the single-output case in Appendix A, and we obtain Eqs.37, 41 (see details in SM IIIC).
Eq.C8 holds for all α as long as α 0 < 1. We also note that a straightforward generalization of Eq.A29, leads to which also holds for all 0 ≤ α < ∞.
Generalization for multiple outputs: The generalization error and related quantities can be calculated similarly as in Appendix B, replacing the scalar order parameter with a matrix.
Integrate over t, and take l = L.
we obtain the predictor statistics as described in Section IV A. Multiple outputs at finite temperature: In Section IV A we focused on results for multiple outputs at zero temperature. Here we introduce the results for multiple outputs at finite T (see SM IVB for detailed derivations). The partition function after integrating over l layers is given by whereŶ is a mP dimensional vector that denotes the vectorized Y . The corresponding saddlepoint equation for U L−l for 1 ≤ l ≤ L is given as where the matrices on the RHS before taking the trace are of dimension mP × mP , and Tr P denotes summing the P diagonal blocks of size m × m. Unlike the zerotemperature case, we cannot obtain saddle-point equations for each of the eigenvalues of U L−l . The kernel undergoes kernel renormalization in the form of a Kronecker product with the renormalization matrix U l . The predictor statistics of the multiple output case at finite temperature is given by See another formulation of the results without the Kronecker product in SM IVB.

Appendix D: Weight Covariance and Mean Layer Kernels
We derive the mean layer kernels for multiple output network, starting from calculating w L w L L , where w L is the weight vector corresponding to a single node in the L-th hidden layer, conditioned on the weights of the previous L − 1 layers. Using Eqs.C1,C2, this quantity can be expressed as Plugging A(t) back in, we have The term Tr(W W ) does not depend on t, therefore we ignore it for simplicity below.
We compute first the integral over t, by introducing OPs U L−1 and H L−1 as in Appendix C, and with change of variablet = K 1/2 L−1 t, and the saddle-point relation U L−1 (I + H L−1 ) = σ 2 I, we write the integration over t as Plugging back in Eq.D3 yields In particular, the weight variance, w L w L L equals implying that while the GP term is of O(N ) as expected, the non-GP correction term is of O(1).
Using similar methods, we can derive for all layers, Since K L−l = σ 2 N L−l−1 X L−l−1 w L−l w L−l X L−l−1 , Instead of having the 'standard' input statistics in many synthetic models, namely sampled i.i.d. from a normal distribution, we assume a 'template' model, in which inputs are clustered and the target rule largely obeys this structure, in order to introduce a strong correlation between input structure and target outputs, as explained below. These types of examples are common in practice, for instance in MNIST and CIFAR-10, where the network is trained on clustered input data and the target labels exhibit significant correlations with the cluster structure (see Figs. 9,12,14(k-o) for results on MNIST example). Because of these input-output correlations, the template model can yield good generalization performance even well below the interpolation threshold. [The predicted dependence of the system's properties on the various parameters are general]. In principle, both training and test data would be sampled from the same cluster statistics.
To simplify the analysis, instead we assume that training inputs consist of the cluster centers, or what we call the templates. The test inputs are samples by adding Gaussian noise to thet training inputs, such that the test data are clustered around the training data, as illustrated in Fig.3(a).
x µ test = 1 − γx µ + √ γη (E1) We consider two types of labeling of the data which differ in task complexity. One is a noisy linear teacher task, where the labels are generated by a noisy linear teacher network, y = 1 √ N0 w 0 x + σ 0 η. Here w 0 ∼ N (0, σ 2 w I), η ∼ N (0, I) are both Gaussian i.i.d., σ w represents the amplitude of the linear teacher weights, and σ 0 represents the noise level of the linear teacher. Here the optimal weights of the linear DNNs are those that yield a linear input-output mapping identical to that given by the teacher weights. Because of the linearity of the rule, if σ 0 is small the system can yield small training error even when P > N 0 . If γ is also small, the system will again yield a good generalization error approaching its minimum on the RHS of the interpolation threshold (α 0 > 1), when the network is in the under-parameterized regime, converging to the optimal error for α 0 1 , see Fig.6(a)-(d). The second task is random labeling of the data clusters that are centered around the templates. We assign random binary labels to the data x µ , Y µ ∈ {−1, 1}. For the test data, we assign label Y µ to it if it is generated by adding noise to the training data x µ . Here, for P > N 0 the task is inherently nonlinear (even for small γ ) and the minimum generalization error is achieved on the LHS of the interpolation threshold (α 0 < 1), because small P implies not only small size of training data but also an easier task. Parameters for the noisy linear teacher: Fig.3: The parameters are N 0 = 1000, P = 300, γ = 0.05, σ 0 = 0.1, σ w = 1. For the top panels (c-f), we are in the small noise regime where the generalization error decreases with N , here we choose σ = 0.5. For the bottom panels (g-j), we are in the larger noise regime where the generalization error increases with N , with σ = 1.3. Fig.4: The parameters are N 0 = 200, P = 100, γ = 0.05, σ 0 = 0.1. For the top panels (a-d), we are in the subregime where the generalization error decreases with L, here the parameters are σ w = 0.3, σ = 1.1, α = 0.5. For the middle panels (e-h), we are in the sub-regime where the generalization error increases with L but goes to a finite limit as L → ∞, here the parameters are σ w = 0.9, σ = 1.35, α = 0.5. For the bottom panels (i-l), we are in the high-noise regime where the generalization error increases with L and diverges as L → ∞, here the parameters are σ w = 0.9, σ = 1.35, α = 0.4. Fig.6 (a-d): The parameters are N 0 = 500, N = 200, γ = 0.1, σ 0 = 0.3, σ w = 1, σ = 1. Fig.11: The parameters are N 0 = 400, P = 100, γ = 0.05, σ 0 = 0.1, σ w = 1. For the top panels (a-d), we are in the small-noise regime where the generalization error decreases with N , here σ = 1. For the bottom panels (e-h), we are in the high-noise regime where the generalization error increases with N , and here σ = 2. Fig.13: The parameters are N 0 = 100, P = 200, γ = 0.05, σ 0 = 0.1, σ w = 1. For the top panels (a-d), we are in the small-noise regime where the generalization error decreases with N , here σ = 1. For the bottom panels (e-h), we are in the high-noise regime where the generalization error increases with N , and here σ = 1.3. Fig.14 (a-e): The simulation parameters are N 0 = 20, P = 300, γ = 0.05,σ 0 = 0.3,σ w = 1, σ = 1. Fig.15 (a-d): The simulation parameters are N 0 = 200, P = 100, γ = 0.05, α = 0.7, σ 0 = 0.1, σ w = 0.3,σ = 1. Parameters for the random cluster labeling:    Fig.8, we present the layerwise mean kernels trained on a synthetic example with an output similarity matrix that exhibits a block structure. The parameters used in the simulation are N = N 0 = 100, P = 80, σ 0 = 0.1. We choose σ = 0.1, so that the non-GP correction term (∼ σ 2 /N ) is of the same order as the input term (∼ σ 4 ) for the single hidden-layer network we consider. In Fig.8(c,d) we show the simulation and theoretical results for the non-GP correction given by σ 2 N Y V U 1 V T Y T . c) Binary classification of randomly projected MNIST data The example we show in Fig.9 is trained on a binary classification task on a subset of randomly projected MNIST data. For MNIST the input dimension is fixed to be 784, the number of pixels in the images. In the example we show, we first appropriately normalize and center the data, such that it has zero mean and standard deviation 1, then randomly project the MNIST data to N 0 dimensions with a Gaussian i.i.d. weight matrix W 0 ∈ R N0×784 , W 0 ∼ N (0, I) and add a ReLU nonlinearity to the projected data.
We then further train the network with the input x µ (µ = 1, · · · , P ) and their corresponding labels. In Fig.9 the network is trained on a subset of the MNIST data with 4 different digits (1,5,6,7). The output of the network is 6-dimensional (y ∈ {1, −1} 6 ) designed to have hierarchical block structure, 4 of the binary outputs are 'one hot' vectors each encoding one digit. The 4 digits are divided in to 2 categories ((1,7) and (5,6)), the other two binary outputs each classifies one of the two categories. The parameters are N = N 0 = 1000, P = 100. We again choose small σ(σ = 0.1) so that the non-GP correction term becomes evident in the layerwise mean kernels.
In Fig.14 we use the same example to train a ReLU network with a single output to perform binary classification on 2 digits 0 and 1. The parameters are N 0 = 20, P = 300, σ = 1. d) Binary classification on subsets of the MNIST data We show in Fig.12 the result for a ReLU network trained on a binary classification task on subset of MNIST data directly. The MNIST data determines the input dimension N 0 = 784. We properly normalize and center the data, such that it has zero mean and standard deviation 1. We train on a subset of MNIST data with digits 0 and 1, the network outputs y ∈ 1, −1. The parameters are N 0 = 784, P = 100. For the top panels (a-d), we are in the small noise regime where the generalization error decreases with N , here σ = 0.8. For the bottom panels (e-h), we are in the large noise regime where the generalization error increases with N , and σ = 1.3.

Langevin dynamics
We run simulations to sample from the Gibbs distribution corresponding to the energy E given by Eq.2, defined in Section II, and compute the statistics from the distribution to compare with our theory. We use the well-known result, that the Langevin dynamics generates a time dependent distribution on the state space that converges at long times to the Gibbs distribution. We perform the Langevin dynamics, at each iteration we compute the predictor on a set of new points and r l for the current weight, and obtain samples of the predictor and r l from the underlying Gibbs distribution. We can then calculate statistics of the predictor including f (x) , δf (x) 2 , and r l /r 0 , and compare them with our theoretical results. Because we focused mostly on T → 0, in our simulations we also choose small T (T = 0.001 for results presented in all figures except for Figs. 8,9, where due to the small σ values, we needed to use T = 0.0001).

Finite T effects
In the simulations of Langevin dynamics, we chose small T in order to compare with the T → 0 theoretical results. In most cases presented in this paper, choosing T = 0.001 (or in Section V A with T = 0.0001) is sufficient to approximate the T → 0 limit. However, in Fig.6 where we consider the dependence of the generalization error on P , as α 0 → 1, the kernel K 0 becomes singular, and the closer α 0 is to α 0 = 1 the lower the temperature needs to be to approximate the zero T limit. For this reason, in Fig.6, the solid curves show the theory for finite T with T = 0.001 as in the simulation, with the mean predictor and the variance δf (x) 2 as given by Eqs.47,48, and the order parameter u 0 given by Eq.46 with l = L. In SM IIE we show the same results as in Fig.6, but now with an extra curve plotting the theoretical result for zero temperature. We see that T = 0.001 is a good approximation to the zero T theory when α 0 is not close to the interpolation threshold α 0 = 1 but deviates from it as α 0 approaches this threshold.