Modelling the inﬂuence of data structure on learning in neural networks: the Hidden Manifold Model

,


I. INTRODUCTION
The data sets on which modern neural networks are most successful, such as images [1,2] or natural language [3], are characterised by complicated correlations. Yet, most theoretical works on neural networks in statistics or theoretical computer science do not model the structure of the training data at all [4,5], which amounts to assuming that the data set is chosen in a worst-case (adversarial) manner. A line of theoretical works complementary to the statistics approach emanated from statistical physics [6][7][8][9]. These works model inputs as elementwise i.i.d. draws from some probability distribution, with labels that are either random or given by some random, but fixed function of the inputs. Despite providing valuable insights, these approaches are by construction blind to even basic statistical properties of real-world data sets such as their covariance structure. This lack of mathematical models for data sets is a major impediment for understanding the effectiveness of deep neural networks.
The structure present in realistic data sets can be illustrated well with classic data sets for image classification, such as the handwritten digits of MNIST [10] or the images of CIFAR10 [11]. The inputs that the neural network has to classify are images, so a priori the input * sebastian.goldt@phys.ens.fr † marc.mezard@ens.fr ‡ florent.krzakala@ens.fr § lenka.zdeborova@cea.fr space is the high-dimensional R N , corresponding to the number of pixels, with N large. However, the inputs that can be recognised as actual images rather than random noise, span but a lower-dimensional manifold within R N , see Fig. 1. This manifold hence constitutes the actual input space, or the "world", of our problem. While the manifold is not easily defined, it is tangible: for example, its dimension can be estimated based on the neighbourhoods of inputs in the data set [12][13][14][15], and was found to be around D ≈ 14 for MNIST, and D ≈ 35 for CIFAR10, compared to N = 784 and N = 3072, respectively. We will call inputs structured if they are concentrated on a lower-dimensional manifold and thus have a lower-dimensional latent representation, which consists of the coordinates of the input within that manifold. A complementary view on the data manifold is provided by today's most powerful generative models, called Generative Adversarial Networks (GAN) [16]. A GAN G is a neural network that is trained to take random noise as its input and to transform it into outputs that resemble a given target distribution. For example, GANs can generate realistic synthetic images of human faces [17,18]. From this point of view, the mapping from the hidden manifold to the input space is given by the function that the GAN G computes.

A. Main results
In this paper, and specifically in Sec. II, we introduce a generative model for structured data sets in the We illustrate the notion of a hidden manifold in input space using CIFAR10 example images. Each black point indicates a possible input in a high-dimensional input space R N . Most points in this space cannot be interpreted as images at all; however, those points that can be interpreted as real images tend to concentrate on a lower-dimensional manifold, here sketched as a two-dimensional curved surface in a three-dimensional space. The intrinsic dimension D of these lower-dimensional manifolds has been measured numerically [12][13][14][15].
above sense that we call the hidden manifold model (HMM) [19]. It is a generative model that produces tuples (, y * ) of high-dimensional inputs x ∈ R N and their scalar labels y * . The key idea is to construct the inputs such that they lie on a lower-dimensional manifold; their labels are then a function of only their position within that manifold. The way the inputs are generated is akin to a learnt single layer decoder with random inputs; it can also be viewed as a single layer generator neural network of a learnt GAN. As a result, inputs drawn from the HMM have non-trivial correlations, do not follow a normal distribution, and their labels y * cannot be written as a simple function of the inputs x.
Our key theoretical result, presented in Sec. III, is to show that despite these correlations, the HMM is amenable to an analytical treatment in a thermodynamic limit of large dimensions N and D, large number of samples P , and fixed respective ratios D/N , P/N . We derive the solution by first demonstrating a "Gaussian Equivalence Property" or GEP (Proposition III.1). As a first application, we use the GEP to derive a set of integrodifferential equations that captures the behaviour of twolayer neural networks, with K = O(1) hidden units, trained using stochastic gradient descent. These equations extend the classical analysis of the dynamics of two-layer neural networks on unstructured data [20][21][22][23] to the hidden manifold and provide detailed insight into the dynamics of learning.
We then use these equations to study the dynamics and the performance of two-layer neural networks trained on data generated by the HMM, in Sec. IV. We find back the specialisation of hidden units, known from the canonical teacher-student model. We analyse the learning for different feature matrices, and show that Hadamard matrices perform slightly better than i.i.d. Gaussian ones. We show analytically that the generalisation performance deteriorates as the manifold dimension D grows. We show that the learning rate has a very minor influence on the asymptotic error, and analyse how the error final error of the network changes as a function of the width of the hidden layer.
Sec. V is devoted to comparison of learning on the HMM and on real data sets such as MNIST [10], fashion-MNIST [24] or CIFAR10 [11]. In particular, we demonstrate that neural networks learn functions of increasing complexity over the course of training on both the HMM and real data sets. We also compare the memorisation of some samples during the early stages of training between the HMM to real data. These comparisons provide strong evidence that the HMM captures the properties of learning with one-pass SGD and two-layer neural networks on some of the standard benchmark data sets rather faithfully.

B. Further related work
The need for models of structured data. Several works have recognised the importance of the structure in the data sets used for machine learning, and in particular the need to go beyond the simple component-wise i.i.d. modelling [25][26][27][28][29][30]. While we will focus on the ability of neural networks to generalise from examples, two recent papers studied a network's ability to store inputs with lower-dimensional structure and random labels: Chung et al. [31] studied the linear separability of general, finite-dimensional manifolds and their interesting consequences for the training of deep neural networks [32,33], while Rotondo et al. [34] extended Cover's classic argument [35] to count the number of learnable dichotomies when inputs are grouped in tuples of k inputs with the same label. Recently, Yoshida and Okada [36] analysed the dynamics of online learning for data having an arbitrary covariance matrix, finding an infinite hierarchy of ODEs. Their study implicitly assumes that inputs have a Gaussian distribution, while our approach handles a more general data structure. The importance of the spectral properties of the data was recognised for learning in deep neural networks by Saxe et al. [37] in the special case of linear neurons, where the whole network performs a linear transformation of the data.
Relation to random feature learning. The hidden manifold model has an interesting link to random feature learning with unstructured i.i.d. input data. The idea of learning with random features goes back to the mechanical perceptron of the 1960s [38] and was extended into the random kitchen sinks of Rahimi and Recht [39,40]. Remarkably, random feature learning in the same scaling limit as used in the theoretical part of this paper was analysed in several recent and concurrent works, notably in [41,42] for ridge regression, and in [43] for max-margin linear classifiers. These papers consider full batch learning, i.e. all samples are used at the same time, which makes one difference from our online (one-pass stochastic gradient descent) analysis. Another important difference is that we study learning in a neural network with two layers of learned weights, while the existing works study simpler linear (perceptron-type) architectures where only one layer is learned. Perhaps more importantly, in our analysis the features do not need to be random, but can be chosen deterministically or even be learnt from data using a GAN or an autoencoder. The principles underlying the analytic solution of this paper as well as [41][42][43] rely on the Gaussian Equivalence Property, which is stated and used independently in those papers.
Gaussian equivalence and random matrix theory. Special cases of the Gaussian Equivalence Property were in fact derived previously using random matrix theory in [44][45][46][47], and this equivalent Gaussian covariates mapping was explicitly stated and used in [42,43]. This formulation has recently been extended to a broader setting of concentrated vectors encompassing data coming from a GAN in [48,49], a version closer to our formulation.

C. Reproducibility
We provide the full code to reproduce our experiments as well as an integrator for the equations of motion of two-layer networks online [50].

D. Learning setup
This paper focuses on the dynamics and performance of fully-connected two-layer neural networks with K hidden units and first-and second-layer weights W ∈ R K×N and v ∈ R K , resp. Given an input x ∈ R N , the output of a network with parameters θ = (W, v) is given by where w k is the kth row of W , and g : R → R is the non-linear activation function of the network, acting component-wise. We study sigmoidal and ReLU networks with g(x) = erf(x/ √ 2) and g(x) = max(0, x), resp. We will train the neural network on data sets with P input-output pairs (x µ , y * µ ), µ = 1, . . . , P , where we use the starred y * µ to denote the true label of an input x µ . Networks are trained by minimising the quadratic training error E(θ) = 1/2 P µ=1 ∆ 2 µ with ∆ µ = φ(x µ , θ) − y * µ using stochastic (one-pass, online) gradient descent (SGD) with constant learning rate η and mini-batch size 1, Initial weights for both layers were always taken component-wise i.i.d. from the normal distribution with mean 0 and standard deviation 10 −3 .
The key quantity of interest is the test error or generalisation error of a network, for which we compare its predictions to the labels given in a test set that is composed of P * input-output pairs (x µ , y * µ ), µ = 1, . . . , P * that are not used during training, The test set in our setting is generated by the same probabilistic model that generated the training data.

The canonical teacher-student model
The joint probability distribution of input-output pairs (x µ , y * µ ) is inaccessible for realistic data sets such as CIFAR10, preventing analytical control over the test error and other quantities of interest. To make theoretical progress, it is therefore promising to study the generalisation ability of neural networks for data arising from a probabilistic generative model.
A classic model for data sets is the canonical teacherstudent setup where inputs x µ are drawn element-wise i.i.d. from the standard normal distribution and labels are given by a random, but fixed, neural network with weights θ * acting on the inputs: y * µ = φ(x µ , θ * ). The network that generates the labels is called the teacher, while the network that is trained is called the student. The model was introduced by Gardner & Derrida [6], and its study has provided many valuable insights into the generalisation ability of neural networks from an average-case perspective, particularly within the framework of statistical mechanics [7][8][9][51][52][53][54][55][56], and also in recent works in theoretical computer science, e.g. [42,[57][58][59]. However, it has the notable shortcoming that its analysis crucially relies on the fact that inputs are i.i.d. Gaussians and hence uncorrelated.

II. THE HIDDEN MANIFOLD MODEL
We now introduce a new generative probabilistic model for structured data sets with correlations. To generate a data set containing P inputs in N dimensions, we first choose D feature vectors f r , r = 1, . . . , D. These are vectors in N dimensions and we collect them in a feature matrix F ∈ R D×N . Next we draw P vectors c µ with random i.i.d. components drawn from the normal distribution with mean zero and unit variance and collect them in the matrix C ∈ R P ×D . The vector c µ gives the coordinates of the µth input on the lower-dimensional manifold The hidden manifold model proposed here is a generative model for structured data sets, where inputs x, Eq. (4) (blue and green balls), concentrate on a lower-dimensional manifold in input space (yellow surface). Their labels y * is a function of their position on the manifold; here we show the setup of a classification task with two classes y * = ±1. In our analysis the labels are generated according to Eq. (5).
spanned by the feature vectors in F . We will call c µ the latent representation of the input x µ , which is given by the µth row of where f is a non-linear function acting component-wise.
In this model, the "world" of the data on which the true label can depend is a D-dimensional manifold, which is obtained from the linear span of F through a "folding" process induced by the nonlinear function f . We note that the structure of data of the same type arises in a learned variational autoencoder network [60] with single layer, or in a learned GAN network [16] with a single layer generator network, the matrix C then corresponds to the random input, the F to the learned features, f is the corresponding output activation. The matrix F can be generic with a certain normalisation, such that its elements are O(1). For our analysis to be valid, we will later assume the normalisation given in Eq. (13) and balance condition given by (14), other than that our analysis hold for arbitrary matrices F . The labels are obtained by applying a two-layer neural network with weights θ = ( W ∈ R M ×D , v ∈ R M ) within the unfolded hidden manifold according to We draw the weights in both layers component-wise i.i.d. from the normal distribution with unity variance, unless we note it otherwise. The key point here is the dependency of labels y * µ on the coordinates of the lower-dimensional manifold c µ rather than on the highdimensional data x µ . We expect the exact functional form of this dependence not to be crucial for the empirical part of this work, and that there are other forms that would present the same behaviour. Notably it would be interesting to consider ones where the latent representation is conditioned to the labels as in conditional GANs [61] or the manifold model of [33].

III. THE SOLUTION OF THE HIDDEN MANIFOLD MODEL
A. The Gaussian Equivalence Property The difficulty in analysing HMM comes from the fact that the various components of one given input pattern, say x µi and x νj , are correlated. Yet, a key feature of the model is that it is amenable to an analytical treatment.
To that end, we will be studying the standard thermodynamic limit of the statistical physics of learning where the size of the input space N → ∞, together with the number P → ∞ of patterns that are presented for learning, while keeping the ratio α ≡ P/N fixed. In statistics this corresponds to the challenging high-dimensional limit. The hidden manifold model can then be studied analytically if one assumes that the latent dimension D, i.e. the dimension of the feature space, also scales with N , meaning that it goes to ∞ with a fixed ratio δ ≡ D/N which is of order 1 with respect to N , so that we have N, P, D → ∞, with fixed α ≡ P N and δ ≡ D N . (6) In this limit, the relevant variables are the "local fields" or pre-activations that are acted upon by the neurons in the hidden layer. They can be shown to follow a Gaussian distribution in the thermodynamic limit (6). We will now make this statement precise by formulating a "Gaussian Equivalence Property" (GEP). We will demonstrate the power of this equivalence by deriving a set of exact equations for online learning in Sec. III B.

Statement of the Property
Let {C r } D r=1 be D i.i.d. Gaussian random variables distributed as N (0, 1). In the following we shall denote by E the expectation value with respect to this distribution. Define N variables u i , i = 1, . . . , N as linear superpositions of the C r variables, and M variables ν m , m = 1, . . . , M as other linear superpositions, wherew m r are the teacher weights Eq. (5). Define K variables λ k as linear superpositions of f (u i ) where f is an arbitrary function: wherew k i are the student weights Eq. (1). We will occasionally write (λ, ν) to denote the tuple of all local fields λ k and ν m . Denoting by g(u) the expectation of a function g(u) when u is a normal variable with distribution u ∼ N (0, 1), we also introduce for convenience the "centered" variables As the C r are Gaussian, the u i variables are also Gaussian variables, with mean zero and a matrix of covariance Note that the covariances of the u i variables scale in the thermodynamic limit as We assume that, in the thermodynamic limit, the W ,W and F matrices have elements of O(1) and that for i = j, Notice that the only variables which are drawn i.i.d. from a Gaussian distribution are the coefficients C r . Most importantly, the matrices F and W can be arbitrary (and deterministic) as long as they are "balanced" in the sense that ∀p, q ≥ 1, ∀k 1 , . . . , k p , r 1 , . . . r q , we have S k1k2...kp r1r2...rq with the q and p distinct. We also have a similar scaling for the combinations involving the teacher weightsw m r . This is the key assumption behind the Gaussian Equivalence Property, and we will discuss its interpretation immediately after the statement of the GEP: Property III.1. Gaussian Equivalence Property (GEP) In the asymptotic limit when N → ∞, D → ∞, with K, M and the ratio D/N finite, and under the assumption (14), {λ k } and {ν m } are K + M jointly Gaussian variables, with mean and covariance The "folding function" f (·) appears through the three coefficients a, b, c, which are defined as where ψ(u) denotes the expectation value of the function ψ when u ∼ N (0, 1) is a Gaussian variable. The covariances are defined in terms of the three matrices whose elements are assumed to be of order O(1) in the asymptotic limit. The derivation of the property is given in Appendix A.
In Sec. III B, we will see that the GEP allows us to develop an analytical understanding of learning with the hidden manifold model. We first discuss several aspects of the GEP in detail.

Discussion
The Gaussian Equivalence Property states that the local fields (λ, ν) follow a joint normal distribution if the weights of the student fulfil the balance condition (14).
In the simplest case, where the x i are element-wise i.i.d. Gaussian, joint Gaussianity of (λ, ν) follows immediately from the central limit theorem (CLT). The CLT can also be applied directly when input vectors x are drawn from a multi-dimensional Gaussian with fixed covariance matrix, which is the setup of Yoshida and Okada [36], In the Hidden Manifold Model considered here, the inputs x i = f (u i ) and thus the x i are not normally distributed and have a non-trivial covariance matrix. The GEP can be seen as a central limit theorem for sums of weakly correlated random variables, i.e. λ k ∼ i w k i x i . In this case, the GEP establishes that λ k is Gaussian, provided that the weights of the student w k i do not align "too much" with the weights of the generator F ir . More precisely, we require that a sum such as S k r = 1 / √ N i w k i F ir remains of order 1 in the thermodynamic limit (6). The balance condition is a generalisation of this idea to the higher-order tensors defined in Eq. (14).
The expansion from the hidden manifold in R D to the input space R N can equivalently be seen as a noisy transformation of the latent variables C. As far as the local fields (λ, ν) are concerned, we can replace the data matrix where Z is a P × N matrix with entries drawn i.i.d. from the normal distribution and 1 is a matrix of the same size as X with all entries equal to one. We use the symbol here to emphasise that the two matrices on the left and right-hand side have matching first and second moments, and are hence equivalent in terms of their lowdimensional projections, but are not the same matrix. We can thus think of the inputs X as a noisy transformation of the latent variables, even without any explicit noise in Eq. (4).
We could also add noise to the expansion explicitly, for example as where ζ would be a noise matrix of appropriate dimensions. These noise injections would indeed make the data high-dimensional, or, if added directly to the latent variables C, result in correlated noise in the input space. In all these cases, the GEP applies and guarantees that the noise ζ would only change the variance of the noise term Z that appears after application of the GET (23). Our results are thus robust to the injection of additional noise. Finally, the GEP shows that there is a whole family of activation functions f (x) (those that have the same values for a, b and c from Eq. 19) that will lead to equivalent analytical results for the learning curves studied in this paper.
Related results in random matrix theory. A related result to the Gaussian Equivalence Property was in fact known in random matrix theory [41,[44][45][46][47][48]. These works study quantities that can be written as an integral over the spectral density of the distribution of inputs x, such as the test and training errors for linear regression problem. However, this spectral density is inaccessible analytically for realistic data. The key idea is then to re-write these integrals by replacing the intractable spectral density with the spectral density of a Gaussian model with matching first and second moments. Using tools from random matrix theory (RMT), one can show that certain integrals over both spectra coincide. This mapping was explicitly used in [42,43]. In order to apply tools from RMT, these works have to assume that the weights F of the generator are random. The advantage of the formulation of the GEP above is that it does not require the matrix F to be a random one, and is valid as well for deterministic or learnt weight matrices, as long as the balanced conditions stated in Eqs.  hold. This allows to generalise these mappings to the case of deterministic features using Hadamard and Fourier matrices, such as the one used in Fastfood [62] or ACDC [63] layers. These orthogonal projections are actually known to be more effective than the purely random ones [64]. It also allows generalisation of the analysis in this paper for data coming from a learned GAN, along the lines of [48,49]. We shall illustrate this point below by analysing the dynamics of online learning when the feature matrix F is a deterministic Hadamard matrix (cf. Sec. IV B).

B. The dynamics of stochastic gradient descent for the Hidden Manifold Model
To illustrate the power of the GEP, we now analyse the dynamics of stochastic gradient descent (2) in the case of online learning, where at each step of the algorithm µ = 1, 2, . . ., the student's weights are updated according to Eq. (2) using a previously unseen sample (x µ , y µ ). This case is also known as one-shot or single-pass SGD. The analysis of online learning has been performed previously for the canonical teacher-student model with i.i.d. Gaussian inputs [20][21][22][23]65], and has recently been put on a rigorous foundation [55]. Here, we generalise this type of analysis to two-layer neural networks trained on the Hidden Manifold Model.
The goal of our analysis is to track the mean-squared generalisation error of the student with respect to the teacher at all times, where the expectation E denotes an average over an input drawn from the hidden manifold model, Eq. (4), with label y * µ = φ(c µ , θ * ) given by a teacher network with fixed weightsθ * acting on the latent representation, Eq. (5). Note that the weights of both the student and the teacher, as well as the feature matrix F ir , are held fixed when taking the average, which is an average only over the coefficients c µr . To keep notation compact, we focus on cases where a = E f (u) = 0 in (19), which leads toλ k = λ k in (10). A generalisation to the case where a = 0 is straightforward but lengthy.
We can make progress with the high-dimensional average over x in Eq. (24) by noticing that the input x and its latent representation c only enter the expression via the local fields ν m and λ k , Eqs. (8,9): The average is now taken over the joint distribution of local fields {λ k=1,...,K , ν m=1,...,M }. The key step is then to invoke the Gaussian Equivalence Property III.1, which guarantees that this distribution is a multivariate normal distribution with covariances Q k , R km , and T nm (16)(17)(18). Depending on the choice of g(x) andg(x), this makes it possible to compute the average analytically; in any case, the GEP guarantees that we can express g (θ, θ) as a function of only the second-layer weights v k andṽ m and the matrices Q k , R km , and T nm , which are called order parameters in statistical physics [20][21][22]: where in taking the limit, we keep the ratio δ ≡ D /N finite (see Eq. 6).

The physical interpretation of the order parameters
The order parameter R kn , defined in (17,20), measures the similarity between the action of the kth student node on an input x µ and the nth teacher node acting on the corresponding latent representation c µ . In the canonical teacher-student setup, where (i) the input covariance is simply E x i x j = δ ij and (ii) labels are generated by the teacher acting directly on the inputs x, it can be readily verified that the overlap has the simple expression R kn ≡ E λ k ν n ∼ w kwn . It was hence called the teacher-student overlap in the previous literature. In the HMM, however, where teacher and student network act on different vector spaces, it is not a priori clear how to express the teacherstudent overlap in suitable order parameters.
The matrix Q k = c − b 2 W k + b 2 Σ k quantifies the similarity between two student nodes k and , and has two contributions: the latent student-student overlap Σ k , which measures the overlap of the weights of two students nodes after they have been projected to the hidden manifold, and the ambient student-student overlap W k , which measures the overlap between the vectors w k , w ∈ R N . Finally, we also have the overlaps of the teacher nodes are collected in the matrix T nm , which is not time-dependent, as it is a function of the teacher weights only.

Statement of the equations of motion
We have derived a closed set of equations of motion that describe the dynamics of the order parameters R km , Σ k , W k and v k when the student is trained using online SGD (2). We stress at this point that in the online learning, at each step of SGD a new sample is given to the network. The weights of the network are thus uncorrelated to this sample, and hence the GEP can be applied at every step. This is in contrast with the fullbatch learning where the correlations between weights and inputs have to be taken into account explicitly [43]. Integrating the equations of motion and substituting the values of the order parameters into Eq. (26) gives the generalisation error at all times. Here, we give a selfcontained statement of the equations, and relegate the details of the derivation to Appendix B; A key object in our analysis is the spectrum of the matrix We denote its eigenvalues and corresponding eigenvectors by ρ and ψ ρ , and write p Ω (ρ) for the distribution of eigenvalues. It turns out that it is convenient to rewrite the teacher-student overlap as an integral over a density r km (ρ, t), which is a function of ρ and of the normalised number of steps t = P/N , which can be interpreted as a continuous time-like variable. We then have with b ≡ uf (u) (19). In the canonical teacher-student model, introducing such a density and the integral that comes with it is not necessary, but in the HMM is a consequence of the non-trivial correlation matrix E x i x j between input elements. Adopting the convention that the indices j, k, , ι = 1, . . . , K always denote student nodes, while n, m = 1, . . . , M are reserved for teacher hidden nodes.
The equation of motion of the teacher-student density can then be written as where , whileT nm is the overlap of the teacher weights after rotation into the eigenbasis of Ω rs , weighted by the eigenvalues ρ: In writing the equations, we used the following shorthand for the three-dimensional Gaussian averages which was introduced by Saad & Solla [22]. Arguments passed to I 3 should be translated into local fields on the right-hand side by using the convention where the indices j, k, , ι always refer to student local fields λ j , etc., while the indices n, m always refer to teacher local fields ν n , ν m . Similarly, where having the index j as the third argument means that the third factor is g(λ j ), rather thang(ν m ) in Eq. (31). The average in Eq. (31) is taken over a three-dimensional normal distribution with mean zero and covariance matrix For the latent student-student overlap Σ k , it is again convenient to introduce the density σ k (ρ, t) as whose equation of motion is given by This equation involves again the integrals I 3 and a four-dimensional average that we denote using the same notational conventions as for I 3 , so the four-dimensional covariance matrix reads 10 0 10 2 10 4 0 The analytical description of the hidden manifold generalisation dynamics matches experiments even at moderate system size. We plot the time evolution of the generalisation error g (α) and the order-parameters R km , Q k and 2nd layer weights v k obtained by integration of the ODEs (solid) and from a single run of SGD (crosses). Parameters: The equation of motion for the ambient student-student overlap W k can be written directly: Finally, the ODE for the second-layer weights v k is straightforwardly given by where we have introduced the final short-hand I 2 (k, j) ≡ E g(λ k )g(λ j ) .

Solving the equations of motion
The equations of motion are valid for any choice of f (x), g(x) andg(x). To solve the equations for a particular setup, one needs to compute the three constants a, b, c (19) and the averages I 3 and I 4 (31,36). Choosing g(x) =g(x) = erf(x/ √ 2), they can be computed analytically [21]. Finally, one needs to determine the spectral density of the matrix Ω rs . When drawing the entries of the feature matrix F ir i.i.d. from some probability distribution with finite second moment, the limiting distribution of the eigenvalues p Ω (ρ) in the integral (28) and (34) is the Marchenko-Pastur distribution [66]: where we recall that δ ≡ D/N . Note that our theory crucially also applies to non-random matrices; we will visit such an example in Sec. IV B, where we also discuss the importance of this use case. A complete numerical implementation of the equations of motion is available on GitHub [50].
We illustrate the content of the equations of motion in Fig. 3, where we plot the dynamics of the generalisation error, the order parameters R km , Σ k and W k and the second-layer weights v k obtained from a single experiment with N = 10000, D = 100, M = K = 2, starting from small initial weights (crosses). The elements of the feature matrix are drawn i.i.d. from the standard normal distribution, as are the elements of the latent representations c. The solid lines give the dynamics of these order parameters obtained by integrating the equations of motion. The initial conditions for the integration of the ODEs were taken from the simulation. The ODE description matches this single experiment really well even at moderate system sizes. For

Discussion
Yoshida and Okada [36] recently analysed online learning for two-layer neural networks (1) trained on Gaussian inputs, with a two-layer teacher acting directly on the inputs x. Their approach consists of introducing distinct order parameters R km (i) , Q kl (i) etc. for each distinct eigenvalue of the input covariance matrix Ω. They analysed their equations for covariance matrices with one and two distinct eigenvalues. Here, we first introduced the GEP (III.1) to show that inputs which are not normally distributed, such as X = f ( CF / √ D), can be reduced to an effective Gaussian model as far as the dynamics of learning are concerned. Furthermore, the description of the learning dynamics we just discussed allows us to analyse inputs with any well-defined spectral density with just a single set of order parameters Q kl , R km and T nm . This is made possible by introducing the integral over the order parameter densities r km (ρ) etc. As we will see below, this integral can actually be solved for small δ, which simplifies the equations of motion considerably and allowing for a detailed analysis (cf. Sec. IV C).
We lastly comment on the role of the dimensionality in our setup. Inspection of the test error (25) reveals that a student has to recover the local fields of the teacher ν m in order to perform well (if she has the same activation function as the teacher). If the student was trained directly on the latent variables C, she could recover these local fields perfectly and we would be back in the setup of Saad and Solla [22]. In the HMM, the student is only given the high-dimensional inputs X, which can be seen as a noisy projection of the latent variables C (23). The high dimensionality of the student inputs is thus a constraint that must be overcome to learn well, because projection to high dimensions is part of the data-generating process. This is to be contrasted with setups like random features [39,40] or certain neural circuits in sensory processing [67,68], where projection of the inputs to higherdimensional spaces is part of the analysis and generally simplifies the subsequent learning problem.

IV. ANALYTICAL RESULTS
The goal of this section is to use the analytic description of online learning to analyse the dynamics and the performance of two-layer neural networks in detail.

A. Specialisation of student nodes in the HMM
An intriguing feature of both the canonical teacherstudent setup and the hidden manifold model is that they both exhibit a specialisation phenomenon. Upon closer inspection of the time evolution of the order parameter R km in Fig. 3 (b), we see that during the initial decay of the generalisation error up to a time t = P /N ∼ 10, all elements of the matrix R km are comparable. In other words, the correlations between the pre-activation λ k of any student node and the pre-activation ν m of any teacher node is roughly the same. As training continues, the student nodes "specialise": the pre-activation of one student node becomes strongly correlated with the preactivation of only a single teacher node. In the example shown in Fig. 3, we have strong correlations between the pre-activation of the first student and the first teacher node (R 11 ), and similarly between the second student and second teacher node (R 22 ). The specialisation of the teacher-student correlations is concurrent to a de-correlation of the student units, as can be seen from the decay of the off-diagonal elements of the latent and ambient student-student overlaps Σ k and W k , respectively (bottom of Fig. 3). Similar specialisation transitions have been observed in the canonical teacher-student setup for both online and batch learning [22,69]; see Engel and Van den Broeck [8] for a review.
B. Using non-random feature matrices Our first example of the learning dynamics in Sec. IV A was for a feature matrix F whose entries were taken i.i.d. from the normal distribution. The derivation of the ODEs for online learning however does not require that the feature matrix F be random; instead, it only requires the balance condition stated in Eq. (14) where I N is the N × N identity matrix. As we can see from Fig. 4, the ODEs capture the generalisation dynamics of the Hadamard-case just as well.

C. The limit of small latent dimension
The key technical challenge in analysing the analytical description of the dynamics is handling the integrodifferential nature of the equations. We can simplify the equations in the limit of small δ ≡ D/N . Numerical integration of the equations reveals that at convergence, the continuous order parameter densities r km (ρ) and σ k (ρ) are approximately constant: This is a key observation, because making the ansatz (42) allows us to transform the integro-differential equations for the dynamics of r km (ρ, t) (29) and σ k (ρ, t) (35) into first-order ODEs, provided we can perform the integral over the eigenvalue distribution p Ω (ρ) in Eqs. (28) and (34) analytically. This is for example the case if we take the elements of the feature matrix F i.i.d. from any probability distribution with bounded second moment, in which case p Ω (ρ) is given by the Marchenko-Pastur distribution (40). We will focus on this case for the remainder of this section. Let us note that the regime of small delta is also the relevant regime for image data sets such as MNIST and CIFAR10, whose δ has been estimated previously to be around δ MNIST ∼ 14/784 and δ CIFAR10 ∼ 35/3072, respectively [12][13][14][15]; cf. our discussion in the Introduction.

The effect of the latent dimension D = δN
As a first application of this approach, we analyse the dependence of the asymptotic test error * g on the latent dimension D of the hidden manifold when teacher and student have the same number of hidden nodes, K = M .
From inspection of the form of the order parameters after integrating the full set of ODEs until convergence, we made the following ansatz for the overlap matrices: Substituting this ansatz into the ODEs allowed us to derive closed-form expressions for ODEs governing the dy-  [12][13][14][15] for δ for the CIFAR10 data set (left) and the MNIST data set (right).
namics of seven order parameters R, r, S, s, W, w and v that are valid for small δ and for any K = M . The teacher-related order parameters T, t,T andt describe the teacher and are constants of the motion. They have to be chosen to reflect the distribution from which the weights of the teacher network are drawn in an experiment. The full equations of motion are rather long, so instead of printing them here in full we provide a Mathematica notebook for reference [50].
The key idea of our analytical approach is to look for fixed points of this ODE system and to substitute the values of the order parameters at those fixed points into the expression for the generalisation error (26). To understand the structure of the fixed points of the ODEs, we ran a numerical fixed point search of the ODEs from 1000 initial values for the order parameters drawn randomly from the uniform distribution. We found two types of solution. First, there exist solutions of the form R = r, S = s and W = w. This solution is a saddle point of the equations and is thus not a stable fixed point of the dynamics. Instead, it corresponds to a well-known "unspecialised" phase, when networks with K > 1 hidden nodes have not yet specialised and hence achieve only the performance of a network with K = 1 hidden unit (cf. our discussion in Sec. IV A). The learning dynamics approaches this saddle point at an intermediate stage of learning, but finally drifts away from it towards a "specialised" solution. This second solution corresponds to the asymptotic fixed point of the learning dynamics where the student has specialised, i.e. we have R large and r small, etc. Substituting the values of the order parameters of this solution into Eq. (26) yields the asymptotic generalisation error of a student.
Making this argument rigorous would require a proof of global convergence of the coupled, non-linear integrodifferential equations of motion (29,35,38,39) from random initial conditions. This is a challenging mathematical problem that remains open, despite some recent progress for two-layer neural networks with finite N and large hidden layer [73][74][75][76][77]. Thus all predictions in this way ultimately need to be compared to simulations to verify their accuracy.
We show the results of this analysis in Fig. 5. The crosses are experimental results for which we trained networks with M = K = 2 on data from a hidden manifold with latent dimension D = 25, 50, 100 and 200, choosing the input dimension N to obtain the range of δ desired for each curve. We plot the asymptotic error averaged over five runs with dots; error bars indicated two standard deviations. The lowest solid line in Fig. 5 is the theoretical prediction obtained by the procedure just explained when assuming that T = 1, t = 0,T = 1,t = 0.
While the experimental results are approaching the theoretical line as the latent dimension D increases, there are qualitative differences in the shape of the δ dependence for small δ. These differences arise due to the following finite-size effect. While it is numerically easy to enforce T = 1, t = 0 by orthogonalising the teacher weight matrix, it is not possible to explicitly control the re-weighted teacher-teacher overlapT nm (30). The deviation ofT nm from the identity lead to the deviations we see at small δ. We demonstrate this in Fig. 5 by also plotting theoretical predictions forT = 1 − x,t = x and choosing x = 1/D. These curves match the experiments much better. Plotting the data with a linear y-scale (not shown) reveals that the solution obtained making the small-δ ansatz (42) is valid until δ ∼ 0.2.

Learning rate η
We found that the asymptotic test error * g depends only weakly on the learning rate η, as we show in Fig. 6 for M = K = 2 and M = K = 6, together with the theoretical prediction fort = 0. This theoretical prediction is again obtained by using the ansatz (43) for the order parameters and solving the resulting fixed point equations, as described in the previous section, but this time varying the learning rate η. The weak dependence of g on η should be contrasted with the behaviour the canonical teacher-student setup, where the generalisation error is proportional to the learning rate in the case of additive Gaussian output noise [55,78].
In the inset of Fig. 6, we plot the generalisation dynamics of a neural network trained on the HMM at different learning rates. As expected, the learning rate controls the speed of learning, with increased learning rates leading to faster learning until the learning rate becomes so large that learning is not possible anymore; instead, the weights just grow to infinity.

The impact of student size
Another key question in our model is how the performance of the student depends on her number of nodes K. Adding hidden units to a student who has less hidden units than her teacher (K < M ) improves her performance, as would be expected. This can be understood in terms of the specialisation discussed in Sec. IV A: each additional hidden node of the student specialises to another node of the teacher, leading to improved performance. We will see an example of this below in Sec. V A.
But what happens if we give the student more nodes than her teacher has K > M ? It is instructive to first study the overlap matrices at the end of training. We show two examples from an experiment with M = 2, K = 6 at δ = 0.01 for networks starting from different initial conditions. In particular, we plot the rescaled teacher-student overlap matrix v k R km in Fig. 7  (b, c). We rescale R km by the second-layer weights to account for two effects: first, the relative influence of a given node to the output of the student, which is determined by the magnitude of the corresponding secondlayer weight; and second, we have a symmetry in the output of the student since for sigmoidal activation func- In the two overlap plots for K > M in Fig. 7, the student nodes display many-to-one specialisation: several hidden units of the student specialise to the same hidden node of the teacher, essentially providing several estimates of the value of this teacher node. Note that each student node specialises to one and only one of the teacher nodes, rather than a combination of two or more teacher nodes. We found this pattern of activations consistently across all of our runs for various K and M . The fact that student nodes are evenly distributed across teacher nodes is further motivated by the fact that such an arrangement minimises the generalisation error if the second-layer teacher weightsṽ m have equal magnitude, and its first-layer weightsw m have the same norm. We anticipate that this specialisation pattern is at least in part due to the sigmoidal form of the activation function g(x). We note that the same manyto-one specialisation of hidden units has been previously reported for the same two-layer networks trained on i.i.d. inputs [55], and that a similar pattern of specialisation was observed for networks with finite input and wide hidden layer, where this type of specialisation was referred to as "distributional dynamics" [74][75][76][77].
These observations motivate the following ansatz for the overlaps of a student with K = ZM hidden nodes (Z ∈ N) and similarly for W k , while we use the same parameterisation for the teacher order parameters T, t,T ,t, A and v. Searching again for specialised fixed points of the resulting equations for the seven time-dependent order parameters R, r, S, s, W, w and v and substituting their values into Eq. (26) yields the predictions we indicate by solid lines in Fig. 7, where we plot the asymptotic test error as a function of Z ≡ K/M . We can see small performance improvements as the student size increases. We also plot, for the three values of M used, the asymptotic test error measured in experiments with D = 50, 100 and 200. As we increase D, the experimental results approach the theoretical prediction for D → ∞.
We finally note that fixed points of the online dynamics with many-to-one specialisation have been described previously in the canonical teacher-student setup [55], who found that this behaviour leads to a more significant improvement of student performance as K increases for teacher tasks with y * = φ(x) compared to the improvement we observe for the HMM. The same type of many-to-one specialisation was also found by Mei et al. [74] and Chizat and Bach [76], who considered a complementary regime where the input dimension N stays finite while the size of the hidden layer goes to infinity.

V. COMPARING THE HIDDEN MANIFOLD MODEL TO REAL DATA
We finally turn our attention to the comparison of the hidden manifold model to more realistic data sets, in our cases classic image databases such as CIFAR10 (see Fig. 1 for two examples of images in CIFAR10).

A. Neural networks learn functions of increasing complexity
The specialisation transition that we discussed in Sec. IV A has an important consequence for the performance of the neural network, as we show in Fig. 8. As we train increasingly large student networks on a teacher with M = 10 hidden units and second-layer weights v m = 1 /M, we observe that learning proceeds in two phases. First, there is an initial decay of the generalisation error until all students have roughly the same test error as the student with a single hidden unit K = 1. In a second phase, students with K > 1 break away from this plateau after further training and achieve superior performance, with the larger networks performing better. These improvements are a result of specialisation after ∼ 10 3 epochs, which permits the student network to capitalise on their additional hidden nodes.
This way of visualising specialisation not only illustrates its importance for student performance, it is also applicable when training the same two-layer neural networks on more realistic data sets such as MNIST (Fig. 8 b) or Fashion MNIST [24] and CIFAR (Fig. 10). The plots demonstrate clearly that in all these cases, the larger networks proceed by first learning functions that are equivalent to the smaller networks.
In all cases, specialisation is preceded by a plateau where the generalisation error stays constant because the student is stuck at a saddle point in its optimisation landscape, corresponding to the unspecialised solution. This plateau has been discussed extensively in the canonical teacher-student setup [8,23,79,80] and more recently in the context of recurrent and deep neural networks [37,81]. By comparing students of different sizes, this plateau can also be demonstrated on image data sets, as we have done above. This learning of functions with increasing complexity has also been observed in deep convolutional networks by Kalimeris et al. [82], who used quantities from information theory to quantify how well one model explains the performance of another.
These observations are interesting because they suggest how to explain the ability of neural networks to generalise well from examples when they have many more parameters than samples in their training data set. This is a key open problem in the theory of deep learning, since the intuition from classical statistics suggests that in these cases, the networks overfit the training data and thus generalise poorly [5,83]. It is possible that by learning functions of increasing complexity, networks are biased towards simple classifiers and avoid over-fitting if their training is stopped before convergence. This topic is an active research area [84,85].

B. Memorisation of random and realistic data
An interesting difference between random and realistic data was demonstrated in a recent paper by Arpit et al. [86]. They trained 100 two-layer networks (K = 4096 hidden units with ReLU activation, 10 output units with softmax activation) for a single epoch on the ten-class image classification task on the CIFAR10 data set, starting from different initial conditions each time. At the end of training, they measured the frequency with which each individual image was classified correctly by the network across runs, which we will call the memorability of an image, which should be thought of as a function of the image and the data set that contains it. We repeated this experiment on CIFAR10, and added three different synthetic data sets: (colour codes refer to Fig. 9): CIFAR10: x µ : CIFAR10 images; y * µ ∈ [0, 9]: CIFAR10 label giving the class of that image.
Gaussian: teacher acting on Gaussian inputs: x µ i.i.d. standard Gaussians; y * µ = arg max φ(x µ , θ * ) TeacherS: teacher acting on structured inputs x µ = f (F c µ ); y * µ = arg max φ(x µ , θ * ) The labels for the synthetic data sets were generated by two teacher networks, one with input dimension N for the Gaussian and TeacherS data sets, and another with input dimension D for the HMM. The teachers were two-layer fully connected networks having M = 2K hidden units with ReLU activation function, and 10 nodes in the last readout layer. Thus the teacher's output φ(·, θ * ) ∈ R 10 , and the class for a given input was obtained as the index of the output node with the highest value for that input. We plot the memorabilities for all images in the training set, sorted by their memorability, in Fig. 9. On the left, we first reproduce the memorability curve for CIFAR10 that was found by Arpit et al. [86] (solid blue), which demonstrates that many examples are consistently classified correctly or incorrectly after a single epoch of training. The memorability curve for a data set containing the same images with random labels (dashed blue) demonstrates that randomised CIFAR10 doesn't contain images that are particularly hard or easy to memorise. The smaller variation in memorability for the randomised data set is largely due to the fact that it takes it more time to fit randomised data sets [87]. After one epoch, the network thus has a lower training accuracy on the randomised data set (cf. the inset of Fig. 9), which leads to the smaller area underneath the curve. We verified that no easy or hard samples appear when training the randomised data sets to comparable training accuracy (not shown). In fact, the memorability of data sets with random labels seem to coincide after accounting for differences in the training error, regardless of whether the inputs are CIFAR10 images, Gaussian inputs or structured inputs X = f (CF ) (4) (dashed lines in Fig. 9 a).
The memorability curves for the Gaussian, TeacherS and HMM data sets in Fig. 9 (b) reveal that hard and easy examples exist for TeacherS and HMM, which both contain structured inputs X = f (CF ), but not in the Gaussian data set. The number of easy examples, but not their existence, correlates well with the training accuracy on these data sets, shown in the inset. In that sense, the hidden manifold model is thus a more realistic model of image-like data than the canonical teacher-student setup.
Note that by making the teacher network larger than the students (M = 2K), the learning problem is unrealisable for all three synthetic data sets, i.e. there is no set of weights for the student that achieve zero generalisation error. The absence of easy examples in the Gaussian data set thus suggests that unrealisability alone is insufficient to obtain a data set with easy examples. Our results also demonstrate that memorability is not just a function of  Fig. (a) shows that this property disappears in all models when the labels are reshuffled (dashed lines). The insets indicate the training accuracy after training, using circles for randomised data sets and squares for unmodified data sets. Fig. (b) shows that these hard and easy examples also exist in the structured data models, TeacherS and the HMM, but not in the unstructured Gaussian one. the input correlations: CIFAR10 images, Gaussian inputs and structured inputs yield the same memorability curves when their labels are randomised. We leave it to future work to identify some criterion, statistical or otherwise, that predicts either whether a sample (x µ , y * µ ) is easy (or hard!) to memorise or whether a training set contains easy examples at all.

VI. CONCLUDING PERSPECTIVES
We have introduced the hidden manifold model as a generative model for structured data sets that displays some of the phenomena that we observe when training two-layer neural networks on realistic data. The HMM has two key ingredients, namely high-dimensional inputs which lie on a lower-dimensional manifold, and labels for these inputs that depend on the inputs' position within the low dimensional manifold. We derived an analytical solution of the model for online SGD learning of twolayer neural networks. We thus provide a rich test bed for exploring the influence of data structure on learning in neural networks.
Let us close this paper by outlining several important directions in which our work is being (or should be) extended.
Comparison to more deep learning phenomenology In the spirit of our experiments in Section V, it is of great interest to identify more properties of learning that are consistently reproduced across experiments with realistic data sets and network architectures, and to test whether the HMM reproduces these observations as well. Of particular interest will be those cases where learning on realistic data deviates from the HMM, and how we can extend the HMM to capture these behaviours.
Beyond online SGD Our analytical results on online SGD rely on the assumption that each new sample seen during training is conditionally independent from the weights of the network up to that point. In practice, samples are seen several or even many times during training, giving rise to additional correlations. Taking those correlations into account to analyse those cases is an important future direction. First steps towards a solution to this challenging problem were made using the dynamical replica method [88,89] for two-layer networks, and for single-layer neural networks trained using full-batch gradient descent, where all the samples in the training set are used at every step of the algorithm [42,43,90]. Generalising these results to two-layer networks is clearly a direction for future work as well.
Learning with a multi-layer network The present work should be extended to learning with multi-layer networks in order to identify how depth helps to deal with structured data. This is a serious challenge, and it remains an open problem to find explicitly solvable models of multi-layer (non-linear) networks even in the canonical canonical teacher-student model where inputs are uncorrelated.
Multi-layer generative model The hidden manifold model is akin to a single layer generator of a GAN. A natural extension would be to take a generator with an arbitrary number of layers. Multi-layer generators are explored in [41,49], whose results are analogous to the Gaussian equivalence property and suggest that the full solution of the online SGD or of the full-batch gradient descent might also be within reach.
Conditioning the inputs on the labels In the HMM, the true label y * of an input x is conditioned on its latent representation c, i.e. its coordinates in the manifold. It may be more realistic to consider models where instead, the latent representation is conditioned on the label of the input, i.e. p(c|y). A simple case of such a model that reduces to a Gaussian mixture of two clusters was explored recently [91]. This is also the point of view taken implicitly in [33]. More generally, exploring different approaches to modelling realistic inputs will allow us to better understand how data structure influences learning.

ACKNOWLEDGMENTS
We would like to thank Bruno Loureiro and Federica Gerace for useful discussions. We thank Stanisław Jas-trzębski for discussing the experiments of [86]. We are grateful to the Kavli Institute For Theoretical Physics for its hospitality during an extended stay, during which parts of this work were conceived and carried out. We acknowledge funding from the ERC under the European Union's Horizon 2020 Research and Innovation Programme Grant Agreement 714608-SMiLe, from "Chaire de recherche sur les modèles et sciences des données", Fondation CFM pour la Recherche-ENS, and from the French National Research Agency (ANR) grant PAIL. This research was supported in part by the National Science Foundation under Grant No. NSF PHY-1748958.
Appendix A: The Gaussian Equivalence Property

Nonlinear functions of weakly correlated Gaussian random variables
In order to derive the GEP we first establish some auxiliary lemmas concerning the correlations between nonlinear functions of weakly correlated random variables.

a. Correlations of two functions
Lemma A.1. Given n + p random variables organised in two vectors, with a joint Gaussian distribution, denote by E the expectation with respect to this distribution. The first moments are supposed to vanish, and we denote by Q, R, εS the covariances: Let f (x) and g(y) be two functions of x and y respectively regular enough so that E and E y [y i y j f (y)] exist, where E x denotes the expectation with respect to the distribution N (a, Q) of x and E y denotes the expectation with respect to the distribution N (b, R) of x. Then, in the ε → 0 limit: Proof. The result is obtained by a straightforward expansion in ε.
The joint distribution of x and y is One can expand the inverse matrix M −1 to first order in ε: and substitute this into the joint distribution (A5) to find Using this expression, the result (A4) follows immediately.
An immediate application of the lemma to the case when n = p = 1 is the following. Consider two Gaussian random variables u 1 , u 2 with mean zero and covariance and two functions f 1 and f 2 . Define, for i ∈ {1, 2}: where . denotes the average over the distribution of the random Gaussian variable u distributed as N (0, 1). Then, in the ε → 0 limit, the correlation between f (u 1 ) and g(u 2 ) is given by This means that, if we consider centered functionsf i (u i ) = f i (u i ) − a i , their covariance is This result generalises to correlation functions of higher order, as stated in the following lemma.
b. Higher-order correlations Lemma A.2. Consider m Gaussian random variables u 1 , . . . , u m with mean zero and covariance and m functions f 1 , . . . , f m . Define as before: and define the centered functions asf then where Π denotes all the m!/(2 m/2 (m/2)!) partitions of {1, . . . , m} into m/2 disjoint pairs. This result means that, for the moments involving only different indices, the random variablesf 1 (u 1 )/ √ ε, . . . ,f m (u m )/ √ ε behave, in the ε → 0 limit, like Gaussian variables with a covariance matrix b i b j m ij .
Proof. The covariance matrix U of the variables u 1 , . . . , u m has elements 1 on the diagonal, and elements of order ε out of the diagonal: U = I + εm. One can expand U −1 in powers of ε: The integration measure over the variables u 1 , . . . , u m can be expanded as: When we compute the integral off 1 (u 1 ) . . .f m (u m ) with the measure (A18), because of the fact that f i (u i ) = 0, we need to include terms coming from p G p (u 1 , . . . , u p ) that involve at least one power of each of the variables u 1 , . . . , u m . When m is even, say m = 2r, for ε → 0, the term of this kind with the smallest power of ε is the monomial u 1 . . . u 2r that comes from the rth order term in G 1 . This gives: where the sumˆ i1j1...irjr runs over all permutations of the indices 1, . . . , 2r. This leads to (A16) for m even. When m is odd, m = 2r + 1, for ε → 0, the leading terms coming from G p that give a non-zero result are monomials of the type u 1 1 u 2 . . . u 2r+1 . They are of order O(ε r+1 ). This proves (A16) for m odd. Corollary A.3. In the special case m = 3, we get (A21)

Derivation of the Gaussian Equivalence Property
The derivation is based on the computation of moments of the variables λ k and ν m , showing that, in the thermodynamic limit, all the moments are those of Gaussian random variables. Here we shall explicit the derivation up to fourth order moments, and leave the daunting higher order moments for a future formal proof of the GEP.

a. Covariances
We first compute the covariance matrix G k = E[λ kλ ]: In the last piece, we need to compute E [(f (u i ) − a)(f (u j ) − a)] for two Gaussian random variables u i and u j which are weakly correlated in the large N limit. In fact, as i = j: is of order 1/ √ D. In the thermodynamic limit, we can apply the lemma (A.1) which gives: From Eqns. (A23) and (A25), we get the covariance of λ variables as written in (16). The covariance E [ν m ν n ] is analogous. We now compute the covariance E [λ k ν m ]. This is equal to The two variables u i and c r are Gaussian random variables with a correlation which goes to zero as O(1/ √ N ) in the thermodynamic limit. We can thus use Lemma (A.1), and more precisely Eq. (A12), to get Using this result in (A26) gives Eq. (17).

b. Fourth moments ofλ k variables
We study the fourth moment defined as: We shall decompose the sum over i 1 , i 2 , i 3 , i 4 depending on the number of distinct indices there are. a. Distinct indices Let us study the first piece of the fourth moment λ k1 λ k2 λ k3 λ k4 : where the sum runs over four indices i 1 , i 2 , i 3 , i 4 which are distinct from each other. We can use the factorisation property of the 4th moments of f (u) of lemma (A.2). This gives The correction terms come from pieces where the intersection between {i 1 , i 2 } and {i 3 , i 4 } is non-empty. If we first neglect this correction, we find Now we shall show that the corrections are negligible. Consider the term i 1 = i 3 , i 2 = i 4 . This gives a correction Using (A12) we get the expression for the correction Using our hypothesis on the fact that the quantities S are of order one, this correction is clearly at most of order O(1/ √ N ), and therefore negligible. The last correction that we need to consider is the term where i 1 = i 3 = i, and i 2 = i 4 = j. This gives which is again negligible in the large N limit. b. Three distinct indices Let us study the contributions to the fourth moment of λ coming from three distinct indices. We study the case where i 1 = i 4 : Using the expression for the third moment of functions of u 1 , u 2 , u 3 found in (A21), we get: The corrections come from cases when i 1 = i 2 or i 1 = i 3 . For instance the piece with i 1 = i 2 gives The only pieces that do not vanish in the large N limit are thus the pieces similar to the one computed in (A38). Putting all of them together we find that the contribution to λk1λk2λk3λk4 coming form pieces with exactly three distinct indices in i 1 , i 2 , i 3 , i 4 is equal to: c. Two distinct indices Let us now study the contribution to the fourth moment of λ coming from two distinct indices. We study first one piece of this contribution to the fourth moment, corresponding to i 1 = i 2 = i, i 3 = i 4 = j: To leading order in the thermodynamic limit, we can write and therefore (the correction coming from i = j being obviously at most O(1/N )).
We study now the second piece of this contribution to the fourth moment, corresponding to i 1 = i 2 = i 3 = i, i 4 = j. This is equal to and it is therefore negligible. Therefore all the contributions to the fourth moment of λ coming from exactly two distinct indices are of the type (A43). They give a total contribution: d. One distinct index The contribution to the fourth moment λ k1 λ k2 λ k3 λ k4 coming from i 1 = i 2 = i 3 = i 4 is clearly of O(1/N ) and can be neglected.
e. Final result for the four-point correlation function of λ variables We can now put together all the contributions to the fourth moment λk1λk2λk3λk4 coming form pieces with four distinct indices found in (A32), those with three distinct indices found in (A40), and those with two distinct indices found in (A47). Defining and recalling the definition (A40) of the X variables, we obtain: + X k1k4;k2k3 + X k2k3;k1k4 + X k2k4;k1k3 + X k3k4;k1k2 We can see that this is equal to which shows that With this, it is clear how to proceed with the calculation of the fourth moments involving λ and ν variables. We first need to study the moments with three λ and one ν, then moments with two λ and two ν, and finally the moments with one λ and three ν variables. In the interest of conciseness, we do not spell out the full details of this calculations here, which proceeds very similarly to the calculations performed hitherto. The generalisation to higher moments of λ variables employs the same combination of repeated use of Lemma A.2 and careful decomposition in subsets of distinct indices. As a result, it can be seen that the set of λ variables has a Gaussian distribution in the thermodynamic limit.
mg (ν m ). Note the different re-scaling of the learning rate for the first and secondlayer weights. From here on out, we shall drop the index µ on the right-hand side as we work at a fixed iteration time. We will keep the convention of Sec. III A where extensive indices (taking values up to N or D) are below the line, while we'll use upper indices when they take finite values up to M or K. The challenge of controlling the learning in the thermodynamic limit will be to write closed equations using matrices with only "upper" indices left. Furthermore, we will adopt the convention that the indices j, k, , ι = 1, . . . , K always denote student nodes, while n, m = 1, . . . , M are reserved for teacher hidden nodes.

First steps
We will start by focussing on the dynamics of the first layer, Eq. (B1a). When we study the evolution of quantities that are linear in the weights, like S k r and the order parameters constructed from it, e.g. Σ k , we need to study where while we keep the second-layer weights v k fixed. We can thus follow the dynamics of S k r (20), which is linear in the weights and enters the definition of the order parameters R km (17) and Σ kl (22): We want to average this update equation over a new incoming sample, i.e. over the c r variables. Upon contraction with F ir in Eq. (B6), we are thus led to computing the averages and where The crucial fact that allows for an analytic study of online learning is that, at each step µ of SGD, a previously unseen input x µ is used to evaluate the gradient. The latent representation c µ of this input is given by a new set of i.i.d. Gaussian random variables c µr , which are thus independent of the current weights of the student at that time. In the thermodynamic limit, the GEP of the previous section shows that, for one given value of r, the K + M + 1 variables {λ k }, {ν m } and β r have a joint Gaussian distribution, making it possible to express the averages over {λ k , ν m , β r } in terms of only their covariances. Looking closer, we see that the average of (B7,B8,B9) over this Gaussian distribution involves two sets of random variables: on the one hand, the M + K local fields {ν m , λ k }, which have correlations of order 1, and on the other hand the variable β r (for one given value of r). It turns out that β r is only weakly correlated with the local fields {ν m , λ k } (the correlation is O(1/ √ N )). In Appendix A 1, we discuss how to compute this type of average and prove Lemma A.1, which for the averages (B7-B9) yields This yields with only the single extensive index r left. While this equation would appear to open up a way to write down the equation of motion for the "teacher-student" overlap R km by contracting (B14) withw m r , we show in Appendix C that such a program will lead to an infinite hierarchy of equations. To avoid this problem, we rotate the problem to a different basis, as we explain in the next section.

Changing the basis to close the equations
We can close the equations for the order parameters by studying their dynamics in the basis given by the eigenvectors of the operator which is a D × D symmetric matrix, with diagonal elements Ω rr = 1, and off-diagonal elements of order 1/ √ N . Consider the orthogonal basis of eigenvectors ψ τ =1,...,D of this matrix, with corresponding eigenvalues ρ τ , such that s Ω rs ψ τ s = ρ τ ψ τ r . (B16) We will suppose that the components of the eigenvectors ψ τ r are of order 1 and we impose the following normalisation: In this basis, the teacher-student overlap R km (17) is given by where we have introduced the projections Sinceω m τ is a static variable, the time evolution of Γ k τ is given by As we aim to compute the remaining sum over r, two types of terms appear: where we have defined d τ = (c − b 2 )δ + b 2 ρ τ , and Putting everything together, the final evolution of Γ k τ is At this point, we note that the remaining averages appearing in this expression, such as E λ k g (λ k )g(ν m ) , depend only on subsets of the local fields {λ k=1,...,K , ν m=1,...,M }. As discussed above, the GEP guarantees that these random variables follow a multi-dimensional Gaussian distribution, so these averages depend only on the covariances of the local fields R km , Q k , and T mn . To simplify the subsequent equations, we will use the shorthand notation for the three-dimensional Gaussian averages which was introduced by Saad & Solla [22] and that we discuss in the main text. To make this section self-contained, we recall that arguments passed to I 3 should be translated into local fields on the right-hand side by using the convention where the indices j, k, , ι always refer to student local fields λ j , etc., while the indices n, m always refer to teacher local fields ν n , ν m . Similarly, where having the index j as the third argument means that the third factor is g(λ j ), rather thang(ν m ) in Eq. (B26). The average in Eq. (B26) is taken over a three-dimensional normal distribution with mean zero and covariance matrix 3. Dynamics of the teacher-student overlap R km We are now in a position to write the update equation for where we have used that theω m τ are static. When performing the last remaining sum over τ , two types of terms appear. First, there isT which depends only on the choice of the feature matrix F ir and the teacher weights w * mr and is thus a constant of the motion. The second type of term is of the form This sum cannot be reduced to a simple expression in terms of other order parameters. Instead, we are led to introduce the density where 1(·) is the indicator function which evaluates to 1 if the condition given to it as an argument is true, and which otherwise evaluates to 0. We take the limit ε ρ → 0 after the thermodynamic limit. Then we can rewrite the order parameter R km as an integral over the density r km , weighted by the distribution of eigenvalues of the operator Ω rs , p Ω (ρ): If, for example, we take the elements of the feature matrix F ir to be element-wise i.i.d. from the normal distribution with mean zero and unit variance, then the limiting density of eigenvalues of Ω is given by the Marchenko-Pastur law [66]: where ρ min = 1 − √ δ 2 and ρ max = 1 + √ δ 2 .
The update equation of r km (ρ) can be obtained immediately by substituting the update equation for Γ k τ (B24) into its definition (B31). Finally, in the thermodynamic limit, the normalised number of steps t = P/N can be interpreted as a continuous time-like variable, and so we have and we recover the equation of motion for r km (ρ), which we re-state here for ease of reading: where d(ρ) = (c − b 2 )δ + b 2 ρ. Note that while we have dropped the explicit time dependence from the right-hand side to keep the equation readable, all the order parameters on the right-hand side are explicitly time-dependent, i.e. Q jj = Q jj (t), r km (ρ) = r km (ρ, t), and the averages I 3 (·) are also time-dependent via their dependence on the order parameters (see Eq. (B26) and the subsequent discussion). In order to close the equations of motion, we now need to find the equations for the order parameters that are quadratic in the weights.

Order parameters that are quadratic in the weights
There are two order parameters that appear when evaluating the covariance of the λ variables: We will look at both W k and Σ k in turn now. a. Equation of motion for W k For the student-student overlap matrix we find, after some algebra, that updates read For the terms linear in the learning rate η, we can immediately carry out the sum over i, which yields terms of the type The term quadratic in the learning rate η requires the evaluation of terms of the type The sum over i thus makes this second-order term contribute to the total variation of W k at leading order, and we're left with an average over four local fields, for which we introduce the short-hand where we use the same notation as we did for I 3 (·) (B26). The full equation of motion for W k thus reads b. Equation of motion for Σ k After rotating to the basis ψ τ , we have It is then immediate that The terms linear in η can be obtained directly by substituting in the update equation for Γ k τ (B24) and are similar to the update equation for r km (ρ). As for the term quadratic in η, we have to leading order where we have used that covariance of β r is given by To deal with the remaining sum over τ , we again make use of the fact that the equation of motion for Σ k depends on the eigenvector index τ only through the eigenvalue ρ τ . Introducing the density as we did for r km (ρ) (B31), we have nṽ n T nn I 3 (k, k, n) − R kn I 3 (k, n, n) Q kk T nn − (R kn ) 2 − bρv k nṽ n r n (ρ) Q kk I 3 (k, n, n) − R kn I 3 (k, k, n) Q kk T nn − (R kn ) 2 + all of the above with → k, k → . Finally, we will treat each of the second-layer weights of the student v as an order parameter in its own right. Their equations of motion are readily found from from their SGD udpate (B1b), and read where we have introduced the final short-hand where we again use the notation we introduced for I 3 (·) (B26).

Generalisation error
Having introduced the short-hand for the integrals I 2 (k, j) (B51), we realise that their form also enters the formula for the generalisation error g (θ, θ) = For example, for a student with g(λ k ) = erf(λ k / √ 2) and a teacher withg(ν m ) = max(0, ν m ), we find that We have now completed the programme that we embarked upon at the beginning of this Appendix B: we have derived a closed set of equations of motion for the teacher-student overlap R km (B32,29) the student-student overlap Q k = c − b 2 W k + b 2 Σ k (38,34,35), and the student's second-layer weights v k (39). These equations give us complete access to the dynamics of a neural network performing one-shot stochastic gradient descent on a data set generated by the hidden manifold model. We can now integrate these equations and substitute the values of the order parameters at any time into the expression for the generalisation error (26), thereby tracking the dynamics of the generalisation error at all times. We describe this procedure in more detail next. 8. Explicit form of the integrals I3 and I4 for sigmoidal students The explicit forms of the integrals I 3 and I 4 that appear in the equations of motion for the order parameters and the generalisation error for networks with g(x) =g(x) = erf x/ √ 2 were first given by [21,22]. Here, we will state them to make the paper as self-contained as possible. Denoting the elements of the covariance matrix such as Φ 3 (B27) as φ ij , we have I 3 (·, ·, ·) = 2 π 1 √ Λ 3 φ 23 (1 + φ 11 ) − φ 12 φ 13 1 + φ 11 (B54) with Λ 3 = (1 + φ 11 )(1 + φ 33 ) − φ 2 13 . (B55)

Order parameters that are linear in the weights
To start with a variable that is linear in the weights, take the time-evolution of S k r . It is clear that the tensor structure of the result (B14) will be of the form where D k and E km are known quantities, expressed in terms of the matrices Q, T, R, and we have introduced the operator which has diagonal elements equal 1, and off diagonal elements of order 1/ √ N . In particular we can use this evolution to study the evolution of R: The point of this analysis is to show that the time evolution of S k r involves (ΩS) r . Therefore to know the evolution of S we need the one of ΩS. This is not inocuous because, in order to have dynamical evolution equations with only "up" indices, we need to contract it. The evolution of R km , which is proportional to the scalar product (in the R-dimensional manifold space) of S k andw m , is thus given by the scalar product of ΩS k andw m .
It is not difficult to see that the evolution of ΩS will require knowing Ω 2 S etc. So we have an infinite hierarchy of coupled equations, which would be hard to analyse. Yet, we can find closed equations by changing basis for S. For the experiments demonstrating the learning of functions of increasing complexity discussed in Sec. V A, we constructed data sets for binary classification by splitting the image data sets as follows: MNIST: even vs. odd numbers Fashion MNIST: t-shirt/top, pullover, dress, sandal and bag vs. trouser, coat, shirt, sneaker, ankle boot We plot the mean-squared error as a function of training time for sigmoidal networks with increasing hidden layer when trained on three different data sets. The curves are obtained by averaging ten different runs, starting from different initial weights. Error bars indicate one standard deviation. For all plots, g(x) = erf x/ √ 2 , gaussian initial weights with std. dev. 10 −3 , batch size 32. CIFAR10: air plane, bird, deer, frog, ship vs. automobile, cat, dog, horse and truck.
We first demonstrate in Fig. 10 that sigmoidal networks show the same learning of functions of increasing complexity discussed in Sec. V A for CIFAR10 when trained on MNIST or FMNIST. Note that for CIFAR10 in particular, we see the effects of over-training set in after several epochs, when the generalisation error starts to increase again (we use plain SGD without any explicit regularisation in these experiments).
We also repeated these experiments for ReLU networks with activation function g(x) = max(0, x). While the dynamics of ReLU students also show a progression from simple to more complex classifiers, the run-to-run fluctuations are much larger than for the sigmoidal students. This is true both quantitatively, but also qualitatively: for example, networks sometimes get stuck in really sub-optimal minimisers for a long time. Hence, plotting the mean trajectories is not as informative as the standard variations would be very high, so in Fig. 11 we instead show representative curves for individual runs of ReLU students for all three data sets and for online learning from a teacher with