Learning Interacting Theories from Data

,


I. INTRODUCTION
Models of physical systems are frequently described on the microscopic scale in terms of interactions between their degrees of freedom.Often one seeks to understand the collective behavior that arises in the system as a whole.The interactions can feature symmetries, such as spatial or temporal translation invariance.Prominent examples of these theories can be found in statistical physics, high energy physics, but also in neuroscience.The nature of the interactions is often derived as an approximation of a more complex theory.
The description of systems on the microscopic scale is key to their understanding.In the absence of an underlying theory, the inverse problem has to be solved: one needs to infer the microscopic model by measurements of the collective states.This is typically a hard problem.A recent route towards a solution comes from studies [1][2][3][4][5][6][7][8] that explore the link between the learned features of artificial neural networks and the statistics of the data they were trained on.This inspection yields insights both into the mechanisms by which artificial neural networks achieve stellar performance on many tasks and into the nature of the data.In this study, we make the link between learned parameters and data statistics explicit by studying generative neural networks.
Generative models learn the statistics which underlie the data they are trained on.As such they must possess an internal, learned model of data which is encoded in * c.merger@fz-juelich.de the network parameters.In this work, we gain insights into the nature of the training data by extracting the model from the network parameters, thus bridging the gap between the learned model and its interpretation.
One class of generative models are invertible neural networks (INNs), also called normalizing flows.INNs are invertible mappings trained to approximate the unknown probability distribution of the training set [9,10].They can be used to generate new samples from the same distribution as the training set, or to manipulate existing data consistent with the features of the training set (for example, transitions between images [11][12][13]).This is achieved by mapping the highly structured input data to a completely unstructured latent space.The model learned by the network is expressed through the inverse mapping, as this must generate all interactions in the data.However, the network mapping is typically high-dimensional and depends on many parameters, which does not allow for a direct interpretation.
In this work, we derive interpretable microscopic theories from trained INNs.We extract an explicit data distribution, formulated in terms of interactions, from the trained network parameters.These interactions form the building blocks of the microscopic theory that describes the distribution of the training data.Furthermore, the process of extracting the microscopic theory makes the relation between the trained network parameters and the learned theory explicit.We show how interactions are hierarchically built through the composition of the network layers.This approach provides an interpretable relation between the network parameters and the learned model.
We illustrate and test this framework on several exam-arXiv:2304.00599v2[cond-mat.dis-nn] 2 May 2023 FIG. 1. Learning actions from data.We observe a physical system of interacting degrees of freedom (gray dots), whose precise interactions are unknown (shaded areas).We train a neural network on measurements of the system.The network learns in unsupervised fashion an estimate of the distribution of training data.We extract the action from the network parameters layer by layer, using a diagrammatic language.The final action coefficients A (k) represent the learned interactions (pink nodes).
ples where the underlying theory is exactly known.We find that the networks are able to learn nontrivial interacting theories.Furthermore, we show that theories with higher-order interactions emerge as the network depth increases.Thus, we show how to leverage the power of machine learning to extract interacting models from data.
Solving inverse problems is a well-known challenge, to which several approaches have been developed.Previous approaches either rely on prior knowledge on dynamical systems or stochastic processes such as specifics of basis functions [14][15][16][17][18] or the form of update equations [19][20][21][22], or they are restricted to pairwise couplings between discrete variables [23][24][25][26][27][28][29][30][31][32], or they require particular symmetries and specific approximation schemes to infer higher order couplings [33,34].In contrast our approach considers continuous variables, is not restricted to pairwise interactions, and does not require prior knowledge of the interaction structure.This paper is structured as follows: in Section II we introduce the action as the central object of a theory.We then describe how to extract the action from a trained INN in Section III.Subsequently, we test this framework in several settings in Section IV.Finally, we summarize and discuss the main findings, compare our work to different previously proposed inference schemes, and provide an outlook on how to extend the framework in Section V.

II. ACTIONS IN PHYSICS
Physical theories are often formulated in terms of microscopic interactions of many constituents, the degrees of freedom {x i } 1≤i≤d , where d is the number of constituents and x describes the system's state.The degrees of freedom {x i } 1≤i≤d can be Ising spins, firing rates of neurons, social agents, or field points.The interactions of the constituents provide a mechanistic understanding of the system.The energy or Hamiltonian H of the system can be written as the sum of these interactions.One can then ask how probable it is to observe the system in a specific microscopic state x given an average energy ⟨H⟩.The most unbiased (maximum entropy) estimate of the probability density is then with Z the normalization factor or partition function.
Here p X is also known as the Boltzmann distribution [35].In statistical physics, the prefactor β is identified as the inverse temperature.
The action S X of this system is defined as the log probability, hence S X (x) = ln p X (x) = −βH(x) − lnZ.Therefore the system is fully characterized by S X and measurements of the system state x correspond to drawing samples from p X .Furthermore, up to the constant prefactor −β and the constant lnZ, the terms in the action are the same interaction terms as those in the Hamiltonian.
Consider an action of the form where the coefficients A (k) are tensors of rank k and dimension d.Without loss of generality we choose the A (k) to be symmetric tensors, P(i1,...,i k ) for any permutation P of the indices.These coefficients encode the coordination between the different degrees of freedom x i .In general, we refer to a term of type A (k) i1,...,i k x i1 ⋯x i k , as a k-point interaction, since this term describes a coaction of k degrees of freedom for i 1 , . . ., i k ∈ {1, . . ., d} all unequal.In this work, we focus exclusively on actions of the form of Eq. ( 1), which are suitable only for describing classical fields x i , as the fields and the action coefficients are tensors and scalars, not operators.However, we are not limited to equilibrium statistical mechanics: the samples could as well stem from a time-dependent process; in this case, the action describes the measure on a path, which is allowed to come from a non-equilibrium system.Furthermore, even for quantum systems where the interactions are exactly known, a renormalized classical theory is sometimes sought to effectively describe the influence of quantum fluctuations [36,37].
Notation In the following, we use the notation u ⊗k for the outer product of k instances of a tensor u , and T (k) ⋅ (u) ⊗l for l ≤ k to denote the contraction of the first l indices of a rank k tensor A (k) with the first index of each tensor u: with multi-indices β 1 , . . ., β l whose rank depends on the rank of u.In the special case that u is a vector, the indices β 1 , . . ., β l vanish from the expression.If additionally k = l, the result is a scalar.Hence, Eq. ( 1) becomes We symmetrize a tensor by averaging over the set P(α) of all permutations of the multi-index α: this operation does not change the result of polynomial contractions: A typical objective in statistical physics is understanding how the microscopic interactions A (k) determine the macroscopic properties of the system.In this work, we take a different approach: Given samples from a system with unknown microscopic properties, we extract the interactions.Generative models such as INNs are a powerful tool to approximate data distributions p X [9,10,13,38].In the next section, we demonstrate how we can extract the interaction coefficients A (k) from trained networks.

III. LEARNING ACTIONS WITH INVERTIBLE NEURAL NETWORKS
Invertible neural networks are used to learn the data distribution p X from a data set D of training samples [9].They implement a bijective mapping where the network output z = f θ (x) is often referred to as the latent variable.The parameters θ of the network are trained such that the latent variables follow a prescribed target distribution.We follow a common choice for this latent distribution as a set of d uncorrelated centered Gaussian variables with unit variance [9] where z T z = ∑ d i=1 z 2 i denotes the Euclidean scalar product.Given p Z , the probability assigned to a specific input is given by the change of variables formula which depends only on the network mapping f θ and its Jacobian J f θ .The training objective is to minimize the negative log-likelihood where in the second line we used that ln p θ is precisely the action S θ of the learned distribution.
In this manner, we optimize Eq. ( 3) to approximate the unknown underlying data distribution using stochastic gradient descent.Since the target distribution, Eq. ( 2) is a set of independent Gaussians, the mapping f θ ∶ x ↦ z of the network aims to eliminate cross-correlations and higher order dependencies of the components of the latent.On the level of the action, this means that all kpoint interactions with k ≥ 3 between components of z must vanish.In turn, the inverse mapping defines a generative process that induces interactions in the learned distribution from a non-interacting latent theory.
We now define the architecture that allows us to obtain a polynomial action S θ from the network parameters θ.The network is composed of multiple layers l; every layer mapping For each layer, we define an output action S X,l+1 , which transforms into the input action S X,l via the change of variables formula.Using Eq. ( 3) we compute the input action S X,l of each layer given the output action S X,l+1 starting with the polynomial action of the latent variable S Z (y) = ln p Z (y).We construct all layer mappings f l such that a polynomial action S X,l+1 generates a polynomial action S X,l in Eq. ( 5) and thus by induction we obtain a polynomial learned input action S θ .Each layer mapping f l is composed of a linear mapping L l and a nonlinear mapping φ l : In the overall architecture, linear and nonlinear mappings are therefore stacked alternately (see Fig. 2).After the last nonlinear transform φ L , we add an additional linear transform L L+1 , such that the network architecture begins and ends with a linear transform.The transform of the action via a single layer, Eq. ( 5), similarly decomposes into two steps, with h l = L l (x l ) the intermediate activation: The transformations from S X,l+1 to S H,l , and from S H,l to S X,l are therefore determined by φ l and L l , respectively, which we express schematically as The remainder of this section is concerned with expressing (9) in terms of transforms of the action coefficients.of the action S H,l of pre-activations h l in layer l of order k; coefficients A (k) l of the action S X,l prior to linear layer l of order k.
Since the actions are polynomials, at each step we can write S X,l , S H,l in terms of coefficients where the sum runs over the ranks k of the coefficient tensors.Here K l is the degree of the polynomial in h l , which depends on the layer index l.We show in the following that the rank of the polynomial does not change between S X,l and S H,l in the latter step of Eq. ( 9).The coefficients further uniquely determine the action.Therefore Eq. ( 9) is equivalent to the coefficient mapping In Fig. 2 we illustrate the order in which the coefficients are transformed.We now derive the recursive equations for the coefficient transforms, beginning with the linear mapping.
Linear mapping The linear mapping is given by with W l , b l ∈ θ.Combining Eq. ( 8) and Eq. ( 11) yields We find the transformed coefficients A (k) l by expanding Eq. ( 14) and ordering them by powers in x l .Higher-rank coefficients of S H,l contribute to lower-rank coefficients of S X,l via the contraction with the bias b l .All remaining indices of B (k) l must then be contracted with the first index of W l : The combinatorial factor k+k ′ k ′ arises due to the symmetry of the coefficients B (k) : since they are symmetric under permutations of the indices, we only need to fix the number of contractions k ′ with the bias term b l and count all k+k ′ k ′ possible combinations of k ′ indices in B (k) .Since L l is linear, S X,l and S H,l both have the same rank K l .
We illustrate Eq. ( 15) for the final linear mapping of the network, the first coefficient transform that starts on the known latent space coefficients B (k) L+1 k (on the far right in Fig. 2).The action of the latent distribution given in Eq. ( 2) describes a centered Gaussian with unit covariance; its first three coefficients are therefore B (0) L+1 = 0 and B (2) , and all coefficients with rank k ≥ 3 being zero.
The transformed zeroth-rank coefficient L+1 , which ensures that the action stays normalized, reads The bias in the linear mapping shifts the mean of the probability distribution, which is induced by the firstorder coefficient The second-rank coefficient is transformed in accordance with the rotation and scaling of the space due to W L+1 : In the case of actions with higher-order interactions k ≥ 3, the coefficient transform Eq. ( 15) requires the computation of many more terms.To simplify the coefficient transforms, we therefore employ a diagrammatic language.A common practice in statistical physics is to represent k-th order interactions as vertices with k legs [35].Accordingly, we here express tensors of rank k as vertices with k legs and the contraction between tensors by an attachment of legs.This facilitates the computation of combinatorial factors, which can be read off from the diagram topology.The diagrammatic representations of Eqs. ( 16), (17), and (18) read: A (1) A (2) A complete presentation of the diagrammatic method is provided in Appendix A.
Nonlinear mapping We follow Dinh et al. [9] to define an invertible nonlinear mapping: we split the activation vectors and activation functions into two halves,1 denoting them by . The first half is passed onto the next layer unchanged; we then add a nonlinear function φ h 1 l of the first half onto the second half We choose a quadratic nonlinearity, where 2 ⌉ is a third-order tensor whose coefficients are trained, χl ∈ θ.In the following, we will use a shorthand notation φ l (h l ) = h l + χ l ⋅ (h l ) ⊗2 with χl being the non-zero part of χ l .
Equation ( 23) is the most elementary nonlinearity which is compatible with a polynomial action.Through the composition of L layer transforms f l , the network mapping becomes a polynomial of order 2 L+1 .The advantage of decomposing such a transform in terms of multiple applications of Eqs. ( 13) and ( 22) is a particularly simple form of the update equations for the coefficients.
Splitting the nonlinear mapping (22) makes it trivially invertible as one can observe by evaluating the composition of Eqs. ( 22) and (24) on an arbitrary vector h l .
We compute the action S H,l from S X,l+1 using Eq. ( 8).Since the Jacobian J φ,l of φ l is a triangular matrix with ones on the diagonal, we have ln det J φ,l = 0. Therefore, the transform of the action induced by φ l is just the composition Equation ( 25) yields a polynomial of order K l = 2K l+1 .We expand the products in Eq. ( 25) and reorder the terms to obtain the transform of the action coefficients.
Since each factor of χ l increases the rank of the resulting tensor by one, lower-order coefficients A and the first index in χ l , but adds two indices to the resulting tensor.As a result, for each χ l in the contraction, the rank is raised by one.Therefore k ′ factors of χ l are needed to increase the rank from of the tensor to which to contract the factors of χ l .However, contractions of this type are no longer symmetric tensors because the resulting k-th order tensor has 2k ′ indices stemming from χ l and the remaining ones from A (l+1,k−k ′ ) .To express this result as a symmetric tensor, we symmetrize the result.This yields Diagramatically, the contraction with χ l is represented by splitting the legs of a vertex.By counting the number of splits we can therefore infer the number of factors χ l of any diagram.To illustrate this, we here show the mapping , which generates interactions up to the fourth order.The zeroth-and first-rank coefficients remain unchanged B (0) L+1 has no legs to split; the splitting of legs in L+1 gives a contribution to B (2) This diagrammatic expression corresponds to No further diagrams are generated from A L+1 , as all legs are split.The higher-order interactions B In tensor notation, the same expressions read This exemplifies how the interactions are built hierarchically as further layer transforms contribute contractions with W l≤L , b l≤L and χ l<L , to previous coefficients.See Appendix A for further details.
The degree K l of the action doubles with each layer, starting from the output action with degree two.Networks of depth L thus generate actions of degree 2 L+1 .Through the composition of several nonlinear mappings like Eq. ( 22), the prefactors χ l of later layers will be exponentiated alongside their activations.Increasing the number of layers will therefore generate terms of arbitrarily high degree in both x and in the prefactors.For example, the contribution of χ 1 to x L will be of order . As a result, large values in χ l are unfavorable as they make the activations diverge.In practice, we find that the entries of the tensors χ l of trained networks are typically small, (χ l ) ijk ≪ 1.
We note that all coefficients with rank k > 2 must contain at least k −2 factors of χ l , where the different factors in general originate from different layers.These terms of rank k > 2 constitute the non-Gaussian part of the action.Therefore, for applications where the data can be described as a perturbed Gaussian, small entries in χ l are sufficient.The higher the rank of the coefficient, the smaller its entries.Consequently, we place a cutoff of two on the number of factors of χ l in the action coefficients, thus ignoring negligible contributions.This effectively imposes a maximum rank of k = 4 onto the action coefficients.
Coefficients of high rank can be numerically intractable for large dimension d, as their size grows as O d k .To mitigate this, we make use of Eq. ( 23) to write the coefficients in a decomposed form, focusing on the most significant contributions, which speeds up the computations and allows for tractable reductions in the size of the stored tensors.We specify this decomposition in Appendix B.
We began this section by equating the transform of the action through the network to the transform of its coefficients, decomposed as alternating linear and nonlinear transforms.We then made these transforms explicit in −2 0 −0.4 0.7 Eqs. ( 15) and ( 26).Given a trained network, these expressions allow us to extract the learned action through the iterative application of the coefficient transforms from the last layer to the first.In this way, we can describe the data distribution constructively, by tracking how the latent distribution is transformed through successive network layers.
Equation ( 26) shows how higher-rank coefficients hierarchically emerge through repeated contractions with the parameters χ L , . . ., χ 1 of the nonlinear mappings and W L , . . ., W 1 , b L , . . ., b 1 of the linear mappings of different layers.In the data space, these coefficients correspond to interactions; therefore this approach establishes an explicit relation between network parameters θ and the characteristic properties of the learned distribution p θ .In the next section, we test this method in several cases with known ground-truth distributions.

IV. EXPERIMENTS
In this section, we will test the learning of actions in three different settings.In Section IV A, we use a randomly initialized teacher network to generate samples.The teacher network has the same architecture as the one described in Section III, enabling us to compute the ground truth action.In Section IV B, the ground truth action coefficients themselves are generated randomly, leading to multimodal data distributions.Finally in Section IV C, we study a physics-inspired model system with interactions on a square lattice of d = 10 2 sites.

A. In-class distributions
First, we test whether we can recover the action coefficients of a known action from samples drawn from the respective distribution.We initialize a teacher network with random weights and compute the corresponding action coefficients T (k)  k≤4 with the method outlined in Section III.We then generate a training set D by sampling Gaussian random variables z ∼ p Z and apply the inverse network transform of the teacher on them; the elements of D are therefore samples from the teacher distribution.Since the teacher distribution is by construction part of the set of learnable student distributions, we refer to this as an in-class distribution.We then initialize a student network to identity W l = 1, b l = 0, χ l = 0 ∀l and train it on the training set D. This choice for the initialization ensures that all variability in the trained result is due to the training data and the sequence of random batches drawn during training.After training, we extract the student coefficients A (k)  k≤4 as described in Section III.The student network has learned the teacher distribution if the associated action coefficients match, Note that T (k) = A (k) ∀k does not imply that the parameters θ of the teacher and student network are equal.Due to the rotational invariance of the latent space, an additional linear transform which rotates the latent space z does not result in a change in the learned action.Hence T (k) = A (k) ∀k implies only that the two networks learn the same statistics.
We compare teacher and student coefficients in Fig. 3 for two different training set sizes D .For a sufficiently large data set, the student can learn the teacher coefficients arbitrarily well: In Fig. 3 (a-d), the coefficient entries coincide while the network parameters θ do not align (see Appendix C for a comparison between network parameters).This confirms that the extracted coefficients are indeed characteristic of what the network has learned, as opposed to the parameters.Given sufficient samples, we therefore find that the method recovers the correct action coefficients.
For a smaller training set with D = 10 3 , we find that the student network overfits the training set.To see this, we compute the test loss L test (θ) on a test set of 10 4 samples.In Fig. 3 (e) the test loss is significantly larger than the training loss.This is reflected in deviating coefficient entries in Fig. 3 (a-d).To quantify this disparity, we compute the cosine similarity between the tensors where the sum runs over all independent indices α, excluding duplicate tensors entries which are equal due to the symmetry.The cosine similarity ranges between zero and one, for perfect alignment it is equal to one.Fig. 3 (f) shows how the cosine similarity between teacher and student coefficients increases with the training set size.The lower order coefficients T (k≤2) are approximated well even in the case of little training data.The learned higherorder coefficients clearly deviate from the teacher coefficients in this case (see Fig. 3 (c,d,f)), indicating that the higher-order coefficients, corresponding to higher-order interactions in the teacher distribution, can only be conveyed through larger data sets.
Learning rules in coefficient space Higher-order interactions depend on higher-order statistics of the data distribution, which must be expressed through a limited amount of samples.We train the network using stochastic gradient descent (SGD), which updates all parameters of the network at training time t according to the dependence of the loss L on a subset of training data D t ⊂ D. In SGD, the update of a single weight ∆θ i = θ i (t + 1) − θ i (t) is where η is the learning rate and ⟨⋅⟩ Dt denotes the average over the current training batch D t .In Appendix D, we show that this leads to a noisy update in the coefficients where ⟨⋅⟩ D is the empirical average over all samples in the full training set D, and ⟨⋅⟩ A is the expectation with regard to the current estimate of the density depend-ing on learned coefficients A (k) k .The random variable ξ (k) t encodes the difference between the mean estimated on the whole training set and a training batch.One can show that on average over all batches, the noise vanishes, ⟨ξ (k) ⟩ = 0 , and the variance also decreases with the training set size (see e.g.[39]).Smaller batch sizes D t therefore increase the noise in the updates of the action coefficients.The expected update ⟨∆A (k) θ ⟩ vanishes on average over all batches when the learned moments and the moments on the training set match: ⟨x ⊗k ⟩ D = ⟨x ⊗k ⟩ A .However, for any finite training set, there will furthermore be a deviation between ⟨x ⊗k ⟩ D and the true moment ⟨x ⊗k ⟩ T of the teacher network.This expected difference scales as Therefore, not only will there be a batch-size dependent variability in the coefficient updates, but a bias induced by the limited amount of training data.This drift introduces a bias in the training, leading to overfitting.For many distributions, both ⟨x ⊗k ⟩ D − ⟨x ⊗k ⟩ T and ⟨⟨ξ (k) ⟩⟩ increase with k [40,41].In this case, both the bias and the variability of the training procedure increase with k, explaining why it is harder to learn higher-order statistics.

B. Out-of-class distributions
In the previous section, we have demonstrated that invertible networks accurately learn any distribution generated by the image set of inverse mappings f −1 θ .However, it is interesting to investigate how our approach deals with distributions outside this set.A first step to this end lies in understanding the nature of mappings employed by the invertible network.
The proposed network architecture belongs to the class of volume-preserving networks [9,10]: the additive nature of the nonlinearity in Eq. ( 22) leads to a constant Jacobian determinant det J f θ (x).The Jacobian determinant det J f θ (x) of a mapping states how the image of the mapping is locally stretched.Since the only contribution to det J f θ (x) comes from the linear transform, Eq. ( 13), the Jacobian determinant is constant.Therefore, this stretch is homogeneous everywhere.As a result, an invertible network with det J f θ (x) = const.and a Gaussian target distribution p Z can only learn unimodal distributions p θ , i.e. distributions with only one maximum.To see this, we compute the gradient of Eq. ( 3) with respect to x: This shows that the learned input distribution p θ (x) has an extremum at x 0 if and only if the target distribution has an extremum at f θ (x 0 ).Since p Z has a single extremum, so does p θ .This limitation is not unique to the choice of polynomial activations, but a consequence of the special structure of the Jacobian of the nonlinearity Eq. ( 22) and the latent distribution p Z .
Since not every distribution is unimodal, we discuss in Section V how to extend the framework introduced here to incorporate multimodal distributions.An effective unimodal model, however, may also prove useful.While volume-preserving invertible networks with unimodal latent distribution p Z cannot learn a multimodal distribution p θ exactly, they can learn an approximation.In this section we show how volume preserving networks can therefore be used to extract an effective theory in the multimodal case.
To avoid tieing results to a particular choice of distribution, we generate actions S R with random coefficients R (k)  k≤3 .A diagonal negative action coefficient R (4) is then added to obtain a normalizable distribution.We ensure that the corresponding distributions are multimodal and that their terms are balanced in strength using a sampling method for the coefficients that is detailed in Appendix E. Although the action S R is an unnormalized log-probability2 , we can then sample a training set D using Markov chain Monte Carlo; for this work we used a Hamiltonian Monte Carlo [42][43][44] sampler implemented in PyMC3 [45].(For details see Appendix F.) Given a known random action S R and a corresponding training set D of generated samples, we train networks of different depths and compare their action coefficients.In Fig. 4 (a) we show a two-dimensional example of such a randomly generated distribution as well as the learned monomodal approximation.Figure Fig. 4 (b)-(d) show comparisons of the true compared to the learned coefficients in the d = 10 dimensional case.As expected, some action coefficients cannot be learned correctly.The largest deviations from the true coefficients occur in the diagonal entries ii , . ... However, we observe that many action coefficient entries that have at least two different indices (shown as A (2) offdiag. in Fig. 4 (d)) recover approximately the correct value.
Using the network to generate samples, which hence belong to the learned distribution p θ , we compare the cumulants of the two distributions.The cumulants of a distribution can be computed from its moments and vice versa.For example, the first three cumulants are equal to the mean, the variance, and the centered third moment of a distribution.Cumulants are better suited than moments to compare two different distributions because they contain only independent statistical information.We distinguish k-th order cumulants from moments by using single brackets ⟨x k ⟩ for moments and double brackets ⟨⟨x k ⟩⟩ for cumulants.Despite the disparity in the coefficients, we find that the cumulants agree up to third order (compare panels (e) -(g) in Fig. 4).The learned distribution is therefore an effective theory that reproduces the statistics of the system beyond the Gaussian order, since in a Gaussian model the third-order cumulants are zero.Equation (32) shows that the action coefficient A (k) converges in expectation either when the moments of the training set and the learned distribution coincide, ⟨x ⊗k ⟩ D = ⟨x ⊗k ⟩ A , or when the network cannot tune the coefficients in the relevant direction.Therefore the training aims to match the moments ⟨x ⊗k ⟩ D , ⟨x ⊗k ⟩ A (and thereby, the cumulants) within the bounds of the flexibility allowed by the network architecture.We find that the higher order cumulants are learned later in training (see Fig. 4(h)).
Mulitmodality often appears as a result of symmetry breaking.Consider the classical example of an Ising model [46].The action of this system is symmetric under a global flipping of all spins.Below the critical temperature, two modes appear, one for positive and one for negative net magnetization.However, in a physical system, this multimodality cannot be observed, because the probability of a global sign flip approaches zero as the system size increases.Furthermore, external factors such as coupling to the environment or a measurement device will also break the symmetry.In such a setting, the network can nevertheless find an informative theory, characterizing the observed monomodal distribution.

C. Interaction on a lattice
Physical theories often feature a local structure of the interactions, for example, a lattice structure.We here construct such a system by introducing nearest-neighbor couplings on a square lattice of d = 10 × 10 sites with periodic boundary conditions.Furthermore, we introduce self-interaction terms of second and fourth order.The resulting action is symmetric under a global sign change and under translations along the lattice.In an experiment, such symmetries may be broken by the coupling of the system to an external environment.We model this breaking of both symmetries by introducing a heterogeneous external field which introduces a bias to each degree of freedom.
The action therefore reads where the I (k) are the coefficients of the true distribu-tion and we have omitted the normalization.Here β serves as an inverse temperature.The matrix Laplacian Λ ij = −δ ij deg(i) + a ij with a ij the adjacency matrix (a ij = 1 if i, j are connected and a ij = 0 else) constitutes an interaction with the deg(i) = 4 nearest neighbors on the lattice.Both the diagonal part of Λ and r 0 encode a second-order self-interaction.The fourth-order term is another self-interaction −βux 4  i .This model can be considered the lattice version of the effective long distance theory of an Ising model in two dimensions [46].We illustrate the network topology and external field in Fig. 4(a).
As in Section IV B, we sample from this distribution using an MCMC sampler (see Appendix F for details) and train networks of different depths L. In Fig. 5 (bd) we compare the learned action coefficients A (k) to the corresponding target values I (k) .We find good agreement for A (1) and the off-diagonal values A (2) ij , i ≠ j, independent of network depth.The external field and nearestneighbor coupling is therefore recovered accurately.The self-interaction A (2) diag is typically lower than the target value while the fourth-order self-interaction A (4) diag is typically larger.Since both parameters control the widths of the distributions, the slightly lower A (2) diag can compensate for the too small magnitude of A (4) diag , producing an effective theory.Nevertheless the higher-order coefficients A (k≥3) of the learned theory are relevant.We show in Fig. 5 (f-h), that the first, second, and third cumulants of the learned and true distribution approach each other as the network depth increases.A Gaussian approximation of S I would only tune A (1) and A (2) with A (k≥3) = 0 to match the first and second cumulants shown in Fig. 5 (f,g).As in the multimodal case therefore, we learn an effective non-Gaussian theory.The effective theory may result from the depth of the network being small in comparison to the dimensionality of the system to tune all higher order coefficients A (k≥3) .The number of independent entries in the action coefficients up to the fourth order is O d 4 , roughly 4.6⋅10 6 for the case of d = 10 2 , with by far the most number of entries in the highest-order coefficient A (4) .In contrast, the number of free parameters of a single layer is 3 +d(d+1).Although the coefficients do not depend linearly on the network parameters, this gives a rough estimate of the required depth L = 34 of the network, at which the number of free parameters in the network and in the coefficients coincide.Since the number of entries in the coefficients grows with O d 4 , but the number of free parameters in the network is only O Ld 3 , the depth required to tune all coefficient entries grows with d.Below this depth, it may well be that the flexibility of the network is too small to tune all fourth-order action coefficients.Even though ⟨ x ⊗k α ⟩ D − ⟨ x ⊗k α ⟩ A in Eq. ( 32) is likely non-zero then, the coefficients may still reach stationary values by the combination of the terms on the right hand side of Eq. ( 32) vanishing.In-deed we find in Appendix G, that in lower dimensions smaller network depths are sufficient to tune the higher statistical orders.Furthermore, the learned coefficients A (4) approach the target value as the depth increases.In all examples studied here, the alignment between learned and true statistics improves with depth, which increases the network flexibility.For shallow networks, however, although the learned action is not equal to the true one, it effectively describes the statistics of the true distribution beyond the Gaussian order as indicated by the good agreement of the cumulants.This behavior is equivalent to that of renormalized theories [46], which feature the same statistical correlations while changing the interaction strengths in a consistent manner.

V. DISCUSSION
We have developed a method to learn a microscopic theory from data -concretely, we learn a classical action that assigns a probability to each observed state.For this data-driven approach, we employ a specific class of deep neural networks that are invertible and that can be trained in an unsupervised manner, without the need of labeled training data.Such networks have been used before as generative models [9,10,13] but are generally considered a black box: after training the learned information is stored in a large number of parameters in an accessible, yet distributed and generally incomprehensible manner.
The diagrammatic formalism developed here allows us to extract the data statistics from the trained network in terms of an underlying set of interactions -a common formulation used throughout physics.To achieve this, we designed the network architecture as a trade-off between flexibility and analytical tractability.The choice of a quadratic polynomial, along with a volume-preserving invertible architecture, allows us to obtain explicit expressions for the interaction coefficients.This formalism shows how the interplay between linear and nonlinear mappings in the network composes non-Gaussian statistics, and hence higher-order interactions, in a hierarchical manner.As a consequence of the quadratic interaction constituting the fundamental building block, higherorder interactions are decomposed into this simplest possible form of nonlinear interplay.As a result, the order of interaction in the data directly maps to the required depth of the network in an understandable manner, thus providing an explanation why deep networks are required to learn higher-order interactions.
The analytical framework we developed can also readily be extended to higher-order nonlinearities: In terms of the diagrammatic language, the quadratic activations used in this work amount to splitting of "legs" in the Feynman diagrams into pairs.Likewise one obtains a three-fold splitting from a cubic term, a four-fold splitting from a quartic term and so on.Such higher-order nonlinearities would allow the building of more complex We distinguish self-interaction terms A (2) offdiag .From the off-diagonal entries A (2) offdiag , we further separate those entries belonging to adjacent lattice sites A interactions with fewer layers.
From a physics point of view one may regard the trained network as a device to solve an interacting classical field theory in a data-driven manner: once the network has been trained, it maps each configuration of the interacting theory in data space to samples in latent space that follow a Gaussian theory, hence a noninteracting one.Such a mapping allows one to compute arbitrary connected correlation functions of the interacting theory.The framework offers two routes to this end.The traditional one uses common rules of diagrammatic perturbation theory to obtain controlled approximations of connected correlation functions in terms of connected diagrams constructed from propagators and interaction vertices of the inferred action.An alternative one directly constructs connected correlation functions hierarchically across the layers of the network, ultimately reduced to pairwise interactions on the level of the latent Gaussian.For example, the second order correlations read ⟨⟨x i x j ⟩⟩ p θ = ⟨⟨f −1 θ,i (z)f −1 θ,j (z)⟩⟩ z∼N (0,1) .For the n-th order correlations ⟨⟨x i1 ⋯x n ⟩⟩ p θ one can therefore either work out the coefficients of the polynomial f −1 θ,i1 (z)⋯f −1 θ,in (z), in a similar manner to the action transform, and then average the resulting function over p Z , or estimate the correlations by drawing samples from the generative network.An open avenue to explore further in this regard is the link between the presented framework and asymptotically free theories, where an interacting theory becomes non-interacting at high energy (UV) scales.In that case, different scales are connected by a renormalization group (RG) flow.It would be interesting to investigate whether the change of couplings described by the RG flow can be related to the transformations performed by the network.More broadly, the ability to learn an interacting theory by the network can be considered an alternative to asymptotic freedom, as the flow across layers does not have to correspond to a change of length scale.
To provide the most transparent setting, we have here chosen the simplest but common case of a latent Gaussian distribution, which has the aforementioned advantage of mapping to a non-interacting theory.A consequence is that the latent distribution only has a single maximum.Since for invertible volume-preserving networks the number of modes in data space and in latent space are identical, these network architectures therefore only learn monomodal distributions.For many settings of interest this is sufficient: multi-modal distributions in physical systems often occur together with non-ergodic behavior such as spontaneous symmetry breaking, selecting one of the modes of the distribution.As presented, our approach necessarily learns the statistics of the selected mode, and thus obtains an effective theory of the single selected phase of the system.The simplest way to learn genuine multi-modal distributions is the use of a multi-modal latent distribution, such as a Gaussian mixture.Our framework would then provide one set of action coefficient for each mixture component, correspondingly offering one effective theory for each phase.
We complement this study with a characterization of the training process, confirming the expectation that both larger training set size and network depth improve learning.Larger data sets in general decrease the bias of the learned distribution due to undersampling of the true distribution.This point is most severe for higherorder statistics, while the first two orders of the statistics are typically learned robustly also from limited data.We provide an approximate expression, Eq. ( 32), to investigate the convergence properties of statistics of different orders.While deeper networks are required to offer sufficient flexibility to learn higher-order statistics, the larger number of trainable parameters at the same time requires more to learn the statistics accurately.Alternatively, the network flexibility can be increased by raising the order of the polynomial activation function.Finding the optimal tradeoff between local nonlinearity and depth is an interesting point of future research.
Several studies have highlighted the role of the stochasticity of the training algorithm for networks performing classification [47][48][49].A possible starting point to understanding SGD for generative models could be Eq.( 32), which relates the trajectory of the learned distribution in coefficient space to the training algorithm.Equation (32) closely resembles the training of Restricted Boltzmann machines (RBMs), where the pairwise coupling matrix between hidden and visible layers updated according to the difference between learned and observed pairwise correlations [23].Studying Eq. ( 32) could shed light on the dynamics of unsupervised learning, for which the architecture used in this work is a fully tractable prototype.
Another challenge in learning interacting theories of higher order is the necessarily large size of the action coefficients which grows with the dimension, irrespective of the manner in which these interactions are inferred.However, it is plausible that not all terms in these tensors are equally important: For spatially or temporally extended systems, interactions between distant degrees of freedom may be irrelevant.The framework explicitly shows how higher-order interactions are composed out of lower-rank coefficients.This may be leveraged to extract the most relevant contributions in a tractable manner (see Appendix B).
The general problem of inferring models from data discussed in this work is a well-known challenge.In the dynamical systems setting, the authors of [16][17][18]50] use regression to infer the right hand side of the governing differential equation of a system from a set of basis functions.Other studies [14,15] infer rules for the time-dependence of couplings (synaptic plasticity) using regression and genetic programming.These approaches produce interpretable models, but require a predetermined a set of basis functions or operations, through the combination of which the system dynamics are approximated.Inference of parameters stochastic processes [19][20][21][22] also relies on the specific form of the update equations.Prior knowledge about likely terms in the dynamical equations or their exact functional form is therefore needed in these works.
Many approaches exist to infer pair-wise couplings of binary degrees of freedom, such as for Ising models [46].Prominent among them are Boltzmann machines [23].Other techniques for Ising models first solve the forward problem, namely the statistics given the couplings -using variations of mean-field theory [24][25][26] or the TAP equations [27,51,52] -and then invert these relations explicitly or iteratively [27][28][29]53].Maximum likelihood methods or the TAP equations have also been used to infer the patterns of Hopfield models [30,31].In the special case of a tree-like, known network topology, or translationally invariant higher-order couplings along a linear chain, the inverse problem can be solved exactly [34].Further works maximize the likelihood of the network model given the data, by using belief propagation to reconstruct the network structure from infection cascades [32], or Monte Carlo sampling to infer amino acid sequences in proteins [54].In contrast to these works, in this study we consider continuous rather than discrete variables.Furthermore, we are not restricted to pairwise interactions and do not require prior knowledge on the structure of interactions.
Zache et al. [33] also solve the forward problem: they approximate a higher-order interacting theory to treelevel or one-loop-order in the effective action, and thereby obtain an invertible relation between interactions and correlations.This approach relies on the validity of the approximations, namely for the typically difficult step from interactions to correlation functions.These approximations are not necessary to train INNs, as the correlation functions are implicitly generated by the network mapping.
Neural networks have also previously been used to treat inverse problems.They are trained to infer the posterior probability of characteristic parameters given data [55,56], and to compute renormalized degrees of freedom that are maximally informative about the global state of a system [57].For systems of interacting identical particles, Cranmer et al. [58] use symbolic regression on trained Graphical Neural Networks to derive interpretable interactions.However, these approaches make no lucid connection between the learned model and the parameters of the neural network as we do in this study.
With the here proposed extraction method for the action of a physical system at hand, one can now proceed to extract hitherto unknown interacting models from data.One interesting application of this approach is to learn actions for systems for which a microscopic or mesoscopic description is not known, for example in biological neuronal networks: the inferred coefficients would determine the importance of nonlinear interactions in biological information processing.The approach may also be fruitful when applied to systems in physics where the microscopic theory may be known but an effective theory is sought that captures an observed macroscopic phenomenon.

ACKNOWLEDGMENTS
We are grateful to Thorben Finke, Sebastian Goldt, Christian Keup, Michael Krämer and Alexander Mück for helpful discussions.This work was partly supported by the German Federal Ministry for Education and Research (BMBF Grant 01IS19077A to Jülich and BMBF Grant 01IS19077B to Aachen) and funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) -368482240/GRK2416, the Excellence Initiative of the German federal and state governments (ERS PF-JARA-SDS005), and the Helmholtz Association Initiative and Networking Fund under project number SO-092 (Advanced Computing Architectures, ACA).Open access publication funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) -491111487.

APPENDIX A: Diagrammatic update equations
We here describe the diagrammatic rules to compute the action coefficient transforms in Eqs. ( 15) and (26).
Following the structure in Section III, we first treat the linear transform (13).Each action coefficient A (k) l+1 is represented by a vertex, where the number of outgoing lines, also called legs, is equal to the rank of the coefficient.Each leg is assigned an index i l corresponding to the indices (i 1 , . . ., i k ) of the coefficient entry of A (k) l+1 it represents.Equation (15) shows that each index of the coefficient must either be contracted with W l , and therefore remains a leg of the resulting vertex from the second index of W l , or it must be contracted with b l ,and therefore drops out.We represent the contraction with W l by an elongated line decorated with W l , and the contraction with b l as a leg ending on an empty circle.Thus to compute the transformed action coefficients, we must add empty circles to the previous vertices in all possible ways and then elongate the remaining legs using W l .A diagram with k legs therefore produces the following diagrams: . . .
where the combinatorial factors k l arise due to the different ways of choosing legs which are contracted with b l .For example, there are k 1 = k ways of choosing a single leg from a vertex with k legs which is then contracted with b l .We first compute the new diagrams for all k, then sum up all diagrams that have equal numbers of legs to one coefficient.This illustrates how higher-order action coefficients, by the contraction of their indices with the biases, i.e. the attachment of empty circles to their legs, contribute to lower-order coefficients.
For the nonlinear transform, we have the opposite effect: each index in the coefficient either remains or is contracted with χ l , which increases the rank of the transformed diagram by one.Therefore either the legs of vertices remain as they are, or they must be split into two to signify the contraction with χ l , which increases the number of legs of the vertex by one.We keep the split legs distinguishable from threepoint vertices by using curved lines for the split legs and sum over all possible ways to split legs.A diagram with k legs therefore produces the following diagrams: . . .Here, the combinatorial factors arise due to the number of ways in which to choose the split legs.The number of factors χ l in any diagram can then be read off from the number of leg splits.As in the linear transform, to compute the transformed action coefficients of rank k, we must therefore sum over all diagrams with equal numbers of legs.This illustrates how higher-order action coefficients arise through the splitting of legs by factors of χ l .
Second, to compute contractions with matrices W such as which arise due to the linear coefficient transform Eq. (15).For these contractions, we must compute the sum over all entries therefore without further simplification this entails the computation of d k entries of T (k) ⋅ (W ) ⊗k from d k terms each, so the total number of floating point operations scales as O(d 2k ).Even though the number of steps required therefore only grows polynomially with d, for realistic data set sizes and k = 4, this number increases very fast.
To facilitate the computation of coefficients with rank k = 4, we exploit that they are built from coefficients of lower rank to write the tensors in a decomposed form.
As a first step, we decompose the network parameters χ l .Without loss of network expressivity, we may choose χ l to be symmetric in its latter two indices (χ l ) µjk = (χ l ) µkj .In the following, we drop the layer index l for brevity, as the structure of the computation is the same for any layer.We then rearrange the tensor to be a list of d symmetric matrices βµ , µ = 1, . . ., d such that χ µkj = βµ kj .Using the eigendecomposition of these matrices βµ , we may write where γ µ,ν , β ν are vectors and γ µ,ν τ = δ τ,µ λ µ ν has only one non-zero entry, namely the ν-th eigenvalue of the µ-th matrix βµ .To store this object we require 2d 2 vectors of length d, namely d 2 vectors β µ,ν and d 2 vectors γ µ,ν .The magnitude of entries in χ is directly related to the magnitude of the eigenvalues λ µ ν , which is typically small for trained networks.We show distributions of eigenvalues from trained networks in Fig. 6.The distributions broaden with increasing depth, however the peak of the distribution remains at λ µ ν = 0.It is therefore possible to reduce the space required to store χ and all tensors related to it by placing a cutoff λ ≥ 0 on the eigenvalues, keeping only the n ≤ d 2 largest eigenvalues which have λ µ ν ≥ λ.Then the number of entries required to store χ scales as O (2nd), as again we need 2n vectors of length d each.To further simplify the expression, we absorb the sum over µ, ν into a single index τ = 1, . . ., n.
An alternative way to achieve the decomposition of χ into a reduced number of components would be to use the decomposed form Eq. (B2) directly during training and limit the number of independent vectors β τ .This approach effectively trades the network expressivity for the tractability of the action coefficient transforms.
In addition to reduced storage requirements, the decomposed form (B2) also has the advantage is that the decomposition translates to all tensors computed via contraction with χ, which is how higher order coefficients are originally generated (compare with Eq. ( 26)).The contraction between a rank k symmetric tensor T (k) and χ is If k = 1, the result is just a matrix.If k = 2, then T (k) ⋅γ τ =∶ α τ is a vector, therefore T (k) ⋅χ can be written as a sum of outer products between three vectors.If k = 3, the result is a sum of outer products between a matrix T (k) ⋅ γ τ = ᾱτ and two vectors; The matricesᾱ τ are symmetric since The case k ≥ 4 does not arise, as any coefficient with degree k ≥ 4 must already contain at least two factors χ.
To store the factors of Eq. (B3)we therefore require n matrices ᾱτ and n vectors β τ .The number of matrix and vector entries required to store this object is therefore O nd 2 .The contraction with matrices W along all indices is which corresponds to n matrix-vector products W T β τ and 2n matrix-matrix products for ᾱτ W and W T (ᾱ τ W ).
Each term in the matrix-matrix product is computed from d terms, therefore the number of terms required to compute W T ᾱτ W is O(2d ω ) with the matrix multiplication exponent ω, which depends on the concrete algorithm used for matrix multiplication, e.g.ω ≈ 2.8 for the Strassen algorithm [59].To evaluate the contraction, we therefore need to compute O(2nd ω ) terms.Even in the case of no cutoff λ = 0 ⇒ n = d 2 , this approach significantly reduces the required computations compared to the naive implementation.
In [60] it was shown that the number of floating point operations needed to compute general contractions of the type of Eq. (B1) can be reduced by exploiting the symmetry of the tensors.They propose a simple scheme to reduce the number of floating point operations (using ω = 3) to roughly O(d k+1 ), and a more complex structure of saving these tensors, which further speeds up the computations at the expense of storing more intermediate entries.In our experiments, we have used a maximal dimensionality of d = 10 2 in Section IV C and no cutoff λ = 0.In the absence of any cutoff, we find a scaling of our algorithm roughly equal to the simpler scheme proposed in [60].The combination of a cutoff, more efficient storing of symmetric tensors as suggested in [60], or restricting the number of free components in χ directly, facilitates the extension of the coefficient transforms to higher dimension d.

APPENDIX C: Dissimilarity of parameters
We here show that although the learned statistics of two networks may be the same, their parameters do not need to align.To this end, we use the teacher-student setup presented in Section IV A, and compute the cosine similarities between pairs of network parameters b l , W l , χ l -both between the teacher and the student, and between the teacher and a network of the same architecture with random Gaussian weights.We then average over the cosine similarities of the different parameters to obtain the average network cosine similarity.We show in Fig. 7 that although the teacher and student coefficients approach each other for increasing data set sizes D, their network parameters remain dissimilar.The appropriate object to compare the learned statistics is therefore the action, not the parameters of the network.

APPENDIX D: Learning action coefficients with SGD
The learned action S θ (x) depends on the parameters θ only through the coefficients A (k) .We can therefore also view the training as a nonlinear optimization of the action coefficients via the parameters.However, we cannot freely move in this coefficient space: we must ensure that the action stays normalized, ∫ exp(S θ (x))dx = 1.To this end we fix the constant term in the action: Given this constraint, we rewrite the update equation (31) in terms of the coefficients using the chain rule, where ∑ α k runs over all possible (multi-)indices of A (k) .
The update in the parameters induces a change in the action coefficients.We use this to approximate the update step for the coefficient entry A α to linear order in the parameter updates: The update step ∆A α therefore depends on the network architecture through the derivatives ∂A (k) α ∂θi .The derivative of the expectation of the action with respect to the the coefficients in the former factor in Eq. (D2) is where ⟨⋅⟩ A denotes the current average of the learned distribution and we used Eq.(D1) in the second line.This term induces a variability in the coefficient updates, as ⟨x ⊗k ⟩ Dt will vary from batch to batch due to the finite size of each batch.We define the random variable ξ (k) t = ⟨x ⊗k ⟩ Dt − ⟨x ⊗k ⟩ D to be the deviation between the moment estimated on the current batch D t and the moment estimated on the entire data set D. Combining Eqs.(D3) and (D2) yields Eq. (32).

APPENDIX E: Random generation of multimodal actions(E)
Coefficient distributions for random actions In section Section IV B we use multimodal actions S R constructed from randomly drawn coefficients R (k) .A basic condition these coefficients must satisfy is that the resulting action be normalizable: ∫ S R (x) dx < ∞.We note that for large enough x, the action is dominated by the highest order terms: It is therefore necessary and sufficient for normalizability that R (4) be negative definite, which we ensure by choosing R (4) to be a diagonal tensor with negative coefficients: Here d is the dimensionality of the data, and x r ∈ R is a length scale which we are free to choose; one can view R (4) as a regulator term, and x r as the value for which it becomes strongly suppressing.For our experiments we used x r = 1.0.
Having ensured that the action is normalizable, we can define the probability p R (x {R (k) } k≤4 ) = exp S R (x) ∫ exp S R (x) dx .We choose the S R such that the data can be described as a perturbation of a Gaussian theory -we therefore also choose R (2) to be negative definite (i.e. a valid precision matrix).Since any d-dimensional multivariate Gaussian can be written as a linear combination of d independent Gaussian variables, we define R (2) as follows: with c = −0.1 in our experiments.This is equivalent to transforming a Gaussian variable z ∼ N (0, 1) by a linear transform x = W −1 z and then computing the action of x (compare to Eq. ( 18)).Finally, the coefficients R (1) and R (3) are chosen as follows (i, j, k = 1, . . .d): R The scaling with respect to x −1 r and x −3 r ensures that neither linear and cubic terms are negligible within the region where the regulator term R (4) is non-suppressing.The variable γ α = P(α) in the denominator of σ α is the multiplicity of the index α.This is the number of times the component R ijk appears in R (3) ; since coefficients are symmetric, it is equal to the number of distinct permutations of (i, j, k).We scale the multiplicity by s α , the number of different components which have the same number of permutations -for example the permutations of the indices (2, 1, 1) and (5, 3, 3) appear γ ijj = 3 times each, and there are s ijj = d(d − 1) distinct entries of such indices.Scaling σ α by γ α and s α ensures that both the on-diagonal and off-diagonal components of R (2) , R (3)  are significant.

Multimodality of random actions
We here provide evidence that the distribution in Section IV B is indeed multimodal.To do so, we initialize an optimization algorithm at random points and attempt to find the maximum of S R (x * ).We initialized at 10 3 different values.In Fig. 8 we show the maximal values of S R (x * ) found by the algorithm as well as the eigenvalues of the Hessian for selected, distinct final values x * .Since all eigenvalues are below zero, the action S R is locally convex down in all directions at both points.The action S R therefore has at least two local maxima.

APPENDIX F: Sampling actions with MCMC
Given ground truth coefficients T (k) , we create a data set D by drawing samples according to the probability p C (x {T (k) } k≤4 ).Markov chain Monte Carlo (MCMC) is well suited to this task since it requires only the unnormalized log-probability, i.e. S C (x), and is guaranteed to converge to the true distribution (in contrast to variational methods like ADVI [61]).To generate D, we used the No-U-Turn Sampler (NUTS) [62] implementation provided by PyMC3 [45]; sampler parameters mostly followed recommended defaults, with 10 3 tuning steps and a mass matrix initialized to unity.The target acceptance rate was increased to 0.95 to increase the sensitivity to small features of the probability distribution.

APPENDIX G: Lattice model in low dimensions
We here show a lower dimensional version of the lattice model introduced in Section IV C.
Here, the combined number of independent entries in the first four action coefficients is only 4844, which corresponds to roughly 6 network layers.Figure 9 shows a comparison of true vs. learned coefficients.Independent of network depth, we find that A (1) and the offdiagonal entries A ) over true (A (1) ) first order coefficients.(b) Distribution of learned coefficient entries A (2) compared to target values (black crosses).
Self-interaction terms are labeled A diag , off-diagonal entries A (2) offdiag .Among the off-diagonal entries A entries in the fourth order coefficient A (4) diag are approximately zero.Their magnitudes increase with L (see Fig. 9 (d)).Increasing the depth L also speeds up learning (see Fig. 9 (c)).Furthermore, we find that up to the fourth order, the cumulants of the learned distribution increasingly align with those of the true distribution as we increase network depth.Therefore we can conclude that increasing the depth of the network increases the accuracy of the learned distribution, both in terms of its coefficients and of its cumulants.
We repeated the experiment for d = 9 without any het-erogeneous external field, therefore h = 0. Again, we find an alignment of most entries in A (1) , A (2) ,with A (2) diag slightly lower than expected and A (4) diag larger than expected (see Fig. 10 (a,b)).In this setting, the symmetry of the action causes the first and third cumulant to vanish.We therefore only compute the alignment of the second and fourth cumulants.As in the previous cases, this alignment increases with depth, as shown in Fig. 10 (d).Therefore, the effective nature of the learned theory does not depend on the system's symmetry being broken.
A (1)  A  1) , A (2) compared to target values (black crosses).Self-interaction terms A diag are shown separately from off-diagonal entries A (2) offdiag .Among the offdiagonal entries A

FIG. 3 .
FIG. 3. Teacher-student coefficient comparison for varying training set sizes D = D .Both teacher and student have depth L = 1.(a-d) Student coefficients A (k) over teacher coefficients T (k) up to fourth order for D = 10 3 in green and D = 10 5 in pink.(e) Training loss (full lines) and test loss (dashed lines) over training steps.(f ) Cosine similarity of coefficients over number of training samples.

4 FIG. 4 .
FIG. 4. Learning an effective monomodal theory.(a) Two-dimensional example of random density with multiple maxima.White lines are level lines of learned distribution for a five layer network.All other panels show results from a d = 10-dimensional data set.(b-d) Learned over true coefficients for a three layer network on a d = 10.We distinguish diagonal tensor entries from off-diagonal ones, where at least two indices differ.(e-g) Learned over true cumulants, computed from samples.Error bars are typically smaller than marker size.(h) Dissimilarity of true and learned cumulants: 1 − cos ∠ ⟨⟨x ⊗k ⟩⟩ A , ⟨⟨x ⊗k ⟩⟩ R over training steps.We record the cumulants at logarithmically spaced intervals during training.The curves are then smoothed by averaging over ten adjacent recording steps.Shaded areas show the variation due to the estimation of the cumulants from samples.Dots indicate training stage of cumulants shown in (e-g).

FIG. 5 .
FIG. 5. Symmetry broken lattice model for networks of varying depth trained on a d = 10 2 dimensional data set with D = 10 6 samples.(a) Sites of square lattice with periodic boundary conditions distributed on a two-dimensional torus.Colored dots show strength of external field h at connected lattice sites.(b) Learned over true first order coefficients for network depth L = 3. (c) Distribution of learned coefficient entries A(2) compared to target values (black crosses) for network depth L = 3.We distinguish self-interaction terms A

( 2 )
,adj.offdiag .(d) Training loss (solid curves) and test loss (dashed curves).Colors distinguish different network depths L. (e) Distribution of learned fourth order self-interactions as function of network depth.The dashed line marks the target value.(f-h) Learned over true cumulants of up to third order.Cumulants were computed on a subset of 10 randomly chosen lattice sites.Colors distinguish different network depths L.

3 FIG. 6 .
FIG.6.Eigenvalue distributions of decomposed χ l for networks of different depths.We decompose trained network parameters χ l from Section IV C to the form of Eq. (B2) and distinguish eigenvalues from the decomposed form of different layers l.

FIG. 7 .
FIG.7.Dissimilarity of parameters.Stars show the cosine similarities of the teacher and student network parameters trained on varying data set sizes D. The average cosine similarity between the teacher and 10 2 randomly generated random networks is marked by the grey line, the shaded area encompasses one standard deviation.The remaining markers display the cosine similarity between the teacher coefficients T (k) and student coefficients S(k) .

1 FIG. 8 .
FIG. 8. Multiple local maxima in S R .(a) S R along the straight line connecting two local maxima x * 0 , x * 1 of S R found by the optimization algorithm.(b,c) Eigenvalues of the Hessian H of S R at local maxima x * 0 , x * 1 .All eigenvalues λ H(x * i ) are negative, therefore the action is convex down in all directions.

( 2 ) 4 FIG. 9 .
FIG.9.Coefficients of lattice model for networks of varying depth trained on a d = 16 dimensional data set with D = 10 5 samples.(a) Learned (L(1) ) over true (A(1) ) first order coefficients.(b) Distribution of learned coefficient entries A(2) compared to target values (black crosses).Self-interaction terms are labeled A

( 2 )
offdiag , entries belonging to adjacent lattice sites A (2),adj. offdiag are shown separately.(c) Training loss (full curves) and test loss (dashed curves).Colors distinguish different network depths L. (d) Distribution of learned fourth order self-interactions over network depth.The dashed line marks the target value.(e) Dissimilarity of true and learned cumulants: 1 − cos ∠ ⟨⟨x ⊗k ⟩⟩ A , ⟨⟨x ⊗k ⟩⟩ R over training steps.We record the cumulants at logarithmically spaced intervals during training.The curves are smoothed by averaging over ten adjacent recording steps.Shaded areas show the variation due to the estimation of the cumulants from samples.Dots indicate training stage of coefficients shown in (a,b).

4 FIG. 10 .
FIG.10.Coefficients of lattice model without external field for networks of varying depth trained on a d = 9 dimensional data set with D = 10 5 samples.(a) Distribution of learned coefficient entries A(1) , A(2) compared to target values (black crosses).Self-interaction terms A

( 2 )
offdiag , those entries belonging to adjacent lattice sites A (2),adj. offdiag are shown separately.(b) Distribution of learned fourth order self-interactions compared to network depth.The dashed line marks the target value.(c) Training loss (full curves) and test loss (dashed curves).Colors distinguish different network depths L. (d) Dissimilarity of true and learned cumulants: 1−cos ∠ ⟨⟨x ⊗k ⟩⟩ A , ⟨⟨x ⊗k ⟩⟩ R over training steps.We record the cumulants at logarithmically spaced intervals during training.The curves are smoothed by averaging over ten adjacent recording steps.Shaded areas show the variation due to the estimation of the cumulants from samples.Dots indicate training stage of coefficients shown in (a,b).