Structures of Neural Network Effective Theories

We develop a diagrammatic approach to effective field theories (EFTs) corresponding to deep neural networks at initialization, which dramatically simplifies computations of finite-width corrections to neuron statistics. The structures of EFT calculations make it transparent that a single condition governs criticality of all connected correlators of neuron preactivations. Understanding of such EFTs may facilitate progress in both deep learning and field theory simulations.

Introduction -Machine learning (ML) has undergone a revolution in recent years, with applications ranging from image recognition and natural language processing, to self-driving cars and playing Go.Central to all these developments is the engineering of deep neural networks, a class of ML architectures consisting of multiple layers of artificial neurons.Such networks are apparently rather complex, with a deterring number of trainable parameters, which means practical applications have often been guided by expensive trial and error.Nevertheless, extensive research is underway toward opening the black box.
That a theoretical understanding of such complex systems is possible has to do with the observation that a wide range of neural network architectures actually admit a simple limit: they reduce to Gaussian processes when the network width (number of neurons per layer) goes to infinity [1][2][3][4][5][6], and evolve under gradient-based training as linear models governed by the neural tangent kernel [7][8][9].However, an infinitely-wide network neither exists in practice, nor provides an accurate model for deep learning.It is therefore crucial to understand finite-width effects, which have recently been studied by a variety of methods [10][11][12][13][14][15][16][17][18][19][20][21][22][23].
This line of research in ML theory has an intriguing synergy with theoretical physics [24].In particular, it has been realized that neural networks have a natural correspondence with (statistical or quantum) field theories [25][26][27][28][29][30][31][32][33][34].Infinite-width networks-which are Gaussian processes-correspond to free theories, while finitewidth corrections in wide networks can be calculated perturbatively as in weakly-interacting theories.This allows for a systematically-improvable characterization of neural networks beyond the (very few) exactly-solvable special cases [35][36][37].Meanwhile, from an effective theory perspective [21], information propagation through a deep neural network can be understood as a renormalization group (RG) flow.Examining scaling behaviors near RG fixed points reveals strategies to tune the network to criticality [38][39][40], which is crucial for mitigating the notorious exploding and vanishing gradient problems in practical applications.In the reverse direction, this synergy also points to new opportunities to study field theories with neural networks [33].
Inspired by recent progress, in this letter we further explore the structures of effective field theories (EFTs) corresponding to archetypical deep neural networks.To this end, we develop a novel diagrammatic formalism. 1Our approach largely builds on the frameworks of Refs.[21,22], which enable systematic calculations of finite-width corrections.The diagrammatic formalism dramatically simplifies these calculations, as we demonstrate by concisely reproducing known results in the main text and presenting further examples with new results in the Supplemental Material.Interestingly, the structures of diagrams in the RG analysis suggest that neural network EFTs are of a quite special type, where a single condition governs the critical tuning of all neuron correlators.The study of these EFTs may lend new insights into both neural network properties and novel field-theoretic phenomena.
EFT of deep neural networks -The archetype of deep neural networks, the multilayer perceptron, can be defined by a collection of neurons whose values φ ( ) i (called preactivations) are determined by the following operations given an input x ∈ R n 0 : Here superscripts in parentheses label layers, subscripts i, j label neurons within a layer (of which there are n at the th layer), and σ(φ) is the activation function (common choices include tanh(φ) or ReLU(φ) ≡ max(0, φ)).The weights W are drawn independently from zero-mean 1 See also Refs.[13,17,18,26,27,31,32,41] for Feynman diagraminspired approaches to ML.

Gaussian distributions with variances C
( ) b , respectively.The statistics of this ensemble encode both the typical behavior of neural networks initialized in this manner and how a particular network may fluctuate away from typicality.In the field theory language, these are captured by a Euclidean action, S[φ] = − log P (φ), for all neuron preactivation fields φ ( ) i ( x), where P (φ) is the joint probability distribution.As we review in the Supplemental Material, at initialization the conditional probability distribution at each layer is Gaussian:

S
( ) where for ≥ 2, and G (1) W x 1j x 2j .We have taken the continuum limit in input space to better parallel field theory analyses.G ( ) −1 is understood as the pseudoinverse when G ( ) is not invertible.We see that for ≥ 2, G ( ) ( x 1 , x 2 ) is an operator of the ( −1) thlayer neurons, so Eq. ( 3) is actually an interacting theory with interlayer couplings.This also means the determinant in Eq. ( 2) is not a constant prefactor.To account for its effect, we introduce auxiliary anticommuting fields ψ, ψ which are analogs of ghosts and antighosts in the Faddeev-Popov procedure.Including all layers, we have where S ( ) 0 is given by Eq. (3) above and S ( ) The th-layer neurons interact with the ( − 1) th-layer and ( + 1) th-layer neurons via S ( ) 0 and S ( +1) 0 , respectively, while their associated ghosts have opposite-sign couplings to the ( − 1) th-layer neurons but do not couple to ( + 1) th-layer neurons.This means φ ( ) and ψ ( ) loops cancel as far as their couplings to φ ( 1) are concerned, which must be the case since the network has directionality-neurons at a given layer cannot be affected by what happens at deeper layers.
Neuron statistics from Feynman diagrams -We are interested in calculating neuron statistics, i.e. connected correlators of neuron preactivation fields φ ( ) i ( x) in the EFT above.More precisely, we would like to track the evolution of neuron correlators as a function of network layer , which encodes how information is processed through a deep neural network and has an analogous form to RG flows in field theory.To this end, we develop an efficient diagrammatic framework to recursively determine th-layer neuron correlators in terms of ( − 1) th-layer neuron correlators.
Starting from the action Eq. ( 3), we can derive the following Feynman rule (see Supplemental Material for details): .
(7) As indicated above, a blob means taking the expectation value of the operator (or product of operators) attached to it.Eq. ( 7) contains both what we normally call propagators and vertices: the first term on the right-hand side, when summed over j, is the full propagator (or two-point correlator) for φ while the second term is an interaction vertex between φ ( ) i bilinears and operators built from φ ( 1) j : for ≥ 2, and ∆ (0) j ( x 1 , x 2 ) = 0. From Eq. ( 7) it is clear that each φ 2 ∆ vertex comes with a factor of 1 n (where n collectively denotes n 1 , • • • , n L−1 ).In the infinite-width limit, n → ∞, the EFT is a free theory, whereas for large but finite n, we have a weakly-interacting theory where higher-point connected correlators can be perturbatively calculated as a 1 n expansion.To see how this works, let us first take a closer look at the two-point correlator Eq. ( 8) (which is automatically connected since we have normalized Dφ e −S = 1, meaning the sum of vacuum bubbles vanishes).We can write it as an expansion in 1 n : The leading-order (LO) term K ( ) 0 is known as the kernel; it is the propagator for φ ( ) i in the free-theory limit n → ∞.Evaluating G ( ) ( x 1 , x 2 ) in this limit amounts to using free-theory propagators K ( 1) 0 for the previous-layer neurons φ ( 1) j in the blob in Eq. ( 7): Here subscript K ( 1) 0 means the expectation value is computed with the free-theory propagator K ( 1) 0 (cf.Eq. (A.15) in the Supplemental Material).We have dropped both neuron and layer indices on σ because the K ( 1) 0 subscript already indicates the layer, and the expectation value is identical for all neurons in that layer.One can further evaluate σ for specific choices of activation functions σ, but we stay activationagnostic for the present analysis.
Eq. ( 11) allows us to recursively determine K ( ) , and has been well-known from studies of infinitewidth networks.It may also be viewed as the RG flow of K 0 , with ultraviolet boundary condition It is straightforward to extend the diagrammatic calculation to K p≥1 .We present a simple derivation of the RG flow of K 1 in the Supplemental Material.
To evaluate the diagram, we need to consider two cases, j 1 = j 2 and j 1 = j 2 For j 1 = j 2 ≡ j, the blob takes its free-theory value at LO: As in Eq. ( 11), we have dropped the layer and neuron indices on ∆.For j 1 = j 2 , free-theory propagators cannot connect ∆ j1 and ∆ j2 , and the leading contribution is from inserting a connected four-point correlator of the ( − 1) th layer: ( y 1 , y 2 ; y 3 , y 4 ) ( y 1 , y 2 ; y 3 , y 4 ) In this diagram, internal solid lines denote φ ( 1) j1 , φ ( 1) j2 propagators.Exchanging the two φ lines results in the same diagram, hence a symmetry factor 1  2 2 = 1 4 .The smaller blob at the center (together with the attached propagators) represents a connected four-point correlator of the ( − 1) th layer, . The larger blobs give rise to the correlators in the second line of Eq. ( 15); they are automatically connected since ∆ A correlator φ( x) . . .by its standard definition includes the propagators K ( 1) 0 ( x, y) . . .(with y to be integrated over), so when we use correlators to build up diagrams, each internal propagator connecting two correlators (blobs) is counted twice.To avoid double-counting we thus insert an inverse propagator for each internal line in the diagram.This explains the factors of K ( 1) 0 −1 in the first line of Eq. ( 15), which effectively amputate the connected four-point correlator (or equivalently the larger blobs in the diagram).The final expression in Eq. ( 15) is obtained by Wick contraction, which yields factors of Adding up Eqs. ( 14) and (15) gives the final result for V ( ) 4 in terms V , i.e. the RG flow of V 4 , which agrees with Refs.[14,21].Both equations are defined by Eq. ( 12) is O(1).The diagrammatic calculation extends straightforwardly to higher-point connected correlators, and provides a concise framework to systematically analyze finite-width effects in deep neural networks.In the Supplemental Material we present new results for the connected six-point and eight-point correlators as further examples.
The RG flow can also be formulated at the level of the EFT action.The idea is to consider a tower of EFTs, S ( ) eff ( = 1, . . ., L), obtained by integrating out the neurons and ghosts in all but the th layer.They take the form: where summation over repeated indices is assumed.The EFT couplings µ ( ) , λ ( ) ∼ O 1 n can be determined from the connected correlators, so their RG flows directly follow from those of the latter discussed above.For example, matching the connected four-point correlator relates λ ( ) to V ( ) 4 and K ( ) where we use an elongated vertex to indicate pairing of the four arguments of λ ( ) .For the two-point correlator, the calculation involves the following diagrams: The alternative pairing of legs at the quartic vertex in the last diagram results in an O n n contribution, which, however, is canceled by diagrams with ghost loops due to the opposite-sign coupling: Similar cancellations also explain the exclusion of O n n2 loop diagrams in the calculation of the connected fourpoint correlator in Eq. ( 17). 2   Structures of RG flow and criticality -The RG analysis of neuron statistics is highly relevant for the critical tuning of deep neural networks.The necessity of tuning has long been appreciated in practical applications of deep learning, especially in the context of mitigating the infamous exploding and vanishing gradient problems which make it difficult to train deep networks given finite machine precision.In the EFT framework, this is related to the fact that generic choices of hyperparameters C ( ) b , C ( ) W lead to exponential scaling of neuron correlators under RG.Taming the exponential behaviors requires tuning the network to criticality by judiciously setting these hyperparameters [38][39][40].At the kernel level, the criticality analysis of Ref. [21] reveals two prominent universality classes which networks with a variety of activation functions fall into: scale-invariant (including e.g.ReLU) and K = 0 (including e.g.tanh).In each case, K ( ) 0 flows toward a nontrivial fixed point as increases; crucially, the scaling near the fixed point is power-law rather than exponential, which allows information to propagate through the layers so the network can learn nontrivial features from data.
While previous criticality analyses have mostly focused on the two-point correlator, it is important to also consider higher-point correlators because they encode fluctuations across the ensemble.In other words, it is not terms, necessitating either working in the regime n n 1 1 or marginalizing the action over all but an O(1) number of neurons in the ( − 1) th layer to have a perturbative 1  n expansion.Retaining the ghosts avoids such subtleties, rendering µ ( ) genuinely O 1 n and the difference between λ ( ) and the (amputated) connected four-point correlator genuinely O 1 n 2 .
sufficient to require the networks are well-behaved on average, but the scaling behavior of each network must be close to the average.At first sight, criticality seems to impose more constraints than the number of tunable hyperparameters, if we require power-law scaling of all higher-point correlators at arbitrary input points.However, as we will show, the structures of RG flow, manifest in the diagrammatic formulation, are such that tuning K 0 to criticality actually ensures power-law scaling of all higher-point connected correlators near the fixed point.Let us start with the two-point correlator.The asymptotic scaling behavior (exponential vs. power-law) can be inferred from the following question: upon an infinitesimal variation at the ( − 1) th layer, how does G ( ) ( x 1 , x 2 ) change?Diagrammatically, this can be calculated as follows: where a blob labeled "δ" denotes the variation of the (two-point) correlator.The result is or, equivalently: where We can clearly see the structural similarity to Eq. ( 15) above.Ultimately, the same subdiagram enters both equations, and we can write: where the φ ( 1) j legs are amputated, and an exchange symmetry between them in the full diagram is assumed (hence a symmetry factor 1  2 in Eq. ( 24)).
The same pattern persists for higher-point connected correlators.For the connected four-point correlator, an infinitesimal variation of V ( 1) 4 ( x 1 , x 2 ; x 3 , x 4 ) results in a change in V ( ) 4 ( x 1 , x 2 ; x 3 , x 4 ): and we find: x 1 , x 2 ; x 3 , x 4 δV ( 1) 4 where the symmetrized form arises because V ( ) 4 x 1 , x 2 ; x 3 , x 4 = V ( ) 4 x 3 , x 4 ; x 1 , x 2 .Generally, the connected 2k-point correlator is defined by φ ( ) where V ( ) 2k can be shown to be O(1) [22].Its variation follows from: and we have where "sym."means symmetrizing the expression in the same way as in Eq. ( 27).The quantity χ ( ) ( x 1 , x 2 ; y 1 , y 2 ) in the equations above is a generalization of the parallel and perpendicular susceptibilities, χ and χ ⊥ , introduced in Ref. [21] when analyzing the special case of two nearby inputs.In the nearby-inputs limit, tuning the network to criticality means adjusting the hyperparameters C ( ) b such that the kernel recursion Eq. ( 11) has a fixed point K where χ = χ ⊥ = 1.In the Supplemental Material, we show that, at least for the scale-invariant and K = 0 universality classes, this tuning actually implies a stronger condition is satisfied (at LO in 1 n ): Eq. ( 31) ensures perturbations around the fixed point stay constant through the layers, not just for the twopoint correlator, δ G ( ) ( x 1 , x 2 ) = δ G ( 1) ( x 1 , x 2 ) , but for the entire tower of higher-point connected correlators, 1 x 1 , . . ., x 2k .This in turn implies that for all of them, RG flow toward the fixed point is power-law instead of exponential, once the single condition Eq. ( 31) is satisfied.The discussion above makes it transparent that power-law scaling of higher-point connected correlators at criticality (previously observed in Refs.[21,22] up to eight-point level in the degenerate-input limit) has its roots in the structures of EFT interactions, as manifested by the common structure shared by the diagrams in Eqs. ( 21), ( 26) and ( 29).
Summary and outlook -In this letter, we introduced a diagrammatic formalism that significantly simplifies perturbative calculations of finite-width effects in EFTs corresponding to archetypical deep neural networks.The concise reproduction of known results and derivation of new results highlights the efficiency of the diagrammatic approach, while the incorporation of ghosts vastly simplifies 1  n counting in the EFT action.Our analysis also made transparent the structures of such EFTs which underlie the success of critical tuning in deep neural networks.In fact, a universal diagrammatic structure emerges in the RG analysis of all higher-point connected correlators of neuron preactivations, which means criticality (i.e.power-law as opposed to exponential scaling) of all the neuron statistics at initialization is governed by a single condition, Eq. (31).
From the deep learning point of view, an obvious next step is to extend the diagrammatic formalism to incorporate gradient-based training and simplify perturbative calculations involving the neural tangent kernel [7,8] and its differentials [11][12][13]21].From the fundamental physics point of view, we are hopeful that much more can be learned from the intimate connection between neural networks and field theories.Understanding the structures of EFTs corresponding to other neural network architectures (e.g.recurrent neural networks [32,42] and transformers [43]) will allow us to gain further insights into this connection and potentially point to novel ML architecture designs for simulating field theories.interaction vertices, and use them to build diagrams from which one can calculate correlators in terms of parameters of the theory.In the present case, however, our goal is to derive RG flows, which are relations between correlators.The general strategy here is to first write th-layer φ correlators in terms of ( −1) th-layer ∆ correlators, i.e. expectation values of (products of) ∆ ( 1) j 's, as summarized by the Feynman rule Eq. ( 7) and exemplified in Eqs.(A.12) and (A.13) above (upon replacing G ( ) j 's by ∆ ( 1) j 's to isolate the connected contribution), and then calculate these ( −1) th-layer ∆ correlators in terms of ( − 1) th-layer φ correlators.In the second step, if a ∆ correlator involves identical neuron indices (e.g. in Eq. ( 14)), it simply takes its free-theory value expressed in terms of free propagators (i.e.two-point φ correlators) at LO; if distinct neuron indices are involved, we need to insert mixed ∆ -φ correlators (e.g. the larger blobs in Eq. ( 15)) to bridge the ∆'s and four-or higher-point φ correlators.In either case, we can express the result in terms of free-theory expectation values of ( − 1) th-layer single neuron operators: where O represents a product of ∆'s and φ's (with the exception of the LO two-point correlator K 0 where O = σ x 1 σ x 2 ; see Eq. ( 11)).By Wick contractions we can then rewrite these expectation values in terms of those of functional derivatives of ∆'s (as in e.g.Eq. ( 15)).
To systematically implement this general procedure, it is convenient to introduce the following * -blob notation: * x 1 x 2 x 2m 1 x 2m φ j (y 1 ) diagrams where the φ 2r ∆ m blob becomes disconnected due to contractions among φ's , (A. 16) where all the m ∆'s and 2r φ's carry the same neuron index j.The diagrams being subtracted off in Eq. (A. 16) are those where the φ 2r ∆ m blob becomes disconnected due to Wick contractions between any number of pairs of φ legs.In the main text, we only encountered the m = 2, r = 0 and m = r = 1 cases when calculating the connected four-point correlator.In these cases, there is no distinction between * -blobs, full blobs and connected blobs, because disconnecting those blobs in any way would give zero due to the tadpole-free condition Eq. (A.14).For general m, r, though, we have to keep in mind that Wick-contracting the φ's may not be the only way to disconnect the φ 2r ∆ m blob; nor does disconnecting a φ 2r ∆ m blob in a diagram necessarily make the full diagram disconnected.Nevertheless, we will see in the examples in the next section that the use of * -blobs conveniently organizes the derivation of RG flows of connected φ correlators and neatly takes care of subtleties regarding double-counting.The * -blobs also admit simple expressions: requiring that each φ must be Wick contracted with one of the ∆'s, we arrive at the following general formula at LO in 1 n : * x 1 x 2 x 2m 1 x 2m φ j (y 1 ) . (A.17) As explained below Eq. ( 15) in the main text, when calculating a full diagram we would always convolve the expression for the subdiagram in Eq. (A.17) with inverse propagators associated with the φ legs (which would become internal lines in the full diagram), or, in other words, we would amputate the φ legs in Eq. (A.17).This would leave us with just the expectation value on the right-hand side of Eq. (A. 17).In what follows we will use * -blobs to build up diagrams, although they coincide with full blobs (in which case the " * " label is redundant) when r = 0 or m = r = 1.We finally remark on the comparison of our results with Ref. [21], which also presented the two-point correlator up to NLO and connected four-point correlator at LO.The results in Ref. [21] are also written in terms of free-theory expectation values of single neuron operators, but the operators are products of σ's and φ's whereas our results are the definition of * -blob in Eq. (A. 16), the diagram in Eq. (A.20) does not include contributions where two of the four φ j legs are Wick contracted: Adding up both contributions discussed above, we obtain the RG flow of the NLO two-point correlator: which can be used to recursively determine K ( ) and K ( 1) 0 .

Connected six-point correlator
We next demonstrate the derivation of RG flow of the connected six-point correlator at LO. Similarly to Eq. ( 13) for the connected four-point correlator, we have We need to consider three cases.First, if j 1 , j 2 , j 3 are all equal, j 1 = j 2 = j 3 ≡ j, we can simply use free-theory propagators K ( 1) 0 to connect the φ j fields contained in ∆ j to obtain the LO result: where the sum over j yields a factor of n 1 which cancels one of the three factors of 1 n 1 from φ 2 ∆ vertices and renders the result O 1 n 2 .Second, if j 1 , j 2 , j 3 take two distinct values, we need to use a connected four-point correlator at the ( − 1) th layer to connect neurons with distinct indices (while still using free propagators to connect neurons with identical indices): ( y 1 , y 2 ; y 3 , y 4 ) The diagram is automatically connected as a result of our * -blob definition Eq. (A.16).To arrive at the expression in Eq. (A.25), we have used Eq.(A.17) with (m, r) = (2, 1) and (1, 1) for the two * -blobs, respectively, and the symmetry factor 1 2 2 = 1 4 comes from exchanging y 1 ↔ y 2 and y 3 ↔ y 4 among the four φ j legs attached to V ( y 1 , y 2 ; y 3 , y 4 ).Compared to the first contribution in Eq. (A.24), here the j sum yields an additional factor of n 1 while the connected four-point correlator inserted carries a factor of 1 n 2 , so the end result is again O 1 n 2 .Note that Eq. (A.25) holds regardless of whether j 1 = j 2 terms are included in the sum (the same is true for Eq. ( 15) in the main text).
Finally, if j 1 , j 2 , j 3 are all distinct, we must use either a connected six-point correlator or two connected four-point correlators to connect the ( − 1) th-layer neurons.In the former case we have j1,j2,j3 * ( y 1 , y 2 ; y 3 , y 4 ; y 5 , y 6 ) while in the latter case we have j1,j2,j3 * * • V  generate an ensemble of N net = 1, 000 neural networks, with hidden layer width n = 300 and depth L = 30, initialized according to Eq. (A.3).We consider an n 0 = 2 dimensional input space and pick four points x 1 , x 2 , x 3 , x 4 as shown in Fig. 1 for illustration.The activation function is chosen to be either ReLU or tanh.For each network, we compute neuron preactivations at every layer according to Eq. ( 1).The connected two-point and four-point correlators are then obtained from ensemble averaging: . . ., L) are the network parameters which are adjusted to minimize a loss function during training, such that the trained network can approximate the desired function.The basic idea of an EFT of deep neural networks is to consider an ensemble of networks, where at initialization, W ( ) ij and b ( ) i

21 )
Contracting φ j legs like this disconnects the φ 4 j ∆ j subdiagram while the full diagram remains connected.On the other hand, these contributions were in fact already included in Eq. (A.19), because the upper parts of these diagrams (involving the ( − 1) th-layer connected four-point correlator) are simply NLO corrections to the φ ( 1) j propagator.Quite generally, our definition of * -blob in Eq. (A.16) conveniently avoids double-counting of such diagrams.

3 FIG. 1 .
FIG.1.Input points chosen for illustration in our numerical experiments.

20 FIG. 2 .
FIG.2.Two-point correlators for ReLU networks.Asymptotic scaling toward fixed point is power-law at criticality (middle panels) and exponential away from criticality (left and right panels).The fixed points K in the middle panels are given by Eq. (A.37).See text for details.

2 FIG. 3 .
FIG.3.Two-point correlators for tanh networks.Asymptotic scaling toward fixed point is power-law at criticality (middle panels) and exponential away from criticality (left and right panels).The fixed point K in the top-right panel is given by the nonzero solution to Eq. (A.38).See text for details.

3 FIG. 4 .
FIG. 4.Connected four-point correlators for ReLU networks at criticality.Asymptotic scaling is power-law for all input choices.See text for details.

2 FIG. 5 .
FIG.5.Connected four-point correlators for tanh networks at criticality.Asymptotic scaling is power-law for all input choices.See text for details.