Large Deviations Approach to Random Recurrent Neuronal Networks: Parameter Inference and Fluctuation–Induced Transitions

We here unify the ﬁeld theoretical approach to neuronal networks with large deviations theory. For a prototypical random recurrent network model with continuous-valued units, we show that the eﬀective action is identical to the rate function and derive the latter using ﬁeld theory. This rate function takes the form of a Kullback-Leibler divergence which enables data-driven inference of model parameters and calculation of ﬂuctuations beyond mean–ﬁeld theory. Lastly, we expose a regime with ﬂuctuation–induced transitions between mean–ﬁeld solutions.

Introduction.-Biologicalneuronal networks are systems with many degrees of freedom and intriguing properties: their units are coupled in a directed, nonsymmetric manner, so that they typically operate outside thermodynamic equilibrium [1,2].The primary analytical method to study neuronal networks has been meanfield theory [3][4][5][6][7][8].Its field-theoretical basis has been exposed only recently [9,10].However, to understand the parallel and distributed information processing performed by neuronal networks, the study of the forward problem -from the microscopic parameters of the model to its dynamics -is not sufficient.One additionally faces the inverse problem of determining the parameters of the model given a desired dynamics and thus function.Formally, one needs to link statistical physics with concepts from information theory and statistical inference.
We here expose a tight relation between statistical field theory of neuronal networks, large deviations theory, information theory, and inference.To this end, we generalize the probabilistic view of large deviations theory, which yields rigorous results for the leading order behavior in the network size N [11,12], to arbitrary single unit dynamics, transfer functions, and multiple populations.We furthermore show that the central quantity of large deviations theory, the rate function, is identical to the effective action in statistical field theory.This link exposes a second relation: Bayesian inference and prediction are naturally formulated within this framework, spanning the arc to information processing.Concretely, we develop a method for parameter inference from transient data for single-and multi-population networks.Lastly, we overcome the inherent limit of mean-field theory-its neglect of fluctuations.We develop a theory for fluctuations of the order parameter when the intrinsic timescale is large and discover a regime with fluctuation-induced transitions between two coexisting mean-field solutions.
First, we introduce the model in its most general form.Then, we develop the theory for a single population.Last, we generalize it to multiple populations.
Model.-We consider block-structured random networks of N = ∑ α N α nonlinearly interacting units x α i (t) driven by an external input ξ α i (t).The dynamics of the ith unit in the α-th population is governed by the stochastic differential equation In the absence of recurrent and external inputs, the units undergo an overdamped motion with time constant τ α in a potential U α (x).The J αβ ij are independent and identically Gaussian-distributed random coupling weights with zero mean and population-specific variance ⟨(J αβ ij ) 2 ⟩ = g 2 αβ N β where the coupling strength g αβ controls the heterogeneity of the weights.The timevarying external inputs ξ α i (t) are independent Gaussian white-noise processes with zero mean and correlation functions ⟨ξ α i (t 1 )ξ β j (t 2 )⟩ = 2D α δ ij δ αβ δ(t 1 − t 2 ).The single-population model corresponds to the one studied in Ref. [4] if the external input vanishes, D = 0, the potential is quadratic, U (x) = 1 2 x 2 , and the transfer function is sigmoidal, φ(x) = tanh(x); for D = 1  2 , U (x) = − log(A 2 − x 2 ), and φ(x) = x it corresponds to the one in Ref. [11], which is inspired by the dynamical spin glass model of Ref. [13].
Field theory.-Thefield-theoretical treatment of Eq. ( 1) employs the Martin-Siggia-Rose-de Dominicis-Janssen path integral formalism [14][15][16][17].We denote the expectation over paths across different realizations of the noise ξ as (Appendix A1) x ⋅ e S0(x,x)−x T Jφ(x) , where ⟨⋅⟩ x J,ξ integrates over the unique solution of Eq. (1) given one realization ξ of the noise.Here, S 0 (x, x) = xT ( ẋ + U ′ (x)) + D xT x is the action of the uncoupled neurons.We use the shorthand notation T 0 dt a i (t)b i (t).For large N , the system becomes self-averaging, a property known from many disordered systems with large numbers of degrees of freedom: the collective behavior is stereotypical, independent of the realization J ij .A selfaveraging observable has a sharply peaked distribution over realizations of J -the observable always attains the same value, close to its average.This, however, only holds for observables averaged over all units, reminiscent of the central limit theorem.These are generally of the form ∑ N i=1 (x i ), where is an arbitrary functional of a single unit's trajectory.It is therefore convenient to introduce the scaled cumulant-generating functional where the prefactor 1 N makes sure that W N is an intensive quantity, reminiscent of the bulk free energy [18].
In fact, we will show that the N -dependence vanishes in the limit N → ∞ because the system decouples.
Performing the average over J , i.e. evaluating ⟨e −x T Jφ(x) ⟩ J , and introducing the auxiliary field as well as the conjugate field C, we can write W N as (Appendix A1) The effective action is defined as the Legendre transform of W N ( ), where µ is determined implicitly by the condition µ = W ′ N ( µ ) and the derivative W ′ N ( ) has to be understood as a generalized derivative, the coefficient of the linearization akin to a Fréchet derivative [19].
Note that W N and Γ N are, respectively, generalizations of a cumulant-generating functional and of the effective action [20] because both map a functional ( or µ) to the reals.For the choice (x) = j T x, where j(t) is an arbitrary function, we recover the usual cumulant-generating functional of the single unit's trajectory (Appendix A4) and the corresponding effective action.
Rate function.-Anynetwork-averaged observable, for which we may expect self-averaging to hold, can likewise be obtained from the empirical measure Of particular interest is the leading-order exponential behavior of the distribution of empirical measures P (µ) = ⟨⟨P (µ x)⟩ x J ⟩ J across realizations of J and ξ.This behavior in the large N limit is described by what is known as the rate function in large deviations theory [see e.g.21]; H(µ) captures the leading exponential probability P (µ) For large N , the probability of an empirical measure that does not correspond to the minimum H ′ (μ) = 0 is thus exponentially suppressed.Put differently, the system is self-averaging and the statistics of any network-averaged observable can be obtained using μ.Similar as in field theory, it is convenient to introduce the scaled cumulant-generating functional of the empirical measure.Because 1 N ∑ N i=1 (x i ) = ∫ Dy µ(y) (y) holds for an arbitrary functional (x i ) of the single unit's trajectory x i , Eq. ( 2) has the form of the scaled cumulantgenerating functional for µ at finite N .
Using a saddle-point approximation for the integrals over C and C in Eq. ( 4) (Appendix A1), we get Both C and C are determined self-consistently by the saddle-point equations where ∂ C denotes a partial functional derivative.
From the scaled cumulant-generating functional, Eq. ( 8), we obtain the rate function via a Legendre transformation [22]: Note that H(µ) is still convex even if µ itself is multimodal.Comparing with Eq. ( 5), we observe that the rate function is equivalent to the effective action: can be solved for µ to obtain a closed expression for the rate function viz.effective action (Appendix A2), one main result of our work, where η is a zero-mean Gaussian process with a correlation function that is determined by µ(x), ), and φ(x) = x, Eq. ( 9) can be shown to be a equivalent to the mathematically rigorous result obtained in the seminal work by Ben Arous and Guionnet (Appendix A3).12).F Power spectra on the left-(dark, solid curves) and right-hand-sides (light, dotted curves) of Eq. ( 12) for the inferred parameters.Further parameters: N = 10, 000, temporal discretization dt = 10 −2 , simulation time T = 1, 000, time-span discarded to reach steady state T0 = 100.
The rate function Eq. ( 9) takes the form of a Kullback-Leibler divergence.Thus, it possesses a minimum at This most likely measure corresponds to the well-known self-consistent stochastic dynamics that is obtained in field theory [4,9,10,23].Note that the correlation function of the effective stochastic input η at the minimum depends self-consistently on μ(x) through Eq. (10).However, the rate function H(µ) contains more information.It quantifies the suppression of departures µ − μ from the most likely measure and therefore allows the assessment of fluctuations that are beyond the scope of the classical mean-field result.
Parameter Inference.-Therate function opens the way to address the inverse problem: given the networkaveraged activity statistics, encoded in the corresponding empirical measure µ, what are the statistics of the connectivity and the external input, i.e. g and D?
We determine the parameters using maximum likelihood estimation.Using Eq. ( 7) and Eq. ( 9), the likelihood of the parameters is given by ln P (µ g, D) ≃ −N H(µ g, D), (A,B) and meta-stability for φ(x) = clip(tan(x), −1, 1) (C,D).A Temporal order parameter statistics across ten simulations (bars) and theory (solid curve) from Eq. ( 13).B Order parameter variance for 10 realizations of the connectivity with standard error of the mean (symbols) and theory (solid curve) from Eq. ( 13).C Mean order parameter for different initial values q0 from simulations (symbols) and selfconsistent theory (solid curves).D Fluctuation induced bistability of the order parameter for N = 750, g = 0.95.T = 5, 000 in A,D; U (x) = 1 2 x 2 ; further parameters as in Fig. 1.
where ≃ denotes equality in the limit N → ∞ and we made the dependence on g and D explicit.The maximum likelihood estimate of the parameters g and D corresponds to the minimum of the Kullback-Leibler divergence H, Eq. ( 9), on the right hand side.Evaluating the derivative of H(µ g, D) yields (Appendix B1) where we abbreviated a ∈ {g, D} and defined C 0 (t 1 , t 2 ) ≡ ∫ Dx µ(x) ẋ(t 1 ) + U ′ (x(t 1 )) ẋ(t 2 ) + U ′ (x(t 2 )) .The derivative vanishes for C 0 = C η .Assuming stationarity, in Fourier domain this condition reads where S X (f ) denotes the network-averaged power spectrum of the observable X.Using non-negative least squares [24], Eq. ( 12) allows a straightforward inference of g and D (Fig. 1).To determine the transfer function φ and the potential U , one can use model comparison techniques (Appendix B2).Using the inferred parameters, we can also predict the future activity of a unit from the knowledge of its recent past (Appendix B3).
Fluctuations.-Therate function allows us to go beyond mean-field theory and examine fluctuations of the order parameter.Here, we use the network-averaged variance q(t) = C(t, t) from Eq. (3) as an order parameter and restrict the discussion to the case U (x) = 1 2 x 2 .Fig. 2A shows the distribution of q(t) across time and across realizations of the connectivity.The fluctuations across realizations of the connectivity can be computed from the curvature of the rate function I(C) that is obtained from (9) by the contraction principle (Appendix C1).In a stationary state and considering only the fluctuations across realizations of the connectivity, for slow recurrent dynamics τ c ≫ 1 we obtain the approximation for the fluctuations of q Here, ⟨f g⟩ 0 ≡ ⟨f (x(t))g(x(t))⟩ 0 denotes an expectation w.r.t. the self-consistent measure (11).For vanishing noise, D = 0, and g > 1, the dynamics are slow and the theory matches the empirical fluctuations very well (Fig. 2A,B).Deviations in Fig. 2B are caused by two effects: For g ↘ 1, periodic solutions appear as a finite-size effect; for growing g, the timescale τ c decreases, eventually violating the assumption τ c ≫ 1 entering Eq. ( 13).Rate functions like I(C) in general also allow one to estimate the tail probability P(q > θ) ≈ exp(−N I(θ)), which here shows a quadratic decline for large departures (Fig. 2A).
When the denominator in Eq. ( 13) vanishes, fluctuations grow large, indicative of a continuous phase transition.For φ ′′′ (0) < 0 the denominator vanishes for g ≥ 1 (Fig. 2B), in line with the established theory, the breakdown of linear stability of the fixed point x = 0 [4].For φ ′′′ (0) > 0, however, Eq. ( 13) predicts qualitatively different behavior: the denominator vanishes at g < 1, in the linearly stable regime.In fact, we find that this regime features the coexistence of two stable mean-field solutions (Fig. 2C, Appendix C2) and fluctuation-driven first order transitions between them (Fig. 2D).The solution with larger q corresponds to self-sustained activity; the solution with smaller q corresponds to the fixed point x = 0 and is stable (Appendix C2), in contrast to the case of a threshold-power-law transfer function [6].
Multiple Populations.-Formultiple populations, any population-averaged observable can be obtained from the empirical measure The joint distribution of all population-specific empirical measures {µ ○ } is determined by the rate function (Appendix D) where γ α = N α N and η α is a zero-mean Gaussian process with , and D = 0; simulation parameters as in Fig. 1.
Again, the rate function can be interpreted as a loglikelihood; its derivative leads to (Appendix E1) which generalizes Eq. ( 12) to multiple populations.Using Eq. ( 16), the inferred connectivity g αβ matches the ground truth well; accordingly, two unconnected populations (Fig. 3A,B) can be clearly distinguished from a more involved network where one population (α = 1) is only active due to the recurrent input from the other population (α = 2, Fig. 3C,D).The method can thus distinguish intrinsically generated activity from a case where activity is driven from outside the network.However, inference of a unique set of parameters is only possible if the output spectra S α φ(x) (f ) differ sufficiently across α.If the output spectra match closely, Eq. ( 16) leads to a degenerate set of solutions that satisfy ∑ β g 2 αβ = const.and are all equally likely given the data (Appendix E2).
Discussion.-In this Letter, we found a tight link between the field theoretical approach to neuronal networks and its counterpart based on large deviations theory.We obtained the rate function of the empirical measure for the widely used and analytically solvable model of a recurrent neuronal network [4] by field-theoretical methods.This rate function generalizes the seminal result by Ben Arous and Guionnet [11,12] to arbitrary potentials, transfer functions, and multiple populations.Intriguingly, our derivation elucidates that the rate function is identical to the effective action and takes the form of a Kullback-Leibler divergence, akin to Sanov's theo-rem for sums of i.i.d.random variables [21,22].The rate function can thus be interpreted as a distance between an empirical measure, for example given by data, and the activity statistics of the network model.This result allows us to address the inverse problem of inferring the parameters of the connectivity and external input from a set of trajectories and to determine the potential and the transfer function.
We here restricted the analysis to networks with independently drawn random weights with zero mean.Since correlated weights have a profound impact on the dynamics that can be captured using both field theory [25] and large deviations theory [26,27], it is an interesting challenge to extend the analysis in this direction.Likewise, synaptic weights with non-vanishing mean, as they appear in sparsely-connected networks, present an interesting extension, because they promote fluctuation-driven states when feedback is sufficiently positive.Another important deviation from independent weights in biological neural networks are motifs [28], which pose a significant challenge already for the field-theoretical approach [29].Beyond the weight statistics, we assumed that the dynamics are governed by the first-order differential equation (1).Indeed, the field-theoretical approach can be generalized to a much broader class of dynamics that do not necessarily possess an action [30]; hence, it seems possible to also derive large deviations results for more general dynamics.In this regard, the extension to spiking networks is a particularly interesting but also challenging future direction.Whether the model, Eq. ( 1), with its current limitations-the independent weights and the first-order dynamics-allows accurate inference of network parameters from cortical recordings is an intriguing question for further research.
The unified description of random networks by statistical field theory and large deviations theory opens the door to established techniques from either domain to capture beyond mean-field behavior.Such corrections are important for small or sparse networks with nonvanishing mean connectivity, to explain correlated neuronal activity, and to study information processing in finite-size networks with realistically limited resources.We here make a first step by computing fluctuation corrections from the rate function.The quantitative theory explains near-critical fluctuations for g ∈ [1, 1+δ(N )] and we discover that expansive gain functions, as found in biology [31], lead to qualitatively different collective behavior than the well-studied contractive sigmoidal ones: The former feature meta-stable network states with noiseinduced first order transitions between them; the latter allow for only a single solution and show second order phase transitions.
We are grateful to Olivier Faugeras and Etienne Tanré for helpful discussions on LDT of neuronal networks, to Anno Kurth for pointing us to the Fréchet derivative and to Alexandre René, David Dahmen, Kirsten Fischer, Here, we derive the scaled cumulant generating functional and the saddle-point equations.The first steps of the derivations are akin to the manipulations presented in [1,2], thus we keep the presentation concise.We interpret the stochastic differential equations governing the network dynamics in the Itô convention.Using the Martin-Siggia-Rose-de Dominicis-Janssen path integral formalism, the expectation ⟨⋅⟩ x J of some arbitrary functional G(x) can be written as where we used the Fourier representation δ(x) = 1 2πi ∫ i∞ −i∞ e xx dx in every timestep in the second step and defined the action An additional average over realizations of the connectivity J i.i.d.
∼ N (0, N −1 g 2 ) only affects the term −x T J φ(x) in the action and results in where we introduced the network-averaged auxiliary field via a Hubbard-Stratonovich transformation.The average over the connectivity and the subsequent Hubbard-Stratonovich transformation decouple the dynamics across units; afterwards the units are only coupled through the global fields C and C. Now, we consider the scaled cumulant generating functional of the empirical density Using the above results and the abbreviation φ(x) ≡ φ, it can be written as where the N in front of the single-particle cumulant generating functional Ω results from the factorization of the N integrals over x i and xi each; thus it is a hallmark of the decoupled dynamics.Next, we approximate the C and C integrals in a saddle-point approximation which yields where C and C are determined by the saddle-point equations Here, ∂ C denotes a partial functional derivative.In the limit N → ∞, the remainder O(ln(N ) N ) vanishes and the saddle-point approximation becomes exact.

Rate Function
Here, we derive the rate function from the scaled cumulant generating functional.According to the Gärtner-Ellis theorem [3], we obtain the rate function via the Legendre transformation with µ implicitly defined by Using the Gärtner-Ellis theorem, we implicitly assume that H(µ) is convex [3].This is, however, not the same as assuming that µ, or the most likely empirical measure μ, is concave.The latter would be a serious restriction as it would prohibit for example treating the bistable case we investigate in the manuscript.A concave P (µ), and hence a convex H(µ), simply corresponds to the situation with a single most likely measure μ but it does not put restrictions on μ itself.In particular, μ may still be bimodal.Due to the saddle-point equations, the derivative of the cumulant generating functional in Eq. ( 2) simplifies to where the derivative only acts on the that is explicit in Ω (C , C ) and not on the implicit dependencies through C , C .Thus, Eq. (2) yields .
Taking the logarithm and using Inserting µ (x) into the Legendre transformation (1) yields Identifying µ(x) in the saddle-point equation and thus C µ = C µ .Accordingly, the last two terms in the Legendre transformation cancel and we arrive at where still In the main text, we use the notation µ appearing in the rate function.Indeed, using the Martin-Siggia-Rose-de Dominicis-Janssen formalism, we have which shows that the two notations are equivalent since xT ( ẋ 3. Equivalence to Ben Arous and Guionnet (1995) Here, we show explicitly that the rate function we obtained generalizes the rate function obtained by Ben Arous and Guionnet [4], whose limitation to finite temperature and time was lifted later [5].We start with Theorem 4.1 in [4] is a good rate function.Now we relate H to the rate function that is derived above, Eq. ( 3).Using the Onsager-Machlup action, we can write . Next, we transform gy → y, ỹ g → ỹ and solve the integral over y in G(µ): The Onsager-Machlup action in the logarithm in D KL (µ Q) and G(µ) cancel and we arrive at up to an additive constant that we set to zero.Since C µ (u, v) = ∫ Dx µ(x)x(u)x(v), the rate function by Ben Arous and Guionnet is thus equivalent to Eq. (3) with φ(x) = x and D = 1 2 .

Background on Rate Function
Relation to Sompolinsky, Crisanti, Sommers (1988) Here, we relate the approach that we laid out in the main text to the approach pioneered by Sompolinsky, Crisanti, and Sommers [6] (reviewed in [2,7]) using our notation for consistency.Therein, the starting point is the scaled cumulant-generating functional which gives rise to the cumulants of the trajectories.For the linear functional we have ∑ N i=1 (x i ) = j T x and thus W N (j T x) = ŴN (j).Put differently, the scaled cumulant-generating functional of the trajectories ŴN (j) is a special case of the more general scaled cumulant-generating functional W N ( ) we consider in this manuscript.Of course one can start from the scaled cumulant-generating functional of the observable of interest and derive the corresponding rate function.Conversely, we show below how to obtain the rate function of a specific observable from the rate function of the empirical measure.
Contraction Principle Here, we relate the rather general rate function of the empirical measure H(µ) to the rate function of a particular observable I(C).As an example, we choose the correlation function because it is a quantity that arises naturally during the Hubbard-Stratonovich transformation.The generic approach to this problem is given by the contraction principle [3]: Here, the infimum is constrained to the empirical measures that give rise to the correlation function C, i.e. those that fulfill C(u, v) = ∫ Dx µ(x)φ(x(u))φ(x(v)).Writing H(µ) as the Legendre transform of the scaled cumulantgenerating functional, , the empirical measure only appears linearly.Using a Lagrange multiplier k(u, v), the infimum over µ leads to the constraint (x) = φ T kφ and we arrive at Once again, we see how to relate W N ( ) to a specific observable-this time for the choice (x) = φ T kφ.
Up to this point, the discussion applies to any observable.For the current example, we can proceed a bit further.With the redefinition C + k → C, we get which made Ω 0 independent of k.Now we can take the infimum over k, leading to The remaining extremum gives rise to the condition i.e. a self-consistency condition for the correlation function.
As a side remark, we mention that the expression in the brackets of Eq. ( 4) is the joint effective action for C and C, because for N → ∞, the action equals the effective action.This result is therefore analogous to the finding that the effective action in the Onsager-Machlup formalism is given as the extremum of its counterpart in the Martin-Siggia-Rose-de Dominicis-Janssen formalism [8, Eq.( 24)].The only difference is that here, we are dealing with second order statistics and not just mean values.The origin of this finding is the same in both cases: we are only interested in the statistics of the physical quantity (the one without tilde, x or C, respectively).Therefore we only introduce a source field (k in the present case) for this one, but not for the auxiliary field, which amounts to setting the source field of the latter to zero.This is translated into the extremum in Eq. ( 4) over the auxiliary variable [8, Appendix 5].
Tail Probability Large deviations results are often stated for the tail probability P(x > θ) where θ is in the tail.Since the notion of a tail cannot be unambiguously defined for quantities like the empirical measure or correlation functions, at least not in an obvious way, we here give an example how to relate the rate function of the empirical measure to a tail probability.
First, we use the contraction principle to get a rate function for a scalar quantity, e.g. the order parameter q = ∫ Dx µ(x)φ(x(t))φ(x(t)) where t is large but fixed such that the measure becomes stationary: Since q is a scalar quantity, one obtains the tail probability as ln P(q > θ) ≃ −N I(q = θ).Below, we calculate both the mean and the variance of q.In general, this would not be sufficient to obtain a tail estimate.However, the numerics indicate that the tail is indeed Gaussian (Fig. 3D) such that the first two cumulants are indeed sufficient.

Log-Likelihood Derivative
Here, we calculate the derivatives of the log-likelihood with respect to the parameters g and D. In terms of the rate function, we have where a denotes either g or D. The parameters appear only in the cross entropy . Above, we showed that Because x is at most quadratic in the exponent, the integral is solvable and we get det(2πC η ) .
Note that the normalization 1 det(2πC η ) does not depend on the potential U .Now we can take the derivatives of ln ⟨δ( ẋ + U ′ (x) − η)⟩ η and get where we used ln det C = tr ln C. With this, we arrive at where the integral over the empirical measure gave rise to C 0 = ∫ Dx µ(x)( ẋ + U ′ (x))( ẋ + U ′ (x)) and we used as stated in the main text.The derivative vanishes for C 0 = C η .Assuming stationarity, in Fourier domain this condition reads where S X (f ) denotes the network-averaged power spectrum of the observable X.

Model Comparison
Parameter estimation allows us to determine the statistical properties of the recurrent connectivity g and the external input D. However, this leaves the potential U and the transfer function φ unspecified.Here we determine U and φ using model comparison techniques [9].
We consider two options to obtain U and φ: comparing the mean squared error in Eq. ( 5) for the inferred parameters and comparing the likelihood of the inferred parameters.For the latter option, we can use the rate function from Eq. ( 3).Given two choices U i , φ i , i ∈ {1, 2}, with corresponding inferred parameters ĝi , Di , we have with H i ≡ H(µ U i , φ i , ĝi , Di ).The difference H 1 − H 2 equals the difference of the minimal cross entropies for the respective choices U i , φ i .Assuming an infinite observation time, this difference can be expressed as an integral that is straightforward to evaluate numerically (see below).
To illustrate the procedure, we consider the potential which is bistable for s > 1 [10] and determine s using the mean squared error and the cross entropy difference (see Fig. 1).Parameter estimation yields estimates ĝ and D that depend on s (Fig. 1A,B).The mean squared error displays a clear minimum at the true value s = 1.5 (Fig. 1C) whereas the maximal cross entropy occurs at a value larger than s = 1.5 (Fig. 1D).The latter effect arises because the cross entropy is dominated by the parameter estimates, thus the mean squared error provides a more reliable criterion in this case.
Cross Entropy Difference Here, we express the cross entropy difference in a form that can be evaluated numerically.Using the rate function, we get det(2πC η ) to arrive at For stationary correlation functions over infinite time intervals, we can evaluate the traces as integrals over the power spectra: With this, we get Accordingly, the cross entropy difference can be evaluated with integrals over the respective power spectra that can be obtained using Fast Fourier Transformation.

Activity Prediction
If the potential of the model is quadratic, U (x) ∝ 1 2 x 2 , the measure μ that minimizes the rate function corresponds to a Gaussian process.For Gaussian processes, it is possible to perform Bayes-optimal prediction only based on its correlation function [9,11].Denoting the correlation function of the process as C x (Appendix B4), the prediction is given by with , and x i = x(t i ).Here t denotes the time point of the prediction and {t i } a set of time points where the activity is known.The predicted value x itself is Gaussian distributed with variance where κ = C x ( t, t).The variance σ 2 x quantifies the uncertainty associated with the prediction x.We use the self-consistent autocorrelation function from Eq. ( 3) to predict the future activity of two arbitrary units using Eq. ( 7) and Eq. ( 8) (Fig. 2A,B).The network-averaged mean squared error is well predicted by Eq. ( 8) as shown in Fig. 2C.The timescale of the error is half of the timescale of the autocorrelation function (Appendix B5).We plot (C x (0) − σ 2 x) C x (0) against an exponential decay C exp(−2τ τ c ), where C x (τ ) C x (0) ∼ exp(−τ τ c ), and find a very good agreement (Fig. 2D).Since τ c diverges for g ↘ 1 (cf.[6]), the timescale of the error diverges as well.

Self-Consistent Correlation Function
Here, we describe how the self-consistent correlation function can be obtained efficiently for quadratic single-unit potentials U (x) = 1 2 x 2 .The first part is a brief recapitulation of the approach in [2,6], the second part specific to the error function is novel to the best of our knowledge.For quadratic potentials, the most likely (self-consistent) measure reads corresponding to the Gaussian process ẋ = −x + η, where η is a zero-mean Gaussian process with self-consistent correlation function Using the linearity of the dynamics of x, one obtains an ODE for its stationary autocorrelation function C x (τ ), with initial conditions C x (0) = σ 2 x and Ċx (0) = −D [2,6].Using Price's theorem, Eq. ( 9) can be cast into an equation of motion x ) in a potential where Due to the implicit dependence of C Φ on C x and σ 2 x , this is not an initial value problem.To determine σ 2 x , we use energy conservation x ) = const.We restrict ourselves to solutions where C x (τ → ∞) = 0 and Ċx (τ → ∞) = 0.With this, energy conservation evaluated at τ = 0 and τ → ∞ yields an equation for σ 2 x : With σ 2 x determined, Eq. ( 9) becomes an initial value problem that is straightforward to solve numerically.Instead of solving Eq. ( 11) for given D to get σ 2 x , we can use it to answer the inverse question: Given g and a desired activity level σ 2 x , how strong does the external noise D need to be?The answer directly follows from Eq. ( 11): We use Eq. ( 12) to uncover the multiple self-consistent solutions; they correspond to a non-monotonicity of D(σ 2 x ).For arbitrary transfer functions, we solve the integrals for C Φ numerically using an appropriate Gaussian quadrature.
Error Function For the transfer function we can leverage an analytical expression for C φ [12,Appendix]: For convenience, we introduce the scaled correlation function Since y depends linearly on C x , we get from Eq. ( 9) an equation of motion for y, with y(0 x and ẏ(0) = π 2 (1 − y 0 )D which again can be rewritten as ÿ = −∂ y V (y, y 0 ).Using Eq. ( 13), we get the explicit expression for the potential We chose the offset of the potential such that V (0, y 0 ) = 0 which reduces Eq. ( 11) to We solve Eq. ( 15) numerically using the Newton-Raphson method implemented in SciPy [13] and Eq. ( 14) using lsoda from the FORTRAN library odepack through the corresponding SciPy interface.From Eq. ( 14) we can determine the timescale of y or equivalently C x .Since y(τ → ∞) → 0, we linearize Eq. ( 14) for τ ≫ 0 to From here, we can directly read off the timescale: We use Eq. ( 16) to determine the timescale of the prediction error (see below).

Timescale of Prediction Error
We here relate the timescale of the prediction error to the timescale of the autocorrelation function C x (τ ) C x (0) ∼ exp(−τ τ c ).The predicted variance in the continuous time limit is determined by the corresponding limit of Eq. ( 8), where T denotes the training interval.Writing t = T + τ and approximating C x (T + τ, u) ≈ C x (T, u)e −τ τc , we get where we used , we arrive at where C x (0) = σ 2 x .Thus, for large τ , the timescale of the prediction error is given by τ c 2.

Order Parameter Fluctuations
Here, we derive an expression for the fluctuations of the variance valid for slow dynamics τ c ≫ 1.According to Eq. ( 16), this is valid for g being of order 1 -in practice, we choose g not too close to 1, however, because of the periodic solutions occurring in finite-size systems in this case [6].We start with the Legendre transform of the rate function of C, Eq. ( 4), which is the scaled cumulant generating functional where we redefined φ T kφ → k in the argument of W ∞ to simplify the notation a bit.To determine the fluctuations, we need to calculate the second derivative of the scaled cumulant generating functional W ′′ (0).
We get immediately due to the saddle-point equations.The second derivative is thus simply Using the saddle-point equations, we get .
In the main text, we show the fluctuations of the order parameter across time and realizations of the connectivity in Fig. 2A.To supplement this, we show the order parameter fluctuations in Fig. 3 for two realizations of the connectivity (Fig. 3A,B) and averaged across ten realizations of the connectivity (Fig. 3C,D).Using a logarithmic y-axis reveals that also the tails are Gaussian.

Coexisting Mean-Field Solutions
Here, we determine a regime where two mean-field solutions coexist.We restrict ourselves to quadratic potentials U (x) = 1 2 x 2 and start from Eq. ( 12), x )), which determines the necessary external noise to reach a given activity level σ 2 x .Non-monotonicities of D(σ 2 x ) give rise to multiple solutions since they indicate a case where the same external noise can lead to different activity levels.
In the noiseless case D = 0, the silent fixed point σ 2 x = 0 is one of the two solutions.Using the stability criterion g 2 ⟨φ ′ (σ x ) 2 ⟩ < 1 from [14], we get for the transfer function φ(x) = clip(tan(x), −1, 1) hence the silent fixed point is stable for g < 1.
D. Rate Function (Multiple Populations)

Scaled Cumulant Generating Functional
Here, we derive the scaled cumulant generating functional and the saddle-point equations for networks with multiple populations.The steps are similar to the single population case, hence we keep the presentation brief.Throughout, we use greek indices for the populations and latin indices for individual neurons within a given population: x α i denotes the trajectory of neuron i of population α, x α the trajectories of all neurons in population α, and x the trajectories of all neurons.The same convention applies to the connectivity: J αβ ij governs the connection from neuron j in population β to neuron i in population α, J αβ the connections from all neurons in population β to all neurons in population α, and J all connections.Furthermore, we denote the size of an individual population by N α and set N = ∑ α N α .
The expectation ⟨⋅⟩ x J of some arbitrary functional G(x) can again be written as where we introduced The action S α 0 now depends on the population, The average over realizations of the connectivity J αβ i.i.d.∼ N (0, N −1 β g 2 αβ ) only affects the term − ∑ α,β xαT J αβ φ(x β ).Due to the independence of the entries of J , the average factorizes into Next, we introduce the population-averaged auxiliary fields via Hubbard-Stratonovich transformations: As in the single-population case, the average over the connectivity and the subsequent Hubbard-Stratonovich transformation decouple the dynamics across units; afterwards, the units are only coupled through the global fields C α and Cα .Now, we consider the empirical densities of the populations, The corresponding scaled cumulant generating functional is where we introduced one functional α for each µ α and the collection of all α , { ○ }.Using the above results and the abbreviation φ(x) ≡ φ, it can be written as Again, the N α in front of the single-particle cumulant generating functionals Ω α results from the factorization of the N α integrals over x α i and xα i each; thus it is a hallmark of the decoupled dynamics.Note that W N ({ ○ }) is still coupled across populations, because each Ω α depends on the set of all auxiliary fields, {C ○ }.
Next, we approximate the C α and Cα integrals in a saddle-point approximation which yields where γ α = N α N .C α { ○ } and Cα { ○ } are determined by the saddle-point equations Here, the asymmetry in the saddle-point equations reflects the fact that Ω α depends on a single C but on all {C ○ }.

Rate Function
Here, we derive the rate function from the scaled cumulant generating functional for the multi-population case.We obtain the rate function via the Legendre transformation with α {µ ○ } implicitly defined by Due to the saddle-point equations, Eq. ( 21) and Eq. ( 22), the derivative of the cumulant generating functional in Eq. ( 24) simplifies to where the derivative only acts on the α that is explicit in Ω α α and not on the implicit dependencies through {C ○ { ○ } }, Cα { ○ } .Thus, Eq. ( 24) yields det(2πC η ) .
With this, we can take the derivatives of ln ⟨δ(τ α ẋ + U ′ α (x) − η α )⟩ ηα and get The derivative vanishes for C α 0 = C α η .Assuming stationarity, a Fourier transformation of as stated in the main text.

Degeneracy of Inference Equation
Here, we show that parameter inference using Eq. ( 29) can be degenerate because different models are equally plausible.

2 DFigure 1 .
Figure 1.Model comparison for φ(x) = erf( √ πx 2) and U (x) = 1 2 x 2 − s ln cosh x.A,B Maximum likelihood estimates of ĝ and D for given choices of s.True values of g and D indicated as gray lines; estimates at the true value s = 1.5 indicated as gray symbols.C Mean squared error between left-and right-hand-side of Eq. (5) for given s.D Cross entropy difference between model with s = 0 and with given s.Further parameters as in Fig. 1 in the main text.

Figure 3 .
Figure 3. Order parameter fluctuations.A,B Temporal order parameter statistics for a single realization of random connectivity each.C,D Temporal order parameter statistics across ten simulations with linear and logarithmic y-axis (panel D is identical to Fig. 2A in the main text).