Transient chaotic dimensionality expansion by recurrent networks

Neurons in the brain communicate with spikes, which are discrete events in time and value. Functional network models often employ rate units that are continuously coupled by analog signals. Is there a qualitative difference implied by these two forms of signaling? We develop a unified mean-field theory for large random networks to show that first- and second-order statistics in rate and binary networks are in fact identical if rate neurons receive the right amount of noise. Their response to presented stimuli, however, can be radically different. We quantify these differences by studying how nearby state trajectories evolve over time, asking to what extent the dynamics is chaotic. Chaos in the two models is found to be qualitatively different. In binary networks we find a network-size-dependent transition to chaos and a chaotic submanifold whose dimensionality expands stereotypically with time, while rate networks with matched statistics are nonchaotic. Dimensionality expansion in chaotic binary networks aids classification in reservoir computing and optimal performance is reached within about a single activation per neuron; a fast mechanism for computation that we demonstrate also in spiking networks. A generalization of this mechanism extends to rate networks in their respective chaotic regimes.

Neurons in the brain communicate with spikes, which are discrete events in time and value. Functional network models often employ rate units that are continuously coupled by analog signals. Is there a qualitative difference implied by these two forms of signaling? We develop a unified mean-field theory for large random networks to show that first-and second-order statistics in rate and binary networks are in fact identical if rate neurons receive the right amount of noise. Their response to presented stimuli, however, can be radically different. We quantify these differences by studying how nearby state trajectories evolve over time, asking to what extent the dynamics is chaotic. Chaos in the two models is found to be qualitatively different. In binary networks we find a networksize-dependent transition to chaos and a chaotic submanifold whose dimensionality expands stereotypically with time, while rate networks with matched statistics are nonchaotic. Dimensionality expansion in chaotic binary networks aids classification in reservoir computing and optimal performance is reached within about a single activation per neuron; a fast mechanism for computation that we demonstrate also in spiking networks. A generalization of this mechanism extends to rate networks in their respective chaotic regimes.

I. INTRODUCTION
While biological neurons communicate by spikes, which are discrete all-or-nothing events, artificial neural networks overwhelmingly use continuous-valued units, commonly referred to as "rate neurons". The ramifications of this fundamental distinction between discrete and continuous signaling have been debated concerning learning algorithms [1,2], energy efficiency [3], and information coding [4][5][6][7][8][9][10][11]. Here we study how differences in signaling impact network dynamics underlying classification performance in a reservoir setting [12][13][14]: Input stimuli influence the dynamical state of a randomly connected network which then acts as the representation, from which the desired output is extracted by a * C.K. and T.K. contributed equally to this work. The divergence of trajectories with different initial conditions in a chaotic network (a) can be computed by a replica calculation (b), leading to a curve describing the average temporal evolution of the distance (c). Sketch of linearly nonseparable data (d) supplied as initial condition to the dynamics (a). Chaotic dynamics expands the representation into spaces with increasing dimension (c). Interclass separation (green) initially grows quicker than variability within class (dark orange) up to point topt of optimal separability (e). Subsequently, chaotic mixing causes separability to decline, ultimately leading to a completely mixed state (f ).
linear readout. For a classification task, the representation thus needs to allow a linear separation of classes. Dynamics promotes this separability by nonlinearly embedding the input into its high-dimensional state space. This embedding is analogous to the kernel trick used in support vector machines [15]: A generic mapping into a high-dimensional nonlinear feature space tends to improve separability, because in N dimensions dichotomies of 2N random points can be linearly separated with high probability [16]. Presenting input stimuli as initial conditions to the dynamics of a network, the nonlinear transformation of the representation is determined by the subsequent temporal evolution (Figure 1a,d). For example, consider stimuli belonging to different classes, each given by a centroid and local noise (Figure 1d). Two properties are needed for classification: Differences between stimulus classes must be maintained or amplified to foster discrimination (Figure 1c, green). Similar stimuli, however, should lead to similar representations to support generalization; the distance between trajectories of data points belonging to the same class should have limited growth (Figure 1c, dark orange). This view exposes the tight link to chaos, the sensitivity of the dynamics to initial conditions. For rate networks close to the edge of chaos, separation and generalization are well balanced, leading generally to optimal performance [17][18][19][20]. While the theory of deterministic [21,22] and stochastic rate networks [23] is well understood and predicts a clearly defined transition to chaos, its link to chaos in binary networks [24,25] remains elusive. Binary networks are the simplest class of models with discrete signaling between neurons.
Here we develop a systematic and model-independent approach to derive mean-field theories for large random networks (Section II A). The formalism finds the same set of mean-field equations simultaneously describing binary and rate networks. It shows that a stochastic rate network with properly chosen noise has the same first-and second-order activity statistics as a binary network. The approach allows for replica calculations, the study of ensembles of pairs of networks with identical connectivity in each realization, but different stimuli, as required to assess chaos and computation (Figure 1b). For stochastic dynamics one compares two systems with slightly different initial conditions but identical realization of stochasticity [26]. The replica theory exposes that chaos and signal processing in statistically matched rate and binary networks are qualitatively different: Binary networks show a transition to chaos that depends on network size (Section II B and Section II C). In the chaotic regime, distances between states in binary networks increase transiently in a stereotypical manner, confined to a chaotic submanifold whose dimension depends on the coupling strength and is a fraction of the entire state space (d max in Figure 1c, Section II D). Rate networks with statistically matched activity, in contrast, are nonchaotic (Section II E). Giving up on the statistical match, rate networks with weak noise in their corresponding chaotic regime show a qualitatively different divergence of state trajectories that sensitively depends on the coupling strength (Section II E). Given a distribution of input data whose within-class variability is smaller than the average between-class distances (Figure 1d, dark orange and green), the dimensionality expansion of presented stimuli by chaotic binary networks leads to a separation that is optimal for classification after t opt τ = 2 ln 2 ≃ 1.4 activations per neuron (Figure 1c,e, Section II F). Sub-sequently, the chaotic mixing leads to a gradual decline of separability (Figure 1f ). Despite the qualitative differences between rate and binary networks, both mechanisms of chaos can be employed to increase classification performance deep in the chaotic regime in a wide range of networks models, including long-short-term-memory (LSTM) and spiking networks (Section II G).

II. RESULTS
A. Model-independent field theory of neuronal networks Here we derive a framework to compute the statistics of neuronal networks in a manner that is largely independent of the employed neuron model. Such a framework is needed to systematically compare different model classes and to assess the generality of results. It must be flexible enough to enable the use of methods such as disorder averages and replica calculations; techniques that are required to systematically derive mean-field equations that allow us to compare networks on a statistical level and to assess how distances between different dynamical states evolve over time and classification of input signals can be achieved ( Figure 1).
We consider a network of N neurons with connectivity matrix J , where individual entries are independently and identically distributed as J ij assumptions on the statistics can easily be relaxed as long as higherorder cumulants are suppressed by the large network size. The N neurons have inputs h = (h 1 (t), ..., h N (t)) and outputs x = (x 1 (t), ..., x N (t)). The input-to-output relation of a neuron is often stochastic, so that a conditional probability ρ[x h] of the output given the input is the most general description of the neural dynamics. The joint statistics of input and output is then amounting to a separation of the neurons' input-output functional ρ[x h] ∶= ∏ i ρ[x i h i ] and the input statistics ρ[h]. Here we denote functionals by angular brackets and vectors of neuron indices by bold-font symbols. Any observable O of a neuronal network can be expressed as a functional of the inputs h, which have the advantage of being closer to a Gaussian distribution than x, due to the convergence of many outputs on one input. Because we do not know the disorder realization (e.g. of the connectivity) in detail, but at most its statistics, we can access only quenched disorder-averaged quantities like The description of the network dynamics is selfconsistently closed by using a delta distribution ρ[h] = δ[h − Jx] to enforce that the input to each neuron is composed of a sum of outputs weighted by the synaptic connectivity J . The idea of splitting the system into a neuron and a coupling model is illustrated in Figure 2. Note that (1) (1) can be understood as a spiral moving forward in time (see also Appendix 1).
Q (t, s) = g 2 ⟨x(t)x(s)⟩ Ω(R,Q) , where the average ⟨. . .⟩ Ω(R,Q) is defined in (37) of Appendix 1 as an average over x ∼ ⟨ρ[x h]⟩ h and h is a Gaussian process h ∼ N (R, Q). We may think of x(t) as the representative neuron of a homogeneous population, because all neurons with statistically identical connectivity and properties are identical after the disorder average. On the intuitive level, DMFT corresponds to modeling the inputs of all neurons as independent Gaussian processes h ∼ N (R, Q). Thus, one obtains the DMFT using only the output-toinput relation given by the disordered connectivity while staying agnostic of the neuron model. To instantiate the approximation for the binary model studied here, we must now provide knowledge about the input-to-output relation ρ[x h].
Binary neuron model We consider the binary neuron model, or kinetic Ising model, with state x i ∈ {−1, 1} [28,29]. The states of all neurons are updated asynchronously by independent Poisson processes with rate τ −1 and an activation probability function T p ∶ R → Figure 2. Summary of model-independent field theory. Conceptual idea to split network into neuronal dynamics, described by conditional probability ρ[xi hi] of neuronal output xi given its input hi, and the mapping of output to input by connectivity Jij . The connectivity average affects only the output-to-input mapping and can thus be performed without specifying the neuron model. The formal saddle-point approximation in auxiliary fields R(t) ≃ R(t) and Q(t, s) ≃ Q(t, s), eqs. (2) and (3), amounts to a Gaussian approximation of the input hi ∼ N (R, Q).
[0, 1]. It is clear that the form of ρ[x h] depends on the realization of the update times, which constitute a source of noise, or temporal stochasticity. The update sequence may be thought of as another type of disorder in the sense that it breaks the homogeneity of the time axis by selecting a set of time points where the neuronal state can change. As with the random connectivity, one may study the behavior of the system averaged over this disorder. In this case, the probability of finding a neuron active at time t is given by the probability T p (h i (t ′ )) to be activated at any prior update time point t ′ and the survivor function [30], the probability that no further update happened since. While this knowledge is far from knowing the complete probability functional ρ[x h] across its infinite time dimension, the information about this single time slice is sufficient to plug into (2) and obtain, after taking a time derivative, the mean-field equation where Details are provided in Appendix 2. In (5) only equal-time autocorrelations Q(t, t) appear, because the dynamics is a Markov process; its evolution at time t depends only on the statistics at this very time point, not on the prior history. Closing the equation is thus simple for binary neurons, because, by x i ∈ {−1, 1}, their autocorrelation is always 1, so that Q (t, t) = g 2 N ∑ i ⟨x i (t) x i (t)⟩ = g 2 when cross-correlations are negligible (see Appendix 1).
To compute Q(t, t + ∆t) for binary neurons, we need more information about ρ[x h], namely the joint probability distribution over two time slices for a neuron: To construct ρ [x (t) x (s) , h] for binary neurons, the basic idea is to iterate the 2×2 states a neuron can assume at the points in time s and t and consider all possible evolutions that match the respective initial and final condition. From such a consideration, we derive Q(∆t) = Q(t, t+∆t) for stationary dynamics in Appendix 3 by again taking a time derivative of the saddle-point equation (3), yielding This equation is the analogon of the integral equation (5.17) of van Vreeswijk and Sompolinsky [27]. The advantage of the form (8) compared to the classical result is, as detailed in Appendix 3, that by differentiating once more with respect to ∆t and then using Price's theorem [31], it can be cast into a Newtonian form where T is a primitive of T, which is ∂ h T (h) = T(h), and N R,Q(0),Q is the bivariate Gaussian with stationary mean R and covariance matrix Q(0) Q Q Q(0) . We exploit this result in Section II E to construct rate models with exactly the same DMFT solution as a binary network.
B. Binary networks are always chaotic in the thermodynamic limit In the setting of reservoir computing (Figure 1), a particularly important measure for the classification performance of a network is how the distance between two different dynamical states, each caused by one stimulus, evolves over time. Tracking the evolution of initially small differences between the states amounts to the characterization of chaos [17,18,21,23,32,33]. We assess chaos by studying the time evolution of two systems with infinitesimally different initial conditions but identical connectivity and identical realization of stochasticity, thus the same sequences of update time points. Technically, this approach amounts to a replica calculation, where one studies the network-averaged correlation between the states of the two systems over time, an approach pioneered by Derrida and Pomeau [34]. Here we do not use the classical annealed approximation of this original work, where the connectivity is redrawn in every time step, but compute the full quenched averages, where the connectivity is constant in time. The calculation leads to a dynamic mean-field theory for the correlation between replicas.
In Appendix 4, by an approach analogous to the derivation of the ordinary differential equation (ODE) for the autocorrelation (10), we obtain the evolution of the cross-replica equal-time correlation Q (12) (t) in the binary network as Here (h (1) , h (2) ) ∼ N (R, Q) is a measure of a pair of Gaussian processes with means ⟨h (α) ⟩ = R (α) = ⟨T (h)⟩ h∼N R (α) ,g 2 , the stationary solution of (5) and co- Since the two replicas are nearly perfectly correlated in the beginning, we know that the correlation between a neuron and its "copy" in the other replica is given by the autocorrelation at first, motivating the ansatz As shown in Appendix 4, an expansion for small ǫ leads to the approximate equation governing the evolution of which generalizes the result of van Vreeswijk and Sompolinsky [27] to arbitrary activation functions. As was their conclusion for neurons with hard threshold, we see from (12) that for any activation function with average positive slope and independent of the parameters, the positive term ∝ √ ǫ is always larger than the negative linear term for small ǫ; so an initial deviation between the replicas will grow, indicating chaotic dynamics. Since the calculation becomes exact in the thermodynamic limit, the conclusion is that infinitely large binary networks are always chaotic, with formally infinite maximum Lyapunov exponent since the slope of the right-hand side of (12) at ǫ = 0 + is infinite, leading to an initial growth of ǫ that is faster than exponential. More specifically, ǫ(t) ∼ t 2 for ǫ ≪ 1, meaning that ǫ(0) = 0+ grows to a finite value in finite time, as opposed to an exponential function. See Appendix 5 for additional details. Since the slope of the activation function appears only averaged over the input distribution, there is no qualitative difference between different activation functions. In particular, going from a stochastic activation function to the deterministic Heaviside limit changes only the second term in (12) by a finite factor and thus does not qualitatively alter the chaotic behavior. This result can also be understood by noting that for the stochastic activation function, the function value at each update is compared to a random number to decide the activity state. The comparison is just like using a Heaviside function but with randomly drawn threshold at each update.

C. Transition to chaos in finite-size binary networks
In contrast to the theoretical prediction, simulations of binary networks in fact show parameter regimes with regular dynamics (Figure 4). Since the theory is only exact in the limit of infinite network size, this behaviour suggests a finite size effect. But the result of the replica calculation (12) does not rely on carrying out the N → ∞ limit. Rather it is expected to be a good approximation for finite, yet large networks N ≫ 1. How can the theory be reconciled with the simulation?
First, while the square-root term in (12) is always larger for sufficiently small ǫ, there also exists a point ∂ t ǫ * ! = 0 where this relationship reverses and the linear term starts to dominate: the point where the right-hand side of (12) vanishes, √ This point corresponds to a stable average distance between (partly) decorrelated trajectories, as illustrated in Figure 3. Second, in a finite network of N binary neurons, an infinitesimal perturbation cannot be realized, since the smallest possible perturbation is to flip a single spin at index i flip . A single flip implies for the minimally perturbed cross-replica correlation (49) so that ǫ min, chaos ǫ * ǫ min, regular ǫ terms of rhs Figure 3. Fixed-point average distance between replicas implies chaotic subspace. The square-root term ∝ √ ǫ (dark orange curve) and the linear term ∝ ǫ in (12) intersect and produce a fixed point ǫ * for the covariance Q(0) − ǫ between the two replicas. The resulting average Hamming distance d * between states in the two copies of the system is given by ǫ * and (16) as d * = N ǫ * g 2 . Depending on whether ǫmin, the minimum decorrelation due to a single flipped spin, is smaller or larger than ǫ * , the replicas will either decorrelate, or they will converge and forget the perturbation.
Therefore, if ǫ min > ǫ * the replicas will tend toward more correlation. But as the only possible step below ǫ min is having zero different spins and thus perfect correlation, the initial difference should tend to be completely forgotten, resulting in regular dynamics. On the other hand, if ǫ min < ǫ * an increase of the initial difference is possible.
Thus the chaos transition criterion in the finite binary network is ǫ min ! ≤ ǫ * resulting in Because of the scaling with √ N , it is clear that networks with thousands or even only hundreds of neurons are only nonchaotic if the connectivity is very weak, g N − 1 2 , or the dynamics is saturated (which gives a small ⟨T ′ (h)⟩ h ). Also, N → ∞ clearly recovers the limit of strictly chaotic dynamics. For the special case of a Heaviside activation function and vanishing mean connectivityḡ = 0, the network is always chaotic, since ⟨H ′ (h)⟩ h∼N (0,g 2 ) = 2 (πg 2 ) results in π 2 ≤ √ N , which is certainly true for typical network sizes.
The predicted transition and the residual correlation ǫ * fit those observed in simulations quite well ( Figure 4). The dependence of the transition on the positive mean connectivityḡ in the upper panels arises because the network settles in a state with nonzero mean activity that depends onḡ; it selects one of the two degenerate states in this bistable, "ferromagnetic", regime. The symmetry with respect to a global sign flip of the activity is spontaneously broken. In this state neurons show a very small average slope ⟨T ′ (H)⟩ H , thus shifting the point of transition to larger g with increasingḡ. The predicted residual correlation is independent of N ( Figure 4c, compare (13)), while the chaos transition depends on N (compare (15)).
We obtain the same criterion (15) through a less general, but more intuitive perspective by analyzing the probability that, given a single-spin difference, the difference in inputs is such that during the next updates, another neuron will also be updated to a "wrong" state (see Appendix 7). This view provides an expression for the average rate of decorrelation caused by an initial single spin flip. Requiring this rate to be unity, we obtain the same chaos transition criterion as (15). The approach is inspired from and very similar to calculating the divergence rate of flux tubes in spiking networks [35]. Such flux tubes are stable local environments of a phase-space trajectory, while the network is globally unstable. Thus, the phase space can be partitioned into tubes which diverge from each other, while perturbations within a tube decay. Indeed, the binary network has relatively trivial flux tubes in the input phase space given by those regions that result in the same updated state.
Note that the chaos transition shown in Figure 4d in simulations happens at slightly larger slopes T ′ (0) than predicted by (15). Considering the cascade of spin flips evoked by the initial perturbation provides an explanation: If the average proliferation rate of spin flips per time constant is only slightly above one, the cascade triggered by a single flipped spin still has a large probability of dying out.

D. Dynamics in binary networks is governed by a chaotic submanifold
The chaos in binary networks found in Section II B causes nearby trajectories to diverge at first. Because of the fixed point value ǫ * of the residual correlation found in Section II C, however, the network states of the replicas do not decorrelate completely. Instead, any pair of trajectories has an average maximal distance determined by ǫ * . This limited distance is a result of the two trajectories evolving by the same network connectivity and update sequence. Also trajectories that are very far apart will converge to this residual correlation. The fixed-point distance ǫ * is thus a representative of the average distance between any two trajectories in the long-time limit. The corresponding Hamming distance H (12) = 1 4 x (1) −x (2) 2 , that is, the number of different spins between a pair of binary states, is given by where we use Q (12) = g 2 N N − 2H (12) and the prefactor of H (12) arises because every flipped spin causes a decrease by 2 (from +1 to −1). Even though the H (12) spin flips distinguishing two trajectories can in principle be distributed across any of the N neurons, the subspace spanned by the set of possible trajectories has an approximate dimensionality of This relation can be understood by considering two independently drawn binary random vectors of dimension d that have, on average, the distance H (12) = d 2, because the average distance between any pair of spins that take the values Therefore, in the following, we quantify dimensionality via the Hamming distance H (12) using (17). Thus, if ǫ min > ǫ * , then H (12) (∞) < 1 and the set of long-term trajectories contains only a single trajectory, thus constituting a limit cycle (although the return time is astronomically large [36]). Irrespective of the initial state, the network is attracted to a stereotypical trajectory; the dynamics is regular. This situation arises for very weak coupling.
If ǫ min < ǫ * , then H (12) (∞) > 1 and there are many trajectories that constitute the attractive subspace. The evolution within the space is chaotic, because for any pair of states with an initial distance ǫ < ǫ * the distance increases; thus small differences are amplified. A set of trajectories that initially spans a low-dimensional subspace is thus expanded into a higher-dimensional space. For long times, however, any two states differ in only typically d(∞) 2 of their neurons. This limiting dimensionality grows proportional to N g 2 as seen by inserting (13) into (16) The time evolution when starting with a set of trajectories with dimensionality d(0) is given by obtained by integrating (12) (see Appendix 5), and shown in Figure 6a. This explicit solution shows that the expansion happens very quickly on a timescale of 2τ , where τ is the average time to have one update per neuron, and then converges to the residual value d * (18) for long times. This exclusive dependence on τ can intuitively be understood from the right-hand side of (15), which can be interpreted as the average number of flips n spawns caused by an initial spin flip within one time constant (as obtained in Appendix 7). Hence, using (13), one has  Figure 4. Chaos transition and residual correlation in theory (a,c) and simulations (b,d). a Theoretical prediction of chaos transition (green line, eq. (15)) and residual correlation coefficient between replicas (dark orange shading, c * 12 = 1 − Q(0) −1 ǫ * and (13)) for varying meanḡ and variance g 2 of the connectivity. Other parameters are N = 5000, T = tanh, τ = 10 ms. b As in (a) but each pixel is colored dark orange (chaotic) or green (stable) according to a network simulation. Two identical networks are evolved with identical random numbers, only one being perturbed by flipping four spins, and after Tsim = 2500 ms the correlation between the state vectors is computed. The few scattered green dots in the chaotic regime are algorithmic artifacts where the perturbation is unsuccessful, see Appendix 11 for the perturbation method. c Theoretical prediction for varying network size N and slope T ′ (0) of the activation function T (h) = tanh(T ′ (0)h). Other parameters arē g = 0, g 2 = 0.01 and τ = 10 ms. d As in (c) but from network simulations, with procedure as described for (b). so that after two time constants have passed, the residual correlation would be reached if the functional form of initial decorrelation would be extrapolated to later times, neglecting saturating terms (see Figure 6a and Appendix 5). But because the residual correlation limits the spread of the cascade of flips, in a similar way as the population size limits the growth of an epidemic [37], the growth slows down and asymptotically approaches the residual correlation. Having quantified how binary networks with discrete signaling separate different states, as required to understand classification in reservoir computing (Figure 1), we now turn to the well-established alternative of units with continuous-valued activity and signaling, commonly referred to as "rate models" and typically employed in ar-tificial neuronal networks. Concretely, we consider the coupled set of stochastic differential equations [21,23] with the activation function T ∶ R → [−1, 1] given by (6), timescale τ and a white noise process ξ with ⟨ξ i (t)ξ j (s)⟩ = σ 2 ξ δ(t − s)δ ij . Chaos in such networks has been intensely studied [21,23,33].
We show in Appendix 8 that the model-independent field theory applied to this stochastic rate model yields the same set of self-consistency equations for the first-(5) and second-order statistics (9) as the binary model; also the conditions on Q ∞ ∶= lim τ →∞ Q (τ ) agree. In contrast to the binary network, however, where the initial value Q(0) = g 2 is known, (9) must be solved with an initial condition for the slopeQ(0+). This slope is determined by the variance σ 2 ξ of the noise in (21). Demanding identical mean-field solutions for the two neuron types, the variance of the noise follows as (see Appendix 8)  Figure 5. Matched second order statistics in binary and rate networks. a Autocorrelation functions in simulations of binary (dark orange dots) and rate networks (blue stars). Theoretical curve (black) given by the solution of (10). Noise amplitude of the rate network chosen by (22) to obtain matched statistics. Other parameters: N = 5000, g = 1.5, g = 0, τ = 1 ms and T(h) = tanh(h − Θ) with Θ = 1.173 such that ⟨x⟩ N (R,Q) ≈ −0.5 according to the stationary solution of (5) (see Appendix 11 for details). We simulate a single realization of the binary network for 5000 ms and average over five realizations of the rate network running for 1000 ms each. We average over neurons in both cases.
Equation (22) tells us that, given a pair of equivalent activation functions (6), moments of connectivityḡ, g 2 , and timescale τ , asynchronously updated binary networks are statistically equivalent in DMFT approximation to rate networks with appropriately chosen Gaussian white noise input. This result is confirmed in simulations by comparing the autocorrelation functions averaged across many neurons in Figure 5. The good agreement between the autocorrelation that is averaged over all neurons in a network with a single random realization of the coupling matrix and the theoretical curves, which describe ensembles of networks averaged over many realizations of the random couplings, moreover shows that these quantities are self-averaging.

Condition for chaos in rate networks
Having established their equivalence on the level of statistics, we now compare the chaotic evolution of binary and rate networks. As its binary counterpart, the rate neuron model can be studied in a replica calculation in dynamical mean-field approximation, which yields the equation of the cross-replica time-lagged covariance of the form [23,33] ( with f T (Q 0 , Q (12) ) = ⟨T(x 1 )T(x 2 )⟩ and the average is The approximation for small differences Q (12) = Q 0 − ǫ, to linear order in ǫ, is which is solved by where λ max (g) is the largest Lyapunov exponent that follows from an eigenvalue problem, see ( No chaos in rate networks with matched statistics Applying criterion (26) to a network of rate neurons with the noise matched to its binary counterpart via (22), we obtain Q 0 = g 2 by construction. Using that ⟨T (h)T (h)⟩ ≤ 1 because of T ≤ 1, we observe that the condition (26) cannot be fulfilled. The dynamics is therefore always in the regular regime because the frozen noise of amplitude given by (22) is so large that it drives the dynamics and suppresses chaos. Only asymptotically the chaos transition is approached for an infinite slope of the activation function, T ′ → ∞, or equivalently g 2 → ∞.
Everything else being identical, the only difference between the two models is the type of signals exchanged between units, either being discrete or continuous. This demonstrates that chaos in binary networks is intrinsically caused by the discrete signaling. Formally, the difference between the two forms of signaling here shows up in the effective noise ξ: In the rate network, the realization of this noise is identical across the two replicas, because it represents the random realizations of the discrete variables of the binary network whose statistics we want to match. In the binary network, this noise itself changes, because it is intrinsically generated by the discrete switching dynamics, so that the realization is not external and frozen, but depends on the microscopic state.

Qualitative differences of chaos between rate and binary networks
In the following, we give up on matching the statistics between rate and binary networks to discuss the qualitative differences of the respective chaotic dynamics.

Residual correlation
The first qualitative difference concerns the residual correlation of the replicas. Since there is no term ∝ √ ǫ in (24), there is no residual correlation for small ǫ. Furthermore, equation (23) is also valid for small Q (12) and shows that the completely decorrelated state Q (12) = 0 is always a fixed point: the expectation value factorizes, and for any point symmetric T the right-hand side vanishes. In the thermodynamic limit, the residual correlation in rate networks is thus zero for any g > 1. A network that has been infinitesimally perturbed eventually has a state that is completely uncorrelated to the unperturbed system.
Trajectories in rate networks of finite size, in fact, show very small residual correlation closely beyond the edge of chaos g 2 ≳ 1. However, as already noted by Sompolinsky et al. [21], the transition at g = 1 is not completely sharp in finite-size networks [see also 38]. For larger networks, however, the residual correlation approaches zero; this is in contrast to binary networks, which in otherwise identical settings have a finite residual correlation (18) even in the large-N limit.
This qualitative difference is shown in Figure 6: In binary networks the decorrelation between the original and the perturbed system d(t) N = 1 − Q (12) (t, t) Q 0 in the long-time limit saturates below unity, on a level that depends on the coupling g by Eq. (18) and is bounded by 8 π 2 ≃ 0.81 in the limit g → ∞. The quantity d(t) N can also be interpreted as the relative dimensionality of the explored space. In rate networks with otherwise identical parameters, the relative decorrelation reaches unity independent of g. The decorrelation in rate networks cannot be interpreted in terms of dimensionality, however. Indeed, a recent work demonstrates a structured chaotic attractor in such networks [39].

Transient of decorrelation
The second qualitative difference concerns the transient of decorrelation. The solution (19) shows that the characteristic timescale of decorrelation is 2τ , where τ is the average interval between two state changes of a neuron, exposing that the microscopic state drives the chaotic evolution. Decorrelation slows down only mildly for weaker coupling g, as shown in Figure 6a. Moreover, it has a finite slope shortly after the infinitesimal perturbation of the system, reflecting the infinite Lyapunov exponent.
In rate networks, the maximal Lyapunov exponent λ max (g) is finite and depends continuously on the coupling g. Decorrelation therefore starts with a vanishing speed for infinitesimal perturbations, well described by the exponential behavior (25), as shown in Figure 6b. The time to reach a given level of decorrelation, moreover, strongly depends on the coupling strength g, corresponding to a critical slowing-down at the transition to Figure 6. Discrete chaos versus continuous chaos. a Evolution of the decorrelation between replicas in the binary network (19) with T = tanh. Increasing coupling strength from dark to light gray, g = {0.5, 1, 2, 3, 5}. Dark orange curves: limit g → ∞ or Heaviside T (dashed); quadratic solution for nonsaturated growth derived in Appendix 5 (dotted). b Evolution of decorrelation between replicas in the rate network (21) obtained by numerical solution of (24) (details in Appendix 6). Color code as in a, but using parameters g = {1.1, 1.5, 2.5, 5.10} and initial decorrelation of 2% instead of 0%.
chaos, which does not occur in binary networks.

F. Computation by transient chaotic dimensionality expansion
We now return to the question how the separation of trajectories by the chaotic dynamics of binary networks affects computation in a setting of reservoir computing ( Figure 1). We investigate the network performance in a pattern classification task: We consider P fixed patterns, numbered by the index 1 ≤ α ≤ P , each given by a randomly drawn binary vector of length L. Noisy realizations of a pattern are then created by adding Gaussian independent noise of variance σ 2 to each entry of the original pattern, creating P classes of noisy pattern realizations. The network is prepared at t = 0 in a fixed initial state consistent with its stationary statistics. A noisy pattern is presented to the network as the initial state of a (fixed) subset of L of the N neurons. The corresponding network state is denoted as per pattern class α ′ by linear regression to provide the output S α ′ (t) = 1 if the α ′ -th pattern has been presented (α = α ′ ) and 0 else (α ≠ α ′ , see Appendix 10 a for details). Thus we have P readouts, one for detecting each of the presented patterns (one-hot encoding). Classification is performed by selecting the strongest readout signal. Additional noise sources are present at the readout and classification to ensure robustness. The setup, training and following theory are detailed in Appendix 10.
Clearly, the set of possible trajectories resulting from the different initial-state preparations has dimensionality d s (0) = L at t = 0. The linear separability of pattern classes is thus initially low, if L ≪ P ≪ N . From Section II D we know that the chaotic dynamics will quickly increase the dimensionality of the state space that en- b Evolution of signal and noise subspace dimensionality. Dimensionality ds(t) given by (17) explored by the network across different patterns (green; solid curve: using averaged simulated distances across all pairs of patterns; dashed curve: theory (19)). Dimensionality dn(t) explored across different noisy realizations of a pattern (dark orange; solid curve: using averaged simulated distances across all pairs of 20 realizations per pattern; dashed curve: theory (19)). Noisy realizations of patterns have Gaussian noise with standard deviation σ = 0.3 added to each of the L entries of the initial pattern state. Difference between signal and noise dimensionality (blue, theory is dashed). c Linear readout S α ′ (t) = w α ′ (t) T (xα(t) + ξpre) + ξpost trained for each time point t to detect stimulus identity by minimizing the quadratic error ∑ α (S α ′ − δ αα ′ ) 2 for the correct stimulus α = α ′ (blue) and nonmatching stimuli α ≠ α ′ (dark orange). Error bars show the variability across patterns and noise realizations, excluding ξpost. Theoretical prediction (28) (dashed blue). Readout by the theoretical weight vector (75) using the approximation (78) (light blue). Inset: classification accuracy of the initial input stimulus based on choosing the readout with largest signal for trained readouts (blue) and approximate readouts (light blue), including ξpost. The approximate readout vectors yield a higher average signal, but also the variance is higher (not shown) resulting in slightly worse classification accuracy. Other parameters: N = 500 neurons, coupling strength g = 0.8, prereadout noise σ ξ,pre = 0.1 and postreadout noise σ ξ,post = 0.1 as detailed in Appendix 10. The training set comprised 100 noisy realizations of each pattern, and the test set 20. All theoretical curves are corrected for the probability that a noise realization does not leave the original flux tube, see Appendix 10 c.
codes the patterns, eventually approaching that of the chaotic submanifold d s (∞) = d * . To explain the effect on the separability of patterns and the classification performance, we must distinguish between the dimensionality of the total set of trajectories (including all patterns and their noisy variations), referred to as the signal dimensionality d s (t), and the dimensionality of the set of trajectories given by noisy variations of a single pattern, referred to as the noise dimensionality d n (t). Let us at first neglect the noise. With the increase of d s (t) also the linear separability of patterns increases. From the property of the linear regression this means that the average readout signal S α (t) of the correct pattern class α increases; if network responses were pairwise orthogonal, which to good approximation is satisfied in the high-dimensional signal subspace, the maximal attainable signal would bê as shown in Appendix 10 b. However, also the noise dimensionality d n (t), spanned by all noisy realizations of the same pattern, increases in the same way as d s (t) due to the chaotic dynamics, as shown in Figure 7b. But d s (t) has a head start because noisy realizations of one pattern are more similar to each other than to other patterns. Now let us assume that different noise realizations cause different responses that lie entirely within and are uniformly distributed across the signal subspace. This means that the noise randomly flips a number of spins that encode the pattern and thus effectively reduces the dimensionality of the space that faithfully encodes the signal. The effective dimension of the space that is available to represent the signal is then The expected signal is then given by This approximate expression overestimates, but captures quite well the overall shape of the average readout signal shown in Figure 7c: Initially, the signal rises in relation to the ratio of dimension of representation space and number of patterns, but ultimately, the signal declines, because the dimensionality spanned by the noise approaches that spanned by the signal. As seen in the inset of Figure 7c, the classification accuracy mirrors the behavior of the average readout signal. This transient increase of the linear separability of the pattern classes is rooted in the property of the system that the signal dimensionality initially rises faster than the noise dimensionality. , δ = ds(0) d * (56) marked by crosses. The time to maximum, for weak initial stimuli δ ≪ 1, is well approximated byt ≃ 2 ln 2 τ (vertical dashed line). From dark to light gray: g = 0.5, 1, 2, 3, 5 and tanh gain function; dashed dark orange curve: Heaviside gain function, limit g → ∞. Initial parameter N ds(0) = 200. b Maximal effective dimension ∆d(t) relative to initial ∆d(0) as a function of N ds(0). Same color code as a. In blue, corresponding simulation results, averaged over ten connectivity seeds each, and corrected for the probability that a noise realization does not leave the original flux tube, see Appendix 10 c. Values of g from dark to light as in a.
For small initial d s (0) and in the limit of vanishing noise d n (0) ↘ 0 the maximum of ∆d = d s −d n , and therefore S α , is reached att ≃ 2 ln 2 τ ≃ 1.39 τ (details given in Appendix 5, Eq. (57)); the peak time depends only weakly on g, as shown in Figure 8a. The improvement of the separability due to the transient expansion, the maximum ∆d(t) compared to its initial value at t = 0 is given by (58) where the latter expression is the limit of g → ∞ of the former. The improvement of the signal scales with N 1 2 and d s (0) − 1 2 , as shown in Figure 8b. The theory slightly underestimates the maximum of ∆d in simulations, but captures the scaling relation, predicting a slope of ≈ 1 2. So the peak classification accuracy can be improved by larger networks.
Note that in Figure 7c, neither in theory nor in the simulation the average signal drops to zero for t → ∞, in b the noise curve saturates to a smaller value than that of the signal; this is because not all noise realizations leave the flux tube of the original pattern, so that the average distance between realizations remains smaller than that between patterns (analyzed in Appendix 10 c).
G. Generalization to other network models: Transient chaotic SNR amplification The computational effect described in the last section relies on two factors: First, a high-dimensional space in which trajectories are nonlinearly embedded, and second, the decorrelation dynamics eq. (12) and Figure 7b that causes a small deviation to initially grow slower than a larger deviation. In the setting of a classification task, this mechanism thus enhances the signal-to-noise ratio (SNR).
The first factor is a general feature of all nonlinear neuronal networks. The second factor, we conjecture, should also be a typical property of chaotic networks, because the expansion of distances between three arbitrary trajectories should imply the largest distance to grow faster (in absolute terms) than the two smaller distances, as by a triangle inequality. Therefore, we expect the transient disentanglement of pattern classes to be a general phenomenon in strongly chaotic neuronal networks. To substantiate this claim, we demonstrate this effect in rate networks, and, in a proof-of-principle manner, in a spiking leaky-integrate-and-fire (LIF) network and a longshort-term-memory (LSTM) network [40].

Rate networks
A network of stochastic rate units with first-and second-order statistics matched to those of a binary network, as in Section II E, has a classification performance close to zero, as shown in Figure 10a. This is because the matched stochastic rate network is not chaotic; trajectories for different presented patterns converge. A small readout noise thus destroys classification accuracy.
However, how is the performance in the chaotic regime of a rate network? For this we no longer consider matched statistics and from here on again use rate networks without effective noise and g 2 > 1, which are chaotic [21].
Similar to binary networks, finite-size chaotic activity ( Figure 9a) in rate networks shows a transiently smaller difference between noisy realizations of a pattern than between patterns in the classification task, as shown in Figure 9b. This also follows from (24) and (25) since the acceleration of the decorrelation is smaller for smaller initial ǫ, making the noise distance grow slower than the signal distance.
Decorrelation in the rate network, however, can take many neuronal time constants if the network is in the mildly chaotic regime (Figure 9b). The timescale sensitively depends on the recurrent coupling strength, as shown in Figure 6b. This can be understood in terms of the decay constant of the time lagged autocorrelation function: In a noiseless rate network, the timescale of the autocorrelation diverges at the transition to chaos [21].
In the rate network classification performance jumps to a high value already after the first time step (Figure 10c). The reason is that all neurons' states are im- Gray scale shows the activity of each neuron between −1 (white) and 1 (black). b Time evolution of average distances between patterns (green) and between noisy realizations of a pattern (dark orange). Difference between signal and noise distances (blue). Numerical solutions of (24) (dashed), for details see Appendix 6. Network as in (a) and P = 50, L = 10, σ = 0.3. mediately nonlinearly affected by the input pattern. The stimulus is thus immediately projected nonlinearly into an N -dimensional representation that allows linear classification. This becomes apparent by considering that in the time step after stimulus presentation, the input to the network contains a term ∝ δt τ ∑ j J ij T (h j (0)). But even though the dimensionality is immediately N dimensional, the amplitude grows continuously with time and is thus very small at first, ∝ δt. This behaviour exposes the qualitative difference in the interpretations of ǫ: In the binary network, there is a direct link between ǫ and the dimensionality of the signal space, while in the rate network ǫ is a measure of the Euclidean distance between trajectories and is only very indirectly related to the number of dimensions across which this distance is distributed. Hence, the difference between inter-and intrapattern distances in Figure 9b peaks at a later time point than the classification accuracy in Figure 10b, because the rate network performs a transient signal amplification rather than a transient dimensionality expansion as in the binary network case.
Even though the dimensionality of the representation immediately after stimulus presentation equals N , distances between stimuli in the new directions are small at first. The classification in the rate network therefore  Figure 10. Stimulus representation in a rate network. a,b,c Signal of correct readout (blue), signal of wrong readouts (dark orange), and classification accuracy (inset). a Including noise σ 2 ξ matched to the statistics of a binary network according to (22). Parameters otherwise as in Figure 9. Because the frozen effective noise suppresses chaos, trajectories for different stimuli converge toward the same state and a small readout noise σ readout = 10 −4 results in classification at chance level. b Removing the noise in the network, and no readout noise. The network is chaotic as in Figure 9, and classification accuracy jumps to 1 in the first time step. c Including readout noise σ readout = 10 −4 impedes classification accuracy until trajectories are sufficiently separated. d Average norm of individual readout weights for the setting in (b). Since trajectories are very close in the initial time period (compare also Figure 9b), initial readout weights are very large, explaining the sensitivity to readout noise in (c).
relies on fine-tuned and very large readout weights in the beginning (Figure 10d). Therefore, the initial classification accuracy is severely impaired by adding even weak noise to the readout (Figure 10c). This addition of noise, in turn, results in a peak of the accuracy predictable by the theoretical peak in signal amplification (Figure 9b).
In summary, rate networks with continuous signaling perform a transient amplification of the signal-to-noise ratio, rather than a transient dimensionality expansion. Compared to binary networks with discrete signaling, the resulting empirical differences are the strong dependence of the decorrelation timescale on the coupling strength, the existence of a minimal coupling strength required for amplification, the absence of the residual correlation, and the initially high sensitivity to readout noise.  ) distribution, with additional detail given in Appendix 11. Gray scale shows the activity of each hidden unit between −1 (white) and 1 (black). b Evolution of average Euclidean distance between different patterns (green), noisy realizations of a pattern (dark orange), and the difference of the two (blue). The short initial decay in pattern distances is likely due to a short networkwide contraction of the dynamics after pattern presentation, caused by the interaction of the gating variable states and the L changed hidden states. c Signal of correct readout (blue), incorrect readouts (dark orange) and classification performance (inset).

Spiking networks
To demonstrate that the same computational effect translates to spiking networks which are furthermore not all-to-all connected, we consider the same task in a purely inhibitory network of LIF neurons with fixed in-degree in the asynchronous-irregular firing state. Trajectories in these networks are known to have a small stable local environment (flux tube) but exhibit chaos for perturbations leaving the flux tube [35,41], just as the binary networks considered in this manuscript. In the binary networks, the flux-tube borders are given by input perturbations that are just sufficient to cause a single spin to flip. By binning the spike trains of the LIF network with a bin width equal to the membrane time constant, we obtain approximately binary vectors if the bin width is small compared to the inverse firing rate. Therefore we can use the same training and analysis procedure as for the binary networks, and also interpret distances in terms of dimensionality. Further details are given in Appendix 11. The LIF network dynamics and performance show the same features found in binary networks: Distances between states quickly grow toward a residual correlation ( Figure 11). Dimensionality expansion takes place on a timescale within which each neuron fires only a single spike or less on average (ν −1 ≃ 55 ms). The distance between pattern classes initially grows faster than the noise distance, Figure 11b, and the readout signal and classification performance show a corresponding transient peak, Figure 11c. As in the binary network, some noise realizations do not leave the flux tube of the unperturbed pattern, causing a reduced long-term average noise distance and a nonzero plateau of the residual classification performance.

LSTM networks
Finally, to demonstrate the existence of the computational effect in a powerful specialized machine-learning architecture, we consider the same task in a recurrent LSTM network [40]. The architecture is similar to the rate networks we consider, but contains a large number of additional dynamical "gating" variables which control when, where and by how much the cell states interact with each other, resulting in considerably more complex dynamics than in a rate network with fixed coupling matrix. We us a vanilla pytorch implementation, choosing the initialization parameters such that the network exhibits spontaneous chaotic fluctuations over a moderate range of timescales, as seen in Figure 12a. Details about the parameters and task implementation are given in Appendix 11. In Figure 12b and c, we see a behavior in close analogy to that shown in Figure 9b and Figure 10 for rate networks. Signal and noise distances increase differentially fast, and there is a pronounced transient peak in the classification accuracy. Other than in the binary and LIF networks, which use discrete signaling, the rate and LSTM networks do not have locally stable flux tubes and no residual plateau in the classification performance.

III. DISCUSSION
This manuscript compares the effect of discrete and continuous signaling on the dynamics and function of neuronal networks. Focusing on binary classification as a fundamental computation, it addresses the question how the temporal dynamics can be used to represent stimuli. Separating representations translates into asking how state trajectories diverge or converge if different stimuli are presented. Technically this amounts to quantifying chaos in such networks.
A model-independent path-integral approach enables comparisons across models. We find that the dynamic mean-field theory is of identical structure for networks of binary units and for continuous rate networks. In binary networks we discover a network-size-dependent transition to chaos and the existence of a chaotic submanifold. We elucidate the qualitative differences to chaos in rate networks in terms of the mechanism causing chaos, timescales, and parameter regimes.
Applied to classification, chaotic dynamics causing a relative dimensionality expansion of representations leads to a mechanism of fast and transient computation in binary networks with discrete signaling. We describe a generalization of this effect as a transient signal-to-noise amplification in chaotic rate networks with continuous signaling.
The remainder of the discussion puts these results into context of the literature, mentions limitations, and provides an outlook.

A. Differences and similarities across neuron models
Transition to chaos in binary networks at finite size We demonstrate that there is a transition to chaos in finite-size binary networks, described by a fieldtheoretical replica calculation. Our results are consistent with works on sparse random boolean networks with synchronous update showing a chaos transition for in-degree K = 2 [34,42], which relates here to the transition at N ≈ 2 for the Heaviside activation function. Derrida and Pomeau [34] approximate the disorder by annealed averages and use synchronous update, while we compute the quenched averages and employ asynchronous update. In mean-field theory, the in-degree in sparse networks plays a role similar to the network size in dense networks. Other works that investigated the edge of chaos numerically in discretely coupled networks have also found small in-degrees as critical coupling [18,20,43].

Correspondence of DMFT in binary and rate networks
The model-independent field theory presented here exposes a one-to-one match of the stationary activity statistics in dynamical mean-field approximation of binary and rate networks. Exposing identities between neuron models is useful to see if and how the results generalize. Steps in this direction where already taken in Grytskyy et al. [44], who showed that weak pairwise correlations can be explained by linearizing LIF neurons, Hawkes processes, and binary neurons, mapping them to noisy linear rate models. The results presented here are more general since they apply not only to the linearization of the models but hold for the nonlinear behavior as well. The equivalence of time-lagged autocorrelations is shown here for stationary statistics; for nonstationary dynamics also the effective noise strength should vary as a function of time [45, chap. 6.4].

Assumptions on connectivity
The assumption of Gaussian connectivity J ij ∼ N (ḡ N , g 2 N ) straightforwardly generalizes to other connectivities, as long as higher than second cumulants are suppressed by powers of N −1 . Scaling the mean connectivity asḡ N yields a consistent approximation in 1 N . Sparse connectivity, however, typically leads to a scalingḡ √ N . Formally, a consistent treatment therefore requires Gaussian fluctuations of the mean activity field ⟨R 2 ⟩ ∼ 1 N , which is possible in the presented framework. Such fluctuations are, however, suppressed by negative feedback [46] in the inhibition-dominated (balanced) regimeḡ < 0. For multiple populations, the DMFT equations acquire population indices, but stay structurally the same (cf. [27] for binary neurons and [33,47] for rate neurons). Scale-free distributions of weights can violate the assumptions and require a different approach [48].

Relation of the model-independent path integral formulation to earlier work
The seminal work by Sompolinsky et al. [21] on rate neurons used statistical field theory [22], and the work by van Vreeswijk and Sompolinsky [27] on binary neurons relied on a disorder average of the master equation [28,29]. The statistical field theory that we develop here captures both model classes, and is similar to the Martin-Siggia-Rose-de Dominicis-Janssen (MSRDJ) formalism [49,50] for rate neurons [reviewed e.g. in [51][52][53]. In particular, this formulation exposes the identical structure of the mean-field approximations.
Binary networks with asymmetric connectivity show nonequilibrium dynamics, so that the Ising Hamiltonian cannot be used. Instead, complete information about the system dynamics needs to be captured. Full information is supplied by the master equation, for which an established approach is the Doi-Peliti formalism [54,55]. The fields in the latter approach, however, have no intuitive physical interpretation, even though they allow the construction of mean-field equations and fluctuation corrections [56]. Closer to our method are the approaches by Sommers [57], Andreanov et al. [58], Lefevre and Biroli [59], which can be obtained as special cases from our formulation.

Different chaotic dynamics in binary and rate networks
Qualitative differences between chaos in binary and rate networks can be summarized as follows: i) Rate networks with activity statistics matched to that of binary networks are nonchaotic. This shows that discrete signaling provides a different mechanism that drives chaos in binary networks, and that the mechanism causing chaos in rate networks is not effective in binary networks. A marginally chaotic solution is approached in matched rate networks when sending the activation function to the Heaviside limit. This is consistent with the finding that rate networks are always chaotic if the activation function has an infinite slope [33]. ii) In the limit of large numbers of neurons, (now unmatched) rate networks have a critical coupling strength beyond which they transition to chaos, while binary networks are always chaotic in this limit. At finite network sizes, binary networks have a size-dependent, critical coupling strength, which is typically very low. iii) Decorrelation of trajectories in binary networks is generally faster than in rate models. In binary networks it takes place on a timescale given by the interval between state changes of individual neurons and is only mildly affected by the network coupling. In rate models, it strongly depends on the cou- pling strength, showing a critical slowing-down at the transition to chaos. Stochasticity gradually smooths out this divergence [23]. iv) Trajectories in binary networks decorrelate only up to a residual correlation, while those in rate networks completely decorrelate. v) Binary networks have an infinite Lyapunov exponent, so that decorrelation starts off with a finite slope even for infinitesimal initial perturbations. In rate networks, the initial decorrelation is an exponential function, whose slope therefore vanishes for infinitesimal perturbations.

Origin of the difference
In the rate network, the noise is external and frozen, but a binary network's noise realization depends acutely on the initial value of the system. Thus, perturbing the initial value also changes the noise realization. In particular, due to the thresholding operation that produces the discrete signal, a tiny perturbation in the input can cause a flip of the neuron, and consequently a macroscopic change of the network state; the probability that this change happens increases with the number of targets that receive this perturbation, and thus with network size; this increase is due to the strong synapses J ij ∝ N − 1 2 (for ⟪J 2 ij ⟫ ∼ gN −1 ). For large networks the growth of the perturbation corresponds to a macroscopic change in the noise realization.
The presented replica calculation provides a complementary explanation for the qualitative difference between binary and rate neurons. The question of a chaos transition is reduced to studying how correlations are transferred from the inputs of a pair of neurons to their outputs: The change of the correlation between repli-cas is proportional to the mismatch between the correlation at time t, the second line on the right of Eq. (11), and the correlation transmitted through a pair of neurons and connectivity, given by the third line, a function c out (c in ). Such a transmission curve c out (c in ) is shown for discrete signaling and for continuous signaling in Figure 13. Clearly, if c out (c in ) < c in for a c in close to perfect correlation just below unity, the correlation decreases over time, the dynamics is chaotic; if c out (c in ) > c in , the correlation regenerates, the dynamics is regular. While the slope c ′ out (c in → 1) for discrete signaling diverges as ∝ (1−c in ) − 1 2 , it stays finite for continuous signaling. The infinite slope for networks with discrete signaling leads to an infinite Lyapunov exponent for N → ∞. For continuous signaling, the slope is finite as long as the slope of the activation function is bounded, resulting in finite Lyapunov exponents. This analytical view of the different transitions to chaos in continuous and discrete networks is also consistent with numerical findings [60].
Correlation transmission by pairs of neurons is well studied experimentally and theoretically [e.g. [61][62][63][64]. A diverging slope of the correlation-transmission curve, here shown for binary neurons, has also been demonstrated for spiking neurons without reset [63] and for the LIF model [62,[65][66][67][68], suggesting that these model classes behave similarly with regard to the transition to chaos.

Flux tubes in binary and spiking networks
There is a tight link between the replica calculation and chaos in spiking networks examined in terms of the divergence rate between flux tubes [41]. Flux tubes are neighboring portions of the phase space within which perturbations of the state do not cause a global change of subsequent activity [35]. Binary neurons are formally simpler than spiking models, because one can investigate changes of network states directly instead of analyzing spike patterns (Appendix 7), binary neurons do not have additional internal degrees of freedom, such as the membrane potential, and their activation times are given by predetermined update times. Therefore perturbations inside a flux tube are not forgotten exponentially as in LIF neurons, but instantly. Just like in LIF networks, the distance to a flux-tube boundary shrinks with network size as ∼ 1 N . The divergence rate between flux tubes scales as ∼ √ N , opposed to ∼ N in the LIF network [41]. Both are consistent with an infinite Lyapunov exponent for N → ∞. Chaotic spiking activity has also been investigated in Lajoie et al. [69] and Lajoie et al. [70], who showed that chaotic quadratic-integrate-andfire networks exhibit a reduced spike pattern entropy, indicating that they explore only a lower-dimensional manifold in phase space.

Transient chaotic SNR amplification
We find that strongly chaotic neuronal networks of different types invariably exhibit a transiently improved separability of low-dimensional inputs. This is at first sight surprising, because chaotic dynamics amplifies noise as well as informative differences. However, the variability within a class (noise) is typically smaller than the variability across classes. We find the latter to be amplified more strongly than the former, thus improving the linear separability of the classes. We argue that this relative amplification is a general effect in nonlinear, high-dimensional chaotic systems. There is an analogy to astrophysics: Space in the Universe is locally expanding everywhere. Thus all points are drawn apart, forming diverging trajectories. As a result of the ubiquitous expansion, galaxies move apart ever faster the greater the distance between them, as described by the Hubble constant. In the same way, trajectories in the network that are farther apart (different classes) separate faster than trajectories that are initially closer (noisy realizations within the same pattern class). Unlike the Universe, however, the network state space is higher dimensional and inherently expands along highly curved directions. Therefore the faster expanding, larger differences are also more strongly affected by the nonlinearity and are more quickly embedded into the surrounding higher dimensions. Also unlike the Universe, the state space volume of the network is finite and constant, so some directions must shrink to conserve the total volume. Trajectories therefore do not diverge indefinitely but reach a stable average distance determined by the volume of the limiting chaotic attractor. They continue to be mixed by the expanding and shrinking dynamics, such that information about their initial distance relations is eventually forgotten and classification performance subsides.

Elucidation by dimensionality in binary networks
This qualitative picture is made concrete in binary networks: Their discrete state space allows an interpretation of the growing average distances in terms of dimensionality. This allows us to express the improvement of classification accuracy in terms of the difference between the dimensionality of the representation of the signal d s and the number of dimensions corrupted by noise d n . Their temporal evolution follows stereotypic decorrelation curves obtained from a replica calculation. Dimensionality and separability are linked, because each dimension allows the linear separation of two additional random features [16,71]. The effect can also be viewed as a dynamical version of the usually statically applied kernel trick [15].

Relation of chaos and computational power
It has been argued that close to the edge of chaos, networks show an optimal trade-off between separation of stimuli and generalization [17,18,20,32]. Our analysis provides a time-dependent perspective on this hypothesis. Indeed, the signal and noise dimensionalities that approximate the classification performance can be compared to the Kernel Quality and Generalization Rank [VC dimension,15], introduced by Legenstein and Maass [20]. Our results show that binary networks generally have a short memory lifetime. But in contrast to rate networks, it also does not reduce strongly when moving deeper into the chaotic regime. For tasks that require only short memory, performance can benefit from the increased separation even deep in the chaotic regime, in particular because the peak of the informative dimensionality max(d s − d n ) increases with network size. This is in line with the results of Snyder et al. [43], where for small readout delay, performance stays high when increasing the in-degrees (e.g. their figure 3 a,b). Overall, these observations raise the question, whether the effect could be combined with a prolonged memory lifetime, for example by heterogeneous time constants of neurons or synapses, clustered connectivity [72], or by feeding the readouts back into the network.

Other related works
Recurrent networks can be transformed to deep feedforward networks with weight sharing by "unrolling" them in time. Successive layers then correspond to adjacent time steps in the recurrent network. Chaotic iterative maps were shown to yield an increase of the dimensionality of representations toward deeper layers [73]. Training specifically on low-dimensional representations yields facilitated feature generation by dimensionality expansion in early layers and feature selection and generalization by dimensionality suppression in later layers [74]. Farrell et al. [75] investigated (continuous) recurrent networks in a classification task similar to ours, also considering the strongly chaotic regime. They focused on late-time compression of the representation by training and found that chaos benefits learning of the task, for which our results provide a principled explanation. Their results provide clues on how training interacts with the random connectivity, an interesting avenue for future work.
The mechanism of transient computation we investigate coexists with nonnormal amplification [76? -78], which is caused by effective feed-forward structures embedded in nonorthogonalizable coupling matrices. Nonnormal amplification is especially strong in the chaotic regime [76]. The here described mechanism, however, also applies to normal matrices. Finally, recent developments in statistical mechanics of computation in neuronal networks are reviewed by Bahri et al. [79].
A mechanism of fast computation in spiking networks The peak classification performance in binary reservoirs is reached on the scale of a single neuronal time constant after stimulus onset. This scale also holds approximately in the LIF network. However, the peak time can likely be even shorter for higher in-degrees or firing rates, as these influence the divergence rate of flux tubes in such networks [41]. The mechanism may explain how computation can spread rapidly through the hierarchical networks of the brain: The time window corresponds to one activation per contributing neuron on average, allowing the fast feed-forward processing latencies of 50 ms per stage measured experimentally in cortical areas [80,81]. This perspective suggests how networks that employ discrete communication may compute rapidly on the basis of a few spikes rather than requiring a prolonged averaging over time. A possible impediment to computation in a chaotic system is its sensitivity to initial conditions, which, apart from the input patterns, are kept fixed in this manuscript. One solution would be a mechanism which quenches variability at appropriate times. Another possibility was exposed by Lajoie et al. [82], who showed that chaotic spiking networks can reliably encode inputs despite changing initial conditions, because each input confines the chaotic activity to a different manifold.

Experimental evidence and predictions
The olfactory system is a potential candidate to rely on the transient computational mechanism we describe, because it is specialized on classification of patterns without a temporal component. In the vertebrate olfactory bulb, an odor activates a comparably low-dimensional pattern of glomeruli, the input layer to a higher dimensional recurrent network that needs to separate representations to enable classification of odor identity by subsequent processing stages. The insect antennal lobe shares this basic organization. Recordings in the olfactory systems in zebrafish [83], locust [84], and rats [85] show a representation of stimuli that is consistent with the here found mechanism of transient dimensionality expansion.
In zebrafish, the activities of mitral cells in the olfactory bulb show a high correlation for similar odors shortly after stimulus presentation. Subsequently they decorrelate on a timescale of ∼ 800 ms, reaching a residual correlation of about 40% [83, Fig 2E]. The discriminability of these similar odors by a linear readout from the mitral cells improves within the same time span to nearly error-free classification [83, Fig 2I]. In the process, the population statistics stays approximately constant. These features are in line with transient dimensionality expansion, except that the classification accuracy does not decline again after the improvement. However, this missing decline could be caused by feedback stabilizing the representation after recognition, or be related to the sustained presentation of the odor stimulus. Recordings in the locust antennal lobe show qualitatively similar behavior [84]. In particular, these experiments report a decoding accuracy that is highest within the transient phase. In the rat olfactory bulb, inhalation also triggers a fast decorrelation transient of ∼ 100 ms, during which odor identity is encoded in the instantaneous spike pattern, and decoding accuracy rapidly peaks after ∼ 70 ms before declining to a lower level [85]. Future work should systematically investigate if the dynamics in these biological systems is in fact chaotic, for which suitable analysis methods are available [86,87]. Also, the analysis of inter-and intraclass distances and classification by linear readouts is applicable to experimental spiking data. The computational mechanism we describe needs a reliable initial state from which trajectories diverge. In the rat olfactory system, this reset could be tied to inhalation onset. In cortex, stimuli seem to quench the variability of spontaneous activity to evoke relatively low-dimensional responses [88][89][90][91]. To check for chaotic dynamics, one could therefore analyze the growth of intertrial variability during and after stimulation offset. We finally list concrete testable predictions for neural systems that implement classification by transient chaotic dimensionality expansion: 1. Variability is small or quenched at stimulus onset, then transiently increases and reaches a stable value.
2. Not only the interclass distances, but also the intraclass (noise) distances increase, although initially slower.
3. Decoding accuracy based on linear readouts trained at each time point shows a peak, and this peak occurs before the distances saturate. the RWTH University and the JARA Center for Doctoral studies within the graduate School for Simulation and Data Science, funded by the Deutsche Forschungs  This section presents a self-contained derivation of the model-independent mean-field theory for networks with Gaussian random connectivity J ij (h 1 (t), ..., h N (t)) and outputs x(t) = (x 1 (t), ..., x N (t)) and the neuronal dynamics is described by the conditional probability functional We use vectorial notation to denote because, given their inputs {h i }, neurons are otherwise pairwise independent. The probability functional ρ[x i h i ] is assumed to be strictly causal, which is x i (t) is independent of h i (s > t); a more explicit notation would be ρ[x i (○+) h i (○)], denoting that the time-argument x i (t + ǫ) must be infinitesimally advanced by ǫ > 0 compared to the argument of h i (t) for ρ to depend on h.
The joint statistics of input and output is then The distribution of the inputs h is given as the marginalization over x as The connectivity J couples the outputs x of the neurons to the input h as So in the marginalization (31) over x we need to set and the inner product is meant aŝ which is ordered in time, resulting in a spiraling structure.
Performing the disorder average ⟨. . .⟩ J of (31), the only term affected is the last exponential factor in the second line of (32), which yields Here, the scalar product in the last term in the exponent rewrites explicitly as The terms suggest the introduction of the auxiliary fields where the bi-linear form is to be read aŝ The appearance of the product sign and the neuronindependent fields R and Q signifies that the problem becomes completely symmetric with regard to neurons. Enforcing the definitions of the auxiliary fields by Dirac distributions, represented in Fourier domain, analogous to (32), yields another pair of fieldsR andQ and brings (31) into the form Adding the normalization condition by integrating over h we note that this integral affects only the last two lines in (36). The exponent in the second line can be considered an action of a field theory for the auxiliary fields with We now compute the values of the auxiliary fields that provide the dominant contribution to the probability mass. The appearance of N in the exponent N Ω[R, Q] suggests to perform the integration over the fields {Q, R,Q,R} in saddle-point approximation, demanding δΩ δ{Q,R,Q,R} ! = 0, which yields four conditions for the saddle-point values R, Q,R,Q of the fields Here the expectation value is ⟨.
The denominator coming from the outer derivative of the logarithm appearing in the expression for Ω does not contribute, because the normalization condition of the latter distribution is unity, since the exponential term is the moment-generating functional of a Gaussian process h ∼ N (R, Q) and ρ[x h] is normalized, allowing us to rewrite The auxiliary fieldsĥ and all their powers are zero on expectation, which is a consequence of the normalization [53, 92, Section X]. The equations thus leads to the result in the main text, eqs. (2) and (3).

Derivation of the mean-field equation for binary networks
Having obtained the saddle-point solution to the path integral developed in the previous section, the first result is the time evolution of the mean input activity R(t). For binary networks, we need to insert information specific to the neuron model in order to compute ⟨x(t)⟩ h∼N (R(t),Q(t,t)) in (2). This means we need ρ[x i (t) h i ] for a binary neuron. Note that only the probability distribution of the activity at a single time point is needed, which is much simpler to obtain than a distribution across all time points, which would include not only the dependence on the input history, but also on the neuron's own activity state. For the most compact presentation, we will here use the bitlike n ∈ {0, 1} representation instead of the Ising x ∈ {−1, 1} representation used in the main text. The results between the two can be easily related by the mapping The bit-like n ∈ {0, 1} representation has the advantage that we only need to consider the active state in averages, since the inactive n = 0 state does not contribute. In this case, the probability of finding a neuron active at some time t is given by the probability T p (h i (t ′ )) to be activated at any prior update time point t ′ and the survivor function e − t−t ′ τ [30], the probability that no further update happened since. Therefore, plugging into (2) from the second to the third line. Taking a time derivative and using (6), we obtain the mean-field equation Note that here we need only the input variance Q(t, t), which is trivially given by the mean activity and, potentially, zero-time-lag cross-correlations of the outputs. However, the mean-field equation does not depend on the autocorrelation at nonzero time lag, which is derived in the next section. Therefore, at least as far as crosscorrelations are negligible, (40) is closed.

Derivation of the ODE for autocorrelations in binary networks
Here we derive the form (9) for the evolution of the autocorrelation.
The correlation functions in the Ising and bitlike representation are, according to (38), related as where by Q(t, s) = g 2 q I (t, s) we obtain the quantity considered in the main text in (9). Defining q(t, s) ∶= ⟨n(t)n(s)⟩ we have where we write ⟨. . .⟩ h∼ρ as a short form of ∫ dh ρ(h) . . .. The latter joint probability is decomposed, analogous to (1), as We obtain the first conditional probability on the right by considering the possibilities to reach the final state Likewise we obtain the latter conditional probability on the right of (43) as Combining (42), (44) and (45) we get In the stationary state, the first integral in the last line reduces to ⟨T p (h)⟩ h∼ρ e − t−s τ . Also q(t, s) =∶ q(t − s) is then a function of the time lag ∆t ∶= t − s alone Differentiating by ∆t we get where we substitute s − t ′ → t in the last step and used the stationarity to shift the time arguments of the h by t − s. Using (38), (41), and (6) and assuming stationarity we get the result (8) in the main text, where the gain function T instead of T p appears. Shifting the integration variable t by ∆t, we obtain and performing another derivative τ d d∆t yields where for the second equality, we use the first-order differential equation (46). Closing the equation in the meanfield approximation, amounts to setting the measure of h ∼ ρ ≡ N (R, Q) to the Gaussian process with mean R and variance Q, as determined by the saddle-point equations (2) and (3). This approximation neglects fluctuations of R and Q, which is justified if the system is not close to the critical point and the average connectivity scales at most like 1 N [29], but even if the latter condition is relaxed to a 1 √ N -scaling, this merely leads to an additional term in the input fluctuations taking into account pairwise correlations [93].
Moving to the [−1, 1] representation by using (38), (41), and (6) and multiplying (47) by g 2 changes T p → T and g 2 q → Q so that we obtain (9). Here, in addition, we introduce N R,Q(0),Q as the bivariate Gaussian with stationary mean R and covariance matrix This Gaussian expectation value allows us to employ Price's theorem [31] which states that

Model-independent replica calculation
To assess the transition to chaos, we perform a replica calculation that considers a pair of networks with identical connectivity but slightly different initial conditions for the neurons. We use superscripts (1) and (2) to distinguish the two systems. The correlation between the two replicas is a measure of the distance between their respective states in terms of the squared Euclidean distance d (12) i .
The first term, on expectation over realizations of the activity, approaches the average autocorrelation in the two replicas and the latter term the inter-replica correlation. For Ising spins the expression simplifies to i . The formal derivation of mean-field equations that approximate these quantities proceeds analogous to Appendix 1: The analog expression to (31) and (32) reads Here the conditional density ρ[x (1) , x (2) h (1) , h (2) ] is a joint distribution across the two replicas, because it must allow the representation of update processes or stochastic activations of corresponding neurons that have identical realizations between the two replicas. The important point is the identical matrix J appearing in the product of the latter two Dirac distributions, which, after introducing Fourier representations as in (32) and taking the disorder average over J , analogous to (33), yields The penultimate line is the same contribution for each replica as in the single system; it is treated in the same manner by introducing pairs of auxiliary fields {R (α) ,R (α) , Q (αα) ,Q (αα) } α∈{1,2} . The last line couples the two replicas and can be decoupled similarly by defining This definition is enforced by inserting a δ constraint, represented as a Fourier integral with the corresponding conjugate fieldQ (12) (s, t). The integral over {R (α) ,R (α) , Q (αβ) ,Q (αβ) } α,β∈{1,2} is then taken in saddle-point approximation with the resulting nontrivial saddle-point equations The remaining response fields vanish,R (α) =Q (αβ) ≡ 0. The expectation value in (49) is taken with the measure where (h (1) , h (2) ) ∼ N ({R (α) , Q (αβ) }) is a pair of Gaus-sian processes with cumulants ⟪h (α) (t)⟫ = R (α) (t), The distance (48) between the replicas in mean-field approximation can then be written as d (12)

Application to binary networks
The zero-lag cross-replica correlation is then given with (49) and (50) as To construct ρ(x (1) , x (2) , t h (1) , h (2) ), first note that both neurons are updated by the same stochastic realizations of the update process. This process has two random components: The drawing of the update time point t ′ , which, for the Poisson updates, has a distribution of e − t−t ′ τ dt ′ τ for the last event to have appeared in [t ′ , t ′ + dt], and the stochastic activation depending on the gain function T p ∈ [0, 1], whose value for both replicas is compared to the same realization of a uniformly distributed random number r ∈ [0, 1] . The four possible outcomes of this update of states (x (1) , x (2) ) are (−1, −1), (1, 1), both of which lead to x (1) ⋅ x (2) = +1 and (−1, 1), (1, −1), both of which lead to x (1) ⋅ x (2) = −1. One thus only needs to distinguish two outcomes: The event x (1) ⋅ x (2) = −1 takes place if the random variable r is in between the values of the two gain functions, T p (h (1) ) < r < T p (h (2) ), which happens with probability p diff = T p (h (1) ) − T p (h (2) ) ; the other event x (1) ⋅ x (2) = +1 with 1 − p diff . So in total we get at the time t ′ of update Taken together with the asynchronous update time point, we thus have Taking a derivative with respect to t, we obtain an ODE governing the time evolution of the cross-replica correlation Note that Q (12) (t) also appears implicitly in the distribution of h (1) , h (2) , rendering the equation nonlinear. This implies the result (11) in the main text, which follows by replacing 2(T p (h (1) ) − T p (h (2) )) = T(h (1) ) − T(h (2) ) due to (6). So far we have proceeded without approximation apart from the saddle-point approximation. Perfect correlation of the replicas Q (12) (t, t) = Q 0 = g 2 is clearly a fixed point, since then h (1) (t) = h (2) (t) and the right-hand side vanishes. We now wish to assess the stability of this solution, that is, whether a perturbation of one replica results in recovery of perfect correlation (regular dynamics) or in a decorrelation of the replicas (chaos). Making the ansatz Q (12) (t, t) = Q 0 − ǫ(t) and using the mean-field approximation of the input distribution, the last term of the ODE (54) becomes, by substituting H ∶= (h (1) + h (2) ) 2, h ∶= (h (1) − h (2) ) 2 and then expanding in h and ǫ (2Q 0 ): Plugging this result and the ansatz Q (12) (t) = Q 0 − ǫ(t) back into (54) then yields (12).

Growth of perturbations in binary networks
Starting from (12) where c = 2 √ π g 2 ⟨T ′ (h)⟩ h∼N (R,g 2 ) , we integrate the differential equation so the solution is Expressed in terms of the dimensionality d(t) = N g 2 ǫ(t) and d * = N g 2 c 2 (18) 2 .
In the long-time limit the solution reaches the fixed point The fastest increase happens for a Heaviside gain func- This shows that the binary network decorrelates only to a dimensionality of d * N = 8 π 2 ≃ 0.81. The distance as a function of time is then The maximal signal-to-noise ratio is obtained by using d s with a nonzero initial d s (0) and a noise distance d 0 n that initially vanishes d 0 n (0) = 0, leading to: where we define δ ∶= ds(0) d * . The maximum of this function is at For small δ ≪ 1 this expression yields so the time of the maximum becomes approximately independent of the initial value δ and thus independent of the signal-to-noise ratiô The maximum is with The latter expression shows that this maximum, relative to the initial signal is For the Heaviside nonlinearity the result becomes with

Growth of perturbations in rate networks
On small timescales the distance evolves in proportion to the Lyapunov exponent In particular, there is no maximum expected on a timescale of the neuronal dynamics. The typical timescale instead is determined by the maximal Lyapunov exponent.
To derive an equation for the time evolution of the correlation between replicas in the rate network, we use the pair of equations obtained from the mean-field description We may thus write So the correlation function c αβ (t, s) ∶= ⟨x α (t)x β (s)⟩ obeys which becomes in differential form This differential equation allows the integration along the t direction by one time-step δ which requires only c αβ and ⟨hh⟩ in the t−past and in the s−past. The integration can be done for t ∈ [0, T ], where T is a desired final point. As a result, one has c αβ (t, s) for t ∈ [0, T ]. In the next update step we move into the s−direction by where the latter integral can be computed because it requires only ⟨h α (t)h β (s)⟩ = g 2 ⟨T (x 1 )T (x 2 )⟩ (x1,x2)∼N (0,c αβ (t,s)) (60) computed in the previous step.
It makes sense to introduce as an auxiliary variable for which we can assume the symmetry q αβ (t, s) = q βα (t, s) to write the updates The auxiliary variable q αβ (t, s) obeys the differential equation (τ ∂ t + 1) q αβ (t, s) = ⟨h α (t)h β (s)⟩, which yields the update equation So the required sequence of updates is: 6. Compute q αβ (t ′ ≤ t + δ, t + δ) by iterating (63) with zero initial condition and starting with t ′ sufficiently far back in the past.
8. Go to the next time slice t → t + δ, return to step 3.

Flux tubes in binary networks
It has been shown by Puelma Touzel and Wolf [35] that the borders of flux tubes in spiking networks of inhibitory LIF neurons are related to changes in the global order of spikes. In particular, if a perturbation creates an additional spike or causes the omission of an expected one, the mean firing rate will stay constant but the order of future spikes is very likely to be irrevocably changed. The divergence rate of two trajectories can be assessed by calculating the mean number of unexpected spike order changes caused by a single such perturbation, resulting in a branching process. In the context of binary networks, we can ask the equivalent question: Given a flip of a single neuron's activity variable, how many "wrong" update results will occur on average in the following time τ ?
The flip of one neuron x j → −x j causes a change ∆h i = −2J ij x j in the input of neurons it is connected to. Across different target neurons ∆h is therefore distributed as and the probability of a neuron to be updated into the wrong state due to the perturbation in the input is where the absolute value enters because both directions of perturbation cause a positive probability of "wrong" updating, and we assume T ′ ≥ 0 for simplicity. Now we ask the following question: How many downstream flips n spawns will, on average, be triggered in the network during one time constant, given a single original flip? This quantity controls whether the decorrelating flips will proliferate or not, because since every neuron is updated on average once per time constant, if n spawns < 1 and the neuron carrying the original flip is updated again, it is most likely updated "correctly" again and the average number of flips in the network has decreased. If n spawns > 1 on the other hand, the average number of flips increases. Being interested in the transition point, we can assume n spawns ≈ 1 so that we do not need to take the interaction of several flips into account. Then and while we take the mean input R into account, we neglect the perturbation of the mean input ⟨∆h⟩ = ±2ḡ N = O(N −1 ) ≈ 0 as it is small compared to the standard de- , allowing the simple calculation Finally, accounting for T ′ p (h) = T ′ (h) 2, given by (6), the chaos transition is expected at which is exactly the result (15), derived via the completely different route of the replica calculation. While the derivation here is nicely and intuitively interpretable, the derivation via field theory and replica calculation allows for systematic generalizations. For example, it is not clear how to obtain the residual correlation (13) in the ad hoc approach.

Fluxtube size
The flux-tube diameter is not a very informative measure for a binary network, since the system trajectory in phase space is typically not in the middle of a "tube" but close to some of its boundaries (given by the thresholds). Therefore, the distance to a boundary strongly depends on the direction of perturbation. As a relatively informative measure, we consider smearing the trajectory in all directions with some variance Var(∆h) = σ 2 fl , which is chosen such that on average, one flux-tube boundary is crossed. This procedure makes sense insofar, as it is similar to adding noise onto the input. It is important to be aware that σ fl is not strictly the average distance to the closest boundary, although the two quantities should covary.
The situation is analogous to the above calculation, because we again need to consider the flips occurring during an update in (on average) all N neurons, which is given by (64) only with σ ∆h replaced by σ fl . Demandinḡ n spawns ! = 1 then yields .
Of course, the 1 N scaling needs to be taken with caution, since our perturbation goes into all N phase-space directions, resulting in a total length scaling as 1 √ N .

Equivalence of dynamical mean-field theories of binary and rate networks
The dynamics (21) can equivalently be written as where the noise η i is an Ornstein-Uhlenbeck process [94], (τ ∂ t + 1) η i = ξ i . This form allows the application of the model-independent field theory. The single-neuron, single-time-slice probability functional is ρ[ , and the noise term is taken into account in Plugging this expression into (2) we obtain the same equation (5) for the mean activity as for the binary neuron, if we choose the strength of the noise σ ξ such that Q(t, t) = g 2 as well. The reason for the equivalence is that the exponential function appearing in the convolution equation is the Green's function of τ ∂ t + 1. In a stationary state, the saddle-point solution for Q(∆t) ∶= Q(t, t+∆t) = g 2 ⟨x(t)x(t + τ )⟩, moreover, follows the same Newtonian equation of motion (9) as for the binary model [21, eq. 7].

Matching initial conditions
Knowing that the differential equations for the timelagged autocorrelations are the same, we have to adjust their respective initial conditions to establish full equivalence. Here we use the subscripts b and r to refer to the quantities of the binary and rate model, respectively. Two initial conditions are needed for a unique solution. One is to require that lim t→∞Q (t) = 0, which is the same in both cases. So the autocorrelation for infinite time-lags is described by a single value Q ∞ , which vanishes for point-symmetric activation functions, but is in Figure 14. Equivalence of binary (left) and rate models in the two different forms (66) (middle) and (21) (right): Mapping from output to input by identical matrix J; asynchronous update process U with rate τ −1 implies exponential convolution kernel, leading to leaky-integration (cf. (5)), identical to operator L = (τ ∂t + 1) −1 present explicitly in (66) and (21). Transitions between discrete binary states effect red noise η in input h (middle), which corresponds to white noise ξ that is low-pass filtered by L (right). Rate models differ in the order of application of this kernel and the connectivity, which yields equivalent dynamics because the two operators commute.
general nonzero and self-consistently determined by the static variability across neurons, caused by the disorder (compare Figure 5). In the binary case, the second condition is Q b (0) = g 2 because the zero-lag autocorrelation of a single spin is always one. In a rate network, however, the input noise strength determines how quickly the autocorrelation decays, resulting in the condition on the derivativeQ r (0+) = −σ 2 ξ 2 [23] [95]. The idea is to choose the variance of the noise σ 2 ξ in the rate network such that Q r (0) = Q b (0), so that the time-lagged solutions for the variance Q(∆t) match.
To do so, usingQ(∞) = 0 and conservation of total "energy" V Q0 + τ 2Q2 2 implied by the Newtonian form of (10), the condition Q 0 ∶= Q r (0) ! = Q b (0) = g 2 can be expressed as a condition for the derivative and thus the noise amplitude Plugging inQ 0 = −σ 2 ξ 2 and solving for σ 2 ξ yields the condition (22) in the main text. This proves that the binary and rate model with appropriate noise have equivalent mean activities and time-lagged autocorrelations in dynamical mean-field approximation.

Explanation of the result
Taking a step back, what is the intuition behind this result? When the binary neurons are averaged over realizations of the update time disorder, the Poisson update process with rate τ −1 becomes an exponential kernel corresponding to that of the rate network. The discrete jumps of the binary neurons around their mean become red noise [96,97], corresponding to the low-pass-filtered noise η of the rate network (66). By nice conspiracy, this noise corresponds to simple white noise ξ in (21), which is also the version treated in most works on rate networks with noise, such as [22,23,33,47,98,99]. This tight relation between the binary and rate models is summarized conceptually in Figure 14.

Slope of correlation transmission in binary and rate neurons
Here we show that the difference between discrete signaling and continuous signaling leads to a qualitative difference in the slope of the correlation-transmission curve and thus the transition to chaos.
Assume, as an approximation, that two neurons receive inputs that are jointly Gaussian distributed as where the covariance matrix is given by Here c in ∈ [−1, 1] controls the correlation between the inputs.

Continuous signaling
A neuron with continuous signaling has the output where T ∈ [−1, 1] is an activation function. The mean output is thus For a point-symmetric gain function that we assume in the following the mean vanishes so that the variance of the outputs is The correlation coefficient between the outputs of a pair of neurons is which has the slope by Price's theorem [31]. Evaluated at c in = 1 this is For activation functions T with finite slope T ′ < ∞ this slope is thus finite. For the signum function T(x) = 2H(x) − 1 we get a = 1 and where the latter line comes from the normalization condition of the two-dimensional Gaussian distribution. Thus, the slope diverges if and only if the output of the neuron becomes discrete.

Discrete signaling
Now consider a neuron with discrete output, but smooth activation function T ∈ [−1, 1]; a smooth function here corresponds to a probabilistic activation The mean output is thus the same as for the continuous signaling (68). For a pointsymmetric gain function that we assume in the following the mean vanishes so that the variance of the outputs is a = 1. The correlation coefficient of the output is then identical to the second moment between the outputs of a pair of neurons The latter expression is related to the probability p diff = 1 2 ⟨ T(h 1 ) − T(h 2 ) ⟩ that the two neurons are in different states. This expression is of course the same as found in (55). In the limit of c in → 1 we thus have So the slope diverges for c in → 1 as This divergence is present even if the gain function has a finite slope T ′ < ∞. This is in qualitative contrast to the finite slope found for the continuous signaling in (70). The infinite slope for continuous signaling in the limit of a sharp activation function, (71), can be shown to have the same form of divergence for c in → 1 when expanded for c in = 1 − 1 q ǫ in the limit of small ǫ ≪ 1.

Noisy binary pattern classification task
We implement a classification task by training one linear readout S α ′ (t) = w α ′ (t) T (x α (t) + ξ pre ) + ξ post (73) of the network state x α (t) at time t for each of the α ′ = 1, . . . , P = 50 patterns to be detected. Here ξ pre and ξ post are additional Gaussian readout noises of standard deviation σ ξ,pre and σ ξ,post , respectively. ξ pre controls how precisely a single neuron's state can be read out. ξ post represents a noise component of the classification mechanism. Training of the readout w α ′ (t) is performed for each time point t by linear regression (see Appendix 10 a), minimizing the quadratic error of detecting the stimulus identity, i.e. minimizing (S α ′ − δ αα ′ ) 2 . The patterns are presented to the network by initializing the first L = 10 of the N = 500 neurons to the stimulus. All other neurons are in an initial state corresponding to the stationary statistics. Each stimulus α is a random binary pattern of length L with {−1, 1} appearing equally likely, superimposed with Gaussian noise of standard deviation σ. Note that because of the noise added to the binary values, these initial states are not strictly ∈ {−1, 1}. This freedom in the initial states is just a way to introduce the noise; after their first update the neurons' states are strictly {−1, 1} again. The resulting evolution of the network state given this initial condition is termed x α (t).

a. Linear regression
Minimizing the quadratic error over all patterns amounts to linear regression; we consider a single scalar readout target value y α ∈ R for each pattern α; in the example above y α ∈ {0, 1}. Then w, x ∈ R N and the quadratic error is Demanding stationarity with regard to w by differentiating by ∂ wi we get N equations 0 = P α=1 2 w T x α − y α x αi ∀i w T P α=1 x α x T α = P α=1 y α x T α .
The value w * to achieve stationarity is with C ∶= P α=1 x α x T α , where we use the symmetry of C. Inserted into (74) In the case of classification, the latter expression simplifies even further: The first term is a constant ∑ P α=1 y 2 α = 1 for labels y α ∈ {0, 1}, where y α = 1 if the presented pattern α is the pattern α ′ to be detected and y α = 0 else, y α = δ αα ′ . The second term is then identical to the definition (73) for ξ = 0, obtained by inserting w * from (75). The expression shows that the signal amplitude S α ′ actually depends on the signal-to-noise ratio, the length of the selected vector x α ′ measured with regard to the variability C across all patterns The generalization to stochastic realizations of x α , for example due to the presentation of noisy patterns is straightforward. We need to replace ∑ P α=1 . . . by ∑ P α=1 ⟨. . .⟩ in the measure for the error (74) and thus throughout this calculation, where ⟨. . .⟩ is the expectation over the noise realizations.

b. Approximation of orthogonal patterns and uniform noise
If the patterns x are sufficiently orthogonal in the signal subspace, we can think of the entries k of any state vector x α to be drawn independently. So the x α,k ∈ {−1, 1} appear with equal probability for those entries that lie in the subspace of dimension d s . All remaining entries are assumed to be constant across patterns. We may thus restrict the space to the d s informative components. The kl-th element of the covariance matrix for independently drawn entries is x αk x αl ≃ P δ kl .
The signal of the readout α ′ , following from (77), then takes the simple form If the signal is perfectly reliable, that is, if for all noise realizations i the response x αi is equal to the stereotypical response x αi =x α , and if the dimension of the informative subspace is d s , sox ∈ R ds , we get with x 2 ds = d s S α ′ ,max d s P .
If noisy realizations of patterns cause flips in random entries of x α ′ , which is an approximation since spins are expected to differ in their susceptibility, the responses are not perfectly reliable, so we need to replace x α ′ by ⟨x α ′ ⟩ in (77) and thus The above expressions thereby link the readout signal to the dimensionality of the responses, as discussed in the main text in Section II F.
c. Nonzero plateau of the signal.
In the simulations, the noise distance somewhat unintuitively saturates slightly below the signal distance. This is explainable by taking into account that not all the initial noise realizations actually cause a crossing of the flux-tube boundary. Instead, those realizations simply follow the unperturbed pattern trajectory, so that d n,i = 0 in those cases. Then it is clear that the average noise distance is smaller than the one predicted based on the assumption of diverging trajectories: We can estimate the probability that no flip occurred due to the noise by using the results from Appendix 7, where we calculate the average number of flips in the network after one time constant given an additional (noise) variance in the input of the neurons σ 2 ∆h . In our present case the noise is given by adding ξ i ∼ N (0, σ 2 ) on the output activities of the L original neurons of the pattern, so that the corresponding input variance felt by all neurons in the network isσ (1 − p single flip (k)) This result fits well with the simulations, yielding the predicted offset of the asymptotic average signal-and noise distances shown in Figure 7b and the asymptotic plateau of the approximated average signal in Figure 7c.

Description of simulations
Simulations for Figure 5 and Figure 4b,d were implemented using NEST [100]. NEST treats binary neurons in the bitlike {0, 1} representation. To let every neuron "see" inputs from {−1, 1} (Ising spins) we add to each neuron i a bias ∑ j J ij and then connect the neurons by the connections 2J ij instead of J ij ; thereby effectively simulating an Ising system. To obtain the autocorrelations for the Ising case, (41) is used, leading to the result shown in Figure 5. Furthermore, we use a non-point-symmetric activation function T(h) = tanh(h − Θ) by choosing a Θ to be nonzero for this plot. The reason is, first, that Θ = 0 leads to the theoretical prediction of maximal output variance ⟪x 2 ⟫ = 1 because the mean output activity ⟨x⟩ is 0. However, due to disorder, the time-averaged activity is actually a fluctuating quantity across the population and therefore, the population-averaged variance is always below 1 for finite systems. This systematic underestimation of the peak of the autocorrelation at zero time lag can be avoided only by choosing an activation function that is not point symmetric. Second, this choice is also natural because mean activity 0 would imply that neurons are active half of the time on average [101,Supp. Mat. II B], which is considerably more than indicated by the low firing rates measured in cortex [102]. For Figure 5, we have therefore shift the working point by numerically inverting ⟨x⟩ = ⟨tanh(h − Θ)⟩ h∼N (R,Q) to obtain the value for Θ, which in mean-field approximation corresponds to ⟨x⟩ = −0.5. In the simulations it turns out that ⟨x⟩ = −0.501, which agrees well with mean-field theory.
For Figure 4b and d, for each point of the grid one simulation of two identical networks was performed. After 1000 ms, in one replica, the first two neurons are set to the active state and the third and fourth neurons are set to the inactive state, after which the simulation continues for 2500 ms. This method of perturbation entails the small probability that these four neurons are already in exactly this state, so that nothing is changed; this is the explanation for the scattered single green dots in Figure 4b,d. The advantage of the method is, however, that it guarantees the same state of the random number generators across both replicas.
The simulations for Figure 7, Figure 9 and Figure 10 are performed using a custom FORTRAN kernel.
The simulations of the LIF network for Figure 11 are implemented in NEST, using the "iaf_psc_delta" neuron with its default parameter settings V rest = V reset = −70 mV, V th = −55 mV, τ m = 10 ms, t refr = 2 ms, C m = 250 pF. Asynchronous-irregular firing in the inhibitory network is evoked by supplying external excitatory Poisson input with rate of 3000 Hz and unit weight, leading to an average network activity of 19 Hz. After simulating for 10 τ m to obtain a state with stationary statistics, the input pattern is applied by causing a spike in the corresponding neurons. Noise is added to the pattern by additional external input spikes perturbing the membrane potentials of the L pattern neurons. Further pa-rameters as mentioned in figure caption: N = 500, J ii = −1mV, K = 125, τ m = 10 ms.
The LSTM network for Figure 12 is simulated using the vanilla PYTORCH implementation with a hidden layer of size N = 200. To obtain chaotic fluctuations over a range of timescales, the hidden weights W hi , W hf , W hg are initialized as ∼ N (0, 5.8 N ). This excludes the W ho weights, which would cause very rapid, erratic dynamics. Instead, these and all remaining weights and biases used the default uniformly distributed ∼ U(−1 √ N , 1 √ N ) initialization. After simulating for 50 time steps to obtain stationary statistics, the input patterns are supplied using the standard input function.
Analysis of simulation data and numerical solutions are implemented in PYTHON. The code to generate all figures is available as a Zenodo archive at doi.org/10.5281/zenodo.4705262.