Quality of internal representation shapes learning performance in feedback neural networks

A fundamental feature of complex biological systems is the ability to form feedback interactions with their environment. A prominent model for studying such interactions is reservoir computing, where learning acts on low-dimensional bottlenecks. Despite the simplicity of this learning scheme, the factors contributing to or hindering the success of training in reservoir networks are in general not well understood. In this work, we study non-linear feedback networks trained to generate a sinusoidal signal, and analyze how learning performance is shaped by the interplay between internal network dynamics and target properties. By performing exact mathematical analysis of linearized networks, we predict that learning performance is maximized when the target is characterized by an optimal, intermediate frequency which monotonically decreases with the strength of the internal reservoir connectivity. At the optimal frequency, the reservoir representation of the target signal is high-dimensional, de-synchronized, and thus maximally robust to noise. We show that our predictions successfully capture the qualitative behaviour of performance in non-linear networks. Moreover, we find that the relationship between internal representations and performance can be further exploited in trained non-linear networks to explain behaviours which do not have a linear counterpart. Our results indicate that a major determinant of learning success is the quality of the internal representation of the target, which in turn is shaped by an interplay between parameters controlling the internal network and those defining the task.


Introduction
A fundamental feature of the brain, and biological networks in general, is the ability to form closed-loop interactions with their environment.Such interactions are often implemented through a dimensionality bottleneck: while networks typically consist of large numbers of units, signals exchanged with the environment are low-dimensional.In fact, external stimuli can often be represented in terms of a few scalar variables (e.g. the angle and speed of a tennis ball approaching); these low-dimensional variables are encoded in the high-dimensional activity of a large population of neurons [1,2] before being again transformed into low-dimensional decision variables and motor outputs (e.g. the angle and speed of the hand holding the racket).
Simple but effective models for studying closed-loop interactions are feedback networks.These models implement a simple form of closed-loop interaction: the output (or readout) signal, which is extracted from a reservoir of randomly connected units as a linear combination of unit activities, is directly injected back into the reservoir as external input [3,4].By adjusting the weights which specify how reservoir activity is mapped to the output, feedback networks can be trained to produce the desired readout signal.In the most common training algorithms [5,6,7], readout weights are updated through least-squares (LS) regression; this can be performed only once, by using a complete batch of activity samples [5], or in an online fashion, by recursively integrating activity samples as they are simulated [7,8].
What kind of closed-loop dynamics can feedback networks implement?Despite some theoretical advancement [4,9,10,11,12], computational properties of feedback networks are still poorly understood.
Early theoretical work has indicated that most feedback models are expected to be able to approximate readout signals characterized by arbitrarily complex dynamics [4].However, it has been reported that not all feedback architectures and target dynamics result in the same performance: trained networks can experience dynamical instabilities [10,11], and converge to fragile solutions for certain choices of the feedback architecture and parameters [7,13].
For a fixed task, several studies have reported that training performance is strongly influenced by the overall strength of recurrent connections in the reservoir [14,7,6].Specifically, performance is high when recurrent connections are strong, but not strong enough to lead to the appearance of chaotic activity [15] -a parameter region named edge-of-chaos [16].Intuitively, the edge-of-chaos defines an optimal tradeoff point where the internal reservoir dynamics are rich but stable.
Reservoir activity, however, is not determined by connectivity alone: because the system is coupled to the environment, activity depends also on the statistics and dynamics of the target signal, which specify the task.How the internal reservoir dynamics interact with the target in determining trained networks performance is a fundamental question in feedback systems which is still not well understood [17].In particular: are there specific target features which optimize performance, and how do they depend on internal properties of the reservoir network?For given values of the target parameters, what are the properties of reservoir activity that support optimal training?How sensitive is the optimal performance to the learning algorithm?To the current date, these questions remain largely unsolved.
In this work, we consider a simple setup consisting of a non-linear reservoir of rate units which is trained to sustain a sinusoidal output with given frequency ω.Consistently across three different training techniques, we find that learning performance is maximized at a finite "preferred" frequency ω, which in turn depends on reservoir connectivity: as the connectivity strength is increased towards the edge of chaos, ω decreases towards zero.This nontrivial dependence of performance, even in a simple task, provides a test case to study the interplay between reservoir and target properties and its effects on learning.
To gain analytical insight into this phenomenon, we consider a simplified setup where reservoir dynamics are linearized, and perform exact mathematical analysis.By averaging over the ensemble of random reservoir networks, we characterize reservoir activity in response to the target signal, and show that a "resonance" frequency ω * emerges, which decreases with the connectivity strength.Under this frequency, dimensionality of neural activity is maximal and synchrony across different units in the reservoir is minimal.When training the network to output the target signal, feedback interactions are most robust in the vicinity of the resonance frequency, thus resulting in optimal performance.Moreover, this behaviour is predicted to be qualitatively consistent across different training algorithms, even if performance itself is sensitive to the algorithm used.We show that our theoretical predictions correctly capture the qualitative behaviour of learning performance observed numerically in non-linear network models.Overall, our results shed light on the learning capacity of recurrent network architectures by quantifying how learning precision is determined by the interaction between internal reservoir connectivity and target dynamics.

Emergence of a preferred frequency in trained feedback networks
We consider a reservoir network consisting of N units characterized by the evolution dynamics: where Φ(x) = tanh(x) is applied to the activation vector x element-wise.Recurrent weights J are fixed, and are generated independently from the normalized Gaussian distribution N (0, g 2 /N ) [15,7], so that the parameter g controls the strength of reservoir connectivity.The one-dimensional external signal u(t) acts as a forcing on the reservoir through input weights m, which are fixed and drawn as independent standard Gaussian variables.
The output of the reservoir network is a one-dimensional readout signal, defined as: through a set of decoding weights n that are assumed to be plastic.The feedback is realized by using the output signal as input: u(t) = z(t) (see Fig. 1A for an illustration), which yields the final autonomous dynamics ẋ(t) = −x(t) + (J + mn )Φ(x(t)). ( During training, the vector n is updated until the output z(t) best matches the desired target f (t).The target function that we consider is a simple sinusoidal wave of frequency ω, i.e. f (t) = A cos(ωt).
We trained multiple instances of this feedback architecture and analyzed how performance depends on the frequency of the target signal ω and on internal coupling strength g (Fig. 1).Three common training algorithms (least-squares (LS) regression, ridge regression [18] and recursive least-squares (RLS) [19,7]) were used (training details are reported in Appendix 4.1).We quantified the error as the mismatch between the target f (t) and the readout z(t) averaged over a finite number of target cycles in the post-training activity.
We observe that, for fixed reservoir connectivity g, the accuracy of signal reconstruction by the output strongly depends on the target frequency: while training on one frequency results in highly precise readout for many cycles, others result in a runaway from the target signal (Fig. 1B).For every value of g, the error has a non-monotonous dependence on ω, and reaches a minimum at a finite frequency that we name ω (Fig. 1C).Each curve, corresponding to a different value of g, has a different optimal frequency: specifically, ω decreases as the strength of reservoir connectivity g increases from zero towards the edgeof-chaos (Fig. 1D; see Appendix 4.11 for a characterization of the edge-of-chaos in our framework).
Although the exact value of the preferred frequency ω is found to be algorithm-dependent, the same qualitative behaviour is observed consistently across the three different algorithms we used for training.
It is also observed for both small and large amplitudes of the target signal A, which are expected to elicit, respectively, weakly or strongly non-linear activity in the reservoir.
The observations from Fig. 1 provide a striking example of the non-trivial interplay between reservoir features (the connectivity parameter g) and external task parameters (the target frequency ω) in determining learning performance.Because the network is completely random, one might naively think that its dynamics do not exhibit a typical timescale, and are thus blind to the signal frequency; instead, the network appears to have its preference even for a simple task.In the rest of this paper, we aim to understand this observation in detail through mathematical analysis.
To this end, we consider a simplified model which greatly eases the analysis: the case of linear reservoir dynamics (Φ(x) = x).The analysis strategy we use consists of two steps [10].To begin with, we examine the feedback network in an open-loop setup (Fig. 1A, yellow), where the encoding of the input and the decoding of the output signals can be analyzed separately.In the encoding phase, we take the input to the reservoir network to be identical to the target function: Preferred frequency Frequency ω C.
Frequency ω analytically the reservoir response x(t) both at the level of single units and the population as a whole (Section 2.2).In the decoding phase, we use the reservoir response to pick a readout n which allows the network to reconstruct the correct output: n x(t) = f (t) (Section 2.3).At that point, our feedback architecture admits the desired target as a solution; to investigate success of such solutions in performing the task, in Section 2.4 we close the loop (Fig. 1A, purple), and characterize dynamics stability.

Open loop: encoding the target signal
We begin our analysis by examining encoding in the open-loop framework: this corresponds to a random reservoir with linear dynamics driven by the target signal.The time evolution is described by here J is a Gaussian random matrix as defined above; to avoid dynamic instabilities, we consider g < 1 [20].The linear dynamics is indifferent to the amplitude of the input, so we set A = 1; in response to the periodic input f (t) = cos(ωt), the stationary solution for t → ∞ is where are complex conjugate vectors representing the reservoir activity in Fourier space (see Appendix 4.2).
A geometric description The stationary solution may be written as showing that activity occupies the plane spanned by two vectors v ± , which are the real and imaginary parts of x ± : R(x + ) = v + and I(x + ) = −v − .In this plane, the state-space trajectory is a closed, elliptic curve (Fig. 2A), with geometry determined by the spanning vectors.
The spanning vectors v ± , in turn, depend both on the recurrent connectivity J and on the driving frequency ω (Eq.( 6)).Their geometry is self-averaging in the limit of large networks, and can be computed by averaging over the ensemble of randomly connected reservoir networks (see Appendix 4.3).
Fig. 2B shows the dependence upon ω of the norms v ± .For very small frequencies, the trajectory follows the drive adiabatically and v − ≈ 0; there is practically only one spanning vector.As frequency Frequency ω Angle θ (rad) Frequency ω  increases, the response acquires a phase shift and the second spanning vector v − becomes non-negligible.
At high frequencies, both norms decrease due to the filtering property of the network; the second spanning vector thus obtains a maximal norm at an intermediate frequency.
We quantify the elliptical trajectory by its linear dimensionality, i.e. the participation ratio computed from the principal components of reservoir activity [21,22].Denoting the activity crosscorrelation matrix by C := 1 T T 0 x(t)x dt and its eigenvalues by ν i , the trajectory dimensionality d is defined as By using Eq. ( 7), and by integrating out time, we find that C is a rank-two matrix, C = 1 2 v + v + + v − v − , whose non-zero eigenvalues (which we take to be ν 1 , ν 2 ) are identical to those of the 2 × 2 reduced crosscorrelation matrix [23 Explicitly computing the eigenvalues of C R yields the expression We observe that the linear dimensionality, which is bounded between 1 and 2, is insensitive to the overall trajectory magnitude, but depends on the ratio of norms r = v − / v + and on the angle θ between the spanning vectors: The ratio r indicates how much the curve is squeezed along a single direction, with both extremes (r very small or very large) resulting in trajectories squeezed along the dominant spanning vector.For a fixed angle, as the ratio passes through r = 1, the trajectory goes through a shape which is most similar to a circle and has maximal dimensionality.For a fixed r, the angle θ determines to what degree the curve is skewed relative to a perfect ellipse; the dimensionality increases monotonically as θ opens up from zero to π/2.
Examination of the vector norms v ± in Fig. 2B indicates that they intersect at a frequency value that we name ω * , where r = 1.Fig. 2C shows how the angle θ varies as a function of frequency; surprisingly, we find that it displays a maximum at ω * .These dependencies are reflected in the behaviour of the dimensionality (Fig. 2D), which itself attains a maximum at frequency ω * .Our mathematical analysis reveals that (see Appendix 4.4) i.e. the resonance frequency ω * decreases to zero as g increases towards the instability boundary (g = 1).
This analytic result is in excellent agreement with finite network simulations, as shown in Fig. 2E.
A single-unit description The analysis above considered the geometry of trajectories spanned by the reservoir population in its high-dimensional activity space, and revealed that trajectory dimensionality is maximized at the resonance frequency ω * .An alternative viewpoint is obtained by considering the statistics of single-unit activity profiles across the population.As we shall see, this alternative perspective reveals that the optimal frequency ω * has a second natural interpretation in terms of population synchrony.
To do so, we derive a self-consistent expression for x + by inserting Eq. ( 5) into the evolution equations (Eq.( 4)): This form highlights that vector x + is given by the sum of two contributions: one associated with the external forcing via the input vector m, and one associated with the reservoir response via the recurrent input Jx + .Since J is random, the direction of the latter contribution is random (i.e., it varies across realizations of J), but its amplitude is self-averaging and depends on the strength of recurrent connectivity g [15].
We use Eq. ( 13) to gain intuition about how the network encodes the external oscillatory signal at the level of single-unit activity.To this end, we visualize the entries of the x + vector as points in the complex plane: (x + ) i = R i e iφi , where R i and φ i represent the amplitude and phase with which a single unit responds to the forcing input (Fig. 3A).How are points corresponding to different units distributed on the complex plane?When recurrent connections are very weak (g 0), different units behave as uncoupled filters of the input; we have x + m/(1 + iω), implying that the real and imaginary part of (x + ) i for different i are proportional one to each other.As a consequence, points on the complex plane are collinear (Fig. 3A left), and phases are identical: φ i = φ.Responses of different units are thus synchronized (Fig. 3B left).As g grows from 0, the second term in Eq. ( 13), which originates from recurrent interactions, starts spreading the real and imaginary parts of (x + ) i away from the line φ i = φ (Fig. 3A right), and introduces variability in response phases (Fig. 3B right).
For fixed values of g and ω, the distribution of dots on the complex plane is a bivariate Gaussian (Fig. 3A); a narrow distribution corresponds to highly synchronized units, and its broadening at stronger coupling indicates their desynchronization.As both m and J are generated from a centered Gaussian distribution, the mean of the distribution vanishes.The covariance is given by: implying that the shape distribution is controlled by the statistics of the spanning vectors v + and v − .
The similarity between the covariance matrix and the reduced cross-correlation matrix C R (Eq. ( 9)) analyzed in the previous paragraph suggests that synchrony in single-unit response and dimensionality of state-space trajectories are deeply related properties of reservoir activity.To formalize this relationship, A. C. we compute the spread of phases φ i across the reservoir population, i.e.
where p(φ) is the probability distribution of phases for a bivariate Gaussian distribution ( [24], see Appendix 4.5).The phase spread for different values of recurrent strength g and frequency ω is plotted in Fig. 3C.These results show that it monotonically increases with g; for any fixed g, it reaches a maximum at a finite frequency value, given again by ω * = 1 − g 2 (Fig. 3D).
To conclude, we have examined the behaviour of single-unit activity in response to a sinusoidal forcing input.In line with classical mean-field studies, we have analyzed the statistical distribution of single-unit activity profiles across the reservoir population [15,25,9].This approach has revealed that, for fixed g, ω * corresponds to the frequency at which single-unit activity is maximally desynchronized.
Note that historically, desynchronized single-unit profiles have been pointed out as a desirable feature of reservoir activity, as temporally heterogeneous profiles form a rich set of basis functions from which complex target functions can be reconstructed [26].

Open-loop setup: decoding the internal representation
After having characterized the reservoir activity during stimulus encoding, we turn to the decoding step of the open-loop analysis.Decoding corresponds to finding a readout vector n ∈ R N which satisfies: the projection of driven reservoir activity along n needs thus to match the target f (t) (Fig. 4A, yellow).
In terms of the Fourier-space representation (Eq.( 5)), n is a solution to the set of two linear equations given by When g = 0, interactions vanish and the equations above read n m = 1±iω, which cannot be satisfied by any n.This scenario corresponds to completely synchronized reservoir activity, or equivalently, activity spanning one-dimensional state-space trajectories.For any g > 0, on the other hand, this system of equations is under-determined, since it fixes only 2 among the N degrees of freedom in n.
We explore the effect of these degrees of freedom by defining a family of readout vectors n parametrized by an integer k, where k = 2, . . ., N ; k indicates the number of reservoir units from which the readout signal is reconstructed.We term such solutions from-k regression.To obtain such a solution, we set all elements except for the first k of n to zero, and then solve Eq. ( 17) by considering the leastsquares (LS) solution of minimal norm, which can be computed through the pseudo-inverse (see Appendix 4.6).When k = N , we obtain the full LS solution, which reads: or, in terms of v ± vectors: where we defined the short-hand notation All the readouts within the from-k family exactly solve the task in the open-loop setup.However, it is not clear a-priori whether all of them are equivalent when closing the loop, i.e. when the feedback network is required to autonomously generate the target signal (Eq.( 3)).In the following, we assess dynamics and stability of closed-loop networks corresponding to the different choices of the readout n.

Closing the loop: autonomous signal generation
In the previous two sections, we have analyzed how random networks encode a one-dimensional periodic signal, and how the network response can be used to reconstruct the same signal as output.Ultimately, we want the encoding and the decoding steps to be self-consistent, i.e. we require which is equivalent to transforming our problem from an open-loop to a closed-loop setup, where dynamics are autonomous and follow Eq. ( 3) with linear interactions: and n satisfies Eq. ( 16).This step is illustrated in Fig. 4A by the purple feedback arrow connecting the reservoir output to the input.If closing the loop does not perturb activity in the reservoir by changing its stability properties, then at every time point the readout n x(t) = cos(ωt) is fed back into the system, and the solution obtained through the open-loop setup is self-consistent.
The solutions to Eq. ( 21) and their stability are fully characterized by the eigenspectrum of J = J + mn (the leak term in the dynamics contributes by uniformly shifting the spectrum by −1).
For N sufficiently large, the eigenvalues of J are distributed uniformly in a disk of radius g < 1 [20].
The position of some or all of the eigenvalues can, however, be modified by the rank-one perturbation mn ; we refer to these as outliers.In order for the closed-loop system to stably sustain the periodic activity we found in the encoding step, the eigenspectrum of J must satisfy two key requirements: (i) a pair of complex outlier eigenvalues with value: λ ± = 1 ± iω (which ensures that a periodic trajectory of frequency ω is realized), and (ii) a stable bulk of remaining eigenvalues: R(λ) < 1 ∀λ = λ ± (which ensures that no runaway activity is generated along other directions).
All eigenvalues of J are roots of the characteristic polynomial det (J + mn ) − λI = 0. Fraction unst.networks The Matrix Determinant Lemma [13,12] allows us to decompose this polynomial into two factors, corresponding to the two sets of eigenvalues: It is seen that the second term vanishes on the spectrum of J, whereas the first term vanishes for the outlier eigenvalues.The outliers therefore satisfy If n satisfies Eq. ( 16), then λ ± = 1 ± iω are indeed solutions, implying that condition (i) is satisfied.
Note that the eigenvectors corresponding to λ ± are identical to the vectors x ± , as (J + mn Thus fixing n in the open-loop framework is equivalent to directly controlling the value of the targetrelevant eigenvalues in the eigenspectrum of the closed-loop network. We next examine whether this pair of eigenvalues are the only outliers generated by closing the loop: while Eq. ( 24) is guaranteed to have λ ± = 1 ± iω as solutions, other solutions might be admitted which could violate requirement (ii).Such potential solutions depend on the overlap between the vectors n and x λ = (λI − J) −1 m.Note that if the readout n was random (and thus orthogonal to J and m), this overlap would vanish and no additional outliers would be generated.
In the case of the full LS solution (k = N , Eqs. ( 18)-( 19)), the readout vector n LS is contained in the plane spanned by vectors v ± .As a consequence, the overlap between n and x λ can be expanded in terms of x ± x λ .In the limit N → ∞, these terms have a simple form which can be evaluated analytically (see Appendix 4.8), yielding an equation in λ which reads: where P 11 and P 21 are the elements of the first column of P = C R −1 and depend on g and ω.As the equation above is quadratic, it admits λ = λ ± as unique solutions.Therefore, in large networks, LS training is guaranteed to result in stable dynamics, as no additional outliers are generated in the eigenspectrum other than the task-relevant ones.This is confirmed by numerical simulation in the left panels of Fig. 4B.
In the more general case of from-k LS regressors with k < N , readout vectors might contain extra components that are correlated with J and m and are not fully contained within the spanning plane; as a consequence, more than two outlier eigenvalues and unstable dynamics can be expected.The right panels of Fig. 4B show an example of such a situation (in the simulation, k = 2).One outlier eigenvalue with R(λ) > 1 is seen in the top panel, which induces the dynamic instability seen in the bottom panel.
As k decreases from the maximal value (N ) to the minimal one (2), the component of the readout vector n outside of the v ± plane becomes larger (Fig. 4C).Numerical analysis indicates that, correspondingly, the fraction of networks with unstable dynamics increases (Fig. 4D).
In summary, we have shown that -although the open-loop setup admits multiple exact solutions -different solutions are not equivalent in terms of dynamical stability when the loop is closed.
Stability properties are related to the orientation of the readout vector relative to the driven open-loop trajectory.In the case of the full LS solution (Eq. ( 18)), the readout n LS is completely aligned with the trajectory plane, and closed-loop dynamics are guaranteed to be stable.Other solutions generally contain components outside of this plane, which can cause activity to diverge.

Predicting performance of trained linear networks
We now turn back to the problem of understanding performance in trained feedback networks and its dependence on the target frequency ω.We start by considering linear feedback networks which are trained (as in Fig. 1C  Consider first the encoding phase of learning (Section 2.2), where reservoir activity is stimulated.
Because of noise, learning algorithms may not have access to the true spanning vectors v ± .Rather, we assume that corrupted versions ṽ± = v ± + ξ ± (where the entries of ξ ± are independent Gaussian noise) are measured.The estimated LS readout then reads: where P is the inverse reduced cross-correlation matrix which includes the noise disturbance.
As in Section 2.4, we can characterize closed-loop dynamics by computing the outlier eigenvalues of J + m ñ LS .The second term in the r.h.s. of Eq. ( 27) is random and orthogonal to m and J, and therefore does not affect the position of outlier eigenvalues.In contrast, the first term is a vector fully aligned with the noise-free spanning vectors v ± , which generates two outlier eigenvalues λ± .Because of the noise, their values deviate from the target eigenvalues λ ± ; they are solutions of an equation identical to Eq. ( 26), but with P replaced by P .For every noise realization, the inverse reduced cross-correlation matrix P is perturbed in a random direction, yielding random modifications to the target eigenvalues λ ± .We can estimate the average mismatch between λ± and λ ± from the sensitivity of matrix P to perturbations, which is quantified by the condition number of the reduced correlation matrix C R , i.e. the ratio between the largest and smallest eigenvalue [27,17]: The value of c and its dependence on ω and g can be computed by taking the limit N → ∞ and averaging over the networks ensemble.Fig. 5A shows that, for fixed connectivity strength g, the condition number is a non-monotonic function of the forcing frequency ω, and attains a minimum at the resonance frequency ω * = 1 − g 2 (see Appendix 4.9).Thus, when training linear feedback networks through noisy LS regression we expect that, for fixed g, the readout would be closest to the desired one at ω = ω * , where the reduced cross-correlation matrix is most robust to noise.This robustness directly reflects the properties of the internal representation of the target signal within the reservoir, which is characterized by maximal dimensionality and minimal synchrony at ω = ω * .
We tested this prediction on finite-size trained networks.Examples from Fig. 5B confirm that the task-related eigenvalue pair λ± deviate from the target ones.As in the case of non-linear networks (Fig. 1B-C), we find that the error is frequency dependent (Fig. 5C, left); furthermore, for fixed strength of the internal connectivity g, we observe that the error is minimized at a frequency ω which is very close to ω * (Fig. 5D, left).
As a second way to characterize performance, like in Fig. 1, we considered feedback networks trained via Ridge regression [18].In this case, the readout vector is deterministic; in the Fourier space, it can be expressed as (see Appendix 4.10): where P = C R −1 + N σ 2 I and I is the 2 × 2 identity matrix.As in the case of noisy LS regression, also this readout vector generates only two outlier eigenvalues λ± , whose values can be again computed through Eq. ( 26); in this case a closed-form expression can be computed (Appendix 4.10).
For moderate values of the regularization parameter σ, the resulting λ± are complex conjugates that deviate somewhat from λ ± (see Supp.Fig. 10 for the full bifurcation diagram).Specifically, their real part is always smaller than 1, implying that the resulting autonomous dynamics are always stable (see Appendix 4.10).The amplitude of the mismatch between the real and the imaginary parts of λ± and the target eigenvalues λ ± depends both on g and ω (see Appendix 4.10), and is minimized at a finite frequency ω which monotonically decreases with increasing g (Fig. 5D center, solid lines).Importantly, the value of ω is predicted to behave similarly (although not identically) to ω * .Fig. 5D (middle) shows an excellent match between these predictions and simulation results.
As a third and final example, we considered linear networks trained via the RLS algorithm [19,7].In this case, an analytical description of the closed-loop spectrum and resulting dynamics is much harder to obtain; we thus computed the value of the preferred frequency ω from simulations.We found that the mismatch between λ± and λ ± displays a strong, non-monotonic dependence on the target frequency (Fig. 5C, right); the preferred frequency ω is, again, quite close to ω * (Fig. 5D, right).To conclude, we analyzed performance in linear feedback networks; as for non-linear networks (Fig. 1), we found that performance is maximized for a preferred frequency ω which decreases with the connectivity strength g.Analysing how the simple LS readout solution interacts with noise, we predicted that the preferred frequency ω is expected to lay close to ω * , i.e. the resonance frequency where encoding dynamics has maximal dimensionality and is minimally synchronized.This prediction is exactly verified in networks trained via LS regression, but also carry over in a qualitative fashion to networks trained via different training algorithms.In fact, we showed that different algorithms are affected by different kinds of biases, whose effect is to shift the value of the preferred frequency ω away from ω * without changing its overall qualitative behaviour.

Internal representation in non-linear networks
We finally turn back to the original problem of analyzing training performance in non-linear feedback networks (Fig. 1).Our analysis of linear networks revealed that a key feature which determines training performance is the quality of representation of the target signal within the reservoir.This representation can be characterised by its dimensionality or, equivalently, by the synchrony of activity across units in the reservoir.
Guided by these insights, we examined the properties of open-loop dynamics (Eq.( 1)) in nonlinear networks.Because of the non-linearity, the neural trajectory x(t) is in this case not planar, but curved along many dimensions (Fig. 6A); most of its variance, however, is still explained by two directions (Fig. 6B).We investigated numerically the properties of non-linear target representations by using the same measures as for linear networks, namely the dimensionality and the spread of phases across units.Although in non-linear systems these are not equivalent measures, we find that their behaviour is qualitatively similar to one another, and to the behaviour of their analogues in linear systems (Fig. 6C-D left).First, both measures increase monotonically with the connectivity strength g.Second, for any fixed value of g, both measures display a maximum at an intermediate frequency ω * .
In the middle panels of Fig. 6C-D, we display the resonance frequency ω * computed from both measures of non-linear representations (left panels) across various values of g and for three target amplitudes (legend).As in the linear case, we find that the value of ω * decreases with the connectivity strength g; unlike the linear case, however, it depends on the target amplitude A. For small target amplitudes (light gray), both measures of non-linear representations yield values of ω * which are quantitatively very close to the values predicted by the linear theory, i.e. 1 − g 2 (yellow line).This is expected, as for low-amplitude driving the reservoir activity mostly remains in the vicinity of the origin, a region where the non-linear dynamics are approximately linear.In the nonlinear case, however, as the target amplitude A increases (darker shades of gray, see legend), the resonance frequency ω * also increases.The decrease of ω * with g is retained, but to a lesser extent.
In the right panels of Fig. 6C-D, we compare the resonance frequency ω * predicted from an-alyzing non-linear representations to the preferred frequency ω which minimizes training performance (Fig. 1).Although the two quantities do not exactly coincide, they display significant correlations.Remarkably, the value of ω * correctly captures the behaviour of the preferred frequency ω with the target amplitude A: like ω * , ω increases with A, as can be seen by the clustering of different shades of grey in

Discussion
Ubiquitously across biology, complex high-dimensional systems interact with their environment through low-dimensional channels.The computational modelling of such setups has advanced considerably in the past two decades with the emergence of reservoir computing techniques [3,26], where learning acts on such low-dimensional bottlenecks.Despite the simplicity of this learning scheme, the factors contributing to or hindering the success of training in reservoir networks are in general not well understood [17].In particular, a theory is lacking for predicting -based on the characteristics of the reservoir and the target function -dynamics and performance of trained feedback networks.
In this work, we studied learning performance of feedback networks trained to self-sustain a sinusoidal readout signal.Through mathematical analysis, we showed that learning performance is mostly controlled by the quality of the internal representation of the target signal.This quality can be quantified by analyzing the open-loop dynamics and measuring the condition number of their cross-correlation matrix, a number that characterizes to what extent the network dynamics is robust to training noise.
We found that the condition number displays, like training performance, a complex dependence on the parameters controlling the reservoir internal properties (strength of reservoir connectivity g) and the readout target function (frequency ω).The parameter values where the condition number is minimized, , define an optimal spot for learning.At this optimal point, internal representations are characterized by maximal dimensionality and minimal synchrony, which are two ways of quantifying the richness of the dynamic repertoire available to the learning algorithm.Our insights were derived by studying linearized dynamics and were later tested on non-linear networks, where they successfully capture non-trivial aspects of training performance.
The condition number of the cross-correlation matrix has been pointed out in several studies as a key quantity in determining performance [17,27].Our work analytically quantifies those empirical observations in the framework of networks trained on a simple task via common LS-based algorithms.
We have shown, however, that performance might depend on other features, such as closed-loop stability, D. Using phase spread of driven trajectories to predict performance.Unit activities x i (t) were fitted with sinusoidal functions of the driving frequency ω, and the variance of the phase distribution was measured.
Left, center and right panels are the same as in C.
for other non-standard algorithms (see Fig. 4).
Importantly, our analysis differentiates between two properties that might hinder network performance: high-norm readouts and non-normality.In a number of classic studies [7,17,12], large norms of the readout vector have been associated with impaired performance.In addition, recent observations indicate that training performance is low in parameter regions where the open-loop dynamics is highly non-normal [13], and link low performance to large readout vectors.In our framework, the two properties can be analyzed separately.Non-normality can be measured from the angle θ between the two activity eigenvectors v ± ; Fig. 2C indicates that non-normality is minimal at the resonance frequency ω * .The norm of the readout vector n LS can be instead derived from Eq. ( 16) (see Appendix 4.7); we show in Supp.Fig. 9 that, for every value of connectivity g, the norm of the readout vector is monotonic in the target frequency ω.We conclude that these two quantities are not equivalent predictors of learning performance; in our setting, training performance is optimal close to ω * , so that non-normality is identified as the dominating factor in controlling performance.
Several studies have supported the hypothesis that learning capability is maximized in the parameter region where dynamics is close to the boundary between ordered and chaotic activity, i.e. the edge-of-chaos [14,7,6].Our findings are consistent with this hypothesis: we have shown that the condition number (and, consequently, the training error) monotonically decreases as the strength of reservoir connectivity g is increased from 0 towards its critical value.However, our analysis has shown that, together with the strength of internal connectivity, learning performance is crucially shaped by the properties of the target function.By analysing non-linear networks, furthermore, we have found that the parameter region characterized by maximally high-dimensional and de-synchronized internal representations does not necessarily coincide with the edge-of-chaos; the two regions in fact diverge as the target amplitude A is increased and activity becomes strongly non-linear (Figs. 6 and 11).Specifically, as A increases, the critical frequency where activity becomes chaotic moves to very high values [25] (Fig. 11), while the resonance frequency ω * (which measures activity dimensionality and synchrony) remains close to the training-preferred frequency ω (Fig. 6).This result suggests that future research should focus on characterizing the properties of driven non-linear activity rather than analysing the transition to chaos per-se.
The numerical analysis of non-linear networks (Fig. 6), which was led by the insights gained from the linear theory, suggests that representation quality is a major determinant of closed-loop performance also in the case of non-linear networks.Exploiting the link between the two, we were able to predict the dependence of the preferred frequency ω on both the internal connectivity g and the target amplitude A, which plays no role in the linear counterpart.This is despite the fact that the non-linearity of the dynamics introduces, in trained networks, new qualitative behaviours which do not exist in linear networks.In particular, we observe that the training error (and, consequently, the value of ω) strongly depends on the hyper-parameters controlling the stability of the limit cycle which constitutes the internal representation (see Appendix 4.1 and Supp.Fig. 8).In this respect, a more detailed analysis is called for; we hope that future work would extend our analytic framework to cover non-linear reservoirs.
and Srdjan Ostojic and Manuel Beiran for their feedback on a previous version of the manuscript.LS would like to thank Friedrich Schuessler for helpful discussions.

Training of feedback networks
In the following, we report the procedures used to train feedback architectures (Figs. 1 and 5).Procedures are detailed for the general case of non-linear networks; the case of linear networks corresponds to taking Φ(x) = x.Results are averages across 1000 different network and training realizations.the cross-correlation matrix and to ease local stability in non-linear networks, white noise is then added on top of activity: Φ = Φ + σ LS ξ, where ξ is a L × N matrix of standard Gaussian variables.The trained readout vector n is finally computed as:

LS regression training
In linear networks, training performance is measured in terms of the mismatch between the target outlier eigenvalues λ ± (see Section 2.4) and the outlier eigenvalues λ± , defined as the pair of complex conjugate eigenvalues of J = J + mn whose real part is maximally close to one.In non-linear networks, performance is measured on closed-loop activity.To this end, closed loop dynamics (Eq.( 3)) is simulated from t = 0 to t = T tot .The initial condition is taken to be equal to activity in the last time step of the open-loop simulation; on top of it, an N -dimensional vector of white noise of amplitude σ pert A is added.The latter perturbation was used to take into account training error generated by unstable local dynamics; we take σ pert = 0 in linear networks.To measure the test error we fitted a sinusoidal function F (t) of fixed amplitude A and frequency ω to the readout signal z = n Φ obtained in the closed-loop simulation, yielding a novel L-dimensional vector F .Readout error is finally measured as , where the average is taken over all the integration time points from t = 0 to t = T tot .If the fit fails, we set the readout error to 1. Parameters used in Fig. 1

RLS training
Training is performed in the closed-loop setup, from t = 0 to t = T tot , with T = N tot 2π/ω and N tot = 20 (Fig. 1) or 10 (Fig. 5).At t = 0, an N × N -dimensional matrix P is initialized as: P = I/α, where I indicates the N -dimensional identity matrix and α is a free parameter.Matrix P represents a running estimate of the inverse of the activity cross-correlation matrix [7].Readout vector n is further initialized with zero entries.At every learning step, closed-loop activity is simulated from t 0 to t 0 + τ (Eq.( 3)), with τ = (2π/ω)/500.Activity at t = t 0 + τ is stored in an N -dimensional vector Φ.
Matrix P is then updated as [7]: The readout vector n is then updated as: where the error e is measured as: e = z(t 0 + τ ) − f (t 0 + τ ).Once training is completed, performance is measured as in the LS case.Parameters used in Fig. 1 are N = 400, α = 1 and σ pert = 0.1.Parameters used in Fig. 5 are N = 400, α = 1 and σ pert = 0.

Analysis of linear open-loop reservoirs
For a general input f (t), the system of linear equations Eq. ( 4) admits the asymptotic solution (t → ∞) which for a complex exponential f (t) = e st for s ∈ C, simplifies to In particular, for f (t) = cos(ωt) = 1 2 (e iωt + e −iωt ), one finds the expression in Eq. ( 5) in the main text.Note that x ± defined in Eq. ( 5) correspond to the amplitude of the peaks of the Fourier transform of x(t); indeed, we have: In deriving Eq. ( 7), we defined:  A.

Statistics of spanning vectors v + and v −
In this section, we characterize the geometry of vectors v + and v − in terms of their norms and overlap.
We start by evaluating the dot product: with a, b ∈ C. If the eigenvalues of J/a and J/b have real part smaller than one, we can use the power series expansion so that Eq. ( 38) becomes Since J is random, the value of this expression randomly fluctuates across different realizations of recurrent connectivity J.We thus turn to a statistical characterization, and evaluate Eq. ( 38) by computing its mean and variance with respect to different realizations of J.
The mean yields, to the leading order in N [12]: We have used: which comes from observing that J p is a random matrix, which is uncorrelated to J q for q = p has variance g 2p /N .This yields: from which we obtain Eq. (42).
The variance can be computed in a similar way.Like the mean, the variance is characterized by O(N ) scaling [12]; as a consequence, variability due to different realizations of J does not enter the dot product Eq.(38) to the leading order in N , and dot products can be replaced with their mean (Eq.( 41)) when N → ∞.
We can now compute the mean norm of the spanning vectors from combining Eqs.(37) and (41): the calculation of the norm of v − is very similar, and only differs in the sign of the first summand in the second line above: Finally, when computing the dot product between the two vectors, the cross terms cancel to yield With these expressions, the angle θ between v + and v − can be written as By denoting ε = 1 − g 2 , we now summarize the statistics of the spanning vectors v ± : and angle between the two spanning vectors reads:

Analysis of geometric properties of driven trajectories
We can use the expressions computed in Appendix 4.3 to evaluate the participation ratio d: which, again, implies a nontrivial maximum of d at ω * (g) = 1 − g 2 .
Finally, we observe that g = 1 is a singular point, as in the limit of low frequency we have but the overlap vanishes and the participation ratio attains its global maximum Note that the limits lim g→1,ω→0 d and lim g→1,ω→0 cos(θ) do not exist, since they depend on the order of limits taken.To see this, compare Eqs. ( 57) and (62).

Distribution of response phases
The phase spread in response of different units (Eq.( 15)) was computed as a numerical integral performed over the probability distribution p(φ) whose analytical form is available in [24].We used: where ρ = cos(θ).From [24] we also used:

From-k regression
In this section, we explain how from-k least-squares regression (Fig. 4) is performed.
For 2 ≤ k ≤ N , we define the cropped spanning vectors as where [v ± ] i indicates the i-th element of vectors v ± .For every k, the from-k LS regressor is given by the pseudo-inverse, which yields the LS regressor of minimum norm: Note that n N LS = n LS .
In the following, we show that any from-k readout vector n k LS with k < N is not fully contained in the plane spanned by vectors v ± .The from-k readout vector n k LS is contained in the plane spanned by cropped vectors v k ± .Consider now any vector a which is orthogonal to the reservoir trajectory plane spanned by v ± .We have We have that cropped vectors v k ± do overlap with vector a, because: As a result, the readout vector n k LS also has a nonzero overlap with a.

Analysis of least-squares regression: norm
We analytically compute the norm of the least-squares readout solution n LS (k = N , Eqs. ( 18) and ( 19)).
We start from Eq. ( 19) to write: The vector within the parenthesis above is contained in the activity-spanning plane and is orthogonal to v − , and has norm the norm of n LS is therefore given by We find that this expression is monotonically increasing in both g and ω, as shown in Supp.Fig. 9.

Analysis of least-squares regression: outlier eigenvalues
In this section, we compute the outlier eigenvalues of J which result from full LS regression (k = N , Eqs. ( 18) and ( 19)).As derived in the main text, outliers obey: Because of Eq. ( 17), we know that the equation above admits the solutions λ = λ ± = 1 ± ω.In the following, we show that λ ± is in fact the only solution admitted.To this end, we use Eq. ( 19) to rewrite the equation as: where we defined the short-hand notation P := C R −1 .A little algebra yields: We can evaluate dot products in the form: by following Eq.(41), which was derived in Appendix 4.3 by averaging over the random connectivity J.
This yields: As we know that the quadratic equation above is satisfied by λ = λ ± , we conclude that Eq. (72) cannot admit other solutions beyond these two.

Condition number of the cross-correlation matrix
The condition number of the reduced cross-correlation matrix C R is defined as: where ν 1 and ν 2 are the two eigenvalues of C R .Their value can be computed as a function of the statistics of the spanning vectors v ± , which in turns depend on ω and g (see Appendix 4.3).
In this section, we show that for fixed g, the condition number c is minimized at the same value of ω which maximizes the participation ratio d; this frequency coincides with ω * (see Appendix 4.4). Using: we have: The participation ratio d is maximized when the quantity ∆ γ is minimized, which is precisely where the condition number c is minimized.

Analysis of ridge regression
We start by computing the readout vector which performs ridge regression in the Fourier space.The ridge regressor of Eq. ( 17) can be written in terms of the v ± spanning vectors as [18]: where P = C R + N σ 2 I −1 i.e.
This yields the ridge regressor readout: where v indicates a normalized vector.By comparison with the LS regressor, Eq. (70), we observe that σ has two effects on the readout: first, it reduces the norm, as expected from a regularizer.Second, it biases the readout vector towards v + .
The outlier eigenvalues λ± imposed by the ridge regressor can be found by utilizing the same strategy as in Appendix 4.8.We insert P 11 = P11 , P 21 = P21 and λ ± = 1 ± ω into Eq.( 77) to obtain the equation for the outlier eigenvalues λ: Depending on the values of σ, ω and g, the equation above admits real or complex conjugate eigenvalues (see Supp.Fig. 10).For low frequencies, the dynamics are characterised by two real eigenvalues λ± .As ω increases, a complex conjugate pair of eigenvalues is formed.Importantly, their real part is always smaller than 1, yielding stable closed-loop dynamics.To see this, we approximate the real part of the solution to Eq. ( 86) by assuming that σ 1: Now, by use of Eq. ( 48) we evaluate implying that the pre-factor for the σ 2 term in the denominator is always larger than 1; as a consequence, σ > 0 always reduces the real part.
To conclude, note that the analysis above allows to predict the behaviour of outlier eigenvalues when ridge regression is performed in the Fourier space (i.e. from the 2-dimensional system of equations in Eq. ( 17)).In Figs. 1 and 5, however, regression is performed in the temporal domain, on a higherdimensional (L -dimensional) set of equations (see Appendix 4.1).In order to compare the analytical prediction with trained networks, in Fig. 5 we thus scale the regularization parameter w.r.t. the value of σ which is used to derive analytical predictions, i.e. we set σ R 2 = σ 2 • L /2 (see Appendix 4.1).

Characterization of the edge-of-chaos in non-linear networks
In analysing non-linear networks (Figs. 1 and 6), we varied the strength of internal connectivity g in such a way that the open-loop dynamics driven by the target function f (t) remains non-chaotic [15] for every value of the forcing frequency tested.The critical value of connectivity strength g c at which open-loop dynamics becomes chaotic depends on the target frequency ω and amplitude A [25], and was investigated numerically (Fig. 11).
In order to find the critical values g c , we start by computing the Lyapunov dimension d L of driven activity [28,29], which is defined based on the Lyapunov spectrum Λ = {µ i } N i=1 .Intuitively,

Figure 1 :
Figure 1: Emergence of preferred frequency in non-linear feedback networks trained to sustain a sinusoidal output.A. Illustration of network architecture used in open (yellow) and closed (purple) loop.B. Example readout signal (dark grey; target shown in yellow) for three learning trials corresponding to the frequency values indicated in green in C (w = 0.1, 0.7, 2.1; g = 1).Other parameters as in C, except (for illustration purposes) training is performed on a smaller number of target cycles (N tot = 4 and N tr = 2, see Appendix 4.1).LS regression was used for training; examples trials for Ridge and RLS are reported in Supp.Fig. 8. C. Readout error as a function of ω, for a range of g values (blue shades), for networks trained via LS (left), ridge regression (middle) and RLS (right).We take here A = 1.Training details and parameters are reported in Appendix 4.1.D. Error-minimizing frequency ω as a function of g, for the three learning algorithms as in C. Three different target amplitudes A were tested (grey shades).

Figure 2 :
Figure 2: Encoding of the target signal: geometry of activity trajectories.A. Projection of one example trajectory x(t) (Eq.(7)) on the plane spanned by vectors v ± .B. Norm of the two vectors v ± as a function of the target frequency ω. C. Angle between the two spanning vectors v ± .In A-B-C we used g = 0.5.In B-C, the black vertical line indicates the resonance frequency ω * where the two norms are equal (r = 1, B) and the angle is maximized (C).D. Linear dimensionality: participation ratio computed from the principal components of activity.We plot results for five increasing values of g (blue shades); black stars indicate the position of ω * for every value of g.E. Resonance frequency ω * .In all panels, continuous lines indicate the analytical predictions.In D, dots show results averaged over 20 simulations of finite size networks, N = 2000.

Figure 3 :
Figure 3: Encoding of the target signal: single-unit description.A. Entries of vector x + in the complex plane.Left: g = 0.1, right: g = 0.5; w = 0.6 for both panels.Continuous lines are contour lines of the bivariate Gaussian distribution predicted by the theory.The most external contour indicates a probability of 0.01.Grey points are results of a finite network simulation (N = 400).B. Sample of activity from four randomly selected units chosen from the corresponding panel in A. C. Spread of response phases across the population (Eq.(15)) for increasing values of g and as a function of ω.Black stars indicate the maximum value.D. Value of the frequency which maximizes the spread of phases from C.

Figure 4 :
Figure 4: Closing the loop.A. Transforming the open-loop encoding/decoding setup (yellow) into a closed-loop system (purple).B. Sample networks trained through full LS (from-N, left) or from-2 (right) regression.The top panels show the eigenspectra of the closed-loop connectivity matrix J (red dots); small black dots indicate the unperturbed eigenspectrum of J.The bottom panels show the output generated by the corresponding networks.Here we used parameters g = 0.8, ω = 0.6 and N = 400.C.Overlap between readout n and the principal components (PC) of driven reservoir activity (Eq.(7)), of which the first two span the v ± plane.The same parameters as in B were used.D. Fraction of unstable closed-loop systems (over 2000 sample networks) as a function of connectivity strength g, for several values of k, measured over 1000 different realizations.We used N = 1000 and ω = 0.6.

9 Figure 5 :
Figure 5: Training performance in linear networks is maximized at ω * .A. Condition number of the reduced cross-correlation matrix C R computed analytically for different values of g (blue shades in C).Stars denote minimal condition number.B. Closed-loop outlier eigenvalues λ± for example networks from three learning trials corresponding to the frequencies marked by green triangles in C (w = 0.1, 0.4, 1.5; g = 0.9).C. Spectrum error as a function of ω, for a range of g values (blue shades), for networks trained via noisy LS (left), ridge regression (middle) and RLS (right).Error is measured from the imaginary part of outlier eigenvalues as |I( λ± ) − I(λ ± )|/ω, and similarly for the real part; the spectrum error is an average of the two.Training details and parameters are reported in Appendix 4.1.D. Frequency ω minimizing the error in the real and imaginary part of λ± as a function of g, for the three training algorithms.Solid orange lines in the middle panel show theoretical prediction for ridge regression (see Appendix 4.10).

Figure 6 :Fig. 1 )
Figure 6: Using open-loop reservoir dynamics to predict training performance in non-linear networks.A. Reservoir trajectories x(t) in driven non-linear networks: example trajectory projected onto the first three PCs of network activity (note scale of PC3 axis).B. Variance explained by projecting trajectories on the first two PC axes (note scale of variance).C. Using representation dimensionality to predict training performance.Left: participation ratio of driven trajectory as a function of ω for a range of g values (blue shades).Results are averages over 20 simulations of networks of size N = 2000, with A = 1.Center: frequency ω * , measured as the position of maximum dimensionality, as a function of g for three different values of A (grey shades).Right: error-minimizing frequency ω (from Fig. 1) plotted against ω * (from center panel).Results are shown for three training algorithms (legend).
Training is performed in the open-loop setup.In a first phase, open-loop activity (where we enforce u(t) = A cos(ωt) in Eq. (1)) is simulated by using the Scipy odeint routine from t = 0 to t = T tot , with T = N tot 2π/ω.Activity is stored in a L × N matrix Φ, where L indicates the number of time points used for integration.The L-dimensional vector F is constructed by computing the target function f (t) at the same time points.Activity and target function from t = 0 to T tr = N tr 2π/ω are later discarded, resulting in L × N and L × 1 matrices Φ and F , where L indicates the number of time points kept after discarding the transient.We used N tot = 20 and N tr = 8.In order to regularize are N = 400, σ LS = 0.01 and σ pert = 0.1.Parameters used in Fig. 5 are N = 400, σ LS = 0.01 and σ pert = 0. Ridge regression training As in the LS case, training is performed in the open-loop setup.The L × N open-loop activity Φ matrix is obtained as above.The trained readout vector n is then computed as:n = (Φ Φ + (σ R ) 2 I) −1 Φ F (31)where I indicates the N -dimensional identity matrix.Training performance is measured as in the LS case.Parameters used in Fig.1are N = 400, σ R = 1 and σ pert = 0.1.Parameters used in Fig.5areN = 400, σ R 2 = L /2• σ 2 and σ pert = 0; σ 2 = 10 −7 is the regularization parameter used for regression in the Fourier space (see Appendix 4.10), which was used to compute the theoretical prediction for outlier eigenvalues.

Figure 7 :
Figure 7: Emergence of preferred frequency in non-linear feedback networks, supplementary results obtained for different hyper-parameters.A. LS training.Parameters are as in Fig. 1, except σ pert = 0.01 (top) and σ LS = 0.001 (bottom).B. Ridge training.Parameters are as in Fig. 1, except σ pert = 0.01 (top) and σ R = 0.1 (bottom).C. RLS training.Parameters are as in Fig. 1, except N tot = 10 (top and bottom), σ pert = 0.01 (top) and α = 5 (bottom).In the bottom row, we have removed from the plot the preferred frequencies ω at low g values in the cases where training fails (i.e. it is characterized by very high error) for every value of ω tested.

Figure 8 :
Figure 8: Emergence of preferred frequency in non-linear feedback networks, example trials.Example trials as in Fig. 1B where training is performed via Ridge regression (A) or RLS (B).Parameters are as in Fig. 1B, except N tot = 3 in B.

9 Figure 9 :
Figure 9: Norm of LS-trained readout vector n.Computed via Eq.(71) for a range of values g and ω.

1 = N 2 P
11 + iP 21 λ + λ − g 2 + P 11 − iP 21 λ − λ − g 2 (77)which can be re-cast as a quadratic equation in λ:(1 + ω 2 )λ 2 − 2g2 + N (P 11 + ωP 21 ) λ + g 4 + N g 2 P 11 = 0. (78) left) via LS regression.The analysis of the previous sections has shown that linear networks trained via LS regression can exactly implement the feedback task with stable dynamics.Due to noise, however, real-life LS regression never converges to this ideal solution.Noise arises in training from multiple sources, such as finite sampling of training data, variability due to different initial conditions or regularization noise.In order to characterize training performance, we thus use our theoretical framework to analyze the effect of noise on the dynamics of feedback linear networks trained via LS regression. )