Takens-inspired neuromorphic processor: a downsizing tool for random recurrent neural networks via feature extraction

We describe a new technique which minimizes the amount of neurons in the hidden layer of a random recurrent neural network (rRNN) for time series prediction. Merging Takens-based attractor reconstruction methods with machine learning, we identify a mechanism for feature extraction that can be leveraged to lower the network size. We obtain criteria specific to the particular prediction task and derive the scaling law of the prediction error. The consequences of our theory are demonstrated by designing a Takens-inspired hybrid processor, which extends a rRNN with a priori designed delay external memory. Our hybrid architecture is therefore designed including both, real and virtual nodes. Via this symbiosis, we show performance of the hybrid processor by stabilizing an arrhythmic neural model. Thanks to our obtained design rules, we can reduce the stabilizing neural network's size by a factor of 15 with respect to a standard system.


I. INTRODUCTION
Artificial neural networks (ANNs) are systems prominently used in computational science as well as investigations of biological neural systems. In biology, of particular interest are recurrent neural networks (RNNs), whose structure can be compared among others with nervous system's networks of advanced biological species [1]. In computation, RNNs have been used to solve highly complex tasks which pose problems to other classical computational approaches [2][3][4][5][6]. Their recurrent architecture allows the generation of internal dynamics, and consequently RNNs can be studied utilizing principles of dynamical systems theory. Therefore, the network's nonlinear dynamical properties are of major importance to its information processing capacity. In fact, optimal computational performances are often achieved in a stable equilibrium network's state, yet near criticality [7,8].
Among the recurrent networks reported in current literature, rRNNs are popular models for investigating fundamental principles of information processing. In these models the synaptic neural links are randomly weighted, typically following a uniform [9] or Gaussian distribution [10][11][12][13]. Recently, there is an increasing interest in some particular types of random recurrent networks with a simplified design, where just the output layers are trained using a supervised learning rule. Such rRNNs are typically referred to as Reservoir Computers, which include * bama@queensu.ca echo state networks [2] and also liquid state machines [14]. Reservoir Computing provides state-of-the-art performance for challenging problems like high-quality long term prediction of chaotic signals [2,15].
Prediction corresponds to estimating the future developments of a system based on knowledge about its past. Chaotic time series prediction is of importance to a large variety of fields, including the forecasting of weather [16], the evolution of some human pathologies [17], population density growth [18], or dynamical control as found in the regulation of chaotic physiological functions [19][20][21]. In order to build a predictor for chaotic systems, most common techniques can be divided into the following groups [22]: (i) linear and nonlinear regression models such as Autoregressive-Moving-Average, Multi Adaptive Regression Spline [23], and Support Vector Machine [24]; (ii) state space based techniques for prediction of continuoustime chaotic systems, which utilize attractor reconstruction and interactions between internal degrees of freedom to infer the future [22,[25][26][27][28]. Attractor reconstruction method is based on the embedding of the original state space in a delay-coordinate space [29]. And (iii) the connectionist approach, including recurrent and feedforward [30,31], deep [32], and convolutional ANNs [33]. This approach usually comprehend the design of ANNs using large amounts of neurons to process information [34].
The high-dimensionality of the ANNs' hidden layer is commonly translated in a computationally expensive problem when considering the optimization of such networks to solve a task. Elseways, in the reservoir computing approach such training efforts are reduced due to the training is done on the output layer only. However, the The internal layer has m neurons whose synaptic weights are defined by the elements of the matrix W . The neurons nonlinear activation functions are hyperbolic tangents. Node responses are internally fed back to the internal layer, yielding to the recurrent architecture of the network. A readout state y out is created via the readout weight matrix W out . more neurons in the hidden layer which are connected to the output layer, the higher the computational cost of the training step. Introducing a novel methodology, we develop a state space-based concept which guide the downsizing of the rRNNs' hidden layer. To achieve this objective, we describe rRNNs and state space-based models within the same framework. From where we show state space patterns revealed by spontaneous reconstructions inside the high dimensional space of our random recurrent network. Furthermore, we introduce a novel methodology based on the Takens embedding theorem to identify the embedding dimensions of the input system's spontaneous reconstruction, and their relevance to the system's prediction performance.
We immediately exploit our insight and devise a new hybrid Takens-inspired ANN concept, in which a network is extended by an a-priori designed delay external memory. The delay term is used to virtually extend the size of the network by introducing virtual nodes which exist in the delay path. We use this new design to first validate our interpretation, and then devise an advanced hybrid rRNN to stabilize a non-periodic neuronal model which requires 15 times less neurons than a bench-mark rRNN. As this system is driven by a stochastic signal, we show how our approach can leverage properties of the underlying deterministic system even for the case of a stochastic drive.

II. RANDOM RECURRENT NETWORKS FOR PREDICTION
A rRNN is illustrated in Fig. 1, indicating the temporal flow of information received by each neuron or node. Nodes are represented by . The rRNN consists of a reservoir of m nodes in state x n at integer time n. Nodes are connected through random, uniformly distributed internal weights defined as coefficients in the matrix W of dimensionality m×m. The resulting randomly connected network is injected with 1D input data y IN n+1 according to input weights defined as random coefficients in the vector W IN of dimensionality m × 1. The time-discrete equation that governs the network is [2] x where µ is the bifurcation parameter, α the input gain, f N L (·) is a nonlinear sigmoid-like activation function, and b the constant phase offset injected through offset weights, defined as random coefficients in the vector W of f of dimensionality m × 1. In practice, we construct a network with m = 1000, using the Matlab routine random. Connection weights W i,j are distributed around zero. In order to set our recurrent network in its steady state, W is normalized by its largest eigenvalue. The network's connectivity is set to one, hence it is fully connected.
An output layer creates the solution y out to the prediction task. In this step the network approximates the underlying deterministic law that rules the evolution of the input system. The output layer provides the computational result according to The output weights vector W out is calculated according to a supervised learning rule, using a reference teacher/target signal y T n+1 [35]. We calculate the optimal output weights vector W out op by via its pseudo-inverse (using singular value decomposition) with the Matlab routine pinv. Equation (3) therefore minimizes the error between output W out · x n+1 and teacher y T n+1 . As training error measure we use the normalized mean squared error (NMSE) between output y out n+1 and target signal y T n+1 , normalized by the standard deviation of teacher signal y T n+1 . When we use a rRNN for time series prediction of a chaotic oscillator such as the Mackey-Glass (MG) timedelayed system [17]. We can achieve good long-term prediction performances using a network of 1000 neurons. Here, long-term predictions are defined as predictions far beyond one step in the future. The task is to predict future steps of the chaotic MG system in its discrete-time version: where y τm = y (n−τm/δ) , τ m = 17 as the time delay, and δ = 1/10 is the stepsize indicating that the time series is subsampled by 10. The other parameters are set to ϑ = 0.2, ν = 10, ψ = 0.1. For any prediction task in this paper, we consider 20-network model via different initializations of {W, W IN , W of f }. The prediction horizon is estimated to be 300 time step, defined by the inverse of the largest Lyapunov exponent of the MG system (λ M G max = 0.0036). For such prediction horizon, and for predicting the value of 20 different MG sequences, we obtain the average of all NMSEs, resulting in 0.091 ± 0.013. This performance was obtained for b = 0.2, α = 0.8 and µ = 1.1, which was found to offer the best prediction performance in a range of µ ∈ [0.1, 1.3]. Moreover, the network was trained with 3000 values of the MG system, with a teacher signal given by y T n+1 = y IN n+1 . We subtracted the average of the MG time series before injection into the rRNN, which is a common practice [35]. Then we discarded the first 1000 points of its response to avoid initial transients. Right after training, where we determined W out , we connected y out n+1 to y IN n+1 , and we left the network running freely 300 time steps, indicated by the prediction horizon.
Given that good long-term prediction performances are obtained, we wonder how input information is represented in the network's space. Such qualitative investigations are for instance common practices in image classification tasks, where the extracted features are often identified and illustrated together with the hidden layers that have generated them [36]. In the following, we introduce a technique that allow us to identify feature representations of the input information in the rRNN's high dimensional space, which are linked to good prediction performances.

III. A METHOD FOR FEATURE EXTRACTION IN RANDOM RECURRENT NETWORKS
As shown previously, our rRNN is able to predict the future values of 1D input chaotic data y IN . Such kind of data comes from a continuous-time chaotic system. As it is known in chaos theory, a minimum of three dimensions are required in a continuous-time nonlinear system to generate chaotic solutions. Typically, continuoustime chaotic solutions come from models consisting of a system of at least three nonlinear ordinary differential equations (ODEs). However, there are other ways to obtain such chaotic dynamics. One of those ways includes the introduction of an explicit temporal variable to an ODE, such as a time delay with respect to the main temporal variable. We define these models as delay differential equations (DDEs), where a DDE is in fact equivalent to an infinite-dimensional system of differential equations [37]. The number of solutions to a DDE is in theory infinite due to the infinite amount of initial conditions in the continuous rank required to solve the equation. Each initial condition initializes an ODE from the infinite-dimensional system of equations. Thus, the introduction of a time delay in an ODE, resulting in a DDE, provides sufficient dimensionality to allow for the existence of chaotic solutions. For instance, this is how the MG system can develop chaotic solutions. In such case, one just have access to a time series from a single accessible variable y n+1 in Eq. (4), while all others remain hidden. Nevertheless, hidden variables are participating in the development of the global dynamics as well as the accessible variables.
In order to approximate a full dimensional representation of these oscillators, we could embed the 1D sequence y n+1 into a high dimensional space. Consequently, reconstructing its state-space. Among the most practiced methods to embed 1D information, we highlight state space reconstruction techniques such as delay reconstruction (Whitney and Takens embedding theorems [38,39]) or through the Hilbert transform [40].
Delay reconstruction is a widely used method to complete missing state space information. According to the Takens embedding theorem, the time delayed version of a time series suffices to reveal the structure of the state space trajectory. Let us represent the data in a M -dimensional space by the vectors y n = [y n , y (n+τ0) , . . . , y (n+(M −1)τ0) ] † , where [·] † is a transpose matrix and y n is the original time series. The essential parameters for state-space reconstruction are M and time delay τ 0 . Embedding delay τ 0 is often estimated by applying autocorrelation analysis or time delayed mutual information to the signal y n . The temporal position of the first zero [25] of either method maximizes the possibility to extract additional information which contains independent observations from the original signal, and hence obtain trajectories or dynamic motion along po-tentially orthogonal state-space dimensions. If successful, information of such maximized linear independence can enable the inference of the missing degrees of freedom [39,41,42]. On the other hand, we estimate the minimum amount of required embedding dimensions M by using the method of false nearest neighbors [25]. Crucially, the following concepts are not restricted to a particular method of determining τ 0 or M .
The autocorrelation function (ACF) is often employed to identify the temporal position τ 0 used to reconstruct missing coordinates that define any continuous-time dynamical system. In order to show how Takens-based attractor reconstruction performs, we use a dataset of 1 × 10 4 values provided by Eq. (4). The outcome of the autocorrelation analysis reveals that the ACF has its first minimum at lag τ 0 = −12. In Fig. 2 we show the absolute value of the ACF together with three examples of 2D delay-reconstructions for three different values of the embedding lag τ 0 . As it can be seen, 2D-reconstructions based on lags τ 0 = −7 and −17 also unfold the geometrical object within a state space. However, aiming at a maximally orthogonal embedding, we base our analysis on the attractors reconstructed by exactly the first minimum of the ACF, τ 0 = −12. According to the false nearest neighbor analysis, the minimum dimensions M required to reconstruct the MG attractor is 4. The Takens scheme therefore provides a set of coordinates y n = {y n , y n−12 , y n−24 , y n−36 } which reconstructs the state space object.
As it can be seen, this is a classic method to extract feature representations y n from the original and unknown feature y n . Next, we want to identify comparable attractor reconstruction inside our rRNN's state space.

A. A Takens-inspired feature extraction technique
The network's state space is defined by the set of orthogonal vectors in W . For that, we analyze the injected signal representation y IN n+1 via network's node responses. As input information y IN n+1 is being randomly injected into the network's high dimensional space, we search for possible spatial representations of the 1D input, where network nodes serve as embedding dimensions.
With that goal in mind, we proceed with an analysis comparable to the ACF used in delay embedding, based on the maximum absolute value of the cross-correlation function |CC(x i , y IN )| max between all node responses {x i } and the input data y IN . As our aim is to identify nodes providing observations approximately orthogonal to y IN , we record for every network node such crosscorrelation maxima and the temporal position, or lag, of this maxima. Figure 3 cross-correlation at time lags covering all Takens embedding delays {0, τ 0 , 2τ 0 , 3τ 0 }. Nodes lagged around {−36, −24, −12, 0} should well approximate the Takens embedded attractor, and a state space representation of the input sequence is presented within the rRNNs' space.
In Fig. 3(b), we illustrate the numerous possible extracted features from the originally injected attractor embedded in the rRNN's space for µ = 1.1. The first column of the figure shows simple 2D-projections of the delay-reconstructed attractor by using the set of lags {−24, −18, −12, −6}. The attractors reconstructed by network nodes lagged at {−24, −18, −12, −6} are shown in the three next columns. Attractors reconstructed by network nodes with maximum cross-correlation values at lags {−18, −6} do not belong to the set of original Takens delay-coordinates. We included these additional delaydimensions to better illustrate the breadth of features provided by the rRNN.
As shown above, the CCA also finds node lags and correlations that do not agree with the Takens framework. In the following, we introduce a methodology to exclude such additional delay-dimensions from our predictor. As an starting point, we suppress nodes with specific CCAlag positions during the training step of the output layer, such that they will not be available to the readout matrix but still take part in the rRNN's state evolution. To that end, we select nodes for which their CCA-lag positions are within windows of width δτ net 0 , centered at integer multiples of τ net 0 , where τ net 0 represents the time lag used for delay reconstructions in the Takens' scheme. The windows width δτ net 0 defines a time-lag uncertainty associated to the identified rRNNs' delay coordinates. All nodes with CCA-lag positions not inside the set of (nτ net 0 ± δτ net 0 ), n ∈ Z will not be available to the readout layer.
To illustrate our method and its effect, we show the non-filtered CCA of the rRNN when driven by the MG signal and with bifurcation parameter µ = 1.1 in Fig. 4(a). Using a constant CCA-windows width of δτ net The performance achieved by setting τ net 0 τ 0 in our approach reduces the NMSE more than 1-fold, even though the system has in average significantly less network nodes available to the output layer (∼ 500), see Fig. 4(g). For τ net 0 = −7, the performance is higher by one order of magnitude, accessed by using approximately the same initial set of nodes available to the output (∼ 1000), see Fig. 4(g). We are therefore able to identify node families that shows attractor embedding features in the network's space for the first time, based on their CCA-lag.
In summary, it can be seen that for filters based on τ net 0 ∈ [−16, −5] the prediction performances are either coincident or better than the non-filtered CCA case. This result is in agreement with Fig. 2, where we show that lags in such interval seem to still unfold the object in the state space.
Finally, some further aspects are seen in our data. For τ net 0 > −4 the CCA-filter windows overlap and nodes with a lag inside such positions of overlap are assigned to multiple windows. This artificially increases the number of nodes available to the system's output beyond m = 1000. The resulting NMSE strongly increases beyond the one for the original rRNN. We attribute this characteristic to over-fitting during learning, where there are just repetitions of the same few delay-coordinates.

B. Characteristics of the extracted features
Our previous analysis identifies and harness meaningful features related to good prediction performances. As this was realized by randomly connected networks, attractor reconstruction was achieved by randomly mapping the originally 1D-input onto the high-dimensional rRNN's space. Such random mapping is treated by the framework of random projections theory (RPT) [43][44][45][46][47]. An active area of study within this field treats the question if original input data is randomly mapped onto the dimensions of the projection space, the structural damage to the original object is minimized.
To determine the degree of potential structural distortions to the original input attractor after random mapping onto the network's high-dimensional space according to y where {ϕ(y IN n ), ϕ(y IN n+1 )} are states built by rRNN node responses, and assigned to an embedding dimension using the CCA. Details in the estimation of { 1 , 2 } are added in Appendix A.
As τ net 0 = −12 was found to be associated to a good prediction performance using approximately half of the initial set of nodes, we show in Fig. 5(a) the estimation of { 1 (µ), 2 (µ)} for the rRNN for which nodes' CCAlag positions are within windows of width δτ net 0 = 3, centered at integer multiples of τ net 0 = −12. Here, we present the average of the statistical distribution that includes 20-network models and predicting 20 different MG time series at 300 time steps into the future. In panel (b), we schematically illustrate the relevant geometrical properties of the attractors mapped onto the rRNN's space. Such study considered τ 0 = −12 with which we obtain the set of coordinates y n = {y n , y n−12 , y n−24 , y n−36 } that we use to unfold the state space object; where y IN n = y n . The consequence of an increasing µ in Fig. 5(a) can be explained with the graphic representation of the limits shown by Fig. 5(b). Here, we illustrate the three general cases (I, II, III) connected to their corresponding ranges in µ in Fig. 5(a). The evolution of the system's trajectory is illustrated along a curve sampled at the positions of the big black dots. Gray dots are network's neighbor states to the state y IN n+1 . The first case (I) corresponds to neighbors which form a dense cloud of samples since 1 ∼ 1, but are arranged closely around the original sample since 1 1 and 2 < 1. The neighbor samples insufficiently enhance diversity in feature extraction and the network cannot predict the system's future evolution. This is confirmed by Fig. 5(c,d), which show bad prediction performance and unity divergence: the RNN cannot predict the system. Case (II) includes the values of µ where 1 1 and 2 ≥ 1. Within this parameter range, the system's prediction performance strongly increases until reaching the lowest prediction error. Our analysis reveals the following mechanism behind this improvement. According to 2 , the maximum interstate distance possible inside the rRNN's space is twice the interstate distance of the original trajectory. As a consequence, the rRNN samples neighbors to state y IN n+1 . Hence, as the state neighborhood is broader, there now is a sufficient random scanning of the attractor's vicinity, such that the network can use the different features to solve prediction. The network can therefore use the projected objects to predict, which is confirmed by excellent performance according to Fig. 5(c,d).
The last case (III) appears for µ > 1.3, where all approximated distances of the embedded attractor are much larger than the original distance. The rRNN's au-  tonomous dynamics therefore enlarge the sampling distance such that no dense nearest neighbors are anymore available for prediction. The result are the distortions caused by the autonomous network's dynamics typically found to be chaotic for µ 1.4, already identified in our previous work [9]. Consequently, the network's folding property distorts the projected features. Therefore, 1 becomes undefined, meaning that information about the structure of the embedded trajectory is lost. As a consequence, prediction performance strongly reduces, see Fig. 5(c,d).
The process described in this section allowed us to identify and use relevant features that benefit good longterm prediction performances with a downsized rRNN. Additionally, we found that the neighborhood generated by the inter-state distances of such features has an meaningful impact on the network's ability to predict at all. At this point, we show how to simplify even more our process and package all the above described steps that allow us to generate features related to good prediction performances.

IV. HYBRID TAKENS-INSPIRED RANDOM RECURRENT NETWORKS
In this section, we directly exploit our newly gained insight and introduce a modified version of the classical random neural network for time series prediction. Here, we design a system which aims to only take into account such Takens-like dimensions that we found to be relevant for prediction. As it was previously described, actions provided by the nodes of the rRNN can be interpreted in the light of delay-embedding. We consequently modify the classical rRNN by including a Takens-inspired external memory: where x n+1+τ T is a delayed term added to the output layer, see Fig. 6(a). All elements of the reservoir layer have been copied and then time shifted by a delay term τ T that could be the Takens embedding delay τ 0 . This process allows us to add virtual nodes to our network, which are distributed in the delay lines. This Takens rRNN (TrRNN) combines nonvolatile external memory (virtual nodes) with a neural network (real nodes) and therefore shares functional features of the recently introduced hybrid computing concept [49]. Yet, our concept makes any additional costly optimization unnecessary. We start our analysis by identifying the embedding delay related to the best prediction performance. Here, we fix µ = 0.1, and modify then the delay term τ T ∈ [−20, −1]. For µ = 0.1, the delay-coordinates found in the network's space only span approximately two Takenslike embedding dimensions of the MG system with delays {2τ 0 , 0}, when τ 0 = −12, see Fig. 6(b). Furthermore, most node responses are distributed along the columns centered in lags {2τ 0 , 0}. Consequently, as shown in Fig. 5(c,d), the prediction performance is almost the lowest possible due to insufficient dimensionality to get attractor-like features. Additionally, we set the number of nodes to 350, which is 70% of the nodes that the best CCA-windows filtered rRNN had to disposal, see Sec. III.A; and it uses 35% of the nodes used by the nonfiltered classical rRNN.
According to Fig. 7(a,b), the best average prediction performance, for 20-network model over 20 MG different time series at 300 time steps into the future, is found for τ T = −12, belonging to an interval τ T ∈ [−13, −10] with the lowest NMSE values and divergent rates. The delay τ T therefore agrees with the one identified for Takens attractor embedding τ 0 and is almost identical to one of the lags τ net 0 found optimal in the CCA-window filtering. Furthermore, here the system only has as many CCAwindows at its disposal as dimensions required to embed the MG attractor. This lifts the disambiguate present in the CCA-window filtering analysis, and consequently the optimum performance is found only for an TrRNN embedding exactly along lags according to Takens embedding. In comparison to the classical rRNN, our TrRNN achieves the same performance, simultaneously reducing the amount of nodes in the reservoir layer from 1000 to 350 in the output when compared to the CCA filtered system. Compared to the pristine rRNN, we obtain one order of magnitude better performance with a network three times smaller. Figure 7(c) shows the estimation of { 1 , 2 } with the variation of τ T . As it can be seen, 2 ≈ 1 is associated to good prediction performances, found for τ T ∈ [−13, −10]. This result agrees with the results provided by the classical random network in Sec. III.B. The CCA for τ T = −12 is shown by Fig. 7(d), where we can find the set of nodes with all delay-coordinates required to fully reconstruct the MG attractor. In the cases where prediction was not possible, the CCA identifies the non-adequacy of the rRNN delay embedding as the reason, see τ T = −3.

A. Application: control an arrhythmic neuronal model
We directly utilize our TrRNN as a part of an efficient feedback control mechanism in an arrhythmic excitable system. We task the TrRNN to aid stabilizing a system which models the firing behavior of a noise-driven neuron. It consists in the FitzHugh-Nagumo (FHN) neuronal model [50,51], where v(t) and w(t) are voltage and recovery variables. I = 0.3 is a tonic activation signal, ξ is Gaussian white noise with zero mean and standard deviation ∼ 0.02, = 0.005, g = 0.5, D = 1.0, and H = 0.15. These equations have been solved by the Euler-Maruyama algorithm for stochastic differential equation's integration. In its resting state, the neuron's membrane potential is slightly negative. Once the membrane voltage v(t) is sufficiently depolarized through an external stimuli, the neuron spikes due to the rise of the action potential [52,53]. The time between consecutive spikes are defined as interspike intervals (ISIs). These random ISIs are shown by dots in black in Fig. 8(a), clearly showing that spiking is non-regular.
We aim to control this random spiking behavior of the FHN neuronal model by proportional perturbation feedback (PPF) method [54] and by using either a TrRNN or a rRNN. The PPF method consists in the application of perturbations to locate the system's unstable fixed point onto a stable trajectory [54,55]. This method is used to fit instabilities in the FHN neuronal model through the design of a control equation. In our case, the goal of using the PPF method is to build a control subsystem, which applies an external stimuli to trigger spiking and reduce the degree of chaos.
In our approach, the past information provided by voltage v(t) in the FHN model is used to determine two things: (i) the parameters to design the control equation, and (ii) training parameters for rRNN (µ = 1.1) and TrRNN (µ = 0.1, and τ T = −166 obtained via the ACF minimum of v(t) as in the Takens' scheme) to predict future values of v(t). The predicted v(t) is used to calculate the full control signal with which we stabilize the neuron's spiking activity. This methodology allows us to replace the quantity under control v(t) by a reconstructed signal, which in control theory is related with the replacement of sensors in an exemplary control system. To train the network, we inject 1 × 10 5 values and we let the network run freely for other 4×10 6 steps, allowing us to stabilize 5619 ISI points. We then evaluate the quality of the stabilization for networks ranging from 11 to 340 nodes. In Fig. 8(a), the set of blue dots along a constant line shows how the TrRNN can stabilize the ISI activity. As it can be seen, the network can control the random ISI starting from n = 7465. The excellent stabilization was achieved with TrRNN with only 12 nodes. Figure 8(b) shows the full comparison between rRNN and TrRNN. The mean value of ISI is calculated for the different sizes of rRNN and TrRNN and then normalized by the mean value of the random ISI. The TrRNN starts inferring the inner dynamics of the FHN system for an extremely small network containing just 12 nodes, from which point on it is always capable to correctly stabilize the ISI. In contrast, the classical rRNN does not predict at all until its architecture has at least 80 nodes, but performance remains poor in comparison with TrRNN. For 200 nodes the rRNN starts predicting the dynamic of the FHN system more or less correctly, allowing the control signal to fully stabilize the ISI. Yet, for more than 200 nodes the good performance still can fluctuate, even significantly dropping again. This indicates that in general the stabilization via a classical rRNN is not robust. Furthermore, with the TrRNN one can reduce the number of nodes to 15 times less than the classical rRNN. This stark difference in performance highlights (i) the difficulty of the task, and (iI) the excellent efficiency that the addition of a simple, linear delay term adjusted to the Takens embedding delay brings to the system. Our TrRNN therefore is not only a interesting novel ANNconcept for the prediction of complex systems. It also once more confirms our theory: embedding inside ANNs is essential if the systems are to be used for prediction.

V. COMPARISON WITH CLASSICAL STATE SPACE PREDICTION
Up to now, we have been describing a method for downsizing the hidden layer of rRNNs which can be summarized by the same steps that should be followed if we attempt to solve continuous-time chaotic signal prediction in the state space framework. This framework can be divided according to three fundamental aspects [22,25,26]: (i) insufficient information to represent the complete state space trajectory of the chaotic system. This problem originates from the fact that in many cases one does not have access to all state space dimensions. In this case a reconstruction of the dynamics along the chaotic system's missing degrees of freedom is required. The knowledge of all dimensions allows us to design predictors based on the full state space trajectories. (ii) The second problem is related to the sampling resolution. All information that is acquired, be it from simulations or from experiments, comes with a particular resolution. To minimize the divergence between a prediction and the correct value, the sampling resolution has to be maximized. This is of particular importance for prediction of chaotic systems as these by definition show exponential divergence. (iii) For deterministic chaotic systems, future states of a given trajectory can in principle be approximated from the exact knowledge of the present state. Therefore, the final step towards prediction is approximating the underlying deterministic law ruling the dynamical system's evolution. Therefore, step (i) is fulfilled by the random mapping which takes place in the high dimensional space of the network. Here, RPT supports the fact that the original input data are randomly mapped onto the dimensions of the projection space, and then the structural damage to the original object is minimized.
Step (ii) is fulfilled by the analysis made in Section III.B, where the sampling resolution is maximized to cover the region between the states y IN n and y IN n+1 . Finally, (iii) relates to the training itself of the rRNN, where W out has to be determined via regression.

VI. CONCLUSION
We have introduced a new method of rRNNs analysis which demonstrates how prediction is potentially achieved in high dimensional nonlinear dynamical systems. Random recurrent networks and prediction of a specific signal can consequently be described via a common methodology. Quantifying measures such as the memory related cross-correlation analysis and the feature extraction are quantitatively interpretable. We therefore significantly extend the toolkit previously available for random neural network analysis. Tools developed in the article might be comparable to the utilization of the t-SNE [56] technique for analyzing ANNs during a classification task.
Our scheme has numerous practical implications. The most direct is motivating the development and analysis of new learning strategies. Furthermore, we already designed a novel hybrid-computer which includes both, virtual and real nodes, that efficiently predicts via a priori defined external memory access rules. This approach allows us to improve the design of our neural network in order to reduce the number of nodes and connections required to solve prediction. Finally, our work partially removes the black-box property of random recurrent networks for prediction, possibly giving translational insight into how such tasks can be solved in comparable systems.
where {ϕ l1 (y IN n ), ϕ l2 (y IN n ), . . .} are node responses lagged at [l min , l max ]. The size of the interval [l min , l max ] depends on the value of µ, as it was shown by Fig. 2(b,c), where we find a broader distribution of delay-coordinates for higher values of µ.
The Under these conditions, we can claim that the transformation by the rRNN agrees with a nonlinear random projections. Estimating limits { 1 , 2 } requires to find the inferior min , and superior max interstate distance limits: where 1 and 2 are calculated by isolating these constants from min = (1 − 1 ), and max = (1 + 2 ). These limits contain information about the minimum and maximum distortions that we can find in order to get the best neighbors in the rRNN. ϕ(y IN n+1 ) − ϕ(y IN n ) min and ϕ(y IN n+1 ) − ϕ(y IN n ) max are calculated by using Euclidean distance under minimum and maximum norms, where ϕ lg (y IN n ) are node responses lagged at l g ∈ [l min , l max ], ∀g = 1, 2, . . . , h.
Here, we therefore identify the smallest and largest distances [ϕ lg (y IN n+1 ) − ϕ lg (y IN n )] min,max along each delay coordinate. Then, it is true that these smallest and largest distances bound the Euclidean distance of these.