Predicting high-dimensional heterogeneous time series employing generalized local states

We generalize the concept of local states (LS) for the prediction of high-dimensional, potentially mixed chaotic systems. The construction of generalized local states (GLS) relies on deﬁning distances between time series on the basis of their (non-)linear correlations. We demonstrate the prediction capabilities of our approach based on the reservoir computing (RC) paradigm using the Kuramoto-Sivashinsky (KS), the Lorenz-96 (L96), and a combination of both systems. In the mixed system a separation of the time series belonging to the two different systems is made possible with GLS. More importantly, prediction remains possible with GLS, where the LS approach must naturally fail. Applications for the prediction of very heterogeneous time series with GLSs are brieﬂy outlined.


I. INTRODUCTION
Tremendous advances in predicting the short and long term behavior of complex systems have been made in recent years by applying machine learning [1][2][3][4][5][6][7].reservoir computing turned out to be a very promising machine learning approach as it combines outstanding performance with conceptual advantages like very fast and comparably transparent learning and possibly appealing hardware realizations of reservoir computing [8,9].
For high-dimensional systems reservoir computing suffers like other machine learning methods from the "curse of dimensionality" meaning that the number of nodes of the network representing the reservoir has to be considerably larger than the dimensionality of the input data rendering the training unfeasible with a naive reservoir computing approach.With a parallel prediction scheme based on local states [10], however, the forecasting of high-dimensional chaotic spatiotemporal systems of arbitrarily large extent becomes possible [11,12].
The definition of local states relies on defining spatial local neighborhoods for each time series to be predicted.Thus, the knowledge of the position of the time series in space is a necessary prerequisite for defining local states.
The similarity of time series can be defined in a much more general way by deducing a distance measure and thus a local neighborhood from the correlations among the time series [13].Those generalized similarities ledamong others -to a reasonable, fully data-driven taxonomy of the stock market [14] while crucial differences between the linear and nonlinear correlation structure of the stock market especially during crises were reported later [15].
In this Article, we employ this approach to define generalized local states for the prediction of high dimensional systems with which some of the shortcomings of the local states approach can be overcome.

II. SYSTEMS AND SIMULATION DETAILS
For modeling high dimensional, spatiotemporal, chaotic systems, the L96 and KS systems have become widely used in the RC community [3,[16][17][18].The L96 [19,20] system is defined as where x(t) is the D-dimensional state vector of the system and F the forcing parameter.In this study, we set D = 40 and F = 5 resulting in a chaotic system for which we calculate a maximal Lyapunov exponent (MLE) of Λ max = 0.45 using Sprott's method of orbit separation (OS) [21].We use the fourth-order Runge-Kutta method [22] for the simulation, with a time step of ∆t = 0.05.The second model, widely used to model a variety of weakly turbulent fluid systems [23,24], is the KS system [25].Its PDE reads where the field u(x, t) is defined on some domain size L. In this study, we use a domain size of L = 22 with periodic boundary conditions u(x + L, t) = u(x, t) for all 0 ≤ x ≤ L. For the numerical treatment, the equations are discretized on a grid of D = 40 points, the same size as the L96 system, and numerically integrated, with a time step of ∆t = 0.5, using the fourth order time-stepping method ETDRK4 [26].Using OS we find a MLE of Λ max = 0.049.

III. RESERVOIR COMPUTING
We use a standard setup for reservoir computing.At the center of RC is a network, in the following called the reservoir A, created as a sparse random D r ×D r network with average node degree κ.After network generation, all its random, uniformly distributed connection strengths are scaled to have a predetermined fixed spectral radius ρ.The D in dimensional input x(t) interacts with the reservoir state r(t) through an input coupler which in our case is a sparse D r × D in matrix W in .Following [1], W in is created such that one element in each row is chosen uniformly between [−ω, ω] where ω is the input coupler scaling parameter.All other elements of W in are zero.The input data then connects with the reservoir state r(t) of the previous time step via activation function tanh(•) to advance the reservoir state by one step in time We choose a simple matrix as output coupler W out , whose elements are determined in the training phase via ridge regression where y T (t) is the D out dimensional target output.r is a nonlinear transformation of the reservoir state r here chosen to be r = [r, r 2 ] T = [r 1 , r 2 , ..., r Dr , r 2 1 , r 2 2 , ...r 2 Dr ] T .This blow-up of the reservoir states first introduced by Lu et al. [16] actually serves to break the symmetries in the reservoir equations [27].
To avoid that the arbitrary initial state of the reservoir influences the regression results, training only starts after a washout phase of 20000 time steps.
Once trained, the output y(t) can be calculated from the reservoir states r(t) as y(t) = W out r(t).When using RC for prediction, it can then be run autonomously by using the prediction of the previous time step y(t) as the input x pred (t) to calculate the next predicted time step y(t + ∆t).In this case one finds r(t + ∆t) = tanh (Ar(t) + W in x pred (t)) . (5)

A. Generalizing Local States
GLS is based on the LS approach proposed by Parlitz et al. [10] and used for RC prediction by Pathak et al. [3].While non-local implementations of RC algorithms use just one network to process all input data, LS and GLS partitions the input data into multiple subsets of smaller dimension each with their own reservoir.The reservoirs are then trained on and predict these subsets only.This provides an effective workaround for the curse of dimensionality by essentially parallelizing the prediction of one high-dimensional dataset using many lower-dimensional subsets.These subsets are in the following called neighborhoods.Each neighborhood itself consists of a number of core variables and neighbor variables and is assigned its own reservoir.Each reservoir in question then uses only the variables of the input making up its neighborhood to predict its core variables as accurately as possible.
As such, the input time series for the reservoir assigned to the i-th neighborhood is not the full D in dimensional input time series x anymore, but instead a slice unique to this neighborhood x i of dimension D i in ≤ D in .Similarly, the trained output of the i-th neighborhood y i is given by just its core variables and hence is an even smaller subset of the neighborhood of dimension D i out ≤ D i in .These different neighborhoods are depicted in Figure 1.Schematic depicting different neighborhoods and associated data flow.Yellow circles mark the core variables, green the neighbors and purple all other variables of the full input.(a) The neighborhood of simple, non-local RC.All dimensions of the input vector x(t) are core dimensions of the reservoir R. No neighbors or other reservoirs exist.(b) LS RC.The highlighted neighborhood of the reservoir has one core variable xi and four locally adjacent neighbors.These five variables are used as input for the reservoir Ri to predict the core variables future state yi.Many more reservoirs exist, each with their own neighborhood.(c) GLS RC.As in (b), the neighborhood has one core variable and four neighbors, with the crucial difference being that the neighbors do not have to be locally adjacent to the core.
Between each prediction step the neighborhoods need a new input which, as typically D i out < D i in , can only come from the other neighborhoods' prediction.Hence between each prediction step, a new input vector x pred is formed by the combined core variable prediction of all neighborhoods.As a consequence, each variable of the original input must be in one and only one neighborhood as a core variable.
Pathak et al. originally introduced their LS approach in the context of the KS system, a purely locally interacting system.As such, their neighborhoods were chosen to have only spatially adjacent cores surrounded by contiguous buffer regions of neighbors (see Figure 1b).
For systems where such a local interaction is not present, this scheme cannot be used.Nonetheless, the idea of locality in the sense of importance to a prediction is generalizable to something which we will call similarity.As the choice of neighborhoods is essentially arbitrary, this allows the creation of neighborhoods not by which variables are locally closest to the core variables, but by instead including the variables most similar to the cores as neighbors.Such a non-local GLS neighborhood is shown in Figure 1c).
The effectiveness of this procedure of course depends on an appropriate choice of neighborhoods, such that the information necessary to predict their core variables is present in their (non-local) neighbors.

B. Similarity Measures
Our first measure to estimate the similarity of the time series is the (linear) cross-correlation (CC) coefficient C xi,xj , where x i is the i-th variable of the time series x.
To transform the CC into a similarity measure (SM) we take its absolute value and associate a larger value as more similar.As such, both a high correlation as well as a high anti-correlation correspond to a high similarity.
Considering that chaotic time series are by their very nature nonlinear, we use one more measure that captures these nonlinear relationships, the mutual information (MI).As we are working with long (10 5 time steps), well behaved time series, we will implement the widely used binning method [15,28,29] as estimator for the MI.Akin to [15] we find empirically that for our time series of length T = 10 5 a choice of T /4 = 158 bins of equal size works well.We normalize the MI as described by Strehl et al. [30] leading to our MI SM where I (X i , X j ) is the MI while H (X i ) and H (X j ) are the Shannon entropies of X i and X j respectively.

C. Neighborhood Definitions
Once a SM has been chosen and calculated for all variables of the full input data, one needs to use it to create the neighborhoods.As the prediction output of each neighborhood is only given by its core variables, predicting these core variables as accurately as possible is most important.While a variety of choices are possible, we restrict ourselves to associate each neighborhood with exactly one core variable.In the LS case, the neighbors of the core are simply the variables spatially closest to that core.We will call this a spatial neighborhood (SN).In the GLS case, the exact analog is possible, where the neighbors of the core are the variables most similar to it as defined be the SM.We call the corresponding neighborhoods CC or MI neighborhoods, depending on the similarity measure used.
Lastly, it should be emphasized that the GLS method is in principle independent of not only the specific SM used, but also of the chosen prediction method.Many other choices for the SM, e.g. the transfer entropy, or the network, e.g.LSTMs, are conceivable.

V. EXPERIMENTS A. Methods and Parameters
To enable a fair comparison of the different neighborhood generation methods, the RC hyperparameters are optimized for the LS neighborhoods and then copied for the GLS neighborhoods without further adjustments.Furthermore, we added noise to all training data following Vlachas et al. [2].Without this noise we found the short term prediction accuracy to often be higher, but at the cost of an increased rate of failed realizations, and lower quality long term predictions in general.The added noise proved decisive in minimizing variance between network realizations and reducing the number of failed realizations, especially for the L96 system.Heuristically, we found normally distributed noise with standard deviation σ noise = 1% σ data , where σ data is the standard deviation of the training data, to be a sweet spot.The hyperparameters used for all RC are given in table I. Transient effects of the simulated data were discarded before any synchronization, training or prediction took place.Similarly, each reservoir training and prediction is preceded by 2000 synchronization steps.All reservoirs were trained for a total of 10 5 time steps using the noisy training data.
To calculate the SMs we use the same noisy data used to train the reservoirs.As a result all neighborhoods shown here are representative examples, but not uniform for all realizations.The neighborhoods are calculated for a single core and 18 neighbors, in the case of SN and MI neighborhoods, and 28 neighbors, in the case of a CC neighborhood.
These fixed neighborhood sizes for SN and MI were chosen to be similar to Pathak's original paper [3] which we found to be a good compromise between computation speed (small neighborhoods) and prediction accuracy (large neighborhoods).While for them this results in good predictions, for the CC SM this leads to essentially all predictions diverging.This is likely the result of the linear CC SM not recognizing the importance of the core's nearest neighbors in these systems.As such, the CC neighborhood size was increased to a total size of 29, the minimum where no predictions diverged.CC neighborhoods of different sizes for the L96 system are discussed in appendix A.

B. Predicting the L96 System
Example neighborhoods for the L96 systems are shown in Figure 2.While the SN neighborhoods are simply defined as a single core and its nearest neighbors, the CC and MI neighborhoods warrant a closer look.First and foremost, even though the MI neighborhoods were calculated dynamically, without directly using the knowledge of L96 being a locally interacting system, the resulting MI neighborhoods closely resemble the SN neighborhoods.The CC neighborhoods in contrast include many variables spatially far away from the core.100 distinct random network realizations are generated for each system-SM combination.They are then trained and used to predict the same 300 sections of 10 Lyapunov times length on the chaotic attractor of the L96 and KS systems.
To quantify the short term prediction accuracy, we define the normalized root mean square error (NRMSE) as where ŷ ∈ R D is the prediction at a single time-step, y ∈ R D the true signal and y max (y min ) the largest (smallest) value taken of any variable in the simulated data set.Short term prediction results are shown in Figure 3. Looking at the short term predictions of the L96 system, it is very striking that the averaged NRMSE of the SN and MI neighborhoods coincide more or less exactly, while the CC prediction is significantly worse.This performance drop is likely the result of the CC neighborhood's inclusion of many variables which are far from the, ostensibly most important, close region around the core (see Figure 2b).
It should be noted that the statistical stability that has been assessed here for the first time is remarkable.Often RC algorithms exhibit a much larger spread of prediction qualities between different random networks [31].This stability is partly attributable to finding the correct hyperparameters for the systems at hand, as suboptimally chosen hyperparameters often leave the best predictions intact while increasing the variance towards completely failed predictions drastically [17,31].However, empirically we found that the role of the noise added to the training data in achieving this stability, especially for the L96 system, cannot be understated either.
Using the OS method we can calculate the MLEs of the predictions as another characteristic to quantify long term prediction accuracy.For this purpose we let each of our reservoir realizations predict three distinct sections of the system attractor each 1000 Lyapunov times in length which are then used to calculate the MLEs shown in table II.
System SIM SN CC MI L96 0.45 0.43 ± 0.01 0.47 ± 0.02 0.44 ± 0.02 TABLE II.MLEs calculated via the OS method for the simulated (SIM) and predicted trajectories using the SN, CC and MI neighborhoods for the KS and L96 data.The errors represent the 1σ standard deviation between network realizations.
The MLEs of all three neighborhoods agree well with the MLEs calculated directly from the simulations.The details of the OS procedure used are described in appendix B.

C. Predicting the KS System
In preparation for the non-local system discussed in the following section, we predict and analyze the KS system as we have just done for the L96 system.
Example neighborhoods and short term prediction results for the KS system are shown in Figure 4   Looking at the short term predictions of the KS system shown in Figure 5, it is striking that the averaged NRMSE of all three neighborhoods coincide more or less exactly.This is remarkable especially when compared to the L96 system in which the CC neighborhoods performed significantly worse.
Using OS to calculate the MLE we find them again to agree excellently with the simulated MLE as depicted in table III.
System SIM SN CC MI KS 0.049 0.046 ± 0.004 0.048 ± 0.001 0.048 ± 0.001 TABLE III.Same as table II but for the KS instead of the L96 system.

D. Predicting a non-local System
To test the usefulness of GLS for non-locally interacting systems, we use the KS and L96 systems to artificially create such a non-locally interacting test system.As depicted in Figure 6 we do this by concatenating both systems and then randomly shuffling the 80 variables of the combined system.As this combined system now is a composite of two systems with different time steps, it does not have a well defined Lyapunov exponent any more.For the sake of consistency we nonetheless continue the time axis rescaling in terms of Lyapunov exponents.To do so we use the larger of the two system's time steps per Lyapunov time as calculated in table IV, hence at worst slightly underestimating our short term prediction results.
As before, we can also calculate the SN, CC and MI neighborhoods for this new concatenated system.The CC and MI neighborhoods are shown in Figure 7.The SN neighborhoods have been omitted from this Figure as they are, by definition, always the same.Fascinatingly, both the CC and MI neighborhoods in the combined but not shuffled systems look like they are composed of the individual system's neighborhoods.In fact, exactly this is the case as both the CC and the MI SMs are able to completely separate the KS and L96 systems.In the case of experimental data, such an analysis of the similarity structure can be highly informative as it exposes the underlying relationships between dimensions.
As before, we quantify the short term prediction accuracy using the NRMSE.The results are shown in vergence of all SN predictions.This is of course expected, considering the SN neighborhood's assume a locally interacting system which the shuffled system is not.Furthermore, the NRMSE of the CC and MI neighborhoods is the combination of the NRMSE of the individual systems.This, again, makes sense due to the perfect separation between the L96 and the KS system shown in Figure 7.As for the KS system, this results in the MI SM delivering significantly better results than the CC SM.
For the long term statistical analysis we cannot use the MLE as it is not well defined due to the different time step sizes of the sub-systems.Therefore we look at the probability distribution functions (PDFs) to roughly estimate the system climate as also used in [4].The details of the histogram generation are described in appendix C. CC and MI PDFs are depicted in Figure 9.The predicted PDFs show excellent agreement with the simulated data.Similarly to the MLE results, the worse short term predictive performance of the CC SM compared to the MI SM is not reflected in its climate reproduction quality.

VI. CONCLUSIONS
We have proposed and discussed a generalization of the concept of local states in the sense of using similarity measures derived from correlations among (instead of spatial distances between) time series.This offers a much more versatile approach for the prediction of highdimensional complex systems.
First, generalized local states can still make excellent predictions in the case of mixed systems, where local states are doomed to fail.Here it is worth noticing that the perfect neighborhood separation we found in the combined KS-L96 system (see Figure 7) suggests that generalized local states can be used to separate mixed data sets and to thus infer different origins of a set of heterogeneous time series, for which the generating processes are unknown.
Second, prediction of high-dimensional systems remains feasible, when for a number or for all time series no spatial information is available.This is more and more the typical case in real world applications, when analyzing such heterogeneous data sets ranging from remote sensing data, financial data, social media data (e.g.Instagram, Twitter) to nowadays also infection rates during the COVID-19 pandemic, etc. and a combination of the aforementioned.Current research explores applications of generalized local states based predictions in those cases.

Appendix A: CC Neighborhood Size
As discussed in the text, the CC neighborhood sizes were necessarily chosen to be significantly larger than the SN and MI neighborhood sizes.
This necessity is likely the result of the CC SM not recognizing the importance of the core's nearest neighbors in the L96 and KS systems.As such, the CC neighborhood size was increased to a total size of 29, the minimum where no predictions diverged and where the core's nearest neighbors were consistently included in its neighborhood.CC neighborhoods of size 19, 27 and 29 for the L96 system are depicted in Figure 10.While the true importance of each variable regarding the prediction of another is of course unknown, at least for the locally interacting L96 and KS systems studied here, the spatial nearest neighbors of the cores are critical for achieving a sensible prediction.
FIG. 1.Schematic depicting different neighborhoods and associated data flow.Yellow circles mark the core variables, green the neighbors and purple all other variables of the full input.(a) The neighborhood of simple, non-local RC.All dimensions of the input vector x(t) are core dimensions of the reservoir R. No neighbors or other reservoirs exist.(b) LS RC.The highlighted neighborhood of the reservoir has one core variable xi and four locally adjacent neighbors.These five variables are used as input for the reservoir Ri to predict the core variables future state yi.Many more reservoirs exist, each with their own neighborhood.(c) GLS RC.As in (b), the neighborhood has one core variable and four neighbors, with the crucial difference being that the neighbors do not have to be locally adjacent to the core.

FIG. 2 .
FIG. 2. Example Neighborhoods for the L96 system.Each row depicts one variable of the time series, while each column represents a neighborhood.Each neighborhood is defined by the core variable (yellow) and its neighbors (green).(a) SN neighborhoods.The SN neighborhood consists only of a core and its 18 nearest neighbors.(b) CC neighborhoods.Each CC neighborhood includes 28 neighbors chosen with the CC SM.(c) MI neighborhoods with 18 neighbors.

4 FIG. 3 .
FIG. 3. L96 system short term prediction comparisons.(a) Simulated data of the L96 system.(b-d) Exemplary difference of the RC prediction to the simulated data when using the (b) SN, (c) CC and (d) MI neighborhoods.(e) NRMSE of SN, CC, and MI prediction data averaged over first the 300 predicted sections and then the 100 network realizations.The error bands correspond to the 3σ standard deviation of the random network realizations.We multiply t by the MLE Λmax of the model, so that each unit on the horizontal axis represents one Lyapunov time.
and Figure 5 respectively.

FIG. 4 .
FIG. 4. Example neighborhoods for the KS system.Each row depicts one variable of the time series, while each column represents a neighborhood.Each neighborhood is defined by the core variable (yellow) and its neighbors (green).(a) SN, (b) CC, (c) MI neighborhoods.

3 FIG. 5 .
FIG. 5. KS system short term prediction comparisons.(a) Simulated KS data.(b-d) Exemplary error in the RC prediction when using the (b) SN, (c) CC and (d) MI neighborhoods.(e) NRMSE of SN, CC, and MI prediction data averaged over first the 300 predicted sections and then the 100 network realizations.The error bands correspond to the 3σ standard deviation of the random network realizations.

FIG. 6 .
FIG. 6. Concatenated System (a) Simulated data combined from the L96 (variables 1-40) and KS (variables 41-80) systems.(b) Simulated data as in (a) with the variables shuffled.The time axis is scaled by the MLE Λmax of the KS model.

FIG. 7 .
FIG. 7. Neighborhoods of the concatenated L96 and KS systems.(a) is the CC and (b) the MI neighborhoods for the concatenated, un-shuffled system.Variable 1-40 of the data used to compute these neighborhoods come from the L96 system with variables 41-80 originating from the KS simulation In (c) the CC and (d) the MI neighborhoods for the shuffled system are shown.

FIG. 8 .
FIG. 8. Shuffled system short term prediction comparison.(a) Simulated data of the combined shuffled KS-L96 system.(b-d) Exemplary difference of the RC prediction to the simulated data when using the (b) SN, (c) CC and (d) MI neighborhoods.The color scale of the diverging prediction was cut to the ensure legibility of the other plots (e) NRMSE of SN, CC, and MI prediction data averaged over first the 300 predicted sections and then the 100 network realizations.The error bands correspond to the 3σ standard deviation of the random network realizations.

8 FIG. 9 .
FIG. 9. PDFs for the longterm predictions of the combined and shuffled KS-L96 system.The PDFs are estimated via two superimposed 100 bin histograms of the simulated (gray) and the predicted (orange) data of the (a) CC or (b) MI Neighborhood.Each entry in the histogram is generated by the prediction of a single variable value, with the dataset coming from the prediction of three 1000 Lyapunov time long sequences.The error bars represent the 1σ standard deviation of the predicted histogram bins.

FIG. 10 .
FIG. 10.CC neighborhoods of the L96 system.(a) Total neighborhood size 19, the nearest neighbors of the core variables are not included in the neighborhood.(b) Neighborhood size 27.Only a couple nearest neighbors of the cores are missing.Nevertheless, predictions using these neighborhoods diverge regularly.(c) Neighborhood size 29.All nearest neighbors of the cores are included in the neighborhood.This is the smallest neighborhood size for which predictions consistently succeed.