Model scale versus domain knowledge in statistical forecasting of chaotic systems

Chaos and unpredictability are traditionally synonymous, yet large-scale machine learning methods recently have demonstrated a surprising ability to forecast chaotic systems well beyond typical predictability horizons. However, recent works disagree on whether specialized methods grounded in dynamical systems theory, such as reservoir computers or neural ordinary differential equations, outperform general-purpose large-scale learning methods such as transformers or recurrent neural networks. These prior studies perform comparisons on few individually-chosen chaotic systems, thereby precluding robust quantification of how statistical modeling choices and dynamical invariants of different chaotic systems jointly determine empirical predictability. Here, we perform the largest to-date comparative study of forecasting methods on the classical problem of forecasting chaos: we benchmark 24 state-of-the-art forecasting methods on a crowdsourced database of 135 low-dimensional systems with 17 forecast metrics. We find that large-scale, domain-agnostic forecasting methods consistently produce predictions that remain accurate up to two dozen Lyapunov times, thereby accessing a new long-horizon forecasting regime well beyond classical methods. We find that, in this regime, accuracy decorrelates with classical invariant measures of predictability like the Lyapunov exponent. However, in data-limited settings outside the long-horizon regime, we find that physics-based hybrid methods retain a comparative advantage due to their strong inductive biases.


I. INTRODUCTION
Chaos traditionally implies the butterfly effect: a small change in a system grows exponentially over time, complicating efforts to reliably forecast the system's longterm evolution.Predicting chaos therefore represents a longstanding problem at the interface of physics and computer science [1], even motivating early applications of artificial neural networks during the 1991 Santa Fe forecasting competition [2].Recent successes in statistical forecasting motivate revisiting this problem, by providing compelling examples of data-driven prediction of diverse systems such as cellular signaling pathways [3], hourly precipitation forecasts [4], active nematics [5], and tokamak plasma disruptions [6].
However, there is little consensus whether the practical success of emerging forecasting methods stems from fundamental advances in representing and parameterizing chaos, or simply from the availability of larger datasets, model capacities, and computational resources [7][8][9].Recent fundamental advances in representing chaos include works demonstrating that chaotic systems appear more linear when lifted to higher-dimensional representations than strictly necessary to describe their dynamicssuch as those implicitly learned by large, overparameterized learning methods [10][11][12].These works partly explain the recent emergence of reservoir computers as strong forecasting methods for dynamical systems [13][14][15][16][17][18][19]; these models use emergent properties of random networks to lift complex time series into a random fea- * wgilpin@utexas.eduture space, thereby simplifying learning at the expense of increasing model complexity [20,21].Other recentlyintroduced hybrid models directly encode dynamical constraints within their model formulation [22]; among these are neural ordinary differential equations [23], physicsinformed neural networks [24], and recurrent neural networks with domain-specific architectural modifications [25][26][27].Broadly, these physics-based models can be seen as containing inductive biases-architectural or modeling choices informed by knowledge that the target time series are drawn from dynamical systems-that effectively reduce the variance of potential fitted models in exchange for more efficient training [28].
In contrast, recent works by the computer science community advocate training large models with minimal domain-specific inductive biases [7,8,29].In controlled experiments, state-of-the art forecasting results have been achieved by large, overparametrized statistical learning models, such as transformers and hierarchical neural network architectures [9,30,31].In principle, these models leverage their scale and the availability of large time series datasets to overcome lack of domain knowledge, and they have demonstrated consistently improving performance in major time series forecasting competitions and benchmarks [32][33][34].When these domain-agnostic models have previously been applied to time series generated by chaotic dynamical systems, these models exhibit strong performance when sufficient training data is available [18,19].
Complicating comparison of domain-agnostic versus physics-based chaotic forecasting models is a lack of systematic comparison on the same datasets.Prior works have compared methods on a handful of well-known chaotic attractors like the Mackey-Glass or Lorenz equa- Each attractor is featurized using 747 invariant properties such as entropy, fractal dimension, et al., and then embedded in a two-dimensional vector space with UMAP.Contours denote 50% confidence intervals in each system's embedding across 500 random initial conditions and feature subsets; points denote centroids for each system.
tions, or have used small domain-specific time series datasets like weather or biomedical time series [9,27,[35][36][37][38].A larger set of representative systems, and a controlled comparison among methods, is necessary to disentangle the relationship between chaoticity, predictability, and forecasting model architecture, as well as to understand how the properties of different black-box machine learning methods interact with the systems they predict.
Here, we systematically quantify the relationship between chaos and empirical predictability in a largescale controlled experiment.We introduce a largescale dataset of 135 distinct low-dimensional chaotic attractors.For each system, we benchmark 24 forecasting methods using 17 forecast metrics that quantify both pointwise accuracy, and ability to capture invariant properties of the underlying attractors.When sufficient training data is available, we find that large, domain-agnostic forecasting models outperform physicsbased models at both short and long forecasting horizons.However, when limitations are imposed on computational resources or data availability, models with inductive biases-particularly reservoir computers-perform more strongly.We find that invariant properties of the underlying dynamical systems only weakly correlate with the ability of the best-performing forecast models to forecast them, suggesting that scale and dataset availability, rather than intrinsic dynamical properties, limit the current ability of large models to forecast chaos.

A. The chaotic attractors benchmark dataset
We introduce a benchmark dataset containing 135 low-dimensional differential equations describing known chaotic attractors [39].Originally curated from published works to include well-known systems such as the Lorenz, Rössler, and Chua attractors, since initial release the dataset has grown through crowdsourcing to include examples spanning diverse domains such as climatology, neuroscience, and astrophysics.Each dynamical system is aligned with respect to its dominant timescale and integration timestep using surrogate significance testing [40], and is annotated with calculations of its invariant properties such as the Lyapunov exponent spectrum, fractal dimension, and metric entropy (Appendix B).
Although each system has a distinct largest Lyapunov exponent (λ max ) indicating its putative chaoticity, some systems are closely related to each other.For example, the 19 members of the Sprott attractor subfamily (λ max ∈ [0.01, 1.1]) exhibit similar qualitative structure such as paired lobes, owing to the presence of predominantly quadratic nonlinearities in the governing differential equations [41,42].To identify relationships among attractors in our dataset, we convert each dynamical system into a high-dimensional vector by first generating a long trajectory, and then computing 747 characteristic mathematical signal properties such as the metric entropy, power spectral coefficients, Hurst exponents, and others that are invariant to the initial conditions and sampling rate [43,44] (Appendix G).We then use uniform manifold approximation and projection (UMAP) to visualize these high-dimensional vectors in a two-dimensional plane (Fig. 1) [45,46].The resulting space of chaotic systems shows clear structure, with the Sprott and other scroll-like subfamilies clustering together, while qualitatively distinct systems separate.These results suggest chaoticity λ max represents just one among many invariant properties that relate different chaotic systems, as λ max correlates only weakly with the embedding (ρ = 0.15 ± 0.03, bootstrapped Spearman rank-order coefficient).This visualization allows us to consider our dynamical systems dataset not as merely a list of differential equations, but rather as a set of points in a space of dynamical systems parametrized by their invariant properties B. Forecasting models evaluated.
We evaluate 24 statistical forecasting models across all 135 dynamical systems.We choose forecasting methods representing the broad diversity of methods available in the recent literature [7,47].Traditional methods include standard linear regression, autoregressive moving averages (ARIMA), exponential smoothing, Fourier mode extrapolation, boosted random forest models [40], and newly-introduced linear models that account for trends and distribution shift [48].Current state-of-the-art models for general time series forecasting are based on deep neural networks: the transformer model [30], long-shortterm-memory networks (LSTM), vanilla recurrent neural networks (RNN), temporal convolutional neural networks [37], and neural basis expansion/neural hierarchical interpolation (NBEATS/NHiTS) [31,49].The latter methods generate forecasts hierarchically by aggregating separate forecasts at distinct timescales (NBEATS), and can explicitly coarse-grain the time series to further reduce computational costs (NHiTS).We also consider hybrid physics-motivated methods such as neural ordinary differential equations [23], which approximate the continuous-time differential equation underlying time series; and echo-state networks (ESN), which train a linear model on a fixed "reservoir" of random nonlinearities [20,21].We include nonlinear vector autoregressive models (nVAR), a generalization of classical ESN that removes the need for an explicit reservoir-hence their designation as next-generation reservoir computers [13].In order to provide reference values for observed scalings, we also include several naive models that underfit the data, including naive mean and simple seasonal estimators, as well as a Kalman filter with internal state frozen at its value at the end of training data availability [32,50].

C. Forecasting benchmark design
Our dynamical systems dataset allows us to systematically compare the forecasting ability of different statistical forecasting methods across diverse dynamical systems.Forecasting dynamical systems from observations is a well-established field [40], and we structure our experiments as a standard long-term autoregressive forecasting task [51].For each D-dimensional dynamical system, we generate two time series arising from distinct initial conditions on the system's attractor y train (t ′ ), y test (t ′ ) ∈ R D , t ′ ∈ [0, T ], which we subdivide into past and future series at a time t * ∈ (0, T ).The model parameters are first fit using y train (t ′ ) on the interval t ′ ∈ [0, t * ], and the accuracy of the resulting predictions ŷtrain relative to the true values y train on the remaining interval t ′ ∈ (t * , T ] are used for model selection and hyperparameter tuning.Next, the final model from each model class is fit to y test (t ′ ) on the interval t ′ ∈ [0, t * ], and the resulting future predictions ŷtest (t ′ ) are compared against the as-yet unseen true values y test (t ′ ) on the interval t ′ ∈ (t * , T ], producing an error score ϵ ik (t) representing the performance of the i th forecasting model on the k th dynamical system at forecast horizon t ∈ (0, T − t * ] after the end of training data availability (note that t ≡ t ′ − t * ).We compute 17 different error metrics, including root mean-squared error, pointwise correlation, mutual information, and Granger causality.We report our results in the main text in terms of the symmetric mean absolute percent error (sMAPE) to its common use, conceptual simplicity, and correlation with other metrics [9,11,35,39,52].
Each forecasting method represents a class of possible models parameterized by choices made regarding architecture (model size, number of layers, units per layer, activation function) or training (optimization epochs, batch sizes).Such choices can strongly affect the performance of different methods [9,15,40], yet different forecasting methods do not necessarily have equivalent adjustable hyperparameters.For all methods, we initialize all hyperparameters at their default values used in the publications from which they were drawn.We use either the original authors' code when available, or widely-used reference implementations based on the original works [47].However, because prior works primarily feature individual systems like the Lorenz attractor or Mackey-Glass equations, we perform additional hyperparameter tuning for each dynamical system and forecasting method pair.Because different methods have different hyperparameters, we restrict hyperparameter tuning to the equivalent of the lookback window T ℓ for each model.T ℓ corresponds to the input size for deep neural networks, the number of features for random forests, the lag order of autoregressive models, the inverse leakage rate for reservoir computers, or the time lag for state space models.Importantly, while t * determines the total history length available for training, T ℓ < t * affects both training and inference; it effectively determines how many past timepoints are simultaneously used as inputs, which the model processes to output a prediction.
Several timescales characterize our benchmark design: the lookback window T ℓ is a tunable hyperparameter for each forecasting method that corresponds to the number of past timepoints seen simultaneously by a model at a given time; the history length t * ≥ T ℓ represents the total number of timepoints available to learn the model's parameters during training.t * is typically several times larger than T ℓ , because fitting model parameters requires supervised training on several subsets of length T ℓ drawn from the available history.The forecast horizon t rep-resents the number of unseen timepoints into the future that are predicted autoregressively; and the Lyapunov time λ −1 max is an invariant property of each distinct dynamical system representing the characteristic timescale over which forecasts are expected to lose accuracy due to the butterfly effect.We report our forecast results scaled by this quantity, in units of λ max t.
Our long-term forecasting experiments require ∼ 10 19 floating point operations for training, model selection, and hyperparameter tuning, a figure comparable to the scale of other recent large-scale machine learning benchmarks [53].We have previously validated our experiment design in a smaller-scale trial univariate study [39]; this new, larger-scale computational study applies the same experiment design to larger and more varied benchmark models, more dynamical systems, and an orderof-magnitude longer multivariate forecasts.For brevity, we highlight results for the best-performing forecasting models, and defer the full tabular results and alternative accuracy metrics to Appendix F and our open-source repository.

III. RESULTS
A. Large, domain-agnostic time series models effectively forecast diverse chaotic systems Our main results are summarized in Figure 2. We observe that modern statistical learning methods successfully forecast diverse chaotic systems, with the strongest methods consistently succeeding across diverse systems and forecasting horizons.We highlight the strong relative performance of machine learning models in Figure 2C, where NBEATS successfully forecasts the Mackey-Glass equation for ∼22 Lyapunov times without losing track of the global phase (see Fig. S1 for additional examples).Across all systems, the best-performing models achieve an average prediction time equal to 14 ± 2 λ −1 max .These results extend beyond the ∼ 10 λ −1 max reported for single systems in recent works [16,51], and sharply improves on the ∼ 5 λ −1 max typically achieved before the widespread adoption of large machine learning models [40].Our results underscore rapid progress since the ∼ 1 λ −1 max targeted in the original Santa Fe competition [2].The strongest-performing methods include NBEATS, NHiTS, transformers, and LSTM, which are all large models originally designed for generic sequential datasets, and which do not assume that input time series arise from a dynamical system.This observation suggests that more flexible, generic architectures may prove preferable for problems where some physical structure is present (e.g., analytic generating functions in the form of ordinary differential equations) but stronger domain knowledge (e.g., symmetries or symplecticity) is unavailable to further constrain learning.However, we note that the vanilla RNN and Temporal Convolutional Neural network exhibit unexpectedly weak performance, particularly given the lat- ter model's relationship to NHiTS and the strong performance of the LSTM [31,37,54].Inspection of individual forecasts shows that both models exhibit instability at long forecast horizons, in which they quickly diverge from the underlying attractor and rapidly accrue errors.The chaotic nature of the time series amplifies this weakness compared to traditional time series forecasting benchmarks.
Among the remaining forecasting models, we note that the naive baselines perform poorly, as expected.Because chaotic systems are ergodic and thus statistically stationary over long intervals, baseline models that include components for constant linear drift perform par-ticularly poorly: these models tend to fit a small but nonzero constant drift term given finite training data, which later causes the models to linearly diverge from the bounded attractor set during testing [32].This effect explains the uncharacteristically low performance of the frozen Kalman model and exponential smoothing, two model that often perform well in short-horizon forecasting tasks [32][33][34]50] where extrapolating the most recent monotonic trend plays a more significant role in determining aggregate accuracy.Conversely, among the classical models, the Fourier regression, ARIMA, and linear models perform comparatively well, due to their ability to model oscillating time series.These results underscore the unique aspects of our long-horizon chaos forecasting task, where models must correctly anticipate turning points and non-monotonic changes due to underlying attractor geometry.
The strong performance of NBEATS/NHiTS suggests that this model has structural features favoring the chaotic systems dataset.Prior work has shown that hierarchical forecasting methods can flexibly integrate information across multiple timescales in a manner inaccessible to classical statistical models [31].While chaotic systems exhibit continuous spectra and thus contain information relevant to forecasting at a variety of timescales, many systems exhibit topologically-preferred timescales such as unstable periodic orbits-like the "loops" on either side of the Lorenz attractor-that dominate the system's underlying measure [55], and which therefore may represent higher priority motifs for learning.Highly performant model architectures therefore likely contain implicit inductive biases that advantage them on chaotic systems relative to other time series.Consistent with this finding, we note that reservoir computers (nVAR/ESN) also perform strongly on the chaotic systems dataset [20,21], in agreement with prior observations for individual chaotic systems [13][14][15][16][17][18][19].
We contrast our results with recent reservoir computing studies that consider a subset of our 24 forecast methods and 135 chaotic systems [19,56,57].The comparatively strong performance of nVAR on our dataset likely stems from the quadratic nonlinearities used within the default set of fixed nonlinear kernels used by this method.Because most chaotic systems in our dataset feature predominantly quadratic nonlinearities, recent works show that the fully-trained nVAR can effectively learn to implement an exact multistep integrator for the dynamics [58].Moreover, while some recent works suggest that ESN/nVAR systematically outperform large-scale models on chaotic time series, we show below that larger domain-agnostic models regain their advantage when given sufficient training data.In contrast to prior works, we emphasize that our study performs model selection with respect to an equivalent hyperparameter found in all forecast methods, the lookback window T ℓ .When this parameter is left untuned at T ℓ = 1, machine learning methods like RNN or Transformers cannot use past context to inform their predictions, a condition analogous to creating a memoryless ESN/nVAR model by setting the leakage rate equal to one.This distinction potentially explains the weak relative performance of deep learning models compared to reservoir-based methods in recent chaotic forecasting benchmarks [19,56].Additionally, while some ESN variants essentially represent untrained RNN with randomly-initialized reservoirs, the particular ESN/nVAR models used here are taken from prior works that introduce additional architectural modifications like imposed sparsity and ridge regularization [13].These domain-specific modifications allow these models to outperform trained RNN on forecasting tasks, despite their architectural similarities.
While the utility of deep learning methods for forecasting general time series has been questioned [9,35], our results agree with recent benchmarks suggesting that large models strongly outperform classical forecasting methods on long-horizon forecasting tasks [30].We find that classical methods like exponential smoothing or ARIMA do not appear among the top models, implying that the size and diversity of our chaotic systems dataset, as well as the long duration of the forecasting task, require larger models with greater intrinsic capacity to represent complex nonlinear systems.Relative performance among models remains stable across two orders of magnitude in Lyapunov time, indicating that strong models better approximate the underlying propagator for the flow even at small forecasting horizons.Given the autoregressive nature of forecasting, an initial accuracy advantage compounds over time due to the exponential sensitivity of chaotic systems to early errors.However, in the supplementary material we show that the best-performing models also reproduce dynamical invariants such as Lyapunov exponent spectra and fractal dimensions better than other methods, suggesting that pointwise forecast accuracy is a prerequisite to accurately reconstructing dynamical manifolds.

B. The inductive biases of physics-based models
provide advantages in data or compute-limited settings .
While domain-agnostic time series methods perform well overall, we note that the different forecasting methods have different intrinsic model complexities and thus capacities.Fig. 3A shows the forecasting error at λ −1 max versus the computational walltime required to train each model on one central processing unit.Training walltime measures model efficiency, and we interpret it as a loose proxy for model complexity because model size and trainable parameter count are not directly quantifiable across highly-distinct and regularized architectures [53].We find that error and training time exhibit negative correlation (ρ = −0.31± 0.04, bootstrapped Spearman coefficient), which persists within most method groups.The best-performing machine learning models require consid-erable training times; in contrast, reservoir computers (both nVAR and ESN) exhibit competitive performance with two orders of magnitude less training time due to their linear structure.The strong performance of reservoir computers implicates an inductive bias for learning complex dynamical systems, due to their fixed kernel structure allowing them to more readily represent continuous spectra [21,[59][60][61].In particular, the nVAR model regresses a set of fixed nonlinearities that includes quadratic terms, making it possible for this model to learn an exact multistep integration scheme for many models in our dataset [58].In contrast, the transformer model has high intrinsic capacity and likely a low inductive bias for dynamical systems [30], and thus requires the most computational resources.

C. Invariant properties fail to explain long-term predictability of different chaotic systems
. This general tradeoff between performance and training difficulty motivates us to search for universal similarities across different forecasting methods.In Fig. 3B we compute the Spearman correlation between each method's instantaneous and time-averaged forecast error as a function of the forecast horizon, ρ k (t) where k indexes the forecast method.We find universal nonmonotonic behavior, in which nearly all methods exhibit peak correlation at one Lyapunov time λ −1 max (Fig 3B).This observation underscores that the largest Lyapunov exponent represents an appropriate timescale for comparing different dynamical systems, and that diverse forecasting models interact with this property in a shared manner.λ −1 max is sufficiently long to distinguish dynamical systems based on their invariant properties, but short enough that forecast methods do not accrue instabilities, large phase offsets, and other artifacts that saturate forecast error and mask intrinsic differences among systems.This observation aligns with recent work from statistical learning theory that draws analogies between trained learning models and disordered systems [62]: when models become most strongly coupled to the specific properties of individual systems, they exhibit peak correlation with their mean-field prediction.Additionally, we find the predictions of different large models become correlated at long forecasting horizons, suggesting that they agree on which particular dynamical systems prove hardest to forecast (Fig. 3D).Surprisingly, their performance only weakly correlates with λ −1 max and thus intrinsic chaoticity, with any correlation vanishing at long forecasting horizons.This unintuitive finding suggests either (a) that large forecasting models have not yet reached sufficient scale and refinement that their performance is bounded by λ max ; or (b) that λ max is not the only invariant quantity that governs the empirical long-term predictability of a dynamical system.
Taken together, these results suggest that the greater intrinsic capacity and flexibility of overparameterized domain-agnostic models allows them to access a new long-horizon forecasting regime, in which their forecast accuracy decorrelates with Lyapunov exponentand thus intrinsic chaoticity.To further investigate model complexity and performance, we next perform a series of experiments in which we titrate the history length t * , which determines the total amount of training data available to each method before generating a forecast (Fig. 3C).Unlike training time or parameter count, this quantity determines how effectively different methods utilize additional observations of a chaotic attractor.
As expected, all models asymptotically improve given additional training data.However, while the neural ordinary differential equation and nVAR models both exhibit initially steep drops-indicating favorable performance in the low-data regime-they fail to reach asymptotic errors lower than the larger-scale models that dominate when more data is available.However, NBEATS performs well in both the low-data and asymptotic regimes, suggesting that the neural basis expansions used in its earlier stages provide an inductive bias that dominates when less training data is available.This matches the intuition that performant models should require both high intrinsic capacity and inductive biases for dynamical systems.

IV. DISCUSSION
Our results show that recently-developed large, overparameterized statistical forecasting models efficiently leverage long-term observations of chaotic attractors, producing best-in-class forecasts that can remain accurate for up to two dozen Lyapunov times.Commonalities in predictions across highly distinct model classes suggest that performance arises primarily from model capacity and generalization ability, rather than specific architectural choices, and that performance at long prediction times is ultimately limited by a model's ability to learn long-term properties of a dynamical system's underlying attractor.The strong performance of generic large models echoes recent findings from other domains, and it represents an intuitive consequence of the "no free lunch" theorem for model selection [63,64].Nonetheless, our results are practically informative for forecasting real-world time series driven by underlying dynamical systems.In the absence of restrictions on data availability or training resources, large domain-agnostic models are likely to produce high-quality forecasts without the need for system-specific knowledge.However, in restricted settings, domain-specific methods such as reservoir computers exhibit the strongest performance relative to their computational requirements [13].
While certain methods perform particularly well in our experiments, we refrain from endorsing specific models to the detriment of others: our results may be specific to our chaotic systems dataset and, more importantly, the  recent literature contains a broad variety of new forecasting models, as well as infinite possible variations of each method due to hyperparameter and architectural choices, which could potentially exhibit comparable performance.Rather, we have chosen a representative set of forecasting models bridging different foci of the literature [7,35], and highlight general trends and the emerging strength of new models on the classical problem of forecasting chaos.
Our observation that λ max fails to fully determine whether a system remains empirically predictable over extended horizons introduces the possibility that our empirical forecasting results may instead correlate with other invariant properties of the different dynamical systems in our dataset, such as various measures of fractality and entropy [8,65], or covariant Lyapunov spectra [66][67][68].Such characterization could improve the interpretability of machine learning-based forecasting models, which ostensibly provide less insight into a time series's structure than classical methods [9,49].However, the strong empirical performance of machine learning suggests the potential for these methods to reveal new properties of nonlinear dynamics and, ultimately, new bounds on the intrinsic predictability and thus reducibility of chaotic systems.

V. CODE AVAILABILITY
All code used in this study is available online at https: //github.com/williamgilpin/dysts

VI. ACKNOWLEDGMENTS
We thank E. L. Florin and Yuanzhao Zhang for feedback on the manuscript.Computational resources for this study were provided by the Texas Advanced Computing Center (TACC) at The University of Texas at Austin.This project has been made possible in part by grant number DAF2023-329596 from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation.

Appendix A. EXAMPLE FORECAST TRAJECTORIES
To complement our quantitative results, in Figure S1 we show particular examples of the most accurate forecasts at > 20λ −1 max for several attractors in our dataset.In all examples, the best forecast is produced by NBEATS/NHiTS, and the predictions both matche the global attractor geometry and point-to-point variations in the dynamics.We note that these particular forecasts represent especially predictable systems from within our dataset; however, if we define a prediction time based on the latest time when SMAPE < 50, the average valid forecast horizon of the best-performing model averaged across all 135 systems is equal to 13.9 ± 2.8 λ −1 max .If we instead define the prediction time based on the first doubling of accumulated pointwise errors, we find a horizon of 15 ± 5.3 λ −1 max .
Our dynamical systems dataset corresponds to an expanded version of our initial benchmark [39]; after the release of our initial benchmark, additional systems were suggested and submitted by users of the open-source code.Each dynamical system in our dataset has the form ẋ = f k (x), k ∈ {1, 2, ..., 135}, and non-autonomous systems are lifted by defining a time-like dynamical variable.For each dynamical system we compute the maximum Lyapunov exponent λ max , correlation fractal dimension, Kaplan-Yorke fractal dimension, and the multivariate multiscale entropy.Figure S2 shows distributions of each quantity across the 135 systems.
All systems in our database are timescale-aligned to have matching dominant timescales and sampling rates: for each system, we calculate the optimal integration timestep by computing the power spectrum, and then using random phase surrogates to identify the smallest significant frequency 1/t max (a lower bound on the Lipschitz constant) and the dominant significant frequency 1/t peak [40].The smallest frequency determines the integration timestep t max /10 for all numerical integration.In order to ensure that trajectories are timescale-aligned, after integration trajectories are resampled to 100 timepoints per t peak .Unless otherwise noted, we measure natural time in units of the dominant significant Fourier timescale t peak , though we rescale this quantity by the Lyapunov exponent when reporting results in the main text.We explore dependence of forecasting on time series granularity (sampling rate) and added stochasticity Forecast Horizon (λ max t) Forecast Horizon (λ max t) Forecast Horizon (λ max t) Forecast Horizon (λ max t)

Figure S1.
The most predictable systems in the dataset.Long-term forecasts for a range of systems in the dataset for which accurate forecasts are achieved for > 20λ −1 max .The predictions (blue) correspond to the bestperforming forecast model on that particular system, relative to a held-out true trajectory (gray).Predictions versus time for a single dynamical variable are underlaid, in order to emphasize pointwise accuracy.Each forecast is generated fully autoregressively, receiving initial conditions and their preceding values but no other input.
in previous work [39]; here we focus on a fixed finegranularity consisting of trajectories with 100 timepoints per dominant Fourier period t peak .
Properties of our dataset, invariant property calculation, and system selection procedure are described in detail in prior work [39], and all code used to prepare and analyze our dataset is included in our open-source code.
Having identified natural units of time for each system in terms of t peak , we next define the structure of our forecasting task in terms of these units.There are several timescales for our forecasting task: 1. Total trajectory length T .This corresponds to the total length of the trajectory computed for separate initial conditions for the train and test sets.This is the longest timescale used in our experiments, and it is later subdivided into history (for training model parameters) and the forecast horizon.
2. Total trajectory index t ′ .This time satisfies t ′ ∈ [0, T ], and it measures absolute time along the original trajectory.We use this notation in order to reserve the quantity t to refer to forecast horizon (time since training data is no longer available).6.The Lyapunov time λ −1 max .An invariant property distinct to each chaotic system, representing the characteristic timescale over which forecasts are expected to lose accuracy due to the butterfly effect.We scale many of our forecast results to this time, λ max t.
For all systems, we use a fixed duration of t * = 10 t peak for all training data.For validation, model selection, and hyperparameter tuning, we generate forecasts up to an additional t = 2 t peak after t * .For evaluating performance, we generate predictions and timepoint-wise forecast error metrics for all future horizons spanning t ∈ [0.01 − 50] t peak , thus allowing us to calculate a horizon-dependent error score ϵ ik (t).This testing dataset corresponds to a trajectory generated from different initial conditions than the trajectory used for training and validation (model selection).

Appendix D. FORECAST MODELS
EVALUATED.
We choose 24 forecasting methods spanning a variety of areas of the forecasting and dynamical systems literature, including state-of-the-art methods [9,49,69,70].Our forecasting methods can be grouped into several categories: A. Physics-based methods.
These models contain inductive biases for time series that would give them an advantage when forecasting time series generated by dynamical systems.
• Echo state networks (ESN).Our echo-state network implementations represent standard configurations used in recent works [19,71].We note that many variants of echo state networks and reservoir computers exist, which use different nonlinear activation functions, reservoir sizes and initializations, and other structural features.Much like architecture choices for deep learning models, it is infeasible to consider the space of all possible models, and so we default to standard architectural choices used in prior works.This includes a fixed reservoir size of 500 units, spectral radius of 0.99, reservoir connectivity of 0.1, input scaling of 1.0, input connectivity of 0.2, and a ridge regularizer of strength 10 −4 in the readout layer.During model selection, we evaluate the leakage rate hyperparameter across the range 0.01 to 1.2.

• Nonlinear Vector Autoregression (nVAR).
This model uses a single hidden layer to produce nonlinear combinations of the input features, which correspond to past time points.Regularized linear regression on these lifted features is used to generate forecasts.Recent work has shown that these models are equivalent to reservoir computers given a sufficient number and diversity of nonlinear features [13,72].Following prior works, we use default hyperparameter values, including a fixed reservoir delay of 100 (1 λ −1 max in our units), and apply a ridge regularizer of strength 10 −4 in the readout layer.During model selection, we evaluate the leakage rate parameter in the range 0.01 and 1.2.
• Neural Ordinary Differential Equations (nODE).These models use deep neural networks to represent learn the function f in an equation ẋ(t) = f (x, t) [23].The neural network takes in the initial state x(t 0 ) and produces the trajectory of x(t) over time via numerical integration.The network is trained by comparing the integrated trajectory with the true trajectory, and using the error to update the function's parameters using either the adjoint method or backpropagation-which become equivalent in continuous time [73].We use default hyperparameters, corresponding to a twolayer 30 unit residual network with SiLU activation, trained for 500 epochs with a learning rate of 0.01 and a batch size of 128 samples.During model selection, we provide the network a varying amount of previous timepoints, corresponding to tuning the lookback window hyperparameter across the range 2 − 120.

B. Deep learning methods.
These methods represent large models with many trainable parameters, which are usually trained iteratively using variants of gradient descent.The term "deep" traditionally refers to models with many trainable layers, though here we use it more informally to refer to overparameterized models with hierarchical structure, in contrast to classical machine learning methods.
• NBEATS.NBEATS (Neural basis expansion analysis for interpretable time series forecasting) is an artificial neural network architecture that uses a stack of fully-connected layers, and residual connections among layers, to model the past and future values of a time series [49].It does not rely on any time-series-specific components such as recurrent or convolutional layers, and can produce interpretable outputs by using either pre-specified or fully-trainable functions.Here, we use fullytrainable basis functions, in order to avoid making assumptions about the structure of the training data.We use default hyperparameters from a reference implementation, corresponding to 4 layers each containing 256 units and ReLU activation functions.During model selection, we evaluate the input length parameter in the range 2 − 50 timepoints.
• NHiTS.A model that builds upon NBEATS by using hierarchical interpolation and multi-rate input processing to specialize its predictions for different significant frequencies in the input signal [31].NHiTS consists of several fully-connected blocks with residual connections, which operate on downsampled versions of the time series before upsampling their forecasts back to the input shape.NHiTS has been shown to match or exceed the performance of NBEATS on standard reference datasets, while substantially reducing required computational resources.We use default hyperparameters from a reference implementation, corresponding to 2 layers each containing 256 units and ReLU activation functions.During model selection, we evaluate the input length parameter in the range 2 − 50 timepoints.
• Transformer.A type of artificial neural network architecture that use self-attention mechanisms to process sequential data, such as natural language or time series [74].Transformers can capture longterm dependencies and complex patterns in time series data by using positional encoding and multihead attention, leading to their widespread use for diverse problems such as machine translation, text summarization, and question answering [63].We use an architecture based on the Informer, a model recently shown to exhibit state-of-the-art performance on long-duration forecasting tasks [30].We use default hyperparameter values from a reference implementation corresponding to 4 attention heads, 3 encoder layers, and 512 nodes in the feedforward layers.During model selection, we evaluate the input length parameter in the range 2 − 50 timepoints.
• RNN.A neural network architecture that sequentially updates a hidden state based on a combination of the hidden state's previous value, and new inputs.RNN models are widely-used for sequential data, and represent a starting point for more recent, specialized architectures for time series and natural language processing.We use default hyperparameter values from a reference implementation containing 2 recurrent layers with 25 units.During model selection, we evaluate the input length parameter in the range 2 − 50 timepoints.
• LSTM.A type of recurrent neural network with specialized gating architecture that better allows incorporation of long-range information [75].The architecture prevents vanishing gradients during training, leading to strong performance on data assimilation, time series representation, and language modelling tasks.We use default hyperparameter values from a reference implementation containing 2 recurrent layers with 25 units.During model selection, we evaluate the input length parameter in the range 2 − 50 timepoints.
• Temporal Convolutional Network.A neural network architecture that uses one-dimensional convolutional layers with causal connections to capture the temporal dependencies [54].Unlike a traditional convolutional neural network, convolutions are strided to ensure that predictions only depend on past values of the time series.Our TCN consists of several stacks of dilated convolutional layers with residual skip connections that increase the receptive field while preserving the input length.
The same architecture recently achieved best-inclass performance for unsupervised time series featurization [76].We use default hyperparameter values from a reference implementation with a dilation factor of 2 and 3 convolutional filters of width 3 [37].During model selection, we evaluate the input length parameter in the range 2 − 50 timepoints.
C. Modified linear models.
These recently-proposed linear models represent ablations isolating different properties of the Transformer architecture [48], in order to identify which aspects of the architecture most strongly determine its performance on a given dataset.These methods represent common statistical forecasting techniques, which are widely-used but which are not overparameterized relative to the training dataset size.
• ARIMA.Autoregressive Integrated Moving Average (ARIMA) is a linear time series forecasting model that combines autoregression (AR), differencing (I), and moving average (MA) components to capture various patterns in the data, such as trends and seasonality [32].Specified by three hyperparameters (p, d, q), the ARIMA model is effective in situations where the underlying datagenerating process can be well-approximated by a linear combination of past values and errors.• Exponential Smoothing.A family of forecasting methods that apply exponentially decreasing weights to past observations, with more recent observations receiving higher weights [32,77].
• Theta.A univariate forecasting technique that transforms the time series into two new time series in which short-term and long-term fluctuations are amplified, respectively [78].The two modified time series are separately forecast using exponential smoothing, and the resulting predictions are then combined to generate a final forecast.We use a fixed theta value equal to 2, in order to differentiate the model from pure exponential smoothing.
• Four Theta.An extension of the Theta method, which performs additional smoothing and amplification transforms in order to isolate fluctuations over a greater range of timescales.We use a fixed theta value equal to 2, in order to differentiate the model from pure exponential smoothing.

E. Classical Machine learning.
These represent common regression methods used in machine learning, which are not overparameterized relative to the training dataset size.
• Linear Regression.A standard linear regression between past values of the time series and the next value.A weak, untuned ridge regularization term of amplitude 0.01 prevents the model's weights from diverging during training.During model selection, we evaluate the input size hyperparameter in the range of 2 − 50 previous timepoints.
• Fourier Transform Regression.A set of dominant frequencies and relative phases are identified from a time series' power spectrum.Forecasts are generated by repeating these frequency components indefinitely into the future, with an appropriate phase offset.
• Random Forest.A model that trains a series of decision tree regressors on individual subsets of the timepoints in the lookback window, and then averages their predictions to produce a consensus forecast [79].In contrast, the gradient-boosting model XGBoost trains an ensemble of trees sequentially, such that each tree prioritizes forecasting timepoints on which previous trees underperformed [80].We use default hyperparameters consisting of 100 trees, with each tree's depth allowed to grow until either all leaves are pure, or all leaves contain less than 2 samples.During model selection, we evaluate the number of input features hyperparameter in the range of 2 − 50 previous timepoints.
• XGBoost.A variant of the Random Forest, in which individual decision trees are trained sequentially to improve on earlier trees' outputs.XG-Boost approaches state-of-the-art performance on regression and classification of tabular data [81].Surprisingly, recent benchmarks suggest that XG-Boost can outperform artificial neural networks on time series forecasting [29], despite the model lacking structural inductive biases that exploit temporal correlations and continuity.During model selection, we evaluate the number of input features hyperparameter in the range of 2 − 50 previous timepoints.
These models isolate single properties of the training time series, and use it to generate minimal forecasts based on solely that attribute.Much like ablations used to evaluate machine learning models [82], these baselines provide minimal reference values against which to compare the results of more sophisticated methods.
• Naive Mean.A model that forecasts all future values of the time series as equal to the mean of the previous values.
• Naive Drift.A model that extracts the dominant linear trend from the previous values, and extrapolates that trend into the future.
• Naive Seasonal.A model that determines the dominant phase and timescale using the peak of the power spectrum, and then computes the average of a set of non-overlapping consecutive windows with width equal to the dominant timescale (after first applying the phase shift).The model then continues this repeated motif indefinitely to generate a singly-periodic forecast of future values.
• Unforced Kalman.A recursive Bayesian estimator that uses a set of linear equations to optimally estimate the state of a dynamic system, given noisy and partial observations [40].The Kalman filter recursively updates its state estimate by predicting the next state using the state-transition model and refining the estimate with new observations via the observation model.In a forecasting context, the transition matrix and other parameters are fit using the training data.To generate a forecast after the training history ends, the internal model state is held constant without updating in order to propagate the forecast autoregressively without additional input data.This naive approach provides a lower-bound on the performance of linear models directly fit on the training data [32,50,83].
For most deep learning models, we use reference implementations provided by the darts Python library [47].For other models, we use reference implementations in the statsmodels, Gluon-TS, sktime, and scikit-learn libraries.We use the authors' original codes for the neural ODE and reservoir computer models [23,70,84].All untuned hyperparameters (e.g.batch size, training epochs, model width, number of layers, etc) are kept at default values used in reference implementations.We tune hyperparameters separately for each forecasting model and dynamical system pair.For each trajectory, 10 full periods comprising 1000 timepoints are used to train the model, and 2 additional periods comprising 200 timepoints are used to estimate sMAPE errors for each combination of hyperparameters.Measured in terms of Lyapunov times λ −1 max , this corresponds to an average of 11 λ −1 max used for training each system.Though distinct forecasting methods have different hyperparameters and architectural details, we focus on tuning whichever hyperparameter most closely corresponds to the lookback window T ℓ for each method.This corresponds to the number of timepoints that the model sees simultaneously when generating a prediction for the next timepoint.In the context of the specific methods considered here, it corresponds to the "lag order" of traditional auto-regressive models like ARIMA.For classical machine learning methods and artificial neural networks, this corresponds to the "input size," or the number of features seen by the model simultaneously as input.For reservoir computers, it corresponds to the leakage rate, which indirectly influences the reservoir's spectral radius [85].
We treat classical models accepting a seasonality hyperparameter as multiple distinct models, and select the best-performing one.A standard grid search via time series cross validation on the training data determines the optimal hyperparameters separately for each method and dynamical system pair.
Caveats of model selection and justification.Complex forecasting models contain many hyperparameters and architectural choices, which parameterize an infinite set of possible models related to each specific method that we test.For example, deep neural networks offer many choices regarding number of layers, depth, learning rate, and batch size, while reservoir computers require choices regarding the random initialization scheme for the reservoir, reservoir size, and unit dynamics.In the spirit of previous large-scale benchmarks [86,87], we seek to perform comparable degrees of model selection for each of the 24 forecasting models that we consider.Thus the best-performing methods represent those that achieved strongest performance given the particular hyperparameters and value ranges we consider, and do not necessarily preclude other models from exhibiting comparable performance in certain regimes.Nonetheless, our results suggest that certain forecasting methods lend themselves to producing strong results with minimal fine-tuning.

Appendix F. ACCURACY METRICS
We consider 16 different forecast accuracy metrics, though we report our main text results in terms of one metric, sMAPE, due to its widespread use, favorable properties, and interpretability [9,11,35,39,52].We include all other results in tabular form in our open-source code repository.

A. Pointwise metrics
A summary of our results in terms of various pointwise accuracy metrics is shown in Figure S3, which pairs the accuracy of each method over gradually increasing forecasting horizons t with a snapshot of the distribution of scores at one Lyapunov time λ −1 max .Here, we describe each error metric in terms of a true trajectory y(t ′ ) and predicted trajectory ŷ(t ′ ).We suppress the subscripts i and k, which we use elsewhere to denote the performance of the i th forecasting model on the k th dynamical system.
1. Symmetric mean absolute percent error.The sMAPE scales the absolute percent error based on the magnitude of the two input time series.If y(t ′ ) is the true trajectory and ŷ(t ′ ) is a predicted trajectory sampled at a discrete set of values i ′ ∈ 1, 2, ...t, then the sMAPE is defined as This metric is widely-used in prior forecasting studies due to its compact range and interpretability [9,11,35,39,52].The argument t indicates that this instantaneous error signal depends on how far into the future a forecast is generated-the forecast horizon.When referring to the error associated with a specific forecasting method on a particular dynamical system, we use subscripts ϵ ik (t), with i indexing the forecast method and k indexing the dynamical system.
2. Spearman correlation.This metric measures the tendency of the true and forecasted time series to co-vary, independently of the relative magnitude of either series.We find that this metric performs well and cleanly differentiates the models, though it tends to zero over extended periods as the true time series and the forecast decorrelate.
The relative ordering of different forecasting models remains largely the same under this metric, though the nVAR model gains a slight relative advantage.
3. Normalized Root Mean Squared Error.The NRMSE rescales the RMSE relative to the average fluctuations within the time series, where y, ŷ ∈ R d and σ corresponds to the standard deviation of y(t ′ ).There is some ambiguity regarding the calculation of σ, which can be estimated from the full training data, the forecast interval, or external measurements.In order to mitigate sampling errors in the calculation of σ due to uneven forecast horizons, we estimate σ over the entire training dataset, making it a constant for each dynamical system.This metric is prone to divergence, like the other metrics considered here.Nonetheless, the relative ordering of the models remains consistent, though the neural ODE model fares slightly better under this metric.
4. Mean absolute scaled error.MASE is a recently-proposed metric that scales the mean absolute error relative to a naive forecast based on forward propagation of the most recent time series value [88], where m indexes the dimensions of a d-dimensional trajectory.We find that this metric is prone to divergence at long forecasting times, reducing its interpretability.However, it yields a nearly identical model ranking to the sMAPE metric.averaged across dimensions. 1 where m indexes the dimensions of a d-dimensional trajectory and ȳm = (1/t) t 0 y m (t ′ )dt ′ , or equivalently (1/t) t t ′ =1 y m for a discrete time series.This quantity has a well-defined upper bound at 1, and tends to smoothly decay.We find that the general ranking of models remains the same under this method, though it exhibits less dynamic range than sMAPE or Spearman correlation.
where m indexes the dimensions of a d-dimensional trajectory.This metric has been proposed as a stable metric for comparing forecasts across highly distinct time series, especially those with varying lengths [33].In practice, we find that the accrual of errors tends to cause the numerator to diverge at intermediate forecasting horizons.Nonetheless, we find that the ranking of models produced by this metric agrees with other metrics.

7.
Other metrics: mean squared error (MSE), mean absolute error (MAE), Pearson correlation, Kendall-Tau correlation, mean absolute percentage error (MAPE), Mutual Information (MI), mean absolute ranged relative error (MARRE), root mean squared logarithmic error (RMSLE), and Coefficient of Variation (CV).Apart from mutual information, these metrics have very similar properties to those highlighted here, and so we defer discussing them in detail.The mutual information calculation uses recently-introduced density estimation methods [89][90][91][92][93], and it can, in principle, capture nonlinear dependencies between a forecast and the true time series.However, we find it to be too sensitive to fluctuations in time series values to smoothly illustrate forecast quality.
We note, however, our recent work showing that all of these metrics empirically correlate strongly with sMAPE [39,52], and our forecasting results in terms of these metrics are available in our opensource code repository.

B. Reconstructing invariant measures
We further assess quality of forecasts by computing invariant properties of the learned chaotic attractors.In all cases, we compute the invariant property separately on the test dataset's true values and on the forecast generated by each method.For a given invariant measure with ground-truth value η k for the k th dynamical system, we compute an estimate ηik for the i th forecasting method using the full predicted time series over the maximum forecast horizon trajectory t ∈ (0, T − t * ).In Figure S4, we show the error |η k − ηik | for four different invariant quantities, across all 135 dynamical systems and 24 forecast methods.
1. Power spectrum.Chaotic systems have continuous power spectra due to their fractal nature.By comparing the power spectrum of the original and forecasted systems, we can assess whether forecasts predominantly capture dominant periodic trends in the time series, or whether they capture variation across scales.We find that the nonlinear vector autoregressive model, which is related to echo state networks, most strongly preserves the underlying spectrum.We hypothesize that the fixed nonlinearities in this method allow it to preserve spectral resolution relative to the fully-trainable neural networks, which learn nonlinear relationships from finite data and thus have finite resolution.Other strongly-performing models include the Fourier transform regression model (FFT), which explicitly preserves the power spectrum, as well as simple seasonality models like the Theta family, which model mixed seasonality.Among the general-purpose forecasting models, the related models NBEATS and NHiTS both perform competitively relative to other systems.
2. Fractal Dimension.The fractal dimension quantifies the space-filling properties of an attractor relative to filled solids or planes.We compute the correlation dimension using the robust, nonparametric Grassberger-Procaccia algorithm [94].
We find the NBEATS model and its lightweight variant, NHiTS, sharply outperform other methods, suggesting that these methods not only generate pointwise-accurate forecasts but also capture fundamental structural properties of the underlying attractor.We note that the neural ordinary differential equation performs strongly on this metric, despite performing comparatively weakly in absolute pointwise accuracy, suggesting that this model captures the attractor geometry even in the absence of pointwise accuracy.
3. Lyapunov Spectrum.We estimate the full Lyapunov exponent spectrum using standard techniques based on continuous QR factorization of a bundle of tangent vectors transported with the flow [95,96].We note that this algorithm fails when the forecasts are constant (as occurs in the naive models), leading the eigenvalue spectrum of the Jacobian matrix to become singular, and resulting in empty regions on our plot due to the spectrum becoming ill-defined.The nonlinear vector autoregressive model, which is related to reservoir computers, performs strongly on this task, suggesting that the high intrinsic capacity of these models allows them to capture the long-term structure of the underlying attractor.Interestingly, on this particular task, the LSTM model outperforms the NBEATS, in contrast to other tasks.
4. Largest Lyapunov exponent.Rather than consider the full Lyapunov spectrum, we instead compare only the largest Lyapunov exponent λ max of the reconstructed attractor.This quantity represents a putative measure of chaoticity for each dynamical system.We find that the neural ODE model performs surprisingly strongly on this task, and we speculate that the learning quality of these models (which are trained using adjoint optimization on numerically-integrated trajectories) leads the model to exhibit particular sensitivity to this invariant quantity.Otherwise, for this quantity, as in others, the best-performing models NBEATS and its relative NHiTS represent the strongest baselines.
Our results show that the forecasting methods with the highest pointwise accuracy generally also exhibit the highest accuracy in recovering global invariant properties of the underlying attractors.However, we note that nVAR performed better on this task, which may stem from its inductive bias on the chaotic systems dataset due to the appearance of quadratic kernels in its reservoir [13,58].Additionally, recent works have shown that both reservoir-based and traditional recurrent models are capable of learning diverse global dynamical properties from high-dimensional chaotic time series when given sufficient training data [66,68].In particular, these works highlight the ability of certain methods to learn covariant Lyapunov vectors, which encode geometric properties of transport in chaotic flows [66,67].

Appendix G. CORRELATION OF MODEL PERFORMANCE WITH INVARIANT PROPERTIES
We first investigate the degree to which each forecasting model tends to agree with other models regarding which dynamical systems are easier or harder to forecast.In Figure S5A we show the mutual correlation C i (t) for each model, while in Figure S5B we correlate model forecasts with the largest Lyapunov exponent λ max for each dynamical system.
We calculate this quantity as follows: Let ϵ ik (t) denote the error of the i th forecasting model on the k th dynamical system at future time t ∈ [0, T fut ] after the end of training data availability.If there are K total dynamical systems and N forecasting models, then to compute the Spearman correlation we first calculate the ordinal rank variables R ik (t) ∈ {1, 2, ..., N } for each ϵ ik (t).We may define the time-dependent Spearman correlation matrix C(t) as: where Ri (t) and Rj (t) are the time-dependent mean rank variables for i and j, respectively.
We define the mutual correlation C i (t) for each forecasting model by taking the sum of this quantity over rows, In Figure S5A, we find that the high-capacity models that lack structural priors for dynamical systems (NBEATS/NHiTS, transformer, LSTM, RNN) all remain correlated with each other across a wide range of forecasting horizons, and that their degree of mutual correlation increases at long forecasting horizons.In contrast, the echo state networks and neural ODE models, despite being performant overall in the forecasting task, disagree with the remaining models regarding which dynamical systems are easier or harder to forecast, particularly at long forecasting times.
In order to determine whether this effect can be explained by interactions between forecasting models and invariant properties of different dynamical systems, in Figure S5B, we correlate forecasting results with the largest Lyapunov exponent λ max for each system.While a weak correlation between forecast error and λ max holds at short forecasting horizons (< λ −1 max ), this correlation degrades at long forecasting horizons.This effect implies that the intrinsic chaoticity of different dynamical systems does not determine their empirical predictability over long forecasting horizons, at least when they are learned by modern large forecasting methods.

Appendix H. TIMING EXPERIMENTS
Timing experiments are performed separately for each forecasting method and dynamical system pair, for a total of 135 × 24 = 3240 experiments.Our experiment design, implementation, and interpretation closely follows standard methods used in previous works that benchmark time series methods [97].Timing experiments are performed on identical hardware restricted to a single CPU core per dataset and per run, with 32 GB RAM and an AMD EPYC 7763 Processor (2.45 GHz).Timing results are averaged over all 135 chaotic systems for the performance benchmarks.We note that the presence of GPU and various hardware-level optimizations could improve timing results for larger models with parallelizable training methods, but the underlying number of hardware operations would remain the same.We generate 40 trajectories emanating from distinct random initial conditions on the attractor.Each trajectory has a length 2000 timepoints, with a sampling rate 100 points per dominant period as determined by surrogate methods described above.For each system and trajectory, we compute 787 features based on known signal processing transforms using the tsfresh toolkit [43]; these include properties such as wavelet coefficients, Friedrich coefficients, and statistical cumulants.The full list of signal features can be found in the tsfresh publication, as well as its accompanying open-source codebase [43].We also calculate an additional 118 features typically used to characterize the complexity of dynamical time series using the neurokit2 toolkit [44,98].These include the various measures of entropy (e.g.sample entropy, permutation entropy), detrended fluctuation analysis, and Hurst exponents.The full list of signal features can be found in the neurokit2 publication, as well as its accompanying open-source codebase [44,98].Across all dynamical systems and trajectories, we subselect only the features with greater variance across different dynamical systems than across replicate trajectories within each system, resulting in a vector containing 747 informative features representing each dynamical system.
We embed each trajectory by using the uniform manifold approximation and projection (UMAP) nonlinear embedding technique, which seeks to represent the original, high-dimensional feature vectors in a lowerdimensional space that preserves local topology and nearest neighbors [45].For each dynamical system, we compute the median position in the reduced-order UMAP space as the exemplary position of that particular system.

Appendix J. INVARIANT PROPERTY CALCULATION
For each system in the dataset, the largest Lyapunov exponent λ max is calculated using multiple methods in order to ensure accuracy.The first method continuously calculates the Jacobian of the dynamical equation along a trajectory using its analytical expression, which can be quickly calculated using modern automatic differentiation software [99].Given a time-dependent Jacobian matrix along a trajectory, the full Lyapunov spectrum can be calculated using composition of the instantaneous QR factorization of the matrix at each timepoint [95,96].As a secondary check, we also calculate the largest Lyapunov exponent using a naive method based purely on the classical definition of the Lyapunov exponent, λ max = lim t→∞ log(||x(t) − x ′ (t)|| 2 /||x(0) − x ′ (0)|| 2 ), where x(0) = x 0 and x ′ (0) = x 0 (1 + ξ).In practice, we set ||ξ|| 2 ≤ 10 −14 , a quantity well above the floating-point precision floor, and stop the calculation when ||x(t) − x ′ (t)|| 2 > 10 −8 .
For both the QR factorization and naive methods, we perform two calculations for consistency: a long-time calculation in which we average the Lyapunov exponent estimates along 1000 distinct trajectories each of length 5000 timepoints, and a short-time calculation in which we average Lyapunov exponent estimates along 50000 distinct trajectories each of length 100.The two calculations agree to within two significant figures, validat-ing that integration steps and durations are sufficient to reach the ergodic limit of each dynamical system.
Having obtained the largest Lyapunov exponents, we estimate other invariant properties using the same procedures.The correlation fractal dimension is estimated for each attractor using the Grassberger-Procaccia algorithm [94], and the multiscale entropy is estimated using algorithms for multivariate time series [100].The Kaplan-Yorke fractal dimension is estimated directly from the Lyapunov exponent spectrum.

Figure 1 .
Figure1.A space of low-dimensional chaotic systems.(A) A dataset of 135 distinct low-dimensional chaotic systems, colored by largest Lyapunov exponent (λmax).(B) A nonlinear embedding of the attractors.Each attractor is featurized using 747 invariant properties such as entropy, fractal dimension, et al., and then embedded in a two-dimensional vector space with UMAP.Contours denote 50% confidence intervals in each system's embedding across 500 random initial conditions and feature subsets; points denote centroids for each system.

Figure 2 .
Figure 2. Statistical forecasting across an ensemble of chaotic systems.(A) The average error of 24 forecasting methods ⟨ϵ ik (t)⟩ k as a function of Lyapunov time, averaged across 135 distinct chaotic systems.Colors denote highperforming models with properties of particular interest.(B) Distributions of the forecast errors when t = λ −1 max .(C) The predictions of the best-performing forecast model (red), relative to a held-out true trajectory from the Mackey-Glass model (gray) at short and long forecasting horizons.
Horizon (λ max t) Forecast Horizon (λ max t) Correlation with λ max History length (λ max t)

Figure 3 .
Figure 3. Universal relationships among forecasting methods.(A) Error versus training time at fixed forecast horizon t = λ −1 max for all models.Bar lengths denote standard deviations along principal axes, with angle indicating Spearman correlation within each model group in order to detect Simpson's paradox.The underlaid linear fit indicates the overall correlation ρ = −0.31± 0.04.(B) Median relative correlation of each forecasting method with its average prediction, across different forecast horizons.(C) Median model errors at t = λ −1 max as the amount of history data increases.(D) Correlation of forecasting error with Lyapunov exponent λmax as a function of forecasting horizon.All error bars correspond to 95% confidence intervals, and colors match methods from previous figures.

Figure S2 .
Figure S2.Invariant properties of the dynamical systems dataset.Histograms showing the number of systems (out of 135 total) with invariant properties in each bin range.Across the systems, the maximum Lyapunov exponent λmax = 0.24 ± 0.31, the correlation dimension D2 = 1.84 ± 0.22, the Kaplan-Yorke dimension DKY = 2.19 ± 0.45, and the multiscale entropy E = 0.78 ± 0.15.
Appendix E. HYPERPARAMETER TUNING, VALIDATION, AND MODEL SELECTION.

5 .Forecast
Figure S3.Alternative error metrics for the forecasting experiments.Many normalized metrics exhibit sensitivity to fluctuations in their denominators, resulting in values exceeding the axes bounds as the errors diverge.Colors correspond to highlighted models from the main text.

6 .
Weighted absolute percent error.WAPE scales the absolute percent error based on the ab-solute values of the true time series, 1 d

Figure S4 .
Figure S4.The root-mean-squared error between the true values of various dynamical properties, and the forecasts generated by different forecasting methods.The Lyapunov exponent calculation is ill-defined for constant-valued naive forecasts, leading to missing bars in the Lyapunov exponent spectrum and largest Lyapunov exponent comparisons.Colors correspond to highlighted models from the main text.

Figure S5 .
Figure S5.Dynamical system properties and forecasting performance.(A) Mutual correlation of each forecast model with other models, as a function of forecasting horizons.Higher values indicate that the particular model tends to agree with other models regarding which dynamical systems are easier or harder to forecast at that forecasting horizon.(B) Correlation of forecasting performance of each model with the largest Lyapunov exponent of each dynamical system, as a function of forecasting horizon.Colors correspond to highlighted models from the main text, and error bars represent 95% confidence intervals.
Appendix I. EMBEDDING THE DYNAMICAL SYSTEMS DATASET

3 .
Lookback window T ℓ .The number of datapoints seen simultaneously by a forecast method at any given time during both training and forecasting.This quantity is akin to the number of features seen simultaneously by a linear regression or random forest model, or the number of lags in autoregressive state space models.Physically, T ℓ represents the amount of context information from past values that a fully-trained model can access when generating a forecast.4. History length t * .This corresponds to the total amount of training data t * ≥ T ℓ , or the total number of unique past values seen by the model across all training iterations and epochs.For our experiments, t * is equal to 10 periods of t peak .
* ).In discrete time, t = 1 represents the next timepoint immediately after timepoints 1, 2, ..., t * have elapsed.Note that t is defined as the time since the end of training data availability, t ≡ t ′ − t * .For our experiments, t ∈ (0, T − t * ], which is equivalent to t ′ ≡ (t * , T ].