Extracting Interpretable Physical Parameters from Spatiotemporal Systems using Unsupervised Learning

Experimental data is often affected by uncontrolled variables that make analysis and interpretation difficult. For spatiotemporal systems, this problem is further exacerbated by their intricate dynamics. Modern machine learning methods are particularly well-suited for analyzing and modeling complex datasets, but to be effective in science, the result needs to be interpretable. We demonstrate an unsupervised learning technique for extracting interpretable physical parameters from noisy spatiotemporal data and for building a transferable model of the system. In particular, we implement a physics-informed architecture based on variational autoencoders that is designed for analyzing systems governed by partial differential equations (PDEs). The architecture is trained end-to-end and extracts latent parameters that parameterize the dynamics of a learned predictive model for the system. To test our method, we train our model on simulated data from a variety of PDEs with varying dynamical parameters that act as uncontrolled variables. Numerical experiments show that our method can accurately identify relevant parameters and extract them from raw and even noisy spatiotemporal data (tested with roughly 10% added noise). These extracted parameters correlate well (linearly with $R^2>0.95$) with the ground truth physical parameters used to generate the datasets. Our method for discovering interpretable latent parameters in spatiotemporal systems will allow us to better analyze and understand real-world phenomena and datasets, which often have unknown and uncontrolled variables that alter the system dynamics and cause varying behaviors that are difficult to disentangle.


I. INTRODUCTION
Physics has traditionally relied upon human ingenuity to identify key variables, discover physical laws, and model dynamical systems. With the recent explosion of available data coupled with advances in machine learning, fundamentally new methods of discovery are now possible. However, a major issue with applying these novel techniques to scientific and industrial applications is their interpretability: neural networks and deep learning are often seen as inherently black box methods. To make progress, we must incorporate scientific domain knowledge into our network architecture design and algorithms without sacrificing the flexibility provided by deep learning models [1][2][3]. In this work, we show that we can leverage unsupervised learning techniques in a physicsinformed architecture to build models that learn to both identify relevant interpretable parameters and perform prediction. Because relevant parameters are necessary for predictive success, the two tasks of extracting parameters and creating a predictive model are closely linked, and we exploit this relationship to do both using a single architecture.
We focus our attention on spatiotemporal systems with dynamics governed by partial differential equations (PDEs). These systems are ubiquitous in nature and include physical phenomena in fluid dynamics and electromagnetism. Recently, there has been significant interest * lup@mit.edu in the data-driven analysis and discovery of PDEs [4][5][6][7][8]. However, previous works on PDE discovery and parameter extraction often assume the entire dataset is governed by the same dynamics and also explicitly provide the key dynamical parameters. In more complex scenarios, we may have limited control over the systems that we are studying and yet still want to model them and extract relevant physical features from their dynamics. If we attempt to study such systems by naïvely training a predictive model, we are likely to fail in one of two ways: first, a single explicit PDE model will be unable to capture the variations in the dynamics caused by uncontrolled variables in the data, and second, a generic deep learning method for time-series prediction such as long short-term memory (LSTM)-based models [9,10] will not be interpretable or provide any physical insight and may also result in unphysical solutions at later times due to overfitting. To avoid these problems and gain a better understanding of the physical system, we must first identify important parameters or variables that are uncontrolled and that change in the raw data, producing varying dynamics. Recent work on learning parametric PDEs has taken steps toward addressing this issue [11]. We will use an unsupervised learning method to automate the process of determining the relevant parameters that control the system dynamics and constructing a predictive model. We propose a model architecture (Figure 1(a)) based on variational autoencoders (VAEs) [12]. VAEs are widely used for dimensionality reduction and unsupervised learning tasks [13] and have been shown to be ef-arXiv:1907.06011v2 [physics.comp-ph] 19 Aug 2019 fective for studying a wide variety of physical phenomena, e.g. discovering simple representations of systems in classical and quantum mechanics [14], modeling protein folding and molecular dynamics [15,16], and identifying condensed matter phase transitions [17,18]. In terms of interpretability, the VAE architecture and its derivatives have also been shown to disentangle independent factors of variation [19][20][21].
Our architecture consists of an encoder (Figure 1(b)) that extracts physical parameters characterizing the system dynamics and a decoder (Figure 1(c)) that acts as a predictive model and propagates an initial condition forward in time given the extracted parameters. This differs from a traditional VAE due to the additional initial condition provided to the decoder, allowing the encoder to focus on extracting latent parameters that parameterize the dynamics of the system rather than the physical state. Our architecture can be thought of as a conditional VAE [22], although only the decoder is conditional. While similar architectures have been recently proposed for physical systems such as interacting particles [23] and moving objects [24], our model is specifically designed to study spatiotemporal phenomena, which have a continuous set of degrees of freedom.
To take advantage of the spatiotemporal structure of PDE-governed systems, we use convolutional layerscommonly used for image recognition tasks [25,26]which efficiently represent local features in both the encoder and decoder portions of our architecture. The translation invariance of the convolutions also allows us to train on small patches of data and then evaluate on arbitrarily large systems with arbitrary boundary conditions. In addition, we construct the decoder to efficiently parameterize the dynamics of a PDE by making its convolutional kernels and biases a function of the extracted latent parameters using a fully-connected neural network. In this way, the latent parameters directly control the local propagation of the physical states in the decoder, resulting in more stable predictions for the future propagation of the system and a more physical encoding of the dynamics. Using convolutions to represent PDE propagation in the decoder is also analogous to applying a finite difference approximation [4].
To demonstrate the capabilities of this approach, we test our method on simulated data from PDE models for chaotic wave dynamics, optical nonlinearities, and convection and diffusion (Section III). These numerical experiments show that our method can accurately identify and extract relevant physical parameters that characterize variations in the observed dynamics of a spatiotemporal system, while at the same time construct a flexible and transferable predictive model. We further show that the parameter extraction is robust to noisy data and can still be effective for chaotic systems where accurate prediction is difficult. The goal of this approach is to provide an addition tool for studying complex spatiotemporal systems when there are unknown and uncontrolled variables present.

II. MODEL ARCHITECTURE
Our model ( Figure 1) has an encoder-decoder architecture based on a variational autoencoder (VAE) [12,19]. Given a dataset of time-series from a spatiotemporal system, the dynamics encoder (DE) extracts latent parameters which parameterize the varying dynamics in the dataset. These latent parameters are then used by the propagating decoder (PD) to simulate the system given an initial condition and boundary conditions. During training, the model is optimized to match the output of the PD to a time-series example from the dataset. The goal of the VAE architecture is to allow the PD to push the DE to extract useful and informative latent parameters.
For training, the network requires time-series data that are grouped in pairs: the input series {x t } Tx t=0 is the input to the DE, and the target series {y t } Ty t=0 provides the initial condition y 0 and the training targets {y t } Ty t=1 for the PD. Each pair of time-series ({x t } Tx t=0 , {y t } Ty t=0 ) must follow the same dynamics and thus correspond to the same latent parameters. We can construct such a dataset from the raw data by cropping each original time-series to produce a pair of smaller time-series. This cropping can be performed randomly in both the time and space dimensions, which allows the network to train on a reduced system size while still making use of all the available data.
In our examples, we also choose to crop dynamically during training, akin to data augmentation methods used in image recognition [26].
In detail, the DE network takes the full input series {x t } Tx t=0 and outputs a mean µ z and a variance σ 2 z representing a normal distribution for each latent parameter z. During training, each z is sampled from its corresponding distribution N (µ z , σ 2 z ) using the VAE reparameterization trick: z = µ z + σ z , where ∼ N (0, 1) is independently sampled for every training example during each training step. During evaluation, we simply take z = µ z . These parameters z along with an initial condition-the first state y 0 in the target series-are then used by the PD network to predict the future propagation of the system. The predicted propagation series {ŷ t } T t=1 produced by the PD can be computed up to an arbitrary future time T , where T = T y during training to match the target series. By providing the PD with an initial condition, we allow the DE to focus on encoding parameters that characterize the dynamics of the data rather than encoding a particular state of the system. This is further reinforced by training on randomly cropped pairs of time-series as well as by the VAE regularization term (Appendix D).
The full architecture is trained end-to-end using a mean-squared error loss between the predicted propagation series {ŷ t } Ty t=1 from the PD and the target series {y t } Ty t=1 . We also add the VAE regularization loss-the KL divergence term D KL (N (µ z , σ 2 z ) N (0, 1))-which encourages each latent parameter distribution N (µ z , σ 2 z ) generated by the DE to approach the standard normal prior distribution N (0, 1). The total loss function is given where T y is the length of the target series without the initial y 0 , and β is a regularization hyperparameter which we tune for each dataset. The β parameter is key to learning disentangled representations [19,20]. By using the VAE sampling method and regularizer, we compel the model to learn independent and interpretable latent parameters (Appendix D). For additional training details, see Appendix B.

A. Dynamics Encoder (DE)
The dynamics encoder (DE) network is designed to take advantage of existing symmetry or structure in the time-series data. We implement the DE as a deep convolutional network in both the time and space dimensions to allow the network to efficiently extract relevant features. To ensure the DE can handle arbitrary system sizes and time-series lengths, the architecture only contains convolutional layers with a weighted average applied at the output to obtain the latent parameters. The weights for the final averaging are also learned by the network and interpreted as variances so that the overall variance can also be computed. In this way, the network is able to focus on areas of the input series that are most important for estimating the latent parameters, akin to a visual attention mechanism [27].
Explicitly, we first compute local quantities where f DE,µ and f DE,σ 2 are multilayer convolutional networks in space and time (see Appendix A for details). Then, instead of using a fully-connected layer to compute the final mean µ z and variance σ 2 z for each latent parameter, we combine the local quantities by performing an inverse-variance weighted average using weights given by to obtain where C is a constant chosen to correct for the correlations between nearby points and d is the total number of time and space dimensions of the input. This averaging serves two purposes: it allows the DE to scale to arbitrary system sizes and geometries, and it improves the parameter extraction by placing greater emphasis on regions of high confidence. Assuming that non-overlapping patches should be treated as independent while overlapping patches are increasingly correlated, we take C = 31 to be the linear size of the receptive field of the convolutional networks f DE,µ and f DE,σ 2 .

B. Propagating Decoder (PD)
The propagating decoder (PD) network is designed as a predictive model for spatiotemporal systems. We structure the PD as a multilayered convolutional network f PD (see Appendix A for details) with a residual skip connection that maps a stateŷ t to the next state in the time-seriesŷ t+1 . Thus, each propagation step is given byŷ To generate the predicted propagation series {ŷ t } Ty t=1 for comparison with the target series {y t } Ty t=1 , we begin with the initial conditionŷ 0 = y 0 and then recursively apply the PD to propagateŷ t →ŷ t+1 , forming a recurrent network. The PD acts as a physics simulator or, in this case, a PDE integrator with explicit time stepping. This architecture reflects both the spatial and temporal structure of PDE-governed systems and incorporates boundary conditions by properly paddingŷ t at each time step before applying the convolutional layers. For example, to use periodic boundary conditions during evaluation, we apply periodic padding at each time step. During training, we treat the edges of the target series-cropped from a full training example-as a boundary condition by using the spatial boundary of each y t in the target series to pad the corresponding stateŷ t before propagation.
Unlike the convolutional layers of the DE, the kernels and biases for f PD are not directly trained. Instead, the kernel weights and biases are a function of the latent parameters z. This type of layer is known as a dynamic convolution [28] or a cross-convolution [24]. Each convolutional kernel and corresponding bias is constructed by a separate fully-connected latent-to-kernel network that maps the latent parameters to each kernel or bias, forming a multiplicative connection in the PD. Thus, we can interpret the PD convolutional kernels and biases as encoding the dynamics of the system parameterized by z.

III. SIMULATED PDE DATASETS
To study the ability of our architecture to perform parameter extraction, we generate simulated datasets of spatiotemporal systems that have spatially uniform, time-independent local dynamics in a box with periodic boundary conditions, i.e. we consider PDEs of the form where F is a general space-and time-independent, nonlinear local operator acting on u. This allows us to design an optimized, physics-informed model architecture. We test our model on a variety of spatiotemporal systems by creating the following three datasets that cover linear, nonlinear, and chaotic dynamics as well as giving both 1D and 2D examples. For details on the generation of the simulated datasets, see Appendix C.

A. 1D Kuramoto-Sivashinsky
The Kuramoto-Sivashinsky equation is nonlinear scalar wave equation with a viscosity damping parameter γ. This is a key example of a chaotic PDE [29] due to the instability caused by the negative second derivative term and was originally derived to model laminar flame fronts [30,31]. The 1D Kuramoto-Sivashinsky dataset has a training set with 5,000 examples and a test set with 10,000 examples.

B. 1D Nonlinear Schrödinger
The nonlinear Schrödinger equation is a complex scalar wave equation with a cubic nonlinearity controlled by the coefficient κ. In our data, we represent ψ = u 1 + iu 2 as a real two-component vector u = (u 1 , u 2 ). This equation can be used to model the evolution of wave-packets in nonlinear optics and is known to exhibit soliton solutions [32]. The 1D nonlinear Schrödinger dataset has a training set with 5,000 examples and a test set with 10,000 examples.

C. 2D Convection-Diffusion
The 2D convection-diffusion equation is a linear scalar wave equation consisting of a diffusion term with constant D and a velocity-dependent convection term with velocity field v. The equation describes a diffusing quantity that is also affected by the flow or

IV. NUMERICAL EXPERIMENTS
We perform numerical experiments by training the model on both the original noiseless datasets and the datasets with added σ = 0.1 Gaussian noisecorresponding to 10% noise relative to the initial conditions. Then, we evaluate the trained models on the full size noiseless test set examples (no cropping). By also training on noisy datasets, we test the robustness of our method and show the effect of noise on the extracted parameters and prediction performance.

A. Parameter Extraction
During training, the model will only use a minimal set of latent parameters to encode the variation in the dynamics and will align each latent parameter in this In each case, the model has correctly identified the number of relevant parameters (one for the Kuramoto-Sivashinsky and nonlinear Schrödinger datasets, and three for the convection-diffusion dataset), which are characterized by high variance in µz and a low mean σ 2 z . These relevant latent parameters correspond to interpretable physical parameters that parameterize the dynamics of the system. The other latent parameters with near zero variance in µz and high mean σ 2 z have collapsed to the prior and are non-informative. Note that while one would expect these collapsed parameters to have σ 2 z = 1, the actual extracted σ 2 z for the collapsed non-informative parameters is less than one. This is an artifact of evaluating the model on a larger system size and longer time-series than the cropped patches used during training (see Appendix E for details).
subspace with an independent factor of variation due to the VAE regularization [19,20]. Intuitively, the regularization encourages each latent parameter to independently collapse to a non-informative prior N (0, 1), and so the model prefers to minimize its use of the latent parameters and maintain their independence (Appendix D). Therefore, the number of latent parameters provided to the model is not critical as long as it is greater than the number of independent factors of variation. In our exper-iments, we allow the model to use five latent parameters. Because the 1D datasets have only one varying physical parameter and the 2D dataset has three varying physical parameters, the trained model will only make use of one or three latent parameters, respectively, and the rest will collapse to the prior.
We can determine the number of relevant latent parameters and empirically verify this claim by examining the statistics of the extracted distribution parameters µ z and σ 2 z from the dynamics encoder (DE) for each dataset. A latent parameter that is useful to the propagating decoder (PD) for predicting the target series will have high variance in µ z and a low mean σ 2 z , implying that the extracted parameter is precise and informative. A parameter which has collapsed to the prior and is noninformative will have low variance in µ z and high mean σ 2 z . These statistics indeed show that our model can correctly determine the number of relevant parameters for each dataset (Figure 2). In real applications, we will not have access to the ground truth physical parameters, so we must rely on these metrics to identify the relevant parameters extracted by the model.
To evaluate the performance of our parameter extraction method, we compare the extracted latent parameters from the model with the true physical parameters used to generate our simulated datasets: the viscosity damping parameter γ for the 1D Kuramoto-Sivashinsky dataset, the nonlinearity parameter κ for the 1D nonlinear Schrödinger dataset, and the diffusion constant D and drift velocity components v x , v y for the 2D convectiondiffusion dataset. Because these simulated physical parameters are drawn from normal distributions (Appendix C), we expect the relevant latent parameters-which have prior distribution N (0, 1)-to be linearly related to the true parameters (Appendix D). For real experimental systems, this is also a reasonable assumption for uncontrolled variables because natural parameters tend to be normally distributed due to the central limit theorem. We assess the quality of the extracted parameters by linearly fitting the relevant latent parameters with the ground truth physical parameters to obtain parameter predictions and R 2 correlation coefficients. Our numerical experiments show excellent parameter extraction on all three datasets (Figure 3) with R 2 > 0.95 for all parameters (Table I) and no degradation in performance with added Gaussian σ = 0.1 noise. However, we do observe some nonlinear behavior at the edges of the parameter range likely due to data sparsity in those regions.
Looking more closely at the results for the threeparameter convection-diffusion dataset, we see that the trained model correctly extracts three relevant latent parameters: one latent parameter corresponds to the diffusion constant and the remaining two-dimensional latent subspace corresponds to the drift velocity vector (Table  II). In particular, the model learns a rotated representation of the drift velocity vector as a two-dimensional latent subspace due to the inherent rotational symmetry of the dynamics (Appendix E), so we can recover the  (Table II)

B. Prediction Performance
In addition to testing parameter extraction, we evaluate the prediction performance of the trained models on their corresponding test sets. Due training speed and stability considerations, our models are initially trained with a PD architecture containing only 16 hidden channels (Appendix A). To show the potential for further model refinement, we fix the weights of the DEs trained on the original noiseless datasets and then train additional PDs, each with an expanded 64 hidden channels. These refined predictive models perform better than the original predictive models used during end-to-end training. For comparison, the datasets are also evaluated with a stiff exponential integrator for semilinear differential equations (ETDRK4 [33]) using a second order finite dif- ference discretization on the same time and space meshes provided in the datasets. Note that using a non-stiff integrator fails on many of the examples in the datasets, so a stiff integrator is required. The models trained on the 1D Kuramoto-Sivashinsky and 1D nonlinear Schrödinger datasets both perform reasonably when compared with the traditional finite difference method (Figures 5(a), 5(b)), with the model trained on the Kuramoto-Sivashinsky dataset maintaining a higher accuracy than its traditional counterpart. The prediction error of the 2D convection-diffusion model is dominated by the uncertainty in the parameter extraction, so the prediction performance is comparable to a finite difference exponential integrator with similar noise in the PDE parameters ( Figure 5(c)). For the models trained on datasets with added σ = 0.1 noise, we do see some negative impact on prediction performance ( Figure  5), but there is no effect on the parameter extraction quality (Tables I, II).
The refined models, trained on the noiseless datasets, demonstrate that the PD-the predictive network-can be improved independently of the DE-the parameter extraction network. Moreover, the solutions generated by these models remain stable and physically reasonable well beyond the number of time steps propagated during training, suggesting that the models have indeed learned physically meaningful propagators of the PDE-governed systems (Figure 4).

V. DISCUSSION
We have developed a general unsupervised learning method for extracting unknown dynamical parameters from noisy spatiotemporal data without detailed knowl-edge of the underlying system or the exact form of the governing PDE. While we do not explicitly extract the governing PDE, our method provides a set of tunable relevant parameters, which characterize the system dynamics in independent and physically interpretable ways, coupled with a highly transferable predictive model.
The flexibility and robustness of our model comes from using a generic physics-informed neural network model for nonlinear PDEs. The interpretability of the resulting extracted parameters is a result of the variational autoencoder (VAE) training and regularization (Appendix D) as well as the inductive biases imposed by our physicsinformed network design. By using appropriate spatial averaging in the dynamics encoder (DE) and dynamic convolutions in the propagating decoder (PD), we ensure that both the parameter extraction and the propagation prediction from our model are physically motivated and generalizable to arbitrary system sizes and geometries. The dynamic convolutions, in particular, are an important physical inductive bias for encouraging the model to learn latent parameters which govern the propagation dynamics. As a result, the learned parameter-to-kernel mappings in the trained predictive model are fully transferable, which we can demonstrate by evaluating the predictive model without retraining on a different set boundary conditions (see Appendix F).
Our strategy for modeling spatiotemporal systems is to retain the expressiveness of a neural network model while imposing general constraints-such as locality through using convolutional layers-to help the network learn more efficiently. For particular applications, this could also include spatial symmetries [34][35][36], e.g. properly transforming fluid flow vectors, as well as additional symmetries of the internal degrees of freedom, e.g. the global phase of the nonlinear Schrödinger equation. These architecture-based constraints encourage the model to learn physically relevant representations and can be tailored to individual applications, allowing us to incorporate domain knowledge into the model. This also lets us use datasets that are much smaller than is traditionally required by deep learning methods. In fact, we have been able to successfully train the model with as few as 10 examples (Appendix G). Additionally, the model's robustness to noise is a powerful feature of deep learning methods and provides a promising avenue for studying dynamical systems in a data-driven fashion [37].
We believe that this unsupervised learning method can be further adapted in the future for a more general class of spatiotemporal systems by incorporating spatially inhomogeneous latent parameters and will also be able to make of use data with incomplete physical states by inferring the missing information. Parameter extraction will improve following the rapid advances in unsupervised learning and disentangling representations, e.g. through a deeper theoretical understanding of the β-VAE [20,38,39] and alternative formulations [21,40]. Physics-informed inductive biases, however, will remain the key ingredient for ensuring the representations are interpretable [41]. The predictive model will also likely achieve better accuracies by using more sophisticated architectures, which have been shown to perform extremely well on even chaotic PDEs [42], and by explicitly incorporating differentiable PDE solvers with derivatives computed by the adjoint method [43]. This would allow the PD network to focus on encoding the PDE rather than also learning a stable integration method, and thus improve the interpretability of the model.
The ultimate goal of this work is to provide additional insight into complex spatiotemporal dynamics using a data-driven approach. Our method is an example of a new machine learning tool for studying the physics of spatiotemporal systems with an emphasis on interpretability. For the 1D datasets, the dynamics encoder (DE) uses 2D convolutions with output channel sizes (4,16,64,64,5), linear kernel sizes (3,3,3,3,1), and dilation factors (1,2,4,8,1). For the 2D dataset, the DE uses 3D convolutions with output channel sizes (8,64,64,64,5) with the same kernel sizes and dilation factors. The f DE,µ and f DE,σ 2 networks share the same convolution weights for the first four layers and have distinct final layers to produce µ and log σ 2 . The final output channel size is determined by the number of latent parameters; in our tests, we use five latent parameters.
For the 1D datasets, the propagating decoder (PD) architecture uses three 1D dynamic convolutional layers with output channel sizes (16,16, data channel size), linear kernel size 5, and periodic padding. For the 2D datasets, the PD uses three 2D dynamic convolutional layers with the same output channel sizes, kernels, and padding. The refined models increase the number of hidden channels in the PD from 16 to 64, resulting in output channel sizes (64, 64, data channel size).
The latent-to-kernel network, which maps the latent parameters to each kernel or bias in the PD, consists of two fully-connected layers, i.e. one hidden layer. For the 1D datasets, the hidden layers have size (4 × input channel size × output channel size) for kernels and (4 × output channel size) for biases, where the input and output channel sizes refer to the corresponding dynamic convolution in the PD. For the 2D dataset, the hidden layers have size (16 × input channel size × output channel size) for kernels and (4 × output channel size) for biases.
Our architecture uses ReLU activations throughout except for the unactivated output layers of the DE, PD, and latent-to-kernel networks. The output of the PD convolutional network f PD uses a tanh activation with a learnable scaling parameter (x → λ tanh(x/λ) with learnable parameter λ initialized to 1) to stabilize the recurrent architecture. The network f PD also has an additional fixed multiplicative pre-factor set to 10 −6 to improve the initial training stability.

Appendix B: Training Details
All models are trained using batch size 50 and the Adam optimizer [45] with learning rate 10 −3 . Models for the 1D datasets and the noisy 2D dataset were trained for 2,000 epochs; the model for the noiseless 2D dataset was trained for 4,000 epochs; and the refined models were all trained for 2,000 epochs. The VAE regularization hyperparameter is set to β = 0.02 for the 1D datasets and β = 10 −4 for the 2D dataset. All hyperparameter tuning was done using the training set for validation.
For the 1D Kuramoto-Sivashinsky dataset, we train the model using a 64 × 94 crop-in the time and space dimensions, respectively-for the input series and a 64 × 76 crop for the target series. For the 1D nonlinear Schrödinger dataset, we train using a 64 × 94 crop for the input series and a 32 × 76 crop for the target series. For the 2D convection-diffusion dataset, we train using a 45 × 62 × 62 crop-in the one time and two space dimensions, respectively-for the input series and a 16×44×44 crop for the target series. During evaluation on each dataset, we use the same full size time-series from the test set for both the input and target series.

Appendix C: Dataset Generation Details
For each time-series example in the 1D Kuramoto-Sivashinsky dataset, we sample the viscosity damping parameter γ from a truncated normal distribution (µ = 1, σ = 0.125, and truncation interval [0.5, 1.5]). We then use the ETDRK4 integrator [33] to generate each timeseries to within a local relative error of 10 −3 . Each timeseries consists of a uniform time mesh with 256 points for a total time T = 32π and a space mesh with M = 256 points for an L = 64π unit cell. These are produced by solving on a finer time and space mesh to ensure convergence and then resampling to the dataset mesh sizes. Each initial state is generated from independently sampled, normally distributed Fourier components with a Gaussian band-limiting envelope of varying widths (uniformly sampled in the interval [π/L, M π/8L]) and then normalized to unit variance.
For each time-series example in the 1D nonlinear Schrödinger, we sample the nonlinearity coefficient κ from a truncated normal distribution (µ = −1, σ = 0.25, and truncation interval [−2, 0]). We then use the ET-DRK4 integrator [33] to generate each time-series to within a local relative error of 10 −3 . Each time-series consists of a uniform time mesh with 256 points for a total time T = π and a space mesh with M = 256 points for an L = 8π unit cell. These are produced by solving on a finer time and space mesh to ensure convergence and then resampling to the dataset mesh sizes. The initial states are generated in an analogous manner to the Kuramoto-Sivashinsky dataset.
For each time-series example in the 2D convectiondiffusion dataset, we vary the parameters by sampling the diffusion constant D from a truncated normal distribution (µ = 0.1, σ = 0.025, and truncation interval [0, 0.2]), and each velocity component v x , v y from a normal distribution (µ = 0, σ = 0.2). Because the convection-diffusion equation is linear, we use the exact solution to generate the dataset. Each time-series consists of a uniform time mesh with 64 points for a total time T = 4π and an M × M = 256 × 256 space mesh for an L×L = 16π×16π unit cell. Each initial state is generated from independently sampled, normally distributed Fourier components with a Gaussian band-limiting envelope of varying widths (uniformly sampled in the interval [π/L, M π/4L]) and then normalized to unit variance. In these plots of the raw extracted latent parameters, we see the direct correspondence between the identified relevant latent parameters in Figure 2 and the true physical parameters as well as the collapse of the unused latent parameters. Note that the relevant latent parameters corresponding the drift velocity v in the convection-diffusion model are not precisely aligned with the two components vx, vy due to the inherent rotational symmetry. The gray shaded bars are the 95% confidence intervals (±1.96σz) produced by the model.
where CDF p(·) is the cumulative distribution function for the probability distribution p(·). One caveat-in addition to ambiguities introduced by symmetries of the physical parameters-is that the relationship may not be monotonic for physical parameter distributions which have support on a topologically distinct space from the real line, e.g. a uniformly distributed periodic angle parameter. However, the result may still be interpretable, e.g. an angle parameter may be encoded as a ring in a two-dimensional latent subspace.
Although this decomposition is suggestive of the effects of VAE regularization, the study of the performance of VAE-based models and the relative importance and model dependence of each of these effects is still very much ongoing [20,21,[38][39][40]47]. While training our model, we empirically observe that the latent parameters retain their independence and that their marginal distributions match the standard normal priors, so only an increase in information stored in the latent space is traded for better prediction performance. We believe we can attribute this to the physics-informed inductive biases present in our architecture, which allows our model to achieve its best performance using a minimal set of independent and normally distributed latent parameters. due to the inherent ambiguity of choosing a coordinate basis-introduced by the rotational symmetry of the velocity vector-and makes judging the extraction performance more difficult. Instead of examining one latent parameter at a time, we must consider the two-dimensional latent subspace associated with the velocity vector. Taking the two relevant latent parameters that are correlated with the drift velocity (Table II), we can perform a multi-variate linear regression of the velocity components v x , v y in this two-dimensional latent subspace to verify that the model has indeed learned a simple rotated representation of the velocity vector (Figure 3(e), 3(f)).