J ul 2 01 9 Precision annealing Monte Carlo methods for statistical data assimilation and machine learning

In statistical data assimilation (SDA) and supervised machine learning (ML), we wish to transfer information from observations to a model of the processes underlying those observations. For SDA, the model consists of a set of differential equations that describe the dynamics of a physical system. For ML, the model is usually constructed using other strategies. In this paper, we develop a systematic formulation based on Monte Carlo sampling to achieve such information transfer. Following the derivation of an appropriate target distribution, we present the formulation based on the standard Metropolis-Hasting (MH) procedure and the Hamiltonian Monte Carlo (HMC) method for performing the high dimensional integrals that appear. To the extensive literature on MH and HMC, we add (1) an annealing method using a hyperparameter that governs the precision of the model to identify and explore the highest probability regions of phase space dominating those integrals, and (2) a strategy for initializing the state space search. The efficacy of the proposed formulation is demonstrated using a nonlinear dynamical model with chaotic solutions widely used in geophysics.


I. INTRODUCTION
Two seemingly distinct challenges for systematically transferring information from a well curated (but noisy) data set to a model of the processes producing the data, namely statistical data assimilation (SDA) [1][2][3][4][5][6] and machine learning [7][8][9][10][11][12][13][14], have been shown to be equivalent in their formal structure [7]. In artificial neural networks, the rules that direct the activities from layer to layer are equivalent to the rules for the temporal development of dynamical models in statistical data assimilation. In SDA, the number of measurements at each observation time plays the same role as the number of independent input-output pairs in machine learning.
In this paper, we first formulate the information transfer problem as a Monte Carlo evaluation of a highdimensional expected value integral. This formulation can be applied to both physical dynamical systems and machine learning problems. We then explore the evaluation of such integrals and add to the Monte Carlo procedures a strategy to identify the dominant contribution to the expected values. We use a problem description close to SDA in physical sciences [1], nevertheless, the statements and lessons identified here are directly usable in machine learning [7].
We first describe the problem formulation and then briefly discuss the Hamiltonian Monte Carlo (HMC) method [15][16][17]. HMC overcomes difficulties in traditional Monte Carlo by complementing the original state space with a set of canonical variables moving in "time". Here, in particular, we study HMC with two additional features.
(1) We use a Precision Annealing (PA) method de-rived from the formulation of the "action", A(X) = − log[π(X | Y)]. The target distribution π(X | Y), conditioned on all observations, governs the expected values of the model state variables and parameters. (2) We use a novel way to initialize the Monte Carlo search. We call the combination of these strategies Precision Annealing Hamilton Monte Carlo (PAHMC).
We also report some results from standard Metropolis-Hastings Monte Carlo calculations [18,19] in which the proposals for movements in the state space are based on random perturbations of the present location. We label these methods as random proposal (RP) searches. We employ our PA and initialization strategies in both the RP and the HMC analyses.
To show the effectiveness of our formulation, both approaches are demonstrated on a chaotic dynamical model widely used in geophysics [20]. We present results on estimating the unobserved state variables of the model, estimating the unknown model parameters, as well as predicting forward in time.

II. SDA AND MACHINE LEARNING
We begin with a description of an SDA framework within which we will discuss the transfer of information to a dynamical model of the processes producing the observations. This sets the notation and frames our subsequent discussions.
Within an observation window [t 0 , t final ], we make a set of measurements at times t = {τ 0 , . . . , τ F } where t 0 ≤ τ 0 < τ F ≤ t final . At each of these measurement times k = 0, . . . , F , we observe an L-dimensional vector y(τ k ) = (y 1 (τ k ), . . . , y L (τ k )). The number of observations L at each measurement time t = τ k is typically less, sometimes much less, than the number of degrees of freedom D in the model of the observed system, i.e., L ≤ D. These data are stored in a library and otherwise unaltered during the remainder of the discussion. We designate the set of observations made from time t = τ 0 up to τ F as Y ≡ {y(τ 0 ), . . . , y(τ F )} .
Our quantitative characterization of the dynamical processes that produced these data is through the chosen model. In SDA, our choices are based on knowledge of the physical dynamics assumed to be describing the data source.
If we are using the structures here to perform a task in machine learning, the model may be divorced from knowledge of physics and constructed through statistical principles.
The model describes the interactions among and the evolution of the states of the target system. From the data Y, we want to estimate the unmeasured states of the model as a function of time as well as to estimate any time independent physical parameters in the model. At the end of the observation window t = t final , we use all the estimated values of model states and parameters to predict the model response to new forcing of the system for t ≥ t final . The predictions are then used to validate (or not) the model completed by us.
We call the D-dimensional state of the model x(t) = (x 1 (t), . . . , x D (t)). The model is constructed to describe the dynamical behavior of the observations through a set of differential equations in continuous time and is written as dx a (t) dt = F a (x(t), θ), a = 1, . . . , D.
In discrete time with equal intervals, ∆t, t m = t 0 +m∆t where m = 0, . . . , M and t M = t final , the dynamics can be written as x a (t m+1 ) = f a (x(t m ), θ) or as x a (m + 1) = f a (x(m), θ), where θ is a set of time independent parameters θ = {θ 1 , . . . , θ Np } associated with the model. The discretetime model f a (x(m), θ) is related to its continuous counterpart F a (x(t), θ) via the choice for solving the continuous time flow for x(t) through a discrete time numerical solution method [21]. We work henceforth in discrete time. For convenience we choose the observation times τ k to be integer multiples of ∆t as well, where {n 0 , . . . , n F } is a set of integers and {n k } ≤ M . Starting from the initiation of the observation window at t 0 , we must use our model equations to move the state variables x(0) from t 0 to τ 1 = t 0 + n 1 · ∆t, where the first measurement is made. Then we use the model dynamics again to move along to τ 2 = t 0 + n 2 · ∆t, where the second measurement is made, and so forth until we reach the time of the last measurement τ F = t 0 + n F · ∆t and finally move the model from x(τ F ) to x(t final ) = x(M ).
We collect all the D-dimensional vectors x(m) for m = 0, . . . , M along with all of the N p parameters into what we call the path X, X ≡ x(0), . . . , x(M ); θ 1 , . . . , θ Np . The dimension of the path is (M + 1)D + N p .
A visual representation of this discussion is given in Fig. 1.
We need three ingredients to effect our transfer of information from the collection of all measurements Y to the model x(m + 1) = f (x(m), θ) along the path X through the observation window [t 0 , t final ]: (1) our noisy data Y, and (2) a model of the processes producing Y. This model is devised by our experience and knowledge of those processes. The third ingredient is comprised of methods to achieve the transfer of information from Y to properties of the model. We would like to estimate all the components in X. This will be our focus from here on.
In this paper we describe and implement identical test problems on two well explored methods of Monte Carlo search.
(1) The original method [18,19]-this starts at some location X i in path space, makes proposals to move to a candidate location X c by sampling a modification to X i drawn from a symmetric distribution. It then accepts or rejects X c using the Metropolis-Hastings rule. This procedure is also outlined in Sec. III A. Under some conditions [21,22], this will explore the target distribution π(X | Y) well and allow the approximate evaluation of expected values of functions G(X) = dX G(X) π(X | Y). We label this very well explored construction as the "random proposal" (RP) Monte Carlo method. (2) The Hamiltonian Monte Carlo method [15][16][17].
HMC takes off in another direction. It proposes moves from X i in an enlarged space reached by adding canonical momenta P to X. The proposals are made using Hamilton's equations of classical mechanics. HMC is reviewed in Sec. III. A numeri-cal example is given in Sec. IV. The computational complexity of HMC is discussed in Sec. VII. If the transfer of information is successful and, according to some metric of success, we arrange matters so that at the measurement times τ k , the L model variables x(t = τ k ) associated with y(τ k ) are approximately equal to each other. However, we are not finished yet.
At this point, we have only demonstrated that the model is consistent with the known data Y. We must further use the model, completed by the estimates of θ and the full state of the model at the final time x(M ), to predict forward for t > t M , and we should succeed in comparison with measurements for y(τ ) for τ > t M . As the measure of success, we may use the same metric as used in the observation window. No further information from the observations is passed to the model in the prediction window.
The same overall setup also applies to artificial neural networks in supervised machine learning [8,9], where the training set is used to transfer information into the model (by adjusting the weights), the test set is used to make predictions and validate the model. The process of prediction is called generalization.
A. The data are noisy, the model has errors It is inevitable that the data we collect are noisy and the model we construct to describe the dynamics has errors. This means we must address the conditional probability distribution π(X | Y) along the process of transferring information from Y to the model.
Assume for the moment that we have an equal number of observation and model time steps, so n k = k in Eq. (3) for k = 1, . . . , M . We now define the shorthand notation: We have used the Markov property of the model π x(n + 1) X n 0 , Y n 0 = π x(n + 1) x(n) and identified CMI(a, b | c) = log π(a,b | c) π(a | c) π(b | c) which is Shannon's conditional mutual information [23] telling us how many bits (when using log 2 ) we know about a when observing b conditioned on c. For us a = {y(n + 1)}, b = {x(n + 1), X n 0 }, and c = {Y n 0 }. The appearance of the CMI is an explicit indication that it is information that is transferred from the observations to the model at each measurement time. Now set aside the condition n k = k, and use Eq. (4) to move backwards from the end of the observation window at t final = t 0 + M ∆t, through the observations at times τ k , to the start of the window at t 0 . Up to factors independent of X, we arrive at , defines the action A(X). Conditional expected values for functions G(X) along the path X are given by Np j=1 dθ j . All factors in the action independent of X cancel out here.
From Eq. (5), the action reads The first sum in Eq. (7) represents the modification to π(X | Y) each time an observation y(τ k ) is made. The second sum includes the stochastic transitions of the model. The last term includes the distribution of the initial condition of the model. We often have no knowledge about π(x(0)), just as we may have no knowledge about the distribution of the parameters θ. In this situation, we take them to be uniform over the dynamical range of the model variables (or the parameters). They then cancel between the numerator and the denominator in Eq. (6).
What G(X) are of interest? A natural choice is the path of the model states and parameters itself: G(X) = X. Another could be the covariance matrix of X.
The action further simplifies to what we call the "standard model" of SDA when (1) the observations y(τ k ) are related to their model counterparts through Gaussian noise with zero mean and a diagonal precision matrix R m = R m I, and (2) the model errors are associated with Gaussian errors of mean zero and a diagonal precision matrix R f = R f I. The standard model action takes the form The first term in Eq. (8) is the measurement error and the second term, the model error.
One way to explicitly see the balance condition between the measurement error and the model error is to examine the resulting Euler-Lagrange equation when the model time becomes continuous when ∆t → 0. This is discussed in [7].

B. The goal of SDA
Our challenge is to perform expected value integrals such as Eq. (6). One should anticipate that the dominant contribution comes from the maxima of π(X | Y), or equivalently, the minima of A(X). Near such minima, the two contributions to the action, the measurement error and the model error, balance each other to accomplish the explicit transfer of information from the data to the model.
When f (x(n), θ) is nonlinear in X, the expected value integral Eq. (6) is not Gaussian, and we need to approximately evaluate integrals of this form in order to complete the task of transferring information.
Two generally useful approaches for evaluating this kind of high dimensional integral are Laplace's method [24,25] and the collection of techniques using Monte Carlo sampling [15,18,19,21,26].
The Laplace-method evaluations of expected value integrals is discussed in [27][28][29][30]. They do not sample from π(X | Y) away from its maximum. Evaluating corrections to the leading Laplace contributions is familiar as perturbation theory in statistical physics [31]. The convergence of such perturbation methods can depend sensitively on the functional form of the action in X. In addition, the need for repeatedly using the Hessian matrix of A(X) may further hinder their use in practice.
The rest of the paper is organized as follows. In Sec. III, we briefly review the random proposal and the Hamiltonian Monte Carlo methods. We then give a two-dimensional example comparing the two Monte Carlo methods in Sec. IV. Then, in Sec. V, we formulate the proposed method to evaluate Eq. (6). We then present a detailed study on a nonlinear, chaotic dynamical model [20] in Sec. VI as a demonstration. Following the numerical calculations, we discuss in Sec. VII the complexity and scaling properties of PAHMC. We summarize the results and discuss the implications in Sec. VIII.

A. Random proposal Monte Carlo
Performing integrals such as Eq. (6) via Monte Carlo searches requires a method to sample from π(X | Y) in order to identify and use those regions in the path space X yielding the dominant contribution to the integral. The original Monte Carlo procedure devised by Metropolis and Hastings [18,19,32] has been widely explored. It samples from a probability distribution such as π(X | Y), seeking the maxima of that distribution. Proposals on how to move about path space are made by perturbing the present location at random. This random-proposal (RP) procedure is in theory ergodic, meaning that the proposals can reach any region in the path space with nonzero probability given sufficient proposed steps [33]. This property makes it an unbiased method for performing the expected value integrals. In practice, the RP procedure often suffers from slow mixing in high dimensional models [22,34,35].
This method proposes new steps via a (usually symmetric) proposal distribution, from a present path space position X i to a new, candidate position X c . Then, according to RP's acceptance rule assuring the detailed balance condition, accept X c as the new sample X i+1 with probability min{π( then serves as the starting point for a subsequent proposal procedure. All accepted paths are then used in the evaluation of G(X) .
The convergence of this procedure to the desired π(X|Y) may be slow especially when X is highdimensional. One reason is that it treats X as a randomwalker in making proposals, often resulting in low acceptance rates and rather limited movements in the path space. The asymptotic distribution π(X | Y) may, in practice, be unreachable.

General idea of HMC-like sampling
Hamiltonian Monte Carlo (HMC) [15][16][17] strikes out in a new direction. It adds to the path X an additional set of variables, which we call P, and identifies a search in (X, P) space that preserves some invariants of the rules for moving about that space. If, for example, the motion in the extended space preserved P · P + X · X, then moving about the space by performing rotations would allow us to move interesting distances in (X, P) space by simply rotating the variables in the high dimensional extended space while staying on the P · P + X · X = const. surface. Making this motion in the extended space may allow large moves in X space with large probabilities of acceptance. If this were the case, one might be sampling exp[−A(X)] much more efficiently than detailed balancepreserved random walking as in the RP procedures.
As introduced in [15] the added conjugate variables are selected from the well explored examples of the canonical momentum familiar to physicists and thoroughly analyzed for two centuries. By choosing the additional variables P to be canonical conjugates of X, one can use the rules of classical mechanics to move around in (X, P) space and preserve the underlying symplectic structure.
If the movement is labeled by a scalar s which we call "time", then the rule d ds preserves the value of the scalar function H(X, P) as well as volumes in (X, P) space and a collection of other quantities known as the Poincaré invariants [39,40]. For HMC, in addition to the original target distribution π(X | Y) ∝ exp[−A(X)], one needs to choose an arbitrary distribution π(P) for the canonical momenta P in order to fully specify the joint HMC target π(X, P | Y). The Hamiltonian can be written as Eq. (10) tells us that the expected values, the goals of SDA and machine learning, are unchanged under HMC for an arbitrary choice of the canonical momenta.
The combination of symmetry preserving movements in (X, P) and invariance in the expected values is appealing. These properties make higher acceptance rates and faster mixing for sampling π(X | Y) possible. Other symmetries in (X, P) requiring different rules for generating motions might work equally well. HMC should be viewed as one of a class of approaches within this overall idea [41,42]. It should also be noted that a high acceptance rate itself does not guarantee efficiency-combining high acceptance with fast exploration of phase space is a goal of all Monte Carlo methods.
One must distinguish the use of the scalar label s in HMC from the time labels t and τ as described in Sec. II. s is just a label to track movements in the enlarged space. Actual times t and τ are labels used in identifying the path of model dynamical variables x(t) and the data y(τ ) through an observation window.

HMC itself
As discussed above, the HMC procedure [15][16][17] doubles the dimension of path space X, introducing a physics-motivated but arbitrarily chosen canonical momentum P associated with the path space position X.
While certainly not required, the choice for the "kinetic energy", h(P) = P · P/2, is convenient and the resulting distribution exp[−h(P)] is Gaussian. As pointed out by Kadakia [43], small modifications of this standard choice to include higher order than quadratic polynomials of P can introduce chaotic behavior and substantial mixing in motions generated by H(X, P) and thus may avoid some potential problems in the implementation of a quadratic choice for h(P).
We proceed with the quadratic choice and the Hamiltonian becomes Why does one want to sample in the even larger (X, P) space given that X is already high-dimensional? The answer lies in the invariance properties of Hamiltonian dynamics. If HMC proposals are made using integration of Eq. (9) from "time" 0 to s with the choice of H as in Eq. (11), namely, d ds then H(X, P) is conserved along the canonical phase space orbit labeled by s. As discussed, many other quantities are preserved. Among them, HMC makes use of the conservation of phase space volume, i.e., d X The core of HMC is to propose (X(s), −P(s)) starting from (X(0), P(0)). This is precise when the integration is performed with s taken as a continuous variable, so a Hamiltonian flow is realized and H(X, P) is conserved. A complete HMC proposal includes a negative sign before P(s), meaning that we need to flip the momentum after the Hamiltonian integration is finished. This flipping is to make the proposal reversible in s and symmetric in (X, P), thus ensuring detailed balance.
In practice, the form of A(X) precludes analytical evaluations, and one must work with discrete s in order to integrate Eq. (12). When we use a symplectic integrator to perform the integration of Hamilton's equations in discrete s [41], a result of Ge and Marsden [44] tells us that we cannot preserve both the symplectic form and H(X, P) the same time. Therefore, we expect that by using a discrete s symplectic integrator, we will find H(X(s), P(s)) ≈ H(X(0), P(0)), but not exactly equal. As a consequence, when determining the acceptance or rejection of the proposal (X(s), P(s)), the acceptance probability can usually be near unity, but not precisely unity, i.e., The use of −P(s) can usually be omitted when performing HMC as P does not enter the calculation of expectations in Eq. (10).
As the parameters θ in SDA and machine learning are taken to be components of the vector X, they also have conjugate variables η included in P. Therefore, HMC provides a principled manner of exploring π(X | Y) on both state variables and time independent parameters.

HMC in practice
When implementing HMC, one must select a symplectic integrator in order to preserve the symplectic invariants [16,41,42], and many choices are available. We have used a fairly common one, the leapfrog symplectic integrator [41,45], which possess simplicity and accuracy. Using this choice, we select a small step size ǫ in s to produce a candidate proposal (X c , P c ) by sampling exp[−H(X, P)] to select P(0) and choosing X(0) to be the last accepted X.
We first move (X(0), P(0)) to (X(ǫ), P(ǫ)) using the leapfrog symplectic stepping rule, Then, starting from (X(ǫ), P(ǫ)), we move to the next point at (X(2ǫ), P(2ǫ)). After proceeding for S steps, we arrive at a candidate proposal (X c , P c ) = (X(Sǫ), −P(Sǫ)). The candidate proposal (X c , P c ) is then accepted or rejected according to Eq. (13). When using PAHMC (as described below in Sec. V), the HMC procedure is repeated for each value of R f (Eq. (8)). In addition, we choose to perform N I independent PAHMC calculations in parallel starting from N I independent initializations.

IV. HMC VS. RP SAMPLING IN A SIMPLE 2D EXAMPLE
We provide a two-dimensional example that illustrates the advantages of sampling the joint distribution π(X, P) as in HMC over the standard RP Metropolis-Hastings procedure. In our example, we applied both methods to We started both methods at (x, y) = (0, 0.8). Each of HMC and RP then generates 500 samples, and the last 301 of these samples are retained and displayed in Fig. 2(b). The RP method explores only a small percentage of the distribution while HMC clearly explores most of the distribution in this comparison. The efficiency of HMC compared to RP is clear from this elementary example.
The example in Fig. 2 shows that HMC typically achieves better mixing than the RP method. In addition, it is important to compare the scaling properties of both methods as the dimension of the target distribution increases. With the potential applications to practical problems in mind, it is critical that the sampling process remains efficient in higher dimensions. It has been shown that HMC typically scales better than RP in terms of exploration efficiency. After equilibrium is reached, for a d-dimensional target distribution, the computational complexity for a given acceptance rate typically grows as d 5/4 for HMC [46][47][48] and d 2 for RP [49].

V. THE PRECISION ANNEALING METHOD
We now turn to the precision annealing method suggested in [29,30,50]. It is used here to facilitate the search for the global minimum of the action A(X) as we gradually increase the model precision parameter R f . As we will see, this procedure drives the sampling area of HMC to the vicinity of the global minimum in A(X) in Eq. (8) where the path X is consistent with both the nonlinear model f (x(n), θ) and the observations Y. This procedure plays a key role in evaluating the integral in Eq. (6). Because we only have partial observations of the model state, the unobserved degrees of freedom in the state variable X are completely uninformed upon initialization. If we impose the model to a high precision at the outset, the convoluted model nonlinearities may prevent a sampling scheme from finding the desired minimum in A(X), therefore, the dominant contribution to the integral in Eq. (6) may not be identified.
The samples generated by RP are shown as blue down pointing triangles, and the samples generated by HMC are shown as red up pointing triangles. To arrive at one proposed sample, the discrete Hamiltonian dynamics is simulated using the leapfrog method for S = 50 steps with step size ǫ = 0.01. The overall acceptance rate for HMC is 0.998. For the Metropolis algorithm, to arrive at one proposed sample, the x and y directions are perturbed simultaneously by drawing from N (0, 0.01). The overall acceptance rate for the Metropolis algorithm is 0.56.

A. Initialization in the path space
Our strategy in this paper is to vary the precision parameter R f that enforces the model error term in A(X). When R f = 0, the model is completely unresolved and the path space is highly degenerate for the unobserved degrees of freedom. In the opposite limit for the hyperparameter R f where R f → ∞, the model f (x(n), θ) becomes deterministic and provides nonlinear constraints on the minimization of the action. Moving R f from very small to quite large by a slow increase permits us to start our Monte Carlo procedure near the maximum of π(X | Y) and remain there.
We now present the initialization procedure at R f = 0. In this case, the standard model action becomes a quadratic form as which apparently has its (degenerate) global minimum at To complete our choice of initial path, we now integrate the model forward through M discrete times. x(0) is initialized such that x a (0) = y a (0) for a = 1, . . . , L and x a (0) is drawn from a uniform distribution covering the dynamical range of the model for a = L + 1, . . . , D. The parameters θ are also drawn from an appropriate uniform distribution. The model is then integrated forward in time, with the observed dimensions in x(t m ) replaced by y ℓ at each time where we have a measurement, t k = τ k for k = 0, . . . , F − 1: This completes our construction of an initial path X init . The freedom in choosing x(0) and θ gives us flexibility to generate multiple such initial paths X (q) init , q = 1, . . . , N I at R f = 0. These are retained for future use.

B. Precision Annealing as R f is increased
Precision Annealing arises from the idea that if we choose an initial path for a Monte Carlo or a variational optimization [27][28][29][30] at the global minimum for small R f as we slowly increase R f , we will stay in a region of phase space where our sampling will arrive at the smallest minimum of the action, for a variational calculation, or sample the neighborhood of the maximum of π(X, P), for a Monte Carlo search. We adopted the following annealing schedule for R f : with α > 1 and β = 0, 1, . . . , β max . R f0 should be small. A choice of α near unity leads to the slow increase in R f as β increases, introducing the nonlinearity of the model in an adiabatic manner. At each R f value, N I MC calculations are performed starting from the solution generated by the procedure in the previous R f .
At β = 0, we select, without repeats, each of the N I initial paths X (q) init identified at R f = 0 to be an initial condition. We then start to sample from the HMC joint distribution π(X, P) ∝ exp[−h(P, X) − A(X)] which has R f0 as the hyperparameter. We collect N β=0 sampled paths, denoted by X (q) β=0, j with j = 1, . . . , N β=0 , that are generated along the Markov chain. We take the sample mean of each of the N I Markov chains developed so far, as the initial path for the next β value. Since N I calculations are done independently and in parallel, this results in a set of N I averages X (q) β=0 with q = 1, . . . , N I . Next, we move to β = 1. We have N I selections of initial conditions, i.e., X (q) β=0 with q = 1, . . . , N I , available for our HMC evaluations. For each initial condition we collect N β=1 sampled paths and form the N I means for each We continue this procedure until we reach a useful β max . By plotting two quantities versus R f or β we will have insight on how to select this.
First, we should make a plot of the action determined by each of the N I paths at each R f as a function of β. We find that the action becomes independent of β as the precision of the model increases. The origin of this is seen in a second graphic where we plot the model error term in the action, as a function of β. If the model is consistent with the data, the model error will rapidly decrease to a small value as β increases.The action is the sum of the model error, the measurement error and the initial condition, as the model error tends numerically to a small number, the action levels out to the sum of the measurement error and the initial condition, which are independent of R f . Indeed, this second sum both provides a lower bound to the action and indicates whether the selected model is "right". This independence of the action with respect to β will also depend in SDA on the number of measurements L at each time observations are made. As a result of these MC calculations, the expected value of functions G(X) in the original space, Eq. (6), will be estimated as since the precision annealing procedure locates the paths that dominate π(X | Y) ∝ exp[−A(X)] [27][28][29][30] for a = 1, . . . , D. In Eq. (22) ν is a constant forcing term; the solutions of these equations for D ≥ 4 are chaotic when ν ≥ 8.0 or so. We report on calculations with D = 20 with ν = 8.17.
We are performing a "twin experiment" [1] here in which we solve Eq. (22) for x(t) beginning with an arbitrary initial value x(0) and using a time step ∆t = 0.025. To each of the D time series we add Gaussian noise with standard deviation σ = 0.4 throughout the observation window {t 0 , . . . , t final } which contains M + 1 time steps spaced at ∆t. These serve as our data. In the twin experiments, we know the (noisy) time series y(t) and the value of the forcing parameter ν so that we are able to validate the results of the proposed method.
We select L < D components of our data set as observations and seek to estimate the unobserved D − L state variables as well as the forcing ν. Assuming at each model time step t = t k we have an observation y(k), i.e., n k = k in Eq. (3) for k = 1, . . . , M , the corresponding action is (23) The first term on the right in Eq. (8) is the measurement error, and the second, the model error. The trapezoidal rule is used to discretize Eq. (22), such that where F a (x, ν) is the model vector field.
In using Monte Carlo methods, we sample the conditional probability distribution π(X | Y) ∝ exp[−A(X)] so we are able to estimate both an average value for the path X as well as variations about this mean. Both will be presented. We will discuss results for L = 7, 8, 10, and 12.

HMC with L = 7
We begin with L = 7 measured noisy time series out of 20 generated as our data set. We used the seven data series y ℓ (m) with ℓ ∈ {1, 4, 7, 10, 13, 16, 19} as the observations. In Fig. 3(a), we display the action levels as a function of R f for L = 7. In this figure, many action levels are present, corresponding to many peaks in the probability π(X | Y) ∝ exp[−A(X)] identified. At large R f values, the N I = 30 action levels start to plateau. However, they are all well above the expected action value which is the measurement error piece of the action. This is an indication that though HMC has located some paths that agree with the Lorenz96 model in Eq. (22), it fails to transfer information from the observations Y to the model due to an insufficient number of observations at each τ k . We expect that as the number of observations L increases, more information will become available to the precision annealing HMC method and that the structure of the action levels will change.
The inadequacy of using only 7 observations at each measurement time, L = 7, is shown in the failure of the model error to significantly decrease with increasing R f . This is seen in Fig. 3(b). Similarly, in the estimation of the model forcing parameter shown in Fig. 3(c), we see substantial inaccuracies.
Further in the predictions for an observed state variable x 7 (t), Fig. 4(a), and for an unobserved state variable x 14 (t), Fig. 4(b), we see many errors in both the estimation window 0 ≤ t ≤ 5 and the prediction window 5 < t ≤ 11.

HMC with L = 8
In Fig. 5(a), we present the L = 8 action levels. Here, 7 out of 30 action levels split from the rest at an early stage in the precision annealing process and coincide with the anticipated action level for the measurement error term. This implies that the transfer of information has been successful. Indeed, as we will see in the figures that follow, the estimations in the model parameter ν as well as the predictions beyond the observation window show that the paths with lowest action levels yield consistent results. In addition, Fig. 5(b) shows the model error calculations for L = 8. It is clear from there that the 7 lowest action levels are numerically dominated by their respective measurement errors instead of model errors. Fig. 5(c) shows N I estimations of the Lorenz96 forcing parameter ν. The true value is ν = 8.17. If we combine this plot with Fig. 5(a), we can conclude that although the method has located some path with high model precision, some of the paths (23 out of 30) differ from the observations. The "wrong" models have been identified because there are not enough measurements L at each time an observation is made leading to errors in the estimated forcing parameter.
Until this point, we have examined the indicators, i.e., action level, model error, estimated forcing, representing the quality of the transfer of information. All of these are inferred from the calculations happening in the observation window 0 ≤ t ≤ 5 and from the data Y available to the PAHMC method. To validate our information transfer method, we need to take advantage of the twin experiment setting. We will compare the estimated unobserved state variables with the data generated in the twin experiment. We will also predict forward in time beyond the observation window and compare the predictions with the actual data. Fig. 6(a) shows the estimate of an observed variable x 9 (t) within the estimation window 0 ≤ t ≤ 5 as well as its prediction within the window 5 < t ≤ 11, for Lorenz96 L = 8. We can see that the estimation agrees quite well with the data. The prediction, with the HMC calculation of errors shown, agrees with the data in the prediction window for some time (about 4 inverse Lyapunov times) and eventually diverges due to the chaotic nature of the Lorenz96 system at ν = 8.17. It is worth noticing that the errors in the prediction are quite small in earlier stages. This is because the PAHMC procedure has accurately estimated all of the unobserved state variables as well as the forcing parameter ν. Fig. 6(b) presents the same information, but for an unobserved variable x 12 (t). In this case, the data for x 12 (t) is not presented to the PAHMC method in the observation window, yet it still achieves high accuracy in both estimation and prediction. This is a strong indication that the transfer of information from Y to Lorenz96 has been successful for L = 8 observations. The nonlinearity of the model is at work here transferring information from observed to unobserved state variables.

HMC with L = 10
If we further increase the number of observations to L = 10, we see that all the action levels plateau almost at the anticipated expected value, as shown in Fig. 7(a). This indicates that the information is sufficient for the method to locate the global minimum of the action. Also, in Fig. 7(b), the rapidly decaying model errors show that the dynamics, x(m + 1) = f (x(m), ν) is enforced strictly in the large R f regime. Fig. 7(c) shows the estimated forcing parameter for L = 10 as a function of R f . The 30 estimated forcing parameters quickly converge to the true value of 8.17, which indicates that the correct model has been found by all the N I = 30 HMC calculations. Conclusions similar to the L = 8 case can be drawn from the prediction results in Figs. 8(a) for the observed vari-   the model error and the Lorenz96 forcing parameter for the N I = 30 initial paths used in the calculations. We forgo showing that measured and unmeasured state values are estimated quite well both in the observation window 0 ≤ t ≤ 5] and the prediction window 5 < t ≤ 11, only noting that the earlier calculations have already informed us that L near 8 or 9 is enough to capture the properties of the noisy data.

B. Using Random Proposal Monte Carlo
Our numerical calculations are "twin experiments" in which for a selected D we choose x(t 0 ) = x(0) and using a time step ∆t = 0.025 generate solutions x(t) over an To each x a (t), we add Gaussian noise with mean zero and variance σ 2 = 0.25, these now comprise our library of observed data. We then select L ≤ D of these noisy data, and form the action  In Fig. 10, we display the action levels as a function of R f at L = 5. We can see that RP identifies many action levels, corresponding to many peaks in the conditional probability distribution P (X | Y) ∝ exp[−A(X)] in Eq. (25). From R f = 2 × 10 4 , we see one level moving away from the collection of larger action levels as β increases. However, no action level has become essentially independent of R f suggesting that the accuracy with which the model is enforced remains too small. We expect that as the number of measurements at each τ k is increased more information about the phase space instabilities will be passed from the data to the model and that the structure of the action level plot will change.

RP with L = 12
In Fig. 11, we now display the action levels and its components, the measurement errors and the model errors, when L = 12. Here the behavior of the action levels is quite different. The model error decreases over a large range of R f until the numerical stability of the evaluation of this term is reduced as small errors in x(m + 1) − f (x(m), ν) are magnified by large values of R f . As this result appears, the action for each of the N I paths at each β levels off, becoming essentially independent of R f , and matches the measurement error, as it must do for consistency [27][28][29][30].  Until this point, we have examined outcomes of the RP Precision Annealing estimation procedure. All of the state variables, measured and unmeasured, as well as the forcing parameter were reported over the observation window 0 ≤ t ≤ 5. In a "twin experiment" as here, we have generated the data by solving a known dynamical equation and adding noise to the output of the D = 20 times series with a known value of ν. The point of a twin experiment is to test the method of transfer of information in SDA. As we have D − L unmeasured state variables at each L, and an unobserved parameter ν, the only tool to determine how well the estimation procedure has done in its task is to predict for t > 5 into a prediction window where no information from observations is passed back from the model. We now examine how well the estimation has been performed by predicting both an observed and an unobserved time series among the D available.
We already see from Fig. 11(c) that the input value of ν = 8.17 has not been estimated very accurately. The apparent bias in this parameter estimation has also been seen in an earlier Monte Carlo twin experiment [26,51]. However, notice that HMC can accurately estimate the parameter ν without an apparent bias as shown in Fig. 9. Fig. 12(a) shows the observed model variable x 2 (t) for the Lorenz96 model with total dimensions D = 20, observed dimensions L = 12, and ∆t = 0.025. The noisy data from solutions of the model equations from the observed variables {1, 2, 4, 6, 7, 9, 11, 12, 14, 16, 17, 19}. The estimation of x 2 (t) during the observation window using RP to transfer information from the data to the model is shown in red, and the prediction using all the estimated states of the model, x(t = 5), and the estimated model parameter, is shown in green x(t ≥ 5). Our knowledge of this dynamical system [51] indicates that the largest global Lyapunov exponent is approximately 1.2 in the time units indicated by ∆t. The deviation of the predicted trajectory x 2 (t) from the data near t ≈ 6.0 is consistent with the accuracy of the estimated state x(t) and this Lyapunov exponent. Fig. 12(b) shows the unobserved model variable x 20 (t) for the same Lorenz96 model. The noisy data from solutions of the model equations from the observed variables {1, 2, 4, 6, 7, 9, 11, 12, 14, 16, 17, 19}. The estimation of x 20 (t) during the observation window using RP to transfer information from the data to the model is shown in red, and the prediction using all the estimated states of the model, x(t = 5), and the estimated model parameter, is shown in blue x(t ≥ 5). Our knowledge of this dynamical system [51] indicates that the largest global Lyapunov exponent is approximately 1.2 in the time units indicated by ∆t. The deviation of the predicted trajectory x 20 (t) from t ≈ 6.4 is consistent with the accuracy of the estimated state x(t) and this Lyapunov exponent.

VII. COMPUTATIONAL COMPLEXITY OF PAHMC
It is important to consider the computational complexity of the proposed method for utilization in a practical task, and it is vital that the method achieves good scaling properties when applied to large systems. Here we give an estimate on the scaling as the system size increases. We first consider the complexity of an elementary step of HMC, which is that of an one-iteration leapfrog simulation. This requires evaluating ∇A(X) once and therefore has dominant time complexity of O(DM ), where, as usual, D is the dimension of the underlying dynamical system and M is the number of discrete time steps in the observation window. As the dimension of the path X grows, the number of leapfrog steps in one HMC pro-  posal needs to be adjusted accordingly to reach a nearly independent point. In fact, it has been noted [16,46,47] that the computation typically grows as the 5/4 power of the dimensions of the model, given a constant acceptance rate.
As a result, the time complexity of the proposed method in this paper is O((DM ) 5/4 ). Apparently, there exists a multiplicative constant-independent of D and M -that is determined by the choices of the number of samples N β and the number of precision annealing steps β max .
The mixing time itself is hard to estimate, yet it has been empirically tested in many cases that HMC achieves faster mixing than other well known Monte Carlo meth-  ods. An elegant modification of the term h(P, X) in the Hamiltonian has been made by Kadakia [43] that may improve the mixing rate. It is useful to provide a benchmark as to the actual computation time for a real system. To this end, we report the time for the Lorenz96 system considered in this paper. In Lorenz96, we are dealing with a 4000dimensional system for which D = 20 and M = 200 (ignoring the additional dimension introduced by the model parameter ν). To obtain one HMC proposal X prop , we simulate the discrete Hamiltonian dynamics given in Eq. (14) for S = 50 steps with step size ǫ = 10 −3 . In an implementation in Matlab, the computation time is around 0.02s. We then make N β = 10 3 proposals for every β up to β max = 30. As a result, the entire calculation takes about 10 minutes to finish.
To provide a direct comparison to the HMC method, we have run the same twin experiments for the Lorenz 96 system, D = 20 and M = 200. For every β up to β max = 50, we make 6000 perturbations on X within each β value-this means we make 2.4 × 10 7 proposals for each β, given that we only perturb one component of X. Implemented in C, the calculation took about 4 hours to complete. The computation time is much longer than that of the HMC method, since the RP method requires more proposals due to inefficient sampling of exp[−A(X)].
Besides the above analyses of the O((DM ) 5/4 ) rule and the actual computation time, we also point out that the precision annealing HMC method could gain a significant speedup if run in parallel, provided that the action is constructed as Eq. (8). We will again use Lorenz96 to illustrate this principle, but it should be noted that the only change in Eq. (8) for a different system would be the choice for the vector field f a (x(m), θ). We will focus on the evaluation of ∇A(X) since this is the only calculation that scales directly with D and M .
Firstly, for the D vector field F a (x(m), ν) in Eq. (22) and a discretization rule (in time) given by Eq. (24), the Jacobian for t = t m in the observation window is given as with i, j = 1, . . . , D. The time consuming part in calculating the D components in ∇A(X) corresponding to t = t m is to multiply the Jacobian J (m) with a D vector. This can be fully parallelized to have a D times speedup. Secondly, to complete the calculation of ∇A(X), we need to repeat the above process for each time mark m from 1 to M . These repetitions are independent and can therefore be fully parallelized to have an M times speedup.
The good potential of parallelization discussed above makes precision annealing HMC especially suitable for very large dynamical systems in terms of both the dimensionality of the vector field and the length of the observation window.

VIII. SUMMARY AND DISCUSSION
We have presented a systematic method to transfer information from observations Y of the degrees of freedom X of a nonlinear dynamical system to a model x(m) → x(m + 1) = f (x(m), θ) of the processes in the system. The observations are noisy, and the model has errors, so we must address the conditional probability distribution π(X | Y) ∝ exp[−A(X)], conditioned on the observations, as the quantity of our interest. Using this probability, the expected value of any function of X, G(X), can be expressed as The path through the observation window X ≡ {x(0), . . . x(M ), θ} is the collection of the model states at the discrete times in that window along with the timeindependent model parameters. The approximate numerical evaluation of expected value integrals is the central focus of this paper. We have shown [7] this to be the case both in SDA and in neural networks for supervised machine learning.
The hallmark of success in this information transfer process is that one can accurately predict the future states of the dynamical model for some time window, a prediction or generalization interval, beyond the observation window, given the final estimated state x(M ) and parameters θ at the termination of the observation window. Before being able to predict, one must first complete the model f (x(m), θ) by estimating all the components in X.The observed components in X must also be estimated because of the noise in the data Y.
In the use of the SDA methods described in this paper when analyzing physical or biophysical systems, the estimation of the unobserved state variables in the dynamical model f (x(m), θ) may turn out to be as important as the estimation of the observed state variables. In numerical weather prediction, for example, good estimates of the full model state of the earth system provides information not necessarily available in the observations themselves. It is the connection of observed and unobserved state variables within the dynamical model that permits this feature.
The present work develops a systematic way of locating the minima of the action A(X) in path, X, space, as well as efficiently sampling from the distribution π(X | Y) ∝ exp[−A(X)] in order to evaluate the path integral. This accurate sampling of π(X | Y) permits estimation of the errors in prediction in addition to expected paths.
We reported on the use of two well-developed Monte Carlo sampling methods in this paper. One is the 'hybrid' MC (or Hamiltonian MC) method [15][16][17] and the other is the familiar Metropolis-Hastings (MH) MC approach where random proposals for moving about in path space for sampling π(X | Y) are made and evaluated. Both on an instructive two dimensional example and on richer examples drawn from solutions to the Lorenz 1996 [20] model we found that the HMC approach was preferable.
HMC was introduced as an innovative version of Monte Carlo sampling for high-dimensional probability distributions. The core idea is to achieve high efficiency by introducing additional degrees of freedom into the target distribution and avoid the random-walk behavior of MH procedures that have lower acceptance rates of proposed moves in path space as well a markedly slower exploration of the target distribution π(X | Y).
HMC proceeds by making proposals according to Hamilton's equations for the model state variables X and their canonical conjugates, P. The HMC samplings occur in the enlarged in canonical phase space (X, P), and the target distribution becomes π(X, P) ∝ exp[−H(X, P)] with H(X, P) = h(P) + A(X). A collection of conserved quantities are associated with this method of making proposals. Among them are the phase space volume and H(|X, P) itself. As a consequence, the overall acceptance rate is close, but not equal, to unity.
For a nonconvex, high-dimensional action A(X), HMC itself does not guarantee we will capture the set of maxima of the target distribution without enlightened guidance to find the minima in the action. In fact, for a random initialization, it is unlikely for an HMC sampler to travel for a long distance in the X space. Therefore, accurate approximation of the path integral in Eq. (27) also hinges on locating the global minimum in the action of the form To address this issue, we have extended the Precision Annealing (PA) procedure developed in the context of variational SDA calculations [27][28][29][30] to MC procedures; in this paper to MH and HMC strategies alike.
In PA the precision, R f , of the dynamical model is varied gradually from R f ≈ 0 to very large values facilitating remaining near the global minimum. At R f = 0, the dynamical model is completely unresolved and the global minimum is easily identified. As one increases R f , MC sampling begins at a position in the path space X that is well informed by the samples within previous R f values. This enables the locating of the global minimum in A(X) to be more and more precise as R f becomes larger. Eventually, if R f → ∞, the deterministic version of the model is imposed on the calculation.
We tested our method on the Lorenz96 model, with dimension D = 20. We report on results where the number of observations ranges from L = 7 to L = 12. We have shown that, at small L, the minima located by PAHMC do not necessarily agree with the observations, though they can be in accordance with the dynamical model. As the number of observations becomes increases, the method takes advantage of the additional information available and successfully locates the desired minimum of the action. The prediction results also showed that the transfer of information to the model has resulted in enhanced predictability of the model system.
This can be traced to improvements in the estimation of the full state of the model at the termination of the observation window and to the stabilization of intrinsic instabilities in the chaotic nonlinear model [51].
A direct comparison between HMC and RP implementations of Precision Annealing Monte Carlo on the Lorenz96 Model, D = 20 shows that PAHMC performs approximately 25 times faster. This does not account for the fact that the RP calculations were done using a compiled code written in C while the HMC code was done using a Matlab interpreted code. The core algorithms utilized by MC procedures can be performed in parallel using GPUs [50]. Contemporary GPU capability suggests a plausible factor of 1000 or more in the speed-up of MC calculations. Using PAHMC on very high dimensional physical (SDA) or supervised ML tasks seems quite promising.
We started by recalling the equivalence of SDA and machine learning [7]. We finish by noting that the role of L is played in machine learning by the number of input/output pairs presented to the proposed network model during training.