Flow-based generative models for Markov chain Monte Carlo in lattice field theory

A Markov chain update scheme using a machine-learned flow-based generative model is proposed for Monte Carlo sampling in lattice field theories. The generative model may be optimized (trained) to produce samples from a distribution approximating the desired Boltzmann distribution determined by the lattice action of the theory being studied. Training the model systematically improves autocorrelation times in the Markov chain, even in regions of parameter space where standard Markov chain Monte Carlo algorithms exhibit critical slowing down in producing decorrelated updates. Moreover, the model may be trained without existing samples from the desired distribution. The algorithm is compared with HMC and local Metropolis sampling for $\phi^4$ theory in two dimensions.


I. INTRODUCTION
A key problem in lattice field theory and statistical mechanics is the evaluation of integrals over field configurations, referred to as path integrals. Typically, such integrals are evaluated via a Markov chain Monte Carlo (MCMC) approach: field configurations are sampled from the desired probability distribution, dictated by the action of the theory, using a Markov chain. A significant practical concern is the existence of correlations between configurations in the chain. Critical slowing down [1] refers to the divergence of the associated autocorrelation time as a critical point in parameter space is approached. This behavior drastically increases the computational cost of simulations in these parameter regions [2,3]. For some models, algorithms have been found which significantly reduce or eliminate this slowing down [4][5][6][7][8][9][10][11], enabling efficient simulation. For field theories, a number of methods have been proposed to circumvent critical slowing down by variations of Hybrid Monte Carlo (HMC) techniques [12][13][14][15], multi-scale updating procedures [16][17][18], open boundary conditions or non-orientable manifolds [19][20][21], metadynamics [22], and machine learning tools [23,24]. In important classes of theories, however, critical slowing down remains limiting; for example, in lattice formulations of Quantum Chromodynamics (QCD, the piece of the Standard Model describing the strong nuclear force) it is a major barrier to simulations at the fine lattice spacings required for precise control of the continuum limit.
Here, a new flow-based MCMC approach is proposed and is applied to lattice field generation. The resulting Markov chain has autocorrelation properties that are systematically improvable by an optimization (training) step before sampling. The approach defines an improved Markov chain step using a flow-based generative model from which sampling is efficient and in which the probability density associated with each sample is easy to compute. In this method, a set of tunable model parameters defines a change-of-variables-a "flow" f -that maps input samples from a simple prior distribution to output samples from a new distributionp f . By training the model parameters, the output distributionp f can be made to approximate a desired probability distribution p. A flow-based Markov chain is then defined to guarantee asymptotic exactness of sampling by using a Metropolis-Hastings step withp f taken as a proposal distribution. Since proposed samples are independent of the previous samples in the chain, the autocorrelation time and acceptance rate are coupled; the autocorrelation time drops to zero as the acceptance rate approaches 1. This is true even in regions of parameter space where standard algorithms exhibit critical slowing down. Under mild conditions (detailed in Section II), this approach is guaranteed to generate samples from the desired probability distribution in the limit of a large number of updates.
This method has several features that make it attractive for the evaluation of path integrals in lattice field theories: 1. The autocorrelation time of the Markov chain can be systematically decreased by training the model; 2. Each step of the Markov chain requires only the model evaluation and an action computation; 3. Each update proposal is independent of the previous sample, thus proposals can be generated in parallel and efficiently composed into a Markov chain; 4. The model is trained using samples produced by the model itself, without the need for existing samples from the desired probability distribution.
The proposed flow-based MCMC algorithm is detailed in Section II. A numerical study of its effectiveness in the context of two-dimensional φ 4 theory is presented in Section III. Finally, Section IV outlines the further development and scaling of the approach that will be required for applications to theories defined in a larger number of spacetime dimensions and to more complicated field theories such as QCD.

II. A FLOW-BASED MARKOV CHAIN MONTE CARLO ALGORITHM
In lattice field theory, a Markov chain Monte Carlo (MCMC) process is an efficient way to generate field configurations φ ∈ R D distributed according to a target probability distribution where j indexes the D components of φ, S(φ) is the action that defines the theory and Z is the partition function. Here, φ is defined to be a vector of D real components representing the combined internal (α) and spacetime (x) degrees of freedom of the field φ(x, α) evaluated on a finite, discrete spacetime lattice (generalizations to gauge fields are discussed in Section IV). A MCMC process generates a chain φ (0) → φ (1) → . . . φ (N ) by steps through configuration space starting with an arbitrary configuration φ (0) . The steps are stochastic and are determined by the probabilities T (φ, φ ) associated with each possible transition φ → φ . These probabilities must be non-negative and normalized: They must also satisfy the conditions of ergodicity and balance to ensure that samples in the chain are drawn from a distribution that converges to p(φ) after thermalization. For the chain to be ergodic, it must be possible to transition from a starting configuration φ to any other configuration φ in a finite number of steps, i.e., and the chain must not have a period, for which it is sufficient that a single state has non-zero self-transition probability, i.e., Balance is the condition that p(φ) is a stationary distribution of the transition: Any procedure which satisfies these conditions will, in the limit of a sufficiently long Markov chain, produce field configurations {φ (i) } distributed according to p(φ).

A. Metropolis-Hastings with generative models
Given a model that allows sampling from a known probability distributionp(φ), a Markov chain for a desired probability distribution p(φ) can be constructed via the independence Metropolis sampler, a specialization of the Metropolis-Hastings method [25]. For each step i of the chain, an update proposal φ is generated by sampling fromp(φ), independent of the previous configuration. This proposal is accepted with probability If the proposal is accepted, . This procedure defines the transition probabilities of the Markov chain. The general Metropolis-Hastings algorithm has been proven to satisfy balance [26] for any proposal scheme. For the independence Metropolis sampler, under the further condition that every state φ has non-zero proposal density and non-zero desired density, the Markov chain is also ergodic and thus guaranteed to converge to the desired distribution [25]. This Markov chain can be intuitively considered a method to correct an approximate distributionp(φ) to the desired distribution p(φ). The accept/reject statistics of the Metropolis-Hastings algorithm serve as a diagnostic for closeness of the approximate and desired distributions; if the distributions are equal, proposals are accepted with probability 1 and the Markov chain process is equivalent to a direct sampling of the desired distribution. This is made precise in Section II C.

B. Sampling using normalizing flows
Here, a normalizing flow model is used to define a proposal distributionp(φ) for a generative Metropolis-Hastings algorithm. Normalizing flows [27] are a machine learning approach to the task of sampling from complicated, intractable distributions. They do so by learning a map from an input distribution that is easy to sample to an output distribution that approximates the desired distribution. Normalizing flow models produce both samples and their associated probability densities, allowing the acceptance probability in Eq. (6) to be calculated.
A normalizing flow enacts the transformation between distributions by a change-of-variables 1 : a smooth, bijective function, f −1 : R D → R D maps samples z from a prior distribution r(z) to φ = f −1 (z). This mapping implicitly defines an output distributionp f (φ), parameterized by the invertible function f . Typically, the prior distribution is a simple and analytically-understood distribution (e.g., a normal distribution). While the desired distribution p(φ) is often complicated and difficult z ϕ to sample from directly, optimizing the function f allows one to generate samples fromp f (φ) ≈ p(φ). The function f is chosen to have a tractable Jacobian such that the probability density associated with any sample φ can be computed exactly: To encode a map from a simple distribution r(z) to a complicated distributionp f (φ), the map f must be highly expressive while also being invertible and having a computable Jacobian. Here, the real non-volume-preserving (real NVP) flow [28] machine learning approach is used: f is constructed by the composition of affine coupling layers that scale and offset half of the components of the input at a time; the choice of which components of the data are transformed is part of the layer definition. Splitting the D-dimensional vector φ into (D/2)-dimensional pieces φ a and φ b according to this choice, a single coupling layer g i transforms φ to z = g i (φ) via where s i and t i are neural networks mapping from R D/2 to R D/2 and denotes element-wise multiplication. Importantly, each affine layer g i is invertible, because the half of the data that is used as input to the neural networks remains unchanged by the layer: The Jacobian matrix is lower-triangular and its determinant can be easily computed. For coupling layer g i : where j indexes the D/2 components of the output of s i . Stacking many coupling layers g 1 , . . . , g n which alternate which half of the data is transformed, the function f is defined as Using the chain rule, the determinant of the Jacobian of f is a product of the contributions from each g i . By increasing the number of coupling layers and the complexity of the networks s i and t i , f can systematically be made more expressive and general. Figure 1 depicts how composing many coupling layers incrementally modifies a prior distribution which is easy to sample into a more complex output distribution that approximates a distribution of interest. For a fixed initial distribution r(z), the neural networks within each affine coupling layer of f can be trained to bringp f (φ) close to the desired distribution p(φ). This training is undertaken by minimizing a loss function. Here, the loss function used is a shifted Kullback-Leibler (KL) divergence 2 between the target distribution of the form p(φ) = e −S(φ) /Z and the proposal distributionp f (φ): This loss function has been successfully applied in related generative approaches to statistical lattice models [30,31]. The formal shift by log Z in Eq. (13) eliminates the need to compute the true partition function, and does not affect the gradients or location of the minima. By non-negativity of the KL divergence, the lower bound on the loss is − log Z, and this minimum is achieved exactly whenp f = p. In practice, the loss is stochastically estimated by drawing batches of M samples from the model φ (i) ∼p f and computing the sample mean: The loss minimization can then be undertaken using stochastic optimization techniques such as stochastic gradient descent or momentum-based methods including Adam and Nesterov [32,33]. By construction, the flow model allows sampling from p f efficiently. The training process can thus be performed by drawing samples from the model itself, rather than using existing samples from the desired distribution as training data. This self-training is a key feature of the proposed approach to Monte Carlo sampling for field theories, where samples from the desired distribution are often computationally expensive to obtain. If samples do exist, they can be used to 'pre-train' the network, although in the numerical studies undertaken here this was not found to be markedly more efficient in network optimization than using only self-training. Given a trained model with distributionp f (φ) ≈ p(φ), samples fromp f can be used as proposals to advance a Markov chain using the generative Metropolis-Hastings algorithm described above. This forms the basis for the flow-based MCMC algorithm proposed here: 1. A flow-based generative model (here, a real NVP model) is trained using the shifted KL loss given in Eq. (13) to have output distributionp f (φ) ≈ p(φ); 2. N proposals φ (i) ∼p f are produced by sampling from the flow-based model (this can be done in parallel) and the associated action S(φ ) is computed for each proposal; 3. Starting from an arbitrary initial configuration, each proposed sample is successively accepted or rejected using the Metropolis-Hastings algorithm given in Eq. (6) to build a Markov chain of length N .
When the prior distribution r(z) is strictly positive, the invertibility and continuity of f guarantees that the generated distributionp f (φ) is also strictly positive. For all models with finite action, and thus p(φ) > 0, the resulting Markov chain is then ergodic by the arguments detailed in Section II A.

C. Autocorrelation time for generative Metropolis-Hastings
For any Markov chain constructed via a generative Metropolis-Hastings algorithm (with independent update proposals), an observable-independent estimator for autocorrelation time can be defined from the accept/reject statistics of the chain. This serves both as a measure of the similarity between the proposal and desired distributions and enables proper error estimation for lattice observables [34].
Precisely, the autocorrelation at Markov chain separation τ , for all observables, is given by the probability of τ rejections in a row, where 1 rej (i) is an indicator variable taking value 1 when the proposed step from i − 1 to i was rejected in the Metropolis-Hastings algorithm, and 0 otherwise. In practice, for a near-equilibrium, finite Markov chain with length N , a finite-sample estimator provides a good approximation to ρ(τ )/ρ(0): This measure of autocorrelation is consistent with the usual definition; it is shown in Appendix A that the standard estimator for the autocorrelation of any given observable O, also converges to p τ rej in the limit N → ∞.
The qualitative relation between acceptance rate and autocorrelations gives a convenient measure of the autocorrelation characteristics of a Markov chain. Precisely, the autocorrelation at distance τ can be bounded in terms of the average acceptance rate a = 1 − E [p rej ]: Increasing the acceptance rate of an independence Metropolis sampler is thus a necessary condition to reduce autocorrelations. Additionally, a = 1 exactly when the proposal and desired distributions are equal. In this case, p rej (φ) = 0 for each φ, and there are no autocorrelations. While bringing a close to 1 does not provide an upper bound on autocorrelation, stochastically improving a loss function that measures distance between distributions is expected to reduce autocorrelations on average. In practice, autocorrelations should be evaluated as a test metric alongside the training loss to confirm improvement over the course of training the model.
The correspondence between loss minimization, acceptance rate, and autocorrelations is studied in the context of φ 4 theory in Section III, where a clear correlation between a and ρ(τ )/ρ(0) is observed.

D. Critical slowing down
When a distribution is sampled using a Markov chain with large autocorrelation time, many updates are required to produce decorrelated samples. Critical slowing down (CSD) is defined as the divergence of the autocorrelation time of Markov chain sampling as a critical point in parameter space is approached [1]. A numerically-stable definition of the characteristic autocorrelation time of a Markov chain is the integrated autocorrelation time: As a critical point is approached, analysis of standard local-update algorithms for lattice models suggests τ int O is typically well-described by a power law in the lattice spacing, or for fixed physical volume, a power law in the lattice sites per dimension, L. A dynamical critical exponent z O is thus defined by a fit to τ int O = α O L z O along a line of constant physics. An update algorithm for which the critical exponent is zero is unaffected by CSD.
In any generative Metropolis-Hastings simulation, the autocorrelation time is completely fixed by the expected accept/reject statistics, which in turn result from the structure of the proposal and desired distributions. For models trained with a target value of the integrated autocorrelation time used as a stopping criterion, CSD associated with the Markov chain sampling is thus trivially removed at the expense of up-front training costs. The difficulty of CSD is in essence shifted to the training of the model, i.e., to the optimization of the proposal distribution.
In this study, the viability of using machine-learned models to producep is demonstrated for a simple system. It remains to be shown for theories of more physical interest that models can be trained to produce approximate distributionsp from which decorrelated samples can be efficiently generated. There are, however, reasons for optimism. Generative models, and in particular flow-based models, are rapidly evolving towards more efficient representation capacity. Complex coupling layers have been implemented [28,35], as have generalized convolutions [36,37] and transformations with continuous dynamics that are not dependent on restricted coupling layers [38]. Moreover, the algorithm proposed here can be trained with constant memory cost [39], alleviating the storage limitations that can arise in gradient-based optimization. Large-scale parallelization of generative models can also be achieved with simultaneous inference and sampling [40]. Finally, models trained with respect to an action at a given set of parameter values can either be used to initialize training or as a prior distribution for models targeting that action at nearby parameter values, reducing the cost associated with parameter tuning.

E. Related machine learning approaches to MCMC
A number of other machine learning approaches to accelerate MCMC have been explored, primarily in the context of quantum many-body systems.
For example, self-learning Monte Carlo (SLMC) methods construct, by a variety of techniques, an effective Hamiltonian for a theory that can be more easily sampled than the original Hamiltonian [41][42][43][44][45]. The effective Hamiltonian is learned using supervised learning techniques based on training data drawn from a combination of existing MCMC simulations, randomly-mutated samples, and the accelerated Markov chain itself (hence the term "self-learning"). The flow-based MCMC algorithm proposed here similarly learns an approximate distribution by self-learning. In contrast to SLMC methods, the model used here allows direct sampling, eliminating the need for intermediate MCMC steps to produce training data and enabling final sampling using an independence Metropolis sampler.
Normalizing flows have also been used to sample from Boltzmann distributions in condensed matter systems. In Ref. [30], a renormalization group architecture was employed to specify an expressive change-of-variables that exploited translational invariance and a hierarchy of scales. The normalizing flow model was trained using the shifted KL divergence also used here. In that work, accelerated MCMC sampling was achieved using HMC in the space of variables drawn from the prior distribution, in contrast to the independence Metropolis sampler proposed here. The renormalization group architecture for normalizing flows is an intriguing possibility for future applications of flow-based MCMC.
Finally, several machine learning generalizations of HMC have been developed. In Ref. [46], volumepreserving flows were learned in an augmented space that introduced auxiliary variables analogous to HMC momenta. Likelihood-free adversarial training [47] was employed to optimize these flows. Acquiring approximately independent training samples, however, requires running the Markov chain itself for many steps. Moreover, the volume-preserving constraint on the normalizing flow results in less expressive power in comparison to generic normalizing flows [35]. By careful construction, non-volume-preserving flows have also been applied to construct Markov chains in spaces augmented with auxiliary variables [48].  The theory with a massive scalar field φ(x) and a quartic self-interaction is one of the simplest interacting field theories that can be constructed. It is thus a convenient testing ground for new algorithms for lattice field theory, such as the flow-based MCMC approach proposed here.
In a d-dimensional Euclidean spacetime, a discretized formulation of φ 4 theory can be defined on a lattice with sites x µ = an µ , where a denotes the lattice spacing, µ ∈ {1, . . . , d} labels spacetime dimension, and n µ ∈ Z d . Here, a finite lattice volume V = (aL) d is considered, with periodic boundary conditions in all dimensions. The lattice action (in units where a = 1) can be expressed as where the parameters m 2 and λ are the bare mass squared and bare coupling, respectively, and the lattice d'Alembert operator is defined by (21) By taking expectation values over the distribution p(φ) = e −S(φ) /Z, observables in the theory can be estimated.
The observables studied here are the connected twopoint Green's function and its momentum-space representatioñ where x µ = ( x, t), as well as the corresponding pole mass and the two-point susceptibility In the limit λ → ∞, with m 2 /λ < 0 fixed, scalar φ 4 theory reduces to an Ising model. Another observable of interest is therefore the average Ising energy density [49], defined by where the sum runs over single-site displacements in all dimensions.
The action of φ 4 theory is invariant under the discrete symmetry φ(x) → −φ(x). Depending on the value of the parameters m 2 and λ, this symmetry can be spontaneously broken. The theory thus has two phases: a symmetric phase and a broken-symmetry phase.

A. Model definition and training
For this proof-of-principle study, the flow-based MCMC algorithm detailed in Section II was applied to φ 4 theory in two dimensions with L = {6, 8, 10, 12, 14} lattice sites in each dimension. The parameters m 2 and λ were chosen to fix m p L ≈ 4 for each lattice size; their numerical values are given in Table I. For simplicity in this initial work, all parameters were chosen to lie in the symmetric phase. In principle, the flow-based MCMC algorithm can be applied with identical methods to the broken-symmetry phase of the theory, but it remains to be shown that models can be trained for such choices of parameters.
For each set of parameters, real NVP models were defined using 8-12 affine coupling layers. The coupling layers were defined to update half of the lattice sites in a checkerboard pattern; successive layers alternately updated the odd and even sites. The neural networks s i and t i used in coupling layer g i (see Eq. (9)) were constructed from two to six fully-connected layers of 100-1024 hidden units with leaky rectified linear units [50] as non-linear activation functions between each stage. The prior distribution r(z) was chosen to be an uncorrelated Gaussian distribution in which the value at each lattice site was independently sampled from a Gaussian with mean 0 and standard deviation 1. The models were trained to minimize the shifted KL loss between the output distributioñ p f (φ) and the desired distribution p(φ) = e −S(φ) /Z using gradient-based updates with the Adam optimizer [32]. A mean absolute error loss, defined in Appendix B, was optimized before training in the case of the 14 2 model where it was found to accelerate convergence to the KL loss minimum.
An exhaustive study of the optimal choice of prior distribution r(z), model depth, architecture and initialization of the neural networks, and of the mode of coupling of the affine layers, is beyond the scope of this proof-of-principle study. The parameters used here, however, proved to define sufficiently expressive models such that the Metropolis-Hastings algorithm applied to output from the trained models easily achieved acceptance

FIG. 2:
Histograms of length of consecutive runs of Metropolis rejections in machine-learned (ML) models at both 50% and 70% mean acceptance. Also shown is the same statistic for Markov chains generated via HMC, where mean acceptance was tuned to 70%. The frequency of long runs of rejections is consistently reduced for models trained to reach higher average acceptance. The ML and HMC ensembles at 70% acceptance display very similar distributions of rejection streaks. rates of well over 50%. With further investment in hyperparameter optimization, higher rates of acceptance could be achieved. In any Markov chain using the Metropolis-Hastings algorithm, there is a tradeoff between computational cost and correlations resulting from low acceptance rates. The optimal acceptance rate minimizes the cost per decorrelated sample from the chain. Here, the cost of training, and not just model evaluation, must be considered, and the optimal level of training in future applications will depend on many factors, such as the desired ensemble size.
For each set of parameters studied, instances of the model were trained to reach both 50% and 70% average Metropolis acceptance. Figure 2 shows histograms of the number of updates between accepted configurations for models at both levels of training. Models trained to reach the higher acceptance rate are seen to have shorter runs of consecutive rejections. Because autocorrelation is related to rejections by ρ(τ )/ρ(0) = p τ rej for independence Metropolis sampling, a reduced frequency of rejection runs with length longer than τ directly implies a reduction in ρ(τ )/ρ(0). Implications for critical slowing down of the generation of decorrelated configurations are discussed in Section III C.
For comparison, ensembles of 100,000 lattice configurations were generated using the machine-learned models in flow-based MCMC as well as standard local Metropolis [51] and Hybrid Monte Carlo (HMC) [52] algorithms at matched parameters. The local Metropolis algorithm employed a fixed order of sequential updates to each site,    with proposed updates to φ(x) sampled uniformly from the interval [φ(x) − δ, φ(x) + δ] followed by a Metropolis-Hastings accept/reject step; for all parameters considered, the width δ was tuned to achieve a 70% acceptance rate. The HMC method was implemented us-ing a leapfrog integrator with a fixed division of trajectory length τ into 10 steps; the trajectory length τ was also tuned to achieve a 70% acceptance rate. method. The flow-generated configurations cannot be distinguished by eye from those generated using local Metropolis and HMC; the generated samples are varied and display correlations on similar length scales as those produced by the standard MCMC techniques. A quantitative comparison of the ensembles generated by the different algorithms is made in the following section.

B. Tests: physical observables and error scaling
Since the flow-based MCMC algorithm satisfies ergodicity and balance, it is guaranteed to produce samples from the desired probability distribution in the limit of an infinite chain. To test the performance of the algorithm for a finite number of samples, each of the physical observables defined above was computed on ensembles of configurations at the parameters of Table I, generated both using standard HMC and local Metropolis methods, as well as with the trained flow-based MCMC algorithm. Figures 4-6 compare the observables computed on ensembles generated using all three methods.
To estimate the pole mass m p , an effective mass is defined based on the zero-momentum Green's functions at various time separations: For all observables, the values computed using the flowbased MCMC ensembles are consistent within statistical uncertainties with those computed using the standard methods. Moreover, Figure 7 shows that the statistical uncertainties of the observables scale as 1/ √ N with the number of samples N , as expected for decorrelated samples.

C. Critical slowing down
For φ 4 theory, a number of algorithms have been developed that mitigate CSD to various extents, such as worm algorithms [49], multigrid methods [53], Fourieraccelerated Langevin updates [54] and cluster updates via embedded Ising dynamics [55]. The path towards generalizing those algorithms to more complicated theories such as QCD, however, is not clear. Algorithms such as HMC and local Metropolis, which are also used for In (c) the upper sets of points in blue correspond to models trained to a mean acceptance rate of 50%, while the lower sets of points in green correspond to models trained to a mean acceptance rate of 70%. Dashed red lines display power law fits to L = {10, 12, 14} with labels L z specifying the scaling. The HMC and local Metropolis methods demonstrate power-law growth of τint, while τint for the flow-based MCMC is consistent with a constant in L and decreases as mean acceptance rate increases. Dot-dashed blue and green lines for the flow-based ensembles display lower bounds in terms of mean acceptance rate based on Eq. (18). Error bars indicate 68% confidence intervals estimated by bootstrap resampling and error propagation.
studies of QCD and pure gauge theory, exhibit CSD for φ 4 (as well as more complicated theories) as the continuum limit is approached.
The parameter sets chosen for the study of φ 4 theory in this work (Table I) correspond to a critical line with constant m p L as L → ∞. For the flow-based MCMC approach proposed here, as well as for ensembles generated using the HMC and local Metropolis algorithms, the autocorrelation times of the set of physical observables discussed previously were fit to leading-order power laws in L to determine the dynamical critical exponents z O for that observable. Figure 8 shows the autocorrelation times for each observable for each approach to ensemble generation. The absolute values of τ int are not directly comparable between methods because the cost per update differs. The scaling with lattice size, on the other hand, indicates the sensitivity of each method to critical slowing down. For both HMC and local Metropolis, the critical behavior and consequently the performance of the algorithm was found to depend on the observable. In each case, the critical exponent was 0.8 z O 2.7. In comparison, for the flow-based MCMC ensembles at a fixed acceptance, the critical exponent was found to be consistent with zero, with the autocorrelation time observableindependent and in agreement with the acceptance-based estimator defined in Section II C.
Since the the mean acceptance rate was used as the stopping criterion for training these models, it was not guaranteed a priori that the measured integrated autocorrelation time would be constant across the different models used. The results in Figure 8, however, suggest that beyond the simple lower bound from Eq. (18) there is a strong correlation between the mean acceptance rate and integrated autocorrelation time for models trained using a shifted KL loss. This is further confirmed by the similarity of the rejection run histograms across lattice sizes for flow-based MCMC, as shown in Figure 2.
As discussed in Section II D, while CSD in the sampling step for the flow-based MCMC is eliminated, the cost is potentially transferred to the training of the flowbased generative model. For the models produced here, achieving the target acceptance rate on larger lattice volumes required more training epochs and more care in hyperparameter choices than for smaller volumes, although standard training techniques were sufficient in all cases.

IV. SUMMARY
This work defines a flow-based MCMC algorithm to sample lattice field configurations from a desired probability distribution: 1. A real NVP flow model is trained to produce approximately the desired distribution; 2. Samples are proposed from the trained model; 3. Starting from an arbitrary configuration, each proposal is accepted or rejected to advance a Markov chain using the Metropolis-Hastings algorithm.
The approach is shown to define an ergodic and balanced Markov chain, thus guaranteeing convergence to the desired probability distribution in the limit of a long Markov chain. In essence, the flow-based MCMC algorithm combines the expressiveness of normalizing flows based on neural networks with the theoretical guarantees of Markov chains to create a trainable and asymptotically-correct sampler. Since these flows are applicable for arbitrary configurations with continuous, real-valued degrees of freedom, one can generically apply this method to any of a broad class of lattice theories.
Here, the algorithm is implemented in practice for φ 4 theory, and is demonstrated to produce ensembles of configurations that are indistinguishable from those generated using standard local Metropolis and HMC algorithms, based on studies of a number of physical observables.
A key feature of the approach is that models trained to a fixed acceptance rate do not experience critical slowing down in the sampling stage. In particular, the autocorrelation time for all observables is dictated entirely by the accuracy with which the flow model has been trained; perfect training corresponds to decorrelated samples and 100% acceptance in the Metropolis-Hastings step of the MCMC process. Nevertheless, the efficiency with which the training step of this approach can be scaled to larger model sizes, and to more complicated theories such as QCD, remains to be studied. Recent advances in the training and scaling of flow models, and in particular demonstrations of constant memory cost training [39], provide reasons for optimism on this front. Further, incorporating symmetries generally improves data efficiency of training, and implementing spacetime and gauge symmetries [56] may be a natural next step to practically train these flow models for lattice gauge theories like QCD.
In moving towards lattice gauge theories such as QCD, several theoretical developments are also required. The real NVP model chosen to parameterize the normalizing flows here is described in terms of vectors of variables φ ∈ R D . Gauge configurations, however, live in a compact manifold arising from the Lie group structure. Extending this method will require a normalizing flow model that can act on this manifold while remaining sufficiently expressive. The choice of prior likewise will need to be extended to a distribution over the manifold of lattice gauge configurations which can be easily sampled. A uniform distribution, for example, may be a candidate for a prior, but this choice must be tested in the context of a specific flow model.
If the flow-based MCMC algorithm proposed here can be implemented for a complex theory such as QCD, the advantages would be significant; arbitrarily large ensembles of field configurations could be generated at minimal cost. The independence of the proposal step from any previous configuration allows parallel generation of proposals, and the continually-improving support in hardware and software for neural network execution suggests future practical gains for this style of ensemblegeneration. Given efficient sample generation from a trained model, ensembles would not need to be stored long-term. Moreover, a model trained for one action could either be re-trained or used as a prior for another flow model targeting an action with nearby parameter values. This would allow efficient tuning of parameters and generation of additional ensembles interpolating between and extrapolating from existing models. 1748958. PES is supported by the National Science Foundation under CAREER Award 1841699, GK is supported by the U.S. Department of Energy under the SciDAC4 award de-sc0018121, and PES and MSA are supported in part by NSERC and the Perimeter Institute for Theoretical Physics. Research at the Perimeter Institute is supported by the Government of Canada through the Department of Innovation, Science and Economic Development and by the Province of Ontario through the Ministry of Research and Innovation.
Appendix A: Acceptance rate estimator for autocorrelation Here it is shown that the standard estimator for the autocorrelation of an observable O converges in the limit of infinite path length to p τ rej , as claimed in Section II C. The standard estimator is defined by: where the final equality is true assuming the Markov chain is initialized with a sample from the stationary distribution p(φ) (i.e., it is assumed that enough prior iterations were discarded that the chain is thermalized).
The expectation value can then be split by cases and, conditioning on the fixed accept/reject pattern, the expectation values can be computed by identifying the distributions of observables O i and O i+τ : O i −Ō 2 all proposals i + 1, . . . , i + τ rejected + (A4) O i −Ō 2 some proposal i + 1, . . . , i + τ accepted (A5) The limit N → ∞ is used to drop biases arising from conditioning on behavior within the region [i, i + τ ].