Applications of flow models to the generation of correlated lattice QCD ensembles

Machine-learned normalizing flows can be used in the context of lattice quantum field theory to generate statistically correlated ensembles of lattice gauge fields at different action parameters. This work demonstrates how these correlations can be exploited for variance reduction in the computation of observables. Three different proof-of-concept applications are demonstrated using a novel residual flow architecture: continuum limits of gauge theories, the mass dependence of QCD observables, and hadronic matrix elements based on the Feynman-Hellmann approach. In all three cases, it is shown that statistical uncertainties are significantly reduced when machine-learned flows are incorporated as compared with the same calculations performed with uncorrelated ensembles or direct reweighting.


I. INTRODUCTION
Understanding the strongly interacting sector of the Standard Model of particle physics, described by the theory of quantum chromodynamics (QCD), is essential for advancing particle and nuclear physics.The numerical framework of lattice QCD is a systematically improvable tool to explore the dynamics of the strong nuclear force.This approach has enabled precise calculations across applications spanning from hadron structure to high-temperature QCD and nuclear physics [1,2].Nevertheless, there is great potential to extend the reach of lattice QCD beyond the current state of the art if computational challenges such as critical slowing down, topological freezing, and signal-to-noise problems can be overcome.In this context, emerging machine learning techniques offer a promising avenue towards mitigating these computational obstacles [3,4].
A growing community effort is developing at the intersection of machine learning and lattice QCD-see e.g.Refs.[5][6][7][8][9] for a selection of applications.In particular, generative flow models [10][11][12] are one of several promising pathways which show potential to accelerate the sampling of lattice field configurations.This line of investigation is developing, with demonstrations in 2D theories [9, and first applications to 4D gauge theories with and without fermions [41][42][43].While the field is progressing rapidly, achieving high-quality models that can be applied at the scale of state-of-the-art calculations still requires further engineering [44].In addition to their promise in the context of sampling, flow models-functioning as approximate maps between distributions-can be used to accelerate lattice QCD calculations in qualitatively different ways.For example, flow models provide a promising new approach to determining thermodynamic observables [9,30,39,45].
In this work, we explore applications which utilize flows to map gauge field configurations between distributions defined by different Euclidean lattice action parameters.Such flows can be used to generate multiple statistically correlated ensembles at different parameters.As we explore in this work, this may be particularly valuable when the variation of some quantity with respect to the action parameter is of physical or computational interest-see also Refs.[46,47].The advantage of flows in this context originates from correlated cancellations of uncertainties between expectation values evaluated at different action parameters, which leads to reductions in the number of configurations needed to achieve a fixed statistical error.
Examples of physically relevant applications of derivatives with respect to action parameters include continuum and chiral extrapolations as well as the computation of matrix elements such as the chiral condensate, the nucleon sigma term, or other observables, using Feynman-Hellmann techniques.Another is derivatives with respect to the electromagnetic coupling for scale setting or to compute isospin breaking corrections in QCD+QED [48,49].One may also consider applications in theories with a sign problem, e.g., to derivatives with respect to the baryon chemical potential or the QCD θterm.In all of these cases, the distributions to be related by a flow transformation are much more similar than in applications intended to accelerate sampling, and current flow methods can already be applied at the scale of typical lattice QCD calculations.Three selected applications are investigated, namely the continuum extrapolation of gradient flow scales, the computation of the gluon momentum fraction of the pion in quenched lattice QCD using the Feynman-Hellmann approach, and the mass dependence of observables in N f = 2 QCD.This paper is organized as follows.In Section II, we discuss preliminaries on flows, their applicability in the context of correlated ensembles, and the residual flow architectures used in this work.The three numerical demonstrations are presented in Section III.We conclude in Section IV.Appendix A provides further details of the flow models used in this work.

II. FLOWS FOR THE GENERATION OF CORRELATED ENSEMBLES
A. Flows for lattice QCD This section presents an introduction to normalizing flows [10][11][12], reviewing the key ideas relevant for the present work.
A "flow" is defined as a diffeomorphism f between probability distributions that maps samples from a base (or prior) distribution, r(U ), to a model distribution with density where V = f (U ).Flows can be constructed such that they have many free, trainable parameters.These parameters may be optimized such that the model distribution approximates some target distribution p, i.e., q(V ) ≃ p(V ).
For the applications explored in this work, flow models are constructed in which the samples U are lattice gauge-field configurations, and the probability distributions p(U ) and r(U ) are defined in terms of Euclidean lattice actions such that r(U ) ∝ exp(−S 0 (U )), and p(V ) ∝ exp(−S 1 (V )).In most cases, it is not necessary to know the normalization of p or r (the exception being thermodynamic observables [9]).
Expressive flow transformations can be constructed in a variety of ways, for example as the composition of n invertible layers Architectures for invertible layers g i which act on lattice gauge fields have been discussed in Ref. [43].The particular constructions used in this work are detailed in Section II C. Given a model, its trainable parameters may be optimized in various ways.One choice is to minimize the Kullback-Leibler (KL) divergence [50] between the model and target distributions.Approaches such as path gradients [51], related control variate methods [43], as well as the "REINFORCE" algorithm [52], may be be used to improve and accelerate training dynamics by reducing the variance associated with stochastic gradient estimates.After optimization, model quality can be characterized using the Effective Sample Size per configuration (ESS), estimated using N gauge field configurations generated from q(V ), and where w(V i ) = p(V i )/q(V i ) is the reweighting factor of the ith configuration.The values of the ESS lie in the interval ESS ∈ [1/N , 1], with ESS = 1 corresponding to a perfect model.
In practice, a learned flow is not perfect, but may function as an approximate map between distributions.To ensure correctness of expectation values computed on the flowed configurations, one may use the independence Metropolis algorithm [53][54][55] or simply reweighting, with the weight of each configuration given by w(U ).Expectation values of observables such as plaquettes, hadronic correlation functions, or the topological charge can be directly reweighted as where the notation ⟨⟩ q is used to refer to expectation values with respect to the probability distribution q, and we assume the reweighting factors have been properly normalized such that ⟨w⟩ q = 1.Derived quantities, such as gradient flow scales or hadron masses, can be computed from reweighted correlation functions.Statistical uncertainties in reweighted quantities are typically larger than those before reweighting.A rough estimate of the increase in the variance is a factor of ≃ 1/ESS.

B. Correlated ensembles and flows
While applications of flows to accelerate the generation of field configurations continue to advance, here we describe another avenue for flow models to improve lattice QCD calculations by reducing the variance of observables that can be computed from differences between quantities at different action parameters.The key idea is the following.Consider a generic parameter of the action, α.The goal is to compute some observable O as a function of α, and in particular the derivative where the right-hand side is a finite-difference approximation of the derivative using ∆α = α 1 − α 2 , with ⟨⟩ α denoting the expectation under the distribution defined by the action parameter α, i.e., p α .Higher order derivatives, or derivatives of one observable with respect to another, may be computed in a similar way.
In this work, we consider three qualitatively different approaches to the computation of the quantity in Eq. (5).The first two are standard tools in common use: 1. Use a very small step ∆α = ϵ, and compute the numerator in Eq. ( 5) with ϵ reweighting: where w ϵ = p α1+ϵ /p α1 .For this approach, the ESS generically degrades as ESS = 1 − k(∆α) 2 + . .., where k is a problem-specific constant.The separation ϵ may be made small without compromising signal-to-noise due to correlated noise cancellations between the two expectation values.As ϵ → 0 it becomes exact, recovering an estimate statistically identical to that obtained by applying the derivative analytically.
2. Generate independent ensembles to separately compute expectation values at α 1 and α 2 in Eq. ( 5).This enables use of much more widely separated α 1 and α 2 than accessible with reweighting, thereby allowing exploitation of the bias-variance tradeoff to reduce statistical uncertainties while accepting additional discretization artifacts from the finite-difference approximation in order to improve signal-to-noise.However, this effect must be sufficiently large to compensate for the lack of correlated noise cancellations.
These two methods each have different capabilities, with each useful for different applications.Incorporating flows provides an additional approach that combines some of the advantages of both: 3. Use a trained flow model to map configurations between the distributions given by α 1 and α 2 .Including flow reweighting factors, correlated differences can be calculated as: where w(f (U )) = p α2 (f (U ))/q(f (U )), such that a perfect flow would remove the reweighting factors entirely.This approach benefits from the same correlated cancellation of uncertainties as does ϵ reweighting, while allowing for larger steps in ∆α to exploit the bias-variance tradeoff as does the approach using independent ensembles.
In Section III below, we provide numerical demonstrations of the advantages of this flow-based approach.Note that the latter two approaches, with finite separation in α, can be combined with improved finitedifference estimators of derivatives to reduce the O(∆α) bias, or by fitting the α dependence at the cost of introducing model dependence.

C. Architecture based on residual flows
The flow architecture used in this work is based on that introduced in Ref. [43], with a series of improvements that are detailed below.The flow transformation is defined as the composition of trainable gauge-equivariant layers that act directly on the gauge links.The transformation of a gauge field U → U ′ through an SU(N )residual layer can be expressed as where g x (U ) is an algebra-valued matrix which can in principle have an arbitrary dependence on the entire gauge-field configuration, as long as it transforms locally under gauge transformations, g x (U ) → Ω † x g x (U )Ω x ; here Ω x denotes a gauge transformation and the subscript labels the spacetime dependence.This transformation can be inverted by fixed point iteration, with a unique solution guaranteed if the Lipschitz continuity condition is satisfied [43].
For numerical tractability, each layer partitions the gauge field and transforms only the active links, defined as those with fixed direction µ on a subset of lattice sites {x a }, conditioned on the values of the remaining frozen links U f .Each layer acts as that is, g x for any given active link depends on all frozen links but only the same active link.This separation of variables allows efficient computation of the Jacobian of the transformation using automatic differentiation as described in Eq. ( 26) of Ref. [43].In the present work, we use two partitioning schemes (also referred to as "masking patterns") for the site index: 1.A checkerboard or "mod 2" masking pattern, where the active links are those with direction µ in the positions that satisfy (p + µ x µ ) = 0 (mod 2) for for p ∈ 0, 1.A stack of 8 layers is needed to transform all links, i.e., 2 complementary checkerboards in each of the 4 directions µ.This is a simple nontrivial choice that updates all variables within a small number of layers.

2.
A "mod 4" masking pattern, where the positions of active links satisfy (p + µ x µ ) = 0 (mod 4), for p ∈ 0, 1, 2, 3. 16 layers are thus needed to transform every link on the lattice.This choice is more expensive than the "mod 2" pattern described above, but it can also be more expressive by allowing a more complicated dependence of the transformation on the frozen links.
The function g x (U f , U µ (x a )) must be constructed in a way that is expressive but simple to evaluate.One simple construction utilizes 1 × 1 staples (depicted in Figure 1), such that the 1 × 1 loops, have the same gauge transformation as g x .One can then define a covariant algebra-valued object as, e.g., where W x,µν = W R x,µν + W L x,µν , and P(W ) is the gauge-covariant traceless anti-Hermitian projection of W .Moreover, α x,µ ] is thus gauge covariant and can be used to construct g x (U ).One choice of such a construction is: where f (x) is e.g., a ratio of polynomials-see Appendix A for an example.
A useful modification to this construction is to consider Wilson loops that are larger than 1 × 1. Sums of such loops can be constructed iteratively, by repeatedly adding together links and staples which transform in the same way, and finally computing a 1 × 1 loop.This is inspired by similar transformations used in Refs.[41,56] and resembles the learned smearing of Ref. [57].This gauge-equivariant "convolution" can be written explicitly as the recursion where η ℓ i,ρ are trainable coefficients, and L ℓ and R ℓ label generic staple-like objects that transform in the same way as the gauge links.Here we use two explicit choices, R 1 µν = (S R x,µν ) † in Eq. ( 10) and R 2 µν = W R x,µν U µ , and similarly for L ℓ µν ; see Figure 1.Note that in Eq. ( 14), these objects are computed using the variables V (i) .After iterating, V (i) is not an element of the gauge group, but this is not important since ultimately there is a projection to the algebra to construct G µ in Eq. (12).
The iterative procedure in Eq. ( 14) can be used to construct expressive residual layers.After applying n pt iterations of Eq. ( 14) to Eq. ( 15), the resulting values of V (npt) can be used to construct the quantity g x (V (npt) , U µ (x a )) that enters in the transformation of the residual layer defined in Eq. ( 9).Specifically, the convoluted frozen links, V (npt) , are used to construct the staples in Eq. ( 11) instead of U f .

III. EXAMPLE APPLICATIONS
Physics contexts in which derivatives of the form of Eq. ( 5) arise are ubiquitous; here we discuss three examples.First, derivatives with respect to the gauge coupling β can be used to constrain continuum extrapolations.Second, matrix elements may be computed using Feynman-Hellmann techniques, where derivatives with respect to action parameters correspond to single insertions of the corresponding operator.Second-order FIG. 1. Sketch of the recursive transformation, Eq. ( 14), to build generic Wilson loops in the residual layers.
derivatives using Feynman-Hellmann also access physically relevant processes, e.g., Compton scattering.Third, derivatives with respect to the quark mass can be employed to constrain chiral extrapolations or in calculations of e.g., sigma terms.This section presents numerical demonstrations using flows to improve estimates of these three kinds of derivatives.
The flow models used in these applications are summarized in Table I.All flow models have been optimized using path gradients [51] as described in Ref. [43].Gauge field samples for both training and evaluation are obtained using standard Markov Chain Monte Carlo methods, specifically the (pseudo-)heatbath algorithm with overrelaxation [58][59][60][61][62] for Yang-Mills theory and the Hybrid/Hamiltonian Monte Carlo [63] (HMC) algorithm for QCD.

A. Continuum limit of gauge theories
One application in lattice QCD for flow-correlated ensembles is in taking the continuum limit.For a numerical demonstration, we consider gradient flow scales.
We use the pure-gauge SU(3) theory, with action where β is the inverse squared bare gauge coupling and U µν is the plaquette.The continuum limit of lattice spacing a → 0 corresponds to β → ∞.
One class of observables is obtained by using the gradient flow; in particular, a scale t c can be defined implicitly from where c is a numerical constant, and E(t) is the energy density at flow time t, for which we use the plaquette definition; see Eq. (3.1) in Ref. [64].The choice c = 0.3 defines the scale t 0.3 , often referred to as "t 0 ".One can compute the ratio of two gradient flow scales t 0.3 /t 0.35 , which can be related to the ratio of the the strong coupling at two different energy scales [64].The continuum limit of this quantity takes the form where k 1 is a dimensionless constant, the ellipsis indicates higher orders in a 2 , the subscripts "lat" and "cont" refer to finite-a and continuum values, and discretization effects are parameterized by powers of a 2 /t 0.3 .The standard approach for performing a continuum extrapolation in lattice QCD relies on computing the desired quantity at several different lattice spacings using independent ensembles and extrapolating.This method can be improved by additional constraints on such an extrapolation in the form of derivatives Without generating more ensembles, this derivative can be computed using finite differences combined with ϵ reweighting or with flows to nearby values of the lattice spacing, or equivalently, values of the bare gauge coupling β: Note that the gradient flow scales t c are derived quantities, so we use the notation "| β " to indicate that they have been computed in a theory with the given β.
To demonstrate the advantage gained by using flows, we compute Eq. ( 20) using ϵ reweighting (Eq.( 6)) and the flowed approach (Eq.( 7)) and compare.For this test, we use 96k configurations at β = 6.02 on volume L 4 = 16 4 .For ϵ reweighting, we use a step of ∆β = 0.001, leading to an ESS of 96% on this ensemble.For the flowed approach, we use Model A of Table I, which maps from β = 6.02 to β = 6.03, that is ∆β = 0.01.This model achieves an ESS of 67%, which is significantly higher than direct reweighting, which has an ESS of 2% at the same target parameters.Using these approaches, we find Flow: k(a 2 ) = −0.0167(41) , that is, the statistical uncertainly using ϵ reweighting is 50% larger than that obtained with flows.In other words, one needs about 2.4× fewer samples using the flow method as compared with ϵ reweighting to achieve the same statistical uncertainty.
Assuming that cutoff effects are already in the linear regime at this value of the lattice spacing, one can use this procedure to perform a simple continuum extrapolation of the ratio of flow scales.The continuum-extrapolated results show the same hierarchy of uncertainties as in Eq. ( 21): Flow: t 0.3 /t 0.35 | cont = 0.8539 (13) , ϵ reweighting: t 0.3 /t 0.35 | cont = 0.8552 (20) . ( These results are shown in Figure 2 for the two methods.

B. Hadron structure with Feynman-Hellmann techniques
Another promising application of machine-learned flows is in the calculation of matrix elements via the Feynman-Hellmann (FH) approach-see Refs.[65][66][67][68] for recent applications.In this framework, a matrix element where h is a stable hadron at rest and O is the operator of interest projected to zero momentum, is computed by taking derivatives with respect to a parameter in the action.Specifically, adding the operator to the action as the matrix element can be obtained as where M h is the hadron mass.In practice, this can be estimated using a finite-difference approximation of the derivative, e.g., Other improved finite-difference approximations or modeling-based approaches may also be used to better control the O(λ 2 ) bias.As a numerical demonstration, we consider a Feynman-Hellmann calculation of the gluon momentum fraction of the pion in the quenched approximation of lattice QCD, similar to Ref. [65].In this case the operator O may be defined as where i, j ∈ (1, 2, 3), which is a discretization of the Energy-Momentum-Tensor (EMT).The matrix element can then be related to the gluon momentum fraction of the hadron ⟨x⟩ g by where the superscript "latt" emphasizes that it is a bare matrix element.When adding this operator to the gauge action with a small parameter λ, the full action can be seen as an anisotropic action with different couplings for the temporal and spatial plaquettes: It is therefore possible to use flow transformations to map from the isotropic pure gauge action at λ = 0 to nonzero values of λ.This target is referred to as "Feynman-Hellmann" in Table I.
We test the flowed approach by computing the difference in Eq. ( 26) using an ensemble generated at λ = 0 and flowed to non-zero ±λ values.The choice λ = 0.01 is small enough that O(λ 2 ) discretization artifacts in the derivative are negligible; compare to the results in Ref. [65].We train two flows, B1 and B2 in Table I, which achieve an ESS of 84% at the evaluation volume, cf. the direct reweighting ESS of around 2% at the same values of λ.The target parameters are matched to Ref. [65], albeit at a smaller volume.The value of β = 6 corresponds to a lattice spacing of a ≃ 0.09 fm (using the Sommer radius to set the scale [69]), and the hopping parameter κ in the quenched Dirac operator-related to the bare quark mass as κ = 1/(2m 0 + 4)-is taken to be κ = 0.132.The lattice spatial and temporal extent are L = 8 and T = 16, such that M π L > 4. For the purpose of this demonstration, we approximate the pion masses using the effective mass at the center of the lattice, where C π (t) is the pion correlator.In physical units, M π ≃ 1.2 GeV.For evaluation, 14k gauge-field configurations are generated using 1 heatbath step with 5 overrelaxation steps between measurements for each independent ensemble.Correlation functions are measured with four smeared sources per configuration with point sinks, using Chroma [70].The pion mass as a function of λ is shown in Figure 3a, as determined using ϵ reweighting, independent ensembles, and flowed ensembles.Since the flow model quality at the volume of interest is very high, uncertainties in the observables computed on flowed ensembles are very similar to those computed using ensembles generated with heatbath.
The physical quantity of interest, ⟨x⟩ latt g , depends on the difference between the pion mass determined at different values of λ.When this difference is computed using independent ensembles, statistical uncertainties add in the usual way, and the error in the correlated difference is larger than that of each M π (λ) estimate.In contrast, for flowed ensembles or ϵ reweighting, cancellations of correlated fluctuations significantly reduce the variances.This can be seen in Figure 3b, which shows ⟨x⟩ latt g computed following the different methods outlined in Section II.The use of flowed ensembles reduces the uncertainty by a factor of ≃ 7 with respect to independent ensembles, and ≃ 5 with respect to ϵ reweighting (using λ = 10 −4 with an ESS of 99.93%).Thus, incorporating flows into this calculation leads to a reduction of more than 20× in the number of configurations necessary to achieve the same statistical error.
It is also possible to compute the second derivative of M π with respect to λ, which can be approximated as While for the particular case of the gluon energymomentum tensor this derivative is not physically relevant, second derivatives are related to matrix elements of two-current insertions-see for instance Compton scattering applications [71,72].Using the same three methods as for the first derivative, we find: Indep.ens.: All the determinations yield numbers that are zero within two standard deviations, but the relative magnitude of the uncertainties can nevertheless be used to assess the advantage of the flowed approach.In particular, for the second derivative, the error reduction when using flows is larger than for the case of the first derivative, a factor of 7 − 10 smaller than that obtained using ϵ reweighting or independent ensembles.This, in turn, leads to requiring one to two orders of magnitude fewer configurations to achieve some target statistical precision.

C. Mass dependence of QCD observables
As a third example, we compute derivatives with respect to the quark mass in QCD with N f = 2 unimproved Wilson fermions.As a simple demonstration, we work directly with the action including the exact fermion determinant, where S g (U ) is the plaquette gauge action and D w is the discrete standard Wilson operator.The quark mass enters in the action via the hopping parameter κ.This target is referred to as "N f = 2 QCD" in Table I.
For this test, we compute the derivative of some simple observables (generically labelled as X) with respect to κ, approximated via finite differences: Depending on the observable, such derivatives can be useful, e.g., to extract sigma terms or to constrain chiral extrapolations.Here we specifically consider the average plaquette, the squared topological charge measured using the gradient flow at flow time t/a 2 = 2, and gradient flow scales t c .We train a flow to map configurations from κ = 0.1530 to κ = 0.1545 at β = 5.6 (Model C in Table I).Such parameters are close to those in Ref. [73].9k configurations are generated using standard HMC with pseudofermions.Note, however, the reweighting factor and KL divergence for each configuration are computed with Eq. (33); this is statistically consistent and introduces no approximations.At the evaluation volume of 8 4 , the flow achieves ESS = 48%, which should be compared with the ESS = 28% obtained using direct reweighting to the same target parameters.
The results are given in Figure 4, which compares the (normalized) values of several observables computed using the two methods, i.e., correlated flowed ensembles and ϵ reweighting (with ∆κ = 3•10 −4 for an ESS of 95%).At these statistics and for these choices of κ, independent ensembles result in statistical errors ≳ 2× larger than those attained with flows, and we do not display them.In all cases, flows provide a variance reduction and the central values are consistent within a standard deviation with those obtained with independent ensembles, which indicates that systematic errors in the finite-difference approximation of the derivative are not significant in this example.The error reduction varies between observables in the range ∼ 20% − 40%.In particular, the largest reduction is seen for the 1 × 1 plaquette loop, while the smallest is seen for the topological charge.Thus, depending on the observable of interest, one requires a factor of 1.5−2× fewer configurations to obtain a comparable statistical error when using flows.

IV. CONCLUSION
In this work, we present the application of machinelearned flows to the computation of observables involving derivatives.Specifically, we use flows to map ensembles between distributions defined by different parameters in the lattice action.By exploiting correlated cancellations of uncertainties between these ensembles, this application has the potential to provide a computational advantage in the evaluation of finite-difference approximations of derivatives.
To illustrate this idea, we showcase three numerical demonstrations in the context of lattice QCD: continuum limit extrapolations, matrix elements using the Feynman-Hellmann approach, and the mass dependence of observables.In all cases, flows provide a reduction of variance, which implies that fewer configurations are needed to achieve the same statistical error.The improvement factor for all demonstrations of this work, defined as the variance reduction in observables computed using flows with respect to ϵ reweighting, is summarized in Figure 5.These values are in the range of 1.5× for observables in QCD to more than 20× for quantities in the Feynman-Hellmann approach.With higher-quality flow models, these factors can be improved.This comparison does not account for the differing costs of the different steps in each method, namely generating the initial ensemble with heatbath, applying the flow (in the flowed case), and measuring correlation functions.Of course, the potential advantages of this approach depend sensitively on not only the model used, but on the particular application, the cost of evaluating observables, how autocorrelations are treated, and the precision goal.For a ballpark comparison, consider the results for the computation of matrix elements in the Feynman-Hellmann approach.In this application, the cost of applying the flow is comparable to the cost of measuring correlation functions, while the cost of a heatbath update is less by an order of magnitude.This amounts to a factor of ≲ 3 increase in computational cost to achieve a variance reduction by a factor of more than 20.This constitutes a real computational advantage of approximately one order of magnitude, neglecting the costs of training.Given expected further improvements through the continued development of flow architectures, these results are promising.
This work focuses on target actions that only depend on the gauge fields, e.g., pure gauge SU(3), quenched FIG. 3. (a) Pion mass in lattice units as a function of the coupling to the gluonic energy-momentum tensor λ.Marker shapes denote how the ensembles were obtained: orange circles for heatbath ensembles at fixed values of λ, blue squares for ensembles flowed from λ = 0, and red triangles when using configurations generated at λ = 0 and reweighted to λ = ϵ = 10 −4 .The pion mass is evaluated in quenched lattice QCD at β = 6.0, κ = 0.132, L = 8 and T = 16.(b) Bare gluon momentum fraction of the pion from Eq. ( 28) using a finite-difference approximation computed using the three different methods: independent heatbath ensembles, ϵ reweighting, and correlated flowed ensembles.QCD, and exact-determinant QCD.To generalize these results to state-of-the-art lattice QCD scales, where the fermion determinant cannot be explicitly evaluated, one must combine these flows with pseudofermion flows for QCD, as explored in Refs.[18,41,42].
As flow model technology for lattice QCD continues to advance, applications of correlated ensembles could be extended to compute other interesting quantities, such as sigma terms of hadrons or observables in QED+QCD.If the success seen in the proof-of-principle applications of this work can be achieved in such contexts, it holds the potential to drive substantial advances in the field.5. Summary of the variance reduction in observables computed from derivatives with respect to the action parameters when using flows compared with ϵ reweighting.The improvement factor is defined as the ratio of variances of the observables computed with ϵ reweighting over flows.The label "N f = 2 QCD" denotes derivatives of observables with respect to κ in two-flavor QCD, the label "Pure Gauge" corresponds to the result for the continuum limit extrapolation of gradient flow scales in the pure gauge theory, and the label "Feynman-Hellmann" indicates observables computed using the Feynman-Hellmann approach in quenched QCD.
ing Facility which is a DOE Office of Science User Facility supported under Contract DEAC02-06CH11357.The authors acknowledge the MIT SuperCloud and Lincoln Laboratory Supercomputing Center [74] for providing HPC resources that have contributed to the research results reported within this paper.Numerical experiments and data analysis used PyTorch [75], JAX [76], Haiku [77], Horovod [78], NumPy [79], and SciPy [80].Figures were produced using matplotlib [81].
Appendix A: Details of models In this appendix, we provide some additional details of the models of this work and the scheme used to train them.It is important to stress that the hyperparameters and training schemes of these models have not been fine-tuned to be optimal, but they suffice for the present demonstration.It is therefore likely that the model quality can be increased with further training or simple modifications of the hyperparameters.
The layers considered in this work use a ratio of polynomials to construct g x in Eq. ( 13), where a i and b i are trainable parameters.
All models have n pt = 6, where n pt is the number of iterations of Eq. ( 14) in each layer.This choice has been found to be empirically better than lower values of n pt .In models A, B1, and B2 we alternate the masking pattern between mod 2 or mod 4, since empirically this results in slight improvements compared to just using the mod 2 masking at the same computational cost (a mod 4 stack is computationally equivalent to two mod 2 stacks).The model architectures are shown in Table II.
The models are optimized by minimizing the reverse KL divergence: [log q(U i ) + S(U i )] + const., (A2) where the sum runs over the B configurations in a batch and the (unknown) normalization constant need not be evaluated for optimization.Samples from the prior distribution are generated using heatbath/overrelaxation (pure gauge) or HMC (QCD).The training scheme consists of a constant learning rate for a fixed number of gradient steps.We use a constant batch size to train each model.Between gradient steps, each configuration in the batch is evolved independently using the corresponding update algorithm.These details are summarised in Table II.In all cases, we use path gradients, which are implemented by computing the gradients for optimization using the path derivative rather than the total derivative: This reduces the variance of the gradients without changing their expectation.See Ref. [51] for more details.These models have been trained for different wall times: 10 days using 6 nodes with 8 NVIDIA A100 GPUs each for model A, 2 days using 2 nodes for models B1 and B2, and 2 days using 4 nodes for model C. Note that no attempts have been made to optimize either the training

( 1 )
µν and α (2) µνρ are d − 1 and (d − 1) 2 trainable parameters in d spacetime dimensions for fixed µ, respectively.Any polynomial function of G x,µ with coefficients that are arbitrary function of Tr[G x,µ G †

FIG. 4 .
FIG.4.Illustration of the error reduction in derivatives of observables with respect to the action parameter κ.Wn×n is the average square Wilson loop of size n, Q 2 is the squared topological charge defined via the gradient flow, and tc labels gradient flow scales, as in Eq. (17).The y-axis shows the values of the observables and their statistical errors normalized to the value obtained with flows.Results that incorporate flows are shown as blue squares, while the errors with ϵ reweighting are denoted by red triangles.

TABLE I .
Summary of flow models used in this work.All flow models have been trained on a hypercubic lattice volume of size 44, while the evaluation lattice volume at which the flows are used (Eval.vol.) is given explicitly in the table.