Control variates for lattice field theory

In most lattice field theories, correlators are plagued by a signal-to-noise problem of exponential difficulty in the time separation. We propose a method for improving the signal-to-noise ratio, in which control variates are systematically constructed from lattice Schwinger-Dyson relations. The method is demonstrated on various two-dimensional lattices in scalar field theory, and a strategy for scaling to larger systems is explored.

With the advent of supercomputing, numerical lattice field theory has become the method of choice for firstprinciples calculations beyond perturbation theory.In particular, lattice QCD calculations provide the most reliable estimates for quantities of interest in hadronic physics [1].It is a systematically improvable scheme in which the only uncertainties stem from the extrapolations to the chiral-continuum-infinite-volume limit and from statistical noise.For situations where the measure of the path integral-determined by the Euclidean action of the theory-is real and positive, it has become routine to carry out high-statistics, large-volume simulations of lattice QCD with physical quark masses at a number of lattice spacings, leading to reliable extrapolations.This is especially true for pseudoscalar meson correlators, whose signal-to-noise ratio stays constant at long Euclidean time separation.
Extending these successes to precise calculations with other mesons and baryons is, however, meeting barriers.In particular, since the interpolating fields used in an interacting field theory couple to multiple asymptotic states, calculations from short separations in Euclidean time have irreducible systematic uncertainties due to contributions involving excited states.For most of these states, at the Euclidean distances needed to isolate the lowest state of interest, the signal-to-noise degrades and the statistical noise becomes insurmountable with today's resources [2].A canonical example of the signal-tonoise problem is in the baryon-baryon correlator of QCD.An argument of Parisi and Lepage [3,4] shows that the signal and noise both fall exponentially, but the exponential fall of the signal is faster than that of the noise, and therefore the signal-to-noise ratio falls exponentially as well.The situation for gluonic correlators is similar: the accuracy with which we can evaluate the expectation values of large Wilson loops or the static-quark potential is severely limited by noise.Since the most accurate determination of the lattice scale is from such gluonic observables [5], this is a serious problem for any precise lattice calculations [6, as an example].In this work we will examine the simpler situation of a scalar field theory in two spacetime dimensions, where the signal falls exponentially while the noise does not; though the methods developed will be generic and expected to be generalizable to other theories.
Previous work has addressed this problem and developed techniques for variance reduction.Of particular interest is a class of techniques in which the obervable of interest can be approximated by an observable that is less expensive to evaluate but whose fluctuations are highly correlated [7,8].The statistical fluctuations in both the approximant and the difference can then be evaluated with smaller effort, leading to an improvement in the signal-to-noise performance.A different class of methods starts from the observation that the signal-tonoise problem is closely connected to the sign problem, and the same methods have often been applied to both.In particular, the method of contour deformations used in [9,10] has its history in approaches to the sign problem; see [11] for a review.It relies on the fact that the path integrals defining the expectation values of the observables are analytic functions, as such a continuation of the domain of the path integral to complex field space can keep the value of the integral unchanged.The statistical variance, however, is not analytic, and it can be reduced.
In this work, we take an alternative course to variance reduction that merges the strengths of the two methods.It is similar to the correlated approximant method in that we find a correction to the integrand whose fluctuations are correlated with the observable.But, like the contour deformation method, the subtraction is of a term that can be guaranteed to integrate to zero.This method has previously been explored for the the fermion sign problem, via perturbative expansion [12] and machine learning [13].
Thus, the core of the approach is the construction of a control variate: a function of the fields which is correlated with the observable of interest, but can be proven to have vanishing expectation value.Naming the control variate f , we define an improved observable Õ ≡ O − f .First note that since f = 0, this new observable obeys Õ = O .The variance of Õ, however, has changed: It is apparent that, when the correlation Of is large arXiv:2307.14950v1[hep-lat] 27 Jul 2023 and positive, the variance of the improved observable is substantially reduced.
It remains to minimize the variance Õ over a set of functions f that can be guaranteed to integrate to 0. The main approach we put forward is to obtain a large basis {F i } of functions such that F i = 0, and then optimize the coefficients c i that define the control variate f ≡ i c i f i .This optimization can be performed efficiently as follows.Measure the covariance matrix of the F i , and the cross-correlations between F i and the target observable O, defining The optimal control variate is defined to be the linear combination of F i that minimizes the variance of Eq. ( 1).
Assuming that M and v have been measured accurately, this is readily shown to be given by coefficients To obtain a basis {F i }, we note that for any function g of fields the integral of ∂g (where "∂" indicates any derivative with respect to the fields) reduces to boundary terms, and therefore 0 under mild assumptions about the asymptotic behavior1 .
Having obtained these coefficients and therefore the definition of the control variate f , we may compute the expectation value of Õ (and therefore O ) by standard Monte Carlo methods.It is important, to avoid introducing a bias, that this final computation of the expectation value is performed over a set of samples uncorrelated with those used to determine the coefficients c in the first place.
Let us see how this method works in scalar field theory in two spacetime dimensions.This model is defined by the action where m2 (not necessarily non-negative) defines the bare mass and λ the bare coupling constant.The first summation is taken over pairs of neighboring lattice sites.Lattice units of a = 1 are assumed throughout, and we note that the action has a Z 2 symmetry φ → −φ.
The object of primary interest for us is the correlation function in the Z 2 -vector channel, defined by As examined in [9], this correlator has a signal-to-noise problem that grows exponentially at large values of τ .We will reduce this signal-to-noise problem by identifying a set of observables that have vanishing expectation values, and constructing a linear combination that is maximally covariant with φ(τ, x)φ(0, 0).To obtain the basis vectors {F i }, we note that for any polynomial P (φ) of the fields and any choice of site x, we have As mentioned above, this identity follows straightforwardly from the fact that the integral of a total derivative reduces to a sum of boundary terms, all of which vanish as long as P (φ) is sub-exponential.This identity is the lattice version of the Schwinger-Dyson equations described in [14] 2 .We will restrict ourselves to low-order polynomials.Due to the Z 2 symmetry of the action, Schwinger-Dyson relations derived from even-order polynomials are uninteresting, relating only Z 2 -odd expectation values which are known to vanish by symmetry.First-and third-order monomials at site y yield the following relations for all pairs of sites x, y: Note that, in the case of λ = 0, the lattice Schwinger-Dyson relations Eq. 7a are closed.In particular, they entirely determine the correlation function C(τ ) being studied.A minimization of the variance in this case would be tantamount to solving the theory exactly, reducing the statistical noise to zero.At any nonzero coupling, this is no longer true.Even the full infinite set of Schwinger-Dyson relations is no longer sufficient to uniquely fix expectation values.This can be seen by noticing that Eq. ( 6) is satisfied on any complex integration contour, as long as the integral of e −S along that contour converges and is non-zero.Multiple non-homologous integration contours are available in the presence of a φ 4 interaction, and therefore the solution to the Schwinger-Dyson relations is not unique.Concretely, our procedure for constructing a control variate from the Schwinger-Dyson relations Eq. (7a) and (7b) is as follows.Supposed we have K samples on an L× L lattice, and the observable of interest is O.We divide the samples into a training and an evaluation set; here we choose them to be of equal size used in determining the coefficients defining the control variate, and the evaluation set for computing expectation values3 .From Eqs. (7a) and (7b), we define the following zero-mean functions, a linear combination of which will become control variates: The first set of observables will be referred to throughout as first-order, and the second set as the third-order candidates.For convenience, we will denote all the candidate basis functions being used in the optimization as F i for i = 1, . . ., B. We now compute, over the training set, the covariance matrix M of the candidates and the correlation v of the candidates with the target observable, as defined by Eq. ( 2).With this done, Eq. ( 3) defines the coefficients of the optimal coefficients c i defining the control variate, and the variance-reduced unbiased observable Note that this full procedure must be repeated, yielding a separate control variate, for each observable of interest.
In particular, when computing a correlator φ(z)φ(0) , FIG. 2. Scalar correlator on a 24 × 24 lattice, with bare parameters m 2 = 0 and λ = 2.The raw result-for which the correlator cannot be distinguished from zero at large time separations-is compared to correlators obtained with firstorder control variates alone, and those using both first-and third-order control variates.A total of 10 4 samples were used, with half being reserved for optimization when a control variate is used.
a separate control variate must be optimized for each possible value of z.
In practice the above procedure has high memory consumption.To make it more feasible on large datasets, we make two simplifications.First, only translationallyinvariant sums of F xy need to be used, reducing the size of the basis from ∼ L 2 to ∼ L. Second, before computing the correlations M and v, we group the samples, still maintaining a sufficiently large number of blocksspecifically, 8 times the size of the basis being used.
Figure 1 shows the efficacy of this approach to variance reduction.On an 8 × 8 lattice with bare parameters m 2 = 0.1 and λ = 0.5, we obtain 10 5 decorrelated samples by MCMC techniques.The correlator determined from the first 10 3 samples is plotted with and without the variance reduction procedure, using only the first-order candidates F (1) .To demonstrate that the correlator is accurate, the results without variance reduction from the full set of 10 5 samples is also plotted.The variancereduced correlator is in agreement with the high-statistics correlator, indicating the correctness of the calculation.
To illisutrate the need for higher Schwinger-Dyson equations as we go to more correlated systems, we show in Figure 2 the same procedure being performed on a larger, 24 × 24 lattice, with bare parameters m 2 = 0 and λ = 2. Variance reduction is performed with F (1)  alone and with the combination of F (1) and F (3) .At this strong coupling, the use of the additional third-order control variates make a substantial difference at large time separations.This is not surprising.As mentioned above, the firstorder Schwinger-Dyson relations of Eq. (7a) give perfect variance elimination at zero coupling, and continue to perform well at weak couplings.At larger couplings, the remaining statistical fluctuations obtained with these relations become larger, and inclusion of higher-order relations (in this work, only Eq. 7b was investigated) becomes more important.Figure 3 shows this in more detail: the estimates obtained as a function of lattice coupling λ (on a 20 × 20 lattice with bare mass parameters m 2 = 0.1) clearly show the need for the third-order term as the coupling gets stronger.At the largest couplings, using the third-order term reduces the number of samples required to get the same size of statistical error by a factor of ∼ 10 2 .
When optimizing the control variate, it is important to have more samples available than degrees of freedom in the control variate.Otherwise, the optimization results in an over-fit to the training data, and the control variate does not generalize to the samples used to compute the expectation value.The control variates discussed so far have V or 2V degrees of freedom, and therefore at least that many samples must be used in the optimization procedure.On the other hand, there is no requirement that the training and evaluation sets need to be of equal sizewith limited sampling, it may be advantageous to use a larger training set.Nevertheless, this presents a problem for the scaling of this method to larger systems, where it is common to have far fewer samples available than the volume of the lattice.Thus, many lattice QCD calculations use a O(10 3 ) configurations when the number of field degrees of freedom is 10 8 [15].A full treatment of this problem is left for future work, but here we show one approach in the present context.
As a straightforward approach to avoid overfitting, we might try to design a family of control variates with relatively few parameters and optimize over that family.However, that assumes a priori knowledge of which basis functions are heavily correlated with the desired observable.In this work we take the less opinionated approach of requiring the vector of coefficients c to be sparse.Each of the ∼ V coefficients begins unknown, but we demand that only a small number of them are sizable in the final result.To learn such a sparse control variate, we introduce an additional term into the objective function: This is the L p regularization, and small positive values of p enforce high sparsity.The optimization is non-convex for p < 1, so we choose to work with the L 1 norm.For well-chosen values of µ, the minimum of this objective will be near a minimum of the original objective function, while having only a small number of contributing entries.This sparsity acts to prevent overfitting.Heuristically, rather than requiring a number of samples greater than the total number of degrees of freedom, we need only to have as many samples as the number of non-negligible degrees of freedom.Of course, this objective function no longer has a closed-form solution for the minimum.Instead, we train the coefficients using the Iterative Shrinkage/Thresholding Algorithm (ISTA) [16], which performs substantially better than naïve gradient descent for such non-smooth objectives.Sparsity is not a property of a vector alone, but rather of the components of the vector as written in a certain basis.Equivalently, the L 1 norm added to the objective function of Eq. ( 10) is not basis-independent.We chose to define the L 1 norm with respect to momentum basis; that is, the Fourier transform of the position basis in which F (1) xy was defined above.This approach is motivated by the limit of weak coupling, as at λ = 0 only a single component need be non-zero to control the fluctuations.
Putting this together, we define a control variate p,q cp,q x,y e ipx+iqy F (1)   xy (11) where p, q are summed over all momentum modes on the lattice while x, y are summed over all L 2 lattice siteswhere, as before, we have used translational invariance to reduce the implemented basis size.We optimize the coefficients cp,q to minimize the objective function E µ (c) defined in Eq. (10).Different values of µ must be tried to find a control variate that makes an acceptable tradeoff between sparsity (avoiding overfitting) and quality; there is no guarantee that such a tradeoff is available.
Figure 4 shows the result of this optimization on a 50 × 50 lattice at bare lattice parameters m 2 = λ = 0.1.Without L 1 regularization, at least 2.5 × 10 4 samples would be required to avoid overfitting; here only 5 × 10 2 samples were used in the optimization of the control variates with an L 1 regularization strength of µ = 10 −4 , with another 5 × 10 2 used to compute the expectation value of the improved observable.Each data point corresponds to a different observable and therefore a different set of coefficients, but the typical vector of coefficients had ∼ 100 elements of appreciable size, thus allowing optimization with no noticeable overfitting with only 500 samples.Thus, we have developed a method in which straightforward optimization techniques can be used to choose combinations of Schwinger-Dyson equations to reduce the variance of observables.The strength of the method lies in the fact that existing ensembles of field configurations can be used, the improved observable is exactly equivalent to the original, and the path-integral measure is not distorted alleviating any concerns about overlap of distributions.Additionally, the use of Schwinger-Dyson equations allows physical intuition to guide the selection of the sparsity penalty that makes the method workable.The examination of the efficacy of the method for multipoint functions and for non-scalar fields is currently under investigation.
We are indebted to Rajan Gupta, Frederic Koehler, and Boram Yoon for helpful discussions over the course of this work.
S.L. is grateful for the hospitality of the Los Alamos National Laboratory, where this project began.S.

K 2 .
FIG. 1. Demonstration of variance reduction via Schwinger-Dyson relations on an 8 × 8 lattice at m 2 = 0.1 and λ = 0.5.The variance-reduced data is compared to raw correlators with 100 times as many samples as a check of the correctness of the procedure.Note that the same set of samples is re-used between different values of τ , and therefore the displayed onesigma uncertainties are correlated.

FIG. 3 .
FIG. 3.For a fixed time separation of τ = 10, the scalar correlator on a 20 × 20 lattice with m 2 = 0.1 as a function of the bare coupling.A total of 10 4 samples were used, with half being reserved for optimization when a control variate is used.Inset is plotted the size of the error bars alone, for the same data.
L. is supported by the U.S. Department of Energy under Contract No. DE-SC0017905, and T.B. and J.-S.Y. under Contract No. DE-AC52-06NA25396.T.B. and J.-S.Y. were also supported by the LDRD program of the Los Alamos National Laboratory.Code used in the demonstrations in this paper is available at [17].