Leveraging neural control variates for enhanced precision in lattice field theory

Results obtained with stochastic methods have an inherent uncertainty due to the finite number of samples that can be achieved in practice. In lattice QCD this problem is particularly salient in some observables like, for instance, observables involving one or more baryons and it is the main problem preventing the calculation of nuclear forces from first principles. The method of control variables has been used extensively in statistics and it amounts to computing the expectation value of the difference between the observable of interest and another observable whose average is known to be zero but is correlated with the observable of interest. Recently, control variates methods emerged as a promising solution in the context of lattice field theories. In our current study, instead of relying on an educated guess to determine the control variate, we utilize a neural network to parametrize this function. Using 1+1 dimensional scalar field theory as a testbed, we demonstrate that this neural network approach yields substantial improvements. Notably, our findings indicate that the neural network ansatz is particularly effective in the strong coupling regime.


I. INTRODUCTION
Monte Carlo methods have achieved enormous success in studying non-perturbative field theory phenomena numerically.Still, it faces problems in some models and/or observables where the statistical noise overwhelms the signal.That is the case of theories with a sign problem (see for instance [1][2][3] for reviews), including all real-time (as opposed to imaginary time) calculations, and models/observables with infinite variance [4][5][6].It is also the case of certain correlators in lattice QCD whose signal-to-noise ratio decreases exponentially with (imaginary) time, making the extraction of energy levels extremely challenging [7,8].The signalto-noise ratio of n-baryon states correlators, for instance, decays as ∼ e −(M −3mπ/2)nt (M , m π are the baryon and pion masses).Given that excited states contaminate the correlator at small t, this decaying signal-to-noise ratio at larger t is a serious problem and is, in fact, the main obstacle one faces in computing the nuclear forces from lattice QCD.
In this paper we study the use of control variates to minimize the variance of lattice observables.The basic idea is simple and well-known [9,10].The expectation value ⟨O⟩ of an observable O can be computed as ⟨O − f ⟩ = ⟨O⟩, where f is known to have vanishing expectation value ⟨f ⟩ = 0.While the expectation value of the observables O and O − f are the same, their variance is not: If a control variate f can be found that is strongly correlated with O while maintaining ⟨f ⟩ = 0, the variance of O − f is smaller than the variance of O and so it is a better estimator of ⟨O⟩.This basic strategy has many forms depending on how one goes about finding a suitable f .For instance, recently in [11] control variates for a scalar field theory calculation was found by expressing it as a linear combination of optimal control variates for a free field theory.It has also been argued that control variates can be a possible solution to remove the sign problem exactly [12,13].
In this work, we find control variates for lattice field theory observables using machine learning techniques, namely, a very general function f is parametrized by a feed-forward neural network in such a way that the condition ⟨f ⟩ = 0 is automatically satisfied.Then, standard minimization techniques are used to minimize the variance.In the language of machine learning, the variance becomes the "cost function" whose minimization can be thought of as a form of unsupervised learning.The fact that general neural networks are universal function approximators (in the sense of being able to approximate any function with a sufficiently large network) suggests we are looking for a control variate within a very large class of functions.A similar approach has been used in Monte Carlo integration in small dimensional space [14][15][16][17].

II. METHOD
The purpose of this section is to review the basic knowledge of the control variates and to explain the way to parametrize the control variates through neural networks.

A. Control variates
Let us denote by ⟨•⟩ the average with respect to the Boltzmann factor: with Z = Dϕ e −S [ϕ] .If ⟨f ⟩ = 0, O and Õ = O − f will have the same expectation value: arXiv:2312.08228v2 [hep-lat] 8 May 2024 However, their variances differ: The function f is called a control variate.The goal is then to find a function f that is highly correlated with the observable O in order to minimize the variance of Õ.
We write our control variate candidate f as where x indexes the sites on the spacetime lattice and g[ϕ] : R V → R V (V is the spacetime volume) has yet to be defined.For any g x [ϕ], where g = (g 1 , • • • , g V ), integration by parts shows that and therefore, ⟨f ⟩ = 0 by construction.

B. Machine learning
While Eq. ( 5) does not define the most general function f [ϕ], we aim at having a universal representation of g [ϕ].
The only constraint we will impose is space-time translational symmetry where the translation operator T y displaces a field configuration by y: T y [ϕ x ] = ϕ x+y .This can be achieved if g is covariant: We can impose translational invariance by defining a function g 0 : R V → R from which we define a g [ϕ] x by It can be easily shown that the control variate f is translational invariant: ) We will define g 0 [ϕ] by a fully connected feed-forward neural network with V inputs and only one output.For theories with parity symmetry ϕ → −ϕ, as the model we will consider in the next section, we choose the activation function σ(x) = arcsinh(x) which is odd.In addition, we remove the bias term in the linear transformation so that the network is an odd function.In this way, we can reduce the number of parameters, which makes training faster.
The ideal cost function for the training of g 0 is the variance of O − f .We will estimate the variance of Õ by the variance of a sample {ϕ a }, a = 1, • • • , N , of N field configurations: where f w , given by Eq. ( 5) and Eq. ( 9) depends on the parameters w of the network defining g 0 .Note that the term from ⟨O − f ⟩ 2 can be omitted since Eq. ( 3) does not affect the new variance.
Since, in practice, N will be small (typically of the order of hundreds in lattice QCD) while the neural network can easily contain a much larger number of parameters, the risk of overfitting exists.In that case, even though ⟨f ⟩ = 0 by construction, Õ is minimized by having f (ϕ a ) ≈ O(ϕ a ) for every ϕ a in the sample, leading to the sample average f ≈ O.This is not a good approximation to the ideal control variate f .This problem was recognized in [14] where the authors suggest to minimize instead the quantity: where µ 0 is the sample average of the observable µ 0 = O.In this case, overfitting leads to f w (ϕ a ) ≈ O(ϕ a ) − µ 0 for every ϕ a in the sample and f w ≈ 0, a much better approximation to the ideal control variate.In fact, [14,18] suggested to use, instead of a fixed value µ 0 , a variable µ that is updated during the training process just like the parameters w of f w .We verified that in our examples this last procedure was significantly better and all results presented below are obtained with this last procedure 1 .Finally, to further avoid overfitting we applied the L 2 regularization where w 2 is the sum of the squares of all neural network parameters and δ is the regularization parameter.

III. RESULTS
In order to test our approach we use a scalar field theory in 1+1 dimension.Due to its simplicity, the same model has been used as a testbed for other approaches to signal-to-noise problem and sign problem [11,[19][20][21], so a direct comparison is possible.We will consider a L 0 × L 1 lattice (So V = L 0 L 1 ).We work in units where the lattice spacing is 1.Then the action on the lattice can be written as where µ = 0, 1 indexes the directions on the lattice and x = (x 0 , x 1 ).The observable we choose to consider is the twopoint correlator at momentum p = 0: (15) Table I summarizes the bare parameters and neural network parameters used in this paper.We use two values of the coupling λ, couplings 0.5 and 24.0 in order to explore both the weak and strong regimes.The number of hidden layers is chosen from 1 to 5, and the number of neurons for hidden layers varies from 1 to 32.The training of neural networks is performed with the usual stochastic gradient method and the ADAM optimizer [22] implemented with the help of the JAX [23] and Flax [24] libraries.The choice of learning rate η is critical to the success of the training.If the learning rate is too small, the training might be stuck at local minima.A too large learning rate leads to a process that misses minima of L(w, µ) altogether.After a lengthy trial-and-error process we find that it is efficient to use an exponentially decreasing learning rate in the beginning of the training and a constant one after that: where n is the training step and n 0 is in the range (6.8 × 10 5 , 6.9 × 10 5 ).
Since the gradient of L(w, µ) is estimated stochastically, we need to decide the number n of samples used.The stochastic error in every step in the gradient descent scales as ∼ η/ √ n while the computational cost of following the gradient descent path by unit of "time" scales as ∼ n/η.Therefore, the optimal choice is n = 1, the value we used in all our calculations.We have tried other choices, n ̸ = 1, and confirmed n = 1 works better empirically.
In Fig. 1 we show the training history on 20 × 20 lattice, both at small and large couplings and with networks with and without hidden layers.The observable we aim at improving was a mid-lattice correlator, that is, Eq. ( 15) with L 0 = L 1 = 20, t = L 0 /2 = 10.Fig. 1 shows the ratio of standard deviations (the improvement in the uncertainty) as a function of the training step for small and large couplings.We used 10 4 and 10 3 fully decorrelated field configurations to train the network and 10 3 samples to estimate the variance.An epoch is defined as N = 10 4 or 10 3 gradient descent steps, that is, one step with each training configuration.The result shows that the variance can be greatly improved, but it also implies that the result of training depends on the size of the training set.These results can be compared to the ones in [11] where a different, more direct method was used to obtain con- trol variates for the same theory.Their method is exact for a free theory and it performs better than at weak coupling.Neural control variates also work better as weak coupling but outperforms the method in [11] at strong coupling.As Fig. 1 shows, a linear transformation (network with no hidden layers) performs well at small coupling since a control variate linear on the fields is sufficient for a free theory.At strong coupling, the variance is not reduced much with a just a linear combination ansatz, which is shown as the dashed line of figure.However, a better control variate is found by introducing more non-linear terms through increasing hidden layers of neural networks.
The same method, applied to larger 40 × 40 lattices, shows a smaller improvement than the 20 × 20 lattice case, both at weak and strong coupling (see Fig. 2).It is notable that networks of different sizes lead to similar results and, in some cases, larger networks do not lead to a larger improvement in the variance.This is in contradiction to the theoretical expectation since, as one increases the number of hidden layers or the number of neurons at each hidden layers, the representability of the neural network increases and, therefore, the larger network should have a better performance than the smaller one.Our conclusion is that there is room for improvement in the training of large networks we have not yet able to accomplish.Future work will concentrate on that.
Reducing the uncertainty of one time slice of a propagator, even by a large factor, doesn't necessarily imply a reduction of the uncertainties of parameters extracted from it, like the value of masses.To investigate this question we applied our method to anisotropic 40 × 10 lattices with couplings fairly close to the continuum limit.Ideally, one would use a different control variate at each time slice but this procedure is too expensive.Instead, we found neural control variates optimizing the uncertainty at the t = L 0 /2 = 20 or t = L 0 /4 = 10 time slices.The correlators are shown in Fig. 3 and the results of the fits of the correlator to the form are summarized in Table II.Note that the correlation between different points of the correlator is considered when fittings are implemented.The regularization strength δ is chosen as 0.1.The reduction of the uncertainty due to the use of a control variate is about a factor ≈ 20 at the time slice t = L 0 /2 = 20 (or t = L 0 /4 = 10).Since the data at different time slices are correlated, we find that reducing the variance of one time slice also reduces the variance of nearby time slices, albeit by a smaller amount.In the example we discussed the uncertainty in the mass estimate is reduced by a factor ≈ 3 corresponding to a configuration set ∼ 9 times larger.
While we improved the error around 20 times at one point, the error of the fit giving the mass estimate is improved by a smaller factor (∼ 3).A better mass estimate is obtained if a different control variate is used at each time slice.In order to reduce the computational cost of training we use as a starting point of the training at one time slice, the final result of a previous time slice (transfer training2 ).In doing so, we obtain around 20 times improvement of the standard deviation at every point and the uncertainty of the mass estimate is also improved with the same amount.We include in Table II the results of fitting the correlator obtained with much higher statistics in order to verify that the reduced variance result obtained with neural control variates and low statistics is consistent with it.

IV. DISCUSSION
We showed how control variates parametrized by neural networks reduce the variance in lattice field theories.By using the simple example of a 1+1 dimensional ϕ 4 theory we established the feasibility of the method and learned some qualitative that can be summarized as: • Reductions of variance by tens or hundreds are feasible.
• It is essential to design the neural network to automatically incorporate the symmetries of the model as those symmetries are difficult to "learn" from a sample of configurations of a realistic size (hundreds to thousands of configurations).Although we have not explored them yet, other techniques can plausibly work better.
• Techniques to avoid overfitting, like the ones we used are essential for good results.
• More direct methods, like in [11], are more efficient at small coupling but neural networks tend to win out at larger coupling.
A natural question is whether how control variates results compare to more standard methods.After all, the variance of an observable can be reduced by just collecting a larger number N of configurations.This process is slow as the uncertainties scale as ∼ 1/ √ N .On the other hand, the training of the neural control variate is computationally expensive, even if it needs to be performed only once per observable.The "breakeven" point where our method wins out over the brute force increase in statistics depends on the cost of collecting a new independent configuration.For the scalar theory we studied, the computational cost of the configuration is very small, and for the precision we achieved, it is clearly more efficient simply to increase the statistics.In other theories, specially those with dynamical fermions, the configurations are very expensive and it makes sense to use substantial computing power extracting as much information as possible from them.Thus, the answer to whether (neural) control variates are more efficient or not has to be studied in every model separately.
Among the direction for future work, the most pressing is the extension of neural control variates to gauge theories.There are conceptual issues to be solved.Given our observation that it is important to impose the symmetries of the model on the neural network, a method to impose gauge invariance seems essential.Also, it is unclear what an analogue to Eq. ( 5) would be.Others issues are of a more practical nature and may be more important.For instance, finding more efficient ways of training neural networks will be crucial to extend the method to realistic, 4-dimensional theories.

6 FIG. 1 .
FIG.1.Training histories of the small and large couplings on 20 × 20 lattice.The figure shows the improvement of standard deviation using control variates with respect to the raw one (σRaw/σCV).The dashed line represents the zero hidden layer result (linear transformation), and the solid line represents the result with hidden layers.Networks for small and large couplings have 5 hidden layers and each has 4 neurons.For left and right panels, 10 4 and 10 3 samples are reserved for training the network respectively, and 10 3 samples are used to estimate the variance.

FIG. 2 .
FIG. 2. Training histories of control variates at t = L0/2 = 20 on 40 × 40 lattice.The left plot shows the improvement of standard deviation with hidden layers.The right panel displays the training histories of different depths of networks with the large coupling, λ = 24.0.10 3 samples are used to train the neural control variates and 10 3 samples are employed to estimate the standard deviation.

FIG. 3 .
FIG. 3. Correlation functions with m 2 = 0.01 and λ = 0.1 on 40 × 10 lattice.The raw result and the result with control variates are shown.2 × 10 3 samples are used in total and for the control variate result, 10 3 samples are used for training and the whole samples are used for estimating observables.For the raw result with large sets, the correlators are calculated with 2 × 10 5 samples.The left plot shows the correlation functions with their fitting.Results are shifted horizontally for better visualization.The right plot displays the errors of the correlators in the left plot.

TABLE I
. Bare parameters of scalar field and neural network parameters in Sec.III.