Machine Learning Estimators for Lattice QCD Observables

A novel technique using machine learning (ML) to reduce the computational cost of evaluating lattice quantum chromodynamics (QCD) observables is presented. The ML is trained on a subset of background gauge field configurations, called the labeled set, to predict an observable $O$ from the values of correlated, but less compute-intensive, observables $\mathbf{X}$ calculated on the full sample. By using a second subset, also part of the labeled set, we estimate the bias in the result predicted by the trained ML algorithm. A reduction in the computational cost by about $7\%-38\%$ is demonstrated for two different lattice QCD calculations using the Boosted decision tree regression ML algorithm: (1) prediction of the nucleon three-point correlation functions that yield isovector charges from the two-point correlation functions, and (2) prediction of the phase acquired by the neutron mass when a small Charge-Parity (CP) violating interaction, the quark chromoelectric dipole moment interaction, is added to QCD, again from the two-point correlation functions calculated without CP violation.

A novel technique using machine learning (ML) to reduce the computational cost of evaluating lattice quantum chromodynamics (QCD) observables is presented. The ML is trained on a subset of background gauge field configurations, called the labeled set, to predict an observable O from the values of correlated, but less compute-intensive, observables X calculated on the full sample. By using a second subset, also part of the labeled set, we estimate the bias in the result predicted by the trained ML algorithm. A reduction in the computational cost by about 7%-38% is demonstrated for two different lattice QCD calculations using the Boosted decision tree regression ML algorithm: (1) prediction of the nucleon three-point correlation functions that yield isovector charges from the two-point correlation functions, and (2) prediction of the phase acquired by the neutron mass when a small Charge-Parity (CP) violating interaction, the quark chromoelectric dipole moment interaction, is added to QCD, again from the two-point correlation functions calculated without CP violation.

I. INTRODUCTION
Simulations of lattice QCD provide values of physical observables from correlation functions calculated as averages over gauge field configurations, which are generated using a Markov chain Monte Carlo method using the action as the Boltzmann weight [1,2]. Each measurement is computationally expensive and a standard technique to reduce the cost is to replace the "high precision" average of an observable O by a "low precision" (LP) version of it, O LP [3,4], and then perform bias correction (BC), i.e., O = O LP + O − O LP . The method works because the second term can be estimated with sufficient precision from a smaller number of measurements if the covariance between O and O LP is positive and comparable to the variance of O, which is the case if, for example, the fluctuations in both are controlled by effects common to both. One can replace O LP in the above formulation with any quantity; however, improved results are obtained when a quantity with statistical fluctuations similar to that of O is chosen for O LP . Since most underlying gauge dynamics affect a plethora of observables in a similar way, such quantities surely exist; the trick, however, is to find suitable sets of quantities.
Machine learning algorithms (ML) build predictive models from data. In contrast to conventional curvefitting techniques, ML does not use a "few parameter functional family" of forms for the prediction. Instead, it searches over the space of functions approximated using a general form with a large number of free parameters that require a correspondingly large amount of training data to avoid overfitting. ML has been successful for various applications where such data are available, including exotic particle searches [5] and Higgs → τ τ analyses [6] at the Large Hadron Collider. It has recently been applied to lattice QCD studies [7][8][9]. Here we introduce a general ML method for estimating observables calculated using expensive Markov chain Monte Carlo simulations of lattice QCD that reduce the computational cost.
Consider M samples of independent measurements of a set of observables . .}, i = 1, . . . , M , but the target observable O i is available only on N of these. These N are called the labeled data, and the remaining M − N are called the unlabeled data (U D). Our goal is to build a ML model F that predicts the target observable O i ≈ O P i ≡ F (X i ) by training a ML algorithm on a subset N t < N of the labeled data. The bias-corrected estimate O of O is then obtained as where the second sum is over the N b ≡ N − N t remaining labeled samples that corrects for possible bias. Here O P i depends explicitly on X i and implicitly on N t and all training data {O j , X j }. For fixed ML model F , the sampling variance of O is then given by  of the data. In this work, we account for the full error, including the sampling variance of the training and the bias correction datasets, by using a bootstrap procedure [10] that independently selects N labeled and M − N unlabeled items for each bootstrap sample. Two additional remarks regarding bias correction are in order. First, while the bias correction removes the systematic shift in the prediction, it can increase the final error; i.e., the systematic error can get converted to a statistical error. In practice, for the two examples discussed below, the BC does not increase the error significantly. Second, there are two ways of bootstrapping the training and BC samples: (i) first partitioning the labeled data into training and BC sets and bootstrapping these and (ii) bootstrapping over the full labeled set and then partitioning the bootstrap sample. We used the latter approach.

II. EXPERIMENT A: NUCLEON ISOVECTOR CHARGES
For a first example, we demonstrate that this method reduces the computing cost for the isovector (u − d) combination of the axial (A), vector (V), scalar (S), and tensor (T) charges of the nucleon [11,12]. On the lattice, the nucleon charges are extracted from the ratio of the three-point [C A,S,T,V 3pt (τ, t)] to two-point [C 2pt (τ )] correlation functions of the nucleon. In the three-point function, a quark bilinear operatorqΓq is inserted at Euclidean time t between the nucleon source and sink. The desired ground-state result is obtained by removing the excitedstate contamination [13,14] using calculations at multiple source-sink separations, τ , and extrapolating the results to τ → ∞.
The results presented use correlations functions already calculated on the a09m310 ensemble generated by the MILC Collaboration [15,16] at lattice spacing a ≈ 0.089 fm and pion mass M π ≈ 313 MeV [11,12]. The data consist of 144,832 measurements on 2263 gauge configurations. On each configuration, 64 measurements from randomly chosen and widely separated source positions were made. The quark propagators were calculated using the Multigrid inverter [17,18], ported in the CHROMA software suite [19], with a sloppy stopping criterion. The bias introduced by using a sloppy convergence condition is much smaller than the statistical uncertainty for nucleon observables [12,20] and, is therefore, neglected in this study. If necessary, however, it can be easily incorporated by modifying Eq. (1).
The correlation coefficients between the various C 3pt measured at t = τ /2 = 5a and the C 2pt at various values of τ , are shown in Fig. 1. The strongest correlation is with the value of C 2pt near the sink of C 3pt at τ = 10a, and not near the t = 5a of operator insertion. Our intuitive understanding of why the correlation is strongest with C 2pt (10a) is as follows: the spectral decompositions of the two correlation functions are similar except for the insertion of the operator at t = 5a in C 3pt (10a). If the ground state saturates these correlation functions, then the extra term in C 3pt is the matrix element of this operator within the ground state of the proton. This matrix element can be considered as inserting a number (the charge) at t = 5a in C 2pt . If the configuration to configuration fluctuations in the matrix element are small, then one expects a strong correlation between C 2pt (10a) and C 3pt (10a). In addition, there are strong correlations between successive time slices of C 2pt ; thus, one expects the correlation of C 3pt (10a) with C 2pt to be spread over a few time slices about t = 10a as also indicated by the data in Fig. 1. In the more realistic case, in which the nucleon wave function at t = 5a has significant contributions from a tower of excited states, the operator can also cause transitions between these states, and its insertion can no longer be approximated by just one number. One can still expect that operators for which these transition matrix elements are small will have stronger correlations. Based on the observed pattern of excited states, discussed in Ref. [11], we expect the ordering of correlations V > T > A > S, whereas the observed pattern shown in Fig. 1 It is the existence of such correlations that allows the prediction of C 3pt from C 2pt using a boosted decision tree (BDT) regression algorithm available in SCIKIT-LEARN PYTHON ML library [21]. BDT is a ML algorithm that builds an ensemble (tower) of simple decision trees such that each successive decision tree corrects the prediction error of the previous decision tree. The result is a powerful regression algorithm with small number of tuning parameters and a low risk of overfitting. It is also fast: for the data sizes we are considering, it only takes a couple of minutes on a laptop to find an appropriate predictor and evaluate it on the unlabeled samples. The SCIKIT-LEARN implementation of the BDT we used in this study is based on the Classification and  (20) 1.0456 (36) TABLE I. Average of C Γ 3pt (10a, 5a)/ C2pt(10a) on the unlabeled dataset. DM is the directly measured result, P1 is the BC prediction defined in the text, with the bias correction factor given in column 4. For the prediction without BC, we used the full 680 labeled configurations for training of the BDT. Note that for this large dataset, the bias correction and the increase in the error in the prediction with BC are negligible.
Regression Trees (CART) algorithm [22] with gradient boosting [23,24]. For the prediction of C 3pt , we use 100 boosting stages of depth-3 trees with learning rate of 0.1 and a subsampling of 0.7. Note that, in this example, the pattern of correlation is such that a linear regression algorithm (such as LASSO [25,26] or RIDGE [27]) gives predictions with reasonable precision. Such a simplification does not occur for the second example described later.
The outline of the calculation is as follows: 1. For each (τ, t), the BDT is trained using the set of C 2pt data (input) and C A,S,T,V 3pt (τ, t) (output). This trained BDT can now take as input the unseen C 2pt data and output the predicted C A,S,T,V 3pt (τ, t). To predict C 3pt at a given (τ, t), one can use the data for C 2pt on all time slices. The essence of a trained BDT is that it gives larger weight to the input C 2pt element with higher correlation with the target observable.
2. The trained BDT is first used on the dataset designated for BC data to predict C A,S,T,V 3pt (τ, t). The bias correction factor is then determined by comparing this prediction with the corresponding directly measured value on the same BC set.
3. The trained BDT is next used on the unlabeled C 2pt dataset to give the predicted C A,S,T,V 3pt (τ, t).

4.
To the average of this predicted C A,S,T,V 3pt (τ, t) set, the bias correction factor is added to give the BC prediction we call P1.
5. The statistical precision can be improved by constructing the weighted average of the BC prediction P1 and the direct measured (DM) results on the labeled dataset. We call this estimate P2. Note that the direct measurements on the labeled data and the predictions on the unlabeled data are not identically distributed because the prediction is not exact, however, the bias-corrected mean is the same. Vector Tensor Therefore, when performing excited-state fits discussed below, we simultaneously fit the two data sets with common fit parameters.
The training and prediction steps treat data from each source position as independent, whereas the biascorrected estimates for each bootstrap sample are obtained using configuration averages in Eq. (1). In this case, the errors are obtained using 500 bootstrap samples.
For the first example, we choose 680 of the 2263 configurations, separated by three configurations in trajectory order, as the labeled data. To determine the number of configurations to use for training, we varied the number between 30 and 180. We found that the variance of the prediction on the unlabeled dataset was the smallest and roughly constant between 60 and 120. We, therefore, picked 60 configurations from the labeled set for training and 620 for bias correction. The 1583 unlabeled configurations were used for prediction. The BDT regression algorithm was trained to predict C A,S,T,V 3pt (τ, t)/N for all τ and t with {C 2pt (τ )/N for τ /a = 0, 1, 2, . . . , 20} as input. The normalization N ≡ C 2pt (τ ) was needed to make numbers of O(1) for numerical stability of the BDT in the SCIKIT-LEARN library.
Data in Table I show that the statistical errors in the bias correction term are large; however, the error in the BC estimate is essentially identical to that in the DM estimates. This implies strong correlations between the two terms, uncorrected and the BC factor. Fig. 2  that the statistical fluctuations in the DM data are larger than the prediction error (PE ≡ C DM 3pt − C Pred 3pt ) of the ML algorithm. The ratios of the standard deviations, σ PE /σ DM , of the PE and DM data are given in Table II. This pattern of smaller variance leads us to believe that, with further optimization, the reduction in computation cost given in Table IV can be increased significantly.
We have carried out two kinds of tests of the efficacy of the method. In Table III, we show data for C Γ 3pt (10a, 5a)/ C 2pt (10a) for different numbers of labeled data, keeping (i) the full 2263 configurations and (ii) 500 configurations. We find that the results are consistent for different numbers, Prediction(N, N t ), of labeled data in both cases. Even when only ten configurations (640 measurements) are used for the training dataset, one gets reasonable estimates. The errors scale roughly as the total number of configurations as can be seen by comparing the upper and lower tables.
In Fig. 3, we compare the prediction P2 of C A,S,T,V 3pt at all τ and t [column (c)] with the DM on labeled and full data shown in columns (a) and (b), respectively. The observed dependence on τ and t is due to contributions from excited states of the nucleon, and the desired groundstate result is given by the limit τ → ∞. This can be obtained by fitting the data at various t and τ using the spectral decomposition of C A,S,T,V 3pt . Fig. 3 shows such a fit assuming only the lowest two states contribute to the spectral decomposition, i.e., the two-state fit described in Refs. [11,12,28]. The lines show the result of this fit for the various τ , and the gray band gives the τ → ∞ value. We find that the prediction P2 in column (c) is consistent with the DM results on the full dataset.
We can further improve the prediction if data for a single value of τ , say C A,S,T,V 3pt (τ /a = 12), is available on the full dataset. Then, in the training stage, we use as input both C 2pt and C A,S,T,V 3pt (τ /a = 12). Having trained the BDT on the labeled data, we now use C 2pt and C A,S,T,V 3pt (τ /a = 12) as input to predict C 3pt (τ /a = 8, 10, 14), which we label VP2. These results are shown in Fig. 3 column (d). Including C A,S,T,V 3pt (τ /a = 12) in the training and the prediction stages increases the computational cost relative to P2 but reduces the errors. For a fixed size of error, VP2 is more efficient than P2 as shown in Table IV. A comparison of the predictions from C 2pt (P2) and from C 2pt and C A,S,T,V 3pt (τ /a = 12) (VP2) vs DM is shown in Table IV for the charges g A,S,T,V obtained after the extrapolation τ → ∞ using the four values of τ . While both estimates, P2 and VP2, are consistent with the DM, VP2 is closer to DM with respect to both the central value and the error. Taking into account the increase in the statistical uncertainty (scaling the cost by the square of the number of measurements) in the predicted results, the ML analysis VP2 provides between a 7% and 26% reduction in the computational cost. The amount of gain is observable dependent.

III. EXPERIMENT B: CP VIOLATING PHASE IN THE NEUTRON STATE
The second example is taken from the calculation of the matrix element of the chromoelectric dipole moment (cEDM) operator, O cEDM ≡iq(σ µν G µν )γ 5 q where G µν is the gluon field strength tensor, within the neutron state. It arises in theories beyond the standard model and violates parity P and time-reversal T symmetries, or equivalently, charge C and CP symmetries in theories invariant under CPT. Since any CP violating (CPV) operator gives a contribution to the neutron electric dipole moment (nEDM), a bound or a nonzero value for nEDM in coming experiments will constrain novel CP violation [29][30][31]. So far only preliminary lattice QCD calculations exist and cost-effectively improving the statistical signal is essential [32][33][34]. We have proposed a Schwinger source method approach (SSM) [35,36] that exploits the fact that the cEDM operator is a quark bilinear. In the SSM, effects of the cEDM interaction are incorporated into the two-and three-point functions by modifying the Dirac clover fermion action: The second equation is for the pseudoscalar operator O γ5 ≡iqγ 5 q that mixes with cEDM due to quantum effects [37]. With CP violation, the Dirac equation for the neutron spinor u becomes (ip µ γ µ + me −i2αγ5 )u = 0; i.e., the neutron spinor acquires a CP-odd phase α (α 5 ), which is expected to be linear in ε (ε 5 ) for small ε (ε 5 ). At leading order, these phases can be obtained from the four two-point functions, C 2pt , C P 2pt , C P,ε 2pt , and C P,ε5 2pt , where the superscript P indicates an additional factor of γ 5 is included in the spin projection [34,38]. 1 The correlator C P,ε 2pt ( C P,ε5 2pt ) is constructed using quark propagators with the O cEDM (O γ5 ) term and is expected to be imaginary and vanish as ε → 0 (ε 5 → 0). In a first step, we show predictions of the BDT regression algorithm for these two using only C 2pt and C P 2pt . For the training and prediction, we use the C 2pt , C P 2pt , C P,ε 2pt and C P,ε5 2pt measured in Refs. [35,36] on 400 MILC highly improved staggered quarks (HISQ) lattices at a = 0.12 fm and M π = 310 MeV (the a12m310 ensemble) with clover fermions. On each configuration, these corre-1 C P 2pt has a zero mean but fluctuations correlated with C P,ε 2pt and C P,ε 5 2pt . It can, therefore, be used for variance reduction [34].
Distribution of Im C P,ε 2pt (10a) (left) and Im C P,ε 5 2pt (10a) (right), averaged over sources in each configuration, are shown in light gold and the prediction error in dark red. The ratio of the standard deviations σPE/σ2pt ≈ 0.18 for OcEDM and 0.4 for Oγ 5 .
lators are constructed using 64 randomly chosen widely separated sources with a sloppy stopping condition, the effects of which are again ignored. Out of the 400 configurations, 120 configurations, separated by three configurations in trajectory order, are chosen as the labeled data, and the remaining 280 configurations are used as the unlabeled data. From the labeled data, 70 randomly chosen configurations are used for training. Only 50 configurations sufficed for bias correction in this case because the ratio of standard deviations of the prediction error vs the DM (σ PE /σ DM ) is small as shown in Fig. 4. Errors are obtained using 200 bootstrap samples.
The BDT regression algorithm is trained to predict the imaginary parts of C P,ε 2pt and C P,ε5 2pt using both the real and imaginary parts of C 2pt and C P 2pt . Note that in the absence of the CPV terms, C P 2pt and the imaginary part of C 2pt average to zero, but, they have nonzero correlations with the target imaginary parts of C P,ε 2pt and C P,ε5 2pt . The BDT regression algorithm with 500 boosting stages of depth-3 trees with learning rate of 0.1 and subsampling of 0.7 gives a good prediction as shown in Fig. 4. Because of nonlinear correlations, the BDT works better than linear regression algorithms in this case; the prediction error is about 50% larger with linear models at t = 1 and decreases to less than 10% by t = 8. Again, for numerical stability, all data fed into the BDT algorithm are normalized by C 2pt (τ ) .
Using the predicted C P,ε 2pt and C P,ε5 2pt on all time slices, we calculate the CPV phases α and α 5 by taking their ratio with C 2pt , because C ε,ε5 2pt differ from C 2pt at O(ε 2 ). Fig. 5 shows the comparison between the CPV phase calculated from DM on the full and labeled data and the ML predicted data. The horizontal lines give the averages over the plateau region where the excited-state contamination is small. Results for α and α 5 are summarized in Table V. To get the improved ML predictions, we combine the prediction on the 280 unlabeled configurations with the DM data on the 120 labeled configurations. These combined data are analyzed following the same bootstrap resampling procedure used in the first example discussed earlier.
The prediction uses 30% of the data for C P,ε 2pt and C P,ε5 2pt and 100% for C P 2pt and C 2pt . This reduces the total number of propagator calculations by 47% compared to the direct measurement. Taking into account the increase of the statistical uncertainty, the computational cost reduction is about 30% as shown in Table V.

IV. CONCLUSION
In conclusion, the proposed ML algorithm used to predict compute-intensive observables from simpler measurements gives a modest computational cost reduction of 7%-38% depending on the observables analyzed here as summarized in Tables IV (VP2) and V (P2). The technique is, however, general provided one can find inexpensive measurements that correlate well with the observable of interest. The computational cost reduction depends on the degree of correlations. We are investigating other ML methods to further improve the quality of the prediction and reduce computational cost.

ACKNOWLEDGMENTS
We thank the MILC Collaboration for providing the 2+1+1-flavor highly improved staggered quarks lattices.