Machine learning action parameters in lattice quantum chromodynamics

Numerical lattice quantum chromodynamics studies of the strong interaction are important in many aspects of particle and nuclear physics. Such studies require significant computing resources to undertake. A number of proposed methods promise improved efficiency of lattice calculations, and access to regions of parameter space that are currently computationally intractable, via multi-scale action-matching approaches that necessitate parametric regression of generated lattice datasets. The applicability of machine learning to this regression task is investigated, with deep neural networks found to provide an efficient solution even in cases where approaches such as principal component analysis fail. The high information content and complex symmetries inherent in lattice QCD datasets require custom neural network layers to be introduced and present opportunities for further development.


I. INTRODUCTION
Lattice quantum chromodynamics (LQCD) [1] is a well established numerical method [2,3] used to study quantum chromodynamics (QCD), the theory of the strong interaction.A central part of the Standard Model (SM) of nuclear and particle physics, strong interactions bind quarks and gluons into protons and nuclei, and dictate the emergence of complex nuclear structure in nature.High-precision LQCD calculations are important in determining the parameters of the SM and guide searches for evidence of new physics beyond it [4].Recent LQCD calculations also provide new insights into the quark and gluon structure of protons [5] and the structure and interactions of light nuclei [6,7].Similarly, LQCD calculations have enabled investigations of QCD matter at extreme temperatures, and efforts to understand QCD matter at high density are underway [8].These calculations are extremely computationally demanding, consuming significant fractions of the computational resources that are available for scientific research worldwide.
LQCD calculations are performed on a discrete 4-dimensional space-time grid (typically a hypercubic lattice), and use Monte-Carlo importance sampling [9] to determine the dynamics of the quark and gluon fields defined on this space.Achieving physical results requires a series of calculations at different discretisation scales (referred to as the lattice spacing), and different lattice volumes, and a subsequent extrapolation to the continuum (where the discretisation vanishes) and infinite volume limits.Particularly challenging is the approach to the continuum limit; the computational cost of the Hybrid Monte-Carlo (HMC) algorithm [10] typically used scales with a high inverse power of the lattice spacing, a, approximately a −z with z > 6 for a fixed physical lattice volume [11].Known as critical slowing down, this occurs because of the quasi-local nature of the HMC updating procedure, requiring an increasing number of steps to update physics on a fixed physical volume as the lattice spacing decreases.A number of methods attempt to circumvent this issue by acting at multiple physical length scales.Examples include perfect actions [12][13][14][15] that aim to achieve almost-continuum physics at finite lattice spacings, and multi-scale thermalisation techniques [16][17][18][19][20][21].Such approaches require careful renormalisation group matching [22,23] of the LQCD actions defined at different scales such that they describe the same long-distance physics.An essential challenge is to solve the parametric regression task: Which action parameters best represent the coarse-scale physics of an ensemble of samples generated at a finer resolution, and vice-versa?Similar parameter regression problems of LQCD datasets arise in the context of mixed action LQCD simulations (see for example Ref. [24][25][26]).
In this work, machine learning (ML) techniques, in particular neural networks, are applied to the regression problem of determining LQCD action parameters from an ensemble of samples.Significant progress in ML over the last few years has led to new scientific applications of ML tools, including to a number of statistical and quantum mechanics problems.In one set of studies, ML has been used to infer the presence of phase transitions and thermodynamic properties in simple condensed matter models [27][28][29][30].In another study, variational methods have been optimized for many-body problems using ML techniques [31,32].Novel approaches to the Monte-Carlo method that is ubiquitous in numerical simulations of many systems have also been developed using ML ideas [33][34][35][36][37][38][39].Finally, ML regression for matching Hamiltonians in condensed matter contexts has recently been investigated [34,40] and shows promise.Very few studies, however, have applied ML techniques to investigate gauge field theories such as LQCD (LQCD is a particularly important example of a more general class of theories defined with a local invariance known as a gauge symmetry), and new techniques and adaptations are required because of the unique and complex symmetry structures of these theories 1 .Averaged over Monte-Carlo importance sampling, LQCD data is invariant under discrete spacetime translations and hypercubic group transformations, although individual samples do not have these symmetries.In addition, internal symmetries based on the continuous Lie group SU(3) associated with each spacetime location must be respected.Exploiting these symmetries is essential to the success of the approach used here; it is found that suitably customised deep neutral networks can provide an efficient and practical method of determining the action parameters describing the physics of a given set of configurations.
This article is arranged as follows.In Section II, the basic aspects of the lattice QCD calculations that are used to train and test parametric regression by neural networks are discussed, and a principal component analysis (PCA) is used to ascertain the difficulty of the regression tasks that are attempted.In Section III, a number of different neural network structures are studied.First, in Section III A, a fully connected neural network is used.This easily solves the parameter regression problem on training ensembles, but suffers from over-fitting due to the inverted hierarchy of the information content of each sample to the number of samples available for training.Despite its failure to generalise, this network finds features that persist in the LQCD data for Monte-Carlo times considerably longer than those seen for typical physics-motivated observables.The overfitting problem is remedied in Section III B, where several custom symmetry-enforcing layers are introduced to define neural network structures that efficiently solve the regression problem.The trained networks correctly resolve parameter differences even between ensembles which are essentially indistinguishable under the PCA analysis.Section IV provides a summary.Two appendices provide additional details of aspects of machine learning and of the lattice QCD calculations.

II. LATTICE QCD
Lattice QCD calculations are performed by approximating the QCD path integral by a Monte Carlo sum over gauge field configurations on a discrete four-dimensional space-time.The expecta-tion value of an operator O that defines some physical quantity is given by: where Z = DψD ψDA e −S[ψ, ψ,A] , the (anti-)fermion and gluon fields (gauge fields) are denoted by ψ( ψ) and A, and S[ψ, ψ, A] is the discretised QCD action (defined in Appendix B 1).In the second line, the fermion and anti-fermion fields are integrated out exactly, and the gauge fields are transformed to link fields U = e iA , to give an effective action S[U ] and operator Õ[U ] depending only on the gluon link fields.The resulting integral can be approximated as where the gauge field configurations U i (i indexes the configurations in a given "ensemble" of fields) are distributed according to the probability measure e − S [U ] .In practice, this is guaranteed by sampling the fields from a Markov chain Monte-Carlo stream for which this probability measure is a fixed point.These representative gauge fields are the input data for the ML approaches to parametric regression studied here.For additional details of the LQCD approach, see Refs.[2,3] and Appendix B 1.
Lattice QCD gauge fields are represented as links between sites on a 4-dimensional lattice of volume2 V = L3 × T , with the lattice sites separated by some physical distance a, typically 0.05-0.15fm.Each link, labelled by U µ (x), where x denotes the spacetime coordinates of the origin site and µ the direction of the link, is encoded by an SU(3) matrix (a 3 × 3 complex matrix M with M −1 = M † and det[M ] = 1) 3 .Links in opposing directions are related via U −µ (x) = U † µ (x − μ), and only links in the positive direction are stored.In this format, a gauge field used in typical modern lattice QCD calculations, where for example L = 64 and T = 128, is described by L 3 × T × 4 × 18 ≈ O(10 9 ) floating point or double precision numbers, where the factor of 4 arises from the number of positive spacetime directions (labelled by µ).In order to recover QCD results, calculations must be performed on a number of ensembles of field configurations with different lattice spacings a and lattice volumes V , and the continuum (a → 0) and large-volume (V → ∞) limits must be taken.
The governing equations of QCD and their lattice counterparts have a variety of symmetries, some that are highly non-trivial.The symmetries satisfied by ensembles of gauge fields are of particular interest in the context of the ML approaches studied here, as they place strong restrictions on numerical operations that can be performed on lattice data to extract physically meaningful results.In particular, lattice QCD is invariant under a local symmetry of the gauge fields known as a gauge symmetry; this is an invariance under local multiplications of link variables by SU(3) matrices referred to as a gauge transformation (note that the matrix Ω(x) differs at every spacetime point).This symmetry is not apparent from the numerical representation of a QCD configuration, but rather constrains physical observables calculated on a given gauge field to be invariant under all gauge transformations of that field.In addition, lattice QCD defined on a discretised finite volume is invariant under discrete translations and under 4-dimensional rotations and reflections (transformations generated by the hypercubic group, H 4 [42]).Unlike gauge symmetry, these latter symmetries do not hold on a configuration-by-configuration basis, but rather emerge after averaging physical quantities over all gauge fields in an ensemble.An additional important property of QCD is that a characteristic length scale, 1/Λ QCD ∼ 1 fm, emerges dynamically from the interactions of the theory, setting a spacetime distance over which values of the link fields are correlated.

A. Lattice QCD ensembles
A number of different ensembles of lattice QCD gauge field configurations were used for this first exploratory study.Each ensemble was generated using a two-colour N c = 2 Wilson gauge action with N f = 2 flavours of dynamical Wilson fermions (defined in Appendix B 1).This action depends on two bare couplings/parameters, β and m 0 .QCD with N c = 2 exhibits similar rich dynamical structure to the full theory with N c = 3 and is a natural testing ground for the new approaches developed here.Ensembles were generated with a standard HMC algorithm using a leapfrog integrator to take molecular dynamics trajectory steps of length τ M D = 0.5 in 15-40 substeps (tuned to keep the acceptance rate ∼ 70%).In each case, the streams were initialised from a hot start or from a thermalised lattice from a nearby set of couplings, and the initial 500 trajectories were not included in the further analysis.For most ensembles, configurations were saved every 10 trajectories to generate ensembles of O(10 3 ) independent configurations, with the separation determined from studies of the autocorrelation times of typical observables (for some ensembles, configurations were saved every trajectory to allow studies of autocorrelation times to be undertaken).Since N c = 2 in these calculations, rather than N c = 3 in full QCD, the lattice data structures used here are somewhat smaller than those used for state-of-the-art calculations, with each configuration represented by O(10 6 ) double precision numbers.All ensembles were generated using a modified version of the chroma lattice field theory library [43] that was previously [44] found to produce results consistent with an independent code [45].
Simple physical observables, including the pion and rho meson masses and scale setting observables w 0 and t 0 [46], have been calculated on Grids A and B; contour plots displaying the variation of these quantities across the ensembles are shown in Fig. 1.
In order to check the validity of the HMC streams, the evolution of simple quantities along the trajectories has been monitored.The simplest, and computationally cheapest, way to produce a gauge invariant quantity from links is to take the trace of products of links over closed loops ("Wilson loops").Wilson loops are defined from gauge links as shown schematically in Fig. 2, and detailed in Appendix B 1. Planar Wilson loops W k×l (x), with indices k and l denoting the dimensions of the loop (with orientation label suppressed), were computed for square loops up to 6 × 6, as well as rectangular loops of size 1 × n for n = 2, . . ., 12, and all possible planar orientations.The evolution of representative loop types for the ensembles in Grids A, B, and C, averaged over orientations and spacetime position, is shown in Appendix B 2. For each case, this evolution indicates that the data is well thermalised after approximately 500 trajectories.
To determine the number of HMC steps required for gauge field configurations to be independent, the autocorrelation times of the pion and rho two-point correlation functions, and of the same sets of Wilson loops introduced above, have been calculated.The autocorrelation function for a given operator O is defined as where τ is the trajectory difference in the autocorrelation.This function decays exponentially as ρ(τ ) ∼ exp[−τ /τ exp ] at large Monte-Carlo times τ .The decay constant τ exp defines an autocorrelation time.Calculations of the autocorrelation time using this definition can suffer from large uncertainties, especially when τ exp is small.Another definition of the autocorrelation time is [3,47] which approaches a constant as τ max → ∞.The autocorrelation functions and integrated autocorrelation times τ int for the Wilson loops, and those for the zero-momentum projected pion and rho two point correlation functions, C π(ρ) (defined in Appendix B 1), are shown in Fig. 3.In all cases, the integrated autocorrelation time is 10 trajectories, validating the choice to take trajectories spaced by this distance as an uncorrelated set to form an ensemble.Other observables may have different autocorrelation times, but the observables considered here are relatively representative4 .

B. Ensemble discrimination using principle component analysis
To guide the application of ML methods to parametric regression of gauge fields in the space defined by the sample ensembles, the differentiability of the ensembles was assessed using a principle component analysis (PCA) [48][49][50].Since Wilson loops are the simplest gauge-invariant objects, the basis for the PCA was generated by calculating a set of square planar loops of sizes up to L/2 × L/2, as well as 1 × n for n up to L, averaged over all possible planar orientations and space-time locations.Averaged loops are denoted W j×l = O(j×l) x W j×l (x), where the sum over O(j × l) is over all hypercubic transformations of the indicated loop.The averaged loop data are sufficiently small in dimension that it is possible to display them for a representative set of ensembles.Fig. 4 shows contour plots of ln |W n×m | from evaluations on each ensemble in the two L/a = 12 grids (Grids A and B).Figs.20,22,and 24 (in Appendix B 2) show histograms for a subset of the loops for each ensemble in each of Grid A, B, and C, respectively.Clearly, some of the loops are statistically well determined, and subsets of the ensembles can be clearly distinguished.Ensembles in Grid C have loop distributions that are more sharply defined than those in Grids A and B as their larger spacetime volume enables more statistical averaging.For large loop sizes, all ensembles become hard to distinguish.FIG.3: Autocorrelation functions ρ(τ )/ρ(0) (left, defined in Eq. ( 5)) and autocorrelation times τ int (right, defined in Eq. ( 6)) of the pion (top) and ρ (centre) two-point correlation functions at different Euclidean time separations, and of the various space-time averaged n × m planar Wilson loops (bottom).Measurements are performed on a subset of ensemble F1, for N traj = 4000 sequential trajectories (N traj = 7980 for the loops).The colours identify the type of loop and the shaded bands correspond to the uncertainties on these quantities as determined from a bootstrap procedure using N boot = 100 bootstrap resamplings of size N traj .
To perform the PCA on the loop data, a correlation matrix between the various loop observables can be constructed, either for a given ensemble, or, as is done here, across a collection of ensembles.The correlation matrix elements are where i ∈ {1 × 1, 2 × 2, . ..}, and e and c label the ensemble and the configuration in that ensemble, respectively.The summation over ensembles is for all ensembles in a given grid, and X and σ(X) denote the mean and standard deviation of the given quantity over the particular ensemble of configurations.The eigenvalues, e i , and eigenvectors, v i , of this correlation matrix for Grid A are shown in Fig.   the information encoded in a collection of the simplest gauge-invariant objects is sufficient to distinguish all but a few of the ensembles in Grid A.
The Jensen-Shannon divergence [51,52] provides a measure of the overlap of probability distributions and can be used to quantify the distinguishability of such distributions.Given two probability distributions P and Q, defined over a space X, the Jensen-Shannon divergence is given by where M = 1 2 (P + Q), and D KL (P ||Q) is the Kullback-Leibler divergence [53], defined as The Jensen-Shannon divergence is bounded by 0 ≤ D JS (P ||Q) ≤ 1, with D JS = 0 if and only if Q = P almost everywhere, and larger values denoting lower overlap between distributions.The square root of the Jensen-Shannon divergence provides a well-defined metric [54,55].As a test of differentiability, the Jensen-Shannon divergences were calculated between all pairs of three-dimensional probability distributions defined by the three dominant eigenvectors of the loop correlation matrix for each ensemble in Grid A5 .To do this, each distribution was first interpolated over the samples from the given ensemble using smooth kernel distributions.The resulting values of D JS are shown pictorially in Fig. 7 for all pairs of the 19 ensembles in Grid A.
-0.9 -0.9 -0.8 -0.8 -0.7 -0.7 -0.9 -0.9 -0.8 -0.8 -0.7 -0.7 Clearly, the dominant eigenvectors in loop space allow excellent differentiation between most pairs of ensembles, with approximately 8 out of 171 independent pairs that are only weakly, or not at all, differentiable.A more challenging test of distribution differentiability is provided by the ensembles in Sets D and E, each designed to have maximal overlap of Wilson loops on each of the ensembles in the set, but different parameters in the {β, m 0 } plane.Fig. 8 shows histograms of the combinations of Wilson loops corresponding to the dominant eigenvectors of the loop correlation matrix for ensemble Sets D and E, while Fig. 9 displays the Jensen-Shannon divergence between pairs of ensembles in these sets.As the ensembles in each of Sets D and E are very poorly distinguishable in the space of Wilson loops, accurate differentiation between them presents a key challenge to parametric regression via ML.

III. NEURAL NETWORKS FOR PARAMETRIC REGRESSION OF LATTICE QCD GAUGE FIELDS
Machine learning techniques, and in particular neural networks, offer a promising solution to parameter regression problems.The main focus of this work is to address such a problem in the context of LQCD: given an ensemble of lattice gauge fields, determine the parameters of a given action that are most likely to have generated it.As discussed in the introduction, this challenge arises, for example, in attempts to ameliorate critical slowing down by the matching of coarse and fine lattice actions, and in the context of perfect actions.Its solution will allow for more efficient LQCD calculations, enabling studies in regions of parameter space which are currently computationally unreachable.
To determine the action parameters of a given ensemble (for a particular choice of lattice action), one possible approach is to calculate a sufficiently large set of physics observables both on that ensemble and on a set of ensembles for which the parameters are known, and perform an interpolation and matching task using the calculated observables.The alternative considered here is to train a neural network to perform the regression directly.In principle, this approach is far more general than one based on a set of physics quantities, as the network can use all of the information encoded in a gauge field configuration.On the other hand, this is also challenging.As discussed in Section II A, a single gauge field configuration is represented by O(10 9 ) real numbers in modern lattice QCD calculations.In comparison, a typical ensemble used for such calculations consists of O(10 3 ) configurations.This hierarchy implies that the stochastic learning of features of the relevant degrees of freedom of the gauge field configurations-in particular that extracted physics results must be invariant under spacetime translations, reflections, and hypercubic rotations as well as under gauge transformations-is challenging.
This challenge is approached in two ways, described in the following two sections.First, a multi-layer perceptron (a fully connected feed-forward neural network) is trained to learn the action parameters corresponding to lattice gauge field configurations.As anticipated, using gauge fields as input with no symmetry constraints leads to over-fitting of the spacetime and gauge features of the data which are not related to the physics encoded by a given ensemble.Nevertheless, this exploration reveals a number of interesting features of the problem at hand.Second, a practical solution to the parametric regression problem is provided in the form of a network with a structure that imposes the spacetime and gauge symmetries of LQCD (or, equivalently, involves preprocessing gauge field data into a format that respects these symmetries).

A. Fully-connected network
The simplest approach to the parametric regression of lattice QCD gauge fields using neural networks is to use a multi-layer perceptron [56][57][58][59], i.e., a fully-connected feed-forward network structure (a glossary of neural network terminology is provided in Appendix A).For each of the ensembles of gauge-field configurations in Grid B, 850 configurations were randomly selected as training data, while 100 were held out as validation data [60,61].Each gauge field configuration, consisting of O(10 6 ) real numbers, was treated as an individual input.As physical quantities are only defined on ensemble average, regression on these inputs can not be exact; a given gauge configuration could, with various probabilities, have been generated from an action differing in both form and parameters from the one that it was in fact generated with, so a perfectly functioning network will necessarily have some spread in predictions from a given ensemble.Quantifying this maximum resolution is possible in principle, but computationally prohibitive, and for this reason has not been undertaken.Investigations into new ensemble-based training approaches that would sharpen the maximum regressor predictability are ongoing.A simple fully-connected neural network structure, represented graphically in Fig. 10, was trained on the regression task 6 .The network was initialised by setting the biases to zero and the weights to a truncated normal distribution centred at zero with a width of 0.02.A tanh activation function was applied to the nodes in each layer, as well as an L2 regulariser with weight decay set to 0.001.Dropout [62][63][64] was also applied to each layer.While many variations of the network structure were investigated, a systematic hyperparameter tuning was not undertaken due to computational limitations.In general, it was found that fewer hidden units and layers than in the illustrated network led to less optimal minima of the loss function, while a greater number did not appreciably change the outcome.Dropouts in the range 0.3-0.6 were required to eliminate over-fitting.A range of regularisation prescriptions and hyperparameters, as well as a range of activations including tanh, reLU [65][66][67], and sigmoid were studied.The Adam optimiser [68] reached the minimum loss with less training than stochastic gradient descent (SGD), and a loss function based on least absolute deviations (L1) rather than least square errors (L2), performed better.
The predictions of the best-performing network for the held-out validation data are shown in Fig. 11.While these results appear to signal the success of this approach, the generalisation ability of the network, i.e., its ability to interpolate in parameter space, is poor.In particular: • New ensembles, even those in the 10 ensembles of Set F, generated from separate HMC streams but with the same {β,m 0 } as one of the training ensembles, were predicted to sit at the average β and m 0 values of all ensembles included in training.This indicates that the network did not succeed in learning the gauge-invariance properties of lattice QCD gauge fields, nor in parametrising the parameter space of the grid of ensembles; • Configurations from the continuation of the HMC streams used to generate the training and validation configurations were also predicted to have different parameters.Specifically, the next configurations in the HMC streams were predicted to have the correct m 0 and β values, but these predictions drifted towards the average over all training ensembles within a few steps.This indicates that the network is identifying some quantity with a longer autocorrelation time than the physics quantities studied in Sec.II A, i.e., that the configurations separated in MC time such that they are independent by the measure of various physics observables, are not independent by the alternative measure found by the network.
The majority of these features are unsurprising; information content suggests that with O( 103 ) samples containing O(10 6 ) real numbers each, it is not feasible to stochastically learn symmetries such as the gauge invariance of the data, and that generalisation will be challenging.This could be remedied by using far larger ensembles of gauge field configurations for training, if that were computationally feasible.
The ability of the network to distinguish different streams generated at the same values of β and m 0 is interesting.In the limit of infinite stream lengths, no calculated quantity, corresponding to a physical observable or otherwise, can achieve this distinction.Such distinguishability indicates that the streams are not completely sampling the gauge field configuration space and is tied to the existence of a feature, identified by the network, that has a longer autocorrelation than those of the physics observables studied in Sec.II A. An autocorrelation time of the neural network feature was obtained from the output of classification networks trained on each of the pairs of streams in Set F, generated at the same set of action parameters.Rather than training this network to identify the {β, m 0 } of a given gauge field as for the regression network described previously, the classifier was trained to produce a classification: {1, 0} for configurations from one stream, and {0, 1} for those from a second.The network structure used was identical to that shown in Fig. 10, with a softmax [69] activation function used for the final layer to provide a normalised probability interpretation for the output: an output {a, 1 − a} for a given configuration indicates that that sample can be with the first stream with a probability a.A categorical crossentropy [70,71] loss function was used for this training.For each pair of streams, 600 trajectories from each stream were used to train an instance of the network.The output of that instance for the trajectories sequentially following the training data defines an autocorrelation function: where P {α,β} (c) denotes the probability for configuration c to be in stream α or β (i.e., the first and second element of the network output for that configuration).The configuration is labelled c {α,β} (τ ) to denote a trajectory from stream α or β, τ steps in Monte-Carlo time after the end of the sequence used as training data.The autocorrelation function, and an autocorrelation time determined from this function by Eq. ( 6), are shown in Fig. 12. Comparing to Fig. 3, it is clear that the autocorrelation time of the feature used by the network to distinguish streams is approximately three times longer than the longest autocorrelation time of the physics observables that were calculated in Section II.
It is natural to speculate that the strong autocorrelation observed in the neural network output is based on some local features of the data, rather than features encoding the physics of interest7 .Further investigation did not find evidence for this interpretation; neither Moran's I [72] nor Geary's C [73] tests supported the existence of correlated spatial regions in the derivatives of the loss  10)) and autocorrelation time (right, defined in Eq. ( 6)) of the feature distinguishing two streams at the same set of parameters, trained on sequences of gauge field configurations.The autocorrelation function was generated by averaging over many different results (trained using all different pairs of the 10 streams, F 1,...,10 , at the same parameters), and was found to be robust under changes of the network structure used to generate it.The dashed horizontal line on the right figure shows the maximum autocorrelation time of various physics observables (see Fig. 3).function with respect to inputs.There is also no correlation of these derivatives with known spatially-varying physical quantities such as topological charge density and action density.While the long-correlation-time feature could not be identified in this study, it provides an interesting topic for further study.In particular, it will be informative to investigate how this scale changes with parameter range, particularly in regions of parameter space where topological charge freezing becomes a difficult problem for simulations.

B. Custom symmetry enforcing network structure
As described in the previous section, experiments with simple fully-connected neural networks were not successful at parametric regression of lattice QCD gauge fields for the training data sets used in this study.This is not unexpected; learning the symmetries of gauge field configurations stochastically is certain to be a challenging task.Symmetries of lattice QCD, however, act to reduce the effective degrees of freedom of the problem, and can be incorporated into the structure and training of neural networks in several ways.First, the stochastic learning of symmetries can be accelerated through data augmentation (i.e., randomly performing a gauge transformation and/or translation/lattice rotation on a configuration).This is analogous to typical uses of data augmentation [74] in, for example, image recognition [75,76], to introduce symmetries such as rotational symmetry 8 .In practice, this was found to be untenable for the case studied here as a result of the large number of symmetries that must be learned, their complex nature, and the requirement that they be strictly observed.Secondly, custom network layers can be designed (or equivalently, data can be pre-processed) to only allow gauge invariant and lattice-symmetry invariant outputs of the network.This approach is found to be successful.
To incorporate the symmetries of lattice QCD gauge fields into neural network structures, several custom networks were designed, featuring an initial pre-processing layer that forms only quantities that respect the invariances of the problem, followed by fully-connected layers operating on these quantities.The possible gauge and translation-invariant degrees of freedom that are allowed by the first layer are specified by hand; in principle this choice could be part of the learning process, although naïve implementations are prohibitively expensive.Wilson loops of all shapes and sizes, along with their correlated products, suitably averaged over spacetime, provide a natural choice of gauge-invariant, translation-invariant quantities that can be formed from a gauge field configuration.The number of such loops is exponentially large in the spacetime volume and it is computationally intractable to allow all to be generated, so a suitable subset must be chosen.As used in the PCA analysis in Section II B, one such subset is the set of square planar loops of sizes up to L/2 × L/2, as well as 1 × n rectangular loops for n up to L, averaged over all possible planar orientations and space-time locations.Another natural choice is the set of all correlated products of two Wilson loops, similarly averaged: where the sum over ∈ O(j ×k) is over all lattice rotations of loops of size j ×k, and these loops are chosen from the same list as the single loops described above.Histograms of these correlated loop products for each ensemble in Grids A, B, and C are shown in Figs.21, 23, and 25 in Appendix B 2.
A third choice of simple, gauge-invariant quantities is the set of subtracted correlated products of loops, Network structures that allow each of these sets-labelled as single loops (SL), unsubtracted products of two loops (CP), and single loops plus the subtracted correlated products of two loops (SLCP)-to be formed in the first layer, are studied.The complete network structures used for regression are illustrated in Fig. 13 for each of the SL, CP, and SLCP cases.Each network was trained using 850 independent configurations from each ensemble in a given grid, with a further 100 held out as validation data.As for the fullyconnected network described in the previous section, the networks were initialised by setting the biases to zero and the weights to a truncated normal distribution centred at zero with a width of 0.02.Although no rigorous tuning of the hyperparameters of the networks was undertaken for the various structures, a large number of variations were investigated.In general, networks with fewer hidden units, or fewer layers, than those illustrated in Fig. 13 were found to produce less optimal solutions, while larger networks did not significantly improve on the results that are presented.As for the fully-connected networks, an L1 distance in the two-dimensional parameter space was used as the loss function, and this was found to perform considerably better than the L2 distance.For a given network structure and loss function, the same minimum loss was achieved using different choices of optimiser, including SGD, Adam [68], and Nesterov [81], with various parameters, although the number of epochs required to convergence varied.
The outputs of neural networks allowing each of the SL, SLCP, or CP loop sets to be formed in the first layer, trained on the ensembles in Grid A, are shown in Fig. 14.In each case, the results display accurate regression and clear differentiation between the ensembles, with the shapes of the confidence ellipses of network predictions elongated in the direction of constant 1 × 1 plaquette, the simplest and most precise gauge-invariant object.The mild distortion of the regression results towards the centre of the grid is natural, as this will always lead to a smaller loss in the case of misidentifications than any alternative.With additional tuning and larger or denser parameter grids for training, one might expect that this distortion can be removed.The training and validation losses of each network are shown against training epoch in Fig. 15.The CP network performs slightly better than the SL network, as one may anticipate, given that it allows a larger number of degrees of freedom to be utilised.The SLCP network, while also having more degrees of freedom than the SL network, displays over-fitting: while the training loss is as good as that of the CP network, the validation loss remains higher.It is likely that tuning the network hyperparameters individually for each network structure would improve these results.For the purpose of the present proof-of-principle study, the CP network is taken as the best example for further study.
Unlike the fully-connected network described in the previous section, the symmetry-respecting networks generalise successfully, both correctly identifying the parameters of other streams generated with the same action as the training data, which are indistinguishable from the validation distributions, and interpolating to intermediate ensembles.This interpolation is illustrated in Fig. 16, which shows the predictions of the CP network on both the evenly-spaced intermediate ensembles of Grid B, and on ensembles in Sets D and E, generated to lie along lines of constant plaquette (isoplaquette lines).While the latter ensembles are essentially indistinguishable along each isoplaquette by various Wilson loops, even using a principal component analysis (see Sec. II B), the parameter predictions from the trained network are distinguishable, and, most importantly, have the correct relative positions in parameter space.The overlap between the network predictions for the very closely-spaced ensembles from Set E is anticipated; as described in earlier sections, there is a maximum resolution inherent in this regression problem.Nevertheless, the ordering of the central values of the distributions remains robust.This shows accurate regression of dense points in a region of parameter space significantly smaller than the space between adjacent training ensembles, confirming that the network has successfully parametrised the relevant features of lattice QCD gauge fields.
The accurate regression achieved with the CP network relies on having a sufficient density of points in the {β, m 0 } plane in the training data set to enable interpolation.Reducing this density by half, for example, and training the same network structure in the same manner, yields a network instance that generalises poorly to intermediate ensembles.Fig. 17 shows the results of such a test, using the Grid A ensembles.Despite the poor generalisation performance, both training and validation loss converge to the same values as for the CP network trained on the entirety of Grid A; that is, the training does not indicate over-fitting.
The successful parametric regression of lattice QCD gauge fields presented here must be extended to larger-volume lattices more typical of modern lattice QCD calculations for the method to be applied in practice.As lattice volume increases, Wilson loop distributions become more sharply peaked, and as a result become more distinct, as can be seen by comparing Figs.21 and 25 which display these loops for data sets with spacetime volumes V = L 3 × T = 12 3 × 36 and 16 3 × 48, respectively.It can thus be anticipated that regression performance with the network structures developed here will improve on larger lattice volumes.Fig. 18 shows the results of a CP network structure trained on Grid C. As expected, the regression performance is better than for the smaller-volume ensembles.Extending these results to even larger volumes, and to N c = 3 QCD, is essential.

IV. SUMMARY
Deep neural networks with custom symmetry-preserving layers provide a solution to the parameter regression problem in lattice QCD, for the proof-of-principle case considered here.Specifically, neural networks regressors trained on grids of ensembles in action parameter space were able to accurately identify the parameters used to generate streams of ensembles, generalising successfully and accurately to ensembles densely spaced and between grid points in the training space.Nonsymmetry preserving networks were also studied.While these were unsuccessful at the regression task, they revealed an unknown feature of the lattice ensembles with a longer correlation length 14: Predictions of β and m 0 for the validation ensembles in Grid A at the same parameter values of the training ensembles, using SL (left panel), SLCP (right panel) and CP (bottom panel) network structures.The stars show the location of each ensemble in parameter space, while the ellipses show the 1σ confidence regions generated from the variation of the predictions for the 100 validation samples from each ensemble.than any of the physics observables that were studied.
Extending this work to SU(3) gauge groups and to larger lattice volumes will be essential for the practical application of the methods developed.In addition to the symmetries exploited here, a typical length scale, 1/Λ QCD ∼ 10 −15 m, emerges dynamically in LQCD calculations.Consequently, there are potential advantages for a convolutional approach [82][83][84] at larger lattice volumes.Convolutional layers would again have to be customised, respecting the gauge symmetry of the problem.Particular use-cases of LQCD parameter regression may also impose additional constraints.For example, regression for the matching of coarse and fine lattice actions requires the identification of ensembles generated in a coarse space with ensembles describing the same physics, but generated via a coarsening prescription [20,21].The latter ensembles, by renormalisation-group evolution,     are described by lattice actions with more parameters than those generated in the coarse space.Preliminary investigation suggests that regression under these conditions will require network structures invariant under irrelevant short-distance degrees of freedom, or the marginalisation over such degrees of freedom in the learning procedure.Regression of the larger number of parameters in such actions (and used in the construction of perfect actions [12][13][14][15]), must also be investigated further.
Clearly, having demonstrated the feasibility of neural network approaches to LQCD in the present work, significant further study is warranted.In particular, the use of lattice symmetries to overcome the dramatic inverted data hierarchy of LQCD-the feature that there are typically far fewer samples than degrees of freedom per sample available-opens the door to many novel applications of machine learning in LQCD.

Further details of ensemble properties
In this appendix, the properties of the various LQCD data sets used in this work are presented.

FIG. 2 :
FIG. 2: Diagrammatic representation of the construction of planar Wilson loops W k×l (x), with indices k and l denoting the dimensions of the loop (with orientation label suppressed), from gauge links U µ (x).

6 FIG. 4 :
FIG. 4: Contours show ln |W n×m | from evaluations on each ensemble in the two L/a = 12 grids.The stars show the locations of the ensembles from Grids A (blue) and B (orange).

3 FIG. 5 :
FIG.5: Eigenvalues e n (left panel) and eigenvectors v n (right panel) of the loop correlation matrix for Grid A. The strength of the contribution of each loop to each eigenvector is represented by the tone of the corresponding box in the right panel (i.e., darker = larger contribution).

FIG. 6 :
FIG.6: Combinations of loops corresponding to the four largest eigenvectors of the loop correlation matrix for Grid A. Each colour denotes a different ensemble in Grid A.

FIG. 7 :
FIG. 7:The Jensen-Shannon divergence, D JS , between pairs of ensembles in Grid A, calculated over the three-dimensional distributions defined by the three dominant eigenvectors of the loop correlation matrix used for the PCA.D JS = 1 implies completely distinguishable distributions.

FIG. 8 :FIG. 9 :
FIG.8: Combinations of loops corresponding to the dominant eigenvectors of the loop correlation matrix for ensemble Sets D and E. Each colour denotes a different ensemble.

FIG. 10 :
FIG. 10: A schematic representation of the neural network structure used for parametric regression.Gauge links, expressed in an SU(2) basis as 4 real numbers, are used as inputs to the network.There are 4 links in each positive direction from a given site, giving a total of 4 × 4 × V = 16 × 12 3 × 36 = 995328 real numbers per gauge field.Two fully connected layers, each with 96 nodes, are used.Each hidden layer features a tanh activation function and dropout.A random set of connections between layers are omitted to denote dropout.

7 FIG. 11 :
FIG. 11: Predictions of β and m 0 on validation ensembles at the same parameter values as the training ensembles.The stars in the left panel denote the parameters used to generate the ensembles, while the ellipses show the one-standard deviation confidence interval of the predictions for the validation ensembles.The same validation data are shown as histograms in the right figure, with the intersections of the grid lines indicating the parameters used for ensemble generation.

40 FIG. 12 :
FIG.12: Autocorrelation function in Monte-Carlo time (left, defined in Eq. (10)) and autocorrelation time (right, defined in Eq. (6)) of the feature distinguishing two streams at the same set of parameters, trained on sequences of gauge field configurations.The autocorrelation function was generated by averaging over many different results (trained using all different pairs of the 10 streams, F 1,...,10 , at the same parameters), and was found to be robust under changes of the network structure used to generate it.The dashed horizontal line on the right figure shows the maximum autocorrelation time of various physics observables (see Fig.3).

FIG. 13 :
FIG.13: Diagrams of the neural network structure used.In the first layer, SL, CP, or SLCP structures are formed, e.g., in the CP case, products of the 18 different types of loops separated by lattice distance R < 13 (averaged in integer space bins of R) are allowed, for a total of 18 × 18 × 13 = 4212 loop products.The first layer is followed by 3 fully connected hidden layers with 1024, 512, and 256 nodes.Each hidden layer uses a tanh activation function, with dropouts between layers.

16 :
Predictions of β and m 0 from the CP network trained on Grid A, for the ensembles in Grid B (left panel) and Sets D and E panel).The open circles show the the location of each ensemble in parameter space, while the ellipses show the 1σ confidence regions generated from the variation of the predictions for the 100 validation samples from each ensemble.The greyed-out stars and ellipses show the validation data and training ensemble locations.

7 FIG. 17 :
FIG. 17: Predictions of β and m 0 from a CP network trained on a subset of the ensembles in Grid A. The stars show the location of each ensemble in parameter space, while the ellipses show the 1σ confidence regions generated from the variation of the predictions for the 100 validation samples from each ensemble.In the right panel, the open circles show the location of testing ensembles, that were not included in training, in the parameter space, while the matched-colour ellipses show the 1σ confidence regions of the network predictions.

7 FIG. 18 : 10 FIG. 19 :
FIG.18: Predictions of β and m 0 for the validation ensembles in Grid C at the same parameter values of the training ensembles, using a CP network structure.The stars show the location of each ensemble in parameter space, while the ellipses show the 1σ confidence regions generated from the variation of the predictions for the 100 validation samples from each ensemble.

Fig. 19
shows the evolution of various Wilson loops with HMC trajectory for the ensembles in Grids A, B, and C, while Figs.20-25 present histograms of the Wilson loops and correlated products of Wilson loops on each ensemble in these grids.

FIG. 20 :
FIG.20:The various Wilson loops, W m×n , on each of the ensembles in Grid A for all m, n combinations used in this work.
Loss for networks trained on the ensembles in Grid A with SL (orange), CP (blue), and SLCP (green) structures in the first layer, optimised with the Adam optimiser.The dark lines indicate the training loss and the pale lines show loss on the validation data.