Flow-based sampling for fermionic lattice field theories

Algorithms based on normalizing flows are emerging as promising machine learning approaches to sampling complicated probability distributions in a way that can be made asymptotically exact. In the context of lattice field theory, proof-of-principle studies have demonstrated the effectiveness of this approach for scalar theories, gauge theories, and statistical systems. This work develops approaches that enable flow-based sampling of theories with dynamical fermions, which is necessary for the technique to be applied to lattice field theory studies of the Standard Model of particle physics and many condensed matter systems. As a practical demonstration, these methods are applied to the sampling of field configurations for a two-dimensional theory of massless staggered fermions coupled to a scalar field via a Yukawa interaction.


I. INTRODUCTION
Lattice field theory is among the most successful approaches for regularizing and computing path integral expectation values in quantum field theory. In particular, the path integral can be numerically evaluated by formulating a stochastic process weighted by the Euclidean lattice action and applying Markov Chain Monte Carlo (MCMC) sampling [1]. The Euclidean lattice results can then be systematically related to the corresponding continuum Minkowski theory. This procedure enables the investigation of equilibrium properties in the theory of interest, and is a powerful and well-established method to study strongly coupled quantum field theories nonperturbatively. Key areas of application include fundamental interactions, most prominently quantum chromodynamics (QCD), as well as problems in condensed matter theory; see Refs. [2][3][4][5][6][7][8][9] for recent reviews.
The sequential nature of the Markov chain is a potential drawback to the MCMC sampling approach for computing path integrals in lattice field theory. In particular, known Markov chain update schemes for many theories of interest are local or diffusive, which can result in severe autocorrelations between successive elements of the chain. Naturally, the larger the autocorrelation between samples, the more samples must be drawn to achieve a result at fixed statistical precision. Close to criticality, e.g. when approaching the continuum limit of lattice field theories, autocorrelations also diverge rapidly for such local or diffusive Markov chains. This issue, referred to * albergo@nyu.edu † gurtej@mit.edu ‡ sracaniere@google.com § danilor@google.com ¶ urban@thphys.uni-heidelberg.de as critical slowing down, can render computations prohibitively expensive [10][11][12]. These challenges have motivated extensive work to replace local/diffusive MCMC algorithms, such as Hybrid Monte Carlo (HMC) [13], with other sampling procedures. Specialized Markov chain steps have been developed in a number of specific contexts, including cluster updates [14][15][16][17][18][19][20][21][22], worm algorithms [23][24][25], sampling in terms of dual variables [26][27][28], and event-chain algorithms [29][30][31][32][33]. Though these methods have been shown to mitigate critical slowing down in some settings, they cannot be applied to many theories of interest, including lattice QCD. In this light, development of sampling algorithms for lattice field theory based on machine learning techniques is underway, and previous works have applied a variety of tools such as adversarial learning and selflearning Monte Carlo methods [34][35][36][37][38][39][40][41][42][43][44][45][46][47]. Progress has recently been made in substituting the proposal mechanism in MCMC with a variational ansatz based on a class of samplers known as normalizing flows [48][49][50], which can be optimized to approximately sample from the target Boltzmann distribution [51][52][53]. Within this approach, proposed samples are by construction uncorrelated, and asymptotic exactness can be guaranteed by implementing a Markov chain with a Metropolis accept/reject step or through reweighting.
Though flow-based models have been extended to exactly incorporate the gauge symmetries inherent in many quantum field theories [54,55], existing applications are so far restricted to purely bosonic theories. For theories involving fermions, the anti-commuting nature of the associated operators must be treated by formulating the theory in terms of integrals over Grassmann-valued field variables. These integrals may be evaluated analytically, resulting in a purely bosonic theory described by an effective action which incorporates the dynamics of the fermion fields via 'fermion determinant' terms. While flow-based methods may in principle be applied arXiv:2106.05934v2 [hep-lat] 28 Dec 2021 to learn this effective action over bosonic fields, the cost of computing such determinants scales unfavorably with the number of fermionic degrees of freedom, and their exact evaluation is typically intractable at the scale of state-of-the-art calculations.
In this paper, a framework is presented for the application of flow-based sampling algorithms to lattice quantum field theories with dynamical fermions, such as lattice QCD and theories describing many condensed matter systems. We construct approaches based primarily on the pseudofermion method [56] to avoid an explicit computation of the fermion determinant while guaranteeing asymptotic exactness of the sampling schemes. We investigate the application of a number of flow-based samplers in the context of a simple, two-dimensional Yukawa model with a real scalar field coupled to staggered fermions. Our results demonstrate that lattice field theories with dynamical fermions are amenable to flow-based sampling and provide a starting point for extensions to higherdimensional settings as well as theories involving gauge fields. The architectures developed here are also applicable to flow-based acceleration of traditional MCMC approaches [57][58][59][60][61]. The primary contributions of this work are: • In Section III, identifying four distinct sampling schemes based on generative models that capture different decompositions/marginalizations of the target distribution over boson and pseudofermion field configurations; • In Section IV, constructing and optimizing efficient, expressive flow model architectures that respect the symmetries of the pseudofermion action, in particular translational symmetry with antiperiodic temporal boundary conditions; • In Section V, implementing and numerically benchmarking these sampling approaches in the context of a two-dimensional field theory with one pair of mass-degenerate fermions.
The remainder of this paper is organized as follows. In Section II, we review the description of fermions in lattice field theory, the use of pseudofermions, and the boundary conditions and translational symmetry of pseudofermion actions. In Section III, we outline four exact generative sampling schemes for fermionic theories and subsequently develop suitable flow architectures as the generative models for use in these sampling schemes in Section IV. Details and numerical results of the application of our framework to a Yukawa theory in two dimensions are presented in Section V. In Section VI, we discuss the applicability of our developments to update-based approaches.
We summarize our findings and provide an outlook in Section VII.

II. FERMIONS ON THE LATTICE
The simulation of dynamical fermion degrees of freedom in lattice field theory is a highly non-trivial task for many theories of physical interest, both conceptually and computationally. In this section, we briefly review the main concepts behind formulations of lattice fermions and their numerical implementation. For a comprehensive treatment we refer the reader to one of the standard textbooks; see e.g. Refs. [62][63][64].

A. Path integrals with fermions
We consider field theories of interacting fermionic and bosonic degrees of freedom discretized on a d-dimensional Euclidean hypercubic lattice with periodic boundary conditions. The action of such a theory can be expressed as where the subscripts B and F denote the bosonic and fermionic contributions to the action, the discretized boson field variables are collectively denoted by φ, and the discretized fermion field variables are denoted by ψ,ψ.
For the present work, we assume that the fermionic action is bilinear in N f flavors of Dirac fermions ψ f ,ψ f and is given by where the Dirac operator D f (φ) includes the kinetic terms, mass terms, and any coupling to bosonic fields for each fermion flavor f . The precise form of D f (φ) is determined by the theory of interest and the choice of discretization. Expectation values of observables O are computed via path integrals of the form and the fermion fields ψ andψ are defined in terms of anti-commuting Grassmann numbers. For bilinear actions of the form given in Equation (2), integration over the Grassmann-valued fermion fields can be performed explicitly, giving By applying Wick's theorem, the dependence of the observable on the fermions can be integrated out. Path integral expectation values can then be written in terms of purely bosonic degrees of freedom as This expectation value may be estimated via MCMC sampling by computing an average over a statistical ensemble of configurations φ, where φ∼p denotes a sum over N configurations sampled from the probability distribution Direct sampling schemes for high-dimensional lattice distributions are typically not known, even for theories without fermions. Nevertheless, the distribution p(φ) can be sampled via MCMC methods with guaranteed asymptotic exactness under certain ergodicity and balance constraints [65]. Among these, the HMC algorithm [13] has been established as the de facto standard method for producing configurations in lattice field theory and is routinely employed in state-of-the-art QCD studies and beyond. This algorithm is based on the numerical treatment of Hamiltonian equations of motion in a fictitious time dimension, where quantum fluctuations are encoded by the random sampling of the associated canonical momenta. Given a field configuration and a set of momenta, this evolution is computed with a symplectic integrator such as the leapfrog algorithm. A Metropolis-Hastings accept/reject step [66,67] results in an algorithm satisfying detailed balance, despite the accumulation of numerical errors along the discretized integration trajectory. At scale, the fermion determinants are treated stochastically via the pseudofermion method introduced in the following section.

B. Pseudofermions
The fermion determinants in Equation (8) cannot be calculated directly at scale because the Dirac matrices D f are high-dimensional. For d-dimensional field configurations with L sites per spatial dimension and L t sites in the temporal dimension, the total number of fermionic degrees of freedom scales as the total number of lattice sites, V = L t L d−1 . Each Dirac matrix D f then has dimensions O(V × V ), and an exact computation of the determinants becomes intractable at the scales of many theories of interest. Instead, Gaussian integrals over auxiliary bosonic fields can be used to replace the direct evaluation of determinant factors, based on the identity where the normalization constant Z N is defined as Here, ϕ R , ϕ I denote the real and imaginary components of the auxiliary complex field ϕ, and the matrix M must be positive-definite. Since the Dirac matrices D f are typically not positive-definite, one cannot directly apply this identity to each factor of det D f . However, for fermion flavors f 1 and f 2 appearing as degenerate pairs, based on γ 5 -hermiticity one can instead use the equality and then apply Equation (9) to the positive-definite ma- For fermion flavors f not included in any degenerate pair, one can apply one-flavor algorithms [68][69][70] to replace D f with a positive-definite matrix M capturing identical dynamics.
Using the pseudofermion approach, a path integral as in Equation (6) can thus be rewritten in terms of an action involving the auxiliary pseudofermion fields ϕ, after replacing the fermion determinants in the given lattice theory by the determinants of N pf positive-definite matrices M k as Each term ϕ † k M −1 k ϕ k in the pseudofermion action can be efficiently computed using iterative solvers such as the conjugate gradient method. Having formulated the theory using pseudofermions in Equation (12), evaluation of the path integral via MCMC can then be performed in this augmented space by sampling from the joint distribution The distribution p(φ) given in Equation (8) is obtained from this joint distribution by marginalizing over the auxiliary pseudofermion fields ϕ. While the pseudofermion method renders the treatment of fermion determinants tractable in principle, the joint action may strongly fluctuate in certain limits. This

Name
Probability density Use case  (14). The normalizing constant Z is given by Equation (4) and ZN is defined in Equation (10). Notes: (A) Only the joint, ϕ-conditional, and φ-conditional densities can be efficiently computed (up to normalization). (B) The ϕ-conditional can be sampled exactly by the method specified in Equation (16). (C) A closed form for the ϕ-marginal density is not generally known (even unnormalized).
feature can slow down MCMC sampling of the joint distribution and can lead to an unfavorable volume scaling of the associated computational effort, especially when many components of the bosonic field are updated simultaneously. Accordingly, numerous modifications of the pseudofermion formulation have been developed to improve the structure of the action; see e.g. Refs. [71][72][73][74][75][76]. These developments are complementary to the application of generative models for sampling the pseudofermion distribution and could be combined with any of the approaches presented here. For example, in this work we improve the efficiency of modeling and sampling the distributions under study by applying even-odd preconditioning [77], as discussed in Section V A and Appendix B.

C. Boundary conditions and translational symmetry
In lattice studies of purely bosonic theories, it is common to choose periodic boundary conditions in all directions of the lattice, allowing the incorporation of an exact discrete translational symmetry in lattice actions for such theories. For fermion fields, one needs to impose antiperiodic boundary conditions in the time direction in order to obtain a consistent definition of the trace for the Euclidean partition function. Actions for theories involving fermionic fields are then invariant under simultaneous spatio-temporal translations of φ, ψ, andψ, with appropriate boundary conditions applied for each field.
To be consistent with the boundary conditions for ψ andψ, each Dirac matrix D f (φ) must include appropriate signs for any terms coupling fields across the temporal boundary. As a result, these boundary conditions affect the pseudofermion formulation of the theory as well, and the pseudofermion action S P F (φ, ϕ) is invariant under simultaneous translations of φ and ϕ with antiperiodic temporal boundary conditions applied to ϕ.
In general, the discretization chosen for the Dirac operator determines which particular lattice translations are included in the translational symmetry group. In the staggered formulation [78], for example, the spinor components of each flavor of fermion are distributed over the components of hypercubes with 2 d sites each, and the translational symmetry group includes all translations by an even number of sites. Translations by an odd number of sites in particular directions correspond to more complicated internal symmetry transformations that mix spinor degrees of freedom, and must involve sign flips on specific field components to leave the staggered action invariant [79]. These translational symmetries of the staggered formulation play a role in the application to the staggered-fermion Yukawa model presented in Section V.

III. EXACT GENERATIVE SAMPLING SCHEMES FOR FERMIONIC THEORIES
Generating importance-weighted field configurations for a lattice field theory involving fermions can proceed via the marginal distribution p(φ) defined in Equation (8), the joint distribution p(φ, ϕ) defined in Equation (14), or through other choices of marginalized distributions defined in Table I. In this work, we develop exact sampling schemes based on generative models that directly approximate these distributions. In defining these sampling schemes, we assume that the model probability density may be computed for each generative model (this property holds for the flow-based models defined below). This section details four asymptotically exact schemes for constructing Markov chains to draw samples of φ, as illustrated in Figure 1. A. Modeling and sampling of p(φ) Since we must ultimately sample only the field φ, one could directly model the φ-marginal distribution (Row 2 of Table I) by constructing a generative sampler providing a distribution q(φ) approximating p(φ). Samples drawn from the model distribution q(φ) can be used in asymptotically exact sampling schemes by either constructing an independence Metropolis Markov chain or applying reweighting/resampling based on reweighting factors p(φ)/q(φ). A direct application of either approach requires computing p(φ) involving the aforementioned determinant factors. The acceptance probability for such a Metropolis Markov chain would be This sampling scheme is illustrated in Figure 1a.
Instead of evaluating the ratio det M(φ )/ det M(φ) directly, which becomes prohibitive at scale, it is possible to apply the pseudo-marginal method [80] using stochastic approximations of both the numerator and denominator of Equation (15) in a way that retains asymptotic exactness. In this stochastic generalization of the Metropolis algorithm, one computes an estimate of p(φ) using an unbiased stochastic estimator when φ is initially proposed. This estimate of p(φ) is then used in all subsequent accept/reject tests for the next element in the Markov chain. Applied to a theory with fermions, this amounts to computing a stochastic estimate of the fermion determinant for each proposed configuration. 1 For example, we can use an unbiased estimator based on pseudofermions. An (unnormalized) estimate for p(φ) can be obtained by generating a pseudofermion ϕ from the conditional p(ϕ|φ) and measuring the quantity e −ϕ † (M −1 (φ)−1)ϕ e −S B (φ) . The ϕ-conditional can be directly sampled according to The matrix A is defined by the identity M(φ) ≡ A(φ)A † (φ) and reduces to the Dirac matrix D(φ) in a two-flavor example. This estimate can be extremely noisy in practice and may give poor statistical performance. However, it can be improved upon by using multiple pseudofermion draws; see e.g. Refs. [74,76,82,83]. In principle, any unbiased stochastic estimator of the fermion determinant can be applied (whether based on pseudofermions or entirely distinct). The limit of taking arbitrarily precise estimators recovers the exact acceptance probability in Equation (15). Acceptance rates obtained by using the exact form can thus be interpreted as an upper bound on sampling performance. B. Gibbs sampling using p(φ|ϕ), p(ϕ|φ) An alternative to modeling p(φ) directly is to construct samplers for both conditional distributions p(φ|ϕ) and p(ϕ|φ) and build an asymptotically exact Gibbs sampler that alternatingly samples from these distributions to update φ and ϕ. For such a Gibbs sampler to satisfy detailed balance, the update to φ must satisfy detailed balance for p(φ|ϕ) and the update to ϕ must satisfy detailed balance for p(ϕ|φ). The ϕ-conditional can be exactly and directly sampled as described in Equation (16), automatically fulfilling this requirement. On the other hand, the φ-conditional (Row 5 of Table I) can be approximated by a generative model distribution q(φ|ϕ) ≈ p(φ|ϕ). This model can be incorporated into an exact Markov chain transition for the φ-conditional distribution as follows. Start with a state φ, sample ϕ from p(ϕ|φ) using Equation (16), conditionally propose φ from q(φ|ϕ ), and then apply a Metropolis-Hastings accept/reject step with the acceptance probability given by This step satisfies detailed balance for the φ-conditional distribution p(φ|ϕ) as required and guarantees asymptotic exactness. Note that in contrast to the φ-marginal sampler described in Section III A, computing the acceptance probability at scale for the sampling scheme described here and in Sections III C and III D does not rely on unbiased stochastic determinant estimators. In this approach, the field ϕ is independently resampled conditioned on φ at each step of the Markov chain, and therefore does not need to be stored. This Gibbs sampler can thus be interpreted as an exact Markov chain over φ alone, with the sampling of ϕ contained inside each Markov chain step as depicted in Figure 1b. The approach closely mirrors the typical sampling strategy employed in HMC, in which pseudofermions ϕ are sampled according to the exact conditional distribution p(ϕ|φ) and Hamiltonian evolution is used to construct an update step that satisfies detailed balance for the conditional distribution p(φ|ϕ ). The generative model proposal and Metropolis-Hastings step for p(φ|ϕ ) can thus be considered an optimizable replacement of the molecular dynamics trajectory utilized in HMC, with the difference that the mechanism of generating a proposal configuration φ does not directly depend on φ (as is the case for a symplectic integrator), but only indirectly through ϕ . However, this also means that in contrast to all other schemes described here, the Gibbs sampler is not an independence sampler. Drawing configurations from the model and constructing the Markov chain cannot be done asynchronously, since the generation of a proposal explicitly depends on the previous element of the chain. The joint distribution p(φ, ϕ) can be autoregressively decomposed as the product p(φ, ϕ) = p(φ)p(ϕ|φ) in terms of the φ-marginal and ϕ-conditional (Rows 2 and 3 of Table I). A generative model for the joint distribution could therefore be produced by approximating both components independently, i.e., q(φ, ϕ) = q(φ)q(ϕ|φ). This autoregressive decomposition allows the joint distribution to be reproduced in terms of two potentially simpler distributions. Note that although the exact sampling procedure described in Equation (16) can be applied to draw samples from p(ϕ|φ), computing the normalizing constant of this ϕ-conditional distribution is not tractable. This is not an obstacle when one is only interested in conditionally sampling ϕ, as is the case for HMC or the approaches of Sections III A and III B, but motivates modeling the distribution in the case where an approximation with a tractable density is required.
Exactness can be straightforwardly enforced in this approach by employing Markov chain steps in which joint samples (φ , ϕ ) are proposed independently from q(φ, ϕ), and a Metropolis-Hastings accept/reject step is applied for the proposed transition (φ, ϕ) → (φ , ϕ ) according to the acceptance probability (18) This sampling scheme is illustrated in Figure 1c. Furthermore, unique reweighting factors can be tractably computed for each configuration φ as p(φ, ϕ)/q(φ)q(ϕ|φ), thus reweighting approaches may also be used as alternatives to MCMC in order to guarantee exactness here.

D. Fully joint modeling and sampling of p(φ, ϕ)
Rather than modeling the factors p(φ) and p(ϕ|φ), one could instead apply generative models to jointly sample the fields φ and ϕ according to a distribution q(φ, ϕ) that directly approximates the joint distribution (Row 1 of Table I). This results in joint samples and density estimates analogous to the autoregressive case above, but is a qualitatively distinct approach to modeling this distribution. Exactness can be enforced using a similar Metropolis-Hastings Markov chain transition with acceptance probability or by applying reweighting or direct resampling techniques. This approach is illustrated in Figure 1d.

IV. FERMIONIC FLOWS VIA PSEUDOFERMIONS
The sampling approaches discussed above for theories involving fermions can in principle use any generative models that enable both efficient sampling and density estimation for the relevant model distributions. Normalizing flows are one such class of probabilistic models for which these operations are made possible using a changeof-variables formula [48,84,85]. Flow-based methods have been successfully implemented to model the unnormalized Boltzmann distributions of φ 4 -theory as well as U (1) and SU (N ) gauge theories [51,52,54,55], opening up a wealth of potential applications in high energy and nuclear physics as well as condensed matter theory. For an in-depth introduction to normalizing flows for lattice field theory with further implementation details and explanations, we refer the interested reader to Ref. [86]. For a general introduction to machine learning for physicists, we recommend Ref. [87].
Below, we outline the key concepts behind flows relevant for our approach to modeling the distributions listed in Table I:

A flow-based model consists of an invertible 'flow'
f and a prior distribution with probability density r(·), which can together be used to produce samples by first drawing z according to r(z), then returning 2. The Jacobian J = ∂f ∂z , combined with the prior density, allows the output probability density q to be evaluated as 3. The flow f is parameterized by free model parameters which may be optimized by minimizing a suitable 'loss function' that quantifies the difference between the model distribution q(x) and the target distribution. A common choice also employed in the present work is the Kullback-Leibler (KL) divergence [88].
The remainder of this section describes how models for each of the distributions required for sampling may be constructed. First, a common training procedure for all such models is described in Section IV A based on the idea that each distribution aims to approximate some marginalization of the same joint distribution p(φ, ϕ). This common training procedure motivates some of the 2 We use the generic notation z and x here to stand for potentially high-dimensional variables acted on by the flow. In our applications these may include boson and/or pseudofermion field variables. Flows are described in field-theoretic notation wherever we work with flows particular to lattice field theory sampling.
architectural decisions for the construction of the models described in Sections IV B and IV C. In the following, we label the model densities according to their corresponding target densities in Table I as q(φ, ϕ), q(φ), q(ϕ|φ), and q(φ|ϕ). In each sampling approach, using model distributions that better approximate the associated target will generally result in higher acceptance rates with potentially lower autocorrelations.

A. Optimizing flow-based models
We first detail a procedure to optimize the model density q(φ, ϕ) to directly approximate p(φ, ϕ). The KL divergence between these distributions is defined as It is minimized if and only if the target and model density match, for which D KL = 0. In practice, a loss function based on this divergence is computed stochastically as drawn from q(φ, ϕ). The unknown normalizing constant log Z has been removed in the definition of Equation (22), since it is just an overall constant shift and does not affect the relevant structure of the loss function.
If the model probability density q(φ, ϕ) can be directly computed, we can evaluate the gradient of Equation (22) with respect to the model parameters defining this probability density. Gradient-based optimization methods can then be applied to minimize L. This training procedure is immediately applicable to the models required for the joint sampling approaches derived in Sections III C and III D. In the former, the distribution q(φ, ϕ) is defined by q(φ)q(ϕ|φ), and this pair of model distributions is simultaneously optimized by minimizing the loss function in Equation (22). In the latter, a model for q(φ, ϕ) is directly constructed and optimized.
The remaining distributions required in Sections III A and III B, namely q(φ) and q(φ|ϕ), do not naturally define a joint model probability density. To optimize these distributions using the loss function above, we extend the model architectures by pairing q(φ|ϕ) with a sampler q(ϕ) and pairing q(φ) with a sampler q(ϕ|φ). The resulting joint models can be optimized as above, and the auxiliary components can be discarded after training. In this work, these auxiliary models are constructed as follows.
We first consider extending the φ-conditional model q(φ|ϕ) to a joint model, which requires a marginal distribution q(ϕ). None of the sampling approaches presented in Section III directly require this marginal distri-bution; however, as we discuss further in Section IV C, we choose to model q(φ|ϕ) by a restricted form of a joint sampler which simultaneously models a marginal distribution q(ϕ). In this extended model, both q(φ|ϕ) and q(ϕ) are described by parameters that are optimized.
A φ-marginal model q(φ) can be extended to a joint sampler by pairing it with a conditional distribution q(ϕ|φ). In principle such an auxiliary model could be constructed solely for the purposes of training. However, in this case we are free to instead use the exact conditional distribution p(ϕ|φ), which can be exactly and efficiently sampled. The result is a joint distribution defined by first sampling φ from the φ-marginal model and then sampling the ϕ-conditional using Equation (16), resulting in the joint density q(φ)p(ϕ|φ). Evaluating the joint KL divergence between this model distribution and the target joint distribution p(φ, ϕ) requires the evaluation of the normalized density p(ϕ|φ), which unfortunately includes a normalizing factor of det M(φ). However, we only require an unbiased stochastic estimator of the gradients of Equation (22) for optimization. Appendix C details a particular stochastic estimator for these gradients which can be used to avoid the costly determinant evaluation; this estimator is used for all φ-marginal models trained in this work.

B. Building blocks
Flow-based models are generally constructed by composing several simple, invertible transformation layers, each described by a number of free parameters. This composition produces an expressive overall transformation that is nevertheless invertible and has a tractable Jacobian determinant. Coupling layers [49] are one common choice of simple transformation in which the degrees of freedom of each sample are divided into two subsets and one subset is updated conditioned on the other, 'frozen' subset, as shown in Figure 2. A 'masking pattern' describes the division into subsets. Transformations of the updated subset are parameterized by 'context functions' accepting the frozen subset as input, which are typically implemented using neural networks. For example, a simple coupling layer for a real scalar field φ(x) ∈ R could be constructed based on a checkerboard division into even/odd sites, where the field at even sites is (invertibly) transformed by an element-wise rescaling operation plus an additional offset. The scaling factors and offsets are given by the output of an arbitrary context function, which may be parametrized by a neural network acting on the odd sites. The transformation is applied alternatingly between even and odd sites; see Ref. [86] for a concrete implementation of such coupling layers. Symmetries may be incorporated in such models using appropriate choices of masking patterns, context functions, and transformations. Other choices of layers are also possible (see Section IV B 5 below) and are similarly encoded using generic neural networks.
The target densities defined in Table I are all invariant under translations with appropriate boundary conditions, as discussed in Section II C. Previous works have shown that exactly incorporating known symmetries into machine learning models can accelerate their training and improve their final quality [89][90][91][92][93][94]. In the context of normalizing flows, ensuring that the model density is invariant under a symmetry group is achieved by choosing an invariant prior distribution and building transformation layers that are equivariant under the symmetry. Below, we introduce several 'building blocks' which are designed to handle these symmetries. These building blocks are used in the implementation of various layers and flowbased models constructed in this work.

Translation-equivariant convolutions via P-fields and AP-fields
The joint distribution p(φ, ϕ) given in Equation (14) is invariant under simultaneous field translations given by and for any translations (δ x, δt) in the translational symmetry group of the discretized theory. In this work, we label fields transforming as Equation (23) as P-fields, and we label fields transforming as Equation (24) as AP-fields. 3 In previous applications of flow models to sampling configurations in lattice field theory, translational symmetry has been implemented for bosonic fields by applying convolutional neural networks [95] with circular padding (periodic boundary conditions) to generate parameters for transformations implemented in each flow layer [51,52,59]. All input, intermediate, and output fields in these applications were P-fields. As a building block for translation-equivariant coupling layers acting on both bosonic and pseudofermionic fields, we extend this approach to define translation-equivariant convolutions that act on a generic set of input P-fields and APfields, producing output fields with a desired set of transformation properties, i.e., a specification of whether each channel of the output should be a P-field or AP-field. To implement such convolutional neural networks, we exploit the fact that P-fields and AP-fields form an algebra under pointwise addition and multiplication and restrict the operations appropriately to satisfy the desired output transformation properties; see Appendix D 1 for explicit implementation details.

Translation-equivariant convolutions via group averages
As an alternative to defining equivariant convolutional neural networks, one can symmetrize a non-equivariant architecture by explicitly averaging over the whole symmetry group [89]. Convolutional layers with periodic padding in all dimensions are already equivariant under translations of P-fields and under all spatial translations of AP-fields, thus only the subgroup of temporal translations needs to be averaged over to ensure equivariance for AP-fields. The result is a generic method to produce convolutional neural networks with prescribed P-field and AP-field transformation properties of each output channel. Compared with standard convolutions or the restricted equivariant architecture given in Section IV B 1, this method requires a greater computational effort by a factor proportional to the temporal extent of the lattice, L t . However, it allows the use of unrestricted convolutional architectures, including arbitrary activation functions and learned biases; see Appendix D 2 for further details.

Affine coupling layers
Translation-equivariant networks constructed by either of the methods discussed in Sections IV B 1 and IV B 2 can immediately be applied in the construction of translation-equivariant affine coupling layers suitable for transforming real-valued scalar fields. An affine coupling layer transforms a field x to ax + b (multiplication and addition are applied pointwise), where a and b are fields produced by context functions acting on the frozen components of the field x. Coupling layers, context functions, and masking patterns are illustrated in Figure 2; see also Ref. [86]. Using translationequivariant convolutional neural networks to produce a and b, either a bosonic field or pseudofermionic field can be updated in a translation-equivariant manner as long as: • The parameters a and b are both P-fields if x is a bosonic field; or • The parameter a is a P-field and b is an AP-field if x is a pseudofermionic field.
Such coupling layers can be composed to produce translation-equivariant flows.

Equivariant linear operators
The conditional distribution p(ϕ|φ) is exactly Gaussian, suggesting that it may be efficiently modeled by flows based on architectures other than coupling layers. For example, one may define a linear operator W = W(φ) to transform the pseudofermion fields. The model distribution q(ϕ|φ) may then be defined by computing ϕ = Wχ, where χ is drawn from the Gaussian To effectively use this flow model, det(WW † ) must be tractable to compute. In the case of a degenerate pair of fermion flavors, the target distribution is defined by While it is clearly sufficient for W to approximate D in this case, it is in fact only necessary that WW † approximates DD † , allowing some freedom in the learned matrix W.
We build the operator W as a composition of simple linear operators W = W n • . . . • W 1 , where each W k has only local interactions along a fixed dimension, in a fixed direction (that is, with only positive or negative offsets, but not both), allowing the determinant of each matrix to be efficiently computed. We choose to parameterize the components of each operator W k by two P-fields, produced from learned translation-equivariant functions of φ. More specifically, we consider 2d types of operators, where each type is defined by a sign s = ±1 and a choice of one of the d lattice directions. For the two-dimensional application described below, there are thus four distinct operator types. The different types of operators are applied alternatingly in the composition, but the specific order can be chosen arbitrarily. The operator type with couplings in the spatial direction and sign s thus updates a field χ by with periodic boundary conditions along the space dimension: χ L+1,j = χ 1,j and χ 0,j = χ L,j . An operator with temporal couplings updates a field χ by with antiperiodic boundary conditions along the time dimension: χ i,L+1 = −χ i,1 and χ i,0 = −χ i,L . This construction may be understood as a convolutional layer with appropriate boundary conditions and an additional constraint on the kernel to have non-zero entries only in the center and at one of the 2d adjacent sites. With these definitions, each operator W k is block diagonal (for a suitable choice of basis). Each block is of the form where we have dropped a (spatial or time) index to simplify the notation. The determinant of each block is simply Π h a h ± Π h b h , indicating that the Jacobian determinant associated with the full composition can be tractably computed.

Convex Potential Flows
Because of the non-local nature of the effective action, we consider an alternative flow architecture to produce a model distribution q(φ) approximating p(φ). Convex Potential Flows (CPFs) are normalizing flows defined via the gradients of a potential that is strongly convex and twice differentiable almost everywhere [96,97]. Strong convexity of the potential on a convex support X guarantees the flow to be invertible on X , and this family of normalizing flows can be shown to be a universal density approximator [97]. We can parameterize strongly convex functions by neural networks with mild constraints on their architecture and weights [98].
More specifically, given a convex potential function u : X → R, we define the map where the index i specifies how each degree of freedom of z is mapped. Starting from a base density r(z), the resulting probability density produced by mapping through f follows from Equation (20) as where x = f (z) and H u (z) = ∂ 2 ∂zi∂zj u(z) is the Hessian matrix of u(z). Training by minimizing the KLdivergence D KL (q||p) between the model q and a target density p only requires the gradients ∇ θ log det H u (x) with respect to the model's parameters θ. Since the Hessian is symmetric and positive-definite for strongly convex potentials, we can directly employ a stochastic trace estimator [97,99], The sample mean over noise vectors χ can be used to estimate this quantity in practice, and the inverse Hessian applied in H u (x) −1 χ can be efficiently computed by the application of the conjugate-gradient method. Note that this estimator only requires the computation of Hessianvector products Hχ, which is particularly convenient when the Hessian is sparse.
CPFs can be straightforwardly applied to construct flows that sample bosonic fields φ. They can also be constrained to be translation-equivariant by using appropriate convolutions. In contrast to coupling layers, the CPF potential is a scalar function based on global information, which may result in transformations of the field φ that can in general be quite non-local. Evaluating the model probability density for use in asymptotically exact Markov chains requires a precise approximation of the log-det Hessian to avoid systematic errors. An exact calculation of the determinant is feasible only for small lattice volumes. For larger field configurations, one could apply a more scalable estimator, such as the estimator based on Lanczos tridiagonalization and the quadrature method described in Ref. [100].

C. Flow models
We next define particular architectures for modeling each of the distributions required for the four sampling approaches introduced in Section III. While the space of possible architectures that may be defined from the building blocks of Section IV B is large and the present discussion is not exhaustive, the use of each sampling method and each building block is demonstrated at least once. The architectures for each approach detailed in this section are summarized in Figure 3.

Modeling p(φ) for φ-marginal sampling
The φ-marginal sampler defined in Section III A requires a flow whose model distribution q(φ) approximates p(φ). Such a flow only needs to manipulate Pfields. We build this φ-marginal model using a composition of CPF layers, where the output of each layer is defined by computing the gradient of a potential u i (·) (see Section IV B 5). These layers act on samples ζ drawn from some base distribution r p (ζ). Figure 3a depicts this type of φ-marginal architecture defined by a composition of CPFs acting on ζ. Each u i contributes a determinant factor det H −1 ui to q(φ) in terms of the Hessian As discussed in Section IV A, this marginal model is extended to the joint density q(φ, ϕ) = q(φ)p(ϕ|φ) for training. The density cannot be computed efficiently due to the determinants involved in the definition of q(φ) as well as in the normalizing constant of p(ϕ|φ), but the flow is nevertheless trainable using stochastic estimates of the gradients. For sampling, the joint density itself may also be estimated using stochastic approximations of the determinant factors. The architecture of the convex potential network u(φ) is based on Ref. [97] and is modified appropriately to account for the periodic boundary conditions. It consists of K layers of convolutions of the form where L j is a convolution layer with periodic boundary conditions and unconstrained weights; L + j is a convolution layer with periodic boundary conditions and positive-only weights; SoftPlus(x) = log(1 + exp(x)); ActNorm(x) = (x − µ)/σ is layer that normalizes its inputs using a learnable offset µ and scale σ, where µ and σ are initialized as the mean and standard deviation of the inputs of an initialization batch [101]; w 1 , w 2 are learnable weights used to control closeness of the flow to the identity map at initialization. The use of periodic boundary conditions for L j and L + j and the final Sum operation ensures that u(φ) is invariant to translations.

Modeling p(φ|ϕ) for Gibbs sampling
Section III B describes a Gibbs sampling scheme that utilizes the exact conditional p(ϕ|φ) and a modeled conditional density q(φ|ϕ). A ϕ-marginal model q(ϕ) is required to extend q(φ|ϕ) to the joint distribution q(φ|ϕ)q(ϕ) for training. We achieve this simultaneous modeling of q(φ|ϕ) and q(ϕ) by using a fully joint architecture with restricted information flow, as shown in Figure 3c. The model consists of a prior distribution over the base configurations ζ, χ denoted by r p (ζ) and r ap (χ), followed by the application of two types of affine coupling layers. First, the layers g p k (·; χ k ) update the P-field configuration conditioned on the AP-field, along with the frozen components of ζ k , to produce q(φ|ϕ) as: where J g p k is the Jacobian for coupling g p k . Second, the couplings g ap k (·) transform the AP-field χ conditioned solely on its frozen components to obtain q(ϕ): To conditionally re-sample φ from q(φ|ϕ) while leaving ϕ unchanged, the bosonic prior variable is re-sampled and the output of the flow is re-evaluated while holding the pseudofermionic prior variable χ fixed. When ϕ is resampled from p(ϕ|φ) in the alternate step of the Gibbs sampler, it is important to update the value of χ by passing ϕ through the inverse of the bottom branch of the flow depicted in the figure. This allows future re-sampling of φ as well as the calculation of the conditional probability density defined by the model.

Autoregressive modeling of p(φ, ϕ) = p(φ)p(ϕ|φ)
Section III C defines an independence sampler based on an autoregressive joint model with the probability density given by q(φ, ϕ) = q(φ)q(ϕ|φ). We implement q(φ) using masked affine coupling layers, as was done in Refs. [51,54,55]. The parameters of the affine coupling layers are given by convolutional networks satisfying translational equivariance through standard periodic boundaries. We implement q(ϕ|φ) using a deep linear flow consisting of learned linear operators W k (φ) as detailed in Section IV B 4. The parameters of these linear operators are all P-fields obtained by similar periodic convolutional networks. The full joint model is given by the autoregressive combination of these two models, i.e. drawing φ from the affine model with distribution q(φ), then drawing ϕ from the conditional deep linear flow with distribution q(ϕ|φ), as shown in Figure 3d. The marginal model is defined by sampling ζ from the prior distribution r p (ζ), then applying the sequence of

FIG. 3.
Architectures for the flow-based models defined in Section IV C for each sampling approach. Note that each coupling layer g p k or g ap k employs masking of the updated field as shown in Figure 2, such that the frozen components of the field are included as input to context functions. Superscripts on coupling layers indicate the translational equivariance structure of coupling layer inputs and outputs (either consistently transforming as P-fields or AP-fields). Many other choices of architectures are possible to model each distribution; the figure reflects the choices utilized in the numerical study undertaken in Section V A.
coupling layers g p k (·) such that the marginal model probability density q(φ) is given by: The conditional linear flow is defined by sampling χ from the prior distribution r ap (χ) and applying the linear operators W k (φ) to obtain the model density We define r ap (χ) = 1 Z N e −χ † χ to match the choice for the linear operator flow in Equation (25).
Note that the learned components in this approach may also be combined in novel ways. For example, it is possible to discard the conditional flow with distribution q(ϕ|φ) after training and simply use q(φ) for φ-marginal sampling as described in Sections III A and IV C 1. This may be advantageous in situations where gradients from an exactly sampleable distribution are not available and training must be fully variational. On the other hand, the conditional deep linear flow may be used by itself as a determinant estimator for given configurations φ.

Fully joint modeling of p(φ, ϕ)
Finally, we construct a model that simultaneously samples φ and ϕ in a fully joint approach which can be employed in the exact sampler defined in Section III D. The joint model implemented in this work is constructed from affine coupling layers that alternatingly transform the bosonic fields conditioned on the pseudofermionic fields, and vice versa, as shown in Figure 3b. The model is defined to sample ζ and χ from the prior distributions r p (ζ), r ap (χ) and subsequently apply alternating layers. Coupling layers g p k (·, χ k ) transform the P-field base configuration ζ conditioned on its frozen components and the AP-field configuration χ k , while couplings g ap k (·, ζ k ) update the AP-field base configuration χ k conditioned on its frozen components and ζ k . This gives rise to the joint density (39)

V. APPLICATION TO A SCALAR YUKAWA THEORY IN TWO DIMENSIONS
As a demonstration, we implement the sampling algorithms defined above for a two-dimensional model of a scalar field coupled to fermions via a Yukawa interaction. This model provides a testbed which features fermionic fields, but without the additional complications brought on by gauge symmetry. Apart from providing a suitable test case for the development of flows capable of modeling fermions, studying Yukawa interactions is also interesting in its own right, e.g. for Higgs physics [102] or the quark-meson model [103].

A. Yukawa theory on the lattice
We consider a real, scalar field φ coupled to one mass-degenerate pair of Kogut-Susskind staggered fermions [78]. We emphasize that our method can in principle be straightforwardly applied to other discretization schemes, such as Wilson fermions, without additional conceptual difficulties.
The action for this theory is defined as S(ψ,ψ, φ) = S B (φ) + S F (ψ,ψ, φ) (see Equation (1)). For the scalar field we choose the usual discretized Klein-Gordon action with quartic self-interaction, defined by where Λ denotes the set of lattice sites, m the bare scalar mass parameter, λ the coupling, and d the dimension. The fermionic action S F is defined by the bilinear form in Equation (2) with N f = 2 and both fermion flavors are defined by the discretized Dirac operator where m f is the bare mass of the fermion and g the Yukawa coupling. The staggered factor η µ is obtained from the Dirac γ-matrices after the staggered transformation and is defined as The Kronecker δ is defined to have antiperiodic boundary conditions in the time direction (conventionally taken to be µ = d) and periodic boundary conditions in the spatial directions, i.e. where We employ an even-odd preconditioning scheme for the Dirac operator for all models except for the autoregressive model using linear operators. In contrast to the default lexicographic ordering, sorting lattice sites into even and odd allows to bring the matrix into a form that is amenable to an explicit block factorization of the determinant, which leads to improvements in the conditioning and solver performance. This reduces the variance and cost of computing the pseudofermion action required for optimizing models and sampling. Most previous work on improved orderings has focused on techniques for Wilson fermions in the context of gauge theory [71,73,104], but the same insights can be applied to the staggered fermion formulation used in this work, as detailed in Appendix B.
All results reported in this work are computed on a 16 × 16 lattice geometry using the two choices of action parameters in the symmetric phase given in Table II. For this theory and lattice discretization, there is no additive renormalization to the bare fermion mass m f . Accordingly, we directly probe the case of vanishing mass by setting m f = 0. The first set of parameters, for which the Yukawa coupling is chosen to be g = 0.1, already provides a realistic test scenario in the sense that the average ratio of fermionic to scalar force magnitudes is around 3%, which is similar to the ratio of fermionic to gauge forces reported in the literature for some lattice QCD computations; see e.g. Refs. [105,106]. The second  choice with g = 0.3 features a much larger force ratio amounting to about 39%, which thus provides a testbed for theories with more prominent fermionic effects. For simplicity, we will refer to these two parameter choices by the associated value of the Yukawa coupling g.
To evaluate the performance of the sampling approaches presented here, we consider the following quantities and observables. The magnetization of a scalar field configuration is defined as We measure the average absolute value |M | , which provides a non-zero order parameter that is large in the broken symmetry phase and exponentially suppressed in the symmetric phase. The connected two-point correlation function of φ is defined as where we fix φ(x) = M = 0 analytically. We evaluate the source-averaged correlator projected to zero momentum defined by Fermionic observables can be computed from the matrix elements of the inverse Dirac operator. The chiral condensate of the fermion field is defined by and we measure |ψψ| as for the magnetization. Using the off-diagonal matrix elements, we also evaluate the average fermionic two-point correlator in the time direction, where y = ( 0, t) with t odd. The particular choices of offsets y select staggered spinor components at the sink ψ(y) that result in a non-zero average correlation function originating from the sourceψ(0).

B. Model architectures
For each of the four sampling approaches outlined in Section III and corresponding model architectures detailed in Section IV C, we create specific models for both choices of target action parameters given in Table II: • For the sampling scheme described in Section III A, we construct a CPF model defining a φ-marginal distribution q(φ) approximating the corresponding p(φ). The model architecture and training follow the generic procedure outlined in Section IV C 1; • To build a conditional model q(φ|ϕ) for the Gibbs sampler described in Section III B, we implement a restricted affine coupling layer flow as described in Section IV C 2. To achieve translational equivariance, we employ the method of group averages described in Section IV B 2 for each convolutional network; • To produce an autoregressive joint model density q(φ, ϕ) = q(φ)q(ϕ|φ) for the sampling scheme described in Section III C, we use a model consisting of affine coupling layers followed by learned equivariant linear transformations as described in Section IV C 3; • For the fully joint sampling scheme described in Section III D, we implement a model with unrestricted affine coupling layers acting on both φ and ϕ, using translation-equivariant convolutions as described in Section IV B 1. This results in a fully joint model distribution q(φ, ϕ) as detailed in Section IV C 4.
The models are optimized for each approach based on the joint KL divergence discussed in Section IV A. Prior distributions for the initial P-field ζ and AP-field χ, where they are used according to Figure 3, are Gaussians of the form with specific values of σ ζ and σ χ for each model chosen to enhance the training stability. Details of the model hyperparameters and training procedure for each of the the models can be found in Appendix A. An exhaustive search over the available parameter space is beyond the scope of the present work, and as such it can be expected that tuning the various model hyperparameters may further improve the reported performance metrics.  Table II are consistent with the measurements from our models. Autocorrelation times τ int are computed for each of the 100 chains and then averaged, and errors are obtained with statistical jackknife. The results are discussed in more detail in Section V C. All models except the autoregressive make use of even-odd preconditioning of the action.

C. Discussion and comparison of sampling schemes
After optimization, we use each of the models to construct asymptotically exact samplers for their respective target distributions according to the four schemes given in Section III. For each case, we produce 100 distinct Markov chains consisting of 10k steps each, of which the first 1k steps are discarded for thermalization. These Markov chains are used for observable measurements and to investigate and compare metrics of the efficiency of sampling via each of these methods.
First, we confirm that each of the observables described above are measured to be consistent across sampling schemes and with HMC baseline results. Calculations of |M | and |ψψ| using each of the generated ensembles are detailed in Table III and are all consistent with the results obtained through HMC. The scalar and fermionic two-point correlators produced by the four exact Monte Carlo sampling schemes models are also consistent with the HMC baseline, as shown in Figures 4 and 5.
For the various sampling approaches, Table III compares the autocorrelations of the magnetization and chiral condensate, as well as the Markov chain acceptance rates. Given a chain of N measurements for some realvalued observable X, its autocorrelation function is defined as where τ denotes the number of Markov chain steps separating the pair of measurements considered. The integrated autocorrelation time for observable X is given as The sum can be truncated at a sufficiently large τ max due to the exponential suppression of Γ X (τ ); 1 τ max N should be satisfied to ensure that the values of Γ X (τ ) are reliable. For the τ int values reported in this work, we use the Madras-Sokal windowing procedure [108] to choose a suitable τ max by identifying the earliest point where cτ int ≤ τ max , with c = 10. The integrated autocorrelation times τ int M and τ int ψψ are given in Table III together with the mean acceptance rates for all four sampling procedures.
To understand the relative performance of the four sampling approaches detailed in Table III, we note that the dependence of the acceptance rate and autocorrelations on model quality is quite distinct in several of these approaches. For one, the φ-marginal sampler involves an exact determinant measurement in the sampling step used for the numerical study above, which is not expected to scale efficiently. If replaced with the pseudo-marginal estimator discussed in Section III A, the variance of the noisy estimates of each determinant would degrade the statistical performance achieved by even an optimally trained model and improved estimators, in particular when encountering large condition numbers. This is an obstacle to working with light fermion masses or field configuration geometries with many lattice sites (e.g. near the thermodynamic limit) independent of the challenge of training accurate model approximations to the target distribution. Thus, the relatively higher acceptance rates and lower autocorrelation times achieved by the φmarginal sampler constructed from CPFs in this study must be contrasted against the potentially difficult scaling challenges or requirements for more precise stochastic estimators in this approach. By comparison, there is no non-trivial upper bound on the acceptance rate of the other three sampling approaches, and they will achieve an acceptance rate of 100% when perfect model distributions are constructed.
Among these three approaches, the Gibbs sampler must also be further contrasted against the autoregressive and fully joint samplers. In particular, the remaining conditional structure of updates to φ and ϕ in the Gibbs sampler results in autocorrelations even if the acceptance rate is 100%. The magnitude of these residual autocorrelations may be small, but nevertheless puts a bound on the performance that is theoretically achievable by a Gibbs sampler, even in the asymptotic limit of perfect models of the involved distributions. Thus only joint models (either autoregressive or fully joint) can completely eliminate autocorrelations in the ideal limit of perfect models. In practice, however, the distinctions between joint models and Gibbs sampling may be minor. For example, the results presented in Table III demonstrate that at the similar acceptance rates of roughly 40%-50% for the Gibbs and autoregressive samplers, the integrated autocorrelation times for the magnetization and condensate are similar, despite the additional autocorrelations introduced by the particular conditional structure of the Gibbs sampling scheme. The fully joint sampler shows a lower acceptance rate and greater autocorrelations, indicating that the differences are largely based on the model approximation qualities. The particular flow-based models implemented to approximate the various distributions used for the four sampling approaches also have distinct scaling prospects. It has been found in previous work [55] that flows based on coupling layers using convolutional networks may be easily transferred between different lattice volumes and thereby trained efficiently. This generalizability applies to the affine coupling layer implementations used for the Gibbs, autoregressive, and fully joint samplers described in this work. The CPF implementation for φ-marginal sampling is also based on convolutional networks for the construction of the convex potentials, thus enabling efficient measurements of these potentials at all lattice volumes. However, in this case computing the Jacobian of the transformation to calculate q(φ) is potentially expensive, because it requires the evaluation of the Hessian of each u i . Stochastic estimation of these Hessian factors may introduce additional noise in exact sampling schemes based on these particular flow architectures.
These results numerically demonstrate the effectiveness of our proposed flow models and sampling schemes. The observed performance differences cannot immediately be attributed to inherent advantages of the chosen building blocks, but may also depend strongly on the model implementation details and theory-specific characteristics. The situation may also be quite different for larger volumes and dimensions as well as other types of fields and interactions, and disentangling the effects of implementation details from asymptotic scaling properties will be the subject of future research. Furthermore, there is a large space of possible combinations of the building blocks introduced here that could be explored in future work to determine models that may have more efficient training, sampling, and scaling prospects. While an exhaustive search over this space is beyond the scope of this exploratory work, the present results serve as a guide for the design of custom flows for lattice simulations with dynamical fermions in other applications.

VI. APPLICABILITY TO UPDATE-BASED APPROACHES
While the sampling schemes presented in this paper are based on the independent proposal of new field configurations (except for the Gibbs sampler; see Section V for further discussion), the flow-based models defined here may also be used in methods that instead propose configuration updates, rather than independent samples. A simple way to achieve this with our architectures would be to formulate stochastic processes in the flow prior that guarantee asymptotic exactness under the target distribution, such as partial heatbath resampling, HMC, or Langevin-type algorithms, rather than independently drawing a completely new prior sample in every update step; such partial updates have previously been studied in the context of other generative models [96,109,110] as well as trivializing map approaches [111,112]. In contrast to these updatebased methods, direct sampling approaches have the advantage that autocorrelations in the flow-based Markov chain are in principle eliminated for an ideal model. Imperfect models, however, can still result in residual correlations caused by rejections in the Metropolis step. Whether these residual correlations from an imperfect model can outweigh the autocorrelations in corresponding update-based methods is an open question, the answer to which may also depend strongly on the model details and the specific problem under consideration.
Apart from devising modified sampling schemes for the types of flows presented in this work, one may also consider defining flows that directly transform configurations in order to produce proposals for Markov chain updates. Related work on learning improved HMC-like updates includes A-NICE-MC [57], its recent application to the lattice simulation of scalar φ 4 -theory [59], L2HMC [58], and DLHMC [60], which was demonstrated to successfully mitigate topological freezing in the context of U (1) lattice gauge theory in two dimensions. These approaches require the implementation of flows suitable for transforming the primary fields and conjugate momenta conditioned on each other. The flows over pseudofermion variables developed in this work can therefore be used to extend such methods to the setting of lattice field theory involving dynamical fermion fields. These insights may also inform the design of novel building blocks for the self-learning Monte Carlo method (SLMC) mentioned in the introduction, which was recently applied to non-Abelian gauge theory with dynamical fermions [47].

VII. SUMMARY AND OUTLOOK
In this paper, we introduce four approaches to applying flow-based sampling to fermionic lattice field theories based on different decompositions of the joint action over bosonic and pseudofermionic fields. All approaches satisfy detailed balance and thus provide asymptotically exact sampling schemes.
We further introduce several techniques to model the distributions required in these sampling approaches, including the construction of flow-based models satisfying the more complex translational symmetry group arising from the pseudofermion boundary conditions (which must be antiperiodic in time). All four sampling methods are demonstrated to successfully produce asymptotically exact samplers in a proof-ofprinciple application to a two-dimensional Yukawa theory.
The flow-based model architectures presented here represent a selection from a large class of possible ways to model the distributions required for the four sampling methods detailed in this work. The observed relative performance of the methods in the application to the Yukawa theory provides a starting point for understanding the distinctions between these sampling approaches and architectures, but should not be considered a definitive indicator of the performance of these approaches or architectures in the context of other theories or at larger scales. Applying these methods to QCD will require future work to understand the scaling of the method with lattice volume, spacetime dimension, and with the involvement of gauge fields. Importantly, investigating the continuum limit of flowbased samplers is relevant to determine their potential to mitigate critical slowing down. This question arises with or without fermions, and empirical studies are required to understand the scaling of these methods both for fermionic and purely bosonic theories. Nonetheless, the "building blocks" of flows suitable for fields including pseudofermions, and the sampling strategies outlined in this work, provide the basis for developing efficient flow-based samplers for fermionic theories.
Continued work into improved stochastic approximations of determinants [83,[113][114][115][116][117][118] complements the flow-based sampling approach presented here. Such developments may be combined with our proposed flowbased sampling framework. For example, most flowbased samplers constructed in this work were designed to target the action after the application of even-odd preconditioning, a standard improvement technique for the fermionic determinant. Other preconditioners for the action may also be applied to increase the performance of flow-based samplers for these theories. Furthermore, improvements to unbiased stochastic estimates of fermionic determinants may increase the efficiency of the φ-marginal sampler at scale.
In summary, this work sets the stage for flow-based sampling of lattice field theories with fermions and paves the way towards an application of generative neural samplers to lattice QCD with dynamical quarks and similar problems in condensed matter theory. The next steps in this endeavor are the transfer of insights gained from the Yukawa model studied in this work to interactions between fermions and gauge fields, and the study of the scalability of the method. If they can be achieved, high quality flow-based models at the scale of state-of-the-art calculations may be able to circumvent the limitations of traditional sampling algorithms, thereby expanding the frontiers of lattice QCD and other lattice field theories. This section lists all necessary details to reproduce the flow architectures discussed in Section V B for the application to the two-dimensional Yukawa theory studied in this work. In particular, all relevant model and training hyperparameters are given in Table IV. Additional peculiarities of the linear operator and fully joint model implementations not listed in the table are discussed below. Furthermore, we provide references for the elementary machine learning components used in the implementation and optimization.
All models are trained using the well-known Adam optimizer [120] with default settings. In some cases, clipping of the gradient value and norm was employed to stabilize training [121]. The deep neural networks providing the context functions and convex potentials in the flow architectures are implemented exclusively in terms of convolutional networks with several hidden layers and channels [95]. The non-linear activation functions employed in these networks are the rectified linear unit (ReLU) [122], in particular the LeakyReLU variant [123], as well as the SoftPlus function in the case of the CPF layers, as detailed in Section IV C 1. In some cases, an additional Tanh activation is applied to the output of each network, as specified in Table IV.
For all experiments in this work using the CPF architecture, we set w 1 = 5 × 10 −3 and w 2 = 1 at initialization (cf. Equation (34)) and all convolutional layers use a stride of 1.
As detailed in Section V A, even-odd preconditioning of the Dirac matrix is not applied for the autoregressive architecture with linear operators, as we observe that the model gives a better approximation to the nonpreconditioned action with standard lexicographic ordering. The space of possible model adjustments to make the even-odd decomposition compatible with linear operators is large, and modifications could be explored to improve the results. The conditional density q(ϕ|φ) is implemented using the composition of 128 equivariant linear operators {W k } 128 k=1 . The 128 linear operators are jointly defined by the stacking of a single squeezing layer breaking invariance under odd translations as explained in Appendix D 1, followed by a convolutional network with periodic boundary conditions. This network features 10 hidden layers with 64 channels each and uses intermediate LeakyReLU activations. In total, the network has 256 output channels, with each pair of output channels providing the values of a and b in the definition of one of the 128 linear operators. The a output is additionally transformed using a normalized SoftPlus function. We also find it useful to add an L 2 regularization loss for both outputs, with a weight of 10 −5 .
For the fully joint model, active components of the scalar field are transformed using its frozen components as well as the pseudofermion field. The updated scalar field together with the frozen pseudofermion components are then used to update the active pseudofermion sites. Ordering lattice sites into even and odd subsets allows writing the Dirac matrix D as a 2 × 2 block matrix of the form where we denote the blocks as A, B, C, D for simplicity. The constant blocks B = D oe and C = D eo couple odd to even sites and vice versa, and φ o and φ e indicate the components of φ respectively associated with odd and even sites of the lattice. This form allows a more efficient stochastic approximation of the determinant by decomposing it into the determinant of either diagonal block and the associated Schur complement as or, equivalently, Rewriting from the first to the second form in Equations (B2) and (B3) ensures that the resulting expression does not involve terms that mix A and D with their respective inverses. This may lead to numerical instabilities if m f = 0, which can result in ill-conditioned A and D.
Since B and C are constant, the terms det(B), det(C) drop out of the path integral and thus do not affect Metropolis-Hastings acceptance rates or gradients for optimization of flow-based models. Hence, they can be ignored for the purpose of training and sampling flow models. 4 The reduced V /2× V /2 form of the Dirac operator makes determinant estimation significantly cheaper while keeping the additional computational overhead minimal. Half of the pseudofermion degrees of freedom completely decouple from the scalar field and can be discarded. The reduced operator partially retains the original periodic and antiperiodic boundary conditions when applied to the even or odd sub-lattices, respectively, reducing to a translation symmetry for even shifts. When utilizing affine coupling layers with a checkerboard mask, it is exactly this subset of the translational symmetry group that is preserved, and the translationally equivariant architecture is directly applicable to learning a distribution over the reduced subset of pseudofermions.
The improvement can be pushed to higher order by noting that the diagonal matrix elements of the preconditioned Dirac operator are close to unity, which makes it possible to employ an ILU preconditioning  scheme [71,73]. It relies on the fact that the preconditioning matrices for the even-odd ordering step described above can be computed explicitly, which is not generally true for other ordering schemes. Though originally designed for the Wilson Dirac operator, the same procedure can be applied to the staggered fermions studied in this work. Since ILU preconditioning breaks the translation symmetry of the pseudofermion action completely, it is not directly compatible with any of our equivariant flow constructions that target the distribution of ϕ. However, in an experiment modeling the even-odd preconditioned φ-marginal distribution using an affine coupling layer model, additional ILU preconditioning for the same architecture led to a moderately improved acceptance rate.
Appendix C: Stochastic estimator for gradients of log det M The calculation of loss gradients required for the optimization of some of the models introduced in this work requires the evaluation of gradients taken with respect to the field φ. In general, M(φ) is a positive definite matrix either arising from Dirac matrices of a pair of mass degenerate fermions as M = DD † , or from one-flavor methods (see Section II B). The calculation of the exact determinant det M(φ) may be intractable because of the scaling with the number of lattice degrees of freedom, thus a stochastic estimator is instead defined in this section to tractably evaluate Equation (C1). By assumption, M is a positive-definite matrix and thus the following stochastic trace estimator is applicable: (C2) Here, the noise vector χ is assumed to be drawn from the unit-variance isotropic distribution with an appropriate number of degrees of freedom to match the dimensions of M.
In the case of two degenerate fermionic flavors, an interesting connection can also be made to the gradient of the negative pseudofermion action (where the sign is chosen to match the positive sign of log det M(φ)). This gradient can be evaluated to be where η ≡ (DD † ) −1 ϕ = (D † ) −1 χ, in terms of the noise vector χ ∼ e −χ † χ used to generate the pseudofermion field. A short derivation shows that this is equivalent to the stochastic estimator of the two-flavor determinant, This relation allows the gradient estimator to be computed using the same tools utilized for the evaluation of HMC forces with respect to the pseudofermion action.
Appendix D: Translation-equivariant networks 1. Convolutional networks with P-fields and AP-fields In Section IV B 1, we introduced P-fields and AP-fields; that they form an algebra can be seen as follows. The set of P-fields is stable under linear combinations and pointwise multiplications. On the other hand, the set of AP-fields is only stable under linear combinations as the product of two AP-fields is a P-field, while the product of a P-field with an AP-field is an AP-field. In other words, the set of P-fields and AP-fields forms a superalgebra [124] under pointwise addition and multiplication. Pointwise application of a function to a P-field results in a new P-field. For AP-fields, more care is required as not all functions can be applied pointwise. Application of an odd function to an AP-field results in another AP-field, while pointwise application of an even function results in a P-field. In this appendix, we explain in more detail how properties of those fields allow us to build expressive neural networks which are equivariant under translations. This means that if T ∈ Z d is an arbitrary space-time translation and f (φ, ϕ) = φ , ϕ is one of our neural networks, then f (T · φ, T · ϕ) = T · φ , T · ϕ . This discussion is not specific to 2-dimensional spacetime, and applies for any dimension d.
First, convolutions can be built for both types of fields. For P-fields, this is achieved by first padding the field using periodic padding, then applying a normal convolution. For a convolution with kernel shape 2k +1, all fields must be padded by k sites in each direction. As a concrete example, assume a 1-dimensional lattice of size 5, a P-field with values [1,2,3,4,5] and a convolution kernel [1,1,1]. The padded P-field would be [5,1,2,3,4,5,1]. Applying the convolution would result in a new P-field with values [8,6,9,12,10]. For AP-fields, the only necessary change is to use antiperiodic padding along the time dimension, and periodic along the space dimension. Consider the 2-dimensional example The above construction of P and AP-convolutions was illustrated with only one channel, but the extension to multiple channels is straightforward. Any non-linearity can be applied between convolutions for a P-field without spoiling translational equivariance. As mentioned above, the LeakyReLU activation function is used throughout this work. For convolutions applied to an AP-field, non-linearities used as activation functions must be restricted to odd functions, for which we choose sign(ϕ) log(1 + |ϕ|) . (D4) With a P-convolution, a bias can be applied along with a convolution at each step, since a bias is constant across all sites and thus transforms as a P-field. However, a traditional bias cannot be applied to the convolution of an AP-field without spoiling translational equivariance. For an AP-field ϕ, a bias-like operation ϕ → ϕ+b sign(ϕ) in terms of a constant b can be applied instead. To avoid potential issues with the non-differentiability of the sign function, we used a differentiable approximation given by applying ϕ → ϕ + b tanh(ϕ/4).
All of the above constructions (P-and APconvolutions, non-linearities and biases) are equivariant with respect to translations on the lattice. By stacking them, we create expressive translation-equivariant neural networks. Our neural networks can also jointly transform pairs of P and AP-fields. For example, we found the following transformation to work well: Input: P-field P and AP-field A P' = conv(P) A' = conv(A) P" = concatenate(P', |A'|) A" = concatenate(A', P'A') Output: P-field P" and AP-field A" The group of translational symmetries of a staggered action described in Section II C only includes translations by even numbers of lattice sites. Implementing symmetries in a network that are not symmetries of the target function restricts the expressivity and may make it difficult or impossible to represent an effective approximation of the target function by the network. For the models targeting the staggered fermion action in the study described in the main text, we avoided encoding translational symmetry by an odd number of sites by explicitly breaking equivariance with respect to odd translations. For example, to break the symmetry by odd translations along the first dimension of a field x, we fold its even and odd indices along the channel dimension; this doubles its number of channels while halving the number of points along the first dimension. We then apply a convolution with stride 1 and kernel size 1, which mixes all the channels. Finally, we split the channels in two and fold them back along the first dimension to get a new field with the same lattice size as x. This approach mirrors the 'squeezing' operation applied in Real NVP flows [49].

Explicit symmetrization by group averages
Equivariance of neural networks under discrete symmetry groups can also be achieved by explicitly averaging over the group. For discrete translations with antiperiodic boundary conditions in particular, this approach avoids the restriction to odd activation functions and vanishing biases, but requires an increase of the computational effort proportional to the timelike extension of the lattice.
Let T a x,t ∈ Z d denote a translation by ( x, t) where antiperiodic boundary conditions are applied in time and periodic boundary conditions are applied in space, and let T p x,t denote a translation by ( x, t) with periodic boundary conditions for all directions. The action of T a and T p is the same along all spatial dimensions, and we define T x = T p x,0 = T a x,0 . For simplicity we now restrict to working with a two-dimensional L×L lattice with coordinates ( x, t), but the following construction immediately generalizes to higher dimensions and non-symmetric lattices. Both P-fields and AP-fields are maps from Z L × Z L to R c , where c is a number of channels. Under lattice translations, P-fields are acted upon by T p , while AP-fields are acted upon by T a .
Consider a function f that maps the pair (φ, ϕ) of a P-field and an AP-field to another field f (φ, ϕ). Assume that the output f (φ, ϕ) transforms with periodic boundary conditions along the space dimension, that is: Using averaging, we will now construct two maps u and v with the transformation properties: We define 5 u(φ, ϕ) = 1 2L 2L−1 n=0 T p 0,−n f (T p 0,n φ, T a 0,n ϕ) and v(φ, ϕ) = 1 2L T a 0,−n f (T p 0,n φ, T a 0,n ϕ) .
We now wish to prove that the transformation properties in Equation (D6) and Equation (D7) apply to these definitions. The proof is roughly the same for both cases, 6 so we will only write it for Equation (D6): T p 0,−n f (T p 0,n T p x,t φ, T a 0,n T a x,t ϕ) T p 0,−n+t T x f (T p 0,n φ, T a 0,n ϕ) = T p x,t u(φ, ϕ) .
(D10) Equation (D10) was obtained using the change of variables n → n − t and the equivariance property given in Equation (D5).
The functions u and v may be used to define equivariant affine coupling layers for the construction of equivariant flows. To achieve equivariance, the underlying function f needs to be evaluated 2L times instead of once, hence the aforementioned increase in computational cost. Since the masked affine couplings employed in this work already restrict the translational equivariance to multiples of two, one may also consistently use only every second term in the sums defining u and v without breaking the symmetry further, implying a factor L increase of the cost instead of 2L. Still, the additional computational requirements are significant compared to the approach detailed in Appendix D 1, and for large-scale implementations one may have to partially trade equivariance against efficiency by excluding more terms from the sums.