Domain Wall Fermion QCD with the Exact One Flavor Algorithm

Lattice QCD calculations including the effects of one or more non-degenerate sea quark flavors are conventionally performed using the Rational Hybrid Monte Carlo (RHMC) algorithm, which computes the square root of the determinant of $\mathscr{D}^{\dagger} \mathscr{D}$, where $\mathscr{D}$ is the Dirac operator. The special case of two degenerate quark flavors with the same mass is described directly by the determinant of $\mathscr{D}^{\dagger} \mathscr{D}$ --- in particular, no square root is necessary --- enabling a variety of algorithmic developments, which have driven down the cost of simulating the light (up and down) quarks in the isospin-symmetric limit of equal masses. As a result, the relative cost of single quark flavors --- such as the strange or charm --- computed with RHMC has become more expensive. This problem is even more severe in the context of our measurements of the $\Delta I = 1/2$ $K \rightarrow \pi \pi$ matrix elements on lattice ensembles with $G$-parity boundary conditions, since $G$-parity is associated with a doubling of the number of quark flavors described by $\mathscr{D}$, and thus RHMC is needed for the isospin-symmetric light quarks as well. In this paper we report on our implementation of the exact one flavor algorithm (EOFA) introduced by the TWQCD collaboration for simulations including single flavors of domain wall quarks. We have developed a new preconditioner for the EOFA Dirac equation, which both reduces the cost of solving the Dirac equation and allows us to re-use the bulk of our existing high-performance code. Coupling these improvements with careful tuning of our integrator, the time per accepted trajectory in the production of our 2+1 flavor $G$-parity ensembles with physical pion and kaon masses has been decreased by a factor of 4.2.

A number of recent developments in the HMC algorithm used by the RBC/UKQCD collaboration have driven down the cost of simulating degenerate pairs of isospin-symmetric quarks with the same mass. These developments include: extensive force tuning via Hasenbush mass preconditioning [1], the zMöbius domain wall fermion action [2], reduced L s approximations to the light quark determinant [2], and the use of implicitly restarted, mixedprecision defect correction methods in the conjugate gradient algorithm [3]. In Table I we list timings for a recent large-scale calculation which utilizes these techniques. We now find that the single-flavor strange and charm quark determinants, which we simulate using the RHMC algorithm, are collectively the most expensive part of the calculation. To address this, we have turned to exploring TWQCD's recently proposed exact one flavor algorithm (EOFA), which allows for simulating single quark flavors without the need for RHMC [4].
Preliminary results have suggested that EOFA simulations can outperform RHMC simulations, both in terms of computer time and a reduced memory footprint, while producing exactly the same physics [5,6].  N f = 2 + 1 + 1 ensemble with physical quark masses and a −1 ≈ 3 GeV on a 12,288-node Blue Gene/Q partition [2].
The RBC/UKQCD collaboration's ongoing efforts to probe direct CP -violation in K → ππ decays provide a second motivation for exploring EOFA. The collaboration has recently reported the first calculation of the ∆I = 1/2 K → ππ decay amplitude with physical kinematics in Ref. [7], which, when combined with previous results for the ∆I = 3/2 amplitude [8] determines the Standard Model CP -violating parameters and entirely from first principles. An important ingredient in this calculation was the introduction of G-parity boundary conditions for the quark fields [7,9]: since the pion is G-parity odd, the pion momenta are quantized along G-parity directions as allowing the ensemble parameters to be tuned such that the K → ππ decay has both physical kinematics and the final pions in the ground state. Since the G-parity transformation G = for ensembles with periodic boundary conditions -including defect correction solvers, the forecasted force gradient integrator [10], and Hasenbusch mass preconditioning -are not applicable or of limited utility for RHMC simulations, but are expected to perform well in the context of EOFA. More generally, since there is no straightforward way to start the multishift conjugate gradient solver used for RHMC with a nonzero initial guess, techniques which rely on forecasting or restarting the solver are not applicable.
In this work we discuss the RBC/UKQCD collaboration's implementation and tests of the exact one flavor algorithm, as well as the use of EOFA in generating gauge field configurations for our ongoing first-principles calculation of the ratio of Standard Model parameters / from ∆I = 1/2 K → ππ decays with G-parity boundary conditions. We have independently implemented EOFA in the Columbia Physics System (CPS), BAGEL fermion sparse matrix library (BFM), and the Grid data parallel C++ QCD library (Grid), for Shamir and Möbius domain wall fermions, with periodic, anti-periodic, and G-parity boundary conditions. We will demonstrate in the following sections that a significant improvement over the RHMC algorithm in terms of wall clock time is indeed possible with EOFA after introducing a variety of preconditioning and tuning techniques. Early work in this direction was presented at the 34th International Symposium on Lattice Field Theory [6]; here we will elaborate on the details and discuss our first large-scale EOFA calculation.

II. THE EXACT ONE FLAVOR ALGORITHM
The exact one flavor algorithm was developed by the TWQCD collaboration and used to enable efficient simulations of single quark flavors on GPU clusters, where memory usage is a significant constraint. In Ref. [11] the authors discuss their construction of a positivedefinite pseudofermion action describing a single flavor of Wilson or domain wall quark, and elaborate on the details of this construction in Ref. [4]. The key is their observation that a ratio of determinants of domain wall fermion (DWF) Dirac operators can be factorized as with M L and M R Hermitian and positive-definite. In a subsequent paper the authors benchmark EOFA against RHMC for N f = 1 and N f = 2 + 1 lattice QCD simulations, and demonstrate a number of advantages of the EOFA formalism [5]. These include substantial reductions in the pseudofermion force and in the memory footprint of the algorithm, since, in the context of EOFA, inversions of the Dirac operator can be performed using the or-dinary conjugate gradient (CG) algorithm rather than the multishift CG used for RHMC.
They ultimately find that they are able to generate HMC trajectories 15-20% faster using EOFA rather than RHMC after retuning their integration scheme to take advantage of these properties. More recently, TWQCD has used EOFA to generate N f = 2 + 1 + 1 domain wall fermion ensembles with dynamical strange and charm quarks [12].
We note that the construction of the exact one flavor pseudofermion action has been detailed by TWQCD in Ref. [4,11] and summarized in our own formalism in Ref. [6]. We will not repeat this discussion here, other than to give a brief overview and to introduce the notation used in this work. We write the 5D Möbius domain wall fermion (MDWF) operator D DWF in terms of the 4D Wilson Dirac operator D W and 5D hopping matrix L ss as (D DWF ) xx ,ss = ( (c + d) with Here x and s are spacetime indices in the 4D bulk and along the fifth dimension, respectively, with L s denoting the total number of s sites, P ± = (1 ± γ 5 )/2 denoting the chiral projection operators, and (R 5 ) ss ≡ δ s,Ls−1−s denoting the operator which performs a reflection in the fifth dimension. We recover four-dimensional quark fields q and q with definite chiralities from the five-dimensional quark fields ψ and ψ described by D DWF at the boundaries of the Green's functions constructed from q and q approximate continuum QCD arbitrarily well in the limit of vanishing lattice spacing and infinite 5D spacetime volume. with The operator D relating D DWF and D EOFA has no dependence on the gauge field, so we are free to replace D DWF with D EOFA in Eqn.
(2) without modifying physical observables described by a properly normalized path integral. In fact, it can be shown analytically using the explicit form of D listed in Appendix A that where V = L 3 T is the 4D spacetime volume. This substitution facilitates the construction of a proper action since the operator γ 5 R 5 D EOFA is manifestly Hermitian for any choice of the Möbius scale α, whereas D DWF satisfies a less trivial γ 5 -Hermiticity condition when α = 1 [15]. However, this comes at the cost of substantially more expensive inversions, since D EOFA is dense in ss whereas D DWF has a well-known tridiagonal block structure.
After introducing D EOFA , TWQCD's construction proceeds by applying the Schur identity to D EOFA , treated as a 2 × 2 block matrix in its spinor indices, and rearranging terms to arrive at the right-hand side of Eqn. (2). Crucially, factors of γ 5 R 5 can be freely inserted under the determinant to replace D EOFA with the Hermitian operator H ≡ γ 5 R 5 D EOFA , since det(γ 5 ) = det(R 5 ) = 1. The final form of the exact one flavor pseudofermion action is In Appendix A we collect explicit expressions for k, Ω ± , ∆ ± , D EOFA , and D for Shamir

III. SUMMARY OF ENSEMBLES USED IN THIS WORK
The properties of the lattices used in this work are summarized in Tables II and III. In all cases we use the Iwasaki gauge action (I) [16], and on some ensembles supplement this with the dislocation suppressing determinant ratio (DSDR) [17,18]; we abbreviate the combined action including both terms as "ID". The additional DSDR term is designed to suppress the dislocations of the gauge field associated with tunneling between topological sectors, thereby reducing the degree of residual chiral symmetry breaking. For strong coupling simulations, where these dislocations occur frequently, the DSDR term reduces the costs associated with light quark masses while still maintaining good topological sampling. We simulate N f = 2+1 quark flavors using domain wall fermions, with either the Shamir (DWF) [19,20] or Möbius (MDWF) [21][22][23] kernel. Finally, on ensembles marked "-G" we use G-parity boundary conditions in one or more of the spatial directions.   The 16I ensemble was first generated and used to study light meson spectroscopy with domain wall fermions in Ref. [24]. The 16I-G ensemble is identical to the 16I ensemble except for the boundary conditions along the x-direction, which have been changed from periodic to G-parity. Likewise, the parameters of the 16ID-G ensemble have been chosen based on a series of β = 1.75 DSDR ensembles generated in Ref. [25], but have G-parity boundary conditions in all three spatial directions. Collectively, these three lattices are used as inexpensive, small-volume test ensembles with unphysical, heavy pion masses to perform cross-checks of the EOFA algorithm and its implementation in the BFM and CPS code libraries. The larger 24ID [26] and 32ID-G [27] ensembles have physical pion masses and are currently being generated as part of production RBC/UKQCD calculations.  Refs. [24] and [7], respectively. On the 16I-G (16ID-G) ensemble we assume the lattice cutoff is the same as the 16I (32ID-G) ensemble since the same action and value of β has been used. The pion masses on the 16I-G and 16ID-G ensembles have been extracted using the fitted value of the lowest energy pion states from Table VIII and the continuum dispersion relation. Finally, the determination of the lattice scale for the 24ID ensemble was performed in Ref. [28], and the determination of the pion mass in Ref. [26].

IV. HYBRID MONTE CARLO WITH EOFA
In lattice QCD correlation functions are computed in terms of a discretized Euclidean path integral Here U is the gauge field, ψ f is the quark field associated with flavor f , and S[U, ψ f , ψ f ] is the action, which decomposes into a sum of contributions from the gauge field, fermions, and possibly other terms (e.g. the dislocation suppressing determinant ratio). To avoid having to deal with anticommuting Grassman variables in a computer, dynamical fermion flavors are integrated out and then reintroduced in terms of bosonic "pseudofermion" fields φ as provided M is positive-definite. While pseudofermions can be represented straightforwardly in a computer, they come at the cost of applications of M −1 rather than M , which is not typically available in an explicit form. Even after discretization the integration in Eqn. (11) is far too expensive to perform directly due to the enormous number of degrees of freedom on a typical lattice. Instead, Monte Carlo techniques are used to ergodically sample a sequence of representative configurations of the gauge field {U i }, for which The standard Monte Carlo technique used in modern lattice QCD calculations is known as the Hybrid Monte Carlo (HMC) algorithm.
HMC generates a Markov chain of gauge field configurations {U i } by evolving a Hamiltonian system in unphysical Molecular Dynamics (MD) "time". This Hamiltonian system is constructed by treating U µ (x) as a generalized coordinate, introducing an su(3)-valued conjugate momentum π µ (x), and forming the standard Hamiltonian The associated equations of motion can then be integrated using numerical integration techniques. The integration is performed over intervals of length ∆τ -referred to as a single MD trajectory -as a sequence of N small steps δτ , with N = ∆τ /δτ . Finite precision integration errors are corrected stochastically with a Metropolis accept/reject step: after every N integration steps by δτ the total change in the Hamiltonian ∆H is computed, and the current gauge field U µ (x) is accepted as the next configuration in the Markov chain with probability One can show that the resulting algorithm satisfies detailed balance provided the scheme used to numerically integrate Eqn. (15) is reversible [29]. Ergodicity is achieved by performing a heatbath step each time the integration is restarted to pick a new conjugate momentum In the pseudofermion formalism applications of the operator (D † D) −1/2 are approximated by a matrix-valued function f (D † D), where f (x) is a suitably constructed approximation to the inverse square root, valid over the spectral range of D † D. Variants of the HMC algorithm which construct f from different classes of functions have been proposed and used in the literature; the most common is the rational HMC (RHMC) algorithm [30], where is a rational function. While rational functions are in many ways a good choice -they are economical in the sense that the inverse square root can usually be well-approximated by a modest number of terms, and the multishift CG algorithm can be used to efficiently invert This product can then be represented as a path integral over a bosonic pseudofermion field with a two-term action (Eqn. (10)) leading to an algorithm which is "exact" in the sense that it avoids the numerical approximations required to implement the square root in RHMC (Eqn. (18)) and related HMC variants. EOFA is also expected to be somewhat faster than RHMC, since there is no rational approximation entering into evaluations of the Hamiltonian or the pseudofermion force, eliminating the overhead associated with multishift CG. In the remainder of this section we elaborate on the details of the action, heatbath step, and pseudofermion force entering into the Hamiltonian equations of motion (Eqn. (15)) for HMC with EOFA.

A. Action
The EOFA action (Eqn. (10)) computes a ratio of determinants of D EOFA upon integrating out the pseudofermion fields (Eqn. (20)). This ratio can be related to the conventional determinant ratio computed by the RHMC algorithm through Eqns. (6) and (8), leading to the relationship We use this relationship as a test of the equivalence of RHMC and EOFA, as well as our implementation of the EOFA action, by stochastically computing the left side of Eqn. (21) with the RHMC action and the right side with the EOFA action (Eqn. (10)) on the same gauge field configuration.
Observing that we can, in general, rewrite a determinant as suggests the following simple Monte Carlo integration scheme: we draw random pseudofermion vectors by independently sampling the real and imaginary parts of each component from the standard normal distribution N (µ = 0, σ = 1), and compute the expectation value where the average is computed using the jackknife resampling technique. This will accurately approximate the true log determinant for finite, realistically calculable values of N provided m 1 and m 2 are sufficiently close that the integrand is well-approximated by a Gaussian with unit variance. To address this latter systematic, we consider splitting Eqn. (21) as a product of determinants with equally-spaced intermediate masses and study the dependence of the result on N m (this procedure is identical to the method introduced in Ref. [31] for computing quark mass reweighting factors). In the upper panel of Figure 1 we plot the log determinants of M −1 RHMC and M −1 EOFA as a function of N m , with N = 10 stochastic evaluations, computed using a single thermalized trajectory of the 16I, 16I-G, and 16ID-G ensembles. For the case of the 16ID-G ensemble, which uses the Möbius DWF fermion action, we also include the overall constant multiplying the right side of Eqn. (21) so that in all cases we are computing the same determinant ratio of D DWF using either action.
We observe, as expected, that both formalisms agree for sufficiently large N m . Likewise, we observe that at sufficiently small N m the calculation generally becomes unreliable since we do not attempt to account for the systematic error associated with approximating the integrand of the path integral by a Gaussian with unit variance (i.e. setting Σ = 1 in Eqns. (23) and (24).). In both cases "sufficiently" small or large N m is controlled by the size of the splitting between m 1 and m 2 . We also observe that, for a given choice of N m and N , both the statistical and systematic errors of the determinant ratio computed via EOFA are suppressed relative to the errors of the determinant ratio computed via RHMC. We argue that the observed error suppression can be explained by comparing the spectrum of M RHMC to the spectrum of M EOFA , which we plot in the lower panels of Figure 1 for a very small lattice volume (4 5 ) where the complete spectrum can be computed directly. While both operators have similar condition numbers, we find that most of the spectrum of the EOFA action is concentrated into a small interval [1, 1 + ∆] with ∆ ∼ O(0.1), leading to an action which is easier to estimate stochastically.
We propose that TWQCD's EOFA construction can be thought of as a kind of preconditioning which computes the same determinant ratio as RHMC but modifies the operator inside the determinant (M RHMC ), mapping its spectrum onto a more compact interval. This suggests an additional application of the EOFA formalism: quark mass reweighting factors can be computed substantially more cheaply using the EOFA action than using the RHMC action, especially at light quark masses, even if the ensemble was generated using RHMC.
This could be useful, for example, to include the dynamical effects of isospin breaking in ensembles generated with isospin-symmetric up and down quarks.

B. Heatbath
At the beginning of each HMC trajectory we wish to draw a random pseudofermion field φ To do this, we first draw a random vector η by independently sampling the real and imaginary parts of each component from the normal distribution with µ = 0 and σ 2 = 1/2, and then compute φ = M −1/2 EOFA η. As with the RHMC algorithm we approximate the inverse square root by an appropriately constructed rational function, but we stress that in the context of EOFA this rational approximation enters only into the heatbath and is not necessary to compute the EOFA action itself or the associated pseudofermion force. Naively applying a rational approximation with the form of Eqn. (19) to the operator M EOFA results in where we have defined γ l ≡ (1 + β l ) −1 . In this form, the nested inversions required to seed the heatbath would make EOFA prohibitively expensive. However, the Woodbury matrix (28) and the cancellation between cross-terms involving products of the chiral projection operators can be used to manipulate this expression into the equivalent form With this expression the EOFA heatbath step can be performed at the cost of 2N p CG inversions using a rational approximation with N p poles. Unlike the case of RHMC, multishift CG algorithms are not applicable to the EOFA heatbath since each of the 2N p operators in Eqn. (29) generates a different Krylov space. Furthermore, since the operators ∆ ± P ± have a large number of zero modes and are therefore not invertible, there is no simple transformation by which this system can be recast into a form amenable to multishift CG.
In the left panel of Figure 2 we test Eqn. (29) on a single thermalized configuration of the 16I ensemble by computing the quantity after seeding the pseudofermion field φ with a random Gaussian vector η. In exact arithmetic ε = 0; in practice it measures the relative error in the heatbath step arising from the choice of CG stopping conditions and rational approximation to the inverse square root. We repeat this calculation, varying the number of poles in the rational approximation but keeping the stopping conditions fixed, and observe that ε reaches the limits of double-precision arithmetic even with a relatively modest number of poles compared to what is typically required to compute non-integer powers of D † D accurately in the context of RHMC. In the right panel of

C. Pseudofermion Force
The pseudofermion force measures the back-reaction of the pseudofermions on the HMC system (Eqn. (15)) under an infinitesimal variation of the gauge field. In our notation {T a } is a basis for the Lie algebra su (3), with the CPS normalization convention The EOFA pseudofermion force can be worked out explicitly by differentiating the EOFA action (Eqn. (10)) and applying the matrix identity with and Standard manipulations can be used to write a more general Dirac bilinear as allowing Eqn. (34) to be efficiently computed locally in terms of a trace over spinor indices, at the cost of the two inversions required to form χ 1 and χ 2 . Since Dirac bilinears of the form a † ∂ a x,µ D W b enter into the pseudofermion forces associated with many of the standard pseudofermion actions for Wilson and domain wall fermions -including the RHMC action -implementing the EOFA pseudofermion force requires little new code beyond what is required to implement the EOFA Hamiltonian.

V. SMALL VOLUME REPRODUCTION TESTS
To further test our implementation of EOFA we have reproduced the 16I (16I-G, 16ID-G) ensemble using EOFA for the strange quark (light quarks) in place of RHMC. For these tests we have made no serious effort to tune EOFA for performance; we have simply checked that replacing RHMC with EOFA, but leaving all other details of the simulation fixed, has no discernible impact on physical observables such as the average plaquette, topological susceptibility, and low energy spectrum.

A. Ensemble Generation
The details of the integrator parameters and nesting are summarized in Tables IV and   V, respectively. We use the abbreviations and

B. Basic Observables
In Table VI we summarize results for the average plaquette P , light quark and strange quark chiral ψψ and pseudoscalar ψγ 5 ψ condensates, and the topological susceptibility where the bin size has been conservatively chosen based on the integrated autocorrelation times measured in Ref. [24] for the 16I ensemble and Ref. [25] for a series of β = 1.75 DSDR ensembles. We expect that the runs produced for this study are too short to reliably compute integrated autocorrelation times directly, but note that there is no evidence of a difference in an integrated autocorrelation time between the EOFA and RHMC ensembles in the time evolution plots of Section C.

C. Low Energy Spectra
In Table VII we list results for the pion, kaon, Omega baryon, and residual masses, computed on the 16I ensemble. These calculations were performed using a measurement package previously introduced in Ref. [13], and based on the all-mode averaging (AMA) technique of Ref. [34].  The pion and kaon masses were extracted by fitting to the asymptotic, large Euclidean time limit of the respective two-point correlation function, where O denotes the choice of interpolating operator, X ∈ {π, K} is the ground state to which O couples, and V and T are the spatial volume and temporal extent of the lattice, respectively. In particular, we performed simultaneous fits to the P P LW , P P W W , and , and the first (second) superscript denotes the sink (source) type. The Omega baryon mass was extracted from the two-point correlation function with the interpolating operator and simultaneously fit to double exponential ansätze with common mass terms Finally, the residual mass was determined by fitting the ratio to a constant, where j a 5q is the five-dimensional pseudoscalar density evaluated at the midpoint of the fifth dimension, and j 5 a is the physical pseudoscalar density constructed from the surface fields.
Corresponding effective mass plots can be found in Appendix C.
In addition, we have also measured the ground state pion energy, kaon mass, and residual mass on the 16I-G and 16ID-G ensembles. While the ground state of the kaon is at rest, the ground state of the pion has nonzero momentum p 100 = (±π/L, 0, 0) on the 16I-G ensemble and p 111 = (±π/L, ±π/L, ±π/L) on the 16ID-G ensemble due to the boundary conditions.
These calculations make use of an extension of the AMA measurement package described above to G-parity ensembles; as discussed in Ref. [9], this requires the inclusion of additional QCD simulation with domain wall quarks performed in Ref. [4]. Following this observation, we examine the forces on the RHMC and EOFA variants of the 16I ensemble. We define a norm on the space of su(3)-valued pseudofermion force matrices F a µ (x) ≡ ∂ a x,µ S(U ) by and consider two measures of the force associated with a given configuration of the gauge field: the first is the RMS force and the second is the maximum force in both cases taken over all lattice sites and link directions. While we expect Equation (48) to be a more pertinent definition in the context of HMC simulations -we have empirically found that acceptance is controlled by the size of F max -both F RMS and F max are, a priori, reasonable measures of the pseudofermion force.
In Figure 3 we compare histograms of F RMS and F max between the RHMC and EOFA 16I ensembles. Each data point corresponds to a single evaluation of the pseudofermion force falling between MD trajectories 500 and 1500. We find that comparing the relative sizes of the RHMC and EOFA forces is highly dependent on whether one chooses F RMS or F max ; the mean EOFA F RMS is roughly 30% smaller than the mean RHMC F RMS , but the distributions of F max are nearly indistinguishable. This observation suggests that while the EOFA force distribution may have a smaller mean than the RHMC force distribution, the EOFA distribution also likely has longer tails, such that the largest forces have similar magnitudes. Since we expect the magnitude of the largest forces to correlate more strongly with the efficiency of the integrator than the magnitude of the average forces, as we have argued above, we interpret these results as suggesting that the optimal step size for an EOFA evolution should be similar to that of an RHMC simulation with the same mass parameters, even if the average force is somewhat smaller.
TWQCD has also observed a large hierarchy of scales in the pseudofermion forces associated with each of the two terms in Eqn. (34); in Ref. [5] they find that the average force We also note that in these small volume test runs we have not considered applying the While the left-hand side can be simulated using a single pseudofermion field, the associated forces can be large if m 1 m 2 , requiring a small step size to maintain reasonable acceptance. The right-hand side, in contrast, involves N + 1 independent pseudofermion fields, but with possibly substantially reduced forces, allowing larger step sizes to be used. For light m 1 one typically observes that the gain from increasing the step size offsets the cost of simulating extra heavy flavors, leading to a more efficient simulation. In Section VII we demonstrate that Hasenbusch preconditioning allows for a substantial speed-up in the context of the 32ID-G ensemble. We also note that in addition to reducing the size of the pseudofermion forces, the Hasenbusch technique preconditions the EOFA force in the sense that the size hierarchy between the left-handed and right-handed force contributions to a single determinant disappears in the limit m i → m i+1 . In practice, we find that the mass preconditioned simulation has comparable left-handed and right-handed force contributions even in the RMS sense.

VI. OPTIMIZATION AND TUNING
In this section we discuss preconditioning and algorithmic techniques which reduce the cost of EOFA simulations. In some cases these are extensions of well-known lattice techniques to the EOFA formalism, while in other cases they are specific to EOFA. We illustrate these techniques using bechmark tests computed with the physical quark mass, Möbius DWF 24ID ensemble, and report timing results for code written in the Columbia Physics System (CPS) and running on 256-node or 512-node Blue Gene/Q partitions.

A. Inversions of D EOFA
Since the majority of the computational effort in an HMC simulation is associated with repeatedly inverting the Dirac operator, techniques to more efficiently apply the Dirac operator or to otherwise accelerate these inversions can have a dramatic impact on the overall efficiency of the integrator. To address the former, we make use of the BAGEL assembler generation library [35] to produce highly optimized kernels and fermion solvers for the Blue Gene/Q hardware. To address the latter, we make use of multiple preconditioning techniques, as well as a mixed precision defect correction CG solver.
The first preconditioning technique we apply -"even-odd" or "red-black" preconditioning -is well-known in the lattice QCD community. Lattice sites are labeled as even if (x + y + z + t) ≡ 0 (mod 2), or odd if (x + y + z + t) ≡ 1 (mod 2), inducing a 2 × 2 block structure on fermion operators Standard tricks can then be used to relate the linear system M ψ = φ to a better conditioned linear system involving only the odd sub-lattice; this preconditioned system is substantially cheaper to invert since the size of the problem has been halved. After inverting on the odd sub-lattice, the even component of ψ can also be recovered at modest cost, without ever needing to explicitly invert on the even sub-lattice. The details of this construction, and its extension to EOFA, are described in Appendix B.
The second preconditioning technique we apply -Cayley-form preconditioning -is unique to EOFA, and was introduced in Ref. [6]. The generic linear system one needs to solve in the context of EOFA has the form where H = γ 5 R 5 D EOFA . For Möbius domain wall fermions D EOFA is dense in ss , and thus considerably more expensive to invert than D DWF , which has a tridiagonal ss stencil, in terms of wall clock time. However, Eqn. (6) suggests that Eqn. (51) can be related to an equivalent system in terms of D DWF by using D −1 as a preconditioner. We elaborate on the mathematical details in Appendix B 2, and, in particular, demonstrate that ∆ ± D has a relatively simple, rank-one form, allowing for substantially more efficient EOFA inversions -even when β = 0 -by working with the preconditioned system. This technique also has the advantage that it allows for EOFA simulations which re-use existing high-performance code for applying D DWF with little modification.
Finally, we use a restarted, mixed precision defect correction solver to perform the conjugate gradient inversions of the fully preconditioned EOFA system. For memory bandwidthlimited calculations -such as applying the Dirac operator -single precision computations can be performed at approximately half the cost of full double precision computations. In the defect correction approach to mixed precision CG, the following algorithm is used: 1. Solve the Dirac equation in single precision arithmetic using a reduced stopping tolerance (typically 10 −4 or 10 −5 ).
2. Compute the current residual using the (single precision) solution in full double precision arithmetic.
3. If the desired final tolerance (typically 10 −8 or smaller) has been reached, stop. Otherwise, return to step 1, using the residual vector computed in step 2 as the new CG source.
We observe that this algorithm outperforms straight double precision CG by approximately a factor of 2 -as one would expect if the calculation is truly memory bandwidth-limitedprovided the local lattice volume on each node is sufficiently large to avoid communications bottlenecks.
In Figure 5 we plot the CG residual as a function of the wall clock running time of the inverter for a series of benchmark inversions of Equation (51) on the 24ID ensemble.
These benchmarks show the inverter performance as we sequentially introduce even-odd preconditioning, Cayley-form preconditioning, and finally, mixed precision CG. We also plot the time required to solve the family of linear systems using multishift CG for the same set of poles {β k } used in the rational approximation to x −1/2 in the RHMC evolution that generated the 24ID ensemble. This allows a baseline estimate of the cost of evaluating the EOFA Hamiltonian or pseudofermion force against the cost of evaluating the RHMC Hamiltonian or pseudofermion force at the same quark mass. We observe a factor of 3.9 speed-up for fully preconditioned EOFA over the evenodd preconditioned RHMC system. In both cases the underlying operator being inverted is D DWF ; the slower RHMC benchmark demonstrates the overhead associated with multishift CG relative to solving a single system with standard CG, both due to the inability to fully utilize mixed precision methods and due to the additional linear algebra required at each iteration.

B. Heatbath
Achieving the full performance improvement suggested by the inversion benchmarks in Section VI A is complicated by the form of the EOFA heatbath, which is expected to be  29)) requires two independent CG inversions per pole used in the rational approximation to x −1/2 , since multishift CG is not applicable: we use two algorithmic techniques to reduce this cost. The first is a forecasting technique initially proposed by Brower et al. [36] in the context of more general HMC simulations, and later used successfully by TWQCD in the context of the EOFA heatbath [4]. The idea is the following: given a set of solutions {ψ k } N k=1 to Equation (51) for N different poles {β k } N k=1 , one can use the linear combination minimizing the functional as the initial CG guess for the next inversion with pole β N +1 . The coefficients c k satisfy and can be computed explicitly using e.g. Gauss-Jordan elimination. Since Equation (54) is the same functional minimized by the conjugate gradient algorithm itself, accurate initial guesses can be computed for modest N provided the {β k } N +1 k=1 are similar in magnitude. In Figure 6 we test this forecasting technique using the 24ID ensemble and a rational approximation with 8 poles, and find that the iteration count required to solve Eqn. (51) to a tolerance of 10 −10 is more than halved for the last few poles.  (29), while the second 8 poles (β = −β l γ l ) are associated with the second (RH) term. We find no improvement from using solutions to the LH system to forecast solutions to the RH system and vice-versa, since the Dirac operator being inverted in either case is evaluated with a different quark mass.
The second technique we have used to accelerate the heatbath is motivated by observing that the coefficients entering into Equation (29) span several orders of magnitude for a typical rational approximation to x −1/2 . We find typical values kα l γ 2 suggesting that the inversions can be performed with reduced stopping tolerances relative to the desired accuracy of M −1/2 EOFA ψ, since the solution vectors are ultimately multiplied by small coefficients when the result is formed. We have explored the following simple optimization scheme to relax the stopping conditions for each pole: 1. Choose a desired tolerance for the heatbath, ε tol , where ε is defined by Equation (30). 2. Choose one of the inversions required to compute M −1/2 EOFA according to Equation (29), and relax the stopping tolerance until the overall error in the heatbath ε reaches ε tol .

Iterate over each inversion until all stopping conditions have been tuned.
We report results for the 24ID ensemble in Table IX. Using a rational approximation with 6 poles, and ε tol = 10 −10 , we observe that the total heatbath time is more than halved while only slightly increasing the error. We have also checked that the final error and heatbath running time after tuning is insensitive to the exact order in which the stopping tolerances are tuned.  The relative error (ε) and total running time for the EOFA heatbath on the 24ID ensemble before and after applying the tuning algorithm discussed in the text.

VII. LARGE-SCALE EOFA CALCULATIONS
In this section we turn to two ongoing ensemble generation calculations currently being performed by the RBC/UKQCD collaboration. The first is a strong-coupling N f = 2 + 1 24 3 ×64×24 Iwasaki+DSDR lattice (24ID) intended for exploratory studies and calculations requiring high statistics [26]. The second (32ID) has been used for a first-principles calculation of the ratio of Standard Model CP -violation parameters / from ∆I = 1/2 K → ππ decays in Ref. [7]. RBC/UKQCD is currently generating more gauge field configurations to reduce the statistical errors in the ∆I = 1/2 decay amplitudes. Both ensembles have

A. 24ID Ensemble
We use the 24ID ensemble as a straightforward benchmark of RHMC against an equivalent EOFA simulation to describe a physical heavy quark flavor. Here this is the strange quark, but N f = 2+1+1 simulations with dynamical strange and charm quarks are another obvious target of EOFA. We make no serious attempt to retune the integrator after switching to EOFA beyond tuning the heatbath step with the following procedure: 1. Compute the largest and smallest eigenvalues of M EOFA (Eqn. (10)) for a few thermalized configurations of the gauge field, and use these measurements to inform the   and without ("dense") Cayley-form preconditioning.
We observe that the dense EOFA formalism is actually somewhat slower than RHMC: the additional complexity of the EOFA heatbath, together with the more expensive inversions of the dense 5D operator D EOFA , negate the expected performance gains from the simpler forms of the Hamiltonian and force evaluations. We emphasize, however, that we have made no attempt to retune the integrator details to optimize for EOFA. After introducing Cayley-form preconditioning -so that we are inverting the tridiagonal operator D DWF rather than D EOFA when we solve Equation (51) -we find that EOFA outperforms RHMC by a significant factor of 3.5.

B. 32ID-G Ensemble
One particularly promising feature of EOFA in the context of G-parity ensembles is the potential for aggressive Hasenbusch mass preconditioning of the light quark determinant; this makes the 32ID-G ensemble a particularly interesting case study since the EOFA formalism is used to describe a physical mass light quark pair. In Ref. [7] the RBC/UKQCD collaboration observed that mass preconditioning is not particularly effective for the RHMC light quark determinant, since each molecular dynamics step requires one multishift inversion of D † D evaluated at the numerator quark mass and two multishift inversions of D † D evaluated at the denominator quark mass. The latter two solves become prohibitively expensive if many intermediate masses are introduced, negating the expected gain from integrating the preconditioned pseudofermion forces with larger step sizes. The EOFA force, on the other hand, is no more expensive to evaluate than the force associated with the standard quotient action (Eqn. (38)), so it is natural to expect better performance from Hasenbusch preconditioning.
In Table XIII  acceptance. The initial RHMC scheme used in Ref. [7] corresponds to run 1, and the final EOFA scheme we have adopted for our continuing ensemble generation corresponds to run 12.
We find, in practice, that Hasenbusch mass preconditioning is extremely effective for the EOFA light quark determinant. In addition to reducing the size of the pseudofermion force, we also observe that the largest eigenvalue of the EOFA action, Equation (10) TABLE XIII: HMC details for the production ensemble generation run (1) of Ref. [7], as well as 13 tuning runs after switching to EOFA light quarks (2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14). We use the following notation: "O" denotes the Omelyan integrator, "FG" denotes the force gradient integrator, "N traj " is the number of trajectories generated for the timing run, "acceptance" is the fraction of gauge field configurations which were accepted in the final Monte Carlo step, and "efficiency" is the ratio of the total job time per trajectory for the specified integration scheme to the total job time per trajectory of the scheme used in run 1. Entries in bold correspond to the original RHMC scheme (1) and the final, fully tuned EOFA scheme (12).
used for each determinant. Table XIV summarizes the measured spectral range, the heatbath error, and the total heatbath cost for each of the runs 2-7. For this ensemble the first Hasenbusch mass reduces the cost of the heatbath by more than a factor of two, and subsequent Hasenbusch masses essentially leave the cost fixed.
For each of the runs 2-7 we generated ten trajectories, beginning from the same seed configuration, and analyzed the resulting distributions of F RMS and F max . In panel (a) of Figure 7 we plot distributions of F max from 850 trajectories of the production RHMC ensemble generation calculation (run 1). Since we are using exactly the same RHMC action  shows an analogous force distribution for the latter mass preconditioning scheme.
In runs 8-10 we explore further tuning of a scheme with four light Hasenbusch masses, and in runs 11-14 we explore further tuning of a scheme with five light Hasenbusch masses.
We note that the Monte Carlo acceptance is relatively poor in runs 8-10 -as we argued in Section V, this is consistent with the view that the acceptance should be controlled by the largest integration errors accrued during the trajectory, which are proportional to  Table XIV, to approximately 850 s after tuning.
Comparing the fully tuned EOFA scheme (12) to the original RHMC scheme (1) in Table   XIII, we find that we are able to generate EOFA trajectories a factor of 4.2 times faster than RHMC trajectories, while maintaining a slightly higher acceptance rate of 93%. We emphasize, however, that this improved performance is only partially attributable to the simpler form of the EOFA Hamiltonian and force evaluations: we have also switched from an Omelyan integrator to a force gradient integrator, retuned the step sizes and integrator layout, and, in some cases, applied optimizations to the EOFA simulation that are not applicable to RHMC simulations (e.g. mixed precision CG). Figure 8 briefly summarizes the respective techniques used in the RHMC and EOFA evolution schemes. We have now adopted the EOFA scheme tested in run 12 for ensemble generation in our ongoing ∆I = 1/2 K → ππ calculation [37]. We expect the resulting performance gain to enable up to four times as many measurements in our current production run as we would have been able inverter; we instead invert the preconditioned system discussed in Appedix B 2. (

Appendix B: Four-Dimensional Even-Odd Preconditioning
The inversions required to compute the exact one flavor Hamiltonian can be accelerated using a standard checkerboarding technique: we label lattice sites as "even" if (x + y + z + t) ≡ 0 (mod 2) or "odd" if (x + y + z + t) ≡ 1 (mod 2). This naturally induces a block structure in the Dirac operator D, which can be LDU decomposed as Left-multiplying the linear system Dψ = φ by results in the equivalent system leading to the following trick: assuming D −1 ee is available in an explicit form, it suffices to invert This system only involves the odd sublattice, and is thus substantially cheaper to invert than D using an iterative algorithm like CG. The solution on the even sublattice can then be reconstructed for a trivial additional cost as This technique is already well understood in the context of RHMC with Shamir or Möbius DWF; in this appendix we describe how to generalize the method to the exact one flavor algorithm.
In the context of EOFA, the generic linear system one needs to invert takes the form We choose to multiply by an overall factor of γ 5 R 5 , rewriting the system as for the following reasons: first, we wish to re-use the existing high-performance implementation of the Wilson / D kernel in the BAGEL library without modification, and second, overall factors of γ 5 R 5 will cancel inside the inverter since we use CG applied to the normal equations and (γ 5 R 5 ) † (γ 5 R 5 ) = 1. Since (D EOFA ) eo = (D DWF ) eo and ∆ ± ∝ δ xx in the 4D bulk, only the operators coupling sites of the same parity need to be modified to implement even-odd preconditioned EOFA. We take somewhat different approaches for the Shamir and Möbius cases. and where the index i denotes the chirality of the shift operator. D −1 ee can be efficiently applied using the LDU decomposition of D ee , again as a slight generalization of the standard Shamir DWF case.

Möbius Kernel and Cayley-Form Preconditioning
Using Eqn. (7) we can write D EOFA in the form The action of the operator appearing in Eqn. (B7) on lattice sites of the same parity, then, is given by with M + , M − , and ∆ ± as defined in equations (A15)-(A18). The matrix elements of D −1 ee = D −1 oo can be found by explicit numerical inversion as part of the setup cost; this is a trivial overhead since it suffices to invert only the ss subblock of D ee . In this form, the exact factorization of the fermion determinant in Eqn. (2) comes at the cost of dense L s × L s matrix operations. We argue that it is possible to do significantly better by introducing an additional preconditioning step.
We note that the system defined by Eqn. (B7) can be more efficiently inverted for the case of Möbius DWF by using the operator D −1 as a right preconditioner, resulting in an equivalent system in terms of D DWF . For the special case β = 0 this is straightforward: observing that the relationship between D EOFA and D DWF (Eqn. (6)) can be used to manipulate it suffices to solve D DWF ψ = γ 5 R 5 φ, from which ψ = Dψ can be recovered at the cost of a single additional matrix multiplication. While we observe that D † DWF D DWF has a slightly larger condition number than D † EOFA D EOFA , leading to a modest increase in the total number of CG iterations required to invert the system, D DWF also has a tridiagonal stencil in the fifth dimension, and can thus be applied in O(L s ) operations -unlike the O(L 2 s ) operations required for the dense D EOFA -leading to a substantial reduction in wall clock time for the inversion.
The β = 0 case is more involved, but can be treated in a similar manner. Right preconditioning Eqn. (B7) with D −1 leads to where we have used γ 5 P ± = ±P ± . We define a new, preconditioned, shift operator ∆ ± by and note that since ( ∆) eo = ( ∆) oe = 0, Eqn. (B16) can be inverted efficiently even with β = 0 provided we can apply the operator (D DWF ) ee ± β ∆ ± and its inverse in O(L s ) operations.
This turns out to be possible after observing that ∆ ± is rank-one, i.e. it can be written as a vector outer product To see this, we start by decomposing D into its chiral components -D = D + P + + D − P − -in terms of which we can also decompose The 5D structure of these operators can be worked out by direct calculation, leading to Matrix-vector products involving the preconditioned shift operator and a pseudofermion field can be computed from this decomposition as and The inverses can be applied using the Sherman-Morrison formula: In terms of which can be constructed numerically using the tridiagonal matrix algorithm [42], the necessary factors can be written as which allow Eqn. (B23) to be applied to a pseudofermion vector in O(L s ) operations.
In Figure 9 we benchmark representative even-odd preconditioned inversions of Eqn. (B7) on the 24ID ensemble, with and without additional preconditioning by D −1 , at the physical strange quark mass. In addition to observing a substantial improvement in terms of wall clock time for the inversion, we note that this preconditioning scheme also has the advantage that it requires little new code -assuming an existing high-performance implementation of D DWF -since D EOFA is never applied directly in the preconditioned formalism.