Speeding up the ab initio diffusion Monte Carlo by a smart lattice regularization

One of the most significant drawbacks of the all-electron ab initio diffusion Monte Carlo (DMC) is that its computational cost drastically increases with the atomic number ($Z$), which typically scales with $Z^{\sim 6}$. In this study, we introduce an algorithm based on a very efficient implementation of the Lattice Regularized Diffusion Monte Carlo (LRDMC), where the conventional time discretization is replaced by its lattice space counterpart. This scheme enables us to conveniently adopt a small lattice space in the vicinity of nuclei, and a large one in the valence region, by which a considerable speedup is achieved, especially for large atomic number $Z$. Indeed, the computational performances of our algorithm can be theoretically established by using the Thomas-Fermi model for heavy atoms, yielding an almost affordable scaling with the atomic number, i.e., $Z^{\sim 5}$. This opens the way for efficient and accurate all-electron ab initio DMC in electronic structure calculations.

Introduction − In recent years, the grand challenge in materials modeling is to provide extremely accurate reference energetics often well beyond the standard benchmark provided by the Density Functional Theory (DFT) that notoriously is not enough predictive in several materials of both scientific [1,2] and technological interests [3,4]. This is also particularly important in view of existing progress in Machine Learning algorithms to define accurate classical force field potentials with reference data as unbiased as possible [5][6][7][8]. For such problems, explicitly correlated wave-functionbased approaches are necessary [9][10][11][12][13], such as the ones used in quantum chemistry and the ones relying on statistical approaches that are known under the generic name "quantum Monte Carlo" (QMC) [14]. In practice, for electronic systems containing more than a handful of atoms, QMC remains the only possible wave function based reference method, partly because of its favorable scaling with system size and the fact that it can be used efficiently on massively parallel supercomputers. One of the most powerful QMC techniques is based on a systematic ground state projection of a carefully determined trial state [15], using the so-called diffusion Monte Carlo (DMC) with the fixed node approximation (FN). This choice represents a good compromise between accuracy and efficiency because FN is necessary for avoiding the wellknown sign problem, and gives the best (i.e., the lowest energy) variational state with the same sign of the trial function. Despite this, FN remains a highly expensive computational tool, especially for systems containing nuclei with large atomic number Z.
In order to avoid an almost prohibitive computational cost, many sophisticated pseudopotentials for QMC calculations have been developed and intensively used so far [16][17][18][19][20][21]. However, they are usually determined within other schemes and require further approximations (e.g., the locality) [14] that spoil the consistency of the method and often sacrifices the variational principle. At present, it is embarrassing to observe that several pseudopotentials used in QMC (e.g., the socalled BFD ones [22,23]) are based on the Hartree-Fock (HF) approximation that completely misses the correlation energy. Their use can be therefore justified only empirically and does not guarantee any consistency, namely that the FN energy differences are consistent with or without pseudopotentials.
All-electron calculations are rarely applied for atoms of large atomic number Z in QMC due to the expensive computational cost. The major drawback of the all-electron calculations is that, in the electronic wave function, the core and the valence regions are characterized by very different length scales. Therefore, within the most straightforward QMC algorithm, the smallest scale (∼ Z −1 ) should be adopted for the proposed random displacement of the electrons, in order to avoid significant biases. Unfortunately, this implies several Markov iterations to obtain a new uncorrelated sample, causing a high computational cost. To solve this drawback, Umrigar et al. have devised an accelerated Metropolis algorithm for the variational Monte Carlo (VMC) [24,25], by which electrons in the vicinity of nuclei are displaced with a step much shorter than the one used in the valence region. They also developed another scheme for the diffusion Monte Carlo (DMC) [26], in which the velocity is reduced in the vicinity of nuclei to prevent from overshooting electrons. Despite this improves the accuracy by a sizable amount, the major drawback of the conventional FN, is that the time step has to remain necessarily the same both for the valence and the core region [26]. Instead, LRDMC can straightforwardly handle different length scales of a wave function [15,27], so that electrons in the vicinity of the nuclei and those in the valence region can be appropriately diffused. Henceforth this remedy is referred to as "double-grid algorithm," as well as "singlegrid algorithm" refers to the simpler version that adopts only a single lattice space as introduced in Ref. 27.
So far, the double-grid algorithm has been used only for a very limited number of applications, specifically, for light elements such as carbon [27] and sodium [28]. For large atomic number (Z), too large computational resources were required, also because the originally proposed algorithm was very in- A two-dimensional schematic picture of the double-grid LRDMC algorithm. Each electron at r is displaced by a shorter (a) or a longer (a ′ ) lattice space according to the probability of p r or 1 − p r , where 0 p r 1. Since p r is a function decaying suddenly far from nuclei (e.g., Gaussian function), most electrons in the vicinity of nuclei are displaced by a, and those far from nuclei are by a ′ . efficient (see later). In this study, we develop a generalized double-grid algorithm that drastically accelerates the calculation especially for large atomic number Z without introducing biases, thus improving the computational scaling from Z ∼6 to Z ∼5 .
Boosting the double-grid LRDMC − In LRDMC, the original continuous Hamiltonian is regularized by an approximate one H a such that H a → H for a → 0, where a is the lattice mesh size used to discretize the continuous space [15,27]. Indeed, the kinetic part is approximated by a finite difference form: where ∆ a,p i and ∆ a ′ ,1−p i are discretized laplacians by a small lattice space (a) and a large one (a ′ ), respectively. The function p r , defining ∆ a,p i and ∆ a ′ ,1−p i , parametrizes the probability to use the smaller and therefore more accurate lattice space (a) when an electron is close to an heavy nucleus. In the previous works, p(r) was chosen to be a simple Pade' function [27]: and a Gaussian-type function [28]: where R c is the position of the nucleus closest to the electron in r, and r c is an important parameter determining the electrons treated with the smaller lattice space a (henceforth referred to as core electrons), in other words, the ones inside the sphere of radius r c (see Fig. 1). This scheme enables us to use always the larger lattice space a ′ in the valence region, while the most expensive smaller lattice space a is used only when the electron is very close to the nucleus.
The key parameters of the double-grid LRDMC are a ′ /a and r c . A smaller r c (larger a ′ /a) accelerates the double-grid scheme as compared with the corresponding single-grid one, whereas the bias (i.e., the difference between the single-grid and the double-grid LRDMC energies at the same a) is correspondingly increased. Therefore, a proper determination of the two parameters is essential to balance accuracy and efficiency of the double-grid algorithm. a ′ /a was originally parametrized as: with the simple function p r in Eq.
(2) and r c = 1 2 Z [27], where Z is an atomic number. However, as it is shown in the following, the above choice is not suitable for large atomic number (Z).
In the following, we briefly describe the developed scheme. First, we discuss a new strategy to properly determine a ′ /a. Since, the computational cost of LRDMC is proportional to the inverse square of the lattice spaces (a and a ′ ), the acceleration of the double-grid vs. single-grid LRDMC (denoted as speedup) can be analytically estimated in terms of a ′ /a and the average number of electrons in the core/valence regions, according to the following relation: where N core (r c ) and N valence (r c ) = Z − N core (r c ) are the average numbers of electrons that are diffused with the smaller (a) and the larger (a ′ ) lattice spaces, respectively [29] . On physical grounds, the average numbers of core and valence electrons satisfies the inequality N core ≪ N valence . On the other hand, the systematic error of the double-grid scheme referred to the corresponding single-grid one at the same a (denoted as bias) cannot be analytically estimated. This is because it is a very complicated function of a, a ′ , r c , and Z. It is, however, possible to estimate an appropriate value according to the following consideration: Once r c and a ≃ 1 Z are given, it is clear that the corresponding bias increases with a ′ . This implies that it is convenient to choose a ′ as small as possible as long as speedup −1 does not sizably increase. Therefore, we determine a ′ /a in a way that the speedup becomes the half of the That the determination of a ′ /a (Eq. (6)) corresponds to nearly the optimal compromise between efficiency and accuracy can be justified by the following argument: (i) If we chose a too large value of a ′ /a, most of the computational time would be spent for the core electrons, and we could certainly decrease the bias by a smaller a ′ without affecting much the efficiency. ii) On the other hand, if we chose a ′ /a (> 1) too much close to one, the bias is minimal (i.e. equal to the single-grid algorithm), but the speedup can be substantially increased by a larger a ′ , a choice that should be clearly possible for the valence electrons. Notice that, N core (r c ) and N valence (r c ) can be readily estimated from an atomic electron density calculated by an effective model such as the Thomas-Fermi [30] and the Slater ones [31]: Next, we discuss a new strategy to properly determine r c . Since here we are interested in the asymptotic behavior of the algorithmic accuracy and efficiency for large Z, it is convenient to adopt the Thomas-Fermi approximation [30], according to which N core (r c ) is given by: where b is a constant value and f (x) is a universal function independent of Z. After integration (see the supplemental material in detail), we can obtain for Z 1/3 r c /b ≪ 1: At this point, it is important to consider that the bias depends on the two lengths, namely, the value of r c (the bias is minimal for r c → ∞) and the value of a ′ (the bias is minimal if a ′ = a ≃ 1/Z). Now, these two contributions are expected to be of the same order if we take a ′ ≃ r c because we can assume that for r > r c , far from the core region, the wave function is smooth and the laplacian can be discretized with a lattice space a ′ r c . This represents the most balanced choice, providing a good compromise between efficiency (smaller r c and larger a ′ ) and accuracy (the other way around). With the above condition, by substituting the Thomas-Fermi expression of Eq. (9) in Eq. (6) for a ∝ 1/Z, we obtain: yielding r c ∝ Z −θ with θ = 5/7. Therefore, our choice in the following is r c (Z) = β·Z −5/7 , where β is a Z-independent prefactor. Although the above discussion based on the Thomas-Fermi model is exact only for Z → ∞, our VMC calculations show that the scaling (i.e., N core (r c ) ∝ Z 3/7 ) is undoubtedly correct even for small Z, as shown in Fig. S-1 (see. supplemental material). The prefactor β should be small enough so that the scaling is valid in a wide range of Z values, even outside the asymptotic power law regime. Therefore, β = 0.75 is employed in this study.
As a summary, in our algorithm, we determine r c (Z) according to r c (Z) = β · Z −5/7 , with β = 0.75. Then, a corresponding appropriate a ′ /a is determined according to Eq. (6), wherein N core (r c ) and N valence (r c ) are estimated by the Slater's effective models [31] with the exponents that Clementi proposed, based on HF calculations [32,33]. Since the computational cost of the all-electron single-grid DMC has turned out to scale with Z 5.5−6.5 [34][35][36], and the single-grid LRDMC similarly behaves, it is obviously very important to accelerate the double-grid LRDMC for heavy elements. In the following, we assume that the unbiased a → 0 fixed node estimate can be obtained by a low order polynomial fit of several energy calculations corresponding to different a ≥≃ 1 Z . This implies, according to Eq. (5) and Eq. (10), that the new algorithm improves the complexity of the well known and widely used DMC algorithm by ∼ Z 4/7 ≃ Z 0.57 , that represents a remarkable achievement especially for large Z.
Practical test of the developed algorithm − In Table I  In practice, it is important to evaluate the actual computational time for a fixed reference error in the total energy, as a function of the atomic number Z. We measured the computational times for He, Be, Ne, Ar, Kr and Xe, wherein a = (3.5Z) −1 is employed [37] . This is consistent with the typical setting of the time step in the standard DMC (τ ∝ Z −2 [36]). Figure 2 shows that our new algorithm accelerates the singlegrid LRDMC calculations by × 1.1, × 1.4, × 2.3, × 3.3, × 4.3, and × 5.3 for He, Be, Ne, Ar, Kr, and Xe, respectively [38] . Our practical test shows that the single-grid LRDMC scales with Z 5.55 , which is already slightly better than the previous report for the standard DMC algorithm (Z 5.97 with τ ∝ Z −2 [36], where the Umrigar's improvement [26] was employed), and the double-grid one improves the scaling to Z 5.06 . The improvement of the computational time by the double-grid algorithm (Z 0.49 ) is consistent with our expectation (Z 4/7 ). To our best knowledge, Z 5.06 is the best scaling for the all-electron FN calculations so far.
Application to large systems − We discuss possible applications of the double-grid algorithm to large systems. For a polyatomic system, the smallest length scale is determined by the heaviest atom in the system with Z = Z max . Therefore, in this case, we can change the definition of R c in Eq. (3) slightly, by considering only the distances of the electrons with the heaviest atoms. In this way, when electrons are close to the lighter elements, they always move with the larger lattice space a ′ , without introducing a sizable bias. Conversely, for r c , one can adopt the value calculated with a single reference heavy atom, as we have done in this work. It is clear, therefore, that a more significant speedup can be achieved by using Eq (5), especially when the number of heavy atoms in the system is very small (e.g., transition-metal porphyrin complexes, metallofullerenes). As the first step to large systems, we considered the benzene molecule (Z max = 6). Table II shows that the bias of the double-grid LRDMC is as small as in the atomic cases while the computational time is accelerated by × 1.9, significantly larger than the one ≃ 1.5 estimated from Fig. 2, demonstrating that the double-grid algorithm is already advantageous for polyatomic systems, even without too heavy nuclei and too many light ones. Finally, we have compared the computational costs between all-electron and pseudopotential calculations [39] . Thanks to the significant acceleration, the CPU time of the all-electron doublegrid LRDMC for the benzene molecule is just 5.7 times larger than the pseudo-potential single-grid one. Thus, the doublegrid LRDMC should make possible the application of the allelectron FN to realistic materials, allowing extremely accurate and easily reproducible reference energies in the future.
Summary − In this study, we develop a new generalized algorithm of the double-grid Lattice Regularized Diffusion Monte Carlo (LRDMC). The speedup of the algorithm is predicted theoretically within the standard Thomas-Fermi model for atoms with large atomic number, and the calculation is indeed accelerated in practice by a large amount, especially for large atomic number Z and without inducing significant biases. As a result, the computational scaling is improved from Z 5.55 to Z 5.06 . Our double-grid algorithm can be applied to polyatomic systems with further significant speedups. Last but not least it should be possible with the present technique to treat ions and electrons without relying on the Born-Oppenheimer approximation, because the corresponding much different length scales should be efficiently considered within the proposed double-grid scheme.
Acknowledgments − The computations in this work have been mainly performed using the facilities of Research Center for Advanced Computing Infrastructure at Japan Advanced Institute of Science and Technology (JAIST). K. Nakano is grateful for a financial support from Simons Foundation. R. Maezono is grateful for financial supports from MEXT-KAKENHI (19H04692 and 16KK0097), from FLAGSHIP2020 (project nos. hp190169 and hp190167 at K-computer), from Toyota Motor Corporation, from I-O DATA Foundation, from the Air Force Office of Scientific Research (AFOSR-AOARD/FA2386-17-1-4049;FA2386-19-1-4015), and from JSPS Bilateral Joint Projects (with India DST). than those of acceptance ratios (e.g., × 3.9 and × 2.3 for CPU time and acceptance ratio, respectively, in the neon atom). This is because the double-grid algorithm consumes more CPU times when computing the discretized laplacians and potentials.
[39] For the benzene molecule, we measured the CPU times for a fixed reference error in the total energy at a = 0.1 Bohr and a = 0.3 Bohr for the all-electron and the pseudopotential calculations, respectively. We determined these values (a min ) such that the extrapolation error obtained with 3 parameter polynomial fit (E (a) = E 0 + k 1 · a 2 + k 2 · a 4 ) of independent energy calculations corresponding to 8 different values of a ≥ a min becomes ∼ 2.0 mHa referenced to the safest extrapolation value (i.e., the smallest a min ). Notice that the energy-consistent BFD pseudopotentials with the VDZ basis were employed. This supplemental material gives details about the scaling of the number of electrons within r c obtained by the Thomas-Fermi model (Sec. I), a remedy for smooth extrapolations (Sec. II), and the variational (VMC) and the lattice regularized quantum Monte Carlo (LRDMC) calculations mentioned in the main text (Sec. III).

THE SCALING OF THE NUMBER OF ELECTRONS WITHIN r c OBTAINED BY THE THOMAS-FERMI MODEL
According to the Thomas-Fermi model [S1], the electron density of the atomic number Z can be represented by: and χ (x) is the universal function independent of Z. Therefore, the number of electrons within r c is defined by: When r is replaced with Z 1/3 r/b = x, the number of electrons within r c is represented as: χ (x) can be approximated by the following polynominal expression at small x region (x ≪ 1) [S1]: where A is the constant value A = 1.8858. If ξ is large enough (i.e., Z 1/3 r c /b ≪ 1), only small x region contributes to the integral and the high-order terms can be neglected. Therefore, the above equation can be approximated by: Since the integral can be replaced by the gamma function: the number of electrons within r c can be represented as: By substituting r c with β · Z −θ , we finally get the relation: and for θ = 5/7: Eq. (S-10) is valid only when the inequality ξ ≫ 1 is satisfied. This depends on the prefactor β as well as the atomic number Z. In practice, the prefactor β should be small enough so that N core (r c ) ∝ Z 3/7 is valid in a wide range of Z values, even outside the asymptotic power law regime. Fig. S-1 shows that the plot of N core (r c ) divided by Z 3/7 obtained by VMC calculations v.s β for Ne, Ar, Kr and Xe atoms. This figure shows that β = 0.75 is small enough to satisfy the above scaling. where k = 4π sin θ/λ, 2θ is the scattering angle and λ is the wavelength. TurboRVB enables us to calculate a radial distribution function as well as an electron density from a manybody wave function both in VMC and DMC levels using the forward walking technique.

A REMEDY FOR SMOOTH EXTRAPOLATIONS
In order to improve the quality of the energy extrapolation for a → 0, it is important to increase r c as a increases. This is because if r c is fixed (i.e., a ′ /a is also fixed according to Eq. (6) in the main text), a ′ ≃ r c is no longer satisfied in a large a region, which introduces a large bias by the larger lattice space a ′ especially in the vicinity of the border between the core and valence regions. A simple parametrization to solve this problem is r c (a, Z) = r c (Z) f (a), where f (a) is an arbitrary function satisfying f (0) = 1 and f (∞) = const. In this study, a simple polynominal function: is employed, where κ is a prefactor, and a = (α · Z) −1 . Therefore, r c (a, Z) can be parametrized as: Eq. (6) in the main text indicates that N core (r c ) should be smaller than N valence (r c ) for any a and Z, otherwise the doublegrid LRDMC becomes useless (i.e., a ′ /a < 1) in a certain case. According to the Thomas-Fermi theory, N core (r c ) becomes equal to N valence (r c ) at r c = 1.33Z −1/3 [S1]. Therefore, the following inequality should be satisfied for all Z and a, Thus, we obtain κ < 2.69 for β = 0.75. κ = 2.5 is employed here. The new algorithm determines r c (a, Z) using Z and a according to Eq. (S-13), then, a corresponding proper a ′ /a is calculated by the obtained r c (a, Z) according to Eq. (6) in the main text, wherein N core (r c ) and N valence (r c ) are estimated by the Slater's effective models [S3] with the exponents that Clementi proposed based on HF calculations [S4, S5].
In Fig. S-3 and Table S-I, we show the LRDMC energies of Be (Z = 4), Ne (Z = 10) and Kr (Z = 36) atoms obtained by the single-grid, the previous and the new double-grid algorithms with the above remedy. We remark that in Figure S-3 the LRDMC energies obtained by the previous parametrization are significantly biased in the large a region especially in Kr (Z = 36). On the other hand, our new parametrization suppresses these significant biases, and the obtained LRDMC energies in the small a region (a → 0) are essentially unbiased for all Z. In Tables S-I, it is evident that r c (a, Z) increases as a increases, and a ′ /a decreases as a increases, by which the condition a ′ ≃ r c is satisfied for any a and Z. In this way, unnecessary large biases ( i.e. not saving computational time) are suppressed, and smooth extrapolations are achieved. Notice that Table S-I indicates that this modified algorithm implies smaller speedups as a increases. This is because r c (a, Z) (Eq. (S-13)) becomes slightly larger than the original r c (Z) due to f (a) as a increases. However, the effect is negligible in practice because small a calculations are much more important than large ones (i.e., the computational cost is proportional to a −2 ). Thus, within the remedy, the new double-grid algorithm achieves both acceleration and smooth extrapolation.

THE DETAILS OF VMC AND LRDMC CALCULATIONS
The variational (VMC) and lattice regularized quantum Monte Carlo (LRDMC) calculations for He, Be, Ne, Ar, Kr, Xe and C 6 H 6 were performed using TurboRVB [S6]. In the VMC calculations, Jastrow Slater (JSD) and Antisymmetrized Geminal Power (JAGP) [S7] ansatz were employed, and the cc-pVDZ (He, Be, Ne, Ar, Kr, and C 6 H 6 ) or ADZP (Xe) basis set taken from EMSL Basis Set Library [S8, S9] were used in the determinant and the Jastrow parts. The variational JSD and JAGP wave functions were optimized using the stochastic configuration in combination with the linear method [S10, S11], by which all variational parameters in the Jastrow and the determinant parts including the exponents were optimized. In LRDMC calculations, the wave func-tions optimized based on JAGP ansatz were used for the guiding functions. The obtained VMC and LRDMC energies are shown in Table S-II. Thanks to our careful optimizations, our VMC-JSD energies are lower than the previous results, especially when Z becomes larger. Remarkably, the JAGP ansatz further improves the variational energies. Our JAGP ansatz also improves the LRDMC energies (i.e., the nodal surfaces) as well as the variational energies.