Rayleigh-Gauss-Newton optimization with enhanced sampling for variational Monte Carlo

Variational Monte Carlo (VMC) is an approach for computing ground-state wavefunctions that has recently become more powerful due to the introduction of neural network-based wavefunction parametrizations. However, efficiently training neural wavefunctions to converge to an energy minimum remains a difficult problem. In this work, we analyze optimization and sampling methods used in VMC and introduce alterations to improve their performance. First, based on theoretical convergence analysis in a noiseless setting, we motivate a new optimizer that we call the Rayleigh-Gauss-Newton method, which can improve upon gradient descent and natural gradient descent to achieve superlinear convergence at no more than twice the computational cost. Second, in order to realize this favorable comparison in the presence of stochastic noise, we analyze the effect of sampling error on VMC parameter updates and experimentally demonstrate that it can be reduced by the parallel tempering method. In particular, we demonstrate that RGN can be made robust to energy spikes that occur when the sampler moves between metastable regions of configuration space. Finally, putting theory into practice, we apply our enhanced optimization and sampling methods to the transverse-field Ising and XXZ models on large lattices, yielding ground-state energy estimates with remarkably high accuracy after just 200 parameter updates.


I. INTRODUCTION
Computing the ground-state wavefunction of a manybody Hamiltonian operator is a demanding task, requiring the solution of an eigenvalue problem whose cost grows exponentially with system size in traditional numerical approaches.Variational Monte Carlo (VMC, [1,2]) is an alternative strategy that avoids this curse of dimensionality by using stochastic optimization to find the best wavefunction within a tractable function class.
VMC has recently seen rapid and encouraging development due to the incorporation of insights from the machine learning community.In 2017, Carleo and Troyer [3] applied VMC with a two-layer neural network ansatz to accurately represent the ground-state wavefunction of quantum spin systems with as many as 100 spins.Since then, there has been major progress in extending neural network-based VMC to the setting of electronic structure, including the development of the neural network backflow ansatz for second-quantized lattice problems [4], as well as of FermiNet [5] and PauliNet [6] for quantum chemistry problems in first quantization.These new approaches have been extended to systems as large as bicyclobutane (C 4 H 6 ), which has 30 interacting electrons [5,7].
VMC is highly flexible, since it extends without significant modification to systems of arbitrary spatial dimension.However, the price paid for this flexibility is a difficult optimization problem that relies on Monte Carlo sampling.Efficiently solving this optimization problem has proven challenging.Recent works [4,[6][7][8][9][10] raise concerns about the speed and stability of wavefunction training and report that VMC can suffer from long training times [5,7], lose stability [10], or converge to unreasonable solutions [11].Thus there is motivation for the development of faster and more stable optimization and sampling solutions.
Our goal is to apply numerical and probabilistic anal-ysis to evaluate and improve upon the optimization and sampling strategies in VMC.To this end, we first provide a unified perspective on several major VMC optimizers, namely gradient descent, quantum natural gradient descent (also known as stochastic reconfiguration), and the linear method.Reviewing these methods in a unified way clarifies a path toward improvement.Specifically, we introduce a new Rayleigh-Gauss-Newton (RGN) method and prove RGN achieves superlinear convergence as the wavefunction approaches the ground state.
Next we analyze the Markov chain Monte Carlo (MCMC) sampling used in VMC.We establish a quantitative extension of the zero-variance principle [1,2] of VMC that we call the vanishing-variance principle.This principle guarantees that the energy estimates converge to the true energy as the wavefunction nears an eigenstate.However, away from an eigenstate, the accuracy of the energy estimates is not guaranteed.The energy estimates can have a high variance and can even exhibit energy spikes (see Figure 4).To stabilize these energy estimates, variance reduction strategies are needed.Using a standard MCMC sampler as in [3], the wavefunction is slow to recover from the energy spikes (∼ 10 3 iterations); however, using the parallel tempering MCMC method [12], the recovery period is much quicker (∼ 10 2 iterations).Variance reduction strategies such as parallel tempering can be essential for realizing the full potential of VMC in large-scale applications.
Lastly, by using the Rayleigh-Gauss-Newton method along with parallel tempering, we obtain highly accurate variational estimates for the ground-state energies of transverse-field Ising and XXZ models with as many as 400 spins.Compared to past benchmark results obtained using natural gradient descent [3], we obtain the same or higher accuracy in fewer iterations.Since RGN is only slightly more expensive than natural gradient descent, by less than a factor of two in our tests, we conclude that RGN can improve the overall efficiency of VMC.
The rest of the paper is organized as follows.Section II gives an overview of variational Monte Carlo, Section III analyzes optimization methods, Section IV analyzes sampling methods, Section V presents numerical experiments, and Section VI concludes.
Throughout the paper, z denotes the real part of a complex number z. v T , v, and v * denote the transpose, complex conjugate, and conjugate transpose of a vector v, and similar conventions are adopted for matrices.We use single bars |•| for the Euclidean norm of a scalar, vector, or matrix and • 2 for the spectral norm of a matrix.Lastly, we consider a finite-or infinite-dimensional Hilbert space of unnormalized wavefunctions ψ and use •, • and • to denote the associated inner product and norm.

II. OVERVIEW OF VMC
The main goal of variational Monte Carlo (VMC) is the identification of the ground-state energy and wavefunction of the Hamiltonian operator H for a quantum many-body system.We denote the ground-state energy and wavefunction using λ 0 and ψ 0 , respectively.In addition to solving the eigenvalue equation Hψ 0 = λ 0 ψ 0 , these admit a variational characterization in terms of the energy functional The ground-state energy λ 0 is the minimum value of E, and the ground-state wavefunction ψ 0 is the minimizer, which we assume to be unique up to an arbitrary multiplicative constant.
Identifying λ 0 and ψ 0 becomes difficult when the Hilbert space associated with H is high-dimensional or infinite-dimensional.For example, in the Heisenberg model for spin-1/2 particles on a graph [13], H is the operator where σ x i , σ y i , and σ z i are Pauli operators for the i-th spin, i ∼ j signifies that i and j are neighboring spins, and J x , J y , J z , and h are real-valued parameters.In the case, e.g., of a 10 × 10 square lattice, the groundstate wavefunction can be viewed as a vector of length 2 100 , which is far too large to store in memory, much less calculate with any conventional eigensolver, direct or iterative.
VMC must approximate this high-dimensional eigenvector using a tractable parametrization ψ = ψ θ , where θ is a vector of real-or complex-valued parameters.VMC uses an iterative approach for updating the θ parameters, with the goal of minimizing the energy within the parametric class.VMC iterates over the following three steps.
1. Draw random samples from the wavefunction density 2. Use the random samples to estimate the energy E [ψ θ ], the energy gradient ∇ θ E [ψ θ ], and possibly other quantities needed for the optimization.
3. Update the θ parameters to reduce the energy.
In VMC, we ideally find that the estimated energies fall quickly in the first iterations and decrease more slowly at subsequent iterations, yielding increasingly accurate estimates of λ 0 , as shown in Figure 1.Additionally, as seen in Figure 1, there is a vanishingvariance principle by which the energy estimator's variance converges to zero as the wavefunction approaches the ground state of H (see Proposition 3).Because of this principle, reductions in the energy mean and reductions in the energy variance both indicate that the wavefunction is approaching the ground state.The vanishingvariance principle is essential in applications, since it enables VMC to provide accurate energy estimates even though the variance at the early stages of the optimization would appear to render such high accuracy impossible.

III. OPTIMIZATION APPROACHES
In this section, we obtain formulas for the energy gradient and Hessian, use these formulas to motivate optimization methods for VMC, and lastly derive theoretical convergence rates for VMC optimizers.Throughout the section, we assume that optimization methods are applied exactly without any Monte Carlo sampling.

III.1. The energy gradient and Hessian
To begin, we derive formulas for the energy gradient and Hessian with respect to the parameters.By adopting the convention of intermediate normalization [14], we obtain compact expressions for these quantities that differ from past presentations, e.g., [1, ch. 9].
We fix a vector of parameters θ and consider a small parameter update θ + δ.The resulting wavefunction, after intermediate normalization, is written This intermediate-normalized wavefunction is a scalar multiple of the unnormalized wavefunction ψ θ+δ and hence has the same energy.However, ψ θ+δ has been rescaled to fix the inner product with ψ θ .We assume that δ → ψ θ+δ is a locally analytic function of real or complex parameters and consider the secondorder Taylor series expansion where ψ, ψ i , and ψ ij denote the normalized wavefunction and its partial derivatives Manipulating (4), we then decompose the energy difference E ψ θ+δ − E ψ θ into the sum of gradient and Hessian terms These gradient and Hessian terms are given explicitly by where H = H − E ψ is an energy-shifted version of the operator H. Equations ( 7) and ( 8) offer transparent formulas for the energy gradient and Hessian.In the case of realvalued parameters, the energy gradient is 2g, and the energy Hessian is 2H +2J .In the case of complex-valued parameters (such as in the setting of [3]), the Wirtinger gradient [15,16] of the energy is g g , and the Wirtinger Hessian is H J J H .The structure of the Hessian simplifies near the ground state, since J → 0 as the wavefunction approaches any eigenstate of H. Proposition 1.The matrix J is bounded by Therefore, J → 0 as min λ∈R (H − λ) ψ / ψ → 0, assuming uniformly bounded ψ ij / ψ terms.
As the wavefunction approaches an eigenstate, Proposition 1 reveals that the Hessian or Wirtinger Hessian takes a simple structure, depending only on first derivatives of the wavefunction.To our knowledge this fact has not been previously identified.An important implication, to be spelled out below in Subsection III.4, is that first derivatives suffice to achieve superlinear convergence in VMC optimization, under the assumption that the true ground state lies within our parametric class.

III.2. Gradient descent methods
The main idea in gradient descent methods is to first approximate the energy using and then choose δ to minimize (10), plus a penalization term that keeps the update small.The penalization term may take the form where > 0 is a tunable parameter.In the first case, we are restricting the Euclidean norm |δ| and the resulting method is standard gradient descent.In the second case, we are restricting the angle between wavefunctions This leads to a method called 'stochastic reconfiguration' or '(quantum) natural gradient descent' [2], which has been used extensively to optimize traditional [17,18] and more recent [3,5] VMC wavefunction ansatzes.In a high-dimensional or infinite-dimensional Hilbert space, the angle ∠ ψθ , ψθ+δ cannot be computed exactly, so natural gradient descent takes advantage of the Taylor series expansion where is a positive semidefinite matrix known as the Fubini-Study metric or quantum information metric [19].However, instead of directly using a penalization term natural gradient descent uses a slightly modified penalization term Here, η > 0 is a parameter that makes the matrix S + ηI positive definite and prevents large updates when the Taylor series expansion ( 13) is not very accurate.
To make the preceding discussion precise, we formalize gradient descent (GD) methods as follows.
Algorithm 1 (GD methods).Choose δ to solve where R = I in GD and R = S + ηI in natural GD.Equivalently, set In addition to GD and natural GD, alternative gradient descent methods such as Adam [20] and AMSGrad [21] have recently gained traction in the VMC community [5,[22][23][24].These 'momentum-based' methods form updates by combining the energy gradient at the current iteration and past iterations.While such strategies are potentially promising, recent tests [5,25] suggest that natural GD still outperforms momentum-based methods on several challenging VMC test problems.Therefore, we focus on GD and natural GD, leaving analysis of other gradient descent methods for future work.

III.3. Rayleigh-Gauss-Newton method
Whereas gradient descent methods are based on a linear approximation of the energy, we now introduce a method-which we call the Rayleigh-Gauss-Newton (RGN) method-based on the following quadratic energy approximation: Here, δ * g and g * δ are the exact gradient terms, while δ * Hδ is just one of the Hessian terms.There is a strong practical motivation for ignoring the other Hessian term δ T J δ , since evaluating this term would requiring taking second derivatives of the wavefunction with respect to all pairs of parameters, which becomes burdensome as the number of parameters grows large.
In the RGN method, we minimize the quadratic objective function (19) plus a 'natural' penalization term, as described below.
Algorithm 2 (RGN method).Choose δ to solve where R = S + ηI.Equivalently, set The parameter η > 0 is again chosen to make H + −1 (S + ηI) positive definite and help prevent large parameter updates.
To our knowledge, the RGN method has not appeared before in the literature.However, it is closely connected to the classical Gauss-Newton method for nonlinear least squares problems [26], which can be viewed as deriving from a similar Hessian splitting.Also, RGN is related to previous VMC optimization methods, including the linear method for energy minimization [2] and a Gauss-Newton-like method for variance minimization [27].All these approaches can be can be described by first linearizing a class of functions and then minimizing a nonlinear loss function applied to the linearized function class For example, in the linear method for VMC, one first linearizes the intermediate-normalized wavefunction ψ θ+δ and then minimizes plus an additional penalization term.Minimization of ( 24) is equivalent to solving the generalized eigenvalue problem for the smallest eigenvalue-eigenvector pair [1,2].As a penalization term, the matrix H is padded with a diagonal matrix −1 I, which is similar to the penalization term used in GD.
The linear method has been observed to yield fast asymptotic convergence in VMC applications with small parameter sets.However, extending the linear method to larger parameter sets is an ongoing challenge [22,28].Motivated by the linear method's potential for fast asymptotic convergence, we introduce RGN as an updated strategy with improved convergence.RGN differs from the linear method in two ways, as detailed below.
First, instead of approximating the energy using the formula (24), RGN uses the quadratic approximation This quadratic approximation agrees with (24) up to O |δ| 3 terms, but it only requires the solution of a linear system instead of a generalized eigenvalue problem.Although the parametrizations considered in our numerical experiments below are small enough so that neither of these linear algebra routines imposes a computational bottleneck, the distinction may become important for large parameter sets.For example, [28] introduced a matrix-free approach for solving the linear system in stochastic reconfiguration, but a similar scheme for the generalized eigenvalue problem has not achieved the same success [29].
Second, RGN uses a 'natural' penalization term −1 δ * (S + ηI) δ, which differs from the penalization term used in the linear method.Because of the penalization, the linear method converges as → 0 to give standard GD updates.In contrast, RGN converges as → 0 to give natural GD updates, which are more efficient than standard GD updates when optimizing many VMC wavefunction ansatzes [5,25].

III.4. Convergence rate analysis
GD, natural GD, and RGN can all be presented in the standardized form Here, the parameter update θ i+1 − θ i is written as the solution to a linear system involving a positive definite preconditioning matrix P i and a negative energy gradient −g θ i .Table I shows the different preconditioners corresponding to the different optimization approaches.
Proof.See Appendix A.
The convergence rate in Proposition 2 depends on a matrix J which vanishes at the ground state.Therefore, If the RGN method is applied with penalization parameters = i tending to infinity and wavefunctions ψ θ i approaching the ground state, the rate of convergence is superlinear, i.e., In practice, our parametric class does not usually contain the exact ground state for H, but if is large and the energy minimizer is close to the ground state, then Proposition 2 still quantifies a fast linear convergence rate for RGN.In the numerical experiments presented in Section V, we achieve such a fast convergence rate by gradually moving the parameter closer to zero as we make progress in optimizing the wavefunction.

IV. VMC SAMPLING ANALYSIS
In this section, we review VMC sampling and prove a vanishing-variance principle that quantifies the sampling error in the estimated energies and gradients.Then, we discuss challenges in VMC sampling and motivate strategies to improve the sampling.

IV.1. VMC sampling
VMC requires quantities such as E, g, S, and H that are constructed as sums or integrals over a highdimensional or infinite-dimensional state space.To compute such quantities, VMC relies on the power of Monte Carlo sampling.In VMC, we first generate a large number of samples σ 1 , σ 2 , . . ., σ T from the normalized wavefunction density using an appropriate Markov chain Monte Carlo (MCMC, [30]) sampler.Then we approximate E, g, S, and H using the following estimators: Here, E ρ and cov ρ denote expectations and covariances with respect to the empirical measure and we have introduced the functions The functions E L and ν i are known as the local energy and logarithmic derivative, respectively.
Next we state the vanishing-variance principle, which quantifies the asymptotic variance of several VMC estimators of interest.Proposition 3. Assume the MCMC sampler is geometrically ergodic with respect to ρ, and for some > 0, where the asymptotic variances v 2 and Σ are given by and g is defined as Proof.See Appendix A.
As a major takeaway from Proposition 3, the energy and energy gradient both have zero variance, i.e., v 2 = 0 and Σ = 0, when the local energy E L is constant, as occurs at any eigenstate of H. Proposition 3 can thus be viewed as a robust and quantitative extension of the classic zero-variance principle [1,2] of VMC.The vanishingvariance principle is robust, since it holds when the wavefunction is not an eigenstate, and quantitative, since it gives a precise formula for the asymptotic variance of the energy and energy gradient estimators.

IV.2. Improving estimation quality
Near an eigenstate, the vanishing-variance principle ensures the relative accuracy of VMC estimated energies.However, away from an eigenstate, VMC estimated energies and energy gradients can have a high variance and change erratically over the course of VMC estimation [11,24].Therefore, variance reduction strategies are needed to ensure VMC's success.
Proposition 3 suggests three strategies for reducing the variance.The first strategy is to increase the number of Monte Carlo samples.We can do this either by running one MCMC sampler for a long time or by running many MCMC samplers in parallel and combining samples.The parallel sampling approach often leads to computational advantages, since vectorized code runs quickly on modern computers and MCMC samplers can be run on multiple nodes/cores to further cut down on the runtime.In the numerical experiments in Section V, we run 50 MCMC samplers per CPU core and use 48 CPU cores, thus generating 2400 parallel MCMC samplers.
The second variance reduction strategy is to reduce correlations among the samples σ 1 , σ 2 , . . ., σ T by applying a fast-mixing MCMC method such as parallel tempering [12].Parallel tempering introduces interacting MCMC samplers that target different densities Periodically, the samplers targeting adjacent densities ρ i and ρ i+1 swap positions according to a Metropolis acceptance probability [31], which improves the mixing time for each of the samplers.Lastly, the samplers targeting ρ m are used for estimating E, g, S, and H. Parallel tempering has reduced correlations in challenging VMC test problems in the past [11,32], and in Subsection V.2 we apply parallel tempering to improve the sampling for XXZ models on large lattices.The third variance reduction strategy is to directly alter the VMC update formula to improve its stability.For example, [5] and [6] use an alternative gradient estimator in which the most extreme local energy values are adjusted to be closer to the median.Similarly, [4] rounds all positive gradient entries to +1, round all negative gradient entries to −1, and then assign a random independent magnitude to each entry.The approaches [4][5][6] all improve the stability of VMC updates, but they discard gradient information that could potentially be helpful.Therefore, we adopt a slightly different stabilization approach in Section V.At each iteration, we check that the parameter update is less than twice as large as the previous parameter update (in Euclidean norm).If not, we shrink in half repeatedly until the parameter update is sufficiently small.This stabilization code eliminates the most erratic parameter updates in our experiments.The code is rarely triggered for TFI models (just 0-2 times per 1000 updates), but it is more commonly triggered for XXZ models (9-34 times per 1000 updates).

V. NUMERICAL EXPERIMENTS
To test the performance of VMC optimization and sampling methods, we estimate the ground-state energies for the transverse-field Ising (TFI) and XXZ models on 1-D and 2-D lattices with periodic boundary conditions.These models are specified by the Hamiltonians where h > 0 and ∆ > 0 are positive-valued parameters.The XXZ model is sometimes alternatively defined as which is a unitary transformation of (46), assuming a bipartite lattice.As a consequence of the Perron-Frobenius theorem and translational symmetry, the models ( 45) and ( 46) both admit unique, nonnegative, translationally invariant ground-state wavefunctions.For 1-D lattices but not 2-D lattices, the ground-state wavefunctions are known exactly [33,34].
As a wavefunction ansatz, we use a restricted Boltzmann machine (RBM), which can be written as Here, α is the hidden-variable density that controls the number of parameters, T ranges over the translation operators on the periodic lattice, and w and b are vectors of complex-valued parameters, called weights and biases.This ansatz is an example of a two-layer neural network and is a simplification of the RBM ansatz used for VMC optimization in [3].The ansatz involves α (n + 1) parameters, where n is the number of spins and we set α = 5 for all of our numerical experiments.We report additional implementation details in Appendix B.

V.1. Comparing optimization methods
To compare different VMC optimizers in the noiseless setting, we apply VMC to TFI model on a 10 × 1 lattice, which is small enough so that E, g, S, and H can be computed by exact summation without the need for Monte Carlo sampling.Figure 2 evaluates the performance of different optimizers in this setting, i.e., with deterministic parameter updates.The figure shows that RGN leads to faster convergence and lower errors than GD, natural GD, and the linear method.
In light of Proposition 2, we expect the most rapid energy convergence when the preconditioner is close to the true Hessian H J J H . Indeed, Figure 3 confirms that the Hessian approximation used in RGN closely approximates the true Hessian, in concordance with the fast observed convergence rate.

V.2. Challenges in VMC sampling
We next apply VMC to larger lattices by incorporating MCMC sampling.For the TFI model, we initialize the 0 500 1000 Iteration MCMC samplers from a configuration chosen uniformly at random and propose random updates based on flipping a single spin.For the XXZ model, we confine the MCMC samplers to 'balanced' configurations for which the magnetization is the same on both components of the bipartite lattice, since the ground-state wavefunction is only supported on these configurations.We initialize from a random balanced configuration and propose random balanced updates based on flipping two spins.
At every new optimization step, the MCMC samplers are continued from the final configurations at the previous step.The MCMC samplers are then run for 20 × n time steps, and the local energies and logarithmic derivatives are evaluated at intervals of n time steps, where n denotes the number of spins.
The MCMC samplers are guaranteed to mix quickly when sampling the ground-state wavefunctions for the TFI model at h = ∞ or the XXZ model at ∆ = −1.For these extreme parameter settings, every Metropolis proposal is accepted, the relaxation time for the TFI sampler is n/2 [35], and the relaxation time for the XXZ sampler is n/4 [36].Yet, there is no guarantee that the MCMC samplers remain efficient for the h and ∆ values more reasonably encountered.
Indeed, the RGN and natural GD optimizers encounter difficulties when calculating ground-state energies for the XXZ model, as shown in Figure 4. Initially during the optimization, the RGN energies decrease quickly, but at iteration 360 (∆ = 1.5) or 370 (∆ = 1.0), the energies exhibit a large spike, which persists over roughly 1000 optimization steps.The natural GD energies exhibit a spike later, during iterations 1500-4000 (∆ = 1.5), which makes sense because the natural GD optimizer converges more slowly than the RGN optimizer overall.The energy spikes are a major difference between exact VMC energies and energy estimates from MCMC sampling.The exact energies change slowly and continuously, as seen in Figure 2.However, the MCMC energy estimates can spike if a slowly-mixing MCMC sampler moves between metastable regions of configuration space.Indeed, Figure 5 establishes that all 2400 MCMC samples typically lie in the ferromagnetic region of configuration space.At the onset of the energy spikes, a few MCMC samplers  enter the antiferromagnetic region of configuration space, encountering high densities and high local energies that have not been experienced before.The densities and local energies are extremely large due to generalization error, and the optimizers require 1000+ iterations to respond to the new MCMC data and eliminate the spikes.To improve the sampling for XXZ models, we apply the parallel tempering method described in Section IV.2 using fifty target densities.Parallel tempering speeds up the mixing of the MCMC samplers and reduces the magnitude and longevity of the energy spikes, as shown in Figure 6.With parallel tempering, the same number of MCMC samples (2400×20) are generated per iteration as in direct MCMC sampling, but the quality of the samples is much higher.The high-quality sampling reduces the generalization error and improves the overall stability of VMC.

V.3. Results for larger systems
Lastly, we apply VMC to estimate ground-state energies for TFI and XXZ models on lattices with up to 400 spins.We train highly accurate VMC wavefunctions for these large lattices by using RGN and (for XXZ models) parallel tempering.For 1-D systems, we compare the estimated ground-state energies against the exact energies in Table II.For 2-D systems, we report the estimated energies themselves in Table III.Summarizing Tables II and III, we find that RGN energies converge more quickly and achieve greater accuracy than natural GD energies.In 1-D lattices, RGN is more accurate than natural GD by up to four orders of magnitude, reaching error levels as low as 1.0 × 10 −9 and 0 500 1000 Iteration  and iteration 1000 are marked in bold.
1.6 × 10 −9 .In 2-D lattices for which exact reference energies are not available, the energy estimates obtained by RGN are typically lower than those obtained using natural GD, and the convergence is very fast.After 200 iterations, RGN is converged to 4-6 significant digits, whereas natural GD is only converged to 1-4 significant digits.
We further illustrate the comparison between natural GD and RGN for TFI models in Figures 7 and 8.These figures, showing the complete time series of energy estimates over 1000 optimization steps, demonstrate that RGN results after 200 iterations are typically more accurate than natural GD results after 1000 iterations.Because RGN is only slightly more expensive than natural GD (less than a factor of two in our experiments), we conclude that RGN makes it possible to obtain accurate ground state estimates with reduced training time and computational cost.

VI. CONCLUSION
This work has analyzed VMC optimization and sampling methods, leading to both theoretical and computational advancements.First, we showed that the energy Hessian simplifies dramatically near an eigenstate, depending only on first derivatives of the wavefunction with respect to the parameters.Taking advantage of this simplification, we introduced a new Rayleigh-Gauss-Newton 0 500 1000 Iteration (RGN) optimizer that can achieve superlinear convergence.Second, we proved a vanishing-variance property that guarantees VMC energy estimates exhibit reduced variance near an eigenstate.This principle ensures accuracy in the energies near the ground state but not away from the ground state, so we suggested a parallel tempering approach to improve energy and gradient estimation for challenging test problems.
We highlight two opportunities for improving our optimization and sampling methods even further.First, for very large parametrizations, the linear system solve in the RGN method becomes numerically challenging.To address this difficulty, the Kronecker-factored approximate curvature method for efficient matrix inversion within natural gradient descent [5,37], in addition to the aforementioned matrix-free approach [28], could potentially be adapted to RGN.Second, while parallel tempering is a simple, broadly applicable enhanced sampling method, there exist a variety of alternative methods [38].We anticipate that further analysis of enhanced MCMC sampling will play an important role in realizing the full potential of VMC in future applications.
Proof of Proposition 2. We prove only the second result.The first result is well-known, and the proof is similar.Because the wavefunction is analytic at θ * and the Wirtinger Hessian H J J H is positive definite, as i → ∞ we find and The i-th parameter update satisfies and the energy ratio satisfies Proof of Proposition 3. The Markov chain central limit theorem [39] guarantees that as T → ∞.Next, using the identity and substituting the empirical averages x = where the asymptotic variances v 2 and Σ are as given in ( 41) and (42).

Appendix B: Computations
Here, we discuss the computational details for our experiments.These computations are implemented using the JAX library for Python [40], and complete scripts and output are available on github [41].Using these scripts, estimating the ground-state wavefunction for a large lattice is relatively fast, requiring less than four days on a single 48-core CPU node (see Table IV).The resulting energies are presented in Table II for TFI models and Table III for XXZ models.We initialize our neural network wavefunction parameters as independent complex-valued N (0, 0.001) random variables, using a random seed of 123.We then update our parameters using GD, natural GD, LM, or RGN over 1000 iterations, as described in Sections III.2 and III.3.During the optimizations, we increase the penalization parameter from = min to = max and increase η from η = η min to η = η max at a geometric rate over τ iterations.Our specific choices of parameters min , max , η min , η max , and τ are detailed below in Table V. min 0.001 max 0.01 for GD and natural GD, 1 for LM, 1000 for RGN ηmin 0.001 ηmax 0.001 for natural GD, 0.1 for RGN τ 100 for deterministic updates, 500 for stochastic updates

TABLE V: Penalization parameters
Before evaluating the wavefunction ψ(σ) or wavefunction derivative ψ i (σ), we check whether σ has 'mostly negative' magnetization, as defined by If we encounter a configuration σ for which (B1) is not satisfied, we transform it to −σ.Indeed, the mostly negative configurations suffice to determine the complete wavefunction given the symmetry condition ψ(σ) = ψ (−σ), and VMC can lead to low-quality ground-state wavefunction estimates when this symmetry condition is not enforced.When optimizing VMC wavefunctions, we occasionally encounter a sudden increase in the norm of the parameter updates, here defined as a factor of two or greater.When such a large update occurs, in addition to immediately restricting the size of the parameter update (by decreasing ), we restore = min and η = η min and restart the geometric progression.
Lastly, to obtain the energy estimates reported in Tables II and III, we run the MCMC chains for an additional 2000 × n time steps and evaluate the local energies at intervals of n time steps.

5 FIG. 4 .
FIG.4.VMC can lead to energy spikes if direct MCMC sampling is used.Plot shows relative error in ground-state energy estimates for an XXZ model on a 100 × 1 lattice.

5 FIG. 5 .
FIG.5.Energy spikes occur when a few MCMC samplers enter the antiferromagnetic region of configuration space, defined by i∼j σiσj < 0. Plot shows number of samplers in the antiferromagnetic region, evaluated at every 100 iterations.

5 FIG. 6 .
FIG.6.VMC recovers more quickly from energy spikes if parallel tempering is used.

TABLE I :
Different preconditioners for energy minimization.

TABLE II :
Relative errors in ground-state energy estimates after 1000 iterations.Lower errors are marked in bold.

TABLE III :
Ground-state energy estimates, normalized by the number of sites.Changes between iteration 200

TABLE IV :
Runtimes per 1000 optimization steps on a single 48-core CPU node, with 2400 × 20 MCMC samples per optimization step.