Automatic Differentiable Monte Carlo: Theory and Application

Differentiable programming has emerged as a key programming paradigm empowering rapid developments of deep learning while its applications to important computational methods such as Monte Carlo remain largely unexplored. Here we present the general theory enabling infinite-order automatic differentiation on expectations computed by Monte Carlo with unnormalized probability distributions, which we call"automatic differentiable Monte Carlo"(ADMC). By implementing ADMC algorithms on computational graphs, one can also leverage state-of-the-art machine learning frameworks and techniques to traditional Monte Carlo applications in statistics and physics. We illustrate the versatility of ADMC by showing some applications: fast search of phase transitions and accurately finding ground states of interacting many-body models in two dimensions. ADMC paves a promising way to innovate Monte Carlo in various aspects to achieve higher accuracy and efficiency, e.g. easing or solving the sign problem of quantum many-body models through ADMC.


I. INTRODUCTION
Differentiation is a broadly important concept and a widely useful method in subjects such as mathematics and physics. Automatic differentiation (AD) evaluates derivatives of any function specified by computer programs [1,2] by propagating derivatives of primitive operations via chain rules. It is different from conventional symbolic differentiations by totally avoiding complicated analytic expressions of derivatives and is advantageous to numerical differentiations by totally eliminating discretization errors. Besides, AD is particularly successful in calculating higher-order derivatives and computing gradients with respect to large number of variables as the case for gradient-based optimization algorithms. Emerging as a new programming paradigm, AD is now extensively utilized in machine learning. Being one of the most important infrastructure for machine learning, it enables massive exploration on neural networks structures.
The great application potential of AD in fields beyond machine learning started to emerge. Specifically, AD has been applied to certain areas of computational physics; for instance its interplay with tensor networks were investigated very recently [3][4][5][6]. One may naturally ask whether AD can be leveraged to Monte Carlo (MC) methods, another big family of computational algorithms. Initial investigations on encoding MC methods into AD framework [7] all assumed normalized probability distributions for Monte Carlo sampling. However, for nearly all interesting problems the normalization factor is not known a priori and the probability distribution is usually unnormalized. Fortunately, knowing the ratio of probabilities between different configurations is sufficient to perform MC simulations in Metropolis-Hasting algorithm [8]; Monte Carlo with unnormalized probability distribution is now a widely employed numerical * These two authors contributed equally to this work † yaohong@tsinghua.edu.cn approach in statistics and physics. Consequently, it is highly desired to integrate AD into generic Monte Carlo to achieve high accuracy and efficiency in various MC applications such as solving interacting many-body models in physics.
In this paper, we fill in the gap by proposing the general theory enabling infinite-order automatic differentiation on expectations computed by Monte Carlo with unnormalized probability distributions, which we call "automatic differentiable Monte Carlo" (ADMC). Specifically, ADMC employs the method of AD to compute gradients of MC expectations, which is a key quantity used in statistics and machine learning [7], without a priori knowledge of normalization factor or partition function which is the case for nearly all application scenarios in Markov chain Monte Carlo (MCMC). As MC gradient problem lies at the core of probabilistic programming [9] and plays a central role in various fields including optimization, variational inference, reinforcement learning, and variational Monte Carlo (VMC), ADMC can be employed to a wide range of MC applications to achieve high accuracy and efficiency.
ADMC not only works with gradient of expectations in MC with unnormalized distributions but also holds true for higher-order derivatives of MC expectations. In contrast, MC estimation of higher order derivatives were rarely considered [10]. In addition, ADMC can be embedded in general stochastic computational graph [11] framework seamlessly and play a critical role at the interplay between differentiable programming and probabilistic programming. By introducing ADMC, we can build Monte Carlo applications in the state-of-the-art machine learning infrastructure to achieve high accuracy and efficiency in addressing questions such as fast search of phase transitions and ground states of interacting quantum models. For models we studied by ADMC, comparable or higher accuracy has been obtained comparing with previous methods such as RBM and tensor network. Moreover, ADMC paves a promising way to innovate Monte Carlo in various aspects, e.g. easing or solving the sign problem [12][13][14][15][16][17][18][19][20][21] of quantum Monte Carlo (QMC) [22][23][24][25][26].
The organization of this paper is as follows. In Sec. II, we review important background knowledge required to understand the general theory of ADMC and its applications, including automatic differentiation, estimation on Monte Carlo gradients, and variational Monte Carlo methods. In Sec. III, we elaborate our theory towards ADMC, including detach function, ADMC estimator for normalized and unnormalized probability distributions as well as general theory on Fisher information matrix (FIM) with unnormalized probabilities , and the general theory for the AD-aware version of VMC. In Sec. IV, we present two explicit ADMC applications in physics: fast search of phase transitions and critical temperature in 2D Ising model; and end-to-end general-purpose ADVMC algorithms and accurately finding the ground state of the 2D quantum spin-1/2 Heisenberg model. We demonstrate how to leverage the power of state-of-the-art machine learning to ADMC algorithms in particular. In Sec. V, we further discuss other possible applications of ADMC as well as some outlooks on ADMC.

II. BACKGROUND
In this section, we would like to provide some background knowledge for the sake of being self-contained. Specifically, we will introduce some basic knowledge of AD, Monte Carlo gradients estimations, and variational Monte Carlo, which are related to the general theory and applications of ADMC.

A. Automatic differentiation
Conventional methods of computing gradients of a given function include symbolic and numerical approaches. It is challenging to symbolically compute gradients of complicated functions as deriving the analytical expression of gradient is often nearly impossible. Numerical differentiation approach computes the gradient by finite discretization and thus normally suffers discretization errors. In addition, these two conventional methods encounter more challenges or errors in computing higher order derivatives, especially when the number of input parameters is large.
AD, on the contrary, by tracing the derivatives propagation of primitive operations via chain rules, can render numerically-exact derivatives (including higher order derivatives) for any programs [1,2,27]. The program is specified by computational graph composed of function primitives. Such directed acylic graph shows the data shape and data flow of the corresponding program.
There are two ways to compute the derivative on the graph with respect to the graph's inputs: the forward AD and backward AD. The forward AD iteratively compute FIG. 1. Forward mode (a) and reverse mode (b) automatic differentiation on computational graphs. Black arrows label the forward pass from inputs to outputs. Red arrows represent forward chain rules in (a) and backpropagation for adjoints in (b).
the recursive expression as shown in Fig. 1(a): where T i stands for nodes on the computational graph; T 0 is the input and T n the final output. The gradient we aim to obtain is ∂Tn ∂T0 . Here ∂Ti ∂Ti−1 corresponds to the derivatives of operator primitives T i = f (T i−1 ), and these derivatives are implemented as AD infrastructures builtin or user customizations. One drawback of the forward mode AD is that one need to keep track of every derivative ∂Ti ∂T0 [i] in the middle of the graph when the input has many parameters, which is normally expensive and inefficient.
Reverse mode AD can avoid the inefficiency encountered by forward mode AD when input parameters are far more than output ones, which is the case of many applications including machine learning and variational Monte Carlo. By defining the adjoint as T i = ∂Tn ∂Ti . As shown in Fig. 1(b), reverse mode AD iteratively compute the recursive expression: The final aim is to compute T 0 . In this approach of AD, one first computes the output and save all intermediate node values T i in the forward pass, and then backpropagates the gradients in the reverse pass. This workflow, denoted as backpropagation in machine learning language [28] is opposite to the forward mode AD where all computations happen in the same forward pass. Reverse mode AD is in particular suitable for scenarios with multiple input parameters and one output value, which is the case of most deep learning setups [29] and many MC approaches such as variational Monte Carlo.

B. Gradients of Monte Carlo expectations
As explained in the introduction, it is of great importance in various fields to compute gradients of Monte Carlo expectation values: ∇ θ O(x, θ) p(x,θ) , where θ represents the set of input parameters, x label MC configurations, p(x, θ) is the (generally unnormalized) probability distribution, and O(x, θ) p(x,θ) is the MC expectation value of O under the probability distribution p(x, θ). We often call O the loss function. Currently, there are two main methods for evaluating MC gradients: score function estimator [30] (also denoted as RE-INFORCE [31]) and pathwise estimator (also denoted as reparametrization trick [32], stochastic backpropagation [33], or "push out" method [34]). Although the method of pathwise estimator in general gives lower variance for MC gradients estimation, it can only be applied to quite limited settings due to the strict requirements on the differentiability of transformers and probability distributions. Therefore, it is nearly impossible to apply pathwise estimator to evaluate MC gradient sampled from vastly complicated distributions encountered in most physics problems. We thus focus on score function method in the present paper as it is more universal and generalpurpose.
The score function estimator is a general-purpose MC gradient estimator with gradient given by where p is a shortcut for p(x, θ). Note that Eq. (3) is quite general by taking into account the dependence of the loss function O on the parameters θ. To leverage AD on the gradient estimation, it is desired to construct an AD-aware version of MC expectation which can correctly obtain the MC gradient itself and its derivatives to all order (including the gradients and all higher derivatives). For normalized probability distribution p, the following AD-aware version of MC expectation was proposed [10]. However, for nearly all interesting physics problems, the normalization factor is not known a priori and probability distribution is usually unnormalized. It is of central importance to sample from such unnormalized probability distributions for applications such as computing physical quantities without knowing partition function a priori or approximating the posterior distributions of latent variables only with knowledge of likelihood and prior. Nevertheless, MC gradient estimation from such unnormalized probability distribution has not been constructed by any previous method. In Sec. III of the present paper, we develop a general framework and construct the AD-aware objective MC expectation which can correctly obtain both the expectation value and its all higher-order derivatives for unnormalized probability distributions.

C. Variational Monte Carlo in physics
VMC is a powerful numerical algorithm searching the ground state of a given quantum Hamiltonian based on the variational principle since a physical Hamiltonian has energy bounded from below [35,36]. By sampling the amplitude of variational wave function |ψ θ , where θ represents the set of variational parameters, one can compute the energy expectation E θ = ψ θ | H |ψ θ / ψ θ |ψ θ , where the ansatz wave function |ψ θ is in general not normalized. The energy expectation can be evaluated through MC: where E loc (σ, θ) = σ|H|ψ θ σ|ψ θ , σ is complete basis of quantum system's Hilbert space, and p(σ, θ) = | σ|ψ θ | 2 is the probability distribution. Note that the probability distribution p(σ, θ) is in general unnormalized since the ansatz wave function ψ θ (σ) = σ|ψ θ is in general unnormalized (as it is usually challenging to normalize the ansatz wave function due to complicated wave function structure). Since E θ depends on variational parameters θ, one thus can in principle optimize E θ obtained by Eq. (5) against θ, giving rise to the optimal ground state energy and wave function within the ansatz.
Stochastic gradient descent (SGD) is de facto for optimizations in machine learning [37][38][39] and can also be employed in computational physics such as optimization in VMC [40]. There are various generalizations beyond vanilla SGD optimizers by considering momentum and adaptive behaviors, amongst which Adam [41] is one common optimizer in training neural networks. Natural gradient descent, a concept emerged from information geometry, is one of the optimization techniques where the local curvature in distribution space defined by neural networks has been considered [42][43][44][45]. Efficient approximations on natural gradient have also been investigated such as FANG [46] and K-FAC [47][48][49][50][51]. For optimization problem such as VMC, gradient descent and natural gradient descent methods can be applied where various machine learning techniques can be utilized to boost VMC. Natural gradient optimization is exactly equivalent to stochastic reconfiguration (SR) method [52,53] in VMC [54][55][56][57].
Recently, there were various studies focusing on using restricted Boltzmann machine (RBM) or related neural networks as the ansatz wave function for quantum systems composed of spins [55,[58][59][60][61][62][63][64][65][66][67][68], bosons [54,69,70], and fermions [56,71]. In previous studies, to incorporate such wave function ansatz into the framework of VMC, one either computes all derivatives ∇ θ ψ θ (σ) analytically when the neural network ansatz is simple enough [58] or applies AD on the wave function to compute ∇ θ ψ θ (σ) [67], and then estimate the gradient ∇ θ E θ by MC sampling given by where · denotes MC sampling of configurations σ from probability distribution |ψ(σ)| 2 . However, applying AD directly on the energy expectation E θ = ψ θ |H|ψ θ to obtain the gradient ∇ θ E θ , the most intuitive way to optimize the ground state, is still lacking partly due to the lack of AD technique for MC expectations sampled from unnormalized probability distributions. With the introduction of ADMC in this work, we can implement ADaware VMC, which is much more straightforward and easy to implement by directly optimizing the energy expectation without any analytical derivation on derivatives for MC expectations or wave functions, which we call "end-to-end" ADVMC.

III. THEORY
In this section, we will present the general theory of the ADMC which enables infinite-order automatic differentiation on MC expectations with unnormalized probability distributions. We shall show the detailed derivations on the general theory.

A. Detach function
We first introduce detach function ⊥(x) which features ⊥(x) = x and ∂⊥(x) ∂x = 0. Here we list some basic formula in terms of detach functions utilized later: f (⊥(x)) = ⊥(f (x)), ⊥(⊥(x)) = ⊥(x), ⊥(x + y) = ⊥(x) + ⊥(y), and ⊥(xy) = ⊥(x)⊥(y). The detach function can be easily implemented and simulated in modern machine learning frameworks (it corresponds to stop_gradient in TensorFlow [72] and detach in PyTorch [73]). We call this function primitive as detach function in this work. This weird looking function has natural explanation in the context of machine learning, especially in terms of computational graph. Such operator corresponds to node in the graph which only pass forward values while stop the back propagation of gradients.
By utilizing detach function, we can construct functions whose derivatives of each order is not related. For example, the function O(x) = x − ⊥(x) equals to 0 irrespective of x although its first-order derivative is 1. The detach function is mathematically sound as we shall prove a completeness theorem for the detach function below.
Theorem 1 For any "weird" function, whose value and every order of derivatives are defined separately, it can always be expressed by "normal" functions with detach function ⊥.
Proof. For a function F(x) whose each order of derivatives are defined as F (n) (x) = h n (x), the construction with ⊥ is: When translated into TensorFlow language, Theorem 1 states that every function defined with tf.custom_gradient can be instead defined with tf.stop_gradient .
..x m ) whose derivatives F (n1...nm) are defined separately, irrelevant from the original function, it can always be expressed by "normal" functions together with single variate detach function ⊥.
The corollary above is obvious by considering similar Taylor expansion construction as Eq. (7). The introduction of imaginary number i enlarges the meaning of the equal sign by twice the equivalence relation: x = y ⇔ Re(x) = Re(y) and Im(x) = Im(y). Similarly, with the introduction of detach function, the equal sign are enlarged as infinite independent equivalence relations: ), (n = 0, 1, 2, · · · ). The conventional "equal" is reexpressed as one relation (n = 0) of the above series: ⊥(f (x)) = ⊥(g(x)).

B. ADMC
We are ready to construct a general theory for MC expectation which can render AD to correctly obtain its directives at all order (including the zeroth-order derivative, the expectation itself). We employ the extended score function method to enable AD on MC expectations for any complicated distribution, both normalized and unnormalized. Theorem 2 below is the central theoretical result of the present paper.
is automatic differentiable to all order and works for both normalized and unnormalized probability distribution p = p(x, θ).
To prove Theorem 2, we first introduce the following lemma: Lemma 1 For both normalized and unnormalized probability distribution p = p(x, θ), Here Z is the shortcut for partition function Z θ = x∈all p(x, θ) with x ∈ all representing the summation over all configurations x. x∈S(p) denotes the average obtained through MC sampling according to the probability distribution p. Note that, for brevity, we omit the 1 Ns factor before the MC sum x∈S(p) in Eq. (9) and hereafter; the sum should be understood as the average 1 Ns x∈S(p) where N s is the number of sample configurations. In Eq. (9) and hereafter, " . =" means that it is the same as the equal sign since MC estimation can be made exact in the limit of large N s . The equal sign also makes sense in any order derivatives. Therefore, to prove the lemma we just need to demonstrate the following formula: where θ1,··· ,θm , n j = 0, 1, 2, · · · . For n = 0, the equation is simply true since both sides give 1. For arbitrary n, it is straightforward to show that which finishes the proof of the lemma. The proof of the lemma above can be significantly simplified. With the enlarged meaning of equal sign, each order of derivatives automatically equal as long as expressions with detach function are accordingly considered. In other words, to prove that some relation holds true for any order derivatives ⊥(f (n) (x)) = ⊥(g (n) (x)), (n = 0, 1, 2, · · · ), we only need to prove that f (x) = g(x) is true. This simplification is the power of detach function. The proof of the lemma can be simplified as: . (12) Note that the simplification from the involved proof in Eq. (10) and Eq. (11) to the neat one in Eq. (12) reflects the brevity and power of detach function and its algebra. Now we are ready to prove Theorem 2. Proving this theorem is equivalent to show the following equation: where O is the shortcut for O(x, θ). Note that, in the average · ⊥(p) , the probability distribution ⊥(p) is the background and is not involved in derivatives. Based on the spirit of detach function algebra, it is enough to show: By utilizing the lemma in Eq. (9) and observing the fact that Z θ is independent of x, it is straightforward to prove Eq. (14) as follows: This finishes the proof of Theorem 2, which is the central result of the present paper. We believe Theorem 2 can provide endless opportunities to build applications combining AD infrastructure with MC algorithms. We emphasize that Theorem 2 is general and applies for both normalized and unnormalized probability distribution p. For the case of normalized distribution x∈all p = 1, we obtain , which is the MC estimator applicable only for the case of normalized probability distribution. For nearly all interesting applications with unnormalized probability distribution p, Theorem 2 is the correct one to use, as we demonstrate in the applications below.
It is worth to provide heuristic explanation for Theorem 2. Through discretizing the parameters θ in numerical differentiations, rigorous MC gradient can be obtained in the limit of zero distretizing intervals. Specifically, to get gradients at θ 0 , one can directly compute MC expectations of O by sampling separately from p(θ) and from p(θ 0 ), with θ very close to θ 0 . However, it is highly inefficient to sample separately from p(θ) and from p(θ 0 ) distributions. As θ is close to θ 0 (in the limit θ → θ 0 ), one can actually reuse the samples from p(θ 0 ) to evaluate the expectation at θ 0 : Finally, we make a note on implementation. For numerical stability, ln p instead of p is in general referenced and the AD version of MC estimator for generic probability distribution p is then given by: From computational graph implementation perspective, p is never explicitly calculated since the numerical value of exp(ln p − ⊥(ln p)) is exactly one. Therefore, ADMC approach using ln p is automatically free from the numerical instability encountered in approaches directly using p.

C. Fisher information matrix and KL divergence in ADMC
For the optimization method of natural gradient descent, the parameters θ are updated in the following way: FIM is of great importance in numerical optimization and is defined as where i, j represent θ i , θ j . FIM is also the Hessian (with respect to θ ) of KL divergence between p(x, θ) and p(x, θ ) with θ approaching θ. Hence, it defines the local curvature in distribution space.
In the following, we derive useful formulas related to FIM with unnormalized probability distribution p in the context of ADMC. For unnormalized p, the expectation of score function is not zero and it is given by: Then, the FIM for unnormalized p can be defined as To apply AD approach, we can obtain FIM through the KL divergence whose Hessian is FIM. The AD-aware KL divergence is given by where the second equation is due to Eq. (9) in Lemma 1. Therefore, for any unnormalized p, we can construct object function as Eq. (20) and compute Hessian of it by ADMC. This approach is preferable than direct estimation from Eq. (19) in some scenarios. (see the SM [74] for details) Following the path of Eq. (20), we could further derive the AD-aware formula for general KL divergence with unnormalized probability p, q parameterized by θ as D. End-to-end ADVMC As discussed in Sec. II, VMC is an important approach attempting to find the ground state wave function of a Hamiltonian by optimizing parametrized wave functions. Here we describe how to implement end-to-end VMC with ADMC, which we call ADVMC. We shall focus on the case where ansatz wave functions are positively valued. For the general case of complex-valued ansatz wave functions, ADVMC can also be implemented. (see the SM [74] for details) As in Eq. (5), the energy expectation value E θ = ψ θ | H |ψ θ of Hamiltonian H associated with the wave function ψ θ (σ) = σ |ψ θ can be evaluated through Monte Carlo sampling where p(σ, θ) = |ψ θ (σ)| 2 is usually unnormalized probability distribution. To optimize (minimize) E θ using gradient-based approach, we need to evaluate the gradients with respect to variational parameters θ: It is clear that E loc (σ, θ) in VMC plays a similar role as O(x, θ) in MC discussed earlier. It is natural to integrate AD into VMC so that an end-to-end ADVMC can be constructed. The ADVMC version of the energy estimator can be constructed as follows: where Re guarantees that the energy estimator is real. Taking account of the variance reduction trick, the AD-VMC energy estimator for real wave function can also be constructed as [74]: Actually, the objective in Eq. (25) has better performance compared with the original estimator in Eq. (24) since E(σ, θ) is detached in Eq. (25) and no further backpropagations behind this node are needed. Note that Eq. (25) as the estimator of E θ can only reproduce firstorder derivative in the framework of AD, while the original estimator in Eq. (24) is correct for all order derivatives. (see the SM [74] for details) The end-to-end ADVMC framework is universal and easy to implement. Instead of computing derivatives of wave functions and plugging the results into the formula of energy gradients by hands as conventional VMC approaches do in Eq. (6), the end-to-end ADVMC optimizes the energy expectation directly and leaves all remaining work to machine learning infrastructure. Analytic and implementation works can be done automatically with AD infrastructure, vectorization/broadcast mechanism, builtin optimizers and GPU acceleration provided by standard ML framework. For different quantum models, the only difference is different E loc (σ, θ). After implementing E loc , we can bring it into Eq. (25) as AD-aware energy estimator. Then, we can use AD to compute the gradients and gradient-based optimizer to optimize the energy.
Besides SGD-based optimizers, natural gradient optimizers (SR methods) can also be incorporated into AD framework. In the context of VMC, the optimization method of natural gradient descent updates the variational parameters as follows: ∆θ = −λF −1 ∇ θ E θ where F can be obtained by Monte Carlo: where ψ is a shortcut for ψ θ (σ) and dependence on parameters θ is implicit. Note that Eq. (26) is connected to Eq. (19) when the distribution p = |ψ| 2 and ψ is real. The relation between FIM and SR method with complex wave functions can also be analyzed by generalizing KL divergence in complex distribution case. (see the SM [74] for details)

IV. APPLICATIONS
The general theory of ADMC we presented above has broad applications, including achieving high accuracy and efficiency in studying interesting many-body interacting models in physics. As we mentioned earlier, by introducing ADMC, we can leverage not only AD but also other powerful features of machine learning frameworks to traditional Monte Carlo. AD together with vectorization, GPU acceleration, and state-of-the-art optimizers can build faster and more capable Monte Carlo applications to study challenging issues in statistics and physics. Here we present two explicit ADMC's applications in studying interacting many-body systems [75] where comparable or higher accuracy can be achieved comparing with previous studies using RBM-based VMC, and tensor network methods.

A. Fast search of phase transitions by ADMC
For many-body systems in physics, it is among central interest to find distinct phases and phase transitions between them. We shall show that ADMC can provide a general and efficient way to find phase transitions in many-body interacting models. At a given phase transition, certain quantities such as specific heat and ordering susceptibility reach a maximal value. This naturally enables ADMC to locate the phase transition in a fast and efficient way by searching for the maximum. A phase transition can occur when certain parameter such as temperature, pressure, and magnetic field is tuned across a critical value. ADMC can efficiently find the critical value of tuning parameter, such as transition temperature.
For concreteness, we shall use ADMC to find the transition temperature of the 2D Ising model on square lattice as an example, although the approach we shall present is general and can be applied to both classical or quantum models. For quantum models, we call the corresponding AD approach as "AD quantum Monte Carlo" (ADQMC). The 2D Ising model is given by H = − ij Jσ i σ j , where σ i = ±1 is the Ising spin on site i of the square lattice and we take J = 1 as the energy unit. It is well-known that there is a phase transition at critical temperature T c below which the system orders spontaneously [76]. The Ising model can be MC sampled with unnormalized probability distribution p(σ, T ) = exp(−H(σ)/T ). As specific heat reaches a maximal value at the phase transition, conventional MC methods usually compute specific heat for many temperature points and then locate the peak of specific heat curve as phase transition. In these conventional approaches, it requires analytical derivation of the formula for specific heat since MC sampling usually cannot compute the specific heat directly. It is relatively simple for specific heat due to the fluctuation-dissipation theorem, i.e. C v (T ) = ( H 2 p − H 2 p )/T . However, it is generally challenging to analytically derive quantities such as gradient or higher order derivatives of physical quantities.
ADMC provides a general way to search for phase transition by directly using the specific heat C v (T ) or other physical quantities as the objective function, which avoids the drawback of conventional MC approaches mentioned above. With the help of ADMC, we can find the peak of the specific heat curve much faster and more efficient. Without the knowledge on the fluctuationdissipation theorem, we can find the location of the peak very accurately with the total computation time which is orders of magnitude faster. In ADMC, we first directly compute energy using the AD-aware version of MC energy estimator as Eq. (8) and then, following the spirit of SGD, we update temperature (starting from any T 0 ) based on the second-order derivative of MC expectation energy in every few MC updates: Although the number of MC updates in each round of temperature update is small rendering noisy estimation of specific heat, such noisy gradient estimator can still converge to T c very quickly. This is the essence of SGD: noisy gradient estimation might lead to better and faster convergence to the minimum or maximum. This is also why mini-batch gradient estimate is used in general neural network training; for instance, one MC sample each pass in training of variational auto-encoder [32] and CD-1 algorithm in RBM training [77] work quite well. Following the same philosophy, we can combine SGD into ADMC framework applied here. Specifically, to maximize some MC expectation values against variational parameters θ = argmax θ O(x, θ) p(x,θ) , we may obtain noisy estimation on the gradients by doing few MC update steps. Such noisy estimation on the gradients can render stable and faster optimizations if learning rate is small enough.
Moreover, one can also utilize third order derivative of expectation energy and apply Newton method to update the temperature, which convincingly shows the value of infinitely automatic differentiable MC estimators.
In terms of implementation, we also combine vectorization into the ADMC workflow above, which takes Markov chain as one of the extra dimension for spin configuration tensors, enabling MC simulation on tens of thousands Markov chains simultaneously. Such vectorization scheme is highly efficient compared to conventional parallel schemes, such as one Markov chain per CPU core. Besides, GPU supports such vectorization very well, providing further speed up. The combination of SGD and vectorized Wolff algorithm leads relatively accurate estimation on T c in just a few seconds. A computational graph is constructed to give the logarithm of wave function ln ψ θ which is also vectorized. Loop 1 is the conventional MCMC approach for updating the configurations according to Metropolis-Hasting algorithm. Loop 2 is ADMC approach to evaluate the AD-aware energy estimator and update the parameters θ by optimizers. One iteration of our algorithm include many (often size of the system) configuration updates (loop1) to decrease the autocorrelation and one step of parameter update (loop2).
We emphasize that the approach we present here is general and can be straightforwardly generalized to other classical or quantum models, where fast estimation on critical values is desired. The knowledge of critical values is helpful to reduce unnecessary calculations on data points deeply in phases and renders fast search of phase transitions in interacting many-body systems.

B. Accurate search of ground states by ADVMC
The integration of AD with VMC provides a powerful tool to accurately study ground states of many-body quantum models in one and higher dimensions. Especially, ADVMC can study generic quantum models (including those models with frustration) in two and higher dimensions using general neural-network states as ansatz wave functions.
The workflow of the end-to-end ADVMC is sketched in Fig. 3. In ADVMC algorithm, we can take advantage of the vectorization technique to watch and update thousands of independent Markov chains in the parallel fashion. As shown in Fig. 3 The variational wave function was chosen to be a fully connected neural network with 7 layers. The number of nodes on each layer is (16l 2 , 8l 2 , 4l 2 , 8l 2 , 4l 2 , l 2 , 1), with l = 8. The activations were set to be RELU for all these layers except the last one. Adam optimizer was used to update the parameters. The dash line is the benchmark ground state energy given by the SSE method. Inset shows the converge of energy near the exact value. The ADVMC result is energetically competitive or advantageous over resulted obtained by various state-of-the-art methods including EPS, PEPS and RBM-based VMC.
tion graph can be constructed by mean-field wave functions with Jastrow factors, matrix product state [78,79], deep neural networks or any other programs with variational parameters and one scalar output. ln ψ θ also has an extra dimension with the same size as n mc . In evaluating the computational graph, the extra dimension behaves as batch dimension in ML language which can be easily taken care of using broadcast technique supported by ML. With the knowledge of wave function amplitudes, we can update the configurations using MCMC method to make them satisfying the distribution given by computational graph wave function ansatz.
Here we demonstrate this new paradigm of VMC by ADVMC study of the spin-1/2 quantum Heisenberg model on the square lattice. The model is given by H = ij J S i · S j , where S j is spin-1/2 operator on site j. Because of the Marshal-sign rule the ground wave function amplitudes of the Heisenberg model can be ren-dered positive definite. Consequently, for simplicity we shall use positive ansatz wave functions in our ADVMC simulations. The computational graph we utilize in this problem is a fully-connected neural network with 7 layers and with RELU activations [80]. The number of nodes on these layers are 16l 2 , 8l 2 , 4l 2 , 8l 2 , 4l 2 , l 2 , 1, where l 2 = 64 is the size of the system. Such neural network design is general without considering any symmetry and geometric information. Totally there are more than one million variational parameters and the number of Markov chains is 5000 in our ADVMC simulations. With such large amount of independent Markov chains and variational parameters, the ADVMC implementation is still very efficient on GPU in terms of time and storage resources due to the highly parallelized structure of our algorithm, as shown in Fig. 4(a). The much shorter running time compared with CPU also demonstrates the increasing significance of GPU acceleration when the number of Markov chains increases.
The approximation ground state energy optimized by Adam converges to −0.6733 (in unit of J) per site averaged by the last 5000 energy data, as shown in Fig. 4(b). This result has 3 × 10 −4 relative error compared with the benchmark ground state energy obtained by SSE [81]. It's also energetically competitive or advantageous over results obtained by various state-of-the-art methods including EPS [82], PEPS [83,84] and RBM-based VMC [58]. This convincingly demonstrates that end-to-end ADVMC can enable us to reach state-of-the-art numerical results with very moderate effort for quantum models.

V. DISCUSSIONS AND CONCLUSIONS
One central issue in Monte Carlo simulations of interacting many-body quantum models is the notorious sign problem. Although it is shown to be NP hard to solve the sign problem generically [85], it is still possible to ease [86][87][88] or solve [16] the sign problem of a given specific quantum model in QMC simulations by certain basis transformations. We propose to employ ADMC as a general way to find the optimal basis which can ease or solve the notorious sign problem in QMC simulations of interesting quantum models, such as the repulsive Hubbard model away half filling. One appropriate objective in ADMC would be the expectation value of the sign which depends on the parameters characterizing the basis choice. ADMC can help to find an optimal basis in which the sign problem is alleviated. From the ADMCoptimized basis with eased sign problem, one may simulate strongly correlated models with lower temperature and larger system size than QMC with usual basis.
ADMC proposed in the present paper is based on score function estimators. For the specific models we have studied, the present ADMC obtains accurate results without suffering any high variance in MC estimations. It is possible to further improve ADMC by reducing variance in MC estimations of expectations. In other words, it would be desired to find baselines or general control variables which could systematically reduce the variance of MC estimations. It is one of future routes to improve ADMC by introducing baselines suitable for any order derivatives as in the case of normalized probability distribution [89].
In conclusion, we have presented the general theory and framework of ADMC. We also showed how Monte Carlo expectations, KL divergence, and objectives from various settings can be expressed in an infinitely AD fashion. We further applied the ADMC approach on various Monte Carlo applications including classical Monte Carlo and end-to-end VMC. Especially, the ADVMC enables us to efficiently study interacting quantum models in higher dimensions. We believe that the ADMC approach can inspire more accurate and efficient Monte Carlo designs with machine learning toolbox in the future. At the intersection of differentiable programming and probabilistic programming, ADMC framework provides a promising route to advance Monte Carlo applications in the fields of statistics, machine learning, and physics.

A. Automatic differentiation approach for Fisher information matrix
In this part, we further discuss the implementation details and advantages on AD approach towards FIM. The test case for algorithm implementations of FIM we utilized is simple distributions such as multivariate Gaussian distribution N (µ|σ), in which µ, σ depend on variational parameters θ. For the simplest case, σ is constant and µ(θ) is determined by parameters θ. We can obtain analytical expression for FIM in this case: If we further assume σ = I and µ i = µ(θ i ) is in the same function form, we can further simplify FIM analytically as: In our code example, we test with three dimensional Gaussian distribution and µ(θ) = (θ + 1) 2 where θ = 0. The expected FIM should be 4 I 3×3 in this case. Such test cases can also be used for testing implementation of unnormalized probability cases if the normalization factor of Gaussian distribution is deliberately dropped out.
The first advantage for FIM with AD approach is zero elements might be kept without MC fluctuations or error bars. Take the test case above for an example, all off diagonal elements of FIM should be zero analytically. If one utilized conventional way computing FIM by MC averaging first order derivatives of ln p, the resulting off diagonal elements are not zero due to the error bar introduced by MC. However, with advanced graph optimization and smart compiler infrastructure provided by TensorFlow, unnecessary computations can be identified and removed from runtime graph. With such state-of-the-art executing engine of computational graph, the off diagonal terms can be pinned at zero with AD approach. This is because the zero nature of these terms have already been identified at graph building time by TensorFlow engines. That is to say, the numerical result can even reach theoretical precision with the help of AD. It is worth noting that such gain is not guaranteed since TensorFlow engine can fail recognizing complicated series of unnecessary operations. For example, AD with unnormalized probability objectives give nonzero off diagonal elements in FIM using the same Gaussian distribution test case.
The second advantage of AD approach is the better compatibility with vectorization scheme. Suppose we vectorize Markov chains as the batch dimension as the case in our implementation of example applications. The conventional way to evaluate FIM involving terms like x∼p ∂ i ln p(x) ∂ j ln p(x) where x is different configurations living on the extra vectorization dimension in our setup. It would be hard to evaluate such terms by treating the batch dimension as a whole where x are different for different chains. This restriction is mainly brought by modern AD infrastructure of ML libraries in which derivatives for multiple outputs can only be obtained one by one and no tensorized fashion AD is implemented. Instead, KL divergence objective only concern about terms like x∼p p ⊥p which is super easy to parallelize by a simple reduce mean. The computation time of the conventional approach is scaling with the number of Markov chains or configuration samples N which is typical thousands to millions while the computation time of AD approach is scaling with the parameter number (one has to apply AD on each derivatives to get the Hessian in ML libraries) which could be way less than the configuration numbers vectorized in the batch dimension. And our numerical experiments indeed show that AD approach is clearly faster than conventional approach either in graph building time or in graph executing time.
B. End-to-end ADVMC setup for general complex wave functions

Computational graph setup for general wave function
If the ground state wave function is not always real positives, the general form can be expressed as ψ σ = e rσ e iθσ , where r characterize the real norm part ln |ψ| and θ characterize the complex angle for the wave function. Therefore, we need two separate computational graphs for computing r and θ, and train them together towards minimal energy. We discuss about the most reliable form of AD-aware energy estimators and the assistant estimator for natural gradients in the following.

Infinite order AD estimator for VMC
The reason why VMC works is due to the following fact: the quantum expectation energy can be approximated by classical Monte Carlo averaged E loc .
Thus we proved lim Nmc→∞ diff = 0. In other words, If we directly use Eq. (A5) as the estimator, AD will give the gradients with another term whose expectation is theoretically zero. Since this term is in general nonzero in MC calculations, it add more variance to the estimation on energy. We can safely drop diff term from the philosophy of baseline method. In other words, it would be better to find the estimator whose first order derivative is directly Eq. (A7) without diff term. Such estimator is easy to construct.
We can prove it by directly calculating the gradient of this estimator: In the sense of its numerical value which is just the same as Eq. (A7). Thus AD-aware estimator in Eq. (A10) gives the right approximation of the gradients of H with lower variance than the general estimator Eq. (A5). Though it is only valid for the first order derivatives.
If the wave function is real, Eq. (A10) reduce to

SR and Natural gradients
SR method (natural gradient descent) is reported to give faster convergence on VMC. In this part, we explore the relation between natural gradient descent and SR method in general settings where the wave function could be complex.
For real wave function case, KL divergence plays the role as the distance of distribution space whose Hessian FIM gives the same formalism as SR method as we have shown in the main text. In terms of complex case, traditional KL divergnece defined with p = ψ * ψ loses the information of wave function's phases. Thus we need a better distance measure to describe the difference between different wave functions.
Infinitesimal distances are thus given by: where δψ = ∂ i ψdθ i .