Unbiased Monte Carlo Cluster Updates with Autoregressive Neural Networks

Efficient sampling of complex high-dimensional probability distributions is a central task in computational science. Machine learning methods like autoregressive neural networks, used with Markov chain Monte Carlo sampling, provide good approximations to such distributions, but suffer from either intrinsic bias or high variance. In this Letter, we propose a way to make this approximation unbiased and with low variance. Our method uses physical symmetries and variable-size cluster updates which utilize the structure of autoregressive factorization. We test our method for first- and second-order phase transitions of classical spin systems, showing its viability for critical systems and in the presence of metastable states.


I. INTRODUCTION
Markov chain Monte Carlo [1] (MCMC) is an unbiased numerical method that allows sampling from unnormalized probability distributions, a central task in many areas of computational science. MCMC is commonly used, for example, in molecular dynamics [2], as well as statistical and quantum physics [3][4][5][6]. In addition to fundamental applications, MCMC serves as a physicsinspired approach to solve a variety of computational problems, including combinatorial optimization [7,8] and computer graphics [9]. While MCMC is a generically applicable technique, its implementation can be plagued by long mixing or autocorrelation time [10]. Various techniques have been proposed to increase the efficiency of MCMC [11], for example, cluster updates [12,13], parallel tempering [14], the worm algorithm [15], and eventchain Monte Carlo [16]. However, these faster MCMC algorithms rely on details of the physical system considered, and they cannot be applied generically.
Machine learning (ML) methods, given their intrinsic flexibility in addressing problems in computational physics [17], are being intensively investigated as a way to improve MCMC. Applications in this direction include, for example, self-learning Monte Carlo methods [18][19][20][21][22][23], enhanced sampling driven by neural networks [24,25], and neural importance sampling [26]. Strongly rooted in the principles of statistical physics, variational sampling techniques are among the most promising ML-driven approaches. Generative neural samplers (GNS) [27][28][29] are a chief example of ML-driven variational methods. These approaches build on the idea of constructing approximate representations of the original probability distribution at hand. The resulting variational approximations can efficiently perform sampling by construction, thus completely bypassing MCMC. A particularly interesting aspect of this approach is its systematic improvability when using the free energy bound minimization as the guiding principle to gauge the approximation accuracy. The * dian.wu@epfl.ch † riccardo.rossi@epfl.ch ‡ giuseppe.carleo@epfl.ch main drawback of the variational approach, however, is that the estimators of expectation values are intrinsically biased by the representation error of the approximated distribution. As unbiased estimators are of central importance in many fundamental applications in physics, recent research has started addressing the key problem of removing the bias induced by ML variational representations, for example, through importance sampling, and incorporating again MCMC strategies [26,30,31].
In this Letter, we propose a way to combine variational techniques with MCMC by using autoregressive neural networks [32,33] to propose cluster updates. We first show that existing unbiased sampling schemes using global updates proposed by GNS can be plagued by the ergodicity issue due to the generic presence of "exponentially suppressed configurations", which have a limited effect on the variational free energy but a rather strong effect on the autocorrelation time. Our workaround to this problem consists of two ingredients. On one hand, we consider physical symmetry operations that leave the Hamiltonian invariant. When applied to the MCMC states, these symmetry operations significantly reduce the exponential suppression of configurations belonging to the same equivalence class. On the other hand, we take advantage of the structure of autoregressive factorization to propose MCMC updates with clusters of spins. The cluster update scheme is automatically learned for any Hamiltonian and is therefore particularly helpful for Hamiltonians with no known cluster update scheme. We benchmark our technique on the two-dimensional Ising model, showing that our solution eliminates the ergodicity issue of the global update approach in the critical region. We then study an Ising-like frustrated plaquette model for which traditional cluster algorithms are not applicable, and we find a first-order transition from a paramagnetic state to a "ferrimagnetic" state that breaks the Z 2 ×Z 2 ×Z 2 symmetry of the Hamiltonian. We show that the method greatly alleviates the metastability issue, as it can rapidly thermalize by cluster updates.

A. Bias in neural sampling
In the following we consider a system of V classical Ising spins s := (s 1 , . . . , s V ), s i ∈ {−1, 1}, at inverse temperature β. We use a GNS q θ with parameters θ that variationally approximates the Boltzmann probability distribution p(s) ∝p(s) := e −βE(s) by minimizing, in the language of statistical physics, a free energy bound [27], which is equivalent to minimizing the Kullback-Leibler (KL) divergence [34]: To construct an expressive q θ , we use an autoregressive neural network to decompose it into a product of conditional probabilities q θ (s) =: , where s <i := (s 1 , . . . , s i−1 ). This specific choice for the model allows us to efficiently sample from the distribution q θ (s) by sampling from the conditional probabilities {q θ;i } sequentially [27].
The variational autoregressive approach is systematically improvable and allows exact sampling. However, the fact that the two distributions are only approximately equal, q θ (s) ≈p(s), also implies that the samples {s (1) , . . . , s (N ) } drawn from the network carry an intrinsic bias. When these samples are used to compute the expectation value of an observable, the resulting estima-torŌ = 1 N N i=1 O(s (i) ) is biased, and most importantly, it is not possible in general to reliably estimate the direction and the magnitude of such bias.

B. Neural importance sampling and global updates
Refs. [30,31] have proposed two closely related solutions to the bias problem. The first method, which we denote neural importance sampling (NIS) in the following, consists of using the modified unbiased estimator andw(s (i) ) :=p (s (i) ) q θ (s (i) ) are the normalized and the unnormalized weights respectively. The second proposed solution, which we denote neural global updates (NGU) hereafter, consists of using the GNS as a MCMC proposer: if s is the Markov chain state, a proposed state s is drawn from the GNS and accepted with the Metropolis probability: II. METHODS

A. Exponentially suppressed configurations
We point out an elementary property of the KL divergence, Eq. (1): the cost of allowing a single bad approximation q θ (s) scales only logarithmically with the ratio q θ (s)/p(s). Therefore, it is reasonable to expect that, even when the free energy is well approximated after the variational training, q θ (s) is still exponentially smaller than p(s) for a small portion of configurations. We call them exponentially suppressed configurations (ESC) 1 . We denote p ESC to be the probability that a configuration is an ESC 2 . A well-trained network has p ESC 1, and they have a limited effect on the variational free energy but a rather strong effect on the autocorrelation time.
Let us consider a Markov chain evolution using NGU, and suppose that the Markov chain state s is an ESC. The ratiow(s )/w(s) in Eq. (2) will be exponentially small for almost any other configuration s ; therefore, the Markov chain will be essentially stuck in s for a long time before accepting any new proposal, and the autocorrelation time of the whole chain will be impractically large. A similar argument applies when considering the variance of the NIS method.

B. Symmetry-enforcing updates
To solve the generic ergodicity problem of neural global update methods, we start by proposing an enhanced MCMC method to enforce the symmetries. At each Monte Carlo step, we apply a random element of the symmetry group G, composed of a translation and reflections, to the current configuration 3 . There is no need to reject this action because the energy is invariant under the action. In the following, we refer to this method as neural global updates with symmetries (NGUS).
Assume that the current configuration s is an ESC, and we use a random symmetry operation to change s to another configuration s * in the equivalence class C. The probability that all configurations in C are ESC is on the order of p #C ESC , so it is extremely unlikely to get stuck within the equivalence class. The occurrence of ESC does not depend on the physical symmetries but rather the structure of the network. 1 In the context of variational inference, it was empirically found [35] that q tends to cover fewer modes than p in the probability landscape. However, the problem of exponentially suppressed configurations is more general as it is present even when all the modes are represented. 2 A distribution ofw is shown in Fig. S3 in the Supplemental Material [36], which can be used to estimate p ESC . 3 Formally, this is equivalent to multiplying the Markov transition matrix by another matrix that leaves the equilibrium distribution invariant, as discussed in the Supplemental Material [36].

C. Neural cluster updates
With autoregressive neural networks, it is particularly natural to consider cluster updates where only a subset of the lattice is changed. Indeed, for any given k, it is possible to propose an update s → s by setting s ≤V −k := s ≤V −k and only sample s >V −k . The weight ra- , which is not too far from 1 when k is small. In this way, the new configuration is closer to the old one and is easier to be accepted, so we expect lower autocorrelation time than global update methods. In the following, we refer to this method as neural cluster updates (NCU).

Algorithm 1 A step of NCUS.
1: Input the current configuration s 2: Sample an integer k ∈ {1, . . . , V } from P cluster 3: Sample the last k spins and propose the configuration s 4: Accept s ← s with probability Pacc(s → s ) 5: Translate s by a random displacement 6: Reflect s along the x axis, the y axis and the diagonal, each with 50% probability 7: Reflect s along the z axis (flip all spins) with 50% probability 8: Output s as a sample in the Markov chain As symmetry-enforcing and cluster updates are compatible with each other, we use the two at the same time and we call the resulting method neural cluster updates with symmetries (NCUS), which still falls into the category of MCMC. As described in Alg. 1, we randomly choose a cluster size k and consider the last k spins s >V −k in the autoregressive order 4 to be inside the cluster. Then we sample those spins from the approximate distribution q θ of the physical system learned into the network, which is conditioned on the spin configuration s ≤V −k outside the cluster. After that, we accept the new configuration s with the probability P acc (s → s ), then randomly apply the symmetry operations. See Fig. 1 for a schematic illustration.
Although the cluster size k can come from an arbitrary distribution P cluster (k), numerical experiments have shown that the uniform distribution P cluster (k) ≡ 1/V already works better than many other cases we have explored 5 .

A. Ising model
We start to demonstrate the effectiveness of NCUS on the conventional two-dimensional Ising model: with periodic boundary conditions s L+1,j = s 1,j , s i,L+1 = s i,1 . The model can be solved exactly and has a critical point at β = ln(1 + √ 2)/2 ≈ 0.44 [37]. Our network architecture is based on PixelCNN [38], combined with dilated convolutions [39] to reduce the total number of parameters. Overall, our networks are lightweight and have 3 convolutional layers and approximately 4 × 10 3 parameters. Thanks to the MCMC bias removal, we do not need the network to approximate the true distribution to extremely high precision, which in any case will be increasingly difficult for larger lattices. As we use the same network for all the experiments, we can compare the performances of the various unbiased sampling methods. After training the network, we generate 10 3 Markov chains in parallel, each containing 10 5 samples 6 .
When comparing the efficiencies of different MCMC algorithms, the main metric is the integrated autocorrelation time τ 7 , which determines the variance of the estimator when the variance of the observable and the 4 The autoregressive order is a one-dimensional labeling s k of the spins in the lattice, mapped from the two-dimensional labeling s i,j , where k = (i − 1) × L + j. 5 A comparison of different choices of P cluster can be found in Figs. S1 and S2 in the Supplemental Material [36]. 6 Details of the network structure, training, and sampling are described in the Supplemental Material [36]. 7 For completeness, we provide the definitions of the autocorrelation function and the integrated autocorrelation time in the Supplemental Material [36]. sample size are given. Here, τ is an intrinsic property of the algorithm and the physical system, without dependence on the sample size, if we have enough samples to obtain a converged estimation of it. For NIS, the autocorrelation time is equal to one by definition; however, there is an increased variance arising from the reweighting procedure, which we consider an effective autocorrelation time for the sake of comparison with the other techniques. From Fig. 2 (a), we see that both NGU and NIS have pathologically high autocorrelation times in the critical region. An inspection of their autocorrelation function [see Fig. 2 (b)] shows that the Markov chain of NGU is essentially nonergodic in the available simulation time. By contrast, our proposed method NCUS has no issue in the critical region. A closer inspection of the inset of Fig. 2 (a) shows that the autocorrelation time of NCUS still increases in the critical region, and the sampling efficiency is improved typically by 2 orders of magnitude compared with the global update methods. The performance of NCUS is also comparable with the celebrated Wolff cluster update method [13], which is specifically tailored for the Ising model. Both NCUS and NGUS perform well in the critical region, and NCUS is to be preferred, as the cluster update allows us to achieve a lower autocorrelation time and, more importantly, a better asymptotic behavior of the autocorrelation function.

B. Frustrated plaquette model
We now study another model that presents a richer physics than the Ising model and for which, to our knowledge, no traditional cluster update method is applicable. We consider a classical spin-1/2 system with nearest-neighbor J 1 , next-next-nearest-neighbor J 3 , and plaquette K interactions: with periodic boundary conditions, which we denote as the frustrated plaquette model (FPM). In this Letter, we set J 1 = J 3 = −1 and K > 0. We sketch the expected phase diagram in Fig. 3 (a). The ground state of the FPM depends on the competition of J 1 and K. For small K, we expect a transition as a function of temperature between a paramagnetic (PM) and a ferromagnetic phase (FM), which is analogous to what is found in the conventional Ising model. For K > 1, the ground state is a repetition of a 2 × 2 unit cell containing one spin pointing in the opposite direction of the other three spins, as shown in Fig. 3 (a). The ground state breaks the Z 2 spin-inversion symmetry and the Z 2 × Z 2 translation symmetry of one lattice spacing in x and y directions. As this phase has an average magnetization per site of 1/2, we refer to this phase as ferrimagnetic (fM). At finite temperature, there must be a phase transition between the PM phase and the fM one, the nature of which we investigate in this Letter.
We present the numerical results for the energy per site ε and the spontaneous magnetization per site m := 1 V i s i in Fig. 3 (b, c) respectively, as functions of temperature with K = 2 and lattice sizes up to L = 32.  Our results in Fig. 3 (b) strongly suggest that, in the thermodynamic limit L → ∞, the energy is discontinuous at the critical point β c = 0.2145 ± 0.0012 with a latent heat Q = 1.36±0.20, estimated using the standard finitesize scaling procedure of Ref. [40]. Another indication of the first-order nature of the phase transition comes from the spontaneous magnetization shown in Fig. 3 (c), which when extrapolated to the thermodynamic limit shows a discontinuity of the spontaneous magnetization from 0 to a value close to 1/2, as expected for the fM phase 8 . The comparison of autocorrelation times from different sampling methods is presented in Fig. 4, which provides numerical evidence that NCUS greatly alleviates the metastability issue expected near first-order phase transitions [44][45][46][47]. Theoretically, a first-order phase 8 A naive Ginzburg-Landau approach for a three-component Z 2 × Z 2 ×Z 2 -symmetric order parameter predicts a second-order transition when truncated at the quartic level. A first-order phase transition is also found in the q = 8 Potts model [41][42][43], but the broken symmetry group there is Z 8 . transition occurs when the distribution of energy p(E) ∝ N (E) e −βE has two peaks with the same size, as shown in Fig. 5, where N (E) is the number of configurations with energy E. A GNS-based sampling method has equal probabilities to generate a sample from the two peaks, and the probability to accept that proposal will be close to 1, if the network is ideally trained and there is no problem of ESC. Meanwhile, for traditional local-update MCMC methods, they can only move small horizontal steps in Fig. 5, so it takes more steps (∼ L 2 ) and exponentially lower probability (∼ e −βδEL 2 ) for them to walk from the low-energy peak to the high-energy one, where δE is the typical energy difference in a local update, which does not scale with L. In other words, the exponentially large number of configurations in the high-energy peak will not make it easier for local-update MCMC methods to sample from that peak because it is exponentially hard for the walker to walk between those configurations in locally connected paths. NCUS reaches a balance between the two extremes, which solves the problem of ESC and keeps the autocorrelation time practically low, even if the network is lightweight and cannot ideally approximate the true distribution. Another potential issue in first-order phase transitions is the strong divergence of the specific heat, resulting in high variance of the energy. Despite this, NCUS still helps us estimate the energy with high accuracy and the error bars in Fig. 3 (b, c) are too small to be visible.

IV. CONCLUSIONS
In this Letter, we have shown a strategy to systematically remove the bias of variational autoregressive neural network methods and, at the same time, keep the variance of observables under control. Our approach exploits the autoregressive structure of the models to generate cluster Monte Carlo updates. After having shown that global updates proposed from networks trained with the KL divergence are generically expected to fail because of a small number of exponentially suppressed configurations (ESC), we have provided a workaround that takes advantage of enforcing the symmetries of the physical system and from using the chainlike graphical structure of the autoregressive model, namely NCUS, to help the Markov chain rapidly escape from ESC. We have benchmarked our technique for the two-dimensional Ising model, showing its efficacy in the critical region, where a straightforward implementation of neural global updates fails. We have further shown the potential of our method for systems for which no traditional cluster updates are known by considering a frustrated plaquette Ising model, where we were able to determine the first-order nature of a paramagnetic-ferrimagnetic phase transition breaking a Z 2 × Z 2 × Z 2 symmetry, remarking that the automatic cluster updates we used allowed us to alleviate the metastability issue.
While we have been mainly concerned with the metric of autocorrelation time, we recognize that the wall-clock time is another important metric for practical computations. In this respect, when computing the energy of the system has a negligible computational cost, current neural network-based methods are not yet competitive with traditional MCMC methods. It can then be argued that the ideal application scenario for ML-based methods are those cases where evaluating the integrand is expensive, for example, in determinant quantum Monte Carlo [48] and lattice field theory [49]. In future work, computational efficiency can be addressed on multiple fronts, for example, by introducing techniques such as hierarchy and sparsity of the neural network models, to reduce the computation time and scale up the lattice size by orders of magnitude. After that, we expect that the slow asymptotic growth of the autocorrelation time of GNS will eventually make them outperform traditional MCMC methods in terms of wall-clock time.

S1. DECOMPOSITION OF TRANSITION MATRIX
The transition matrix M of a Markov chain is defined by where π t is a state vector containing the probabilities of 2 V configurations at sampling time t. M ij is the probability to move from the configuration j to i. M has the left stochastic property To improve the acceptance rate of the chain, we can write M as a convex combination of two transition matrices or as a product of two transition matrices where M (1) and M (2) are easier to sample than M . Given M (1) and M (2) are transition matrices, M must be a transition matrix and hold the left stochastic property.
In NCUS, we decompose M as where M S contains the symmetry operations, and M K represents the cluster update. Specifically, we write M K as a convex combination of different-size cluster updates where k P cluster (k) = 1, and each of {M (k) } contains the proposals and the rejections when only the last k spins are sampled. M (k) with a smaller k usually has a higher acceptance rate. See Fig. S1 and Fig. S2 for a comparison of different choices of P cluster . The sampling of M S does not need rejection, because the symmetric configurations always have the same energy as the original one. We further decompose M S as where M Rx , M Ry , M Rd , and M Rz contain reflections along the x axis, the y axis, the diagonal, and the z axis respectively, and represent translations by i spins in the x and the y directions respectively. In this way, we naturally decompose the whole symmetry group into the direct product of subgroups, and avoid enumerating all the symmetry operations in sampling, while the number of symmetry operations can grow exponentially with the number of symmetry subgroups.   Figure S2. IATs when sampling with each k individually (P cluster (k ) = δ(k − k)), computed on the 16 × 16 Ising model. The dashed horizontal line denotes the result with the uniform distribution. Although some medium-sized k give slightly lower IAT than the uniform distribution, it still performs better than most individual k.

S3. AUTOCORRELATION TIME
For Markov chain-based algorithms including NCUS, NGUS and NGU, we use the integrated autocorrelation time (IAT) τ to characterize the efficiency of the algorithm [10]: where {s (1) , . . . , s (N ) } are the samples in the Markov chain, O is the observable we are interested in, andŌ = . Because the estimation of r(t) contains significant noise when t becomes large, we cut off t in the summation for τ when r(t) crosses 0. The resulting IAT is insensitive to the cut-off point. If we change the threshold from 0 to 0.1, or cut off only when 100 consecutive r(t) are lower than the threshold, the relative change of τ is < 10%.
As we are drawing multiple Markov chains in parallel, we need an effective IAT to represent all of them. We first compute the variance Var Ō i of the observable estimator for each chain i: where n is the chain length, and Var [O i ] is the variance of the data in this chain. Then we compute the expectation of the observable over all chains, and propagate the variance using the fact that the chains are independent of each other: where m is the number of chains. Now the effective IAT τ eff can be solved from where Var [O] is the variance of the data in all chains. τ eff is independent of the number of chains or the chain length, as long as we have enough samples to obtain a converged estimation. We define an effective autocorrelation time for NIS by using the increased variance created by the reweighting procedure [50] (S14)

S4. DETAILS OF NUMERICAL EXPERIMENTS
Our network has 3 convolutional layers, each with kernel size 5. The convolutions are masked to implement the autoregressive property, as introduced in PixelCNN [38]. The numbers of input, hidden, and output channels are 1 → 16 → 16 → 1. SiLU activations [51] are applied after the first and the second convolutional layers, which are reported to produce lower loss than ReLU. Sigmoid activation is applied after the third convolutional layer to restrain the output into (0, 1). To be efficient in a large number of sampling steps, we keep the network to be lightweight, while its receptive field should be able to approximately cover the whole lattice. So we use dilated convolutions [39] to expand the receptive field, and increase the dilation rate in each convolutional layer by a step size. The receptive field radius can be calculated by where D = 3 is the number of convolutional layers, s = 5 is the convolution kernel size, and d is the dilation step size. For lattice sizes L = 8, 16, 24, 32, the dilation step sizes are 1, 2, 3, 4 respectively. The network has 3, 761 non-masked parameters in total, regardless of the lattice size. During training, we use Adam optimizer [52] with conventional learning rate 10 −3 , batch size 64, and take 2 × 10 4 training steps. To avoid being trapped in local minima, especially at low temperatures, in the first 10 4 steps we linearly anneal β from 0 to the desired value, which is reported to produce a lower loss than exponential annealing. We do not use weight regularization or gradient clipping, because the network is shallow and there is no significant instability in training.
For sampling, we generate 10 3 Markov chains in parallel, each containing 10 5 samples. The chains are initialized by samples from the network. The first 10 4 samples in each chain are discarded, to make sure only the samples after thermalization are taken into account. For each experiment of NCUS up to L = 32, the Gelman-Rubin diagnostic [53] is less than 1.1, which confirms the chains are thermalized. The IAT is less than 4 × 10 3 , which is shorter than the remaining chain length by orders of magnitude.  Figure S4. System size dependence of the integrated autocorrelation time τ from Metropolis single spin flip method (SSF) and NCUS on FPM around the phase transition temperature. Due to technical limitations, we are currently unable to use neural networks for lattice sizes larger than L = 32.