Estimation of Thermodynamic Observables in Lattice Field Theories with Deep Generative Models

In this work, we demonstrate that applying deep generative machine learning models for lattice ﬁeld theory is a promising route for solving problems where Markov Chain Monte Carlo (MCMC) methods are problematic. More speciﬁcally, we show that generative models can be used to estimate the absolute value of the free energy, which is in contrast to existing MCMC-based methods which are limited to only estimate free energy diﬀerences. We demonstrate the eﬀectiveness of the proposed method for two-dimensional φ 4 theory and compare it to MCMC-based methods in detailed numerical experiments.

In this work, we demonstrate that applying deep generative machine learning models for lattice field theory is a promising route for solving problems where Markov Chain Monte Carlo (MCMC) methods are problematic. More specifically, we show that generative models can be used to estimate the absolute value of the free energy, which is in contrast to existing MCMC-based methods which are limited to only estimate free energy differences. We demonstrate the effectiveness of the proposed method for two-dimensional φ 4 theory and compare it to MCMC-based methods in detailed numerical experiments.
Introduction. The free energy of a physical system is of great importance since it can be related to several thermodynamical observables. In particular, at non-zero temperature, it allows to compute the entropy, the pressure or, more generally, the equation of state of the considered physical system. For example, QCD at high temperature -as a generic strongly interacting field theoryplays an essential role in the physics of the early universe and is now extensively probed in large-scale heavy ion experiments [1]. Hence, knowing such thermodynamic quantities from QCD alone is of very high relevance.
The main tool to study strongly-coupled field theories, such as QCD, is to discretize them on a spacetime lattice and use Monte-Carlo Markov-Chain (MCMC) methods to numerically calculate the relevant physical quantities. Unfortunately, these thermodynamical quantities are challenging to compute using existing MCMC methods. The fundamental difficulty is that MCMC is not able to directly estimate the partition function of the lattice field theory. Therefore, the absolute value of the free energy cannot be estimated straightforwardly.
Instead, there are a number of MCMC methods to estimate differences of free energies. One typically chooses a free energy difference ∆F = F b − F a such that F a is known either exactly or approximately. One can then deduce the value of the free energy F b = ∆F + F a at the desired point in parameter space. If the free energy F a is not known exactly, this induces an unwanted approximation error. Most of the methods to estimate ∆F rely on integrating a derivative of the partition function over a trajectory in the parameter space of the lattice field theory [2]. Alternatively, one can use a reweighting procedure to calculate free energy differences between neighbouring points of the discretized trajectory and then sum them up [2,3]. These approaches require simulations at each parameter point of the discretized trajectory which is numerically costly and leads to accumulation of errors. This effect is often the dominant contribution to the error -especially if the trajectory passes a phase transition. Such situations arise for example in the context of studying the deconfined phase of SU (3) Yang-Mills theory [4,5]. We stress that the accumulation of the statistical error along the trajectory and the approximation error of its starting point are not independent. The former could be reduced if a better starting point was available. There are also non-equilibrium methods based on Jarzynski's identity to estimate free energy differences without the need for integration [5][6][7]. However, also these methods require expensive repeated simulations corresponding to an ensemble of non-equilibrium trajectories through phase-space.
It is therefore desirable to develop methods which allow the direct estimation of the free energy at a given point in parameter space.
In the following, we will propose such a method based on deep generative machine learning models. As we will discuss, our method comes with rigorous error estimators and asymptotic guarantees. Over the last years, deep generative models have been applied with great success to generate, for example, high-resolution images, natural speech, and text (see [8] for an overview). In [9], a machine-learning-based regression algorithm for determining action parameters from an ensemble of field configurations is proposed and [10] uses a neural network to predict the structure of phase transitions from field configurations. References [11][12][13][14][15] conjecture a relation between Restricted Boltzmann Machines and Quantum Fields in the context of the holographic duality. In the recent works [16][17][18], deep generative models have also been used in the context of lattice quantum field theories (see also [19,20]). The main objective of these works was to reduce the integrated autocorrelation of the simulations. In contrast, this work demonstrates that deep generative models can be used to estimate quantities which are not (directly) obtainable by MCMC approaches.
We also note that generative models have been used in [21] to estimate free energy differences in the context of statistical mechanics by combining these models with the Zwanzig free energy perturbation method [22]. Contrary to this approach, our method estimates the absolute value of the free energy. We furthermore note that the free energy can also be directly computed using the Tensor Renormalization Group method, see [23] for an application to φ 4 -theory. For other novel approaches to obtain thermodynamic quantities and, in particular, the equation of state, see [24,25].
In the following, we will give a brief overview of relevant aspects of lattice field theories and generative models. We will then discuss how generative models can be used to estimate the free energy and compare this approach to MCMC-based methods in numerical experiments.
Lattice Field Theory. A lattice field theory can be described by an action S(φ). In the following, we will consider (euclidian) real scalar field theory for concreteness, i.e. φ(x) ∈ R for each lattice site x ∈ Λ of the lattice Λ. The path integral then reduces to an ordinary high-dimensional integral. Therefore, expectation values of operators O(φ) can be calculated by where we defined D[φ] = x∈Λ d[φ(x)] and the partition function Z is given by If we impose periodic boundary conditions in time for a lattice with temporal extend N T , the theory is at finite temperature T = 1 β = 1 N T a , where a denotes the lattice spacing. The free energy is then defined by and can be related to the pressure p = − F V , where V denotes the spatial volume of the lattice Λ whose number of lattice sites we denote by |Λ|. Similarly, the entropy H can be obtained from the free energy by F = U − T H, where the U is the internal energy.
Deep Generative Models. We focus on a particular subclass of generative models called normalizing flows (see [26] for a recent review). These flows are distributions q θ with learnable parameters θ. They also have the appealing property that they allow for efficient sampling and calculation of the probability of the samples.
In more detail, these flows are constructed by defining an invertible neural network g θ . For a brief overview of neural networks, we refer to the Supplement. The samples φ ∈ R |Λ| are obtained by applying this network to samples z ∈ R |Λ| drawn from a simple prior distribution q Z such as a standard normal N (0, 1): Since the network g θ is invertible by assumption, it then follows by the change of variable theorem that φ ∼ q θ with The architecture of the neural network g θ is chosen such that i.) invertibility of g θ and ii.) efficient evaluation of the Jacobian determinant dg θ dz are ensured. A particular example of such an architecture is Non-linear Independent Component Estimation (NICE) [27] for which the neural network g θ consists of invertible coupling layers y l : R |Λ| → R |Λ| , i.e.
Invertiblity and efficient evaluation of Jacobian determinant is then ensured by splitting the components of the layer y l = (y l u , y l d ) in two parts y l u ∈ R |Λ|−k and y l d ∈ R k for given k ∈ {1, |Λ| − 1}. The layer y l+1 = (y l+1 u , y l+1 d ) is then recursively defined by where m is another neural network (not necessarily satisfying the two requirements from above). Due to the splitting, this can be easily inverted by The total Jacobian determinant is then dg θ dz = 1 since it is the product of the Jacobian determinant of each layer.
Training. We want to train a generative model which samples field configurations φ ∼ q θ approximately from the path-integral distribution For this, the Kullback-Leibler divergence [28] between the normalizing flow q θ and the target distribution p is minimized, i.e.
where we have defined the variational free energy as well as the expectation value E q [O] = D[φ]q θ (φ)O(φ) and the free energy F = − 1 β ln(Z). This divergence vanishes if and only if the distributions q and p are identical [29].
The KL divergence is minimized by gradient descent with respect to the parameter θ of the flow q θ . Since the free energy F does not depend on the flow q, the variational free energy F q can equivalently be minimized. Therefore, the training procedure does not require a target distribution (6) with a tractable partition function Z. Using the explicit expression for the probability of the flow (3), we can rewrite the variational free energy as In training, the expectation value is approximated by its Monte-Carlo estimate. In machine learning, this approach of learning a model from an unnormalized target distribution is very well established [30][31][32][33]. Recently, the same method has been used in the context of lattice field theories [16]. Furthermore, this approach has been applied to quantum chemistry [34] and statistical physics [35][36][37]. The variational free energy does not allow us to infer the value of the KL divergence since the free energy F is not known. In order to alleviate this shortcoming, we define the random variable C(φ) = S(φ)+ln q θ (φ), which is related to the variational free energy by βF q = C q . In the Supplement, we show that where we have defined the importance weight w(φ) = p(φ) q(φ) . Thus convergence of training will result in a small variance Var q (C). In practice, a Monte-Carlo estimate of this quantity can be calculated without any significant overhead during training as C(φ) is also needed for Monte-Carlo estimation of the variational free energy F q , see (7). It is therefore advisable to closely monitor the variance of C during training.
Estimation of Thermodynamical Observables. The partition function Z can be rewritten as where we have defined the unnormalized importance . Therefore, the partition function can be estimated by Monte-Carlo as followŝ  Errors are obtained with the delta and uwerror method [42] for flow and HMC respectively (see Appendix for Jackknife error analysis).
We emphasize that the sampling procedure does not need to be sequential (as for a Markov Chain). As a result, it can very efficiently be parallelized and does not suffer from autocorrelation. FromẐ, one can then easily estimate the free energy bŷ From the free energy (10), one can then straightforwardly obtain estimates for the pressure and entropy, as explained above. The estimator (10) has been extensively studied in the context of training an Importance Weighted Variational Autoencoder (IWAE) [43][44][45]. It was shown in [43] that it is a statistically consistent estimator if q has support larger or equal to the target p.
In [44], its variance and bias were derived using the delta method (see also [36,45]). For convenience, we summarize the relevant results in the Supplement. Alternatively, one can use the Jackknife method to estimate the bias and variance [46]. Numerical Experiments. We apply the proposed method to two-dimensional real scalar field theory with action where κ is the hopping parameter and λ denotes the bare coupling constant of the theory. The action is invariant under Z 2 -transformations, i.e. φ → −φ. Figure 1 shows the absolute magnetization |φ| as a function of the hopping parameter κ. As the hopping parameter κ increases, spontaneous magnetization is observed.
In the following, we will estimate the free energy F e at λ e = 0.022 and κ e = 0.3 for lattice sizes |Λ| = N L × N T of 64 × 8, 48 × 8, 32 × 8, 16 × 8 with both the flow-based and an MCMC-based method.
Using the flow method, we can directly estimate these free energies. We modify the NICE architecture to ensure that the flow q θ is invariant under Z 2 -transformations, i.e. q θ (φ) = q θ (−φ). By the definition (3) of q θ , an odd function g θ (−z) = −g θ (z) implies Z 2 -invariance of q θ . The map g θ is odd if all its coupling blocks y l are odd, see (4). The latter condition can be ensured by choosing an odd neural network m for the coupling (5) which we achieve by using tanh non-linearities and vanishing biases for the network m.
After training has completed, the free energy is then computed using the proposed estimator (10). For error analysis, we use both the Jackknife as well as the deltamethod and check that they lead to consistent error estimates. In many applications, generative models suffer from mode dropping [47], i.e. some modes of the target p are not captured by the model q θ . For our specific estimation method however, a simple consistency check can be performed ensuring that mode dropping does not occur. To this end, we estimate Z = (E p [w −1 ]) −1 by a single Markov chain at the target point in parameter space and ensure that this leads to a compatible estimate, see Supplement.
For MCMC, we use a reweighting procedure [2,3] which is significantly more involved and uses the relation F e = ∆F e b + F b . Here, F b is the free energy at κ b = 0 and λ b = λ e . The value of F b can be analytically calculated since for vanishing Hopping parameter κ: where |Λ| denotes the number of sites of the lattice Λ and with K n being the Bessel function of the second kind. We prove this relation in the Supplement. The free energy We estimate this expectation value with an overrelaxed HMC algorithm [38][39][40][41]. In practice, the variance of the estimator will become prohibitively large if the two distributions p b and p e do not have sufficient overlap. We therefore choose intermediate distributions p i1 , . . . p i K ensuring that neighbouring distributions p i k and p i k+1 have sufficient overlap. The free energy difference can then be obtained by In our numerical experiments, we keep λ = 0.022 fixed and only vary the hopping parameter κ of the intermediate distributions p i . We choose a difference in hopping parameter of δκ = 0.01 for κ ∈ [0.2, 0.3] and δκ = 0.05 for all other intermediate hopping parameters κ. We therefore use K = 14 Markov chains with 400k steps each. Thus, a total number of 5.6 million configurations is used for estimation. For a detailed analysis of the dependence of our results on this choice of δκ, we refer to the Supplement. The error analysis is performed with both the uwerr [42] and Jackknife method which are checked to lead to consistent estimates. We again refer to the Supplement for a more detailed description. Figure 2 shows that the estimates of both the flow and MCMC are compatible within errorbars. However, the trajectory of the MCMC method has to pass the critical region which is challenging due to critical slowing down. The flow-based estimate can be directly performed at the desired point in parameter space and therefore does not suffer from this problem. This conceptual difference leads to a significantly more precise estimate by the flow-based method. For regions in parameter space which do not require the crossing of a phase transition, MCMC-based methods have errors of comparable order of magnitude (see Supplement). The ability of the flow to perform direct estimates is both of practical as well as of conceptual importance. For example in finite-temperature QCD, one often uses a trajectory whose initial free energy is approximated by the Hadron Resonance model (see for example [2]) leading to an undesirable systematic error. Furthermore, summing up the free energy differences along the trajectory leads to an accumulation of errors. This effect is often the dominant contribution to the error and is particularly pronounced in situations for which the trajectory has to cross a phase transition. Such situations are of great practical relevance, for example in the deconfined phase of SU (3) gauge theory [4,5]. We stress that both error sources are related since the initial free energy is the starting point of the trajectory.
Conclusion. In this letter, we have proposed a method to directly estimate the free energy and hence thermodynamical observables of lattice field theories using deep generative models. This method is of great conceptual appeal as it avoids cumbersome integration through parameter space and does not require an exactly or approximately known integration constant. Future work will focus on scaling this approach to four-dimensional gauge theories.
Recent work has successfully constructed flows which are manifestly gauge-invariant [17,18]. This recent progress, combined with the enormous ongoing advances in deep learning, makes it very promising that our method can be applied to non-abelian gauge theories, and ultimately QCD, in the not too distant future.
Acknowledgements. This work was supported in part by the German Ministry for Education and Research (BMBF) under Grants 01IS14013A-E, 01GQ1115, 01GQ0850, 01IS18025A and 01IS18037A. This work is also supported by the Fraunhofer Heinrich-Hertz-Institut (HHI) and by the grant funded by the DFG (EXC 2046/1, Project-ID 390685689

Conventions for Action
The form of the action S used in the main text is It can be obtained by starting from the (more standard) action and performing the following re-definitions A brief overview of Deep Learning Neural Networks Neural networks are a machine learning algorithm which has proven to be particularly powerful. A neural network is build of layers which are defined by where x ∈ R n , y l ∈ R m are input and output of the layer. The output of the layer is also often called the activation of the layer. The weights W l ∈ R m,n and the bias b l ∈ R n are the learnable parameters of the neural network. The non-linearity σ : R → R is a non-linear function which is applied element-wise to the components of W l x + b l . Widely-used activation function are σ(x) = max(x, 0) or σ(x) = tanh(x).
A neural network consists of L such layers, i.e.
It is important to note that the weights W l and biases b l do not have to be of the same dimensionality for each layer (although we did not make this explicit in our notation). It is also important to note that we merely described the most simple type of neural network, namely a fully-connected neural network. There is a zoo of other neural networks but we will refrain from a more detailed discussion as it is not needed for our purposes (see [8] for an overview).
Learning Parameters with Backpropagation The parameters of the neural networks, are determined by minimizing a certain loss function L by gradient descent (see (7) for the particular loss function used in this work). It is important to emphasize that the number of parameters are typically large (of order 10 3 − 10 6 for typical modern neural networks). It is therefore clear that one cannot determine the gradient by finitedifference (as we would need calculate the finite difference ratio for each of these parameters which is prohibitively expensive).
The basic idea for calculating the gradient ∇ W L is to use the fact that we know the functional form of the neural network: the gradient of the loss is given by Each term in this expression is known, for example ∂y l+1 ∂y l = σ (y l )W l .
For a fixed non-linearity, we know the analytical form of the derivative σ . This observation leads to the following algorithm: we first perform a forward pass of the neural network, i.e. starting from the input x, we calculate the activations y l for each layer and store them in memory. This process ends with the final activation y L which is, by definition, the output of the neural network. The gradient of each layer ∂y l+1 ∂y l = σ (y l )W l can then directly be calculated (as we have stored the activation y l ). Crucially, we only need the matrix product of these Jacobians and it is efficient to start by calculating the gradient with respect to the output layer L, then the layer L − 1 and so forth. This is because the loss function has a scalar output value and therefore the matrix product of the Jacobians ∂L ∂y L ∂y L ∂y L−1 . . . ∂y l+1 ∂y l is a vector with the same number of components as y l . We can therefore save memory by simply overwriting the stored activation y l . This algorithm is called backpropagation and allows us to calculate the gradient ∇ W L for roughly the same cost as a forward pass of the neural network.

Relation between Var(C) and KL divergence
Theorem. Let C(φ) = S(φ) + ln q(φ). The following relation between the KL divergence and the variance of C holds: where w(φ) = p(φ) q(φ) is the normalized importance weight. Proof. The expectation value of the normalized importance weight is The Kullback-Leibler divergence can be rewritten in terms of the normalized importance weight We now expand the KL divergence around the expectation value of the normalized importance weight We now relate this expression to the variance of C. To this end, we first observe that C = − lnw, wherew = exp(−S)/q is the unnormalized importance weight. We then rewrite the expectation value of C as where the last step uses the expansion for the KL divergence derived above. It then follows its variance is given by Expanding the logarithm around E q [w] = 1 again, we obtain Combining this expression with the expansion derived for the KL divergence, we obtain the claim of the theorem.
The higher-order moments will be small towards the end of the training process for which q ≈ p and thus w ≈ 1. Thus, the variance of C will become small. We indeed observe this behaviour in our numerical experiments, see Figure 3 for an example.

Analytic Solution for Partition Function
The action of the scalar field theory is given by We want to calculate the partion function which for vanishing hopping parameter κ decouples in independent integrals of each lattice site of the lattice Λ: The partition function can then be calculated analytically using the integral where K n is the modified Bessel function of the second kind. Using this formula, we obtain the following analytic form of the free energy where we have defined This result corresponds to the zeroth order of the hopping expansion [46] of the partition function Z and one may, in principle, also calculate higher-order corrections. However, they are not needed for our purposes.

Error Analysis for Free Energy Estimator
Our discussion is based on [44] which discussed the same results in the context of variational inference. We provide a review here since these results may be hard to extract for physicist not familiar with variational inference. We also point out a subtlety that was not discussed in the previous work.
The estimator for the free energy is given bŷ Theorems for the variance and bias of this estimator are discussed in the following. For this, we use the delta method of moments which is summarized in the following theorem.
X i be the sample mean of independent and identically distributed random variables Let h be a realvalued function with uniformly bounded derivatives. It then holds that We refer to Chapter 5.3 of [48] for a proof and more details.
The application of the delta method to the free energy estimatorF is, in practise, subject to a subtlety regarding the bounded differentiablity of the function h. We will ignore this subtlety in the following and return to it at the end of the section.
Proof. The bias of −βF = lnẐ is given by Using the delta method for moments, we derive that where we have used that h(x) = ln(x) has second derivative h (x) = − 1 x 2 . The proof then concludes by observing that Theorem. The variance ofF is given by assuming that E q [w 2k+2 ] < ∞ for k = 0, 1.
Proof. The variance can be written as We now evaluate both terms on the right-hand-side individually using the delta method. For the first term, we use the delta method with h(x) = (ln x) 2 which has second derivative Using this expression, we then obtain that E q [(lnẐ) 2 ] is equal to For the squared expectation value, we use the expansion (18) derived in the proof for the bias. This gives that (E q lnẐ) 2 is equal to Subtracting these two expressions, it then follows , and the proof concludes by observing that Z = E q [w].
A few remarks are in order: from the theorems, it follows that the standard deviation of the estimatorF is of order O(1/ √ N ). In the large N limit, we can therefore neglect the bias correction as it is of order O(N −1 ). Furthermore, we can replace the expectation values in the theorems by the sample mean up to (negligible) higherorder corrections. In practise, we therefore use these results to estimate the variance and bias ofF . Alternatively, one can use a standard Jackknife analysis to estimate variance and bias (see for example [46]). In our experiments, we use both methods to estimate the errors and check that they lead to consistent results. Lastly, we remark that error estimators for general observables involving the partition function can be derived, see [36].
As mentioned above, the delta method requires that the derivatives of the function h are (uniformly) bounded. For a generic LQFT, this will not be the case for h(x) = ln(x) since its derivatives diverge for x → 0 + . To the best of our knowledge, the same problem will generically arise in the context of variational inference but seems to have not been discussed in the literature.
To address this subtlety, one could require that the action of the lattice quantum field theory is bounded. For example, this can be ensured by putting the field theory in a box potential. Since only very high energy configurations are affected by this (for suitably large choice of the box potential) and since these configurations are extremely unlikely to be sampled, this modification will have no practical effect on the numerical experiments. After this modification,Ẑ is bounded from below and h (n) (Ẑ) is also bounded as a result.
More rigorously, the result for the variance can be derived without assumptions on a bound for the derivatives by using the delta method for in law approximation which takes the following form X i be the sample mean of independent and identically distributed random vari- For a proof, we again refer to [48], see Theorem 5.3.3. Applying this theorem to the free energy estimator −βF = lnẐ, we obtain the same expression for its variance as derived above. However, the theorem does not require any bound on the derivatives of h(x) = ln(x).

Systematic Errors
In this section, we discuss the various sources of systematic error relevant for both the flow-based and MCMC-based estimation methods and discuss how they can be assessed.
Mode-dropping for Generative Model: We ensure that no mode-dropping for the generative model takes place, i.e. all modes of the target distribution p are captured by the generative model q. Mode-dropping can cause underestimation of the partition function Z. To this end, we use the relation The left-hand-side can be estimated using a single Markov chain. Note that we sample from the target distribution p as opposed to the generative model q. Therefore, the estimator is consistent even if q is modedropping. The resulting estimateẐ for the partition function can then be plugged into (10) to obtain an estimate for the free energy. It is then checked that the result is consistent with the one obtained from (9), see Figure 5. We stress that this consistency check only requires a single Markov chain and is therefore equal in cost to the overlap consistency check of the MCMC method discussed below. Bias due to imperfect training: From the theoretical analysis of the variance of the estimator (19), it is expected that it has a standard deviation with the typical N − 1 2 fall-off in the number of samples N . On the other hand, the bias of the estimator (17) due to imperfect training has a subleading N −1 fall-off. We check carefully that our error estimates indeed show the theoretically predicted N − 1 2 behaviour in our numerical experiments. Repeated runs: We repeated the estimate for the smallest lattice ten times. The resulting estimates are shown in Figure 6. The sample standard deviation is consistent with our error estimates. Furthermore, the MCMC-based method does not systematically over-or underestimate with respect to the flow.
Step size for MCMC: As explained in the main text, the free energy difference ∆F e b is calculated in steps In the following, we will analyze the dependency of our results on the chosen steps.
We start from an initial step size corresponding to a change in hopping parameter κ of δκ = 0.05. Between κ = 0.2 and κ = 0.3, we however take a finer step size of δκ = 0.01. Since we are interested in the free energy difference ∆F e b between κ b = 0.0 and κ e = 0.3, this corresponds to running K = 14 Markov chains. We focus  Table I. For the flow-based method, we use the same number of samples as for the corresponding refined MCMC method. As a result, also the error of the flow's estimate decreases.
on the 16 × 8 lattice and use an overelaxed HMC algorithm to sample 400k configurations for each chain. The overrelaxation is performed every 10 steps.
We then repeatedly refine the step size in a certain subregion around the critical κ value. The details can be found in Table I. The results of this analysis are shown in Figure 7. We observe that the error of the estimator does not significantly decrease. We note that the error of the flow decreases during refinement because its free energy estimation uses the same number of samples as all Markov chains combined (and this number increases by the additional refinement steps).  Mode-dropping for MCMC: In order to ensure that the distributions p i and p i+1 in have sufficient overlap, we also estimate Zi Zi+1 by exchanging p i with p i+1 in the relation above. We then check that this leads to compatible results, see Figure 8. We note that this consistency check is relatively cheap as it requires running one additional Markov chain. We also study the dependence of the integrated autocorrelation of the free energy on the refinement of δκ, see Figure 4.

Additional point in parameter space
In the main text, we demonstrated our method in a case where the MCMC-based method had to cross the critical region to calculate the absolute value of the free energy because we consider this one of the main applications of our method. Since this leads to large integrated autocorrelation times for the MCMC-based method, it has errors which are significantly larger than the generative model estimate. In the following however, we will focus on a point in phase space for which the MCMC approach does not need to cross a critical region. Namely, we use the same value for the bare coupling λ = 0.022 as in the main text but set the hopping parameter κ = 0.2. As can be seen from Figure 1, this corresponds to a point before the critical regime. The resulting estimates for the free energy are shown in Figure 9. The MCMC-based estimate has a variance of the same order of magnitude as the flow-based one in this regime.

Details on Numerical Experiments
HMC: We use a HMC algorithm with overrelaxation. Each Markov chain has 5k thermalization steps followed by 400k estimation steps. The sign of the field configuration is flipped every ten steps.
Training of flow: For every lattice, we use a normalizing flow with six coupling layers. Each coupling layer (5) has neural network m with five fully-connected layers with no bias and Tanh non-linearities. The hidden layers of m consist of 1000 neurons each. For each coupling layer, we split the input in half to obtain y (u) and y (d) , see (5), using a checkerboard-type partitioning. Consecutive coupling blocks use alternating checkerboard partioning in order to ensure that all lattice sites are updated. We train the flow for 1M steps using an 8k mini-batch. We use ReduceLROnPlateau learning rate scheduler of Py-Torch with an initial learning rate of 5 × 10 −4 and patience of 3k steps. The minimum learning rate was set to 1 × 10 −7 .
Estimation: As described in the main text, for HMCbased estimation we use a step size of δκ = 0.01 for κ ∈ [0.2, 0.3] and a step size of δκ = 0.05 for all other values of the hopping parameter. As a result 14 Markov chains are run. In total, the HMC-based method therefore uses 14 × 400k = 5.6M configurations. We use the same number of samples for the flow-based estimation. For efficiency, we sample these configurations in minibatches of 3k samples.
Error estimation: we use both the uwerr [42] and jackknife method to estimate uncertainties for HMC. In order to deal with autocorrelation for jackknife, we perform binning with a 1k bin size. Error estimation for flow is performed by the delta method and also by jackknife, see Figure 10. From the free energy estimates, one can then derive other thermodynamic observables such as the entropy. We refer to the main text for a discussion of this. Figure  11 shows estimation of entropy. Errors were estimated using both the Jackknife and uwerr method. Both error analysis methods lead to consistent results.
Runtime: Normalizing Flows allow for very efficient sampling. Specifically, we can sample the total of 5.6M samples in under a minute for all lattices considered in the main text. Furthermore, this sampling procedure can be perfectly parallelized over multiple GPUs as the flow can generate each configuration independently. On the other hand, flows require a substantial up-front training cost which is independent of the number of samples used for free energy estimation and is therefore amortized over the entire estimation procedure. As a result, the relative runtime comparison between the flow-based and HMC-based algorithm strongly depends on the number of samples (as well as, of course, on the used implementation and hardware setup). In our numerical experiments, we use a Intel Xenon 2.4 GHZ CPU with NVidia P100 graphics card with 16GB memory. Both our implementation for the HMC as well as for the generative model were not optimized. The training time of the generative models takes about 20 hours to converge. Generating the samples with 14 different Markov chains takes about 25 hours. It is likely that both the MCMC runtime and training time could be substantially reduced by using more efficient implementations and by tuning the hyperparameters of the training process, such as initial learning rate, its decay schedule, choice of optimizer, initialization of weights, as well as early stopping (see [49] for an overview). Since the goal of this work was to introduce the method and provide a proof of principle, we did not explore such techniques as part of this study.