Detecting and Mitigating Mode-Collapse for Flow-based Sampling of Lattice Field Theories

We study the consequences of mode-collapse of normalizing flows in the context of lattice field theory. Normalizing flows allow for independent sampling. For this reason, it is hoped that they can avoid the tunneling problem of local-update MCMC algorithms for multi-modal distributions. In this work, we first point out that the tunneling problem is also present for normalizing flows but is shifted from the sampling to the training phase of the algorithm. Specifically, normalizing flows often suffer from mode-collapse for which the training process assigns vanishingly low probability mass to relevant modes of the physical distribution. This may result in a significant bias when the flow is used as a sampler in a Markov-Chain or with Importance Sampling. We propose a metric to quantify the degree of mode-collapse and derive a bound on the resulting bias. Furthermore, we propose various mitigation strategies in particular in the context of estimating thermodynamic observables, such as the free energy.


I. INTRODUCTION
Using normalizing flows for sampling in lattice field theory has gained significant attention over the last few years.Several works have been carried out in the domain of scalar field theories [1][2][3][4][5][6][7][8][9][10], U (1) [11,12] and SU (N ) [13][14][15] pure gauge theories, and fermionic gauge theories [16,17].This rapid development is attributed to the appealing conceptual properties of flow-based sampling.A well-trained flow approximately acts as a trivializing map [18] and therefore can significantly reduce the integrated autocorrelation time of physical observables.The practical obstruction to harnessing this conceptual advantage is that the training process becomes increasingly challenging as the dimensionality of the lattice increases, resulting in poor volume scaling [19][20][21][22].Furthermore, it is well-known that generative models struggle to learn long-range correlations [23] which is crucial as a critical point is approached.When the continuum limit of the theory is taken, both challenges manifest simultaneously: the volume needs to be increased as the critical point is approached.As a result, it remains an open question whether useful architectures can be found for addressing critical slowing down in the continuum limit.
Another conceptually appealing property of normalizing flows is that they allow for independent sampling, thus making flow densities suitable for being combined with Metropolis-Hastings accept-reject schemes.This approach is often referred to as Neural-MCMC [1,[24][25][26].As a result, it may be hoped that they can avoid the tunneling problem which arises when local update MCMC algorithms are applied to theories that have degenerate minima separated by high action barriers.However, normalizing flows are typically trained by self-sampling in the context of lattice field theory [1].As we will discuss, this bears the risk that the training will assign vanishing low probability mass to some of the modes of the theory [2][3][4], since the training objective will not strongly penalize this.If mode-collapse happens, certain modes of the theory will not be probed by the sampler.This problem, therefore, leads to substantially biased estimators of physical observables as shown in fig. 1.In this sense, the tunneling problem reappears for normalizing flows in the form of mode-collapse and is thus shifted from the sampling to the training phase of the algorithm.
This work studies mode-collapse and the more general mode-mismatch phenomenon, both theoretically and numerically.We first discuss in detail the mode-seeking nature of the standard self-sampling-based training procedure which corresponds to minimizing the reverse Kullback-Leibler (KL) divergence [3].We compare this to an alternative training procedure which is based on minimizing the forward (as opposed to the reverse) KL divergence and review why it is equivalent to maximum likelihood training.This objective has the advantage that it is substantially less vulnerable to mode-collapse but has the disadvantage that it requires representative configurations sampled from the theory.
In many applications, this prevents this objective from being of any use since if such configurations are available, we can directly measure physical observables on them and a flow is not necessary.However, we point out that there is an important exception to this: for thermodynamic observables, such as the free energy, it is still useful to train a flow.This is because these observables are typically obtained by integration through the parameter space of the theory and thus require a significant number of Markov chains along a discretized trajectory in the parameter space.By training a flow on samples generated at a single point in parameter space, we can completely avoid the need for these additional Markov chains.In this important scenario, it is thus sensible and, as we argue, advisable to use forward KL training for the flow to significantly reduce mode-collapse.Besides modifying the training procedure, we also propose to mitigate mode-collapse by combining two flow-based estimators for the free energy.
We then study the bias induced by mode-collapse theoretically.Specifically, we derive a bound on the bias of the estimator for physical observables.This allows us to propose a natural metric to quantify the degree of mode-collapse of the sampler.
The effectiveness of our proposed methods is then demonstrated on a two-dimensional φ 4 scalar theory.

II. TRAINING A GENERATIVE MODEL
Normalizing flows [27][28][29] are a particular class of generative models giving access to an analytic form of the likelihood.While this work focuses on flows for concreteness, we stress that the theoretical arguments made in the following sections hold for any generative model allowing for exact likelihood estimation.For such models, a variational density q θ , the sampler, parameterized by a set of weights θ, is optimized to approximate the target density of the lattice field theory where Z is the partition function and S is the action of the theory.
During the training of a normalizing flow, an efficient transformation to map a base density q z into a non-trivial target is learned.In practice, the base distribution is chosen such that it allows for efficient sampling.Common choices for the base density are therefore normal or uniform distributions.
The flow uses a diffeomorphism f between the base space Z and the configuration space X hence The diffeomorphism f θ is a composition of bijective transformations f i θ referred to as coupling blocks.Each of these blocks satisfies the following requirements: 1. f i θ is a bijection, 2. both f i θ and its inverse are in C ∞ .
3. the determinant of the Jacobian is efficient to evaluate.

Symmetric
Phase Broken Phase FIG.1: Estimation of free energy density in broken and symmetric phases using a reverse-KL trained flow -estimation of the free energy density using the approach proposed in [2] and [4].The second-order phase transition is represented with a color gradient from red to green in the background.The free energy density is estimated using samples drawn from the flow and the target distribution respectively.The flow estimates (purple) are compared to the HMC baseline (pink).The experiments consider lattices Λ = 64 × 8 at fixed coupling λ = 0.022 for the φ 4 -theory as in [2].In the broken phase (green) of the theory, e.g., κ ≥ 0.3, the target density has two modes with very little tunneling probability between the two.Using a deep generative sampler, i.e., a flow, trained with reverse-KL in the broken regime leads to biased estimates.This problem lays the foundation of the study presented in this work.
The inverse of the transformation f −1 θ (x) = z therefore always exists by construction.
Leveraging these properties, an analytic expression for the likelihood of the flow-based model reads Different coupling blocks satisfying the requirements above have been proposed; these include Non-Linear Independent Component Estimation (NICE) [30], Real Non-Volume Preserving (RealNVP) [31], and Generative flow (GLOW) [32].We refer to [28,29] for an overview of the existing coupling blocks and further technical details.FIG.2: Comparison between a reverse-KL (purple) and forward-KL (orange) optimization approach -a sketch of normalizing flows trained with both KL objectives as described in section II A.
Training with a reverse KL shows a mode-seeking behavior and is thus prone to mode-dropping.The forward-KL has instead a mode-covering behavior which leads to larger support over the sampling space.
A. The Forward-and Reverse-KL Divergences During training, the normalizing flow is optimized by density matching.It is common practice to minimize the so-called KL divergences to this end although other types of generalized divergences can be used [33][34][35][36][37][38][39][40].As we will discuss in this section, and in section III, choosing an appropriate divergence is crucial to ensure successful training.
The so-called reverse-KL divergence reads where D[φ] represents the measure of a high-dimensional integral.It is worth stressing that the KL divergence is not symmetric hence The right-hand-side of eq. ( 5) is usually referred to as the forward-KL which can be written as an expectation value with respect to the target density p These two choices for the divergence lead to different training procedures, as we will discuss in sections section II A 1 and section II A 2. We also note that the use of reverse and forward KL is not mutually exclusive.In the context of Quantum Chemistry [41], for instance, a combination of the two is typically chosen.

Reverse-KL: training by self-sampling
The reverse KL divergence is the standard choice for training normalizing flows on the lattice.This is because lattice field theory comes with an action S which is known in closed form -in contrast to many other machine learning applications.
The reverse KL divergence (4) can be approximated by a Monte-Carlo estimate [2,24] as follows KL(q θ || p) = E q θ ln q θ (φ) p(φ) Here, the field configurations are sampled from the flow, i.e. φ i ∼ q θ , and thus the training relies on self-sampling.
In particular, the partition function Z contributes by a shift term that is constant with respect to the parameters of the flow and thus can be ignored for optimization by gradient descent.
Training using a reverse-KL is therefore very efficient because it does not require samples from the target density p due to self-sampling.Unfortunately, this comes at a cost as this objective is known to be prone to modecollapse.The sketch on the left-hand side of fig. 2 shows that a reverse-KL-trained flow tends to focus its support on a subset of the modes when the target density is multimodal.This undesirable behavior strongly affects the reverse KL as the dropped modes are not probed in the self-sampling process.
This mode-seeking nature of the reverse KL represents a major drawback and limits the applicability of this framework when the physical density to be learned has more than one mode [3,4,42].

Forward-KL: training by maximum likelihood
The forward KL divergences can be written as an expectation value with respect to the target density p and thus also be approximated by Monte-Carlo In contrast to the reverse KL divergence, the samples are here to be drawn from the density p of the theory, i.e. φ i ∼ p.As can be seen from the equation, the minimization of the forward KL corresponds to maximizing the likelihood of the model.In the machine learning literature, the forward training procedure is thus also known as maximum likelihood training.Indeed, this has already been used in the context of lattice field theory [3].
This training procedure has the advantage that it is mode-covering since all modes of the physical target density p will necessarily be probed in training.It has however the disadvantage that it requires samples from the target p.In lattice applications, these are typically generated by a Monte-Carlo algorithm, such as HMC.However, if these configurations are available, one can directly measure physical observables on them and there is therefore no need to train a flow in the first place.
One may thus wonder if this training procedure is of any use then.For thermodynamic observables however, such as the free energy, one typically does not only require a single Markov chain for the target density p but a whole series of Markov chains along a discretized trajectory in the parameter space of the theory.As we will review in the next section, a flow allows us to completely avoid the need for these additional Markov chains.For the important class of thermodynamic observables, forward KL training is thus well-justified and, as we will show, advisable.

III. RELIABLE ESTIMATORS IN PRESENCE OF MODE-COLLAPSE
Combining deep generative models, e.g.normalizing flows, with neural importance sampling (NIS) has been shown to be a fruitful approach for estimating thermodynamic observables in lattice field theory [2,4], statistical mechanics [24], and chemistry [41,43].This approach enables direct estimation of the free energy as well as other thermodynamic observables because flowbased sampling allows for estimating those observables at arbitrary points in the parameter space.Remarkably, this is in stark contrast to standard Markov-Chain Monte-Carlo methods which instead require non-trivial integration in the parameter space [2].More specifically, NIS allows the computation of a direct Monte-Carlo estimate of the partition function which is crucial for many thermodynamic observables [2] such as entropy and free energy In the following, we revise two different estimators of the free energy, namely the p-estimator and the qestimator.These allow estimation using samples drawn from the target and the generative model respectively [4].We stress that these estimators are applicable for any generative model that has a tractable likelihood, such as normalizing flows [28], autoregressive neural networks [44], and diffusion models [45].

A. Different Estimators for the Partition Function
Given a generative model, irrespective of whether it has been trained using reverse-KL or forward-KL, the resulting sampler q θ is an approximation for the target density p. Leveraging this, recent works proposed to estimate the partition function of a physical system [2,24] directly at a given point in parameter space.This approach samples lattice configurations from the generative model q θ and estimates the partition function Z with a so-called q-estimator with Alternatively, when samples from the target p are available, i.e., using a thermalized Markov chain, one can estimate the inverse partition function with the so-called p-estimator Combining these results with eq. ( 9), one immediately derives corresponding estimators for the free energy Both the p-estimator and the q-estimator can be shown to be asymptotically consistent under the assumption that the supports of the flow q θ and the target density p match, i.e. supp(q θ ) = supp(p) [24].
More generally, we prove in appendix 2 that the estimators serve as upper and lower bounds of the free energy: Theorem 1.The forward free energy F p and reverse free energy F q provide an upper and lower bound of the free energy F respectively, i.e.
and similarly if supp(q θ ) ⊆ supp(p) In the presence of mode-collapse, the flow has smaller support than the target, i.e., supp(p) supp(q θ ) .
In this case, the q-estimator in eq. ( 12) may thus lead to (possibly strongly) biased results since it cannot cover all the modes irrespective of the number of samples.On the other hand, the q-estimator has the advantage that it is typically more efficient to sample directly from the flow while the p-estimator requires the (possibly costly) generation of configurations by a Markov Chain.Nevertheless, it is advisable to estimate the free energy with both estimators if there is a risk of mode mismatch and ensure that both lead to consistent results.The phenomenon of mode-collapse is a widely known issue in the field of density estimation [46][47][48][49].In particular, when deploying generative models for physical systems this becomes of crucial importance as neglecting subsets of the modes of a target density would inevitably lead to highly biased estimation of physical quantities.Moreover, this may sometimes not even be detected unless appropriate estimators are used [4].We want to stress that this problem is not restricted to lattice field theories [3,4] but is also found within other contexts, such as molecular systems [41,50,51].Having an estimator which quantifies the amount of probability mass being dropped by a variational ansatz is therefore highly desirable for more reliable and unbiased estimators of physical quantities.
When the trained model is neglecting some modes of the target density, hence missing full support over the target domain, estimates of physical observables may be biased.A desirable property of our framework is to be able to detect such bias by providing reliable bounds on the error when the model is mode-dropping.When no further assumptions are made on the support of q θ the expected value of the importance weights w(φ) = p(φ) This expectation value thus measures the degree to which the support of the target density p is covered by the sampler q θ .As a result, w ∈ [0, 1] provides us with a natural quantity to measure the sampler's ability to probe the entire support of the target density p.
We will now derive an estimator for this expectation value.To this end, we rewrite the above expression as As shown in the last section, the partition function Z can be approximated by the p-estimator (11) when samples from the target density are available.As a result, a Monte-Carlo estimate of eq. ( 14) is given by where φ i ∼ q θ and φ j ∼ p are sampled from the flow and the target density p respectively.

B. Bounding the Bias of Physical Observables
Following [2], given a physical observable O, we define the estimator with importance weight w(φ) = p(φ) q θ (φ) .This estimator is not necessarily asymptotically unbiased if the sampler q θ is affected by mode-collapse, i.e., supp(p) supp(q θ ).Similarly, the expected value of the estimator reads This may not agree with the true value of the observable Indeed, in the presence of mode-dropping supp(p) supp(q θ ), the bias reads i.e. the bias arises due to the insufficient support of the sampler.

IV. THE MODE-DROPPING ESTIMATOR
In the following, we aim to derive a bound on this bias.In the process, we will also obtain a natural measure for the degree of mode-collapse.To this end, we foliate the sampling space by disjunct sets for n ∈ N and D ∨ ∈ R ≥0 .We refer to fig. 3 for a visual illustration of the foliation.For fixed n ∈ N, we can define the following weights Leveraging these definitions, we derive a bound on the bias in appendix 4 which is summarized in the following theorem.
Theorem 2. Let the action S of the theory and the observable O be polynomially bounded, i.e. for some C ∨,∧ , D ∨,∧ , α ∨,∧ ∈ R ≥0 and for some c, α, d ∈ R ≥0 .The bias, i.e., the difference between the expectation value Ō of the estimator Ô and the true value O * , then satisfies The bias is therefore bounded by a weighted sum over the α n .The weighting of each summand depends on the observable O of interest.We note that many physical observables are simple powers of fields, i.e.O(φ) = ||φ|| k for k ∈ N. It can be shown that the foliation (21) along with the polynomial bound of the action S implies that where we have defined u n = n C∨ 1 α ∨ .We refer to appendix 3 for more details.For such observables, the bias can thus be bounded by FIG. 4: Spontaneous symmetry breakingspontaneous symmetry breaking for the φ 4 as a function of the hopping parameter using the action in eq. ( 29).
The top and bottom figures show a histogram of the magnetization -self-normalized by its absolute valuein both linear and log-scale respectively.The self-normalization makes the modes centered around +1 and -1.As the hopping parameter κ increases, the number of configurations in between the modes reduces.
The theorem also naturally relates to the quantity w introduced in the last section which quantifies the degree of mode-collapse.In order to provide a single number for the degree of the mode-collapse of the sampler, it is natural to choose a uniform, i.e. observable agnostic, weighing.It then follows from the definition of the α n , see (22), that this weighting measures the mismatch in support between the sampler and the target density and is thus directly related to the mode-dropping estimator w derived in the last section.

V. NUMERICAL EXPERIMENTS
We evaluate our proposed methods to detect and mitigate mode-collapse using the two-dimensional scalar φ 4 -theory with action where λ is the bare coupling while κ is the hopping parameter.We refer to [2] for more details on this hopping parameterization of the action.Throughout all our experiments we keep the bare coupling fixed at λ = 0.022 and vary κ such that the theory crosses the phase transition due to the spontaneous breaking of its Z 2 symmetry, i.e. φ → −φ.As the hopping parameter κ FIG.5: Histogram of magnetization (left) and free energy estimates (right) for two flow-based models trained with forward-and reverse KL at κ = 0.5 and Λ = 64 × 8 -The left-hand side shows a histogram of the magnetization for configurations sampled from a forward-KL trained normalizing flow (orange), a reverse-KL flow (purple) and an overrelaxed HMC (pink).Models were used to sample configurations on lattices of shape Λ = 64 × 8 at κ = 0.5, i.e., well in the broken phase.Dashed black lines are the expected values for the absolute magnetization (and its negative value) using HMC.We observe good agreement between HMC and forward flow histograms.The right-hand side plot shows estimates of free energy densities using the estimators introduced in section III for both flow models.Their estimates are computed and then compared to the reference HMC baseline (denoted by the solid black line while the pink surroundings show the standard deviation) and following the method discussed in [2] and appendix 5.The q-and the p-estimators for the free energy are shown using different markers.
Error bars -based on the standard deviation -for the flow-based estimate of the free energy density are not visible as they are several orders of magnitudes smaller.The mismatch between the reverse-KL flow and the HMC histograms, shown in the left-hand side plot, is reflected in highly biased estimates (dark blue markers).
increases, spontaneous magnetization is observed.This is illustrated in fig. 4 for which the hopping parameter takes values through the critical region around κ c ≈ 0.275.The curves show the density (top) and log density (bottom) of the normalized magnetization with different colors referring to different values of the hopping parameter κ.Spontaneous symmetry breaking is observed as the distribution of the magnetization changes from a wider single-mode to a bi-modal density with a suppressed tunneling probability between the two modes.This suppression is accentuated as the value of κ increases.
A. Free Energy Estimators.
Our first numerical experiment analyzes the performance of two normalizing flows trained with both objectives described in section II A. We refer to those flows as the forward-KL flow and the reverse-KL flow if they were trained with maximum likelihood or self-sampling respectively.We train for a hopping parameter κ = 0.5 such that the theory is in its broken phase, see fig. 5.For maximum likelihood training, we use 50M samples generated by an overrelaxed HMC.Following [2], we choose an architecture for the normalizing flows such that they are manifestly invariant under Z 2 symmetry.
The reference estimates for the true free energy were obtained via HMC simulations.Similarly to the approach followed in [2] such estimates are obtained by discretizing the hopping parameter space so that free energy differences can be estimated via HMC along the trajectory.Those contributions are added integrating such trajectory up to the desired point at which the free energy needs to be estimated.Further technical details on how deep generative models were trained and HMC reference values obtained, can be found in appendix 5.
As can be seen on the left-hand side of fig. 5, the forward-KL flow (orange) very closely reproduces the reference distribution by HMC (pink) while the reverse-KL flow (blue) is substantially off.
On the right-hand side of fig. 5, we estimate the free energy density of the system using both flows.We use the same color scheme as on the left-hand side and measure the free energy using both the p-estimator eq. ( 13) and the q-estimator eq. ( 12) of the free energy.Our numerical results indeed agree with the theoretical prediction of theorem 1.Specifically, with the model trained with maximum likelihood, both estimators lead to compatible predictions with the HMC estimator.This is consistent with the left-hand side of the plot which suggests that no mode-dropping took place for this model.For the reverse-KL flow, however, such an agreement cannot be expected as the left-hand side of fig. 5 suggests a mismatch in support.Indeed, we find that the qestimator substantially overestimates the true value F, FIG.6: Analysis of the free energy estimates for forward and reverse KL flows for a 64 × 8 lattice and different hopping parameters κ -colors denote flow models trained with different objectives, namely the forward-KL (orange) and reverse-KL (purple).Markers refer to the two different estimators of the free energy introduced in eq. ( 12) and eq.( 13).Specifically, plus and square are used respectively.Every point in the inset shows that flow and HMC estimates are compatible, i.e., the gap is within the statistical uncertainty.The lower plot relates the results from above to the mode-dropping estimator eq. ( 16).This demonstrates that when the flow is a good approximator for p, the estimator is close to one.When mode-collapse happens, instead, the estimator of w quickly decays to zero.and the p-estimator substantially underestimates F, i.e.Fq θ ≥ F ≥ Fp , as predicted by theorem 1.
This experiment thus illustrates that: a) a modecovering objective such as the forward-KL should be preferred when the target density is multimodal and b) when the support does not match, both p-and qestimators of the free energy from section III give upper and lower bound respectively.
We repeated this analysis for a number of values of the hopping parameter κ.The results are summarized in fig.6.We evaluate the gap between the neural importance sampling (NIS) estimate F and the HMC reference normalized by the total standard error.Namely, if the normalized gap is within the range  16) for lattices Λ = 16 × 8 and Λ = 64 × 8 respectively.Unsurprisingly, we observe that mode-collapse gets more severe as the lattice size increases.This is reflected in a stronger decay toward zero of the estimator for larger volumes.
[−1, +1], both estimators are compatible (see inset in the top plot of fig.6).Dashed curves connect q-estimates (12), while solid curves connect p-estimates (13), of the free energy at different values of κ.Again the results obtained with the reverse-KL and forward-KL flow are shown in blue and orange respectively.The inset shows close agreement of both estimators and both flow models for κ ∈ {0.2, 0.3}.However, deep in the broken phase, e.g.κ ≥ 0.4, the two modes of the distribution start to lay further apart resulting in a failure of the mode-seeking objective, i.e. the reverse-KL, to properly capture the target density.As a result, the probability mass transport induced by the normalizing flow fails to reproduce the correct target distribution p, leading to a larger gap between the p-and q-estimator.This phenomenon is consistent with the behavior of the magnetization in the left-hand side plot of fig. 5.When using the forward-KL trained flow instead, the support of the sampler is closely matching the support of the target hence making both free energy estimators asymptotically unbiased and compatible with the HMC reference even at the higher values of the hopping parameter κ.This effect is clearly shown in the inset where there is a good agreement between both estimators of the free energy.This observation suggests that the mode-covering nature of the forward-KL is crucial to ensure that the flow leads to asymptotically unbiased estimates of physical observables.
In fig.6, it is also shown that our proposed modedropping estimator (14) correlates well with the observed gap in the free energy estimation.Lastly, we use our estimator w to evaluate the mode-dropping of forward and reverse flow models trained at several κ values and different lattice sizes as shown in fig. 7. The left plot refers to a large lattice of size Λ = 16 × 8 while the right plot to Λ = 64 × 8.These results demonstrate that the quality of the sampler very quickly deteriorates in the broken phase due to mode-collapse for the model trained by self-sampling.This is not the case for models trained with the forward KL.Indeed, as shown in fig.7, these models scale significantly better in the volume of the system.It would be interesting to study whether this superior scaling also holds in the unbroken phase.Still, as it is unrelated to the question of mode-collapse, we leave such analysis to future work.

VI. OUTLOOK AND SUMMARY
Mode-collapse constitutes a fundamental obstruction for flow-based sampling on the lattice.In many ways, it represents an analog to the tunneling problem in local MCMC algorithms.In this work, we have studied this important limitation of flow-based sampling in great detail.We argue that in the important case of thermodynamic observables, there are practical and theoretically grounded mitigation strategies available.Specifically, the flow can be trained using the forward KL divergence and the free energy can be evaluated with two estimators bound the true value.Furthermore, we have analyzed mode-mismatch theoretically and derived a bound on its induced bias as well as a quantitative measure for its severity.Normalizing flows are currently only limited to toy models.Encouragingly, we also observed as a side-product of our analysis, that the forward KL objective leads to better scaling in the system size.This observation may be worthwhile to be studied further as part of future work.

APPENDIX 1. Forward-KL training
Training a normalizing flow with forward-KL in the context of lattice field theory requires pre-generated samples at a given point in parameter space.Before training a flow model, one should instantiate a thermalized Markov chain at a fixed value of the coupling parameters and generate a sufficient number of Monte-Carlo configurations which are then used to train the flow.A pseudo-code for this approach is presented in algorithm 1.We note that practically this approach may not always be feasible.For example, the number of pre-generated configurations needed for training a flow to an acceptable accuracy increases as the size of the lattice grows.For instance, training a flow for a 64 × 8 lattice in the context of the φ 4 field theory, in the broken phase, requires already more than fifty million samples.This problem, therefore, limits the practical deployment of forward-KL training schemes at larger scales.Moreover, another limitation of such an approach is that generating samples with HMC may not always be possible.Indeed, in the proximity of a phase transition, long-range autocorrelation will prevent to samples a necessary large amount of uncorrelated samples in time.One would therefore need to be very careful in generating a suitable dataset of HMC configurations to avoid incorporating any additional unwanted bias when training the flow.
Algorithm 1: To train an NF with forward-KL in the context of lattice field theory we need to generate training configurations from a thermalized HMC chain.This HMC pre-sampling process is made more efficient by running C max independent chains in parallel so that the runtime to sample the entire dataset Φ is constant in the total number of samples B. The total number of samples will therefore be B = n C max .The sampling is done between line 2 and line 5 in the algorithm below.Once the dataset is sampled and stored on disk, one starts training up to T max iterations.Per iteration, one draws batches of m configurations from Φ in line 7 and uses them to evaluate the expectation value of the log-density of q θ and compute the gradient of the forward KL objective in line 8 and line 9 respectively.The model weights are then updated and the learned bijection f θ along with the variational density q θ are returned by the algorithm at the end of its training steps.

Input:
• prior density, e.g., qz ∼ N (0, I) • parametric model with parameters θ • parametric action S(φ, κ, λ) with fixed coupling parameters λ and κ • empty tensor for storing a batch of B configurations Φ ∈ R B×N S ×N T Result: • learned bijective transformation f θ s.Theorem.The forward free energy F p and reverse free energy F q provide an upper and lower bound of the free energy F respectively, i.e.
Furthermore, if supp(q θ ) ⊇ supp(p) it follows F q = F , and similarly if supp(q θ ) ⊆ supp(p) Proof.From the definition of the free energy F = −T ln Z we first note that F q ≥ F is equivalent to Z q θ ≤ Z.Using the fact that supp(e −S(φ) ) = supp(p), we obtain where the last inequality holds because e −S(φ) ≥ 0. Thus, we conclude F q ≥ F with the corollary that supp(q θ ) ⊇ supp(p) implies equality F q = F .Similarly, shows F p ≤ F , in general, and F p = F given supp(q θ ) ⊆ supp(p).Hence, by combining the inequalities we can conclude

Bound on the configuration
Let us assume S(φ) to be polynomial bounded with non negative coefficients C ∨ , C ∧ , D ∨ , D ∧ , α ∨ , α ∧ ∈ R ≥0 .The left-hand and right-hand sides represent the lower and the upper bounds on the action S. One needs to find appropriate coefficients C ∨, ∧ , α ∨, ∧ , D ∨, ∧ such that the inequalities are satisfied.We now do a foliation of the sampling space which can be seen as a re-distribution of the lattice configurations φ into infinitely many buckets labeled by the index n.Combining eq. ( 30) and eq.( 31) one can rewrite a condition on the norm of φ [52].For a configuration φ ∈ ∆ n This implies for n > 0. On the other side, it follows that which implies that for C ∧ > 0, n > 0. Combining eq. ( 33) and eq.( 35) we obtain the following bounds on the norm of the lattice configuration In particular, this can be shown to imply a bound on the volume of ∆ n , e.g., making the volume finite where we used the volume of the N −ball,

Proof of Theorem 2
We now leverage the result from appendix 3 to derive a bound on the bias for the general observable O when supp(p) supp(q θ ) and therefore the importance sampling estimator is in principle not asymptotically unbiased.
Theorem.Let the action S of the theory and the observable O be polynomially bounded, i.e.
for c, α, d ∈ R ≥0 .Then, the bias between the estimated observable Ô and the true value O * is given by where Ō = E φ∼q θ Ô(φ) .
Proof.Let's assume the generic observable O to be polynomially bounded When supp(q θ ) ⊆ supp(p) appendix 4 can be bounded as where α n is the coefficient defined for each bucket, i.e., such that the following relation holds n∈N One, therefore, concludes that the bias is bounded by the following series In order to obtain convergence of this series, we observe that polynomial boundedness of the observable implies i.e., |max φ∈∆n O(φ)| grows polynomially in n.Similarly, from eq. ( 46), it follows showing that α n decays exponentially in n.Thus, |max φ∈∆n O(φ)| • α n decays exponentially in n.This implies convergence of the series in the right-hand side of eq. ( 48).
It is important to note that each α n is weighted by the maximum of the observable on the corresponding volume ∆ n which makes the bias inherently dependent on the observable while the α n coefficients are universal and represent the amount of mode dropping per bucket.
As an explicit example, let us consider S : R → R; S(φ) = φ 2 and the observable φ 3 .This means that the true value of the observable is If we assume a mode dropping model q, with q For the definition of the ∆ n , we can choose D ∨ = 1 and thus ∆ √ n] and obtain Hence, the bias is within the bound given by the theorem

Details on the Numerical Experiments
In the following, we summarize the details and setup used to perform the training of both forward-and reverse-KL normalizing flows as well as to estimate the HMC reference values.For our experiments, we focused on the action S(φ) from eq. ( 29) as a function of κ while keeping the coupling λ = 0.022 fixed throughout the analysis.

a. HMC sampling
For estimating the HMC reference values of the free energy density reported in section V, we followed the same approach as in [2].The idea is to discretize the trajectory in the κ-space (hopping parameter) into a sequence of finite steps where free energy differences can be calculated by running an HMC.The target free energy at an arbitrary point κ * is then obtained by summing up all the free energy differences from κ = 0 to κ = κ * .We note that the higher the kappa values, the more steps one needs to make in order to discretize the trajectory up to the target point in parameter space.It follows that the uncertainty on the estimates also grows when κ * increases as more terms are combined to obtain the free energy at the desired κ * .Specifically, in our experiments, we chose a regular step-size between two subsequent κ values in the trajectory to be ∆κ = 0.01.Such step-size is used to discretize the trajectory starting from κ = 0.0 all the way up to the target.For instance, measuring the free energy density at κ * = 0.3 would therefore require thirty steps, hence 30 independent HMC chains.Each of these chains is initialized around the vacuum expectation value (vev), has an overrelaxation every 10 steps, and a total of 10k thermalization steps, i.e. discarded configuration updates, followed by 500k sampling steps.Those configurations, from the equilibrium distribution, are used to estimate the free energy difference at a single given point of the trajectory.The total number of HMC samples needed to estimate the free energy at an arbitrary point thus depends on κ * .For instance, referring to the previous example, 30 chains with 500k steps each add up to 15M HMC samples.

b. Reverse-KL Flow
To train the reverse-KL normalizing flows we followed the same strategy presented in [2] with the same setup of hyperparameters.We used a batch size of 8k samples and a learning rate update according to the ReduceLROnPlateau scheduler of PyTorch with an initial learning rate of 5 × 10 −4 and patience of 3k steps.The flows have the same number of coupling blocks and the same type of checkerboard partitioning discussed in [2].Models were trained for 700k steps in total and the last saved checkpoint is used for sampling.Every reverse-KL model was trained on two GPUs (in parallel), either P100 or A100 NVIDIA devices.Depending on the lattice volume and the model type the training took up to 50hrs of wall time.

c. Forward-KL Flow
Training a forward-KL flow requires a different procedure which was outlined in appendix 1.For every flow model, we used 50M pre-sampled HMC configurations as input data.These were sampled in batches of 100 independent HMC chains each of which had 10k equilibration (discarded) and 500k sampling (stored) steps.The stored configurations from each chain in the batch were concatenated to generate the full training set.
At the stage of training, the 50M configurations are loaded in batches of 8k samples per iteration (training step).When the entire dataset is processed once, the full set of configurations is reshuffled and reused (as it is standard practice in deep learning) until the desired number of training iterations is reached.Again, forward-KL models were trained for 700k steps on two GPUs (in parallel), either P100 or A100 NVIDIA devices.Depending on the lattice volume and the model type the training took up to 55hrs of wall time.

d. Flow sampling
For sampling configurations from both forward-and reverse-KL normalizing flows we proceed as follows.In order to have a fair comparison with HMC one would need to sample as many configurations as those needed to integrate the trajectory in the hopping parameter space, as discussed in appendix 5 a.However, for our flow estimates, we took only 1M configurations and used the estimators for mean and variance introduced in [2,24] and proposed in section III A. Though 1M is in general a lower bound on the total amount of configurations used to compute HMC estimates, we empirically observed this was sufficient to obtain estimates with errors several orders of magnitude smaller than HMC.Therefore, we took this as a sufficient number of samples for comparing the two sampling approaches.

FrequencyDFIG. 3 :
FIG. 3: Foliation -(a) visualization of the foliation according to eq. (21) showing ∆ 1 , ∆ 2 , and ∆ 3 explicitly.(b) Field configurations are distributed into buckets.The labels n b and b w thereof represent the bucket index and the width of each bucket, respectively.

8 FIG. 7 :
FIG. 7:Mode dropping evaluation as a function of the hopping parameter κ for fixed λ and two different lattice sizes -mode dropping estimator for different values of the hopping parameter and different lattice sizes.For each setup, two normalizing flows with the exact same architecture were trained with reverseand forward-KL objectives as described in section II A. The left and right-hand plots show the mode-dropping estimator eq.(16) for lattices Λ = 16 × 8 and Λ = 64 × 8 respectively.Unsurprisingly, we observe that mode-collapse gets more severe as the lattice size increases.This is reflected in a stronger decay toward zero of the estimator for larger volumes.