Efficient Modelling of Trivializing Maps for Lattice $\phi^4$ Theory Using Normalizing Flows: A First Look at Scalability

General-purpose Markov Chain Monte Carlo sampling algorithms suffer from a dramatic reduction in efficiency as the system being studied is driven towards a critical point. Recently, a series of seminal studies suggested that normalizing flows - a class of deep generative models - can form the basis of a sampling strategy that does not suffer from this 'critical slowing down'. The central idea is to use machine learning techniques to build (approximate) trivializing maps, i.e. field transformations that map the theory of interest into a 'simpler' theory in which the degrees of freedom decouple, and where the statistical weight in the path integral is given by a distribution from which sampling is easy. No separate process is required to generate training data for such models, and convergence to the desired distribution is guaranteed through a reweighting procedure such as a Metropolis test. In a proof-of-principle demonstration on two-dimensional $\phi^4$ theory, Albergo et al. (arXiv:1904.12072) modelled the trivializing map as a sequence of pointwise affine transformations. We pick up this thread, with the aim of quantifying how well we can expect this approach to scale as we increase the number of degrees of freedom in the system. We make several modifications to the original design that allow our models learn more efficient representations of trivializing maps using much smaller neural networks, which leads to a large reduction in the computational cost required to train models of equivalent quality. After making these changes, we find that sampling efficiency is almost entirely dictated by how extensively a model has been trained, while being unresponsive to further alterations that increase model flexibility. However, as we move towards the continuum limit the training costs scale extremely quickly, which urgently requires further work to fully understand and mitigate.


I. INTRODUCTION
Lattice field theory involves the computation of expectation values of the form where resulting from the discretisation of Euclidean path integrals onto a space-time lattice Λ. O(φ) is a generic observable defined for the field configuration φ, and the (Euclidean) action S(φ) encodes all of the dynamics and interactions of the fields. In most interesting scenarios integrals of this form are not tractable and we are forced to resort to sampling. Concretely, estimating (1) by sampling means evaluating a statistical average,Ō over a representative sample {φ} comprising N field configurations drawn from a statistical ensemble with Boltzmann factor e −S(φ) . The error on this estimator scales as 1/ √ N eff , where the effective sample size N eff reaches a maximum value of N in the absence of correlations between configurations.
Markov Chain Monte Carlo (MCMC) methods are the best known tool for sampling from high-dimensional distributions. However, the configurations in the resulting sequence are indeed correlated, and the effective sample size is diminished by a factor of twice the integrated autocorrelation time, 1 defined for each observable in terms of its autocorrelation function Γ O (t) = O(φ (n+t) )O(φ (n) ) − O 2 , where t represents a number of steps separating pairs of configurations in a Markov chain, and n is arbitrary provided the process has equilibrated to its stationary distribution [2].
Under normal conditions this issue is manageable. Most MCMC algorithms, however, suffer from an acute condition known as critical slowing down associated with a quite catastrophic reduction in their sampling efficiency as the system under study approaches a critical point [3]. Critical slowing down typically manifests as a power-law scaling of the integrated autocorrelation time with the system's correlation length ξ, This is rather unfortunate since ξ (in lattice units) diverges as we take the continuum limit of our lattice field theory. Algorithms based on random-walks or classical molecular dynamics [4][5][6] have lower limit of z O = 1 owing to the maximum speed of information propagation, but in the absence of very careful tuning [7] they typically exhibit z O = 2 scaling, corresponding to diffusive information transport. Furthermore, there is substantial evidence that the picture is even worse when considering theories which possess non-trivial topology in the continuum limit [8][9][10][11][12], including QCD itself [13,14]. As the continuum limit is approached, the rapid increase in energy barriers between topological sectors can result in z O > 2 for topological observables, and potentially even exponential scaling [15].
Collective update algorithms can, in principle, fare much better, since they need not be restricted to local dynamics. Indeed, several celebrated algorithms based on collective updates have been devised for certain systems [16][17][18][19][20], but as yet a general-purpose collective update algorithm is lacking, and critical slowing down remains an unsolved problem in the majority of cases, including lattice QCD.
Given this state of affairs, there has recently been a great deal of interest in augmenting the MCMC 'toolkit' with novel techniques from the rapidly maturing field of machine learning. Perhaps the most eye-catching new additions to the toolkit are deep generative models [21,22], which can be viewed as a category of highly flexible statistical models that, using stochastic optimisation techniques (a.k.a 'training'), can approximate complicated probability densities. The unique property of generative models is that, once trained, they can be directly sampled from to generate samples of potentially 'realistic' data. The relevance of this becomes clear upon a change of terminology: let 'data' mean 'field configurations' and 'probability density' refer to their statistical weight in the path integral. Already, a number of prototypical hybrid algorithms have been proposed in which deep generative models either guide or replace traditional MCMC update procedures [1,[23][24][25][26][27][28][29][30][31][32][33][34][35][36][37]. Deep generative models have also been used to automatically identify relevant variables, leading to novel algorithms for characterising phase diagrams [30,[38][39][40][41] and enacting renormalization group transformations [42][43][44][45][46].
Normalizing flows [47][48][49] are a class of deep generative model which approximate the distribution of interest by learning an invertible map from a set of 'latent' variables whose distribution is much easier to sample from. Typ-ically, the map is built out of a sequence of relatively simple pointwise transformations. The capacity to model complex, correlated probability distributions such as those corresponding to Euclidean lattice field theories arises due to the fact that the parameters of these transformations are generated by neural networks that take the field variables themselves as inputs [50,51]. Put another way, the process of training such a model is an encoding of the correlations between the degrees of freedom in the path integral into the weights and biases of these neural networks.
In spirit, normalizing flows are very similar to Lüscher's trivializing maps [52], which are field transformations that map an interacting theory to a limit where the field variables decouple, at which point sampling becomes extremely efficient. Reference [52] provides a power-series expression for the generators of a class of flows which trivialize lattice gauge theories, although only the first two terms in this series are tractable in practice, and additional finite-step errors are accumulated through numerical integration of the flow. Unfortunately, the degree to which the result of this procedure approximates a trivializing map proved insufficient to improve the scaling of integrated autocorrelation times [53]. However, there has recently been a renewed interest in the potential to construct (approximate) trivializing maps with additional leverage provided by modern machine learning techniques [1,[31][32][33][34][35].
As noted in Reference [52], the existence of a trivializing map 2 implies that it is theoretically possible to evaluate Equation (1) without MCMC, by simply generating uncorrelated random fields. In Sections II B and III we outline a more practical procedure in which a normalizing flow generates statistically independent field configurations that act as proposals for the Metropolis-Hastings algorithm. From a theoretical perspective, this has the potential to become more efficient than traditional sampling, since the statistical efficiency of the sampling algorithm is decoupled from the correlation length of the system. The caveat is that, somehow, the costs associated with the highly non-trivial task of sampling from the path integral of an interacting field theory are transferred to the training of the model. Therefore, to answer the question of whether a generative sampling algorithm can be expected to outperform traditional methods is a matter of understanding how these training costs scale as the continuum limit is approached.
Albergo et al. [1] first demonstrated that the procedure just described is a viable approach to sampling in lattice field theory. In their proof of principle study, which focused on two-dimensional scalar φ 4 theory on lattices with up to 14 2 sites, the normalizing flow was a sequence of pointwise affine transformations parameterised by neural networks. Here, we continue the same thread, with the aim of establishing how well this approach scales to lattices with up to 20 2 sites. Finding that the recipe used in Reference [1] yields relatively inefficient representations of trivializing maps for the particular theory of interest, we make a number of adjustments; most importantly, we bring in a more expressive transformation based on a spline, and replace the deep neural networks by networks with a single hidden layer that is rather narrow. We quantify the performance and scaling of our models using hardware-independent metrics: the Metropolis-Hastings acceptance rate, the number of trainable parameters in the models, and the total number of field configurations generated during the training phase.

II. SAMPLING IN LATTICE FIELD THEORY
The problem we are trying to solve can be phrased as follows: we would like to generate samples {φ (1) , φ (2) , . . . , φ (N ) } of the discrete random field {φ x | x ∈ Λ} ≡ φ ∈ M |Λ| , where |Λ| is the number of sites on the lattice and M |Λ| is a direct product called the field space, that are representative of the lattice field theory we seek to study. By 'representative' we mean that the probability of a particular configuration appearing in the sample is to be proportional to its Boltzmann weight, We will refer to p(φ) as the target density.

A. Markov Chain Monte Carlo
MCMC sampling methods work by generating a sequence of transitions φ (n) → φ (n+1) which together comprise a Markov chain (φ (n) ) N n=1 . Thus, implicit in any MCMC method is a transition kernel W (φ → φ ), which is required to have a stationary distribution that is equal to the distribution from which we wish to sample, implying the following: If W (φ → φ ) is also ergodic, then the stationary distribution is unique and the Markov chain is guaranteed to converge to p(φ) [55]. However, this does not imply that any finite section of the chain is representative of p(φ), since the configurations will be correlated. In practice, this results in statistical errors that scale as (2τ O N ) −1/2 rather than N −1/2 , leading to a trade-off between algorithmic efficiency -the amount of effort taken to generate a transition -and statistical efficiency -how many transitions are required to produce a statistically independent configuration.
To guarantee Equation (7) it is sufficient to impose detailed balance, Generating transitions that satisfy Equation (8) [16][17][18][19][20] involve contrived update procedures which are only applicable within certain classes of models. A simple and robust alternative due to Metropolis et al. [4,5] is to generate configurations φ via a distribution q(φ | φ) which is easy to sample from, and accept or reject these proposals based on an acceptance probability and detailed balance is satisfied. The standard choice is the 'Metropolis test', which, importantly, does not require the calculation of normalizing factors. A rejection of the proposal corresponds to a duplication of the current state in the chain. Thus, the Markov chain can be seen as a reweighting of the set of proposals in which the configurations pick up integer weights. The proposal distribution q(φ | φ) can be anything which guarantees ergodicity of W (φ → φ ), and it is sufficient for it to have non-zero density everywhere on M |Λ| [56]. In the following snippet of Python code, which implements the Metropolis-Hastings algorithm, generator yields proposals drawn from q(φ | φ), and acceptance is a function which evaluates Equation (9).

Metropolis-Hastings Algorithm (Python)
chain = [] current = next(generator) # initialise for n in range(N): proposal = next(generator) prob = acceptance(current, proposal) if rand() < prob: chain.append(proposal) current = proposal else: chain.append(current) The Metropolis-Hastings algorithm is completely agnostic towards the process through which proposals are generated -be it a single spin flip, a molecular dynamics trajectory or an independent configuration explicitly drawn from some proposal distribution -provided any 'selection bias' is properly accounted for by the factor q(φ | φ )/q(φ | φ). This makes it very appealing as a kernel around which to construct collective updates algorithms. The difficulties arise due to the rapidly increasing sparsity of p(φ) as the number of degrees of freedom increases and as we move towards the continuum, which puts extremely stringent constraints on how proposals may be generated if we are to sample the path integral in an acceptable amount of time. Two main approaches to this problem are: • Local updates: Generate proposals that are close to the current configuration by updating individual lattice sites. Changes in p(φ) can be made arbitrarily small by tuning the step size so as to yield a desired acceptance rate.
• Hybrid Monte Carlo: Generate proposals by numerically integrating a fictitious Hamiltonian system, and by doing so update all of the lattice sites. In this case it is the number of integration steps that must be balanced against the acceptance rate.
Both of these methods become less efficient when we take the continuum limit, as we are forced to trade down on step size to keep the acceptance rate reasonably high, meaning each statistically independent configuration requires more steps to produce. This is what is meant by critical slowing down, and the decline in statistical efficiency is quantified by the dynamical critical exponent in Equation (5).

B. A generative approach to global updates
Though it might seem extremely ambitious, the simplest possible generative sampling algorithm would be one in which a deep generative model generates entire, statistically independent field configurations with a probability close to their true weight in the path integral. Let us entertain this ambition. For reasons which will shortly become clear, we will focus the following discussion on parametric models with an explicit probability density,p(φ). The intention is to construct a model and identify a set of model parameters, θ, such that the approximationp(φ) ≈ p(φ) is a 'good' one. The notion of a 'good approximation' is made quantitative through the Kullbach-Leibler divergence [57], When we speak of 'training' such a model, what we really mean is optimising a particular function with respect to the model's parameters. In the present work, this 'objective function' will be (a variant of) the Kullbach-Leibler divergence, and the goal of training will be to find the set of parameters θ * which satisfy Successfully training a generative model to generate uncorrelated samples of field configurations with a probability close to their true weight in the path integral certainly appears to be an auspicious starting point for constructing an efficient sampling algorithm. Of course, the problem is that anything less than a perfect fit, i.e. p(φ) = p(φ), implies that the samples generated by the model are not truly representative of the field theory, with discrepancies betweenp(φ) and p(φ) manifesting as biases in expectation values. Yet it is possible to exactly correct for these biases through reweighting or a Metropolis step, provided we have access top(φ). Hence, as well as restricting ourselves to models with an explicit density function, we will also demand thatp(φ) is tractable, by which we mean it is given exactly by a closed-from expression computable in polynomial time (in order to be scalable) and whose repeated evaluation (for us, 10 5 − 10 9 times during training) does not constitute an unacceptably large overhead. 3 Consider a variant of the Metropolis-Hastings algorithm in which generator is a generative model equipped with an explicit and tractable density that is capable of generating independent configurations with probability given by If the model were a perfect approximation, such that p(φ) = p(φ) for all φ, then A(φ → φ ) = 1 identically and 100% of proposals would be accepted. In a more realistic situation where there are discrepancies, the inefficiency of generating proposals with a probability proportional top(φ) rather than p(φ) manifests itself through multiplicities in the Markov chain due to rejections, which are in turn measurable as autocorrelations. However, if proposals are drawn independently, then rejections are the only source of autocorrelation. As explained in Reference [1], the autocorrelation at separation t is given, for all observables, by This is an extremely appealing feature that is not present in traditional algorithms, where local dynamics combined with energy barriers can lead to the decoupling of autocorrelation times for topological and non-topological observables [10]. An estimate of Equation (13) is trivial to obtain from the accept/reject history of a Metropolis-Hastings simulation, with which we can compute an estimate for the integrated autocorrelation time that we denote τ rej .
Since Equation (13) is strictly larger than the average rejection rate 1 − E φ∼p E φ ∼p [A(φ → φ )] raised to the tth power, a lower bound on the integrated autocorrelation time can be given in closed form by a geometric series, This expression is not particularly useful per se, but we will be interested in how close to this lower bound the actual integrated autocorrelation falls. Although this is not the approach we will take, reweighting can instead be done at the level of computing ensemble averages through a change of measure in Equation (1) to Dφp(φ)w(φ), where the reweighting factor w(φ) ≡ p(φ)/p(φ) is the same factor used in the Metropolis test [32]. The mean estimator from Equation (3) then readsŌ Of course, while this approach makes use of all of the generated configurations, there is still a price to be paid for drawing samples fromp(φ) rather than p(φ); the weights ensure that the number of configurations yielding non-negligible contributions to the sum drops rapidly as the approximationp(φ) ≈ p(φ) degrades. As remarked on in Reference [33], this a posteriori reweighting approach is appealing if O(φ) is cheap to compute relative to the cost of generating configurations from the model.

III. NORMALIZING FLOWS
For our purposes, we define a normalizing flow as a bothdirections continuously differentiable bijective mapping, 4 between 'latent' random variables, z ∼ r(z), and 'candi- We will immediately restrict ourselves to the special case of M = R, which applies to scalar φ 4 theory. 6 Hence, the density associated with the candidate field configurations is given by the familiar formula for a change of variables, involving the Jacobian determinant, In practice, we only every require the logarithm of this equation.
We will draw latent variables from an uncorrelated Gaussian distribution, which one may interpret, in the spirit of References [52,53], as a trivial limit of φ 4 theory. 7 In principle one could put more effort into generating latent variables that reduce the workload for the flow. 8 However, even with a priori knowledge of some features of the target density this comes with a high risk of over-engineering the problem; i.e. leading to marginal improvements in the model's approximation to the target, which are completely negated by the increased costs of generating samples and computing the density.
Although the bijective construction is not the most flexible a priori, normalizing flows have several advantages over other generative models. Crucially, it is straightforward in principle to ensure that the density is tractable, by choosing a map whose Jacobian determinant |∂f θ (z)/∂z| is tractable. It is this feature that provides us with a means of guaranteeing convergence to the correct target density through the Metropolis test. Additional benefits relate to the training, which is discussed in the next subsection. Finally, as an added bonus, the intermediate states of the flow correspond to valid probability densities in their own right, from which we can draw samples. In this sense, Normalizing flows are more 'interpretable' than other generative models which can behave more like a 'black box'.

A. Training a flow model
Since normalizing flows are differentiable by construction, they can be trained using standard gradient-based optimisation algorithms such as stochastic gradient descent. The algorithm used in this work is a variant that incorporates momentum, called ADAM [60]. The conventional approach to training is to expose the model to a set of data drawn from the distribution of interest, p(φ), via a separate process and tune the parameters of the model in order to optimise some objective function. If we were to take the conventional approach here, the set of training configurations would be divided into 'batches', passed through the layers of the flow model in the reverse direction, and the resulting variables f −1 θ (φ) used to estimate the following objective function by averaging over the batch: Equation (18) is an estimator for the Kullbach-Leibler divergence defined in Equation (10), up to an unknown self-information term, E φ∼p(φ) [log p(φ)], that does not depend on the model's parameters and is therefore irrelevant for the purposes of optimisation.
However, for our purposes this strategy is clearly not satisfactory since the problem has gone full circle; the ability to train models would then be tied to the ability to generate a large representative samples of configurations to act as training data, which is exactly what we are prevented from doing by critical slowing down. Thankfully, an alternative path presents itself in the typ-ical scenario where we are interested in sampling from a theory for which S(φ) is completely specified. In this training paradigm, favoured by Reference [1] and many subsequent studies, one considers the alternative definition of Kullbach-Leibler divergence following a reversal of the arguments with respect to Equation (10), This allows us to define an objective function that can be minimised using estimates based on configurations generated exclusively by the model: In Equation (20), the irrelevant terms that do not depend on the model's parameters are E z∼r(z) [log r(z)] and the normalizing factor in the path integral, log Z. One could be forgiven for thinking that the difference between optimisingD KL (p ||p) and optimisingD KL (p || p) is no more than a matter of exchanging an pre-existing training set for configurations drawn from the model. In practice, however, the two modes of training have distinct quirks which are important to appreciate. A recent contribution, Reference [61], includes a comparison of the two training schemes in situations where p(φ) is multi-modal.
In fact, by insisting on not having to obtain training inputs from an external process, we have actually sidestepped several of the major difficulties normally faced during training. In particular, since each batch of training inputs is stochastically generated on-demand, and never recycled, 'over-fitting' of training data is relegated to a non-issue. Furthermore, we need not be concerned about bias in the training inputs, since they are obtained through exact sampling from the latent distribution. The problem that we are most likely to encounter is one of insufficient flexibility to resolve all of the features in the target density, leading to a model which 'under-fits' the 5 We refer to the configurations generated by the model are referred to as 'candidate' since they might be rejected by the Metropolis test. 6 For discussion and examples of flows on non-Euclidean manifolds, see References [58,59]. 7 Cf. Equation (40) with λ → 0 and β → 0. 8 In Section VIII B we briefly discuss the use of free fields as the latent variables.
target. This is an important example of qualitatively different results arising from the choice of training scheme. OptimisingD KL (p ||p) typically results in 'smoothed' approximations to the target, whereas the approach we take, optimisingD KL (p || p) by sampling from the model, has a tendency to fit (not necessarily all of) the modes of the target, and setp(φ) ≈ 0 elsewhere. This behaviour is explained and its implications discussed in Sections VII B and VII C.

B. Building flexible models
When considering potential transformations for the layers g i , there are two conflicting requirements that will need to be met with a potentially very delicate compromise. Firstly, the flow will need to be highly flexible in order to start with uncorrelated Gaussian variables and distil the complex features of a system near to criticality, which will include non-trivial correlations on multiple scales. On the other hand, the Jacobian determinant in Equation (16) must be tractable since we are still required to evaluatẽ p(φ) in order to perform the reweighting that guarantees convergence to the correct target. Furthermore, the speed at which models can be trained and sampled from will depend on the efficiency with which the (logarithm of) the Jacobian determinant can be computed. This is a significant constraint and one that is very much at odds with the goal of using invertible transformations to map simple densities to complex ones; it is challenging to define sufficiently expressive transformations without rendering the Jacobian term intractable.
The key component that initially enabled normalizing flows to become competitive with more flexible generative models at performing benchmark tasks (such as image synthesis) was a particular type of transformation now widely referred to as a coupling layer [50,51,62]. Coupling layers are essentially a template for building flexible, pointwise, invertible transformations that are guaranteed to have a triangular Jacobian matrix. One divides the inputs to a coupling layer into two groups, only one of which will actually undergo a non-trivial transformation that is conditioned on information derived from the remaining, non-transformed variables. Since we are interested in φ 4 theory with a single degree of freedom at each lattice site, this splitting equates to an (arbitrary) partitioning of the lattice into Λ A and Λ P , which we refer to as the 'active' and 'passive' partitions, respectively. The normalizing flow is then built out of several couplings layers by function composition, Defining v 1 ≡ z and v I+1 ≡ φ lets us write the action of the i-th coupling layer as analogously. The set of functions C i (which are as yet unspecified) transform the active partition and are conditioned on a set of parameters N i (v P i ) that are themselves functions of the passive variables.
In the examples that we will consider here, these parameters are the output layer of one or more fully-connected feed-forward neural networks. 9 Throughout this paper, neural network outputs, exclusively, will be denoted by bold letters, and it will be left to the presence or absence of indices (e.g. i for the layer index, x for the lattice sites) to specify the cardinality of sets (for example, v i,x is a number while v i is a vector with |Λ| components).
The Jacobian for a coupling layer is, in block notation, where the lower-right block is understood to be a diagonal matrix whose diagonal elements are ∂C i,x /∂v A i,x for each x ∈ Λ A i . As promised, this matrix is triangular, so the 9 There is an important distinction to be made between the neural network outputs, N i,x , that parametrise a function, C i,x , and the 'model parameters', θ, that are tuned during training, which are the aggregation of all of the parameters (weights and biases) from each neural network, from each coupling layer.
determinant is simply equal to the product of terms on the leading diagonal.
By swapping the active and passive partition after every coupling layer (Λ ) and composing at least three layers, we ensure that each lattice site is updated using information from every other one. Thus, coupling layers allow us to sample from correlated target densities using uncorrelated latent variables at no additional expense in the computation of the Jacobian determinantthe calculation is the same as it would be if we replaced the neural networks N i with parameters that had no dependence on v P i . Since det AB = det A det B, a sequence of I coupling layers induce the following Jacobian determinant, These terms can be accumulated alongside the transformations of field variables, so that a single pass through all of the coupling layers yields both a set of candidate field configurations and the left hand side of Equation (24) for each configuration in the batch, ready to evaluate logp(φ) (if sampling) or the objective function (if training). Given freedom to divide the lattice in whichever way seems fit, we will implement a 'checkerboard' partitioning featuring in Figure 1, which ensures that each lattice site is directly influenced by its closest neighbours. We will often refer to a pair of coupling layers, which together transform every degree of freedom once, as a 'coupling block'. From hereon, we will drop the A and P superscripts and assume we are always talking about transforming a set of variables v i belonging to the active partition. Furthermore, we will denote the neural networks without an explicit dependence on the passive partition.

C. Affine and additive transformations
Affine coupling layers were introduced by Dinh et al. [51] as part of the Real NVP architecture. The pointwise transformation multiplies and shifts each degree of freedom, and is commonly written in vector form, where s i and t i are modelled by neural networks with |Λ A | outputs, and is the element-wise product.
Using Equations (23) and (24), a single affine coupling layer contributes log ∂g aff FIG. 1. Graphical representation of the first coupling block under the 'checkerboard' partitioning scheme. The latent Gaussian variables, v1 ≡ z, are split into two partitions (the red and black nodes). The i-th coupling layer transforms the active partition Ci : v A i → v P i+1 using information from the passive partition, v P i , via the neural network(s) Ni(v P i ). I denotes the identity transformation. Note that there is no need to merge the active and passive partitions until after the final coupling layer.
to the logarithm of the Jacobian determinant. The precursor to Real NVP uses volume-preserving 'additive' coupling layers [50], such that Equation (25) reduces to the shift by t i only, In our implementation of these coupling layers, we standardise the inputs to the neural networks such that they have unit variance, and do not apply activation functions to the output layer of these neural networks. We also append a global rescaling transformation after all of the coupling layers have acted, which can have a learnable scale parameter.
As an inexpensive yet remarkably expressive flow architecture, Real NVP has achieved widespread success and is frequently taken as a benchmark model to which new flow models are compared. However, more sophisticated flows using more flexible transformations have since achieved superior results on a number of standard datasets (mostly images) -see e.g. References [62][63][64][65][66][67][68][69][70]. This motivates us to explore one of the prominent alternatives.

D. Rational quadratic splines
Splines are functions defined piecewise by polynomials. Coupling layers using spline-based transformations were introduced in Reference [66] and further developed by Durkan et al. in References [68,69]. We will focus on the most flexible member of the family as described in Reference [69], which is based on a continuously differentiable spline interpolant first considered in Reference [71].
A rational quadratic spline (RQS) transformation C rqs i,x is defined for a single degree of freedom by K rational quadratics, referred to as the 'segments' of the spline. These segments are joined end-to-end at a set of 'knots' . , K such that the result is a strictly monotonic, C 1 -differentiable function on the interval [−a, a], which will be chosen in order to contain essentially all of the probability mass.
Given a reference point, the parameterisation provided by Reference [71] requires 3K + 1 strictly positive parameters to uniquely specify this function: the side lengths (w k i,x , h k i,x ) of the K rectangles which have adjacent knots on their opposing corners (often referred to as 'widths' and 'heights'), and the derivatives d k i,x at the K + 1 knots. Note that our choice of labelling, featuring in Figure 2, means that the endpoints of the k-th segment are the (k − 1)-th and k-th knots. This gives us a set of |Λ A | × (3K + 1) parameters for the coupling layer, For later convenience, define the slopes of the straight lines connecting adjacent knots as and re-express the variables being transformed as which corresponds to the fractional position of v i,x within the specific segment in which it is located, whose index we label . Note that each degree of freedom v i,x must first be sorted into the appropriate segment (i.e. the value of determined) using e.g. bisection search, which is not too expensive since the knots are already sorted into ascending order. Using the 0th knot at V 0 i,x , C rqs i,x (V 0 i,x ) = (−a, −a) as the reference point, the RQS transformation and its gradient can then be written, for each degree of freedom, using Equations (30) and (31).
Taking the logarithm of Equation (31) and summing over all x in the active partition yields the contribution to the logarithm of the Jacobian determinant from one RQS coupling layer. The advantage of using this parameterisation should now be clear; all that is required to guarantee that the gradient is strictly positive is that every parameter in N i is also strictly positive. It is also particularly simple to enforce the desired normalisation: The function defined by Equation (30) is an interpolant for the set of knots, 10 so the problem of representing complicated transformations reduces to one of generating a sufficient number of knots with sufficient accuracy. After fixing d 0 we let a single neural network generate the remaining |Λ A | × (3K − 1) parameters in the RQS layer, using the |Λ P | field variables in the passive partition as inputs. We take the unconstrained outputs of the neural net -denoted below with a hat -and split them into widths, heights, and derivatives. Positivity of the h k i,x and w k i,x , as well as the correct normalisation, is enforced by passing the unconstrained widths and heights through a 'softmax' activation function, The derivatives are instead passed through a 'softplus', To ensure that the inputs of a spline layer fall within the interval [−a, a], we generally chose a = 5 and standardised 10 To see this, substitute α i,x = 0 or 1. the inputs before the first RQS layer by dividing them by the standard deviation, taken over both the batch and the lattice sites. To catch the edge cases of inputs falling outside of this interval, we extended the definition of the transformation to be the identity outside of [−a, a], while fixing the derivatives at the external knot points to be unity to ensure that the transformation remains everywhere differentiable.

E. Neural networks
The principle behind this entire approach is that correlations in the target density can be encoded in the weights and biases of neural networks. Once these weights and biases have been fixed (i.e. once we have finished training), a neural network is simply a function. In this case, the role of these functions is to take a set of field variables (the passive partition) as inputs, and use this information to return a set of parameters that govern the transformation of a different set of field variables (the active partition).
There are theoretical grounds to believe that this works; specifically, there is a universal approximation theorem implying that any well-behaved function defined on a compact subspace of R |Λ P | can be approximated with arbitrary accuracy by functions of the form which represent feed-forward neural networks with a single hidden layer containing H elements and a non-linear activation function σ [72]. A (1) , A (2) are matrices containing the network weights, and b (1) is a vector of biases. This theorem does not tell us how large the hidden layer ought to be in order to represent the desired function with a given accuracy, though the number will be related to the degree of non-linearity and sensitivity with which the function must respond to variations in its inputs. Coupling layers will need to be able to dramatically alter the field variables in response to their closest neighbours, while also accounting for more subtle corrections due to field at larger separations, which implies that the functions being modelled by the neural networks must be extremely sensitive. However, there is no hard-and-fast rule dictating the optimal size and depth of neural networks for our specific problem; this must be discovered through experimentation.
Universal approximation theorems have also been proven, quite recently, for so-called (deep) convolutional neural networks [73][74][75]. The building blocks of these networks are convolutions, denoted by , with a set of 'kernels', W , whose weights are trainable parameters, In the above, k = (k 1 , k 2 ) and lattice coordinates have temporarily been written as arguments rather than the usual subscripts. A 'stride' size of one is also implied, meaning that the convolution is applied at every lattice site. In practice, W is usually taken to be non-zero only inside a small window −K ≤ k 1 , k 2 ≤ K, where K tends to be rather small indeed (often just one or two). Since K represents the largest distance over which correlations can be encoded into the trainable weights of any given convolutional layer, theories with long correlation lengths ξ K require convolution-based models to be sufficiently deep in order to indirectly model correlations over large scales.
Rather than a number of nodes H in the hidden layers of a fully-connected network, we may specify a number of input and output 'channels', each of which has its own convolutional kernel. Since we are dealing with a singlecomponent scalar field, the first layer will have one input channel. The number of output channels will depend on how many numbers are required to parametrise the transformation of each degree of freedom, i.e. one for additive layers, two for affine layers, and 3K − 1 for the splines. Note that, unlike the fully-connected case, the geometry of the objects being convolved must remain intact; one cannot simply pass the passive elements v P into a convolutional network as an arbitrarily-ordered one-dimensional vector, as we have previously been doing. Following Reference [34], we pass two-dimensional configurations into the convolutional networks, but with zeroes as placeholder values for the active partition to maintain the diagonal Jacobian structure of the coupling layer.
Let M Λ P denote an L × L matrix with ones at the positions corresponding to the passive partition, and zeroes at the active lattice sites. Further, let c i = 1, . . . , H i label the channels in the i-th layer of the network, and σ and b (i) once again represent activation functions and vectors of biases, respectively. The depth-D convolutional networks tested here can then be written as In the present work we have limited the scope of our quantitative study to models using fully-connected networks as described by Equation (35), supplying only indicative examples of models using convolutional networks as described above.

F. Enforcing equivariance
Convolutions are frequently favoured over linear transformations because they possess the very desirable property that translations of the inputs induce nothing more than translations of the outputs, a trait often referred to as equivariance (with respect to translations). 11 Hence, 11 The translational equivariance of the convolutional networks described by Equation (37) deserves comment. These networks are really equivariant with respect to the discrete group of translations that are isometries of the checkerboard sub-lattices. A consequence of the checkerboarding is that we sacrifice equivariance under the full group of lattice translational isometries.
convolutional networks are not required to 'learn' that inputs related by a global translation should be considered equivalent.
In addition to the symmetries of the lattice, φ 4 theory possesses a Z 2 symmetry corresponding to invariance of the action under a global sign-reversal of the field, Given that equivariant maps are those which commute with the symmetry transformation, it is not difficult to show that, for the Z 2 symmetry, the equivariant maps are odd functions, and that equivariance of f θ requires the coupling layers g i (v i ) to be equivariant, meaning that the transformations satisfy In our models, N i (v P ) are (almost always) fullyconnected feed-forward networks as defined by Equation (35), which are odd functions if we drop the biases and use odd activation functions (e.g. tanh) [32]. If we make these choices for the neural networks in the affine coupling layers, s i and t i , then Equation (39) is trivially satisfied by implementing one additional step, that is to take the absolute value of the output of the s i network.
Enforcing Z 2 -equivariance in the RQS transformations is less straightforward; the terms in Equation (30) cannot all be simultaneously odd. We implemented a rather crude workaround that involves splitting the batch of latent variables according to sgn x∈Λ z x (i.e. the initial 'magnetisation' of each configuration), and treating the two groups slightly differently within the transformation.
The key observation is that if we take a as 0th knot (instead of −a) and construct the spline in the reverse direction, then this is equivalent to taking C rqs i,x → −C rqs i,x . Practically speaking, it is simpler to simply reverse the ordering of the k indices in the network outputs h i , w i and d i for one of the two groups.
Unfortunately, although the equivariance condition is satisfied, this approach is not entirely legitimate, since the result is a transformation that is not a continuous function of the inputs, thereby failing to satisfy the conditions required for Equation (16). To see this, observe that ) may take two possible values for a fixed v i,x and v P i , experiencing a discontinuous jump when changes in the active partition cause the overall magnetisation of z to flip sign. So, while we include results using these 'equivariant splines', we do not recommend using this prescription in future.
It is worth bearing in mind that enforcing symmetries is not absolutely necessary; the Metropolis-Hastings algorithm is guaranteed to converge to the correct target, and therefore reproduce all of its symmetries, as long as the transition kernel is ergodic. We remind the reader that a sufficient condition isp(φ) > 0 ∀φ ∈ R |Λ| [56] (so that every configuration has a finite probability of being generated), and that this is guaranteed (for a sensible choice of r(z)) since f θ is a bijection. Nevertheless, a guiding principle of optimisation is that it is generally more efficient to enforce known constraints by construction, and benefits of doing so for normalizing flows have been reported in References [31,33,76].

IV. RELATED WORK
The first demonstration of a normalizing flow forming the basis of a sampling algorithm for lattice field theory was provided by Albergo et al. [1] for two-dimensional φ 4 theory, using the Real NVP architecture described in Section III C. Still with φ 4 as the target theory, Nicoli et al. [32] used an even more bare-bones flow where the coupling layers simply shift the field variables in such a way that the symmetry under φ → −φ is preserved. Our work draws on ideas from both of these studies, though we pivot in the opposite direction with respect to Reference [32] by using coupling layers that are more flexible than those in Real NVP. More recent work along these lines has been undertaken by Hackett et al., who compared several optimisation strategies for flow-based sampling from bimodal distributions, including φ 4 in its broken phase.
Further progress has mostly been on the side of developing the necessary machinery to apply these ideas to lattice gauge theories; specifically, those that are invariant under local U(N ) or SU(N ) transformations. Rezende et al. [59] explored several possible approaches to using normalizing flows in cases where the field variables are defined on an n-sphere or n-torus. A procedure for constructing normalizing flows that are equivariant under gauge transformations was initially developed by Kanwar et al. [31] for the U(1) case and then extended to SU(2) and SU (3) by Boyda et al. [33]. By definition, a gauge-equivariant flow is one that commutes with the action of the gauge group, which implies that gauge invariance is preserved by the flow. Hence, representative samples of gauge fields can be generated using latent variables drawn from the uniform (Haar) measure for the gauge group, and passing them through a gauge-equivariant flow. More recently yet, Albergo et al. developed the flow-based approach to sampling from theories with dynamical fermions. A code-based introduction to these methods has been provided by Albergo et al. [34], which we made use of when implementing convolution-based flow models.
Another recent and highly relevant contribution was made by Lawrence and Yamauchi [35] who showed that, in certain cases at least, it is possible to use a normalizing flow to sample from a theory possessing a 'sign problem', which is to say the action is complex and exp(−S) cannot be interpreted as a measure of probability.
Several alternative ideas that involve training parametric models to perform collective updates predate the use of normalizing flows. For example, in the 'self-learning Monte Carlo' method [78] the parametric model describes an effective action for a spin system with n-th nearest neighbour interactions whose couplings have been inferred from pre-generated training data, which can then be used to generate Wolff cluster updates [17]. Restricted Boltzmann machines (RBMs) have been embedded in traditional MCMC algorithms [25,26], though training the RBM requires pre-generated configurations and its sampling procedure (Gibbs sampling) introduces its own autocorrelation. Generative adversarial networks (GANs), which can be more flexible than normalizing flows but for whichp(φ) is defined implicitly and cannot be directly computed, have also been used to generate candidate field configurations, but require a lot of additional machinery on top of the GAN itself to ensure that the distribution being sampled from is close to the correct one [27,29], or otherwise estimate the discrepancy [30].
While there have been substantial advances in the use of machine learning to extract physical information for lattice field theories, the generation of samples from some approximation of the true path integral has generally come as an add-on when the tool being used is a generative model. In contrast, the key strength of normalizing flows is the explicit and tractable densityp(φ) which makes exact sampling possible using the Metropolis test.

A. Field theory and observables
For the main part of our study we used the following action: which describes a discretised analogue of two-dimensional scalar φ 4 theory with dimensionless couplings β and λ, defined on a periodic lattice Λ, using e µ to denote a unit lattice vector in the µ-th dimension. Experiments with the non-interacting theory used the 'standard' action described in Appendix A, which is given by Equation (A6) with g 0 = 0. We focus on isotropic lattices with 6 2 ≤ |Λ| ≤ 20 2 sites. A nice feature of the parameterisation given above is that the limit λ → ∞, φ 2 → 1 is very clearly identified as the Ising model at temperature T ≡ β −1 . Indeed, in the continuum limit φ 4 theory belongs to the Ising universality class, with spontaneous breaking of the φ → −φ symmetry occurring along a critical line λ, T c (λ) in the space of couplings. Defining the reduced temperature t = T −Tc(λ) Tc(λ) , the asymptotic behaviour of observables as the system approaches criticality is described by power-law dependence on t. For example, the magnetic susceptibility diverges as χ ∼ t −γ , and the correlation length diverges with a different critical exponent, ξ ∼ t −ν . Eliminating t, we see that χ ∼ ξ γ/ν .
In a finite volume observables depend on both the couplings and the system size in a non-trivial manner, and their behaviour in the critical region is described by finite-size scaling. For example, the susceptibility in a volume of linear extent L can be written in the following manner, in which finite-volume effects have been bundled into a dimensionless scaling function g χ (L/ξ), which we notice must tend towards a constant value as L → ∞ and approach (L/ξ) γ/ν for L ξ so as to act as a cutoff. We now need to specify how we actually measure observables on the lattice. The basic building blocks are the two point correlation function, and its Fourier transform, We have used the translation invariance of Equation (40) to take a volume-average in Equation (42) for the simple reason that it improves the statistics. The susceptibility is identified withG(0), but in the classical spin setting it is often expressed in terms of the magnetisation M (φ) = x∈Λ φ x , Estimators for these observables are easily obtained by exchanging . for a sample mean, and uncertainties estimated using the bootstrap method [79,80]. However, without explicitly breaking the Z 2 symmetry one will always measure φ = |Λ| −1 M = 0, so if one is interested in the phase transition one can compute separate sample averages for configurations with positive and negative magnetisation, to properly account for the fact that the field variable distribution is bimodal.
The correlation length requires a little more work to measure. It is the longest mode in the spectrum of L−1 x1=0 G(x 1 , x 2 ), the correlation function in timemomentum representation, at momentum q 1 = 0. For sufficiently large separations, x 2 , this takes the form of a pure exponential (a cosh due to lattice periodicity), from which the correlation length can be extracted through a fit 12 or by computing In general, this can be challenging due to low signal/noise ratio at large separations, but with L ≤ 20 we also suffer from having very few data points to fit. To slightly improve the situation, we average over the two dimensions when computing Equation (46). Another option exploits the fact that the lattice propagator takes the formG(q) ∝ µ 4 sin 2 (q µ /2) + ξ −2 −1 in the low-momentum limit, which lets us write [82] Here,q 1 = (2π/L, 0) andq 2 = (0, 2π/L) are the smallest possible non-zero momenta, and we have usedG(q) + G(−q) = 2 ReG(q). Our intention will be to tune the couplings so as to obtain systems with correlation length ξ = L/4, meaning that as we increase the lattice size we are studying essentially the same theory with an increasingly fine resolution. The purpose of doing this is so that we only see the effect of the number of degrees of freedom on algorithmic efficiency, as we keep the physical size in units of the correlation length constant. The choice proportionality constant (four) is a reasonable trade-off between the rate at which criticality is approached as we increase L, and the size of finite-volume effects contained within scaling functions. Fixing λ = 0.5 and allowing β to vary, we obtained three separate predictions for the value of β that corresponded to ξ = L/4 on the symmetric side of the phase transition. These values are provided in Table I.

B. Model details
When investigating the scaling of training costs (Section VI D), we used normalizing flows that are a specific hybrid of affine coupling layers and rational quadratic splines, with the parameters of the transformations generated by fully-connected feed-forward neural networks containing a single hidden layer of size H = |Λ|. In Section VI C we report on the observations that led us to converge on this particular design.
The metric we use to measure the quality of trained models is the average rate at which configurations generated by the model are accepted when used as proposals for a Metropolis-Hastings simulation. In Section VI A we verify that this acceptance rate entirely governs the integrated autocorrelation times of the resulting Markov chains, as claimed in Section II B and specifically Equation (13).
We used the ADAM optimisation algorithm [60] to update the parameters of our models. The step size or 'learning rate' was annealed during training according to a cosine schedule, where T is the total number of training iterations. Note that this learning schedule requires that we specify T before training begins. There are no additional 'stopping criteria'. The ADAMW variant [83] along with 'warm restarts' [84] (which amount to resetting t = 0) is a useful generalisation which permits us to continue training (perhaps with a larger batch size) if we are not happy with the outcome after T iterations. After some experimentation with faster initial learning rates, which typically resulted in lower acceptances if the number of training iterations was large, we generally opted for η 0 = 0.001. The batch size, i.e. the number of configurations used to estimate the objective function at each training iteration, varied from 250 to 32000 configurations. In the vast majority of cases the difference between the batch size and the number of training iterations was a factor of one, two or four. Note that these batch sizes are much larger than those conventionally used in stochastic optimisation.
In fact, it is quite typical to intentionally aim for a highly stochastic trajectory through the space of parameters, by using a very small number of training inputs (as low as 2 in Reference [85]) for each update of the model's parameters. This may seem surprising, particularly since the graphical processing units (GPUs) on which these models are run are entirely optimised for highly parallel computations, so a small batch size is an under-utilisation of these capabilities. The motivations behind this choice are that the stochasticity reduces the tendency of the model to over-fit the training inputs or otherwise get stuck in local optima [85,86], and tends to find 'better' global optima [87]. However, we have no reason to prefer small batch sizes a priori; as explained in Section III A, the problem of over-fitting training inputs does not apply to us, and we expect the issue of local optima to be alleviated, to some extent, thanks to stochasticity inherited from the random number generator that produces our training inputs.
Unless stated otherwise, one can assume the following for all models presented in the remainder of this paper: • The φ 4 couplings are given by Table I.
• The flow comprises a number of affine coupling blocks followed by a single rational quadratic spline coupling block.
• Z 2 equivariance is enforced in the affine and additive coupling layers, as described in Section III F.
• The splines have 8 segments and do not have Z 2 equivariance enforced.
• Neural networks are of the fully-connected kind with have a single hidden layer containing exactly |Λ| (i.e. L 2 ) elements, as defined in Equation (35) with H = |Λ|.
• We do not apply an activation function to the output layer of the s and t networks in the affine (or additive) layers.
• In figures, data points and error bars are an average and range taken over three identical models with different random initialisations.
Our code, ANVIL [88], is publicly available. It uses the PyTorch library [89] for constructing and training models, and Reportengine [90], a declarative framework for performing scientific analysis.

C. Summary of the procedure
A training iteration consists of the following steps: (17) to generate a batch of N 'latent configurations' -{z (1) , z (2) , . . . , z (N ) } where z (n) ∼ r(z) -with each configuration z (n) comprising |Λ| uncorrelated Gaussian numbers. 2. Pass these variables through the layers of the model, calculating the logarithm of the Jacobian determinant, log |∂g i /∂v i |, for each layer g i as it transforms one of the two partitions. This results in N candidate field configurations and N Jacobian determinants log |∂f θ /∂z| corresponding to the full transformation φ = f θ (z). (40), for the batch of candidate field configurations.

Compute the action, Equation
4. Average the action and Jacobian over the batch, to provide an estimate of the reverse Kullbach-Leibler divergence, Equation (20). 13 5. Update the parameters of the model by a small increment in the direction of steepest gradient using the ADAM or ADAMW optimisation algorithms.
Once we have a trained model, we move onto the sampling. We generate a large sample of candidate configurations from the model, along with their Jacobian terms, and immediately calculate the quantity log w(φ) = − logp(φ) − S(φ) for each candidate configuration. We are now fully equipped to run a Metropolis-Hastings simulation as described in Section II; for the Metropolis test we simply exponentiate log w(φ ) − log w(φ) to obtain the acceptance probability A(φ → φ ) (see Equation (9) with q(φ | φ ) =p(φ) and p(φ) ∝ exp(−S(φ))).

A. Proof of principle
As a basic check that the types of models described in Section III have the capacity to encode the information necessary to trivialize field theories, we trained a set of models to generate free fields. For this, we found that a sequence of 2-4 blocks of additive transformations performed on equal par with the more flexible affine and 13 Strictly speaking we require the gradient of Equation (20) with respect to the parameters of the model. This is performed automatically by PyTorch's 'autograd' machinery [89]. Gradients are propagated through neural networks using the backpropagation algorithm [91]. spline layers. Figure 3 demonstrates that the candidate field configurations generated by models with very high acceptance rates are indeed representative of the desired field theory. In this case we know exactly what is required of f θ ; it must perform a rescaling of the latent degrees of freedom followed by a Fourier transform to real space. However, it would be wrong to suppose that this is a trivial exercise, because the map is built out of a peculiar set of transformations that individually transform half of the degrees of freedom, conditioned on the other half, and the actual transformation learnt by the model does not, and cannot, decompose into the simple steps described above. 14 Moving onto the φ 4 theory, we fixed λ = 0.5 and trained hybrid affine-spline models at various values of the inverse  Table I. The inset figures show the magnetisation and susceptibility over the same range of β. Models consisted of a block of affine layers followed by a spline block. The affine layers had Z2 equivariance enforced as described in Section III F, but this was only true of the spline layers in the low-temperature phase (see Figure 12 for explanation). For each lattice size, models were identical (as outlined in Section V B) and were trained in an identical fashion (16000 iterations with a batch size of 16000). temperature β so as to cross the phase transition. In this study, emphasis was placed on like-for-like comparison of models trained against different targets, rather than maximising the acceptance rate. Figure 4 shows that high acceptance rates are possible in both the symmetric and the broken phase of φ 4 when using a unimodal Gaussian prior. We also see that, as should be expected, trivializing the theory becomes increasingly challenging as the phase transition is approached. Though this is still interesting, one should be cautious when interpreting Figure 4. In reality the problem is probably much easier for short correlation lengths than implied by these results, in part because the fully-connected neural networks will contain a high level of redundancy since many degrees are effectively decoupled.
As an aside, we found that this sort of 'parameter scan' can be performed efficiently using a single model that is initially trained at a high temperature, by adjusting the temperature over a sequence of training phases; in other words, a model trained at temperature β 0 can be re-trained at temperature β 0 + δβ with relatively little effort. A thorough investigation into the potential of this feature was recently provided in Reference [61] (see 'adiabatic retraining').  (14) is also plotted in green, for comparison. The data was obtained by sampling from models trained on a broad range of systems, including those specified in Table I, those depicted in Figure 4, and others trained during various experiments.

B. Acceptance rates and autocorrelation times
In Figure 5 we used the traditional approach to estimating integrated autocorrelation time, based on autocorrelations in the magnetisation of each configuration in the Markov chain, which is described in Appendix B. We now compare this with the observable-independent estimator, τ rej , defined by Equations (13) and (4). The results, shown in Figure 6, show generally good agreement between the two, though where significant discrepancies exist they are always such that the rejection-based estimator returns a larger integrated autocorrelation time than that calculated using the magnetisation.
In Section II B it was claimed that, when the proposal distribution q(φ | φ) in the Metropolis step, Equation (9), is not conditioned on the current state of the Markov chain and is instead given byp(φ), the integrated autocorrelation time is determined entirely by the rate at which proposals are rejected or, more precisely, by Equation (13). Since this has nothing to do with the specifics of the flow model, the system size or the values of the couplings, we simply combined results from a large number of previously trained models (539, to be precise) to verify Discrepancy between the alternative estimates of the integrated autocorrelation time shown in Figure 6, with x-coordinates corresponding to the length of the longest consecutive run of rejections occurring in the sampling phase. this property. Figure 5 provides the necessary empirical evidence that we have indeed nullified any dependence of τ O on the correlation length of the system, and have therefore eliminated critical slowing down in the sampling phase. The geometric lower bound on the integrated autocorrelation time from Equation (14) is also plotted, but we find that the relationship between acceptance rate and integrated autocorrelation time is fit rather well by a power law, Table II contains no new information, but rephrases this power-law relation in terms of the acceptance rate required for the effective sample size to be a particular fraction of the Markov chain length. However, for reasons discussed shortly, we take this scaling relation with a pinch of salt, and advise against extrapolating to lower acceptances and larger autocorrelation times.
As shown in Figure 7, the discrepancy between τ M and τ rej is directly related to the presence of long phases in the Metropolis-Hastings simulation in which every proposed configuration was rejected. These 'rare events', arising from the tails of the distributions depicted in Figure 8, are problematic when it comes to estimating the integrated autocorrelation time. The rejection-based estimator is highly sensitive to them and hence picks up a large statistical error (the situation resembles the slow convergence of estimators when local-updates sampling algorithms are required to traverse large energy barriers). Conversely, the traditional estimator is relatively insensitive to a small number of uncharacteristically long periods of consecutive rejections, which may result in an underestimate of the true integrated autocorrelation time.
In Figure 8 we see the distribution describing the length of runs of consecutive rejections become less long-tailed as the acceptance rate increases. Although this trend is somewhat obvious in a qualitative sense, the circumstances by which these long-tailed distributions arise are an interesting facet of the scheme used to train these models. We expand on this in Section VII B.

C. Finding efficient representations
The size of models, i.e. the number of trainable parameters, |θ|, will obviously play an important role in determining the scaling of training costs. Hence, it is not simply a question of building normalizing flows out of highly expressive transformations, but also one of minimising the number of redundant parameters in the model. More specifically, our ambition must be to find architectures which are able to learn the most efficient representations of trivializing maps, for which |θ| grows slowly as we increase the number of degrees of freedom in the target density.
As stated previously, we found that the flexibility of affine and rational quadratic spline transformations was not put to good use when the target density corresponded to a free theory, with models built from additive layers performing equally well using fewer trainable parameters in total. We now turn to our main point of focus, which is finding efficiency representations of trivializing maps for strongly interacting theories. It is a tremendously useful feature of normalizing flows that we are able to sample from the the intermediate layers, which define probability densities in their own right -see Figure 9. This gives us insight into the role played by each individual layer.
The transition from disorder to long-range order in lattice φ 4 theory is characterised by a gradual separation of the initial unimodal probability density into two distinct peaks, corresponding to a positive and a negative net magnetisation. Therefore, in the regime where the density is bimodal, a normalizing flow transforming Gaussian latent variables must at some point enact a Z 2 -symmetric bulkshifting of probability density to ± |φ| . For flows using solely affine layers, we observed that the task of transforming a unimodal density into a bimodal one was almost always designated to the final coupling block, after earlier layers had resolved the general structure of correlations in the unimodal setting. We suspect that this is because the functions being modelled by the neural networks simplify when their inputs are distributed unimodally around zero, whereas they must become more strongly non-linear, and hence harder for the neural networks to approximate, when their inputs are distributed bimodally. When an RQS block was introduced, this always took on the unimodal-bimodal transformation regardless of its position in the flow. However, Figure 10, which compares various orderings of affine and spline blocks, shows that the strongest design starts with affine layers and finally applies a single block of spline transformations.
Our expectation was that the additional flexibility offered by RQS transformations might help to perform the challenging splitting and shifting of density in a way that is more sensitive to subtle differences in the inputs than is possible with affine layers. This is confirmed emphatically by the results in Figure 11, for which we took a seven-block affine flow and substituted six blocks of affine coupling layers for a single RQS block in order to compare models with an approximately equal number of trainable parameters. However, the substitution of many inexpressive affine layers for a single, highly flexible RQS block is a one-off trick; the results in Figure 10 show that adding a second spline block fails to improve the acceptance rate any more than adding another affine block. It appears that, with the unimodal-bimodal transformation taken care of in the final layer, the remainder of the trivializing map can be modelled more efficiently using the simpler  Table I. transformations.
From Figure 11 we also see that enforcing Z 2equivariance in the affine layers, as described in Section III F, leads to higher acceptances that reduce less steeply as the lattice size increases. Despite the theoretical flaw in the design of the Z 2 -equivariant spline layers, we still report on a brief investigation into their performance. Figure 12 shows that in the symmetric phase (which includes the couplings in Table I) we did not find that using these layers improved acceptance rates, though they appear to do so in the broken phase.
We also report on some preliminary experiments with convolutional networks, summarised in Figure 13. The convolution-based models are highly parameter-efficient in comparison with the fully-connected ones, since the number of parameters is decoupled from the lattice size (with the caveat that deeper models are required when the correlation length grows proportionally to the lattice size). We expect this parameter efficiency to translate to improved scaling of training costs with respect to models using fully-connected networks. However, on these small lattices we were able to reach significantly higher acceptance rates using fully-connected networks, in less time than it took to train even the simplest convolutionbased model. We are currently working on a more systematic and large-scale comparison of models based on convolutions versus fully-connected networks, focusing particularly on the scaling towards the continuum limit.
Having converged on a general recipe for f θ -some Z 2equivariant affine layers followed by a single RQS block, using full-connected neural networks -we have a number FIG. 11. Comparison between three groups of flow models using different coupling layers. One RQS block contains approximately the same number of parameters as six affine blocks. 'Equivar' refers to the Z2-equivariant affine layers described in Section III F. The green data points correspond to flows in which the affine layers were similar to those used in Reference [1] (though ours used neural networks that were both narrower and shallower). Models were trained in an identical fashion, for 16000 iterations with a batch size of 16000. The physical parameters are given in Table I. of possible ways to make the map more flexible: adding more affine layers; increasing the size of neural networks; increasing the number of segments in the splines. Our goal is to figure out the extent to which each of these improve the ability of the model to fit p(φ) while minimising the amount of redundancy in the model's parameters.
Firstly, our experiments, shown in Figure 14, indicate that prepending more affine layers to the flow leads to modest improvements in the acceptance rate for a fixed training length, becoming increasingly worthwhile as the system size increases. However, there is certainly a law of diminishing returns, and on the lattices that we explored for this work the acceptance seems to plateau by the time we reach five affine layers. The law of diminishing returns also applies to adding more segments to a spline transformation; we found that using 8 segments was reasonable, and resulted in RQS layers that were substantially more flexible than affine layers without being too slow to train.
During numerous experiments with deep fullyconnected networks (only some of which are reported in Figure 15) we failed to observe any consistent improvements, in terms of the Kullbach-Leibler divergence or the acceptance rate, over an equivalent model using neural networks with a single hidden layer, i.e. exactly as defined in Equation (35). We checked that the networks did not suffer from the vanishing gradient problem by training them for at least as many epochs as were required for the gradients from all layers to reach the same order of magnitude. We also substituted the tanh activation func- FIG. 12. Changes in the Metropolis-Hastings acceptance rate (denoted byĀ) of spline-based models, due to enforcing Z2equivariance in the spline layers, as described in Section III F. Couplings and training hyper-parameters were identical to those in Figure 4. We verified (by inspection of histograms) that each non-equivariant model retained an approximate symmetry, such that both positive and negative magnetisations were sampled with similar frequency.
FIG. 13. Results from preliminary experiments using affine flows with convolutional networks, compared to the fullyconnected networks used throughout the rest of this study (single hidden layer of size |Λ|). The number of parameters, |θ|, various through the number of affine coupling blocks (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12). Convolutional networks had square kernels of size 3 × 3. The brackets in the legend denote the number of channels per hidden convolutional layer, and the number of hidden layers, respectively. Models were trained for 16000 iterations with a batch size of 16000. Physical parameters are as in Table I for 6 2 , 10 2 and 14 2 lattices.
FIG. 14. Improvements in the Metropolis-Hastings acceptance rate due to adding more affine coupling blocks, before a rational quadratic spline block performs the final transformation. Models were trained for 16000 iterations with a batch size of 16000, and the physical parameters are those found in Table I. tion for a ReLU which only saturates in one direction, although this meant we were no longer able to enforce Z 2 equivariance within the affine layers, which proved to be a poor exchange.
As shown by Figure 15, we also found that the benefits of increasing the width of neural networks quickly diminished once the hidden layer contained more than H = |Λ| elements (i.e. twice the size of the input layer, which is the passive partition only). Since a doubling of the neural network widths in spline layers increases the number of parameters by a factor of approximately 3K|Λ|, incurring a considerable increase in training costs, it was almost never a better investment of resources than increasing the batch size or number of training iterations. Figure 15 exemplifies one of the key conclusions of our work -that the acceptance rate of our models is strongly dependent on the amount of effort we put into training, while being relatively oblivious to the act of adding to the total number of trainable parameters in the model. In the following section we explore this in a more quantitative fashion.

D. Scaling of training costs
Neglecting algorithmic or hardware-related factors that contribute to scalability, the cost of training a model up to a given value in some performance metric (e.g. acceptance rate) can be measured in terms of the number of training iterations and the batch size. We will combine these two factors into a single number, |Φ train |, that is the total number of configurations that the model has been exposed to during training. This happens to also be rather Comparison of Metropolis-Hastings acceptance rates for flow models with a different number of trainable parameters, |θ|, within their neural networks. The colour axis, |Φtrain|, denotes the total number of configurations used in the optimisation (the batch size multiplied by the number of training iterations). The dashed lines connect models whose neural networks had a single hidden layer, of varying width, whereas following the solid lines equates to increasing the number of hidden layers without changing the widths. Models comprised one affine block followed by one RQS block and were trained for 32000 training iterations with a range of batch sizes. Physical parameters are provided in Table I. convenient for making comparisons with critical slowing down in traditional algorithms, where the integrated autocorrelation time is proportional to the total number of configurations that must be generated to achieve a target error on expectation values. See Appendix C for a set of indicative training times, measured in seconds.
For this part of the study, we trained a large number of models, as per the recipe of the previous subsection (1-5 affine blocks followed by a single spline), using the φ 4 couplings given in Table I. For training we covered a range of batch sizes (250-32000) and training lengths (500-64000 iterations), both of which were incremented in factors of two. In Figure 16 we plot the mean and standard deviation of the acceptance, taken over each set of models trained with the same |Φ train |. The error bars are typically much smaller than the difference between adjacent data points, indicating that the overall quality of optimisation is relatively stable under mixing of the batch size and the number of training iterations, provided the total number of configurations from which the model can learn remains fixed. It is striking that the acceptance rate does not appear to plateau, even for the smallest lattice. This is a sign that the acceptance is limited not by the expressivity of the model, but by the amount of training (whereas Figure 11 shows that pure affine flows are genuinely limited by their expressivity).
When attempting to quantify the scaling of |Φ train | in a way that can be contrasted with critical slowing down, FIG. 16. Relationship between the average acceptance rate from fully-trained models and the total number of configurations exposed to the models during training -i.e. the product of the batch size and the number of training iterations. The φ 4 couplings are given in Table I. Error bars are standard deviations over a number of different models trained using different batch sizes (≤ 32000) and training lengths (≤ 32000), as well as a variable number of affine layers (≤ 5).
we found that, in order to get sets of points that could be fit reasonably well using a power law, we needed to group the models by sorting them into bins according to their integrated autocorrelation time, and select the 'best' model from each group. These fits are shown in Figure 17. The results are sobering; despite the seemingly large improvements compared to the original formulation of Reference [1], the amount of effort required to train these models is growing at an astonishing rate. While it is true that |Φ train | is a one-off overhead cost, unlike the number of configurations required in a traditional sampling simulation, scaling that goes with the 9th power of the correlation length makes it impossible to avoid the conclusion that this prescription remains far away indeed from a solution to critical slowing down. In Section VII D we discuss why it is that models with relatively few trainable parameters appear to require such a colossal effort to train. With few exceptions, machine learning has followed a trend towards increasingly deep architectures, in spite of the fact that many of these architectures, including the fully-connected feed-forward neural networks defined used in this work, require only a single hidden layer to act as universal approximators. The preference for deep architectures stems from two key principles [92]. Firstly, FIG. 17. Scaling of the 'training cost', measured by the total number of configurations used for optimisation, for models to reach a certain target sampling efficiency. Models were sorted into 'bins' by the integrated autocorrelation time measured during the sampling phase, and data points represent the best model (lowest τM ) from each bin. As usual, the correlation length was fixed at ξ ≈ L/4 (see Table I). Hence, the values in the legend can in some sense be compared to the critical exponent zO which determines the critical slowing down in traditional simulations.
deep architectures are less inclined to over-fit training data and hence generalise better than shallow ones. However, as explained in Section III A, over-fitting is not a problem we will face since each training iteration exposes the model to a set of previously-unseen training inputs. Secondly, deep architectures can represent some highly nonlinear functions more efficiently than shallow architectures. This is connected to their ability to represent functions through hierarchical dependencies between abstract 'features' (Reference [93] argues, convincingly, that this is essentially a consequence of the laws of physics). The truth of this can be rigorously demonstrated for a number of architectures comprising specific functions [93][94][95], and empirical evidence is so overwhelming that 'deep and cheap' has become somewhat of a mantra in machine learning.
Nevertheless, the question of whether deep architectures are more efficient is task-specific and should be answered through experimentation. In our case, experiments indicate that increasing the depth of neural networks does not increase the expressive capabilities of the model in a way that translates to a better fit to the target density. It is possible that this result is not merely a peculiar outcome of building the map out of coupling layers, whose structure is quite unusual; we draw attention to Reference [28], in which the conclusion was that a single-layered restricted Boltzmann machine provides a more efficient representation of an Ising system near criticality than any of its deep generalisations. Furthermore, given a fixed number of trainable parameters, we found that a shallow flow with a more flexible spline layer dramatically outperformed a deeper flow using affine layers, for the interacting theory at least.

B. A closer look at the training algorithm
Conventional training, via optimising Equation (18) using a fixed training set, penalises the model for underestimating the density at any point at which there is a training input. This tends to result in models that are smoothed approximations of p(φ), spanning the space of training inputs. If we were to use such a model as the basis of our sampling algorithm, we would expect to see a steady flux of configurations originating from regions of configuration space 'between the peaks' in p(φ), which would typically be rejected by the Metropolis test. The situation for our models is quite different. Instead, the characteristic behaviour, which is known as 'zero-forcing', is for the optimisation to quickly purge the density from any region of configuration space in which p(φ) is very small, resulting in models that fit the modes of the target well but frequently underestimate the low-density tails [96]. When these underestimated regions are eventually sampled from, the probability of transitioning away is suppressed by a factor ofp(φ)/p(φ) due to the Metropolis test. Thus, Markov chain histories in which the acceptance rate is typically quite high, but with occasional, 'surprisingly' long periods of consecutive rejections, are a generic feature of models trained in this way.
It is enlightening to look more closely at the precise way in which optimisation based on Equation (20) acts to fit a model to a target density. For this purpose it will be more convenient to write the Kullbach-Leibler divergence in the following form: i.e. with the 'irrelevant terms' that are neglected in Equation (20) put back in. The origin of zero-forcing is clear; if logp(φ) is large in some region of configuration space in which p(φ) is small, then the optimisation will receive a strong gradient signal, because configurations from this region are generated with high frequency and each contributes a large positive term to the objective function.
Using the chain rule, the gradient of Equation (50) can be written The gradient with respect to generated configurations contains two terms whose roles can be understood intuitively. One of these acts to reduce the action, E φ∼p(φ) − log p(φ) , driving the optimisation towards the mode(s) of the target density. This process is 'boosted' by the zeroforcing property described previously. The second term acts to increase the entropy, E φ∼p(φ) − logp(φ) , driving density away from the mode(s), which is necessary to fit the tails of the target. Figure 18 shows a typical training profile, in which the rapid fitting of the modes, boosted by zero-forcing, yields a fairly high acceptance rate early on in the training, after which a more gradual process of expanding around the modes to fit the low-density tails takes over. Also note that, as the model expands to fill the tails of the target and and the acceptance rate goes up, we also see that the distribution of consecutive rejections becomes less long-tailed, as demonstrated by Figure 8.

C. A small warning regarding ergodicity
It is worth considering the expected outcome of training a model that is seriously deficient in its inherent capacity to approximate the target density. In such a situation Equation (50) can be expected to drive the model towards prioritising a certain subset of features in p(φ) at the expense of almost completely ignoring others; the objective function cannot penalise the model for making such a terrible error if the model never generates a configuration from that region. This is well-documented in the literature [97,98]. Recall that the flow-based approach to sampling is guaranteed to be ergodic; by construction, p(φ) has support on R |Λ| (since it is bijectively related to a Gaussian distribution), which means there is a non-zero probability of any flow model proposing any configuration. However, when the probability of sampling from important regions of configuration space is suppressed to the extent that they are almost never sampled from on the timescales of practical simulations, we might speak of an effective breaking of ergodicity.
This appears to be worrying, and might remind the reader of the 'mode collapse' problem faced when train-ing generative adversarial networks. Fortunately, in our case the problem is much less severe, because underestimates of p(φ) must necessarily be compensated by overestimates elsewhere in configuration space, which are then more likely to be penalised (this is a key strength of likelihood-based training). This is not to say that the sort of dramatic error just described cannot occur, but in practice we found that it is far more likely to be caused by instability in the training due to using a large learning rate and small batch size, and not because the objective function is genuinely minimised by learning a density that is qualitatively different from the target due to model inflexibility.
In the top sub-figure of Figure 19 we show an example of a model that has erroneously broken the Z 2 symmetry by being trained too aggressively, the result being that it generates samples of field configurations whose magnetisations all possess the same sign. The other two sub-figures demonstrate that the problem is easily resolved by decreasing the learning or increasing the batch size, thereby reducing the likelihood of making a sequence of optimisation steps which collapse one of the two modes. Note that the 'wrong' model attained a higher acceptance rate than both of the 'correct' models. However, this is misleading; the non-symmetric model is a highly inefficient generator of proposals for the Metropolis-Hastings algorithm. If the sampling was run for long enough, eventually a configuration from the collapsed mode would be generated, at which point the Markov chain would freeze due to the enormously suppressed probability of transitioning away. Of course, this is exactly the level of inefficiency we should expect from a process that is attempting to perform a reweighting to a distribution with which there is very little overlap.
Although this example is rather contrived and very easily avoided by simply choosing sensible hyper-parameters, it serves as a warning not to rely on the acceptance rate as the sole indicator of model quality. This conclusion was also reached by the authors of Reference [61]. Looking ahead to more complicated field theories, it will be sensible to check for violations of the symmetries which one knows to be present but which may be broken by the model. In the φ 4 case, we can easily see if the Z 2 symmetry has been broken by looking at histograms of the field variables, and broken translational or rotational symmetries leave clear imprints in the correlation function.

D. Critical slowing down of the training
Our results strongly suggest that the quality of the overall fit of our models to the target theory is limited by how extensively they have been optimised, not the inherent expressivity of the models. For the φ 4 couplings given in Table I   Histograms of field variables taken from three samples of 10 5 configurations, generated by three different models. The colour labels the sign of the magnetisation. The φ 4 parameters are {L, β, λ} = {6, 0.8, 0.5}. The models had two blocks of affine layers for which Z2-equivariance was not enforced. In the top sub-figure, the large learning rate and small batch size result in the breaking of the Z2 symmetry during optimisation. This is easily avoided by using more sensible learning rates and batch sizes. However, note that the acceptance rate for the top model is larger. explanation for this disappointing result would be nice.
Of course, by moving to larger lattices we also increased the number of trainable parameters. Since we fixed the neural networks to have a hidden layer of size H = |Λ|, the size of models grows as |θ| ∝ |Λ| 2 = L 4 . 15 The results of Figure 17 then state that, given a fixed target acceptance rate, the number of configurations |Φ train | needed to train models scales with at least the square of the number of parameters requiring optimisation. The numbers here are less important than the fact that we are not seeing exploding training costs because the dimensionality of the optimisation problem is exploding (this was the point of looking for efficient representations).
A convincing explanation for this apparent reduction in efficiency arises from a thorough analysis of the training procedure; specifically, the two terms in the gradient of the Kullbach-Leibler divergence described above. Huang et al. [96] showed both theoretically and empirically 16 that, in several typical cases, the 'expansion signal' due to the entropic term is feeble in comparison to the strength of the signal that drives the optimisation towards the mode(s) of the target, due to the term driving action minimisation. Put differently, the probability of generating a configuration which produces a gradient, ∇ θDKL (p || p), 15 The weights of the connections between two neural network layers of widths n and m can be represented by an n × m matrix. 16 Albeit in a slightly different context, in which the target density is given by a second generative model. that points away from the local mode of p(φ) is related to the spectrum of eigenvalues in the target's covariance matrix; the more ill-conditioned this matrix (i.e. the more eigenvalues that are close to being zero, relative to the principal eigenvalue) the lower the probability. The consequence of this is that the later stages of the training can be extremely inefficient, since a potentially tiny fraction of training inputs contribute to the the process of 'expanding' around the mode(s) of the target, through which the model learns to match the low-density tails. We refer the reader to the original reference [96] for the details of this argument and empirical results, although the argument is quite intuitive: an ill-conditioned covariance matrix arises when the probability density effectively resides on a low-dimensional sub-manifold -this is sometimes referred to as having a low intrinsic dimension with respect to the actual number of degrees of freedom in the distribution -and in such situations we would expect the majority of training inputs generated by the model (which after all started life as an isotropic Gaussian) to fall outside this low-dimensional sub-manifold, producing a signal for the model to contract further towards the mode. Meanwhile, the rate at which configurations are generated in a direction along which the model needs to expand is very low. Our suspicion is, therefore, that the later stages of the training are predominantly spent compressing the model density onto a low-dimensional sub-manifold, and only very slowly expanding on this sub-manifold to better approximate the tails of the target.
The observations of Reference [96] are highly relevant for data-driven applications because realistic data distributions tend to have low intrinsic dimension. They are also relevant here; the action becomes increasingly ill-conditioned as the correlation length increases (i.e. effective mass tends to zero). If further convincing is required, Reference [99] recently provided empirical evidence that, for a number of lattice models, critical points exist at minima of the intrinsic dimension. Figure 20 shows how the acceptance of the L = 12 models from Figure 4, with varying correlation length but a constant number of degrees of freedom, improved over the course of training. Intriguingly, the profiles look very similar, other than being shifted with respect to each other. Although interpreting this plot in light of the previous discussion is an exercise in speculation, it is possible that a consequence of increasing the correlation length is that a greater proportion of the target density is fit in the slow phase as it becomes more concentrated on a manifold of low dimension.
To be clear, it is unsurprising that model optimisation should becoming increasingly challenging as a critical point is approached. What is perhaps more interesting is the way in which a long correlation length afflicts the particular training scheme used here, although it is very possible that effect just described is subdominant on the small lattices studied in this work (in which case we need to identify the dominant source of inefficiency). Much more work will be needed to disentangle the various fac- tors contributing to the overall scaling of training costs and paint a more quantitative picture, but if this alternate manifestation of critical slowing down turns out to be a key bottleneck then attention should turn to establishing whether it can be tamed more readily than the familiar version that hampers traditional MCMC. In Section VIII B we discuss some possible avenues for further investigation.

VIII. CONCLUSIONS AND OUTLOOK
We have verified the key results of Reference [1]: firstly, that there exist approximately trivializing maps in twodimensional lattice φ 4 theory that are accessible in practice to normalizing flow models with tractable Jacobian determinant; and secondly, that using such models as generators of proposals for a Metropolis-Hastings simulation represents a complete transfer of the computational costs normally associated with critical slowing down to the cost of training the flow model. As it often the case in Machine Learning, the efficiency of the procedure depends on the tuning of hyperparameters, which we have investigated in this work. We have shown that fairly modest modifications of the original prescription -inserting a more expressive transformation at the final layer of the flow and drastically reducing the size of the neural networks -lead to much more efficient representations of approximately trivializing maps for this system. The extent of the associated reduction in training costs is such that the systems studied here and in Reference [1] are accessible with the sorts of computing resources typically found on personal computers. However, our main finding is that the rate at which training costs scales as we move towards the continuum limit is extremely large, increas-ing far more quickly than the size of the models. Our work demonstrates a rather urgent need to understand and mitigate inefficiencies in the training algorithm itself when the target of optimisation is a lattice field theory in the critical regime, and this must be done in parallel with efforts towards building more sophisticated flow models.
Below, we outline our immediate intentions for further research and describe several potential options for improving on the scaling of training costs.
A. Non-linear σ models Further work on our part will focus on the O(N ) and CP N −1 non-linear σ models. The latter class of models are a good milestone on the journey to QCD since they exhibit many of the interesting non-perturbative features observed in QCD whilst being much more amenable to numerical study, being two-dimensional theories without even a fundamental gauge field. There is precedent for expecting CP N −1 to challenge this method; previous studies uncovered pathological critical slowing down of the topological charge Q, consistent with τ Q ∼ e cξ [10], and it was precisely this situation which the original formulation of trivializing maps failed to resolve [53].
Non-linear σ models also have a non-trivial field space, though one can parameterise any O(N ) element or CP N −1 representative using the unit n-spheres. Fortunately for us, several recent efforts have augmented the set of tools at our disposal with several that are dedicated to problems involving angular variables [58,59,100,101]. The generalisation is straightforward in principle; as well as being invertible and continuously differentiable the flow is required to respect the topological properties of the field space. However, one should be aware that in most implementations of normalizing flows, the present work included, a Euclidean setting is implicit in the changeof-variables formula defined by Equation (16) as well as, more fundamentally, in the neural networks, which naturally act on Euclidean vectors.

B. Improving on the scalability with physics
More work is required to determine just how large a role this analogue of critical slowing down (see Section VII D) plays in the exploding cost of training. Our suspicion is that the correlation length of the target theory will contribute significantly as one moves to larger lattices. If our understanding is correct, it may be more efficient to learn a trivializing map to the corresponding free theory, implying that (with a suitably chosen bare mass) the latent density r(z) resides on a similarly low-dimensional manifold as the target. We can easily generate real-space configurations of non-interacting fields by simply rescaling and performing a Fourier transform, and the log-density term is simply given by the action of the free theory. A cursory look at some models trained with free fields acting as latent variables indicates that starting from free fields does not dramatically alter the overall quality of optimisation if the target theory is strongly interacting. However, this could be because any structure in the latent density is obfuscated during the fast zero-forcing stage, which may fail to keep the intrinsic dimension low. A more robust strategy might involve gradually flowing through the space of couplings from the free theory to the theory of interest, more akin to the original trivializing maps of Reference [52]. This could be achieved by updating the couplings during training, 17 or by a layer-wise approach to training that forces the trivializing map to follow a constrained path through the space of couplings.
Notwithstanding issues related to specific training schemes, there are reasons to be a touch more optimistic about the scalability of this method. Most actions or Hamiltonians that are of interest in physics contain exclusively local interaction terms, possess symmetries under certain transformations, and are often approximately or exactly self-similar over multiple scales. Each of these qualities might be exploited to build highly constrained models that yield efficient representations of trivializing maps.
Locality is partly responsible for the success of convolutional architectures, which pass localised filters (convolution kernels) over the data. 'Features' can be extracted at multiple scales by downsampling the data between convolutional layers, a procedure which shares much in common with Kadanoff's block decimation [102], the precursor to renormalization group transformations [103]. One of the implications of renormalizibility is that coarse-grained representations of the fields contain useful information encoded as relevant variables, providing a physical motivation for the use of multi-scale architectures which progressively resolve features at finer scales. Multi-scale flows form the basis of the information-preserving renormalization group algorithm of [45], and similar architectures have performed well in the classic application of image synthesis [51], where the same concepts of locality and coarse-grained descriptions hold true, albeit less formally. We also note that convolutional networks that are equivariant or covariant with respect to other types of symmetry transformations have been introduced in References [104-108].

C. Alternatives for improving scalability
One might describe the ideas suggested above as 'physics-based'; the ambition is to somehow incorporate known physics to construct models and training schemes that are 'better informed' about the nature of the target density. However, there are a number of 'technical' options which might (at least partially) circumvent the issue with training. One of these is to devote some effort to generating a training set of configurations via traditional MCMC (potentially as a bootstrapping of the generative training update), and include these in the optimisation. Even a relatively small number of training examples sampled from the low-density tails may help to improve the rate at which the model expands to fit them, though we note that Reference [1] did not report any significant improvements by using a pre-generated training set. There are also ways forward in which we just accept that sampling from regions of low density is a weakness of the generative approach. For example, it would be trivial to incorporate regular local updates alongside generative proposals in the sampling phase, which may to get the best of both worlds; local moves could offset biases in regions of low target density more efficiently than the Metropolis test alone, which as we have seen results in long periods of consecutive rejections. Another possibility, more appealing in terms of scalability, is to embed generative proposals within a multi-level sampling algorithm such as those developed in References [109,110]. In effect, this means training models to trivialize sub-volumes of the lattice in which the correlation length is cut off by L subvol , trading improved efficiency in the training stage for the additional cost of having to stitch together these sub-volumes.
Even in very optimistic scenarios where improvements to the training scheme alleviate the worst effects of critical slowing down, we would still expect larger batches and longer training runs to universally lead to improvements (up to a point where the expressivity of the model prevents it from being able to fit the target any better), since there is no over-fitting to speak of. Distributing batches over multiple processing units, a.k.a 'data-parallel' training, is therefore likely be an essential ingredient of future attempts to scale this method up to larger systems. The efficiency of the data-parallel approach is reduced as node memory limits are exceeded and the batch size per node must be decreased. Hence, it bodes well that we found shallow neural networks to outperform their deep counterparts, suggesting that memory costs will increase relatively slowly. To contrast this point, several state-ofthe-art models based on residual networks with hundreds or thousands of layers have such high memory requirements that the layers themselves must be distributed over multiple nodes. In fact the situation as regards memory is even better since normalizing flows lend themselves to low memory requirements by their very design; backpropagation through reversible layers (such as coupling layers) can be performed without storing the intermediate vectors [111], meaning that memory requirements do not increase in proportion to the number of coupling layers.

D. Wider context
On a final note, we find it thought-provoking that several generic features of 'real-world' data, such as locality, symmetries and scale invariance, arise in lattice field theories, but in a manner that is more precisely defined, e.g. by terms in the action, renomalization group transformations etc. Meteorological data, for want of an example, contains correlations over multiple scales and emergent phenomena arising from purely local interactions. We also draw attention to an intriguing study in which machine learning techniques were used to measure the presence of various symmetries (e.g. Z 2 , O(2)) in pieces of artwork [112]. The framework of lattice field theory provides low-level control over these quasi-universal properties through our ability to simply write down an action. Hence, lattice field theory is in many ways an ideal test-bed for improving our understanding of how these properties can be efficiently encoded into statistical models, even when the technique or model under study is destined for completely unrelated applications. A further convenience of working with lattice field theories is that there are several options for generating reproducible data for training or validation in a way that can be done ondemand (with the caveat that state-of-the-art ensembles are extremely expensive to generate) instead of requiring permanent storage. Perhaps it is not too outrageous to imagine that a set of lattice field theories may, in future, become a standard suite for testing and benchmarking new models and algorithms.
In practice, the integrated autocorrelation time defined by Equation (4) must be estimated from a Markov chain of finite length N . However, the statistical error on the autocovariance estimator, increases with t, so it is preferable to truncate the sum at some separation W < N .
We thus have the estimator and must attempt to find the value of W which minimises the sum of: 1. The bias due to truncating the sum, where we have assumed that W is sufficiently large that the autocorrelation takes a pure exponential form, with T O being the characteristic relaxation time of the slowest mode of O.
2. The statistical error approximated by the Madras-Sokal formula [2,114], which uses the approximation T O W N .
We would therefore like to find the minimum of ∆(W ) = |ε trunc (W )|+|ε stat (W )|. To do so we follow the 'automatic windowing' procedure detailed in Reference [115]   Measurements of the real time taken to train our models to reach an acceptance rate of 70% for the systems studied in Reference [1]. The models were trained on a desktop PC with an Intel i7-7700K quad-core CPU and 16GB RAM. Since we were aiming for speed of training rather than reaching the highest possible acceptance rates, we increased the initial learning rate η0 with respect to our main study, although we do not recommend doing this in general.
First, note that we can re-cast the integrated autocorrelation time in terms of the equivalent pure exponential decay, which is equal to zero, rather than 1/2, for uncorrelated data, and so offers improved precision in situations where decorrelation occurs very quickly. We furthermore assume that it is valid to substitute the slowest mode T O for λτ * O with λ being a small constant factor, to be tuned such that smallest value of W for which drops below zero occurs, generally, at a point at whicĥ τ int,O (W ) levels off to a plateau. Since we generally encountered very small integrated autocorrelation times, the approximation in Equation (B3) is probably a poor one. By the same token, however, statistical errors were minimal. Ultimately, a little tuning of λ by visual inspection ofτ O (W ) for a small number of experiments was sufficient.