Event Generation with Normalizing Flows

We present a novel integrator based on normalizing ﬂows which can be used to improve the unweighting eﬃciency of Monte-Carlo event generators for collider physics simulations. In contrast to machine learning approaches based on surrogate models, our method generates the correct result even if the underlying neural networks are not optimally trained. We exemplify the new strategy using the example of Drell-Yan type processes at the LHC, both at leading and partially at next-to-leading order QCD.


I. INTRODUCTION
Numerical simulation programs are a cornerstone of collider physics.They are used for the planning of future experiments, analysis of current measurements and, finally, reinterpretation based on an improved theoretical understanding of nature.They employ Monte Carlo methods to link theory and experiment by generating virtual collider events, which can then be analyzed like actual events observed in detectors [1,2].
With more and more data available from the Large Hadron Collider (LHC) and the high-luminosity upgrade, the task of simulating collisions at high precision becomes a matter of concern for the high-energy physics community.The projected amount of computational resources falls far short of the needs for precision event generation [3].Past studies of the scaling behavior of multi-jet simulations have shown that the compute needs are largely determined by the gradually decreasing unweighting efficiency [4,5].Except for dedicated integrators, which require a detailed understanding of the physics problem at hand, adaptive Monte-Carlo methods seem the only choice to address this problem [6][7][8][9][10][11][12][13].
With the rise of machine learning, this topic has seen a resurgence of interest recently.The possibility of using these techniques for integration in high-energy physics was first discussed in Ref. [14].Boosted Decision Trees and Generative Adversarial Networks (GANs) were investigated as possible general purpose integrators.This new technique improved the integration of non-separable high dimensional functions, for which traditional algorithms failed.The first true physics application was presented in Ref. [15].The authors used Dense Neural Networks (DNN) in order to perform a variable transformation and demonstrate that they obtain significantly larger efficiencies for three body decay integrals than standard approaches [16].The major drawback of this method is its computational cost.Since the network acts as a variable transformation, its gradient must be computed for each inference point in order to determine the Jacobian.This becomes computationally heavy for high multiplicity processes.
A completely orthogonal approach utilizes machine learning techniques directly for amplitude evaluation [17] or event generation [18][19][20][21][22][23][24].Training data for these approaches are obtained from traditional event generation techniques, and hence the problem of efficient event generation still remains.In addition, one needs to ensure that the neural networks are trained well in order to approximate the original integrand.If this is not the case, the resulting generator will not only be inefficient, but may actually yield the wrong result.
In this publication we propose a novel idea to address the problem: We replace standard adaptive algorithms like Vegas [6,7], by the extension [25,26] of a Nonlinear Independent Components Estimation technique (NICE) [27,28], also known as a Normalizing Flow.This algorithm is combined with a recursive multi channel [29,30] to form a generic integrator for collider event generation.We test its performance in Drell-Yan type processes at the LHC, computed both at leading and partially at next-to-leading order QCD.We focus our study on the event generation efficiency in comparison to the general-purpose matrix element generator Comix [30].While the training of neural networks during the adaptation stage of the Normalizing Flow integrator is a very time consuming operation, event generation is inexpensive, because no gradients need to be computed.This could make the technique a prime candidate for LHC event generation in the near future.
This manuscript is organized as follows.Section II briefly reviews the technique of Monte-Carlo integration and introduces the concept of Normalizing Flows.Section III presents our new integrator.Section IV discusses its computing performance and presents some first applications to LHC event generation.Section V contains an outlook.

II. NORMALIZING FLOWS
Monte-Carlo or quasi-Monte-Carlo methods are known as the only viable option to tackle high-dimensional integration problems.The basic technique relies on approximating the integrand by randomly sampling points in the integration domain Ω and weighting each point, x, by the value of the integrand, f (x).The value of the integral is then obtained as the statistical average of all points, and the uncertainty is determined by its variance: In this context, x indicates that the average is taken with respect to a uniform distribution in x.The variance of the integral can be reduced by importance sampling or stratified sampling [31].In particular, using the transformation dx = dG(x)/g(x), with G(x) the primitive of g(x), one obtains The function g(x) can now be chosen appropriately, such as to minimize the variance.In the limit g(x) → f (x)/I, Eq. ( 2) would be estimated with vanishing uncertainty.The goal is thus to find a distribution g(x) that resembles the shape of f (x) most closely, while being integrable and invertible in order to allow for faster sampling.
For multi-dimensional integrals, where the variable transformation reads d x → d x |d x( x )/d x |, we can simply replace the Jacobian g(x) → |d x /d x|, and Eq.(2) remains valid.This forms the basis for the concept of a Normalizing Flow: for a bijective map, G( x), of the random variable x that is drawn from a flat probability distribution, the variable x = G( x) follows the probability distribution Applying a series of transformations, G k , where k = 1, ..., K, one defines the Normalizing Flow as a bijective mapping between statistical distributions of the random variables x and x K .
In practice, G k is often limited to simple functions, in order to make the determinant of the Jacobian easy to compute.This constrains the level of complexity that can be modeled by the Normalizing Flow.The complexity can be increased using so-called Coupling Layers, which were first introduced in Refs.[27,28].An alternative technique is based on autoregressive models [32,33].In this study we focus on Refs.[25,26] which generalizes the design of the Coupling Layers proposed by Ref. [28].

A. Coupling Layers
A coupling layer is a special design of a bijector, first proposed in Refs.[27,28].Figure 1 shows its basic structure.For each bijective mapping, the input variable x = {x 1 , .., x D } is partitioned into two subsets, x A = {x 1 , .., x d } and x B = {x d+1 , .., x D }.Under the bijective map, g, the resulting variable transforms as In this context, m represents the output of a neural network that takes x A as inputs and outputs parameters of the invertible "Coupling Transform", C, that will be applied to x B .The inverse map G −1 is given by which leads to the simple Jacobian Note that Eq. ( 7) does not require the computation of the gradient of m( x A ), which would scale as O(D 3 ) with D the number of dimensions.In addition, since C is diagonal, the computation of the determinant of ∂C/∂ x B scales linearly with the number of dimensions, and is therefore tractable even for high dimensional problems.The Normalizing Flow method is thus evidently superior to existing integration techniques based on Neural Networks.To construct a complete Normalizing Flow, one simply compounds a series of Coupling Layers with the freedom of choosing any of the input dimensions to be transformed in each layer.We show in Ref. [34] that at most 2 log 2 D Coupling Layers are required in order to express arbitrarily complicated, non-separable structures of the integrand.
In order to implement a Normalizing Flow integrator in practice, the user must provide a Neural Network, represented by m( x A ), a function f to integrate, and the definition of a loss function.This is discussed in more detail in Ref. [34].

B. Piecewise Polynomial and Rational Quadratic Spline Coupling Transforms
So far we have not yet specified the invertible coupling transforms C. The design of this function impacts the flexibility of coupling layer based Normalizing Flow algorithms and is an active field of research.A very powerful definition of C was introduced in Ref. [25].Both the domain and codomain of each Coupling Layer are defined to be the unit hypercube.If the random variable x is uniformly distributed, it follows from Eq. ( 7) that the initial variable x follows the distribution |∂C/∂ x B |. Thus the coupling transform can be interpreted as the Cumulative Distribution Function (CDF) of x.Each dimension is then divided into K bins and models this CDF with a monotonically increasing polynomial function per bin.In particular, Ref. [25] experiments with piecewise linear and piecewise quadratic coupling transforms.In the implementation of a piecewise quadratic coupling transform, the bin width is allowed to vary in order to increase the flexibility of the Coupling Layer.
It may seem natural to generalize the piecewise polynomial coupling transform to include even higher order terms in order to increase the expressivity of the Coupling Layer.This has been proposed in Ref. [26], which generalized the piecewise quadratic coupling transform to allow a monotonically-increasing rational-quadratic function in each bin of the coupling transform.To implement this, the bin heights, bin widths and also the derivatives in between each bin are allowed to vary and are predicted by the Neural Network.

III. PHASE-SPACE INTEGRATION
In this section we briefly summarize the diagram-based [35] recursive [30] multi-channel [29] integration used in our numerical routines.The latter are designed to cope with especially large numbers of outgoing particles and exhibit exponential scaling, reduced from factorial scaling by means of dynamic programming.
Consider a 2 → n scattering process, where we denote the incoming particles by a and b and the outgoing particles by 1 . . .n.The corresponding n-particle differential phase space element reads where m i are the on-shell masses of outgoing particles.The full phase space can be factorized as [31] dΦ where π = {1, . . ., m} corresponds to a set of particle indices.Denoting the missing subset as α = {a, b, 1, . . ., n} \ α for all α ⊂ {a, b, 1, . . ., n}, Eq. ( 9) allows us to decompose the complete phase space into building blocks corresponding to the t-and s-channel decay processes T π,αbπ α,b = dΦ 2 (α, b; π, αbπ) and S ρ,π\ρ π = dΦ 2 (π; ρ, π \ ρ) and an s-channel production process D α,b , which corresponds to overall momentum conservation and the associated overall weight factor.These objects have been introduced as phase space vertices in Ref. [30], while the integral P π = ds π /2π (see Eq. ( 9)) was called a phase space propagator.In this notation, there is a one-to-one correspondence between the computation of hard matrix elements in the Berends-Giele recursion [36][37][38][39] and the computation of phase-space weights.Thus, the computation of phase-space weights can be carried out in a recursive fashion, yielding the same (exponential) scaling as the computation of the hard matrix element.The basic building blocks of phase space integration can be summarized as Here, λ is given by the Källen function λ(a, b, c) = (a − b − c) 2 − 4bc.Note that, in the context of Monte-Carlo integration, each basic integral ds, d cos θ and dφ in Eq. ( 10) corresponds to a random variable, chosen in the appropriate range.Details on the construction of the phase-space recursion are given in Ref. [30].Here we simply recall the result in a schematic form: • If the partition is an s-channel, insert an s-channel vertex S π1,π2 π else insert a t-channel vertex T π1,π2 α,b .
• Proceed until there is no multi-index left.
We can improve this procedure by forming an average over all possible ordered partitions (OP) of each multi-index.
Assigning each splitting an adjustable weight defines the recursive multi channel.This is formalized as follows In this context we defined the one-and no-particle phase space as dΦ(i) = 1 and dΦ(∅) = 0.The numbers ω correspond to vertex-specific weights, which are normalized as π1 ω π1,π\π1 π = 1 and π1 ω π1,απ1 α + ω α,b = 1 and can be adapted to optimize the integrator.The sums run over all possible S-and T -type vertices which have a correspondence in the matrix element.The full differential phase space element is given by The recursive integrator is typically improved by combining it with the Vegas algorithm [6,7].In this configuration, Vegas generates the input random numbers, x, that are used to perform the basic integrals ds, d cos θ and dφ in Eq. (10).We adopt the same strategy and simply replace Vegas by the Neural Network+Normalizing Flow algorithm proposed in Refs.[27,28] and extended in Refs.[25,26], which is implemented in the i-flow package [34].
We include the multi-channel weights, ω, in the integration, which allows us to work with a single network for all channels.All other components of the integrator remain the same.This makes it possible to directly compare the performance to existing algorithms.As the Normalizing Flow method is capable of capturing correlations in the multi-dimensional phase space, while still exhibiting polynomial scaling with the problem size, one would expect a performance improvement particularly when the number of dimensions in the problem is large.

IV. NUMERICAL RESULTS
This section presents a numerical study of the novel phase-space integrator.All computations are performed using the event generation framework Sherpa [40][41][42], with matrix elements provided by Comix [30].This use case differs slightly from Ref. [43], where matrix elements are computed by Amegic [44].Amegic performs an explicit sum over color degrees of freedom, while Comix uses Monte-Carlo sampling.The Neural Network + Normalizing Flow integrator could in principle be trained on a combined color-momentum space when used with Comix.We have tested this approach, but have not found an advantage for processes beyond W/Z + 1j.Hence we refrain from using this technique.

A. Definition of efficiency
In order to assess the performance of the new integrator we investigate its unweighting efficiency.The most basic definition of unweighting efficiency would be the average weight during event generation (i.e. the integral, Eq. ( 1)), divided by the maximum weight.
Its inverse corresponds to the number of events to be drawn on average, before an event is accepted with unit weight.If the distribution of weights does not exhibit a sharp upper edge, the denominator in Eq. ( 13) will depend on the sample size, and the efficiency will decrease with increasing number of events.This is particularly worrisome when events are generated in a distributed computing approach (e.g. on the LHC Computing Grid).Each individual compute job will have its own, individual maximum, say max{f /g} i .In order to realize the accuracy of the combined event sample, each subsample must then be weighted by max{f /g} i / max j {max{f /g} j .
Here we follow a different approach.It has recently been pointed out that a weighted combination of event samples is prone to outliers in the weight distribution, unless adaptive algorithms continue to be optimized during event generation [45].If event generation is performed in a distributed fashion, this cannot be achieved, as the individual compute jobs do not communicate.The efficiency should therefore be computed based on the number of points during the last iteration of the optimization.This definition is still prone to possible outliers, which are removed by performing a bootstrap: 1. Assuming the number of events during optimization was N opt , draw nN opt events.
2. From these events, select m replicas of N opt events each and compute their maximum weight.
3. Compute the total maximum, w max , as the median of the individual maxima.
The efficiency is then given by There will be a number of event weights that exceed w max .We can account for the mismatch on an event-by-event basis by recording their relative weight.Formally, if an event is generated with weight w i , we keep it with weight wi , where with R ∈ [0, 1] a uniformly distributed random number.The event sample will then be partially weighted, unweighted against a maximum that corresponds -on a statistical basis -to the largest weight probed by the adaptive integrator during optimization.This allows one to reach in principle arbitrary precision by applying the bootstrap and jackknife techniques of Ref. [46].Possible practical implementations of this method are discussed in Ref. [45].

B. Hyperparameter Optimization
We use a quasi-random search strategy to optimize the hyperparameters of the Neural Networks.The basic idea of random search has been proposed in Ref. [47], and follows the known strategy used in tuning Monte-Carlo event  generators [48].It was pointed out that a random search strategy typically covers the space of hyperparameters better than a grid-based search, since the influence of hyperparameters is often uncorrelated.Since computing resources are limited and the computational cost of the training is fairly large, we resort to Sobol sequences to adequately populate the hyperparameter space, H.The figure of merit for a configuration c ∈ H is its unweighting efficiency.Due to the limited computing resources we redefine for the purpose of tuning w max in Eq. ( 14) to be the 99.9th percentile of the distribution of nonzero absolute values of weights.We have tested that this makes the definition of w max stable against variations in the random seed and gives results that are qualitatively compatible with the technique described in Sec.IV A. The main advantage of the alternative definition is that it allows us to determine the target number of points for the computation of w max as 1/(η cut σ 2 ε), where η cut is the cut efficiency, σ is the desired Monte-Carlo accuracy, and ε is the desired percentile defining w max .
Table I shows the sampling boundaries for H, and the optimal choice for W + 1 jet.Figure 2 shows the distribution of sampling points in the N epochs -LR and N bins -N samples planes.The first suggest a high learning rate, coupled with a large number of epochs to be beneficial.The second suggests a strong preference for a small number of bins.We also performed a parameter scan for W + 2 jets and found the best configuration to be comparable to the above and to yield similar unweighting efficiency as the best configuration in the W + 1 jet setup.Due to limited computing resources, we did not perform a separate hyperparameter scan for W + ≥ 3 jets.

C. Comparison to existing approaches
In this section we compare the performance of our integrator based on the normalizing flow technique to the best alternative method available in the public event generation program Sherpa [40][41][42].The basic computational setup is analogous to Ref. [4].We consider W ± -and Z-boson production in proton-proton collisions at the high-luminosity LHC at √ s = 14 TeV.We use the NNPDF3.0NNLO PDF set [49] and evaluate the strong coupling accordingly.Jets are defined using the k T clustering algorithm with R = 0.4, p T,j > 20 GeV and |η j | < 6.Following the good agreement between parton-level and particle-level results established in Ref. [50], and the good agreement between fixed-order and MINLO [51] results established in Ref. [52], the renormalization and factorization scales are set to  II: Unweighting efficiencies at the LHC at √ s = 14 TeV using the NNPDF 3.0 NNLO PDF set and a correspondingly defined strong coupling.Jets are identified using the kT clustering algorithm with R = 0.4, pT,j > 20 GeV and |ηj| < 6.In the case of Z/γ * production, we also apply the invariant mass cut 66 < m ll < 116GeV.
Table II shows a comparison of unweighting efficiencies defined according to Eq. ( 14), where N opt = 2 • 10 4 , n = 50, and m = 10 3 .We give results for both leading-order cross sections, and for the subtracted real-emission corrections to next-to-leading order cross sections, using the dipole method of Catani and Seymour [54].The subtracted realemission corrections typically present the biggest challenge in cross-section calculations at next-to-leading order in the perturbative expansion, and therefore drive the computing demands of precision simulations for LHC experiments.The new integrator based on Neural Networks and Normalizing Flows gives a much larger unweighting efficiency than Sherpa in processes with few jets, both at LO and at NLO precision.In processes with more final-state jets it performs similar to the existing integration techniques in Sherpa.The Neural Network technique generally performs better when we do not combine it with a multi-channel approach for initial-state integration.Only the major features of the final state should be mapped out, for example the Breit-Wigner resonance in W ± production.This indicates, unsurprisingly, that the Normalizing Flow approach is more efficient in approximating smooth structures of the integrand than in differentiating between effectively independent integration domains.It leads to the strikingly lower efficiencies in Z-boson production processes, where we have combined a 1/ŝ integrator and a Breit-Wigner mapping.In general, the new method is best applied to low-multiplicity problems, where the training of the neural networks can be performed at reasonable speed with relatively few samples per epoch.We expect that in the high multiplicity cases the Neural Network + Normalizing Flow technique will also outperform Sherpa, if it can be trained over sufficiently many epochs with sufficiently many sample points.However, due to restricted computing resources, we were not able to verify this claim for the case of W ± /Z+4j production.The picture might be altered by future implementations of matrix element generators on accelerators.For exploratory work on this topic, see Refs.[55,56].

V. CONCLUSIONS
We have presented a novel approach to phase-space integration for collider physics simulations, which is based on Neural Networks and Normalizing Flows.The integrator is implemented as an add-on to the existing event generator Sherpa and can be used with both internal matrix-element generators, Comix and Amegic.Neural Network hyperparameters were tuned using a quasi-random search strategy.For the optimal set of parameters, the unweighting efficiency of the integrator exceeds that of conventional methods by a factor of two to three in simple processes.In high-multiplicity processes, traditional techniques tend to perform similarly well, while also requiring fewer computing resources.We expect this picture to change as implementations of matrix element generators on accelerators such as GPUs and TPUs become available.Additional possible improvements of Normalizing Flows are discussed in Ref. [34].These findings are corroborated by the results presented in Refs.[34,43].

FIG. 1 :
FIG. 1: Structure of a Coupling Layer.m is the output of a neural network and defines the Coupling Transform, C, that will be applied to xB.

FIG. 2 :
FIG.2: Projections of the sampled parameters and color coded acceptances.The plot on the left suggest a high learning rate, coupled with a large number of epochs to be beneficial.The plot on the right hand side suggests a strong preference for a small number of bins.The best performing configuration is indicated with a star.

TABLE I :
10 −2 log 4.55 • 10 −4 Parameter ranges and prior distributions in H and values of the optimal point in the scan.We use LR as abbreviation for the learning rate.