Exhaustive Neural Importance Sampling applied to Monte Carlo event generation

The generation of accurate neutrino-nucleus cross-section models needed for neutrino oscillation experiments require simultaneously the description of many degrees of freedom and precise calculations to model nuclear responses. The detailed calculation of complete models makes the Monte Carlo generators slow and impractical. We present Exhaustive Neural Importance Sampling (ENIS), a method based on normalizing flows to find a suitable proposal density for rejection sampling automatically and efficiently, and discuss how this technique solves common issues of the rejection algorithm.

The generation of accurate neutrino-nucleus cross-section models needed for neutrino oscillation experiments require simultaneously the description of many degrees of freedom and precise calculations to model nuclear responses. The detailed calculation of complete models makes the Monte Carlo generators slow and impractical. We present Exhaustive Neural Importance Sampling (ENIS), a method based on normalizing flows to find a suitable proposal density for rejection sampling automatically and efficiently, and discuss how this technique solves common issues of the rejection algorithm.

I. MOTIVATION
In modern science and engineering disciplines, the generation of random samples from a probability density function to obtain data sets or compute expectation values has become an essential tool. These theoretical models can be described by a target probability density function p (x). Ideally, to generate samples following p (x), the inverse transformation method is used. To perform the inverse transformation, the cumulative probability has to be calculated and the inverse to this function has to be found. Numerical methods have to be applied to obtain the Monte Carlo (MC) data when this is not feasible computationally. This is especially true for high-dimensional spaces, where the integrals required to find such inverse transformation become analytically challenging.
A standard numerical method to obtain such data set is to perform a Markov Chain Monte Carlo (MCMC) algorithm [1], which provides good results for expected value calculations. Compared to other methods, it has the advantage that, in general, it requires very little calibration, and high dimensions can be broken down into conditional smaller dimension densities [2]. However, the MCMC method produces samples that form a correlated sequence. Also, the convergence of the samples' chain to the target density cannot be guaranteed for all possible models.
Another standard algorithm to produce MC samples is the acceptance-rejection or simply rejection sampling [3][4][5][6], which produces i.i.d. (independent and identically distributed) samples from the target density via an auxiliary proposal function. The proposal has to satisfy being a density which can both be sampled from and evaluated efficiently, as well as being as close to the target density as possible. The main disadvantages of the method are the following [7]: 1. Designing the proposal function close to a particular target density can be very costly in human dedication. * pinas@aia.es 2. If taken a generic proposal function, such as a uniform distribution over the domain, the algorithm is usually very inefficient.
3. The inefficiency grows rapidly with the number of dimensions.
Ideally, to avoid these inconveniences, one would like to have a method to find a proposal function that adapts to a given target density automatically. This would solve simultaneously the human time cost as well as the inefficiency of generic proposal densities. An approach of the usage of normalizing flows to find a suitable proposal for a given target density has been suggested previously as Neural Importance Sampling (NIS) [8], focused on the integration of functions via importance sampling [9]. Normalizing flows provide an expressive family of parametrized density functions q φ (x) through neural networks, by defining a differentiable and invertible transformation from a base distribution to a target distribution, allowing to evaluate and sample from complex distributions by transforming simpler ones. The concept of integrating via importance sampling with normalizing flows for High Energy Physics (HEP) has been explored in [10], to simulate collider experimental observables for the Large Hadron Collider.
In this work we further explore the possibility of utilizing normalizing flows to find a proposal function for a given target density to perform rejection sampling for MC samples , and analyze its viability through the following points: • We discuss the importance of adding a background density to the target one to assure the coverage of the whole phase space when performing rejection sampling.
• We define a two-phase training scheme for the normalizing flow to boost initial inefficiency in the optimization when adjusting the initialized random density towards the target one.
• We measure the performance of the method and argue for relaxing the rejection sampling constant arXiv:2005.12719v1 [hep-ex] 26 May 2020 factor k to improve largely the efficiency of acceptance while quantifying the error committed in doing this approximation via the concept of coverage.
Considering the proposed algorithm covers the whole phase space by modifying NIS with the background density, we denote this method by Exhaustive Neural Importance Sampling (ENIS). We apply the above algorithm to a HEP problem, in the form of the charged current quasi-elastic (CCQE) crosssection for anti-neutrinos interactions with nuclei, performing in-depth analysis and discussion of the efficiency of the method. Neutrino-nucleus cross-section modeling is one of the main sources of systematic uncertainties in neutrino oscillations measurements [11]. Cross-section models are either analytically simple but poorly describing the experimental data or involving complex numerical computations, normally related to the description of the nucleus, that imposes limitations in their MC implementation. New tendencies in the field also call for a fully exclusive description of the interaction adding complexity to the calculations. The analytical model utilized in this paper is simple, but it is a realistic one and a good reference to demonstrate the capabilities of the proposed method to generate neutrino-nucleus cross-sections efficiently. We will show that ENIS opens the possibility to incorporate efficiently complex theoretical models in the existing MC models enhancing the physics reach of running and future neutrino oscillation experiments.
ENIS algorithm may be used beyond the scope of neutrino physics. Further applications to be evaluated in detail in the future are particle/nuclear physics experiments, detector responses for medical physics, engineering studies or theoretical modelling. In general, it could be applied to any Monte Carlo simulation that is limited by the algorithm's speed, such as for importance sampling to provide fast Monte Carlo with sufficient accuracy (i.e. fast detector simulation, design studies, minimum bias background simulations, etc.). Additionally, the technique may help model developers extract expected values from their theoretical predictions in realistic conditions by including simple detector effects in models, such as effects of detector acceptance cuts, impact of model degrees of freedom on the predictions or uncertainty propagation.

II. FRAMEWORK
In this Section we will describe background and framework needed for the rest of the paper. Sec. II A explains the physical model of charged current quasi-elastic neutrino interaction we will apply ENIS to in Sec. IV. As a summary and to introduce our notation, Sec. II B overviews the rejection sampling algorithm. Finally, in Sec. II C we make a modest introduction to normalizing flows, focusing on the implementation of Neural Spline Flows.

A. Model of Charged Current Quasi-Elastic Anti-Neutrinos Interactions with Nuclei
The Charged Current Quasi-elastic (CCQE) is a basic model of neutrino interactions that might be expressed in simple formulae. The CCQE model has many advantages during this exploratory work, as it can be implemented in a simple software function, while at the same time it is also a realistic environment to understand the implications of modeling cross-sections with the proposed methodology. The selected model to describe CCQE is the well established Smith-Monith [12]. The nucleon momentum distribution follows a Relativistic Fermi Gas (non-interacting nucleons in a nuclear potential well) with a 0.225 GeV/c Fermi level. The model includes the Pauli blocking, preventing the creation of final state nucleons below the nucleus Fermi level. The model can be applied both to neutrinos and antineutrinos interactions. Antineutrinos are selected for this study due to the vector axial current cancellation imposing more stringent conditions at the edges of the kinematic phase space. The model includes the following degrees of freedom generated by the Monte Carlo model: the neutrino energy, the µ ± momentum and angle, and the target nucleon Fermi momentum. Contrary to other MC implementations, the neutrino energy is not a fixed input value but it is generated by the algorithm to add complexity to the calculations and to check the capabilities of the calculations to reproduce the cross-section as a function of the neutrino energy. The implementation of this model for fixed energy value is also possible. The basic kinematic distributions obtained with this model will be discussed in Sec. IV.

B. Rejection sampling
Rejection sampling is a well known technique [3][4][5][6][7] to obtain MC samples from a target density p (x) which can be evaluated (up to a constant), but cannot be sampled from through the inverse transform. It relies on an auxiliary proposal function q (x), from which one should be able to sample from and evaluate efficiently. A constant k > 0 is introduced which has to satisfy that The resulting function k · q (x) is called the comparison function.
The procedure to sample from the target density is then the following: 2. A random number u is generated uniformly in the range [0, k · q (x)], u ∼ Unif(0, k · q (x)).
3. If u fulfills the condition u ≤ p (x), the sample is accepted; otherwise, it is rejected.
Additionally, if p (x) is normalized, the probability that a sample is accepted is proportional to p acc ∝ 1/k, i.e., k gives an intuition of the number of tries until we obtain an accepted sample.

C. Neural density estimation using Neural Spline Flows
A family of density functions q φ (x) over the real Ddimensional space R D parametrized by φ satisfies that q φ (x) ≥ 0 for all x, φ, and that q φ (x) dx = 1 for all φ. Normalizing flows are a mechanism of constructing such flexible probability density families q φ (x) for continuous random variables x ∈ R D . A comprehensive review on the topic can be found in [13], from which a brief summary will be shown in this Section on how normalizing flows are defined, and how the parameters φ are obtained, together with a specific implementation, the Neural Spline Flows (NSF) [14].
Consider a random variable u defined over R D , with known probability density p u (u). A normalizing flow characterizes itself by a transformation T from another density p (x) of a random variable x defined also over R D , the target density, to this known density, via The density p u (u) is known as base density, and has to satisfy that it is easy to evaluate (e.g., a multivariate D-dimensional normal, as will be chosen through this work, or a uniform distribution in dimension D). The transformation T has to be invertible, and both T and T −1 have to be differentiable, i.e., T defines a diffeomorphism over R D . This allows us to evaluate the target density by evaluating the base density using the change of variables for density functions, where the Jacobian J T (x) is a D × D matrix of the partial derivatives of the transformation T : The transformation T in a normalizing flow is defined partially through a neural network with parameters φ, as will be described below, defining a density The subindex of T φ will be omitted in the following, simply denoting the transformation of the neural network by T . If the transformation is flexible enough, the flow could be used to evaluate any continuous density in R D . In practice, however, the property that the composition of diffeomorphisms is a diffeomorphism is used, allowing to construct a complex transformation via composition of simpler transformations. Consider the transformation T as a composition of simpler T k transformations: Assuming z 0 = x and z K = u, the forward evaluation and Jacobian are These two computations (plus their inverse) are the building blocks of a normalizing flow [15]. Hence, to make a transformation efficient, both operations have to be efficient. From now on forth, we will focus on a simple transformation u = T (x), since constructing a flow from it is simply applying compositions.
To define a transformation satisfying both operations to be efficient, the transformation is broken down into autoregressive one-dimensional ones for each dimension of R D : where u i is the i-th component of u and x i the i-th of x. τ is the transformer, which is a one-dimensional diffeomorphism with respect to x i with parameters h i . c i is the i-th conditioner, a neural network, which takes as input x <i = (x 1 , x 2 , . . . , x i−1 ), i.e., the previous components of x, and φ are the parameters of the neural network. The conditioner provides the parameters h i of the i-th transformer of x i depending on the previous components x <i , defining implicitly a conditional density over x i with respect to x <i . The transformer is chosen to be a differentiable monotonic function, since then it satisfies the requirements to be a diffeomorphism. The transformer also satisfies that it makes the transformation easily computable in parallel and decomposing the transformation in one dimensional autoregressive transformers allows the computation of the Jacobian to be trivial, because of its triangular shape. To compute the parameter h i of each transformer, one would need to process a neural network with input x <i for each component, a total of D times.
Masked autoregressive neural networks [16] enable us to compute all the conditional functions simultaneously in a single forward iteration of the neural network. This is done by masking out, with a binary matrix, the connections of the h i -th output with respect to all the components with index bigger or equal to i, ≥ i, making it a function of the < i components.
In their work on Neural Spline Flows [14], Durkan et al. advocate for utilizing monotonic rational-quadratic splines as transformers τ , which are easily differentiable, more flexible than previous attempts of using polynomials for these transformers, since their Taylor-series expansion is infinite, and are analytically invertible.
Each monotonic rational-quadratic function in the splines is defined by a quotient of two quadratic polynomial. In particular, the splines map the interval [−B, B] to [−B, B], and outside of it the identity function is considered. The splines are parametrized following Gregory and Delbourgo [23], where K different rational-quadratic functions are used, with boundaries set by the pair of Having this in mind, the conditioner given by the neural Stacking up many of these transformations, a highly flexible neural density estimator, the NSF, can be build, which satisfies: 1. It is easy to sample from q φ (x) using the inverse transform T −1 in Eq. (2) by sampling u ∼ p u (u).
2. Eq. (3) allows to evaluate the densities q φ (x) of these samples when generating them in an efficient way.
This density estimator will be the one utilized during this work.

III. METHODOLOGY
With the framework introduced in Sec. II, we are now in a position to define the ENIS method and the different magnitudes we will use to measure its performance.
We start in Sec. III A by showing the objective function to be minimized by the NSF to adjust its proposal function q φ (x) to the target density p (x). Then, in Sec. III B, we discuss the importance of adding background noise to both ensure coverage of the whole phase space of p (x) and to boost the initial phase of the training of ENIS. The exact training scheme is then shown in Sec. III C, differentiating the warm-up phase from the iterative phase. Finally, in Sec. III D, the performance metrics are introduced, explaining the concept of coverage and effective sample size when considering a more relaxed condition on the rejection constant k.
A. Optimizing the parameters of the NSF Consider a target probability density function p (x) which can be evaluated for all x but from which we are unable to sample data directly through an analytical inverse transform. If we could approximate this target density by our neural density estimator q φ (x), then we could exactly sample from the target density using rejection sampling, since we can both sample and evaluate from q φ (x).
To obtain the parameters φ of q φ (x) given a density p (x) which can be evaluated, we want to minimize the Kullback-Leibler divergence (KL-divergence) [24] between both distributions, which is ≥ 0 and only equal to zero if both distributions match: When minimizing with respect to φ, the KL-divergence is simplified to The last expression could be approximated numerically if we could sample x ∼ p (x), since it corresponds to approximating an expected value of a function we can evaluate, log q φ (x), and is equal to maximizing the loglikelihood of these samples: Müller et al. [8] propose a solution for computing the gradient with respect to φ for this maximization problem. They suggest using importance sampling [9] for this particular expected value: with x i ∼ q φ (x) and the weights defined as w(x) = p (x) /q φ (x). Notice how we only need to be able to evaluate p (x) to compute this quantity. With this gradient, we are able to minimize the KL-divergence in Eq. (4) if the support of q φ (x) (i.e., the domain where the function is non-zero) contains the support of p (x) to perform the importance sampling of Eq. (6) correctly. Notice that, in order to properly optimize the parameters φ, p (x) does not need to be normalized, since this simply changes the magnitude of the gradient, but not its direction. The lack of proper normalization can be properly handled by standard neural network optimizers such as Adam [25]. The method described by Eq. (6) implies an iterative way of optimizing q φ (x) with the following steps: 1. A batch of x is generated according to the current state of the neural network, q φ (x).
2. Using this batch, the parameters φ of the neural network are optimized via the gradient of Eq. (6).
3. This updated neural network then generates the next batch.

B. Relevance of background noise
As briefly discussed in previous Section II, in order to optimize the neural network following Eq. (6), the gradient is only correctly computed if the support of q φ (x) contains the one of p (x). Moreover, if we want to use q φ (x) as our proposal function to sample from p (x) via rejection sampling, this also has to hold.
To ensure the proper p (x) support, we introduce the concept of a background density function, p bg (x). In HEP, as in many other scientific areas, the density is restricted to a certain domain of x ∈ R D , e.g., the cosine has to be in [0, 1], the magnitude of the momentum in an experiment has to be positive and has a maximum value of p max , there are constraints in the conservation of energy and momentum, etc... Hence p bg (x) should be a density that has a support beyond these phase-space boundaries. In what follows, a uniform distribution will be considered, with limits in each dimension according to the phase space of that coordinate. The selection of the functional form of the p bg (x) is arbitrary and it can be selected to adapt to the requirements of each project.
The background density p bg (x) will be used for two tasks: (i) Improve initial training: At the beginning of the training, we cannot assure that the support of q φ (x) contains the one of p (x). Hence, instead of using q φ (x) for the importance sampling of Eq. (6), p bg (x) will be used during the warm-up phase of the training. The distribution of the weight function w(x) = p (x) /p bg (x) might span several orders of magnitude, but this way we ensure the full support of p (x). This strategy gives a better approximation than the one obtained by the randomly initialized neural network q φ (x) at the start of the training.
(ii) Ensure exhaustive coverage of the phase space: The target density p target (x) that the neural network will learn will be constructed as a linear combination of the true target density p (x) and the background p bg (x): with α ∈ (0, 1). This implementation adds a certain percentage α of background noise to the target density, spreading it over all the domain of the background density, allowing to properly apply the methods rejection and importance sampling with q φ (x) as the proposal function, covering exhaustively the phase space. Experimentally we have found good compromise with α = 0.05.
Optimizing q φ (x) to match p target (x) of Eq. (7) instead of p (x) will make the proposal q φ (x) slightly worse for rejection/importance sampling. By performing the optimization to p (x) directly in an iterative way, as explained at the end of the last Section, some regions of the phase space might disappear for future samplings. These regions are located normally close to the boundaries of sampled volume. Having a constant background noise prevents these losses from appearing, as the neural network has to also learn to generate this noise, covering properly the required phase-space volume. We will discuss the impact of the background term on the method performance in Section IV.

C. ENIS training scheme of the proposal function
The training procedure to obtain q φ (x) from p (x) following ENIS consists of two phases: 1. Warm-up phase: (i) Sample x p ∼ p bg (x) and compute their weights 2. Iterative phase: (i) Sample x q ∼ q φ (x) and compute their weights w q (x q ) = p (x q ) /q φ (x q ).
(ii) Sample background x bg ∼ p bg (x) with associated weights w bg (x bg ) = C w bg p bg (x bg ), where (iii) Optimize the parameters of q φ (x) via Eq. (6) using The proposed method is not to use q φ (x) as a direct approximation of p (x), but as proposal function to perform either rejection (Sec. II B) or importance sampling [9]. This allows for the methods to correct any deviation in the neural network modeling of the exact density while utilizing its proximity to such density.
We use the learned probability density function q φ (x) to generate samples via rejection sampling (see Sec. II B), which, in HEP, is of high interest and costly via standard procedures. The parameter k of the rejection algorithm has to be estimated empirically. Consider n samples • The average of the weights is where C is the normalization of the density p (x), i.e., its volume.
• k max , the smallest constant k > 0 such that the inequality of Eq. (1) holds, is equal to (max w(x i )) −1 .
In real conditions, the parameter k can be relaxed. Instead of choosing the maximum value among the empirically computed weight distribution, it can be taken as the inverse of the Q-quantile of these weights, w Q , denoted by k Q : This is equivalent of clipping the weights' maximum value to the Q-quantile of w, capping the desired density function p (x) we are generating using these weights for the rejection. The new weights w (x) are simply: The ratio of volume with respect to the original density p (x) we are maintaining by clipping the weights this way defines the coverage we have of the rejection sampling, and is equal to This allows us to quantify the error we are committing when choosing a quantile over the maximum of weights when defining a constant k for rejection sampling.
The idea behind relaxing this constant k is that we will approximate wrongly only a small region of p (x) with q (x). In that small region, the ratio p (x) /q (x) is large compared to the rest of the domain but still it is occupying a small volume of the density p (x). This region can be ignored by relaxing k, making the overall ratio of p (x) /(k · q (x)) closer to 1 and improving drastically the rejection sampling at the cost of this small discrepancy which we are committing, quantified in Eq. (9).
As an additional qualitative measurement of the goodness of different proposals under different constants k, the effective sample size (ESS) will be used [26], which corresponds approximately to the number of independent samples drawn. The ESS for n samples of weights {w(x i )} n i=1 is defined as: This is a rule of thumb to obtain the number of independent samples. The closer N ESS is to the number of samples n, the more uncorrelated the weighted samples are. If large weights appear, then the independence of the samples will be diminished, as a same sample gets represented many times. We define N ESS /N as a rough estimate for the ratio of independence of the samples.

IV. MONTE CARLO GENERATION OF THE CCQE ANTINEUTRINO CROSS-SECTION
We will now proceed to apply ENIS to the CCQE antineutrino cross-section density. In Sec. IV A, we discuss how the training for the NSF was performed, describing the background we added to cover the phase space. We show qualitatively the obtained densities and compare them to the target one. After obtaining a suitable proposal, we discuss in depth the performance of the obtained result in Sec. IV B, comparing the ENIS proposal to a generic uniform one, demonstrating its potential while justifying the relaxation on the constant k for the rejection sampling.

A. Training
To find the proposal function q φ (x) via NSF for the CCQE antineutrino interaction cross-section density, described in Sec. II A, we followed the training scheme from Sec. III C.  The background chosen is a uniform distribution, covering a range of [0, 10] for the incoming neutrino energy E ν (in GeV), [0, 10] for the outgoing muon momentum p µ (in GeV/c), [0, π] for the angle of the outgoing muon θ µ (in rad), and [0, 0.225] for the target nucleon Fermi momentum p nucleon (in GeV/c). These bounds were expanded by covering a slightly more extended domain, of an additional 2% at the beginning and end of each dimension, to assure that the physical boundaries are completely covered. This expanded background was added with a α = 0.05 contribution to the cross-section density in Eq. (7), as well as used during training for the warm-up phase.
As for the hyperparameters of the NSF, we have chosen the Adam optimizer [25] with learning rate 0.0005, batch size 5 000, training steps 400 000, 5 flow steps, 2 transform blocks, 32 hidden features and 8 bins. This gives a total dimension of 37 220 for the parameters φ of q φ (x). This configuration for the NSF was chosen experimentally to have a relatively low number of parameters (one can have easily six million parameters instead of the ≈ 37 000 we have) since a lower number speeds up the generation and evaluation of samples x ∼ q φ (x). Additionally, the learning rate was decreased during the training using a cosine scheduler to ensure stabilization at the end of the training procedure.
The training consists in maximizing the log-likelihood of Eq. (5) by computing its gradient via Eq. (6), and is shown over the 400 000 iterations in Fig. 2. In the grey area, the training is performed with samples of the background distribution p bg (x), while in the white area the samples of the training samples are generated by the current proposal distribution q φ (x). Notice that since the samples are generated in real-time during the training, there is no need to worry about possible overfitting of the parameters of the neural network, which is a common issue in many machine-learning applications. The values of Fig. 2 are computed every one thousand steps, for a batch of 200 000 samples x ∼ q φ (x). The log probability can be seen to converge at the end of the training, which is mainly due to the cosine scheduler, but also due to the saturation over the family of parametrized densities q φ (x).
To have a visual representation, Fig. 3 shows the marginalized 1-dimensional densities of the four crosssection variables of the target density p (x) (blue) vs the NSF proposal q φ (x) (orange). The plots show a small discrepancy in each variable, but an overall agreement between the two densities. Aside from a mismodeling on the side of q φ (x) in certain regions, the differences can also come from the fact that the NSF is learning a modified target density (Eq. (7)). To asses qualitatively that the correlation between the variables are also captured by q φ (x), Fig. 4 shows 2D-histograms for both the real density p (x) (left) and the the proposal density q φ (x) (right). Visually, an overall agreement can be seen. There is a slight discrepancy for high energy p nucleon values, where the attenuation indicates that for the NSF proposal function it is more spread due to the background noise p bg (x) it is also learning (Eq. (7)).
In what follows the performance of the NSF proposal will be discussed in more quantitative ways, and compared it to a uniform proposal.

B. Performance and discussion
In this section we will focus on analyzing the performance of the proposal density obtained by the NSF while also comparing it to a uniform proposal density, p Unif (x), which in our case will be the same as p bg (x), defined in Sec. IV A.
We start by generating ten million samples from each proposal density and compute their associated weights. The proportion of samples with weight equal to zero is 5.56% for the NSF proposal, compared to the 98.03% for the uniform one. To understand the distribution of such weights, Fig. 5 shows the logarithmic scale of them (for the weights > 0), assuming the average of the weights is equal to 1. For the NSF q φ (x) (left), all weights are concentrated around log 10 w = 0 with a small dispersion around it. Notice that there are only three weights in ten million barely greater than hundred. This shape justifies using not the maximum value of w to perform rejection sampling, but some quantile of it, as we will discuss below. Contrary, for the uniform distribution p Unif (x) (right) we can see that the spectrum of weights goes over nine orders of magnitude. The mean for log 10 w q φ is 0.023 ± 0.040, while for log 10 w p Unif we obtain an average of 0.85 ± 0.88, indicating a huge fluctuation in the magnitude of the weights.
The results of the performance test for rejection sampling are summarized in Tab. I, where we compare various quantities for the NSF q φ (x) and uniform p Unif (x) proposal functions. For this, different quantiles for the constant k for the rejection method are used, following Eq. (8), relaxing its restriction as discussed in Sec. III D. The quantiles for k were chosen using the ten million weights computed for the previous discussion of the weight magnitudes, as well as the probability of acceptance, the coverage, and the effective sample size. We considered a case of sampling one million accepted samples via rejection sampling, where samples from the proposal were generated and checked for acceptance/rejection in parallel, in batches of 300 000 samples. The purpose of the parallelization is to exploit the computational capacities of a GPU. We denoted each of these batches of generating and checking a cycle of the rejection sampling. The values in Tab. I, for each quantile value and a proposal function, are the following: • p accept : probability of accepting a single event, given FIG. 4. 2D histograms comparison of cross-section density for the real cross-section p (x) (left) and the proposal density q φ (x)(right). Visually, an overall agreement can be seen. by the average of p (x) /(k · q φ (x)). If p (x) /(k · q φ (x)) > 1, it is taken as 1 for the computation.
• Cycles: number of rejection sampling cycles of size 300 000 samples needed to obtain one million ac-cepted samples: Cycles = 10 6 p accept · 3 × 10 5 (11) • Time: seconds it takes to compute these cycles and obtain one million accepted samples: t cycle · Cycles.
• N ESS /N : the ratio of effective sample size over the total number of samples, quantifying an estimate of the ratio of independence of the events. This was computed for a sample size of 10 million.
The probability of acceptance, p accept , for the NSF is at least one order of magnitude higher than the one obtained from uniform sampling. Additionally, NSF grows rapidly towards ∼ 70% acceptance while also covering > 99% of the original density volume, as shown in the Coverage column. This is not the case for the uniform distribution, which, while being only one order of magnitude behind NSF with regards to acceptance, is missing a large volume of coverage of the original density.
The number of rejection sampling cycles needed to achieve the desired number of accepted samples is inversely proportional to p accept , as shown in Eq. (11). In a cycle, the algorithm has to sample from the proposal and evaluate both the proposal and p (x). For the NSF, the cycles get stalled when reaching a high percentage TABLE I. Performance values for different quantile choices of k for rejection sampling, as discussed in Sec. III D, comparing both NSF q φ (x) and uniform p Unif (x) proposal functions. For this exercise, one million samples were generated, performing rejection sampling in batches of 300 000 tries. The quantities are the probability of accepting a single sample paccept, the number of rejection cycles (batches of 300 000) used to obtain one million accepted samples, the time it took in seconds to generate these accepted samples, the coverage of the target density for that particular quantile (Eq. (9)) and the ratio of effective sample size NESS (Eq. (10) of acceptance, since the number of cycles has to be a whole number, which is equivalent to the whole number +1. Notice how the number of cycles for the NSF is two orders of magnitude smaller compared to the uniform one, however, the coverage of the uniform drops drastically when decreasing the quantile, and hence the quality of the samples. We will discuss more in-depth in Appendix A. When looking at the time it takes to obtain one million accepted samples, it is directly proportional to the cycles for each proposal. The main difference is that a cycle for the uniform proposal takes a fraction of the time of a cycle for the NSF. This is because sampling and evaluating for the NSF is heavy computationally compared to doing this task for a uniform distribution. As mentioned before, see Appendix A for a more in-depth discussion.
The coverage is the main quantity of measurement of the quality of the produced samples since it measures the volume conserved of the original distribution when performing rejection sampling with certain quantiles. For all the chosen quantiles, the NSF drops a volume < 1 %, while for the uniform distribution the loss is of > 5 % for quantile 0.9999, > 38 % for 0.999, and > 91 % for 0.99, which is unacceptable when trying to produce samples from the original distribution. For the NSF this level of performance when taking the above quantiles is expected, as in Fig. 5 we have seen that the upper tail of weights with large magnitudes is barely a percentage over the whole distribution. However, for the uniform distribution, the loss of coverage is caused by two facts: (i) 98.03% of the weights are zero, hence placing the whole distribution on a 1.97% of the weights. (ii) These weights, as seen in Fig. 5, span over many orders of magnitude, making a cut on the quantile of their distribution more noticeable, as will be discussed below.
To visualize the coverage and the regions missing by choosing a quantile k Q0.999 and different proposal functions, Fig. 6 shows 2-dimensional histogram representation of the marginalized coverage bin-to-bin, taking the variables in pairs, where each bin quantifies the coverage of that bin (i.e., the sum of weights in that bin after choosing a certain quantile over the sum of weights of those weights without clipping). On the left, the coverage for the NSF is presented and shows that only few regions of the phase space have values smaller than 1, and even in those regions the coverage has no noticeable discrepancies. On the right, the coverage of the uniform proposal is shown for the same quantile 0.999, marking clear regions where the coverage drops drastically to values close to zero.
When comparing both coverage regions in Fig. 6, a clear pattern be seen for the uniform one, while it looks quite random for the NSF. This is because the coverage is related to the ratio p (x) /q (x), with q (x) the corresponding proposal density. For the NSF, q φ (x) has a shape very closely related to p (x), as shown in Fig. 3 and Fig. 4, so the coverage would correspond to regions where the discrepancy is large, which has a chaotic behavior. Contrary, for the uniform proposal, p Unif (x), this ratio is proportional to p (x), hence, by clipping, we are doing so according to that particular shape, making the coverage less chaotic and more structured. This translates into making highly probable areas equally likely than others with less probability, affecting this exact group of regions as we will now analyze. Fig. 6 gives us an overall picture of where the densities are wrongly estimated by choosing certain quantile, but it does not quantify or indicate the amount of error, that is, it is not telling us whether the coverage is poor in areas of small or high density. To answer this question, a multidimensional histogram overall four dimensions was performed, with a binning of 20 in each dimension. Then, for each bin, we compute the percentage of weight for a proposal q, which is equivalent to the percentage of density p (x) in that bin, and the coverage for the quantile k Q0.999 . Fig. 7 shows a histogram of the number of bins according to their % w q of bin vs their coverage. Notice how % w q is taken on the logarithmic scale. For the NSF (Fig. 7 left), the regions of coverage visibly smaller than one are two to four orders of magnitude smaller in % w q than the denser high % w q region on the top right. This means that the areas being misrepresented by taking the quantile k Q0.999 are relatively small. Also, most of the area with coverage < 1 are close to full coverage. Contrary, the uniform proposal (Fig. 7 right) shows the coverage dropping for high values of % w q , indicating that important regions of For the NSF (left), the coverage is barely lower than one in most of the bins, and when it drops it is for low %wq. Contrary, for the uniform proposal (right), the coverage drops for high %wq, making the overall coverage way smaller than the one for NSF, as shown in Tab. I.
Concerning the ratio of ESS, we can see that for the NSF it is larger than 90%, giving highly uncorrelated events. For the uniform proposal however, the ESS drops to the range of 0.16 − 1.95%, even for lower quantiles. The differences can be understood from Eq. (10) and by looking at the weight distribution for each proposal in Fig. 5. A large percentage of NSF weights have the same order of magnitude while for the uniform we go over a spectrum of 8 orders of magnitudes. Additionally this ratio of ESS has to be considered for the area of the original distribution given by the coverage, where the uniform distribution misses a lot by choosing smaller quantiles.
To summarize the analysis performed on the results of Tab. I, in the case of the NSF, by lowering the quantile, the probability of acceptance grows until reaching almost 80%, reducing the time to a 0.7 % of the original one while maintaining coverage of over 0.99 %. These scores allow us to justify using a smaller quantile for rejection sampling to improve significantly the performance in time, while also quantifying the misrepresentation we are doing by lowering the constant k. Contrary, for the uniform proposal, the analysis showed the weakness when trying to utilize smaller quantiles, lowering the coverage by over 38 % when using a quantile of only 0.999. Additionally, looking at the N ESS and Fig. 5, we can see that most of the distribution is concentrated in a relatively small number of samples.

V. CONCLUSION
In this letter we have presented Exhaustive Neural Importance Sampling (ENIS), a framework to find accurate proposal density functions in the form of normalizing flows. This proposal density is subsequently used to perform rejection sampling on a target density to obtain MC data sets or compute expected values. We argue that ENIS solves the main issues associated with rejection sampling algorithms as described in the introduction: (i) The training to find a good proposal is done automatically from the target density, with little configuration needed from the human point of view. (ii) Compared to generic proposal functions such as the uniform one, the normalizing flow adapts its density over the target one, getting rid of the inefficiencies which usually are on the downside of the method. (iii) The proposal function is generated based on a set of trivial normally distributed random numbers transformed through the flow, without any rejection method applied.
The performance of the method has been demonstrated and analyzed in a real case scenario, the CCQE crosssection of the antineutrino interaction with a nucleus , where the density is four-dimensional. We have shown that, for the normalizing flow proposal, we can relax the condition in the constant k, used to construct the comparison function, boosting greatly the efficiency (up to ≈ 80% of acceptance rate) while committing a very small error on the target density (less than a 1%), bringing orders of magnitude of speed up in computing time compared to the same error committed by a uniform proposal. Additionally, we investigated the coverage of the generation method as a function of the constant k. We showed that the areas of the phase space where the error is committed are less relevant to the final result compared to the error in the uniform case.
The method can be used to generate fast MC samples in cases where the precision is less relevant versus the algorithm speed. High Energy Physics presents some of these examples such as extensive statistical studies based on "Asimov data sets", fast detector simulations, or simply in fast studies for detector developments and designs. In those cases, the learned proposal function might be sufficiently precise and easy to generate with a set of simple normal random generators transformed through the flow.
As of usage, ENIS brings the possibility of applying the same normalizing flow for rejection sampling of similar densities, e.g., densities coming from a model where the parameters are changed slightly, altering the overall density smoothly. The weight distribution of the ratio between target and proposal will be altered, but no additional training of the neural network would be needed regarding the theoretical model remains similar to the original one. This is a huge advantage compared to methods like MCMC, where one would have to rerun the complete algorithm to obtain samples from each of the different densities.
As a last remark, we have demonstrated the potential of ENIS for the four-dimensional CCQE interaction density. We believe this will only be more noticeable when applying it to higher dimensions, as the original paper of NSF [14] shows a remarkable performance on spaces of dimensions up to sixty-three. The advantages will be also more obvious as the underlying cross-section model becomes more complex and computationally involved. and uniform proposals to generate ten million samples via rejection sampling. Choosing kQ as one of the below quantiles, we increase the time to evaluate the model p (x) for each rejection cycle by a factor tincrease. tNSF (t Unif ) is the time it takes to obtain ten million accepted samples, in seconds, for the NSF (uniform) proposal. t Unif /tNSF is the ratio of the time between both proposals, showing that for more complex models than the CCQE cross-section of this paper, NSF outperforms a uniform proposal by orders of magnitude.
Quantile tincrease tNSF t Unif t Unif /tNSF k Q0.9999 and k Q0.9990 with the same acceptance probability as in Tab. I, consider different hypothetical models with factors t increase ∈ {1, 10, 100, 1000} compared to the CCQE cross-section. We show the time (in seconds) it takes to generate these ten milllion samples for the NSF, t NSF , and for the uniform proposal, t Unif , while also computing their ratio t Unif /t NSF . For models of p (x) with larger computationally complexity, we can see that even for the maximum quantile k Q1.0 NSF is faster to obtain ten million accepted samples than using a uniform proposal, gaining a whole magnitude in time when the model is at least hundred times more complex. Choosing quantiles smaller than 1.0, for models which take thousand times more to compute compared to the CCQE crosssection one, the NSF proposal generates these ten million samples over hundred times faster than the uniform distribution. Even for simpler models with only hundred more computation complexity, the gain of using the NSF proposal is noticeable.