Neural Networks for Full Phase-space Reweighting and Parameter Tuning

Precise scientific analysis in collider-based particle physics is possible because of complex simulations that connect fundamental theories to observable quantities. The significant computational cost of these programs limits the scope, precision, and accuracy of Standard Model measurements and searches for new phenomena. We therefore introduce Deep neural networks using Classification for Tuning and Reweighting (DCTR), a neural network-based approach to reweight and fit simulations using all kinematic and flavor information -- the full phase space. DCTR can perform tasks that are currently not possible with existing methods, such as estimating non-perturbative fragmentation uncertainties. The core idea behind the new approach is to exploit powerful high-dimensional classifiers to reweight phase space as well as to identify the best parameters for describing data. Numerical examples from $e^+e^-\rightarrow\text{jets}$ demonstrate the fidelity of these methods for simulation parameters that have a big and broad impact on phase space as well as those that have a minimal and/or localized impact. The high fidelity of the full phase-space reweighting enables a new paradigm for simulations, parameter tuning, and model systematic uncertainties across particle physics and possibly beyond.

In collider-based high-energy physics, parton-, particle-, and detector-level Monte Carlo (MC) simulation programs enable scientific inference by connecting fundamental theories to observable quantities. However, these tools are often computationally slow and emulate probability distributions that are analytically intractable. This has resulted in three key simulation challenges for particle physics: (1) an insufficient number of simulated events, (2) unaccounted for biases from simulation parameters, and (3) the inability to utilize the full phase space for parameter tuning.
A variety of approaches have been proposed to address the above challenges. The two existing solutions to (1) are to use more [1][2][3] and/or faster computers or accelerators [4,5] or to build fast surrogate models ('fast simulation'). Machine learning tools hold great promise for augmenting [6] or replacing [7][8][9][10][11][12][13][14][15][16][17][18][19][20] current fast detector simulation approaches, but are not yet precise enough to match the full, physics-based detector simulators that are often the limiting factor in the overall software pipeline. Deep learning methods to circumvent expensive simulations for hypothesis testing were studied in the context of effective field theory fits [21][22][23]; related ideas will be useful also for reweighting. The only solution for (2) aside from generating a large set of simulations or interpolating between bins of low-dimensional histograms [24] is to assign event weights for parameter variations. Currently, this is only possible for a small number of perturbative parameters in parton shower programs [25][26][27] and for parton distribution functions [28,29]. Pseudo-automated procedures exist for tuning parton shower models [24,30], but the format of the existing public data means that these algorithms are restricted to a set of mostly onedimensional inputs that must be assumed to be independent. The variational method proposed in Ref. [31] has been demonstrated with high-dimensional data, but utilizes a minimax optimization technique and requires running the simulator many times during training.
This letter introduces Deep neural networks using Classification for Tuning and Reweighting (Dctr, pronounced "doctor"), a new approach to solve all three computational challenges. In particular, deep neural network-based classifiers are used to (continuously) reweight one particle-level simulation into another and additionally use the full phase space to fit parameters within a given model. When the nominal particle-level sample has a corresponding detector-level simulation, then this procedure produces a new detector-level sample as well. Non-deep machine learning tools have been used in the past for discrete re-weighting [18,[32][33][34] with a small number of observables. Deep-learning-based discrete weighting was considered in [18,35] and continuous single observable reweightings were presented in [36]. The re-weighting presented here combines a full-phase space deep learning architecture [37] with parameterization [38,39] to fully morph one simulation into another. There are no restrictions on the size of the input feature space nor on the number of interpolated parameters. In addition to re-weighting, we show how Dctr can be used with a differentiable re-weighting function (such as the one just mentioned) to optimize simulation parameters. Fitting parameters based on the parameterized classifiers was proposed in Ref. [38]; here, the fitting procedure uses a classifier to construct the loss function, which can readily incorporate all of the information from the (potentially high-dimensional) input features and be optimized using standard deep learning tools.
The first ingredient to the full phase-space reweighting procedure is a prescription to derive event weights. Consider two simulations that describe the same phase space Ω and are described by probability densities p 0 (x) and p 1 (x), for x ∈ Ω. Assuming that p 0 and p 1 have the same support 1 , the function w(x) = p 0 (x)/p 1 (x) is the ideal per-event weight to morph the second simulation into the first one. A key observation made by multiple groups in the past is that w can be well-approximated by training a machine learning classifier to distinguish the two simulations. For example, let f (x) be a neural network and trained with the binary cross-entropy loss: (1) where 0 and 1 represent sets of examples from the two simulations. Then a well-known result is that 2 , The benefit of parameterizing f as a neural network is that deep learning can readily analyze all of Ω, which was not possible with shallow learning attempts with a similar statistical foundation. The closest attempt to a full phase space approach directly tried to learn p i (x) using the full kinematic (i.e. non-flavor) part of Ω [35,40], but this is much harder than learning the ratio.
An important reweighting scenario is when the two simulations are from the same simulation program, but with different model parameters, θ. For example, when model uncertainties are evaluated, one may want to transform p θ (x) into p θ+δ θ (x). When these uncertainties are profiled in a fit, it is important that the transformation procedure be able to continuously interpolate between model parameters. The neural network reweighting approximation can be extended to this continuous case by adding θ as a feature [38,39]: f (x, θ). In the examples presented below, the training data are generated with a uniform distribution in θ, but this probability density can be optimized per application and can even be discrete.
Even though generators have many parameters that must be fit to data, gradient methods cannot be used directly with the models as the phase space they produce is not usually differentiable (or at least the derivative is intractable) with respect to their model parameters. Surrogate generative models built from neural networks can be used for gradient-based parameter fitting, but may not have sufficient quality to be reliable. Reweighting is a robust alternative to surrogate generative models. A neural network-based continuous reweighting function is essentially a differentiable (in model parameters) version of the original simulator and can be used to perform inference 1 In most physical applications, this is always the case. If there are regions where p 0 (x)/p 1 (x) is far from unity, one can add a regularization parameter to the training to mitigate large weights, which may significantly reduce the statistical power of the reweighted dataset. We found that this works well, but was unnecessary for the examples presented in this paper. 2 See Appendix A for the derivation. on the parameters themselves. This is especially powerful for particle-level parameter tuning to data where one sample with a computational expensive full detector simulation can be continuously reweighted to other parameter points with the same detector model at no extra simulation cost.
An ideal loss function used to fit model parameters makes use of the full observable phase space. Typical metrics such as the χ 2 between histogram approximations to probability densities become impractical when Ω is high dimensional. As described above, classifiers are powerful tools for accessing all of the available information. Therefore, one can use a classifier for the loss. When a classifier trained to distinguish some θ 0 from a θ 1 performs poorly, then the two samples are close. While using classification to quantify differences between event samples has been used for anomaly detection [41][42][43], we are unaware of an example where it is used for parameter fitting. The idea of using the classifier loss as a metric is similar to the minimax strategy in Generative Adversarial Networks [44], only in this context the generative part is a reweighter and is trained independently. A more elegant way of implementing this approach is to fit unknown parameters to the values that minimize the nominal classifier loss. In particular, suppose that a reweighter neural network f is trained as described above. Such a function will satisfy for all θ. Note that the f in the first sum takes the parameter θ and not θ 0 , otherwise the discrimination task would be trivial. Now, suppose there is a new sample θ 1 where θ 1 is unknown (for instance, θ 1 are collider data). The claim is that if θ * is chosen as then θ * = θ 1 . As f minimizes the cross-entropy loss for any θ (Eq. 2), must hold. However, the converse must also be true since θ * minimizes the cross-entropy loss as well and therefore, θ * = θ 1 . Since f is differentiable, Eq. 3 can be solved using standard gradient-based methods. While Eq. 3 performs the fit on the same particle-level phase space as the reweighting, it can be readily extended to do the fitting (via the classification loss) at detector-level while the reweighting can be performed at particle-level using one fully detector-simulated event sample (see Appendix B). The last ingredient to Dctr is a suitable neural network architecture that can effectively capture all the salient features of Ω. A natural tool for this task is the Particle Flow Network (PFN) [37], built on the Deep Sets framework [45]. While many deep learning architectures incorporate the symmetries and structure of high energy physics events [8,35,[46][47][48][49][50][51][52][53][54][55][56], PFNs are particularly effective because they can operate on variable-length sets of particles and respect the quantum-mechanically induced permutation invariance of particle labels. These networks can also readily incorporate non-kinematic information such as particle flavor. A particle flow network is a composition of two neural networks F and Φ: where p i is the set of features belong to particle i (momentum and flavor) as well as θ. The function Φ embeds the input particles into andimensional latent space and F is a simple R → R neural network. References [37,45] proved that this structure is sufficiently flexible to approximate any function and in practice, ∼ O (10).
To illustrate the potential of Dctr, full phase-space reweighting and parameter tuning is performed on a sample of generated events from the Pythia 8.230 [57,58] event generator. Particle-level e + e − → Z → dijet events with about 100 particles in each event are clustered into jets using the anti-k t clustering algorithm [59] (R = 0.8) with Fastjet 3.0.3 [60,61]. The jets are presented to the neural network for training, with each jet constituent represented by (p T , η, φ, particle type, θ), where θ is the parameter in Eq. (2). One million events were generated for each set of Pythia parameters. In addition to a default parameter set using the Monash tune [62], three separate samples were generated with uniformly sampled TimeShower:alphaSvalue, StringZ:aLund and StringFlav:probStoUD in the ranges [0.10, 0.18], [0.50, 0.90] and [0.10, 0.30], respectively. We also generated one sample where all three parameters were simultaneously uniformly sampled. These parameters were chosen because they represent both perturbative and non-perturbative physical effects and the ranges are similar to those studied in Ref. [30]. The Monash values of the three parameters are 0.1365, 0.68, and 0.217, respectively.
The reweighting and fitting was found to work well without any hyperparameter modifications from Ref. [37]. In particular, Φ has two hidden layers with = 128 and F is composed of three hidden layers and two output nodes for binary classification, and all the hidden layers have 100 nodes. The activation function used for all layers is ReLu with the exception of the classification output which uses softmax. All models were implemented in Keras [63] with the Tensorflow backend [64] and trained using the crossentropy loss with the Adam [65] optimizer for 50 epochs, using early stopping with patience 10, with batch size 1000. Each training set contained 8 · 10 5 training and 10 5 validation jets of each class. Training time was 10-15 min for each model (20 seconds per epoch) on an NVIDIA GeForce GTX 1080.
As a first test of Dctr, a single parameter (TimeShower:alphaSvalue) is reweighted using the full phase space of the generated jets. Results for discrete and continuous reweighting from a varied parameter to the nominal sample are presented in Fig. 1. The entire phase-space is reweighted, but is too high dimensional to visualize. Instead, three histograms of physically relevant one-dimensional observables are presented: the number of particles inside the jet (multiplicity), an n-subjettiness ratio τ 32 [66,67], and a four-point Energy Correlation Function [68] ECF(N = 4, β = 4). By definition, τ 32 = τ 3 /τ 2 where τ n = i∈jet p T,i min j=1...n {∆R(i, j)} for axis j; likewise, ECF(N, β) is the sum over all quadruples inside the jet weighted by the product of the momenta and the product of all opening angles raised to the power β. The large values of n, N , and β are used to expose complex features with a nontrivial dependence on all particles inside the jet. Many more observables were studied, but these ones are representative.
The reweighted distributions are in excellent agreement with the target nominal distribution. Samples used to make the histograms shown for TimeShower:alphaSvalue values 0.1365 and 0.1600 were not used during training or validation. The fidelity of a continuous reweighting is quantified in the lower right plot of Fig. 1, which presents the χ 2 /ndf as a function of the initial α s parameter value.
Variations in TimeShower:alphaSvalue modify many aspects of jet fragmentation and therefore it may be an easy parameter for the reweighting network to learn. In contrast, the hadronization parameters StringZ:aLund and StringFlav:probStoUD may be more difficult because the size of their effects on the phase space is small and/or localized. Figure 2 shows that despite these potential challenges, the reweighting procedure is able to effectively capture subtle and isolated modifications to the phase space. Variations in the StringZ:aLund parameter result in mostly percent-level differences in the presented distributions, which are corrected in the reweighted model. Modifying the StringFlav:probStoUD parameter only changes strange particles such as kaons, which highlights the importance of including flavor in the full phase space network. Importantly, this model learns to only change the distributions related to strange particles, leaving other observables untouched. Simultaneously reweighting all three parameters also works well, but is more difficult to visualize. We also verified that Dctr works for pp MC simulations by reweighting from Pythia to Herwig for both the quark and gluon samples taken from [37,69,70].
The well-trained Dctr model can now be used to demonstrate the potential for parameter tuning follow- To quantify the quality of the reweighing, and to illustrate one trained model can continuously reweight for any parameter, we show the χ 2 /ndf for multiplicity as a function of αs in the lower right plot for reweighting to αs = 0.1600. For each value, we compare the χ 2 relative to αs = 0.1600 before and after reweighting. Each χ 2 value is averaged over 10 runs and the grey band marks the standard deviation, which is consistent with χ 2 /ndf ≈ 1.
ing Eq. 3. As a first step, Table I presents the result of a fit where the 'data' are the same as the nominal, but with each parameter changed one at a time (each row is a separate fit). To illustrate the sensitivity to the randomness in the model initialization, each fit is performed ten times. This variation could be reduced with a more sophisticated neural network and/or more training data. For each of these one-dimensional fits, the fitted value is consistent with the target value within these statistical fluctuations from initialization, which are 1 − 3%. As TimeShower:alphaSvalue has a bigger impact on the phase space, it is less sensitive to the initialization statistical fluctuations. For a fit with data, the statistical and systematic uncertainty could be determined with toys and even profiled, as is standard for parameter fitting.  As a next step, the top part of Table II shows the result of a simultaneous fit to the three parameters. As with the one-dimensional fit, the fitted values are all statistically consistent with the target values. Interestingly, the sensitivity to the initialization statistical fluctuations is about the same for the three-dimensional fit as for the one-dimensional fits, providing confidence in the scaling to more parameters. In practice, the fitting procedure would be validated on a variety of simulations with known parameters, as just described. An illustration of the fit itself is shown in Fig. 3, where a two-dimensional slice through the likelihood landscape is presented and the fit execution demonstrated with markers and dashed lines. The broadness of the loss in the StringZ:aLund direction relative to the TimeShower:alphaSvalue one is a reflection of the significantly smaller impact of fragmentation function variations on the observable phase space compared with modifications to the final state shower strong coupling. After the validation, the model can be deployed on data, where the parameters are unknown. The lower part of Table II replicates this scenario, where the Pythia parameters were blinded during the fit. This closure test indicates that the method is robust to userbias.
The empirical results demonstrate that Dctr is ready to be deployed. The discrete reweighting could be used to generate new full-detector simulated samples with a different particle-level simulation when at least one fully simulated sample exists. This could be particularly useful for systematic uncertainties computed using pairs of simulations (e.g. comparing Pythia and Herwig) and for legacy data analysis in which the original detector simulation is no longer available [71]. Continuous reweighting will enable systematic parameter variations for uncertainty estimation that were not possible before (most parameters). Such variations can even be profiled during any statistical test that fits phase space regions sensitive to the varied nuisance parameters. Finally, the full power of Dctr can be used for parameter tuning. Unlike traditional tuning which use unfolded data that are usually one-dimensional and without observable-observable correlations, a new paradigm is now possible were highdimensional detector-level data can be used directly. The full power of the data can be utilized and all of the cor-relations are correctly accounted for in the fit. For the first time, this may allow for proper covariance matrices (and thus correlated uncertainties) to be determined for simulation parameter values. All of these opportunities illustrate the broad applicability of full phase-space reweighting and parameter tuning and the power Dctr to extend the scope, precision, and accuracy of colliderbased particle physics analyses.
BN would like to thank Luke de Oliveira, Michela Paganini, Chase Shimmin, and Paul Tipton for collaboration on an early stage prototype of this project. We thank Steve Mrenna and Peter Skands for helpful discussions about automated variations in Pythia. We also thank Kyle Cranmer, Aviv Cukierman, Patrick Komiske, and Eric Metodiev, and Jesse Thaler for lively discussions about deep learning-based reweighting. Additionally, we are grateful to Kyle Cranmer, Phil Ilten, and Jesse Thaler for feedback on the manuscript. This work was supported by the U.S. Department of Energy, Office of Science under contract DE-AC02-05CH11231. Finally, we are grateful for the opportunity to use the Cori supercomputing resources at NERSC.
suppose that h(x) is a function with a strictly smaller loss in Eq. A1 than g. Since the average loss for h is below that of g, by the intermediate value theorem, there must be an x for which the average loss for h is below that of g, contradicting the construction of g.
Now, consider the case where the loss is cross-entropy: where Coincidentally, the exact same result holds if using mean squared error loss. When using either loss function with two outputs and the softmax activation for the last neural network layer, the first output will asymptotically approach g(x) and the other by construction will be 1 − g(x). The ratio of these two outputs is then: Therefore, the output is proportional to the likelihood ratio. The proportionality constant is the ratio of fractions of the two classes used during the training. In the paper, the two classes always have the same number of examples and thus this factor is unity.
at truth level (before detector simulation) while the fit happens in data (after the effects of the detector), this procedure will not work. It works only if the reweighting and fitting both happen at detector-level or both happen at truth-level. The following is an alternative method: where w(x i , θ) = f (x i , θ)/(1−f (x i , θ)) is a trained Dctr using binary cross entropy as in the main body. The intuition of the above equation is that the classifier g is trying to distinguish the two samples and we try to find a θ that makes g's task maximally hard. If g cannot tell apart the two samples, then the reweighting has worked. This is similar to the minimax graining of a GAN, only now the analog of the generator network is the reweighting network which is fixed and thus the only trainable parameters are the θ . The advantage of this second approach is that it readily generalizes to the case where the reweighting happens on a different level: where x T is the truth value and x D is the detector-level value. In simulation (the second sum), these come in pairs and so one can apply the reweighting on one level and the classification on the other. Asymptotically, both this method and the one in the body of the DCTR paper learn the same result: θ * = θ 0 . To see this for the second method, consider the same logic as in Appendix A. Conditioning on x and θ, the optimal g is given by which reduces to the result of the previous appendix when w = 0. For fixed g, the loss is maximized when g is independent of x, which happens if ( , which means that w(x, θ) is proportional to the likelihood ratio between the two samples. An example implementation of this method in Keras can be found at Ref. [72].