Parameter Estimation using Neural Networks in the Presence of Detector Effects

Histogram-based template fits are the main technique used for estimating parameters of high energy physics Monte Carlo generators. Parameterized neural network reweighting can be used to extend this fitting procedure to many dimensions and does not require binning. If the fit is to be performed using reconstructed data, then expensive detector simulations must be used for training the neural networks. We introduce a new two-level fitting approach that only requires one dataset with detector simulation and then a set of additional generation-level datasets without detector effects included. This Simulation-level fit based on Reweighting Generator-level events with Neural networks (SRGN) is demonstrated using simulated datasets for a variety of examples including a simple Gaussian random variable, parton shower tuning, and the top quark mass extraction.


I. INTRODUCTION
Synthetic data produced from Monte Carlo (MC) generators are a key tool for statistical analysis in high energy particle physics. These MC generators have a number of parameters that can be measured by producing multiple synthetic datasets and finding the one that agrees best with data. This procedure can be computationally expensive, especially when detector simulations are involved. In some cases, one can avoid detector simulations by using unfolded data for parameter estimation. Until recently [1,2], unfolding methods were only available for low dimensional or binned data. Even with the advent of new methods, one can achieve a higher precision with folding instead of unfolding. For example, template-based fitting is the standard approach for extracting the top quark mass [3], one of the most precisely measured quantities at hadron colliders 1 .
Machine learning may provide a solution to the simulation challenge. One possibility is to replace or augment synthetic data from slow physics-based generators with synthetic data generated from neural generative models [2,. This requires neural networks to learn p(data|parameters) accurately, which is a difficult task. An alternative solution is to instead learn the ratio p(data|parameters)/p(data|reference), where the reference may be from a particular synthetic dataset generated with a fixed set of parameters. It is wellknown [37,38] (also in high energy physics [1,[39][40][41][42][43][44][45][46][47][48][49]) that a suitably structured and trained neural networkbased classifier learns to approximate this likelihood ra- * ajandreassen@google.com † schsu@uw.edu ‡ bpnachman@lbl.gov § nsuaysom@uw.edu ¶ adisurtya@berkeley.edu 1 Even with ∼ 1 GeV theoretical ambiguity [4,5], the uncertainty is still at the 0.5% level.
tio, so one can turn the difficult problem of probability density estimation into the relatively easier task of classification. Applying this idea to full phase space reweighting and parameter estimation was recently proposed with the Deep neural networks using Classification for Tuning and Reweighting (Dctr) protocol [39]. When used to perform an unbinned fit, the original Dctr algorithm first learns a parameterized reweighting function and then continuously (and differentially) modifies the MC generator parameters until the classifier loss used to define the reweighting function is minimized. The Dctr fitting protocol is effective because it factorizes the reweighting and fitting steps. Furthermore, the fit can be performed with gradient-based methods due to the differentiability of neural networks. However, a key challenge with this approach is that one must train the reweighting function using data of the same type as the data that are used in the fit. In other words, if the fit is performed with data at detector-level, the reweighting function must be trained with a large number of synthetic data examples that include detector effects. As detector simulations can be computationally expensive, this can be a significant challenge.
We propose a new approach whereby only one synthetic dataset with detector effects ('simulation') is required and all of the reweighting is performed at particle-level ('generation') (following the nomenclature from Ref. [1]). This new Simulation-level fit based on Reweighting Generator-level events with Neural networks (Srgn) approach still factorizes the problem into a reweighting step and a fitting step, except that now each step includes training classifiers: one at generator-level and one at simulation-level, respectively. This approach is the same as Dctr in the reweighting step but differs in the fitting step. In the form proposed in this paper, the fitting step is not differentiable, but it is amenable to non-gradient-based optimization procedures. Given the computational efficiency of particle-level generation compared with detector-level simulation, this approach will enable new fitting strategies for analyses like the top quark mass measurement, related tasks at the Large Hadron Collider (LHC), and beyond. This paper is organized as follows. Section II reviews neutral network reweighting and introduces the new two-level approach for incorporating detector effects. A variety of numerical results are presented in Sec. III. In particular, (1) a simple Gaussian example is used to first demonstrate the salient features of the new approach, then (2) parton shower tuning provides a high-dimensional example without detector effects, and finally (3) the top quark mass measurement is deployed for a multi-dimensional use case including detector effects. The paper ends with conclusions and outlook in Sec. IV.

II. NEURAL NETWORK REWEIGHTING AND DETECTOR EFFECTS
Suppose that features X ∈ R N follow a probability density p(x|θ), where θ are parameters of the model. A reweighting function w θ0 (x, θ) is designed so that a sample drawn from p(x|θ 0 ) weighted by w is statistically identical to a sample drawn from p(x|θ). The ideal reweighting function is w θ0 (x, θ) = p(x|θ)/p(x|θ 0 ). One strategy for constructing w is to model the probability density p(x|θ) and then take the ratio 2 . Density estimation is a significant challenge, especially in the case of collision events where X is a variable and high dimensional object and p(x) has significant symmetry. One solution is to turn the challenge of density estimation into the relatively easier task of classification. Suppose that f is a neural network trained to distinguish between a sample of events θ drawn from p(x|θ) and a sample of events θ 0 drawn from p(x|θ 0 ). If f is trained using the binary cross entropy loss function: then with a flexible enough architecture, an effective training protocol, and sufficient training data, Therefore, one can construct w using f . Furthermore, if the training of f includes a variety of values of θ, then it will naturally learn to interpolate and become f (x, θ); consequently, w becomes a parameterized reweighting function.
The original Dctr fitting protocol is expressed symbolically as where θ ? is not known. If f is the optimal classifier, then θ * Dctr = θ ? . Detector effects distort the feature space. Let X S |X G ∈ R M represent simulation-level features given generator-level features X G . In synthetic samples, we have the corresponding pairs of X G and X S for every collision event. However, X G is not known for real data. Therefore, it would be ideal to do the fit using X S , but perform the reweighting using X G , as reweighting only requires events from generation.
The Srgn protocol is a two-step procedure as illustrated in Fig. 1. First, a reweighting function is trained. Then, a classifier is trained to distinguish the target data from reweighted simulation. As this classifier is trained, the parameters θ are also modified. When the classifier is unable to distinguish the target data from the reweighted simulation, then the current parameters are the fitted parameters.
Step 1: One sample with Gen. and Sim.

Generation
1. An illustration of Srgn, applied to a set of synthetic and natural data. There is one synthetic dataset where the particle-level data ("generation") is passed through a detector emulation ("simulation"). Srgn is a two step process. First, a parameterized reweighting function is learnt using the generation dataset and a set of additional synthetic generator-level datasets. Second, the synthetic simulation is reweighted and compared with real data, iterated to converge on the parameters θ. Illustration is inspired by Ref. [1].
Symbolically, suppose that w θ0 (x G , θ) is a reweighting function learned at generator-level, where θ 0 represents the nominal parameter for the synthetic sample. Furthermore, suppose that g is a neural network defined as follows: Then, where the quantity in [·] is typically called the area under the receiver operating characteristic curve or AUC. We calculate the AUC between g's predictions on events from the unknown sample and g's predictions on reweighted events from the synthetic sample; effectively, if we reweight events from the synthetic sample θ 0 to events from θ , then we calculate the AUC between g's predictions on θ and g's predictions on θ ? . In analogy to Eq. 3, one might think to define θ * Srgn as the value of θ that maximizes the loss in Eq. 4. This would make the Srgn procedure differentiable in contrast to Eq. 5 (the AUC is not generically differentiable). However, one can show that (see Appendix A) where p = Pr(θ = θ ? |x S ). When w = 1, Eq. 6 is the usual result that the classifier is simply the probability of the target class given the features. Plugging Eq. 6 into Eq. 4 and optimizing with respect to θ does not generally result in θ * = θ ? (see Appendix B). The Srgn result defined by Eq. 5 achieves θ * Srgn = θ ? when the features x G include the full phase space, defined below.
The probability density of the features x S weighted by w θ0 (x G , θ) is given by where the approximation in Eq. 8 depends on the fidelity of the neural network optimization. Equation 9 is equal to p(x S |θ) if p(x S |x G , θ 0 ) = p(x S |x G , θ). In this case θ * Srgn = θ ? . The equality p(x S |x G , θ 0 ) = p(x S |x G , θ) holds if x G contains all of the relevant information about the detector response so that changing θ has no impact on the resolution. In this case, the feature space is said to contain the full phase space (later denoted Ω). Note that it is common in experimental analyses to perform generator-level reweighting for estimating theoretical modeling uncertainties. These reweighting schemes typically use histograms and therefore are constrained to one or two-dimensional feature spaces. The above calculation suggests 3 that this is likely insufficient for an unbiased estimate of the impact on simulation-level quantities.
The various properties of the Srgn method will be illustrated in the next section with a variety of examples.

III. RESULTS
Three sets of examples are used to illustrate various aspects of the Srgn method. First, simple Gaussian examples are used, where the probability density is known and thus the reweighting function can be computed analytically. The features of Srgn described in the previous section are explored with these examples. The parton shower examples from Ref. [39] is used as a second example. These examples show how the new method can be effective with high-dimensional features, but do not incorporate detector effects. A measurement of the top quark mass is used as a third example to demonstrate both multivariate fitting and detector effects.
The Srgn protocol calls for two neural networks: one called f that is used to construct the reweighting function w and another called g that is used to perform the fit. These neural networks are implemented using Keras [51] with the Tensorflow backend [52] and optimized with Adam [53]. Networks are trained using the binary cross entropy loss function. The network architectures vary by example and are described below.

A. Gaussian Example
The generator-level feature space is one-dimensional and follows a Gaussian distribution: X G ∼ N (µ, σ 2 ). Detector effects are modeled as independent Gaussian noise: The detector smearing and the generator width σ = 1 are known but µ is not known. In this case, the reweighting function can be computed analytically: The parameterized reweighting is trained with µ values sampled uniformly at random in the range [−2, 2]. One million examples are used for both data and the nominal synthetic dataset, and = 0.5. These data for µ = 0 are presented in Fig. 2. A reweighting function is derived using a neural network with 3 hidden layers using 50 nodes each. Rectified Linear Units (ReLU) connect the intermediate layers and the output is Softmax. The network is trained for 200 epochs with early stopping using a patience of 10. The batch size is 10 5 . A comparison of the analytical (Eq. 10) and learned reweighting is shown in Fig. 3 using weights based on generator-level in both cases. The reweighting effectively morphs the µ = 0 distribution to one that has µ = 1.5. The goal of Srgn is to use simulated features with reweighting based on generator-level. This is explored in Fig. 4. In order to show that the features need not be the same at generator-level and simulation-level, X G is twodimensional. Then, we construct the detector response such that the simulation-level observable X S depends explicitly on the primary generator-level feature, but its detector smearing depends on the secondary generatorlevel feature. That is, detector effects are non-uniform, and are dependent on the generator-level parameter(s). In particular, we choose the primary generator-level feature X G,0 ∼ N (µ, 1) and the secondary generator-level feature X G,1 ∼ N (0, ν 2 ), where ν = (ω 0 + ω 1 µ) 2 for two constants ω 0 and ω 1 . (Specifically, we choose ω 0 = 0.7 and ω 1 = 0.2 for this example.) Then, on a per-event basis, detector effects are emulated by X S = X G,0 + Z, where Z ∼ N (4|x G,1 |, (x G,1 ) 4 ), and 4|x G,1 | represents a net shift bias and (x G,1 ) 2 represents a smearing bias. Importantly, the resolution depends on the secondary generator-level feature. Figure 4 shows the result of a reweighting derived on generatator-level for ten million events, using the same architecture and training procedure as the previous example. By construction, both the smearing and the shifting are more intense for the µ = 1.5 simulation-level distribution. When using both generator-level features Events per bin (normalized) (constituting the full phase space Ω), reweighting is successful. However, if only the primary generator-level feature is used for w, then the reweighting fails to reproduce the simulated-level probability density. So far, the results have only illustrated the efficacy of reweighting -the rest of the plots in this section demonstrate how the reweighting can be used for fitting. To begin, the one-dimensional generator-level setup is used for the fit. The fitting data consist of one million events with = 0.5 for detector effects. Then, a classifier is trained with different values of µ to distinguish the unknown dataset from the reweighted synthetic dataset and the AUC from Eq. 5 is plotted as a function of µ for a fit at both generator-level and simulation-level. The architecture of this neural network consists of 2 hidden layers using 128 nodes each. Rectified Linear Units (ReLU) connect the intermediate layers and the output is a sigmoid. The network is trained for 200 epochs with early stopping using a patience of 5. The batch size is 1000. In both cases, the reweighting is performed at generatorlevel. Figure 5 illustrates several aspects of the proposed fitting method with Srgn. First, the minimum of the AUC is 0.5 and occurs at µ = 1 in both cases, which is the correct value. Second, the rise in the AUC func- Events per bin (normalized) Events per bin (normalized) tion away from the minimum is steeper at generator-level than simulation-level, as expected given the loss of statistical power from detector smearing. In addition to showing the AUC function, the values of fits using a non-differentiable optimizer are also presented as markers in Fig. 5. At both generator-level and simulation-level, the fit correctly identified µ ? = 1. The numerical results of the one dimensional Gaussian fit are presented in Table I, and the reported measurements are the averages and standard deviation over 40 runs using a non-differentiable optimizer. A small number of runs resulted in defective optimization terminating at the fitting bounds; these were identified and removed by examining values outside a 2σ window outside the mean as well as all values within 5% of the range of the bounds. As a next illustration, a fit is performed for both µ and σ. A two-dimensional reweighting function is parameterized in the one-dimensional Gaussian mean and standard deviation within the ranges [-2, 2] and [0. 25,4], respectively. The reweighting function can still be computed analytically and is given by where µ 0 and σ 0 denote the nominal values for the Gaussian distribution. As before, one million events are used for the fit and detector effects are modeled with = 0.5. The efficacy of a two-dimensional reweighting function is presented in Fig. 6 for a case with µ 0 = 0, σ 0 = 1. The neural network weights are just as effective as the analytic weights to morph the default distribution into a distribution with µ = 1 and σ = 1.25. A two-dimensional fit to µ and σ is demonstrated in Fig. 7. The AUC function is minimized at the correct values of µ = −1 and σ = 0.75 for both generator-level and simulation-level for a reweighting function derived at generator-level in both cases. The contours in Fig. 7 indicate that the AUC function rises more steeply away from the minimum at generator-level as would be expected of the enhanced statistical power of the dataset without detector effects. The numerical results of the two dimensional Gaussian fit are presented in Table II. The reported measurements and uncertainties are again the averages and standard deviation over 40 runs using a non-differentiable optimizer. The same procedure for identifying and removing outliers as the one dimensional fit was employed.

B. Parton Shower Monte Carlo Tuning
The parton shower tuning examples from Ref. [39] are presented in this section. There are no detector effects, but we show that the new fitting methodology works with high-dimensional features and in particular can be integrated with particle flow networks [54] which are based on deep sets [55]. The event generator details can be found in Ref. [39] and are briefly reviewed here. In particular, e + e − → Z → dijets are generated using Pythia 8.230 [56,57] and anti-k t [58] R = 0.8 jets are clustered using FastJet 3.03 [59,60]. The jets are presented to the neural network for training, with each jet constituent represented by (p T , η, φ, particle type, θ), where θ are the generator parameters to be determined. The neural network setup is the same as in Ref. [39], which uses the default particle flow network parameters from Ref. [54].
The default generator parameters follow the Monash tune [61]. Three representative generator parameters are used here to illustrate the Srgn fitting procedure. First, TimeShower:alphaSvalue is varied to illustrate a parameter that has a significant impact on the entire phase space and is thus relatively easy to tune. Second, StringZ:aLund is a parameter that also impacts the entire phase space, but to a lesser extent than the strong coupling constant used in final state raidation. Finally, StringFlav:probStoUD is a parameter that has a large impact on a narrow region of phase space. The Monash tune values of the three parameters are 0.1365, 0.68, and 0.217, respectively. For TimeShower:alphaSvalue and StringFlav:probStoUD, two nearly sufficient onedimensional statistics are known: the number of particles inside the jets and the number of strange hadrons, respectively. Fits using these simple observables will be compared with the full phase space fit below.
Generator-level features illustrating variations in each of the three parameters are shown in Figure 9. The full phase space will be used in the fit, but these are representative features to illustrate the effects of parameter variations. These features are the same as used in Ref. [39] and are the number of particles inside the jet (multiplicity), the number of kaons inside the jet, an nsubjettiness ratio τ 2 /τ 1 [62,63], and a four-point Energy Correlation Function using angular exponent β = 4 [64] ECF(N = 3, β = 4). As advertised, the final state shower α s and hadronization parameters affect all four observables, with a bigger shift from α s . In contrast, the strangeness parameter only affects the number of kaons and has no impact on the other observables. To perform a given fit, we scan for the AUC as a function of the parameter to search for the minimum; the step sizes are 0.001, 0.01, and 0.005 for TimeShower:alphaSvalue, StringZ:aLund, and StringFlav:probStoUD, respectively.
One dimensional fits to each of the three parton shower parameters are shown in Fig. 8. Since TimeShower:alphaSvalue has such a large effect on the phase space, it is the most precisely measured parameter as indicated by the steepness of the AUC curve near the minimum. The steepness of the full phase space fit also shows that there is slightly more information wtih respect to multiplicity alone. The StringZ:aLund parameter has the smallest effect on the phase space of all three parameters, and is thus is the least precisely measured parameter. StringFlav:probStoUD primarily has an effect on the number of strange particles, and thus the full phase space does not offer much more information than only the number of strange hadrons, so the precision is comparable for both approaches. The reported measurements and plots are the averages and standard deviations over 40 runs, each with a different reweighting function and classifier that differened only in their random initialization. A small number of the runs resulted in reweighting functions that were defective and these were identified and removed by examining the runs with fitted values outside a 2σ window around the mean.  Across the 40 runs, most of the results clustered around the mean and so the outliers look systematically different than the fits with effective reweighting functions.
The numerical results of the three fits are presented in Table III. The fitted values are statistically consistent with the target values and the uncertainties are generally comparable to or smaller than the values from the original Dctr protocol [39].

C. Top Quark Mass
Top quark pair production is generated using Pythia 8.230 [56,57] and detector effects are modeled using Delphes 3.4.1 [65][66][67] using the default CMS run card. One of the W bosons is forced to decay to µ + ν µ while the other W boson decays hadronically. Each event is recorded as a variable-length set of objects, consisting of jets, muons, and neutrinos. At simulation-level, the neutrino is replaced with the missing transverse momentum. Generator-level and simulation-level jets are clustered with the anti-k t algorithm using R = 0.4 and are labeled as b-tagged if the highest energy parton inside the jet cone (∆R < 0.5) is a b quark. Jets are required to have p T > 20 GeV and they can only be b-tagged if |η| < 2.5. Furthermore, jets overlapping with the muon are removed.
Events are only saved if they have at least two b-tagged jets and at least two additional non b-tagged jets. Four observables are formed for performing the top quark mass extraction. First, the b-jet closest to the muon is labeled b 1 . Of the remaining b-tagged jets, the highest p T one is labeled b 2 . The highest two 4 p T non b-tagged jets are labeled j 1 and j 2 . The four observables are given by: m b1µν , m b2µν , m b1j1j2 , and m b2j1j2 , where the fourmomentum of the detector-level neutrino is determined by solving the quadratic equation for the W boson mass 5 .
Histograms of the four observables for generator-level and simulation-level are presented in Fig. 11. On both particle and detector level, one can see that varying the top quark mass M t has the greatest effect on m b1µν and m b2µν as opposed to m b2j1j2 and m b1j1j2 . However, the latter two still have some visible dependence on M t . Therefore, it is expected that fitting on all four observables (denoted O 4 = {m b1µν , m b2µν , m b2j1j2 , m b1j1j2 }) should yield a more precise fit than fitting on any single one.  The application of the Srgn technique to the top quark mass fit is presented in Fig. 10. Both neural networks used for reweighting and classifying are implemented identically to the Gaussian example, with the exception of increasing early stopping patience to 20. To perform a given fit, we scan for the AUC as a function of the top quark mass with a step size of 0.1 GeV to search for the minimum. In all cases, the fitted value agrees with the correct mass, M t = 175 GeV. The top plot in Fig. 10 shows that the generator-level fit is much more precise than the simulation-level fit, based on the curvature of the AUC near the minimum. The other two plots in the figure demonstrate a superior precision for the fourdimensional fit compared with the one-dimensional fit. The same ensembling and outlier removal procedure is applied here as in the previous section. Horizontal error bars are the standard deviation across 40 runs (outliers removed) with a different random initialization.
Numerical values for the top quark mass fit are presented in Table IV

IV. CONCLUSIONS AND OUTLOOK
This paper addresses a key challenge with simulationbased inference in the presence of detector-effects. In particular, detector simulations are computationally expensive, so it is desirable to construct a method that uses as little detector simulation as possible. We have introduced the Srgn approach that only requires one synthetic event sample with a detector simulation, and all other synthetic event samples need only be known at the generator-level. A variety of similar methods have been proposed in Ref. [39,[43][44][45][46], but they typically require many synthetic event samples with detector simulation.
The Srgn protocol is unbinned and can process multidimensional feature spaces and parameter spaces. In its current form, there is a non-differentiable step required to optimize the area under the receiver operating characteristic curve. Future refinements of this method may result in a fully differentiable pipeline.

CODE AND DATA
The code for this paper can be found at https:// github.com/hep-lbdl/SRGN.