Machine learning-based event generator for electron-proton scattering

We present a new machine learning-based Monte Carlo event generator using generative adversarial networks (GANs) that can be trained with calibrated detector simulations to construct a vertex-level event generator free of theoretical assumptions about femtometer scale physics. Our framework includes a GAN-based detector folding as a fast-surrogate model that mimics detector simulators. The framework is tested and validated on simulated inclusive deep-inelastic scattering data along with existing parametrizations for detector simulation, with uncertainty quantification based on a statistical bootstrapping technique. Our results provide for the first time a realistic proof-of-concept to mitigate theory bias in inferring vertex-level event distributions needed to reconstruct physical observables.


I. INTRODUCTION
Since the early 1970s, Monte Carlo event generators (MCEGs) have played a vital role in facilitating studies of QCD in high-energy scattering processes.From the experimental perspective, MCEGs are a crucial part of the procedure used for modeling the detector response folded into measured quantities ("detector-level") to extract the true energies and momenta of final state particles as produced at the interaction point ("vertex-level").The development of modern MCEGs, such as PYTHIA [2], HERWIG [3], and SHERPA [4], has been driven by a combination of high-precision experimental data and theoretical inputs.
The latter have involved a mix of perturbative QCD methods, describing the dynamics of quarks and gluons at short distances, and phenomenological models that map the transition from quarks and gluons to observable hadrons, as well as nonperturbative inputs such as parton distribution functions for applications involving hadrons in the initial state [5][6][7][8][9][10].
While the theoretical assumptions are usually well justified, an approach that mixes data with a model for the underlying physical law which we want to infer can potentially lead to biased results.Moreover, the need to correct for detector effects typically becomes increasingly difficult in higher dimensions and prevents a faithful reconstruction of vertex level events in a model independent way.In this work we present a novel approach to build an event-level interpolation tool based on machine learning (ML) that avoids theoretical assumptions about the femtometer-scale physics and a strategy to correct for detector effects at the event level.
MCEGs in general can be viewed as a type of data compactification utility, encapsulating large amounts of data collected from multiple experiments which can be regenerated.On the other hand, the reliance of existing MCEGs on theoretical assumptions of factorization and hadronization models limits their ability to capture the full range of possible correlations between the produced particles' momenta and spins.Moreover, existing theory-based MCEGs are limited in the scope of applications.For instance, to date no MCEG is able to reproduce all the possible single-spin or double-spin asymmetries in inclusive or semi-inclusive electron-proton deep-inelastic scattering (DIS).
Having an MCEG that faithfully simulates particle reactions by preserving all the correlations among the particles' momenta is extremely valuable for theoretical developments.
In practice, such correlations are integrated out or projected onto customized degrees of freedom, which effectively limits the ability to fully test the theory.In processes such as semi-inclusive DIS, for example, different regions of phase space are expected to be dominated by different physical mechanisms, and having access to all correlations among the particles' momenta is essential to understanding these mechanisms.
A detailed survey of MLEGs for physics event generation can be found in Ref. [19].A crucial feature of GANs (as well as generative models in general) is their ability to generate synthetic data by learning from real samples without explicitly knowing the underlying physical laws of the original system.We present a case study for inclusive DIS with realistic pseudodata generated from phenomenological models.We first train the MLEG that can faithfully reproduce the phase space of inclusive DIS along with uncertainty quantification (UQ) stemming from finite statistics and model architectures.Subsequently, we implement detector effects using an effective parametrization of detectors and train the MLEG and folding algorithms to simulated detector-level DIS events.For the first time a closure-test for reconstructing vertex-level DIS events, free of theoretical assumptions, is performed.
The results provide a new opportunity for experimental data analysis to use the GAN approach to build theory-free event generators which mitigate biases induced in reconstructing physical observables from experimental data.Moreover, the technique provides a new form of data representation that can be easily distributed, in contrast to the traditional datarepresentation via histograms that are limited for processes with high-dimensional phase space.
We begin the discussion in Sec.II with a schematic overview of the MLEG training with our GAN-based event-level interpolator.This is followed in Sec.III by a description of the ML detector surrogate that we use in order to simulate the effects of real particle detectors.
The application to inclusive electron-proton DIS is discussed in Sec.IV, where we examine GAN training both without and with detector effects.In Sec.V we summarize our findings and discuss future extensions and applications.

II. GAN-BASED EVENT-LEVEL INTERPOLATOR
A schematic view of the training workflow of our MLEG GAN is illustrated in Fig. 1, where, as usual, the GAN model is composed of a generator and a discriminator.The generator converts noise through a deep neural network into event-level features, which is customized by a given reaction.The generated event features are then passed into a detector simulator to convert them as "trial" detector-level events.The discriminator learns through another deep neural network to differentiate the true detector-level event samples from the ones produced by the generator and the detector simulator.The GAN training evolves as the generator and discriminator compete adversarially, each updating their parameters during the training process.Eventually, the generator is able to produce synthetic samples that the discriminator can no longer distinguish from the real samples, at which point the training of the MLEG is complete.
Although GANs have demonstrated impressive results in various applications, including generating near-realistic images [20], music [21], and videos [22], training a successful GAN model is known to be notoriously difficult.Many GAN models suffer from major problems, such as mode collapse, non-convergence, model parameter oscillation, destabilization, vanishing gradient, and over-fitting due to unbalanced training of the generator and discriminator.Approaches and techniques to address these general problems have been proposed and discussed recently in the literature [23][24][25][26][27].
Unlike common GAN applications, such as the generation of realistic high resolution images, the success of our GAN application as nuclear and high-energy physics event generators relies on its ability to faithfully reproduce correlations among the particles' momenta, which are increasingly difficult in higher (greater than one or two) dimensions.At the same time, the corresponding multidimensional momentum distributions or histograms display rapid changes in the phase space that spans several orders of magnitude.The challenge is then to design suitable GAN architectures capable of reproducing all of the correlations among the particles, along with a faithful reproduction of the multidimensional histograms across the phase space.In Sec.IV we will discuss in detail about how to customize this for our specific application of inclusive DIS.

III. ML DETECTOR SURROGATE
Experimental data, provided in the form of final state particle momenta, are affected by distortions introduced by experimental detectors.A correction procedure is usually necessary to extract the true information from the measured cross sections and provide the vertex-level distributions used in physics analysis.Such detector effects have multiple causes, including limited acceptance, finite resolution, efficiency distortion, and bin migrations due to radiation and rescattering.Corrections are commonly taken into account using unfolding procedures that attempt to correct for the detector effects at the histogram level, which requiring ad hoc corrections for each type of observable.
In order to demonstrate that our framework is realizable in a real experimental analysis, such detector effects must be incorporated.For this purpose, we use the "eic-smear" software package [39], which was developed at Brookhaven National Laboratory as a fast simulation tool for the future Electron-Ion Collider [40], and provides a simplified parametrization of the response of the detectors.
We develop ML-based detector surrogates using a secondary conditional GAN, as illus- trated in Fig. 2. The idea is to train a conditional generator simulating the smearing effect of the detector by converting input vertex-level event features and noise into detector-level event features, as dictated by eic-smear.To do this we build training samples using trial vertex-level guess event samples and the associated eic-smear detector-level samples to train the conditional-GAN.Once the conditional GAN is trained, the ML detector surrogate (represented by the dashed box in Fig. 2) can be integrated as the detector simulator in Fig. 1.
It is worth noting that for a more realistic description of detector effects, the eic-smear parametrization should be replaced by a full GEANT-based [46] detector model.However, its integration within our MLEG models using standard ML libraries is beyond the scope of the present analysis, and will be the subject of future work.

IV. APPLICATION TO INCLUSIVE ELECTRON-PROTON SCATTERING
In this section we describe the application of our MLEG strategy to the inclusive unpolarized DIS of electrons (four-momentum k) from protons (four-momentum P ).Our goal is solely to produce the scattered electron phase space, labeled by the four-momentum k .As a surrogate for real experimental data, we use pseudodata generated from the JAM QCD global analysis framework [1] that has been tuned to describe world data on inclusive DIS and other high-energy scattering processes.
The inclusive electron DIS samples are generated at a center of mass energy of 318.2 GeV, compatible with HERA kinematics, by integrating the 2-dimensional differential cross section dσ / dx dQ 2 , computed at next-to-leading order in perturbative QCD using importance sampling, and unweighting events over a very dense binning in (x, Q 2 )-space.Each event is transformed into an outgoing electron momentum in the HERA laboratory frame by generating an azimuthal angle relative to the beam axis sampled from a uniform distribution.
While our ultimate goal is to apply this approach to real data, this case study provides unique insights of our ML workflow and allows us to identify challenges in formulating a suitable feature space to be learned by the model.
When training the GAN solely using the electron momentum in the laboratory frame as event features, the generator was found to create electron samples that violate momentum conservation near the edge of the phase space, and the model was not sensitive enough to prevent the production of these samples [45].To alleviate this problem and aid the training, we use a change of variables that enhances the discriminator awareness in these difficult regions.Specifically, we define the scaled variables where E e is the incident electron energy, k 0 and k z are scattered electron energy and longitudinal momentum, respectively.In Eqs.(1) the energies and momenta in the arguments of the log are implicitly in units of GeV.These variables can be easily inverted into the original momentum space.In particular, the variable ν 2 changes rapidly as the energy of the outgoing electron approaches its limit, allowing the discriminator to be aware of such region.
In the following, we present details of our chosen ML architecture used for the event-level interpolation and the ML detector surrogate.
• MLEG: The input to the generator in Fig. 1  • ML detector surrogate: The detector surrogate model is based on a conditional GAN architecture [44].As shown in Fig. 2 we have a generator that receives vertexlevel as input in addition to a 100-dimensional white noise centered at 0 with unit standard deviation.The generator will learn to fold the inputs and produce detectorlevel events that mimic the detector response dictated by eic-smear.By conditioning the model on vertex-level event features we can enforce learning the correlations between vertex and detector level events as opposed to learning a deterministic mapping between inputs and outputs.As for the MLEG, the generator will produce a 2-neuron output corresponding to the detector-level variables ν 1 and ν 2 , activated by a linear function representing the generated features, and the discriminator will similarly produce "0" or "1" for training and generated samples, respectively.In both the generator and discriminator architectures of the ML detector surrogate, we use the same number of hidden layers, neurons, dropout rates, and activation functions as in our MLEG.A similar idea of using GAN for detector effects has been proposed by Bellagente et al. [43], where in contrast to our folding procedure, parton-level data is mapped to detector-level data using a conditional GAN model.
For both of our GAN architectures we adopt the Least Squares GAN (LSGAN) [38], which replaces the cross entropy loss function in the discriminator of a regular GAN by a least squares term, where P v denotes the conditioned vertex-level samples that are fed as inputs to the ML detector surrogate.The main advantage of the LSGAN is that by penalizing the samples that are far from the decision boundary, the generator is prompted to generate samples closer to the manifold of the true samples.
Our networks are trained adversarially for 100,000 epochs, where an epoch is defined as one pass through the training data set.For the optimizer, in both cases we use Adam [31] with a 10 −4 learning rate, β 1 = 0.5, and β 2 = 0.9.To balance the generator and discriminator training, the training ratio is set to 5.

A. GAN training without detector effects
As a first step in our numerical analysis, we train the MLEG using the DIS pseudodata samples without detector effects in order to establish the baseline agreement between training and synthetic data, without the complications introduced by the detector folding.In Fig. 3 we compare the training and synthetic normalized inclusive ep phase space distributions for the scattered electron in the variables ν 1 and ν 2 .The uncertainty bands were generated by training 10 independent GANs, where for each training the samples were prepared using the bootstrapping procedure (i.e., taking random samples with replacement).It is useful to the band size reflects the uncertainty evaluated using the bootstrap procedure (see text).The bottom of each panel shows the pull distributions (red circles) defined in Eq. ( 4), with the two horizontal dotted lines corresponding to ±1σ.
define the "pull" metric between the training (JAM) and synthetic (GAN) data by where E[P(O|bin)] and V[P(O|bin)] are the expectation values and variances of the discrete probability density P of an observable O.As expected, the synthetic distributions for ν 1 and ν 2 match well with the distributions from the training samples, within the statistical uncertainties, since for these variables the deviation from the training set is explicitly disfavored by the discriminator.Also shown in Fig. 3 are distributions of derived quantities that are physically relevant for the DIS process, namely, the four-momentum transfer squared, , and the Bjorken scaling variable x = Q 2 /2P •(k − k ).While these ob- x = 9e-5 x = 1e-4 x = 2e-4 x = 3e-4 x = 5e-4 x = 6e-4 x = 8e-4 x = 9e-4 x = 1e-3 x = 3e-3 Comparison the reduced inclusive ep cross section σ ep r versus Q 2 at fixed values of Bjorken-x from the HERA collider [32] (red circles) with data generated from the JAM global QCD analysis [1] (black solid lines) and the trained GAN (yellow bands).No detector effects are included, and for clarity the cross sections are scaled by a factor 2 i , with i ranging from i = 0 for the highest-x value to i = 17 for the lowest-x value.
servables are obtained by nonlinear transformations of the original variables ν 1 and ν 2 , the result accurately reconstructs the matching, within uncertainties, with the corresponding spectra from the training data.
In Fig. 4 we illustrate the reduced inclusive ep DIS cross section, σ ep r (in practice the reaction involved positrons scattering from protons), as a function of Q 2 in multiple bins of x for the HERA data [32] and for the parametrization of the data from the JAM global QCD analysis [1].These are compared with the reduced cross sections reconstructed by the GAN.
Within the statistical uncertainties, the empirical results are well reproduced by the MLEG simulation in most of the regions of the phase space.Note that the agreement between the JAM fit and the HERA data deteriorates at the largest Q 2 values for each fixed-x spectrum due to the vanishing of the phase space.Nonetheless, as Fig. 4 demonstrates, the GAN is able to reproduce this feature of the parametrization, indicating that the GAN has learned accurately the complex correlations present in the unpolarized DIS phase space.

B. GAN training with detector effects
Having established a baseline agreement for our MLEG framework, we proceed to include detector effects, as would be in actual experimental situations, which inevitably increases the complexity of the analysis.As discussed above, we train separately an ML detector surrogate using a detector parametrization provided by the eic-smear software [39].For the trial vertex-level event samples we use directly the samples from the JAM global QCD analysis instead of the flat phase space so as to optimize the GAN training.However, we stress that in principle the model architecture for the detector surrogate can be trained with any samples.
In Fig. 5 we show the vertex-and detector-level distributions for ν 1 and ν 2 , where significant distortions are observed for the latter.An issue regarding the change of variables in Eqs.(1) is that after smearing the detector-level k z variable can exceed the physical limit given by the incident beam energy E e , rendering the transformation singular for those unphysical cases.However, since the change of variables, in particular for ν 2 , is solely designed to increase the detector awareness in the difficult regions, we can replace E e in Eqs.(1) by the maximum energy found for the detector-level samples to achieve the same goal, and avoid the singularity of the variable transform.This disparity, however, creates an impression of higher levels of distortion in the ν 2 variable compared to ν 1 .
We also illustrate the smearing effects by focusing on small intervals in ν 1 and ν 2 , as shown in the are regions where GANs do not match precisely with eic-smear, namely, the tail regions at small and large ν 2 , which correspond to the edges of the reaction phase space.For the scope of this study, the GAN output represents a reasonable true detector proxy, allowing us to carry out the vertex-level learning closure test and validate the proof of principle of our MLEG framework.
With the ML detector surrogate we proceed with training the MLEG with detector effects.
In Fig. 6 we show similar results as in Fig. 3, but this time with detector effects included.
As expected, the variables ν 1 and ν 2 are well reproduced, since the discriminator supervises on these variables during the training.Similarly, the predicted DIS variables x and Q 2 at x FIG. 6.As in Fig. 3, but with detector effects present.
the detector level are well reproduced within the uncertainties.
As the final step, we examine the quality of the MLEG at the vertex level by analysing the direct output of its generator, and plot in Fig. 7 the corresponding vertex-level distributions.
Relative to the detector level, the vertex-level distributions are observed to have, on average, larger values for the pull than those in Fig. 6.This is expected since we do not directly supervise at the vertex level, but instead these are inferred quantities.A more detailed examination of this is shown in Fig. 8, where we plot the reduced cross sections as in Fig. 4, but in the presence of detector effects.As expected, the uncertainties increase due to the detector effects.However, within uncertainties, the synthetic reduced cross sections are in agreement with the true vertex level cross sections.This can be seen as confirmation that our MLEG training passes the closure test in the presence of detector effects.We have presented a new approach based on generative adversarial networks to extract physics observables from pseudodata in a physics agnostic manner.To illustrate the strategy, we developed a GAN-based MLEG capable of generating synthetic data that mimic inclusive deep-inelastic ep scattering pseudodata generated from PDFs in the kinematics of the ZEUS and H1 experiments at HERA.To demonstrate the veracity of our approach we performed a closure test, extracting the original PDFs from synthetic particle four-momenta.
To simulate real experimental scenarios, we introduced distortions into the analysis that would be induced by a real detector, implementing a resolution smearing function, and after repeating the test obtained good agreement between original and extracted PDFs.
Pulls quantified the uncertainty associated with the unfolding procedure, showing not only that we were able to extract the desired physics observables, but providing an uncertainty quantification for the unfolding procedure.To our knowledge this is the first time that detector effects were unfolded from pseudodata on an event basis.
While our long term goal remains to construct an MLEG for real experimental events across multiple channels for QCD studies, the present analysis is a necessary and important proof of concept that demonstrates the viability of applying ML techniques to mitigate theoretical bias in experimental data analysis.The promising results found with the case study of inclusive ep DIS suggests potentially important applications of the GAN-based MLEGs to physical processes beyond inclusive reactions.
As obvious improvements, and in view of its application to data analysis, we envision the implementation of a more realistic detector simulator based on GEANT to further study this technology.We expect that the use of our framework in ep scattering will be a valuable complementary tool for nuclear and particle physics programs at current and planned facilities, such as Jefferson Lab [36] and the Electron-Ion Collider [37].

FIG. 1 .
FIG. 1. Schematic view of the MLEG GAN training framework.The MLEG (dashed box) uses a generator which transforms noise into event-level features.The generator is concatenated with a detector simulator to mimic synthetic detector-level event features.The deep neural network based discriminator compares detector-level event features in order to build gradients to update the generator of the MLEG.

FIG. 2 .
FIG. 2. Schematic view of the ML detector surrogate, where a generator converts input vertex-level event features and noise to detector-level event features.The training samples are obtained from guess vertex-level samples and the corresponding detector-level samples using a detector simulator.The discriminator (right hand side of the figure) is trained simultaneously with vertex-level and detector-level event features in order to minimize the dependence of the generator on the input vertex-level guess samples.
is a 100-dimensional white noise array centered at 0 with unit standard deviation.The generator network consists of 5 hidden dense layers, with 512 neurons per layer, activated by a leaky Rectified Linear Unit (ReLU) function.The number of layers and neurons are optimized to balance execution time and convergence.The last hidden layer is fully connected to a 2neuron output corresponding to the variables ν 1 and ν 2 , activated by a linear function representing the generated features.The corresponding discriminator also consists of 5 hidden dense layers with 512 neurons per layer, optimized as for the generator, and activated by a leaky ReLU function.To avoid overfitting, a 10% dropout rate is applied to each hidden layer.The last hidden layer is fully connected to a single-neuron output, where "1" indicates a true event and "0" a fake event.The discriminator D is trained to give D(F ) = 1 for each training sample F , and D( F ) = 0 for each sample F produced by the generator.
where P G denotes the distribution of the generated samples and P T the distribution of the training samples.As a result, by setting b−a = 2 and b−c = 1, minimizing the loss function of LSGAN implies minimizing the Pearson χ 2 divergence.For the conditional model, the objective functions can be defined as min D

FIG. 3 .
FIG. 3. Comparison of distributions of training and derived variables from JAM training samples(black circles) and GAN-generated synthetic data (yellow bands) for the case of no detector effects;

20 FIG. 5 .
FIG. 5. Comparison of training features at the vertex level (generated, blue histograms) and detector level (smeared, green histograms) with the MLEG generated synthetic data (red histograms).The insets illustrate the local smearing effect at the points indicated by the green vertical dashed lines.

FIG. 8 .
FIG.8.As in Fig.4, but with the synthetic reduced cross sections generated by the GAN including detector effects and unfolding.