Efficient precision simulation of processes with many-jet final states at the LHC

We present a scalable technique for the simulation of collider events with multi-jet final states, based on an improved parton-level event file format. The method is implemented for both leading-and next-to-leading order QCD calculations. We perform a comprehensive analysis of the I/O performance and validate our new framework using Higgs-boson plus multi-jet production with up to seven jets. We make the resulting code base available for public use.


I. INTRODUCTION
The simulation of events with high-multiplicity final-states in experiments at the Large Hadron Collider (LHC) is a challenging computational problem [1][2][3][4].Using the best available algorithms, the calculation of the integrand for multi-jet processes scales at best exponentially with increasing particle multiplicity.The integration over the many-body phase space calls for suitable importance-sampling techniques, which often also scale exponentially [5][6][7][8][9][10][11][12][13][14][15].Hence, while being a solved problem in principle, the calculation of cross sections and the production of unweighted events at high jet multiplicity is still a hard problem to date.Computing techniques have remained conceptually identical since their inception four decades ago and typically make use of dynamic programming [15][16][17][18][19][20][21][22][23][24].
While the calculation of hard cross sections with full quantum interference effects is a considerable challenge even at tree level, the Markovian methods used in parton showers are often sufficient to describe the dynamics of collider events, and in fact they are necessary to properly account for the all-orders resummation of virtual corrections [1,25,26].The combination of the evolution implemented in parton showers with the exact calculations implemented by hard matrix elements provides the best available physics modeling of LHC events, accounting for both inter-jet correlations and intra-jet evolution by means of NLO matching [27][28][29][30][31] and multi-jet merging [32][33][34][35][36][37][38][39][40][41][42][43][44][45].However, parton showers and the subsequent hadronization [46][47][48][49][50][51] and multiple interaction [52][53][54] models are associated with various free parameters.Varying these parameters is key to assessing the uncertainty of LHC simulations.The large difference in required computation time, and the need to perform the simulation for multiple parton shower, underlying event and hadronization parameters makes it natural to separate the generation of LHC events into the calculation of the hard interaction and the simulation of the remaining physics aspects.
A number of approaches have been proposed to address this problem.The earliest and most widely used include the user process functionality of Pythia [55], and the LesHouches event file format [56][57][58].More recently, with the need for high-statistics event simulation leading to the use of high-performance computing facilities [59], the need for improved I/O performance and truly parallel I/O has become apparent.In this manuscript we report on the extension of a new event file format [60] for particle-level events.Firstly, we enable the storage of the standard and hard events needed for next-to-leading matching in the MC@NLO method.Secondly, we propose a new layout of the event file in order to increase the performance in large-scale parallel processing.Thirdly, we implement the new technology in various parton-level and particle-level event generators.At parton level we use Sherpa [61,62] with the two internal matrix-element generators Amegic [63] and Comix [15], as well as the new GPU-accelerated leading-order parton-level event generator Pepper [14,64].At particle level, we use the event generators Pythia [55,65] and Sherpa [61,62].
We also provide a first physics application of our new algorithms.We simulate Higgs boson plus multi-jet events with up to seven jets at tree level, and up to two jets at next-to-leading order QCD precision.With the High-Luminosity LHC (HL-LHC) expected to collect 3 ab −1 of data, these predictions can be used to test QCD associated Higgs production over a large dynamic range.Moreover, Higgs-boson plus multi-jet events play an important role as irreducible backgrounds to more detailed tests of the Higgs sector of the Standard Model, and especially in weak vector boson fusion.Using the MEPS@NLO merging method [41,42], in particular the reweighting of higher-multiplicity tree-level predictions with the help of Born-local K-factors from the Higgs plus two-jet setup, our new code base enables the most precise predictions of Higgs plus ≥4 jet events to date.We make the corresponding input event samples available for public use 1 and provide a publicly available version of the parton-and particle-level event generators that can be used to process these event files.This manuscript is structured as follows: Section II discusses the challenges faced in previous simulations and provides a documentation and performance assessment of our new event file format and parallel event generation techniques, including the changes needed for the simulation of events at next-to-leading order QCD precision.Section III discusses the new GPU-accelerated parton-level simulation pipeline.Section IV presents the first phenomenological application and discusses the impact of NLO matching and multi-jet merging in Higgs plus multi-jet production.We conclude with an outlook in Sec.V.

II. EXTENDED PARALLELIZATION FRAMEWORK
The first scalable particle-level event generator was based on legacy versions of Alpgen [22] and Pythia 6 [55] and was introduced in Ref. [59].To make state-of-the art parton-level and particle-level simulation tools available for use on HPC systems, Ref. [60] proposed a new event generation framework, based on Sherpa 2 [62] and Pythia 8 [65].This framework is based on a parallelized main routine for Pythia 8, and a new parton-level event file format, using the HDF5 library for parallel I/O. 2 It solved the main problem of making the event production scalable to thousands of MPI ranks, but still suffers from an I/O bottleneck.Various modern high-performance computing systems are not well suited for the fast writing and reading of large amounts of data, as is common in parton-level event simulations.The solution to this problem is discussed in this section.
Reference [60] also did not provide a means to store information for parton-level events simulated at NLO QCD precision.At present, a pure leading-order based event simulation falls short of the precision requirements at the LHC experiments.Therefore, an extension of the previous event file format to NLO QCD precision is an additional problem we will address.At leading order QCD, one can write the expectation value of an arbitrary infrared safe observable, O, at particle level as where B is the differential Born cross section, including flux and symmetry factors, as well as the parton luminosity, and dΦ B is the differential phase-space element, including the integration over the light-cone momentum fractions of the initial-state partons.The functional F MC (O, Φ B ) implements the parton shower and is explained in more detail in [27,30].In the MC@NLO [27] or POWHEG [28] NLO QCD matching technique, Eq. ( 1) becomes where V and R are the virtual and real-emission corrections, S and I are the differential and integrated NLO infrared subtraction counterterms, and KP are the factorization scale dependent finite corrections arising from the combination of the integrated infrared subtraction counterterms with collinear mass-factorization counterterms.The infrared subtraction counterterms are most commonly defined in the Frixione-Kunszt-Signer [66] or the Catani-Seymour [67,68] subtraction scheme and depend on the momenta and flavors of two partons i and j that are to be combined, as well as a spectator or recoil momentum, k.The differential phase-space element for the real-emission process is given by dΦ R .It can be factorized into a differential Born phase-space element and a single-emission phase-space element as dΦ R = dΦ B,ijk dΦ +1,ijk [30].
The NLO QCD expression, Eq. ( 2), shows a number of important differences compared to the LO expression, Eq. (1): • There are two different classes of events, one with parton-shower starting condition Φ B (the so-called standard events, or S-events), and one with parton-shower starting condition Φ R (the so-called hard events, or H-events).
• The H events require a simple, leading-order like phase-space generator for dΦ R , which implies that the event file format is the same as at leading order.
• The S events require two additional integrations, one for the KP counterterms, and one for the one-emission phase space dΦ +1,ijk .They also require a sampling of the indices i, j and k.
In order to make MC@NLO S-events reproducible and enable a reweighting of events to arbitrary PDF sets and/or scales at NLO QCD, the event format therefore requires the storage of the indices i, j and k, as well as the phase-space point (Φ B , Φ +1,ijk ) and the momentum fractions z 1 and z 2 .We will define the corresponding structures in Sec.II A.

A. Event file format
In this subsection we describe the new event file layout, which includes the optimizations that will be described in Sec.II B as well as extensions for event simulation at NLO QCD.Both are inspired by Les Houches Event Files (LHEF) standard [56], which is widely used in the high-energy physics community.It is based on XML, which makes it flexible enough to add any desired feature, but poses a challenge for I/O operations at scale.In contrast the HDF5 format uses a computing model similar to databases, making it rigid, but highly efficient in parallel workflows.
The LHEF format comprises global properties as well as event-wise properties [56].Global properties include process information (i.e. the type of collisions), total cross-sections as well as reweighting information.The event-wise properties are the process ID, the event weight, the scale of the hard process as well as the values of α QCD , α QED and the list of particles generated.The latter contain information about momentum four-vectors, particle ID, charge, spin and lifetime, as well as production history.We collect this information in consolidated datasets, reflecting the global, event-wise and particle properties.In addition, we introduce two new datasets, which include the event-wise and particle-wise information needed to simulate NLO QCD events in the MC@NLO matching scheme.We will call this structure the LHEH5 event file format. 3e follow the naming scheme of Ref. [60] and define datasets called init and procInfo that are used to store basic information about the entirety of events contained in the file.We also add a new dataset, version that identifies the version of the event file format.Event-wise properties for leading-order events are stored in the dataset events and for MC@NLO S-events in the dataset ctevents.Particle-wise properties for leading-order events are stored in the dataset particles and for MC@NLO S events in the dataset ctparticles.
Each dataset is a two-dimensional array and has an HDF5 attribute properties that identifies the individual columns of the dataset in order of appearance.For example the procInfo dataset has the properties procId, npLO, npNLO, xSection, error and unitWeight.In future updates of the event file format, these attributes can be used to communicate the content of the individual entries to the user of the file, similar, although not quite as flexible as in the case of XML-based Les Houches event files.The content of all datasets is summarized in Tab.I.In addition to Ref. [60], we introduce the following entries: • Process properties npLO and npNLO.If the process is computed at leading order QCD, we set npLO to the final-state particle multiplicity.If the process is computed at next-to-leading order QCD, we instead set npNLO to the final-state particle multiplicity at Born level.• At NLO QCD, we include the minimal information needed to reconstruct the complete event weight in the MC@NLO matching method.For hard remainder events (H-events), the leading-order type information is sufficient.For standard events (S-events) we add the following: -Counterterm properties in the ctevents dataset: ijt and kt refer to the Born-level QCD dipole used to generate a real-emission phase-space point in Eq. ( 2), i, j and k correspond to the respective particle IDs at real-emission level.tlpsw is the phase-space weight dΦ B and bbpsw is the corresponding single-emission phase-space weight dΦ +1,ijk .The variables z1 and z2 are the MC points of the integration variables in the KP contribution.See Eq. ( 2) for details.
-Counterterms properties in the ctparticles dataset: px, py, pz and e store the momenta of all particles in the phase-space point (Φ B , Φ +1,ijk ).See Eq. ( 2) for details.

B. I/O operations at scale
Our event simulation frameworks read and write event data through a multi-layered I/O software stack based on HDF5 [69], an array-oriented library and data model, which in turn uses MPI-IO [70].Within Sherpa and Pythia, HDF5 is accessed through the HighFive header library [71].Each layer typically provides tuning parameters.Optimal performance can in principle be achieved with the help of sophisticated I/O tuning systems [72].However, selecting the right parameters with the help of subject expertise is often more efficient and more reliable.In our case, some straightforward changes to the I/O layer resulted in large performance gains, reducing the I/O for particle-level events to a very small fraction of the overall runtime.
Figure 1 shows profiling results obtained with the help of Darshan [73,74] for a parton-level simulation of Higgs Figure 2 shows profiling results obtained with the help of Darshan [73,74] for a particle-level simulation of Higgs plus four jets with leading-order multi-jet merging, run on 1024 ranks of the Perlmutter system at NERSC. 5 For this test, we used the CPU-only nodes and did not access the scratch file system in order to give a more reliable estimate of the expected I/O time on typical computing clusters and HPC machines.The total amount of data read during the test was 128.85 GiB, and the time spent in I/O operations was less than 5% of the runtime.The POSIX-level data rate was 103.44 GiB/s and the MPI-IO level data rate was 14.43 GiB/s.Figure 2 shows that the I/O operations in our improved code are spread evenly over the runtime of the simulation, leading to more file access operations, but smaller data transfers per operation.While the file system would support larger transfer rates, storing the data for processing in the simulation program would require larger RAM allocations, leading to slower overall execution times.This effect becomes particularly important at larger scales, of the order of 1000 ranks and beyond, where the aggregate time needed for heap allocation would constitute a substantial part of the total runtime and break the strong scaling.This concludes our optimization of the particle-level simulation.
Figure 3 shows the strong scaling tests for the parton-level and the particle-level component of the simulation.The test is performed on the Perlmutter system at NERSC.We begin at a scale similar to the upper end of the tested range in [60].For parton-level calculations, we observe good scaling properties up to 8192 MPI ranks.We note that the work for this test was selected such that the minimum runtime would correspond to about 30s, below which the initialization time of the executable consumes a significant fraction of the overall runtime.In practice, one would rather choose runtimes that are significantly longer, in order to minimize the impact of initialization.At the particle level, we observe scaling up to about 1024 MPI ranks, above which the behavior depends on the particular implementation of MPI-IO.Figure 3 compares the Cray implementation to ROMIO.Above 16384 ranks, the Cray implementation suffers from a problem that prevents collective open calls through HDF5.The ROMIO implementation does allow collective open calls, but does not reach the full performance of the Cray MPI-IO library in data transfer.However, we note that at this scale only about 6400 events are processed per rank, leading to an overall runtime of about one minute for a total of 105 million events.There is no practical need to perform a calculation of this scale in less than an hour, therefore our example should be seen as a test of the absolute limits of the code.We believe that further optimization is not needed at this stage.In addition, we note that the particle-level simulation was limited to the perturbative event phases, i.e. we did not include hadronization, underlying event simulation and hadron decays.Due to the reduced event processing time in this scenario, any scaling violations observed in our test are more severe than in practical applications.We would like to conclude this section with a seemingly obvious but practically very important remark on the limits of scalability.The aim of an efficient parallel code is to maximize the effective computation time per worker node, i.e. the time spent in useful computations between I/O operations, with the I/O contributing an insignificant fraction of the overall runtime.One of the main reasons for scaling violations to occur is that the time between I/O operations becomes too short because of the limited size of input files.This can lead to significant problems at very large scales, where the input files must then be tens or hundreds of Gigabytes in size.Therefore, it is not practical for us to attempt scaling tests for particle-level simulations that go beyond O(10 4 ) MPI ranks.We note that this intermediate scale parallelism is actually advantageous, because it allows to access backfill queues at large computing centers.

III. PARTON-LEVEL EVENT GENERATION ON GPUS
We have added support for the generation of the proposed parton-level HDF5 event files to the Pepper partonlevel event generator.This recently presented generator, previously called BlockGen [64,75], is being developed to deliver performance portability for standard-candle processes, which currently include V+jets, t t+jets and pure jet production at tree level.This is achieved by focusing on this subset of processes, and choosing the right algorithm for such parallelization [64], for example by making use [75] of a minimal color basis [76][77][78][79].Furthermore, parallelized execution on accelerators like GPUs is supported.Preliminary results have been reported in [64,75] for various processes.Detailed speed comparisons e.g. to Comix and comparing GPU with CPU evaluation will be reported in a forthcoming publication that will also mark the first public release of Pepper, in combination with the Chili phase-space generator [14].As in the Sherpa case (see Sec. II B), we added MPI-IO HDF5 support to Pepper via the HighFive header library, and enabled HDF5's collective data and collective metadata features.II.Benchmarks for the production and HDF5 writeout of pp → Z + jets events, comparing Sherpa's Comix with Pepper+Chili, on a single core of an Intel(R) Core(TM) i3-8300 CPU at 3.70GHz and 8MB L3 cache.Event samples are generated with a given target for the total cross section uncertainty ("Tot.unc.")."Speed-up" gives the walltime gain factor of Pepper+Chili vs. Sherpa (Comix).For Pepper+Chili, the lower multiplicities Z+0j and Z+1j are generated using helicity summing, while the higher ones are generated using helicity sampling, in order to achieve the best possible performance in each case.Table II shows benchmarks for the production of parton-level HDF5 event files on a single CPU thread for the pp → Z + n jets process with n = 0, . . ., 4 for a given uncertainty target of the total cross section ("Tot.unc."), comparing Sherpa's Comix generator with Pepper, where the latter uses Chili for the phase-space sampling.Prior to event generation, the different phase-space generators are optimized until a given accuracy target is reached to ensure a fair comparison.The benchmark metrics are the walltime for the generation of the event sample, the memory consumption in terms of the applications' unique set size (USS) in RAM, and the fraction of the number of non-zero events over the total number of events generated, i.e. the measured combined efficiency of the phase-space sampling and unweighting ("Eff.").For Pepper+Chili, we switch from using helicity summing for the n = 0, 1 multiplicities to using helicity sampling for the n = 2, 3, 4 ones, in order to achieve the best performance.We find that the walltimes are significantly lower for Pepper+Chili for the given multiplicities, with the speed-up factor ranging between 2 and 10.For the higher multiplicities, n = 3, 4, the speed-up becomes smaller, but is still significant with factors of 2.6 and 1.7, respectively.The efficiencies for Pepper+Chili are also better by factors between 2 and 5.We remind the reader that Pepper is based on explicit color sums to achieve a perfect lock-step parallel processing, which is essential for achieving excellent performance on a GPU.The computational complexity for these explicit sums scales factorially and will eventually cause Pepper to become slower (and more memory-consuming) than Comix, which uses an algorithm with overall exponential scaling.For gluon scattering, this transition occurs at seven final-state gluons [64].This is also the reason for the larger memory usage growth going from n = 3 to n = 4 for Pepper+Chili compared to Sherpa.
Figure 4 shows a cross-check of differential distributions of k T jet rates between Pepper and Comix, after leadingorder multi-jet merging [38] with Sherpa 2.2 [62].The first ratio panel compares the predictions obtained with Pepper+Sherpa to the results from Comix+Sherpa, normalized to the statistical uncertainty of the latter.The second ratio panel shows the relative contributions from the event samples with Z + 0 jets, Z + 1 jet, Z + 2 jets and Z + 3 jets  5. Transverse momentum spectra of the leading, sub-leading and sub-sub-leading jet in Z+jets production at the LHC.Simulations have been performed with up to 2-jet, 3-jet and 4-jet matrix elements at leading order QCD.The colored lines represent the contributions from the parton-level inputs with the specified multiplicity.
to the overall prediction.It can be seen that the results are in agreement up to statistical fluctuations, which are typically at or below the 1σ level, as expected.

IV. PHENOMENOLOGICAL APPLICATIONS
In this section we present first phenomenological applications of our new framework.We show that it can be used to perform simulations with two of the main particle-level simulation tools for LHC physics, Pythia 8 and Sherpa 2. We also present the first computation of Higgs with up to seven final-state jets, and we make the corresponding parton-level event samples available for public use.The source codes for our study can be found at https://gitlab.com/hpcgenand at https://gitlab.com/sherpa-team/sherpa/-/tree/rel-2-3-0.

A. Systematic comparison of particle-level simulations
The systematic assessment of uncertainties in particle-level simulations has been vital for the success of the LHC physics program.It is particularly important in cases where the uncertainty is not of parametric type, such as when switching between two formally equivalent, but practically different, NLO QCD matching schemes [30].When used correctly, the residual variations between event generator predictions give the best possible non-parametric estimate of perturbative (and non-perturbative) uncertainties in the simulation.Our new event generation framework allows to obtain such uncertainty estimates based on the same parton-level input configurations and at minimal computational cost.This contributes to creating a sustainable computing model for high-energy physics research.We carry out an example study of this type for the standard candle process of Z-boson plus multi-jet production.We consider proton-proton collisions at the high-luminosity LHC at √ s = 14 TeV.The complete setup has been described in [60].In particular, we use the CT14 NNLO PDF set [80] and define the strong coupling accordingly.Our modified parton-level event generator is based on Comix [15] as included in Sherpa version 2.2.4 [61,62].Our modified particle-level event generators are based on Pythia 8 [65] and Sherpa 2.2 [62], including the improvements reported in [81,82].Jets are defined using the k T clustering algorithm with R = 0.4, p T,j > 20 GeV and |η j | < 6.Following the good agreement between parton-level and particle-level results established in [83,84], and the good agreement between fixed-order and MINLO [85] results established in [86,87], the renormalization and factorization scales are set to Ĥ′ T /2, where Ĥ′ T = j∈jets p t,j + m 2 l l + p 2 T,l l.
Figure 5 shows the transverse momentum spectra of the leading, sub-leading and sub-sub-leading jet in the simulation.The colored lines correspond to the contributions from the individual parton-level input samples after the full simulation.The upper ratio panel shows the ratio between the Sherpa and the Pythia predictions.This ratio is of the order of 10%, which can be ascribed to differences in the parton-shower algorithm used in the two different generators.6. Transverse momentum spectrum of the leading jet in Z+jets production at the LHC, simulated using multi-jet merging for up to two jets at NLO with up to zero, one and two additional jets at leading-order precision (from left to right), compared to a purely LO multi-jet merged prediction with the same overall multiplicities.The colored lines represent the contributions from the parton-level inputs with the specified multiplicity, and the hatched and solid bands indicate the uncertainties from renormalization and factorization scale variations at leading-and at next-to-leading order.
This uncertainty should be added as a variation to the parametric scale uncertainties, which we investigate next.
Figure 6 shows the transverse momentum spectrum of the leading jet in a multi-jet merged setup with up to two jets computed at next-to-leading order precision, and with up to zero, one and two additional jets computed at leading order precision. 6For reference, we also show the prediction from a leading-order multi-jet merged event sample with identical jet multiplicity (dashed lines).The leading-order predictions have been scaled such as to reproduce the total cross section of the next-to-leading order predictions.The colored lines correspond to the contributions from the individual parton-level input samples after the full simulation.The hatched bands indicate the scale uncertainties from a seven-point scale variation at leading order, and the solid bands represent the corresponding uncertainties at next-to-leading order precision.Note that the scale uncertainties increase with increasing jet multiplicity in the merging.This is an artifact of the method to estimate the scale uncertainty in the complete calculation, and is due to the fact that scales are varied in the computation of the hard matrix elements alone.It also indicates the importance of higher-multiplicity final states for the experimental observable.To obtain a comprehensive picture of the uncertainty, the renormalization scale dependent terms of the parton-shower resummation at higher logarithmic order should be taken into account.This is the topic of active research elsewhere [88,89], and we will therefore not discuss the effect in this publication.We emphasize, however, that the simulation of additional radiation at tree level is necessary for a proper physics modeling of high-multiplicity final states, and it is therefore not sufficient to limit the fixed-order perturbative calculations to low multiplicity.This is where the increased efficiency of our event generation framework becomes relevant for practical applications at the LHC.

B. Higgs boson plus multi-jet production as an example
With an anticipated 3 ab −1 at the high-luminosity LHC, Higgs-boson plus multi-jet events will be copiously produced, and even the six jet final state will be measurable at good precision.While not a discovery channel in its own right, the Higgs-boson plus multi-jet signature can be used to test the dynamics of the Standard Model, and it also provides the background to a number of Higgs-boson related measurements and searches, such as Di-Higgs production.In anticipation of these analyses it behooves us to provide precision simulations.In this subsection, we therefore present the first study of Higgs-boson production through gluon fusion at the LHC, with up to seven additional jets computed at LO QCD and up to two jets computed at NLO QCD in the Higgs effective theory [90,91].We use the MEPS@NLO algorithm [41,42] to merge these calculations into an inclusive event sample.The parton-level inputs are generated using Amegic [63], Comix [15] and MCFM [92][93][94][95][96][97][98] 7 .the maximum number of jets computed at NLO precision is n max,NLO = 2, and we apply local K-factors based on this calculation to higher jet multiplicities.The panels on the right show the ratio between different predictions, normalized to the result for n max = N + 2. It can be seen that the NLO merged predictions are more stable with respect to variations of n max at N = 1, as expected from the higher precision of the calculation at low jet multiplicity.At N = 2, this effect is diluted by higher-multiplicity tree-level contributions, as explained in Sec.IV A, Fig. 6.NLO accurate predictions for 3 jets at parton level would help to alleviate this problem [99,100].However, we could not generate the corresponding unweighted event samples within the limited computing budget for this publication, and we therefore leave a detailed investigation to future work.

C. Implementation in state of the art experimental simulations
We have validated our new event processing framework in the ATLAS benchmark setups described in [81] 8 .For practical applications where multiple particle-level simulations are generated with the same parton-level input, the LHEH5 event file technology will result in a significantly reduced overall production cost.There may however be remaining obstacles to implementing the method in large-scale event production for the LHC experiments, in particular the access of sub-samples and the synchronization of sub-samples across various sites of the WLCG.The solution to this problem must be found in collaboration with experts from the LHC experiments, who are proficient in WLCG workflows.We therefore postpone the discussion to a future publication.

V. CONCLUSIONS
We have presented a new framework for the precise and efficient simulation of events in collider experiments, with particular emphasis on the high-luminosity Large Hadron Collider.The new technique is especially suited for the physics modeling of high-multiplicity final states as it allows to match parton-level calculations at next-to-leading order QCD precision to parton showers and merge multiple exclusive calculations into inclusive predictions.Parametric uncertainty estimates can be computed on the fly, using the techniques from [101].There are no restrictions on the variations that can be performed, and the variations do not need to be included at the time of parton-level event production.We have demonstrated scalability of our approach on a state of the art high-performance computer at a leadership class computing facility.With the computing demands of the LHC experiments becoming an ever more pressing problem due to increased precision in the measurements, our new framework presents an important step towards a more flexible as well as economically and ecologically sustainable approach to event generation in the high-luminosity era.We have validated the new technology against previous simulation programs and enabled event production with a modern, portable parton-level event generator.

FIG. 1 .
FIG. 1. Darshan graphs depicting the I/O behavior of Sherpa during parton-level event generation for H+3 jets at leading order QCD on 1024 MPI ranks.Left: POSIX I/O behavior.Middle: MPI-IO behavior.Right: Overall I/O cost.Top row: Before optimizations.I/O operations are fragmented and uncoordinated among processes.Bottom row: Including collective I/O, improved file layout to reduce metadata operations and limiting stat calls to the master rank.The number of events was held constant for this test, and the total amount of data written was 1.01 GiB.
FIG. 2. Darshan graphs depicting the I/O behavior of Sherpa during particle-level event generation for H+4 jet leading order multi-jet merging on 1024 MPI ranks.Left: POSIX I/O behavior.Middle: MPI-IO behavior.Right: Overall I/O cost.All figures after including collective I/O, improved file layout to reduce metadata operations and limiting stat calls to the master rank.The number of events was held constant for this test, and the total amount of data read was 128.85 GiB.

FIG. 3 .
FIG.3.Strong scaling test of the simulations.Left: Parton level.Right: Particle level.At particle level, we do not include the simulation of multiple interactions and hadronization, and we do not process events further.All times are normalized to the individual results obtained on 512 MPI ranks.

log 10 (FIG. 4 .
FIG.4.Differential jet rates for the leading, sub-leading and sub-sub-leading jet clustering in Z+jets production at the LHC.Simulations have been performed with up to 1-jet, 2-jet and 3-jet matrix elements at leading order QCD.The colored lines represent the contributions from the parton-level inputs with the specified multiplicity.