Exploring the Space of Jets with CMS Open Data

We explore the metric space of jets using public collider data from the CMS experiment. Starting from 2.3/fb of 7 TeV proton-proton collisions collected at the Large Hadron Collider in 2011, we isolate a sample of 1,690,984 central jets with transverse momentum above 375 GeV. To validate the performance of the CMS detector in reconstructing the energy flow of jets, we compare the CMS Open Data to corresponding simulated data samples for a variety of jet kinematic and substructure observables. Even without detector unfolding, we find very good agreement for track-based observables after using charged hadron subtraction to mitigate the impact of pileup. We perform a range of novel analyses, using the"energy mover's distance"(EMD) to measure the pairwise difference between jet energy flows. The EMD allows us to quantify the impact of detector effects, visualize the metric space of jets, extract correlation dimensions, and identify the most and least typical jet configurations. To facilitate future jet studies with CMS Open Data, we make our datasets and analysis code available, amounting to around two gigabytes of distilled data and one hundred gigabytes of simulation files.

The unprecedented release of public collider data by the CMS experiment [35] starting in November 2014 [36] has enabled new exploratory studies of jets. The first such jet analyses [37,38] were performed using the CMS 2010 Open Data [39], corresponding to 31.8 pb −1 of 7 TeV data from Run 2010B at the Large Hadron Collider (LHC). Among other aspects of jets, these studies explored the groomed momentum fraction z g [40], which has subsequently been measured in proton-proton and heavy-ion collisions by CMS [41], ALICE [42], and STAR [43]. The CMS Open Data release from LHC Run 2011A includes detector-simulated Monte Carlo (MC) samples, facilitating machine learning studies [44][45][46], an underlying event study [47], as well as a novel search for dimuon resonances [48]. CMS has also released data from Runs 2012B and 2012C, which have been used to search for non-standard sources of parity violation in jets [49] and extract standard model cross sections [50]. Beyond CMS, archival ALEPH data [51] have been used by Ref. [52] to search for new physics and by Refs. [53][54][55] to perform QCD studies. While analyses using public collider data cannot match the sophistication or scope of official measurements by the experimental collaborations, they can enable proof-of-concept collider investigations and help stress-test archival data strategies.
In this paper, we perform the first exploratory study of the "space" of jets using the CMS 2011 Open Data. This data and MC release corresponds to 2.3 fb −1 of protonproton collisions collected at a center-of-mass energy of √ s = 7 TeV. The key idea, as proposed in Ref. [56], is to compute the pairwise distance between jet energy flows, and then use this information to construct a metric space. This enables a variety of distance-based jet analyses, including quantitative characterizations and qualitative visualizations. Because this is an exploratory study, we do not unfold for detector effects nor estimate systematic uncertainties, but the general agreement between the CMS Open Data and simulated MC samples provides evidence for the experimental robustness of these methods.
The metric we use is the "energy mover's distance" (EMD) [56], inspired by the famous earth mover's distance [57][58][59][60][61] sharing the same acronym. The EMD has units of energy (i.e. GeV) and quantifies the amount of "work" in energy times angle to make one jet radiation pattern look like another, including the cost of creating energy for jets with different p T . While we focus on the EMD between pairs of jets in this study, the same concept could be applied to pairs of events as a whole. Crucially, the CMS Open Data contains full information about reconstructed particle flow candidates (PFCs) [62][63][64], which provide a robust proxy for the energy flow of a jet. It also contains information about primary vertices, allowing us to mitigate pileup (multiple proton-proton collisions per beam crossing) through charged hadron subtraction (CHS) [65]. Because of the improved resolution and pileup insensitivity of charged particles (i.e. tracks), we use a track-based variant of EMD for these exploratory studies.
The remainder of this paper is organized as follows. We begin in Sec. II by describing the CMS Open Data and the MOD software framework used for our analysis. In Sec. III, we validate the Jet Primary Dataset by comparing the basic kinematic and substructure properties of jets between the CMS data and MC samples. The core of our analysis is in Sec. IV, where we perform a variety of exploratory studies using the EMD. We conclude in Sec. V with a discussion of future jet studies on public collider data.

II. PROCESSING THE CMS OPEN DATA
In this section, we describe the main steps for processing the CMS Open Data. Our eventual analyses will be based on a single unprescaled trigger above its turn-on threshold, but we include additional details here about the analysis pipeline in order to demonstrate the general capabilities of our framework. The reader already familiar with how CMS data is processed can safely skip to Sec. II E, where we describe the baseline jet selection criteria used for our substructure and EMD studies.

A. Jet Primary Dataset
The CMS Open Data is available on the CERN Open Data Portal [36], which currently hosts data collected by CMS in 2010 [95], 2011 [96], and 2012 [97], as well as specialized samples for machine learning studies [98]. It also contains limited datasets from ALICE [99], ATLAS [100], and LHCb [101], as well as data from the OPERA neutrino experiment [102]. Accompanying the CMS 2011 Open Data is a virtual machine which runs version 5.3.32 of the CMS software (CMSSW) framework. This open data initiative complements efforts like HEPData [103], Rivet [104], and Reana [105] to preserve the results and workflows of official collider analyses (see further discussion in Ref. [106]).
The CMS Open Data is grouped into primary datasets that contain a subset of the triggers used for event selection [107]. There are 19 primary datasets included in the 2011 release, along with corresponding MC samples (see Sec. II D below). All of the primary datasets are provided by CMS in their analysis object data (AOD) format, which provides high-level reconstructed objects used for the bulk of official CMS analyses in Run 1. A subsample of some primary datasets (e.g. Jet [108] and MinimumBias [109]) are also provided in the RAW format, containing the full readout of the CMS detector.
Our analysis is based on the Jet Primary Dataset [66], which includes a variety of single jet and dijet triggers. This primary dataset contains 30,726,331 events spread across 1,223 AOD files, totaling 4.7 TB. The 2011A data-taking period is subdivided into 318 runs, and the runs are subdivided into 109,428 luminosity blocks (LBs) [110]. A luminosity block is the smallest unit of data-taking for which there there is calibrated luminosity information, and during one block, the triggers are guaranteed to have consistent requirements and prescale factors (see Sec. II C below). Of the events in the Jet Primary Dataset, 26,275,768 are contained in "valid" LBs which are certified by CMS for use in physics analyses [111].
Each event in the AOD format has a complete list of PFCs, which are particle-like objects containing the reconstructed four-momentum and a probable particle identification (PID) code. In addition, the AOD format has AK5 jets, which are clusters of PFCs identified by the anti-k t jet algorithm [112] with radius parameter R = 0.5. Jet energy correction (JEC) factors are obtained for the AK5 jets, including a correction for pileup using the area-median subtraction procedure [113]. The jets also have the information needed to impose jet quality criteria (JQC).

B. MIT Open Data Framework
Because of the technical challenges involved in using CMSSW, we only use it to extract information from the AOD files, performing the actual physics analyses outside of the virtual machine. Building on the MOD software framework introduced in Ref. [38], we use a custom MODProducer module in CMSSW to translate each AOD file into a plain text MOD file. We then use a custom framework called MODAnalyzer to read in each MOD file and perform various jet analysis tasks using FastJet 3.3.1 [114]. Finally, we convert the MOD files into HDF5 [115] files for universal usability.
As described in more detail in Sec. II E, we consider the hardest and second-hardest jets for our analysis, after correcting the jet p T using the JEC factors and imposing the "medium" JQC [116,117]. To access the constituents of jets, we first recluster the complete set of PFCs into AK5 jets and then compare against the CMS-provided preclustered AK5 objects. Due to rare numerical rounding issues, there are cases where the AK5 objects disagree, and we discard jets whose transverse momenta differ from the CMS-provided jets by more than one part in 10 6 or whose four-vectors are more than 10 −6 apart in the rapidity-azimuth plane. When the AK5 objects agree, we associate them in the HDF5 files.
A number of substantial improvements have been made to MODProducer compared to Ref. [38]. We have added additional physics information in the MOD format, including metadata about files, LBs, and triggers. We have added primary vertex information to implement CHS for pileup mitigation (see Sec. III B below), made possible because the AOD files have a VertexCollection handle that can assign a charged-particle track to the closest collision vertex. We also added the ability to process MC files provided by CMS in the AODSIM format, including both generation-level particles and reconstructed PFCs.
After the jet selection stage in MODAnalyzer, the rest of our workflow is in Python 3. We used NumPy [118] for data manipulation, Matplotlib [119] to produce figures, Python Optimal Transport [120] to calculate the EMD, and EnergyFlow 0.13 [84] for a variety of jet analysis tasks. In addition, we embedded our code in Jupyter notebooks [121] for enhanced transparency and portability. To assist future jet studies on the CMS Open Data, our complete set of Jupyter notebooks is available [85], and the corresponding reduced jet datasets are on the Zenodo platform [86][87][88][89][90][91][92][93][94].

C. Triggers, Prescales, and Luminosities
The Jet Primary Dataset contains 30 triggers [107]. We summarize these triggers in Table I, indicating the  number of valid LBs and events for which the trigger is  present, as well as the number of valid events for which  the trigger fired. There are single jet and dijet triggers, where the trigger names include the nominal p T requirement for the jet(s). For simplicity, we do not distinguish between trigger versions, denoted by suffixes like v2, in our analysis. (The documentation for Ref. [66] lists 5 L1FastJet trigger variants in the Jet Primary Dataset, but as far as we can tell, these triggers were introduced after Run 2011A was complete.) There are 7 triggers that were operational during the entire 2011A run, corresponding to 109,339 LBs. This can be compared to the luminosity information in Ref. [110], which lists 109,428 valid LBs in this run, leaving 89 LBs unaccounted for in the Jet Primary Dataset. These "missing" LBs only contribute 6 nb −1 to the recorded integrated luminosity, so their absence has a negligible impact on our studies. We investigate the missing LBs in more detail in App. A. There are also 643 LBs that are on the list of validated runs [111] but absent from the luminosity table [110]; we omit these from our analysis under the assumption that they are not in fact valid runs. Finally, we omit 143 valid LBs that contain events but have zero recorded luminosity, and we investigate these "zeroed" LBs further in App. A.
Because the total data-taking rate is limited, the lower p T jet triggers are prescaled to only fire a fraction of the time they are active. The prescale factors satisfy p trig ≥ 1, with p trig = 1 indicating an unprescaled trigger. (Strictly speaking, there are separate and independent prescale factors for the Level 1 (L1) trigger and the highlevel trigger (HLT), but we always use p trig to refer to the product of these factors.) The trigger prescale factors are fixed within a LB but can change between LBs. The effective luminosity for a given trigger is: where b labels a LB, L b is the recorded integrated luminosity in that block, and p trig b is the associated prescale factor. The effective luminosities for the Jet Primary Dataset triggers are reported in Table I, along with their average prescale factors and effective cross sections: where L trig total = b L b is the total luminosity of the run while the trigger was present, and N trig is the total number of events for which the trigger fired.
Our analysis is based on the substructure of individual jets, so we focus our attention on the 9 single-jet triggers in Table I [66], restricted to LBs that have been identified as valid for physics analyses by CMS [111] and that have non-zero recorded luminosity [110]. Shown are the number of valid LBs and events for which the trigger is present and the number of valid events for which the trigger fired. Also provided are the effective luminosity L trig eff defined in Eq. (1), and the average prescale value p trig and effective cross section σ trig eff defined in Eq. (2). As discussed in App. A, there are 89 "missing" LBs in the CMS 2011A luminosity table [110] that are not represented in the Jet Primary Dataset, but they have a negligible impact on our analysis. We also omit 143 "zeroed" LBs during which events were detected but zero luminosity was recorded. The HLT Jet300 trigger (bolded) is the one used for the jet studies in Secs. III and IV. relatively few events. Their effective luminosities as a function of the number of cumulative time-ordered LBs are plotted in Fig. 1a. We see that as the integrated luminosity increases, some of jet triggers have to be prescaled. We also see that the HLT Jet300 trigger only starts acquiring data partway through the 2011A run, coinciding with the HLT Jet240 trigger being prescaled.
In Fig. 1b, we plot the effective cross section in each time-ordered LB for the 9 single-jet triggers. The trigger behaviors are relatively stable over the course of the 2011A run, though there is a noticeable shift in the HLT Jet80 trigger when its selection criteria changed. One can also see when the HLT Jet300 trigger turned on and when the HLT Jet80 and HLT Jet150 triggers were turned off.
Since HLT Jet300 is the lowest p T single-jet trigger that is unprescaled, it will be the sole trigger used in our substructure and EMD studies (see further discussion in Sec. II E). For reference, the recorded luminosity for HLT Jet300 as a function of time is plotted in Fig. 18 of App. A.

D. Monte Carlo Event Samples
A key feature of the CMS 2011 data release compared to the initial one from 2010 is the inclusion of MC event samples. (Some MC samples corresponding to the 2010 dataset have been subsequently released.) For our analysis, we use samples of hard QCD scattering generated by Pythia 6.4.25 [82] with tune Z2 [122]. As summarized in Table II,  Note that the Jet300 trigger used for our jet studies turns on after around 50 pb −1 has already been collected, but this is a relatively small fraction of the total 2.3 fb −1 collected over the course of Run 2011A. The luminosity profile as a function of date is shown in Fig. 18 of App. A. (b) Effective cross section for the single-jet triggers in each LB where the trigger fired. The flatness of these curves indicates that the trigger behavior is roughly constant across the entire run, apart from moments where the trigger criteria or prescale factors changed. The horizontal dashed lines correspond to the total effective cross section for that trigger from Table I. samples with non-overlapping hard-scattering partonp T ranges [67][68][69][70][71][72][73][74][75][76][77][78][79][80][81], totaling 13.4 TB. They are labeled by CMS as QCD Pt-MINtoMAX TuneZ2 7TeV pythia6, wherê p T ∈ [MIN, MAX] GeV. These events are then simulated and reconstructed using the CMS detector simulation based on Geant 4 [83]. Throughout this paper, we use "generation" to refer to the output of the parton shower generator, and "simulation" to refer to the output of the detector simulation.
Both the generation-level and simulation-level objects are stored in AODSIM format by CMS, and we convert them to our MOD format using MODProducer. Apart from the generation-level event record from Pythia, the AODSIM format is very similar to AOD. In particular, AODSIM includes reconstructed AK5 jets, simulated trigger information, as well as the addition of pileup. We store the simulated PFCs, the final-state particles in the Pythia event record, and the 2 → 2 hard-scattering process for anticipated future studies related to parton flavor. If an association between simulation-level and generation-level jets is needed, jets are matched if their jet axes are within ∆R = 0.5 of each other. To enable future jet flavor studies, generation-level jets are also matched to hard-process partons if they are less than ∆R = 1.0 apart.
Because of the steep dependence of the QCD dijet cross section onp T , the MC events have different weights, though the weights for all events in a single MC sample are the same. Therefore, when filling histograms, we have to weight each MC event by the generated cross section σ MC eff divided by the number of events in the MC sample, as given in Table II. As discussed in App. B, we also weight the MC events according to the number of primary vertices in order to match the distribution of pileup seen in the data.
One subtlety in using the generation-level Pythia information is that there is a cutoff on the hadron lifetime above which they are considered stable. This cutoff is set to c τ stable = 10 mm, which means that various hadrons with non-zero strangeness are considered stable, notably the K 0 S meson. Typically, these strange hadrons decay within the CMS detector volume and are often reconstructed as if the decay products came from the primary vertex. For example, K 0 S → π + π − will typically be reconstructed as two pion-labeled PFCs. This leads to a mismatch in observables like track multiplicity unless we manually decay these strange hadrons. As a workaround, we load the generation-level event record into Pythia 8.235 [123] and adjust the hadron lifetime threshold to c τ stable = 1000 mm. Because the kinemat- ics and flavors of the hadron decay will not be the same as in the CMS detector simulation, there is a slight mismatch when comparing a generation-level event to its simulation-level counterpart, though this issue does not arise when comparing histograms.

E. Jet and Trigger Selection
The jet studies in Secs. III and IV are based on the two hardest p T jets in an event. This is motivated by the fact that 2 → 2 QCD dijet production at leading order yields two jets of equal p T . Therefore, considering the substructure of just the hardest p T jet (as in the studies of Refs. [37,38]) is IRC unsafe, since an infinitesimally soft emission can change the relative jet ordering. On the other hand, considering more than two jets requires information beyond leading order, so we only consider the two hardest p T jets in our analysis. (See Ref. [124] for further discussions of single-jet inclusive cross section definitions.) The CMS single-jet triggers are designed to fire any time an event has a jet whose p T is above a given threshold. We independently analyze the two hardest jets in an event, correcting their p T values by the appropriate JEC factors. When we perform our substructure analysis, we require that the jets satisfy |η jet | < 1.9 to make sure that the R = 0.5 jets are reconstructed fully within the tracking volume that covers |η tracker | < 2.4. We impose "medium" JQC (see Table III) [116,117] throughout this study.
In Fig. 2a, we show the p T spectrum of just the hardest jet in the CMS 2011 Open Data, separated into the 9 single-jet triggers. (The spectrum for the two hardest jets will be shown in Fig. 5a.) We see that the triggers start to collect an appreciable number of jets when the jet p T matches the trigger name, asymptoting to a common smooth p T spectrum. The small population of jets at low p T values below the turn on is due primarily to trigger misfirings, for example from fake jets that do not satisfy the jet quality criteria. In Fig. 2b, we show the same p T spectrum in the CMS simulation, separated into the 9 most relevant MC samples for our analysis (out of 15 total). We see that the MC files have support mainly in their designatedp T ranges, albeit with a spread due to phenomena like initial state radiation (ISR) that change the overall event kinematics.
To simplify our physics studies, we use just one of the single-jet triggers. As mentioned above, we select HLT Jet300 since this has the lowest p T threshold among the unprescaled single-jet triggers. Looking at Fig. 2a, we can estimate that Jet300 is fully efficient above p T > 375 GeV. Looking at Fig. 2b, we see that all of the MC samples withp T > 170 GeV contribute appreciably to the p T > 375 GeV region, corresponding to 8 required MC event samples.
To determine where the Jet300 trigger is fully efficient, we compare its behavior to the Jet240 trigger; see related trigger efficiency studies in Refs. [107,125]. In Fig. 3a, we consider events where the Jet300 trigger is present and the Jet240 trigger fired. We then plot the fraction of events where Jet300 fired as a function of jet p T . Fitting the resulting fraction to an error function, we estimate that the Jet300 trigger is 99% efficient (relative to Jet240) at 367 GeV, justifying our choice of p T > 375 GeV. We can cross check our trigger efficiency study using the simulated MC samples. In Fig. 3b, we plot the fraction of events where the simulated Jet300 trigger fired as a function of jet p T . Doing the same error function fit, we find that the simulated Jet300 trigger is 99% efficient (relative to an absolute scale) at 350 GeV, which is again consistent with our p T > 375 GeV choice. For completeness, we provide efficiency plots for all of the triggers in Fig. 21 of App. C. Since we are performing an exploratory jet study, we do not correct for this small trigger inefficiency in our analysis.

CMS 2011 Simulation
Pythia 6 Tune Z2 p jet T > 10 GeV |η jet | < 1.9 [120,170]     Shown are (a) the relative efficiency of the Jet300 trigger with respect to Jet240 in the CMS Open Data, and (b) the absolute efficiency of the Jet300 trigger in the MC simulation. Both of these curves are fit to an error function (ERF) to estimate the efficiency boundaries. From these, we conclude that the Jet300 trigger is fully efficient above p jet T > 375 GeV. This analysis is repeated for the other triggers in Fig. 21  The selections in the first block ensure that the Jet300 trigger fired in a valid LB, the requirements in the second block ensure that the Jet300 trigger is fully efficient, and the cuts in the third block impose the JQC and the baseline analysis criteria. Because our analysis is based on the two hardest jets, there is an increase by a factor of about two between the first and second blocks. Our initial workflow is summarized in Table IV. Because we consider the two hardest jets with p jet T > 10 GeV, there are about twice as many jets in the analysis as the number of events. In order to have a more homogenous jet sample, we impose the narrower p jet T ∈ [375, 425] GeV range for our substructure and EMD studies below. An example event from the CMS 2011 Open Data passing our kinematic jet selections is displayed in Fig. 4, including information about the charges and vertices of the PFCs.

III. ANALYZING JET SUBSTRUCTURE
To validate the performance of the CMS detector for jet reconstruction, we present a variety of jet kinematics and jet substructure distributions derived from the CMS 2011 Open Data. There are two main differences compared to a similar analysis performed in Ref. [38]. First, we can now compare the open data distributions to detector-simulated MC samples to check for robustness. Second, we have proper luminosity information [110] such that we can plot (uncorrected) differential cross sections, instead of just normalized probability distributions.

A. Overall Jet Kinematics
In Fig. 5a, we show the p T spectrum of the two hardest jets (i.e. two histogram entries per event), restricted to the region |η jet | < 1.9 and p jet T > 375 GeV. Here, we compare the CMS Open Data in black to the simulated MC samples in orange. We find very good agreement in the shape of the p T spectrum after including appropriate Kfactors described below, though there are small disagreements and discontinuities for p jet T > 750 GeV. We also show the generation-level Pythia distribution without detector simulation in blue, which matches very well to the orange simulation-level distribution with detector response, indicating that the overall JEC factors have been chosen appropriately. (Of course, the JEC factors also include data-driven corrections beyond just those captured by the detector simulation.) Note that these distributions only include statistical uncertainties, without any estimate of systematic uncertainties.
Because Pythia is a leading-order generator, we have rescaled the MC events by a next-to-leading-order (NLO) K-factor. This p T -dependent K-factor is derived from Ref. [126] for R = 0.5 jets, with K NLO 1.135 in the vicinity of 400 GeV. As discussed further in App. B, we reweight the MC in order that the pileup level in the simulation matches the data. Finally, we multiply by an additional factor of K 375 = 0.961 to ensure that the lowest bin in the simulation has the same normalization as the actual data. This factor partially accounts for effects like the efficiency of the medium JQC, which is difficult to extract reliably from the CMS simulation, as well as QCD corrections beyond NLO and uncertainties on the recorded luminosity.
In Fig. 5b, we show the jet pseudorapidity spectrum. After relaxing the |η jet | < 1.9 requirement, we find a small number of jets at larger pseudorapidities. Compared to the simulated data, the open data has more jets in the vicinity of |η jet | 1.2 and fewer in the vicinity of |η jet | 0.0, indicating a possible issue with the Pythia prediction or with the pseudorapidity dependence of the JEC factors. That said, the overall agreement is very good, giving us confidence that we can make basic kinematic jet selections. For completeness, the jet azimuth spectrum is shown in Fig. 22 of App. C, which exhibits the expected flat spectrum with small fluctuations due to detector inhomogeneities.

B. Jet Constituents
In addition to the reconstructed AK5 jets, the CMS Open Data contains the complete list of PFCs, which allows us to calculate a wide range of jet substructure observables. Due to detector effects, one has to be careful when interpreting the PFC information. Ultimately, we will focus on track-based observables which have better reconstruction performance as well as better pileup stability.
In Table V, we list the PID codes of the PFCs and their absolute counts in the jet sample with |η jet | < 1.9 and p jet T ∈ [375, 425] GeV. Note that there are more events in the MC samples than in the open data, so there is a corresponding increase in the number of total PFCs. The PID codes indicate the most likely particle candidate, using the PDG MC numbering scheme [127]. In particular, code 211 includes π + , K + , and proton candidates, code 22 includes photon and merged π 0 → γγ candidates, and code 130 includes K 0 L and neutron candidates. The counts in Table V include contamination from pileup. As shown in Fig. 19a of App. C, there are typically ∼ 5 pileup events per beam crossing. While the CMS Open Data already includes a pileup correction for the jet p T via the JEC factors, this is insufficient to correct substructure distributions. We have two ways to mitigate the effect of pileup. First, we apply the CHS procedure [65] to remove charged particles not associated with the primary vertex. This is possible since MODProducer now stores vertex information (see Sec. II B above), so we can remove charged jet constituents assigned to pileup vertices. Though CHS cannot remove neutral particles from pileup, it does reduce the overall pileup contamination by a factor of ∼2/3. Second, inspired by the SoftKiller procedure [128], we impose a p PFC T > 1 GeV cut on all PFCs, where this value is motivated by Fig. 6 below. This helps control the level of neutral pileup, though we will still focus on track-based observables in our subsequent analyses.
The p T spectrum of neutral PFCs is shown Fig. 6a. The neutral PFCs do not benefit from CHS, so there We consider up to two of the hardest pT jets, restricted to |η jet | < 1.9 and p jet T > 375 GeV. In addition to having a pT -dependent NLO K-factor, the MC events have been normalized to match the lowest pT bin. (b) Jet pseudorapidity spectrum, with the |η jet | requirement removed. For both jet spectra, we see very good agreement between data and simulation, indicating that we have properly processed the CMS Open Data, including appropriate JEC factors. In these and all subsequent plots, the error bars indicate statistical uncertainties only, with no attempt at estimating systematic uncertainties. The jet azimuth spectrum is shown in Fig. 22  is a significant excess of neutral PFCs from pileup below around 2 GeV, compared to generation-level expectations. That said, the CMS simulation appropriately captures this neutral pileup contamination. Because of finite calorimeter granularity, there is a depletion of moderate p T neutral PFCs as a result of merging. This merging results in an excess of higher p T neutral PFCs, which can be seen in Fig. 23a of App. C.
The p T spectrum of charged PFCs is shown in Fig. 6b. With CHS, the PFC p T spectrum is rather similar between the CMS Open Data and the MC event samples, even at the generator level and even going out to higher p T in Fig. 23b of App. C. The main difference is below 1 GeV, where one sees the impact of tracking inefficiencies and momentum misreconstruction. For this reason, we impose a cut of p PFC T > 1 GeV for all of our jet substructure studies, which results in better data/MC agreement for observables like track multiplicity that are sensitive to such effects. Note that this same p PFC T cut was advocated for in Ref. [38], though a looser cut of 500 MeV is used by CMS in its track multiplicity study [129].

C. Jet Substructure Observables
We now plot a representative sample of jet substructure observables, comparing the CMS Open Data to the MC samples, both before and after detector simulation. Based on the conclusions of Sec. III B, we always implement CHS and impose the p PFC T > 1 GeV cut. In order to analyze jets with similar total p T , we focus on the relatively narrow range of p jet T ∈ [375, 425] GeV. In Fig. 7, we show three classic substructure distributions: jet mass, constituent multiplicity, and p D T [130]. Using all PFCs, shown in the left column of Fig. 7, there is good agreement between the CMS Open Data and the simulation-level MC events. This suggests that Pythia 6 with tune Z2 has a reasonable model for jet fragmentation and that the CMS simulation provides a faithful characterization of the detector response; see related studies in Ref. [129], as well as Ref. [131] for alternative Pythia tunes.
That said, there are significant differences when comparing the generation-level and simulation-level MC distributions, even after applying CHS for pileup mitigation. Roughly speaking, the CMS detector reconstructs fewer PFCs than expected, which is consistent with merging of neutral PFCs due to finite calorimeter granularity. On the other hand, the CMS detector reconstructs a larger jet mass than expected, which is consistent with residual neutral pileup contamination.
We can improve the generation-level and simulationlevel agreement by restricting our analysis to just charged PFCs, as shown in the right column of Fig. 7. The agreement improves most notably for the IRC-unsafe observables of multiplicity and p D T . While the CMS detector reconstructs fewer charged PFCs than expected from Pythia at the generation level, the difference is well within the theoretical uncertainties in MC generation (see further discussion in Ref. [132]). Since we will not attempt to unfold the data in this paper, it is important for us to use observables that are robust to detector effects. For this reason, the focus of our EMD studies will be on track-based observables.
It is worth remarking that the good agreement in the track multiplicity distribution in Fig. 7d is due in part to using the medium JQC. If we were to use the loose JQC, there would be an excess of events with very low track multiplicity in the CMS Open Data. Most likely, these are prompt photons which barely pass the loose JQC, and to describe these properly, we would need to include photon-plus-jet MC samples. This excess is removed by the medium JQC, with only a modest impact on other substructure distributions.
We investigate three additional jet substructure distributions in Fig. 8: N 95 [133], z g [40], and D 2 [134] with β = 1. These observables probe, respectively, the uniformity of jet activity, the momentum sharing between subjets, and the two-prong substructure of jets. We implement N 95 as the minimum number of pixels in a 33×33 jet image from −R to R required to account for at least 95% of the total p T . The soft drop jet grooming [135,136] parameters used to define the groomed momentum fraction z g are z cut = 0.1 and β = 0. Jets with z g = 0 indicate that the grooming procedure results in just a single remaining particle. Again, we find good agreement between the CMS Open Data and the simulationlevel MC samples when using all PFCs, but the detectorlevel and simulation-level distributions agree somewhat better when restricted to track-based observables. Using our released samples [86][87][88][89][90][91][92][93][94], it is straightforward to plot a wide range of jet substructure observables [137], a number of which have already been implemented in the EnergyFlow package [84].

IV. EXPLORING THE SPACE OF JETS
We now turn from considering individual substructure observables at the histogram level to studying the radiation pattern in jets more broadly. In this section, we will use the energy mover's distance [56] as a metric to compare the energy flow of jets. We perform a range of exploratory EMD studies on the CMS Open Data to universally probe jet modifications, explore the space of jets, and visualize the most representative jets.

A. Review of the Energy Mover's Distance
The jet energy flow can be characterized by an energy density on a two-dimensional surface, corresponding to an idealized detector at infinity [21][22][23]. For protonproton collisions, we typically use transverse momentum p T instead of energy and we indicate angular directions via rapidity y and azimuth φ. In these coordinates, the energy flow (more precisely, the transverse momentum flow) is: where j labels the constituents of the jet J . The expression in Eq. (3) is IRC safe by construction, since a particle with zero p T does not contribute to the sum and a collinear splitting does not change the sum. The energy flow does not include any PID information, which is important to ensure IRC safety. To handle constituent masses, one could include velocity information [138], but that is beyond the scope of this paper.
Given two jets I and J , the EMD is [56]: where R 2 ij = (y i −y j ) 2 +(φ i −φ j ) 2 is the rapidity-azimuth distance, R is the jet radius, and f ij is the amount of transverse momentum "transported" from particle i in jet I to particle j in jet J , subject to the constraints: i∈I j∈J Finding the minimum over {f ij } in Eq. (4) is an optimal transport problem which can be solved efficiently using the network simplex algorithm [139][140][141].
The expression in Eq. (4) is non-negative, symmetric, and satisfies the triangle inequality: Therefore, EMD is a proper metric on the space of energy flows, with units of energy (i.e. GeV). If the EMD between two jets is zero, then they are treated as identical. For this reason, it is often convenient to perform symmetry transformations on the jets prior to calculating the EMD. (This transformation procedure is closely related to the tangent earth mover's distance [142].) For all of the EMD studies in this paper, we longitudinally boost and azimuthally rotate each jet such that its fourvector is at the (y, φ) origin. The second term in Eq. (4) is a cost term when two jets have different values of their scalar sum p T . Because we are primarily interested in relative jet energy flows and not absolute jet energy scales, it is convenient to rescale the jets to make this cost term vanish. For jets with p jet T ∈ [375, 425] GeV, we rescale the jet constituents uniformly such that j∈J p T j ⇒ 400 GeV.
Since we are working in relatively narrow p T range and since QCD is a quasi-scale-invariant theory, this rescaling has only a mild impact on our results. Experimentally, this rescaling has the nice feature of reducing the dependence of our results on the JEC factors and on any PFC selection criteria. Theoretically, this rescaling has the nice feature of making the EMD identical (up to an overall energy scale) to the 1-Wasserstein metric between probability densities [143,144]. Changing the baseline from 400 GeV to some other scale would just proportionally rescale all the results below. As motivated by Sec. III (and further motivated by Sec. IV B below), we often restrict our attention to charged particles with p PFC T > 1 GeV. Strictly speaking, such a PFC restriction breaks the collinear safety (though not the soft safety) of the EMD, though there are calculational strategies to account for this using track functions [145][146][147][148]. Note that we always apply the rescaling in Eq. (8) after applying any PFC-level restrictions, such that our track-only results are similar in spirit to track-assisted observables [149,150]. Crucially, the PFC restriction and overall rescaling still preserve the metric properties of the EMD in Eq. (7).
An example EMD computation for two jets in the CMS Open Data is shown in Fig. 9. In the top row, we show two jets plotted in the style of Fig. 4. Here, the size of the dots indicates the transverse momenta of the PFCs, the colors indicate whether the PFCs are neutral or charged, and the crosses indicate charged PFCs that have been removed by CHS. In the bottom row, we drop the PID information and switch to the energy flow representation in Eq. (3). We overlay the two jets, with the red dots corresponding to the first jet, the blue dots corresponding to the second jet, and the gray lines indicating the optimal transport {f ij }. Because we have rescaled the jets by Eq. (8), all p T from the first jet can be transported to the second jet.

B. Quantifying Detector Effects
As a first application of the EMD, we investigate a novel way to quantify the impact of detector effects and pileup. An example MC jet is shown in Fig. 10, where the EMD is computed between the same jet before and after detector simulation. See Sec. II D for how we associate simulation-level and generation-level jets. Pileup is removed with CHS and a variety of PFC cuts are applied to improve the agreement between the particle-level and detector-level jets. This is explicitly shown by the decreasing EMD as the cuts are applied, quantifying the fact that the radiation patterns within the jets are becoming more similar.
To see the impact of these cuts on the jet ensemble as a whole, in Fig. 11 we histogram the EMDs between the same MC jet evaluated at generation level and simulation level. Here, we impose p jet T ∈ [375, 425] GeV on the simulation-level jet, while the generation-level jet could fall outside of this range. We emphasize that these EMD calculations are performed after the rescaling in Eq. (8), so this only quantifies the change in the radiation pattern, not the change in radiation intensity. As emphasized in Ref. [56], jets that are close in EMD are close in any (Lipschitz-bounded) IRC-safe measure, so small values of the generation-to-simulation EMD correspond to small differences between, for example, the generationand simulation-level jet mass. In this way, the EMD provides a universal bound on the impact detector effects can have on IRC-safe observables, which is a convenient alternative to studying the impact on specific observables individually.
Considering all PFCs in Fig. 11a, the generation-tosimulation EMD peaks at around 17 GeV. We can decrease the generation-to-simulation difference by sequentially applying CHS and the p PFC T > 1 GeV cut, though the impact is relatively modest. In evaluating the EMD, the p PFC T > 1 GeV restriction is applied at both the generation and simulation levels. Imposing the trackonly restriction in Fig. 11b, the generation-to-simulation EMD peak is shifted downward by a factor of about 2. Now, CHS has a much more pronounced impact, since it decreases substantially the relative pileup contamination. The p PFC T > 1 GeV cut has a modest, but nonnegligible, impact. As expected, the impact of detector effects and pileup is minimized for track-based observables after CHS. In Fig. 20 in App. B, we further investigate the performance of CHS for pileup mitigation. In Fig. 24 in App. C, we investigate the impact of the p PFC T cut in more detail.
From these studies, we conclude that our default selection (charged PFCs with p PFC T > 1 GeV) is a reasonable compromise between reconstruction performance and substructure sensitivity. More generally, we see that the EMD is an effective and intuitive way to quantify the impact of detector effects and pileup contamination.

C. Visualizing the Space
It is interesting to directly visualize the metric space of jets defined by EMD. There are a variety of techniques to visualize high-dimensional data in low dimensions, which provide a fascinating way to see the broad features of a dataset. Here, we apply t-Distributed Stochastic Neighbor Embedding (t-SNE) [151][152][153][154], which finds a lowdimensional embedding of the data in a way that respects the distance between data points. We run t-SNE with a two-dimensional embedding space, in which the procedure defines two axes and attempts to place data points in this two-dimensional plane in such a way that jets close in EMD are nearby and jets far in EMD are distant.
Though there are techniques to implement t-SNE on N data points in O(N log N ) runtime [154], due to current limitations in the scikit-learn [155] implementation that we use, we have to perform O(N 2 ) operations. To make this computationally tractable, we restrict our attention to the p jet T ∈ [399, 401] GeV range, which yields approximately 40,000 jets in the CMS Open Data. We also subsample and unweight the MC events to obtain around 40,000 generation-level and 40,000 simulationlevel jets as well. (Because there are insufficient events in thep T ∈ [170, 300] GeV MC sample [74], we have to downweight them by a factor of around 10 to achieve an approximately unweighted sample.) We apply CHS, the p PFC T > 1 GeV cut, and the track-only restriction on all jets. To reduce the effective dimensionality of the dataset and remove a trivial isometry, we rotate the jets around the jet axis such that the principle component of the transverse radiation pattern is aligned vertically in the rapidity-azimuth plane, breaking the two-fold degeneracy by enforcing that the jet has more scalar sum p T at positive azimuth. We also keep only the particles within a jet radius of the jet axis.
The results of t-SNE embedding into a twodimensional space are shown in Fig. 12  Open Data and for the simulation-level and generationlevel MC samples. For visual clarity, we rotate the t-SNE manifold such that the three embeddings exhibit roughly the same large-scale structure. The gray contours represent the density of the embedded jets. Example jets are sprinkled throughout the space and color coded by their jet mass fractile (i.e. fraction of events with smaller jet mass than the color coded value).
For the CMS Open Data in Fig. 12a, the t-SNE embedding exhibits a dominant cluster of jets with typically low jet mass, with a long slope extending out to typically higher jet masses. The most exotic jets are furthest away from the dominant cluster. The t-SNE embeddings of the MC samples in Figs. 12b and 12c are qualitatively similar, though the specific density distributions differ. Using smaller jets samples, we find that the variability between the data and MC t-SNE embeddings is comparable to the variability when running t-SNE multiple times on the same sample. No obvious anomalies in the CMS Open Data appear visually, though we return to anomalous jet configurations in Sec. IV F.

D. Correlation Dimension
To gain more quantitative insight into the space of jets, we can use the EMD to compute its dimensionality.
While a variety of definitions exist for intrinsic dimension, we use the correlation dimension [156,157], which is a type of fractal dimension and was the measure used in Ref. [56]. From a matrix of pairwise EMDs between jets, the correlation dimension is defined as: Here, N is the total number of jets in the sample and the Heaviside theta function indicates whether the jet k is within an EMD Q of jet .  [74].) We also perform the same jet rotation in Sec. IV C.
After the rescaling in Eq.        zero for Q > 400 GeV. Because we cluster jets with the anti-k T algorithm, though, the jet configurations that could in principle lead to this maximum EMD value are not present in our samples. For example, consider two jets of equal scalar sum p T : one consists of a single particle; the second consists of two particles, each with transverse momentum p T /2, separated by ∆R. The EMD between these configurations is 1 2 p T ∆R. Within a jet region of size R, ∆R could in principle be as large as 2R (i.e. EMD as large as p T ), but anti-k T would split the second jet in two unless ∆R < R (i.e. EMD of p T /2). In practice, we find that dim(Q) indeed goes to zero around Q 200 GeV.
In Fig. 13a, we compare the correlation dimension between the CMS Open Data and the MC samples, again with CHS and tracks only with p PFC T > 1 GeV. The agreement between the open data and the MC sample at simulation-level is very good, though the correlation dimension is roughly 0.5 above the generation-level curve for much of the plotted Q range. Naively, one might think that detector effects would decrease the correlation dimension, since finite granularity effects decrease the relative complexity of jet configurations. Instead, the added half dimension suggests that the detector has more of a smearing effect, analogous to the way that smearing a zero-dimensional point generates a higher-dimensional manifold.
The fact that the correlation dimension in Fig. 13 increases logarithmically with decreasing Q is expected from first principles QCD. The number of jet constituents scales up logarithmically with decreasing energy scale (see e.g. [158,159]), as does the entropy of a jet [160], and both of these quantities are related to the effective dimensionality of the space of QCD jets. We leave a QCD calculation of dim(Q) to future work, noting that the result will depend on the strong coupling constant α s as well as on the relative fraction of quark and gluon jets in the sample.
The correlation dimension gives us an interesting handle to understand the impact of applying cuts on the PFCs, complementary to the studies in Sec. IV B. In the bottom row of Fig. 13, we show dim(Q) for all PFCs and just tracks, as well as the effect of the p PFC T > 1 GeV cut, always with CHS applied. For the CMS Open Data in Fig. 13b and for the simulation-level MC in Fig. 13c, there is relatively little impact on the correlation dimension for Q 40 GeV. Below this scale, though, the correlation dimension is significantly smaller when restricting to just tracks and/or when imposing p PFC T > 1 GeV. Interestingly, for the generation-level curves in Fig. 13d, there is a much more modest impact from these restrictions. In fact, restricting to charged PFCs can sometimes increase the correlation dimension, since after applying the rescaling in Eq. (8), the charged PFC restriction acts like a kind of smearing. From this we conclude that dim(Q) is a robust measure of dimensionality at high Q, and very sensitive to QCD fragmentation and detector effects at small Q. The distance of our jet dataset to a selection of 25 representative jets, shown for (red) jets selected with the k-medoids algorithm as well as (gray) randomly selected jets. The k-medoids are systematically closer to the dataset, demonstrating that jets chosen in this way are significantly more representative than a random selection of jets.

E. The Most Representative Jets
Computing the EMD also allows us to visualize the space of jets in such a way that observable values can be correlated with jet topologies. Specifically, given a set of jets, we can find the k jets {K 1 , · · · , K k } (called medoids) that minimize the sum of the distances of each jet to its closest medoid: The value of Eq. (10) provides a quantitative notion of how well approximated the dataset is by the k jets. Inspired by the N -subjettiness observables of Ref. [161,162], this quantity can be thought of as the "k-eventiness" of the dataset. While naively optimizing the choice of the medoids takes O(N K+1 ) runtime, we use a fast iterative approximation techniques from the pyclustering Python package [163]. This k-medoids procedure provides a significantly more representative selection of jets than a random subsample, as quantified by the V k distribution in Fig. 14 for the case of k = 25. Along these lines, one might also consider clustering the full dataset of jets, for instance using iterative reclustering similar to techniques used to cluster particles into jets [112,[164][165][166][167], though we leave further explorations in this direction to future work.    Fig. 12 and their area is proportional to the number of jets nearest to them. The medoid jets try to "tile" the space in a rigorous sense.  In Fig. 15, we show the 25 most representative jets in the p jet T ∈ [399, 401] GeV subsample from Sec. IV C, arranged according to t-SNE and sized according the number of closest neighbors. Because these medoids are representative (and not just randomly selected) in that they try to minimize V k , there is a rigorous sense in which understanding the structure of these 25 jets captures the structure of the CMS Open Data jet ensemble as a whole.
If we apply the k-medoid procedure to jets occupying the same histogram bins of a specific observable, we can then visualize how the jet topology changes as ob-servable values change. In Fig. 16, we show histograms for the six substructure observables from Sec. III C, using the CMS Open Data with CHS and only tracks with p PFC T > 1 GeV. In each histogram bin, we show the four most representative jets, as determined by the 4-medoids procedure. For jet mass in Fig. 16a, we see a steady evolution from one-prong topologies to two-prong topologies. The reverse behavior is shown for D 2 in Fig. 16b, with two-prong topologies evolving into one-prong ones. One low-D 2 medoid jet consists of two highly overlapping prongs, distinct from the one-prong high-D 2 config-urations, highlighting the Sudakov safety of D 2 [40,134]. For the IRC-unsafe observables of track multiplicity in Fig. 16c and p D T in Fig. 16e, we see evolutions between simple topologies and jets with more complex substructure. For N 95 in Fig. 16d, there is a progression from narrow jets to diffuse jets. Finally, for z g in Fig. 16f, there is an evolution from unbalanced subjets to balanced subjets, with its Sudakov safety apparent from the one-prong configurations throughout. While all of these behaviors can be understood from the definition of these observables, the k-medoids procedure offer an intuitive visualization of the jet configurations that contribute to each observable value.

F. Towards Anomaly Detection
As the last application of the EMD in this paper, we present a first step towards using it for anomaly detection. Instead of finding the most representative jets as in Sec. IV E, we can find the least representative jets. As one way to quantify this, we can find the n-th moment of the EMD distribution of one jet to the rest of the dataset, where we applied the n-th root such that Q n has units of GeV. Small values of Q n indicate a common jet configuration. Large values of Q n indicate a jet which is far from the rest of the dataset, and therefore anomalous. In Fig. 17, we show the distribution of Q n for n = 1 (i.e. mean EMD) along with the four medoids in each histogram bin. As expected from the t-SNE visualization in Fig. 12, the most typical jet configurations have a single hard prong, while the least typical configurations have multi-prong or diffuse topologies. In Fig. 25 of App. C, we show a simular plot for n = 1 2 and n = 2. The most anomalous jets isolated by Q n for n = 1 2 , 1, and 2 agree for the six most anomalous jets, with the top three such jets shown in the bottom row of Fig. 17. The most anomalous jets are all highly complex threeprong topologies, hinting at a close relationship between this measure of anomalousness and observables such as N -subjettiness [161,162].
The anomalousness of a jet, quantified by Q n , is nontrivially correlated with the jet mass, which is easily confirmed by observing the medoids in each bin in Fig. 17. While this is expected and understandable from QCD, this correlation can complicate searches for resonant new physics by sculpting the background. To circumvent this correlation in the case of these searches, the EMDbased approach can be combined with mass decorrelation techniques [168][169][170] or with ideas such as CWoLa hunting [171] to look for anomalies within mass bins compared to sidebands.

V. CONCLUSIONS
The CMS Open Data is an exciting resource for performing exploratory studies in collider physics. In this paper, we performed the first ever exploration of the metric space of QCD jets on real collider data, using the EMD [56] as our measure of jet similarity. The EMD provides complementary information to traditional histogram-based analyses, and it also provides new strategies for data visualization in particle physics. In terms of quantitative measures, we showed how to use the EMD to characterize the impact of detector effects and to calculate the intrinsic dimension of a jet ensemble. For qualitative studies, we showed how to use the EMD to identify the most representative jets in a histogram bin and the least representative jets in the ensemble as a whole, where the latter analysis is particularly interesting in the context of anomaly detection for new physics searches [171][172][173][174][175][176][177].
Beyond the specific EMD studies here, a key outcome of this research is a processed and validated jet sample for use in future jet studies consisting of jets in the CMS 2011 Open Data with a p T above 375 GeV. This processed single-jet dataset is available on the Zenodo platform [86][87][88][89][90][91][92][93][94] along with the analysis tools needed to make the bulk of plots in this paper [84,85]. This sample is ready to use out-of-the-box by future users, since JEC factors and JQC are available and easy to apply, and baseline event selection criteria have been chosen to ensure that the Jet300 trigger is fully efficient. Because we apply the same processing pipeline to corresponding simulated MC events, one can assess the impact of detector effects on new jet analysis strategies. While we have not performed detector unfolding or estimation of systematic uncertainties in this exploratory study, our dataset contains sufficient information to implement these important elements, which we leave to future work. As an important stress test of this archival strategy, we plan to perform our next jet physics analysis directly on the released datasets without ever accessing the underlying CMS AOD files.
There are a number of future directions to pursue using the EMD. We focused on a narrow p T range of [375,425] GeV in this paper in order to have a more uniform jet sample, but it would be interesting to perform EMD studies on higher p T jets. This is particularly relevant in the context of the intrinsic dimension; in a preliminary QCD calculation of the correlation dimension as a function of jet p T , we find non-trivial dependence both on Q and on the quark/gluon composition of the sample. One application suggested in Ref. [56] is using EMD for jet classification, and it would be interesting to do a data/simulation classification study in the spirit of Refs. [178,179] to identify regions of phase space that are not well modeled by the current generation/simulation tools. In this study, we focused on applying the EMD to individual jets, but it could also be applied to events as a whole, which would be a novel strategy to explore the MinimumBias Primary Dataset. It would also be interesting to explore alternative EMD definitions that incorporate PID information.
Finally, we applaud the commitment shown by the CMS experiment to releasing research-grade public data. The inclusion of simulated datasets in the 2011 release was essential for us to gain confidence in the robustness of track-based observables for jet substructure studies. Even without the actual data files, the simulated datasets are a valuable resource for phenomenological studies, since they cover a wide range of final states with fully realistic detector information. As CMS continues to release research-grade data, we hope that more researchers take advantage of this unique resource for particle physics.

ACKNOWLEDGMENTS
We thank CERN, the CMS collaboration, and the CMS Data Preservation and Open Access (DPOA) team for making research-grade collider data available to the public. We specifically thank Edgar Carrera, Kati Lassila-Perini, and Tibor Simko for help processing the CMS Open Data, and Salvatore Rappoccio for help implementing CHS. We thank Maximilian Henderson, Edward Hirst, and Ziqi Zhou for collaboration in the early stages of this work. We benefitted from additional feed- As mentioned in Sec. II C, there are 89 valid LBs tabulated in Ref. [110] that do not appear anywhere in the Jet Primary Dataset [66]. There is of course the possibility that we made a mistake in processing the data, though we verified that MODProducer recovers the total number of events (both valid and not) quoted in Ref. [66]. Also, the missing LBs do not appear to represent a missing AOD file, which was an issue that had to be resolved for Ref. [38]. In particular, the missing LBs do not appear to be linked in time, whereas a given AOD file typically has consecutive sequences of LBs. Moreover, there are strange characteristics of the missing LBs that suggest that there might be more systematic issues at play.
We can classify the missing LBs into two main categories: 1. Near zero luminosity. For 17 missing LBs, the recorded luminosity was less than 0.03 µb −1 . It is plausible that none of the jet triggers fired during these LBs, in which case they should count (negligibly) toward the integrated luminosity of the run.
2. Large delivered/recorded discrepancy. For 71 missing LBs, the recorded luminosity was at least an order of magnitude smaller than the delivered luminosity. It is plausible that these LBs should not have been classified as valid, in which case it is consistent to ignore them.
Curiously, there was one missing LB where the discrepancy between the delivered and recorded luminosities was only 2.3%. This is consistent with the typical delivered/recorded mismatch for the valid LBs in the Jet Primary Dataset, which is around 3%. Another issue raised in Sec. II C is that there are 201 valid LBs present in Ref. [110] which have zero recorded luminosity. The 164 such LBs in Run A can be categorized as follows: 1. Exactly zero delivered luminosity. For 3 zeroed LBs, the delivered luminosity was also zero. Of these, 1 LB contained 0 events; the other 2 contained a total of 3 events that were triggered in the Jet Primary Dataset.
2. Near zero delivered luminosity. For 20 zeroed LBs, the delivered luminosity was less than 0.05 µb −1 , so it is expected that the recorded luminosity could be zero. Of these, 11 LBs contained 0 events; the other 9 contained a total of 23 events that were triggered on in the Jet Primary Dataset, so we can safely ignore these as well.
3. Sizable delivered luminosity. For 141 zeroed LBs, the delivered luminosity was greater than 2.7 nb −1 , so one expects at least one of the Jet triggers to have fired. Of these, 9 LBs contained 0 events; the other 132 contained a total of 20,850 events, even though the recorded luminosity was zero. Most likely, these were misclassified as valid LBs.
Tallying these together, there are 21 zeroed LBs that have zero events, which are already counted as missing LBs above. The remaining 143 zerored LBs have a total of 20,876 events, which is the number listed in Table I. Following the recommendation of CMS, we omit all of the zeroed LBs from our analysis. While these missing and zeroed LBs do not affect the conclusions of our physics studies, they do highlight the importance of stress-testing archival data strategies to make sure that there is validated information available to future generations of collider enthusiasts [180].
For completeness, in Fig. 18, we plot the total delivered and recorded luminosities for Run 2011A as a function of date, along with the effective luminosity for the Jet300 trigger. Note that the loss of luminosity due to the late turn-on of the Jet300 trigger has a negligible effect on our analyses.

Appendix B: Aspects of Pileup
The CMS simulated MC samples include the effect of pileup, but the number of overlapping events differs from what is observed in the CMS 2011 Open Data. To correct for this, we reweight the MC events to match the observed number of primary vertices (N PV ). Note that a larger number of primary vertices is associated with a larger amount of pileup contamination.
The effect of this reweighting is shown in Fig. 19a, where we plot the number of primary vertices associated with each event in the CMS Open Data compared to the MC simulation, both before and after reweighting. The reweighting factor is derived from all "medium" quality jets with p jet T > 375 GeV and |η jet | < 1.9, though the plot only shows the p jet T ∈ [375, 425] GeV range. As a cross check, in Fig. 19b, we plot the number of primary vertices with at least one track associated with the reconstructed jet of interest. From this, we conclude that the event-wide reweighting does indeed correct the in-jet pileup contamination level.
We can quantify the performance of CHS for pileup mitigation by performing an EMD analysis analogous to Sec. IV B. In Fig. 20, we show the generation-tosimulation EMD before and after CHS is applied, split into low (N PV < 5), medium (N PV ∈ [5, 10]), and high (N PV > 10) levels of pileup contamination. First, we see that the EMD grows (i.e. reconstruction degrades) as the pileup levels increase, though for these modest levels of pileup, the distortions are not so large. As already shown in Fig. 11, CHS does mitigate the impact of pileup, with better performance when considering just tracks.
One surprise in Fig. 20d is that the track-only EMD gets smaller as the pileup contamination increases. We are not sure of the origin of this behavior. It might be related to the use of the rescaling factors in Eq. (8), or it might indicate a bias where low N PV events often have unreconstructed primary vertices, so CHS does not removes tracks that it should. Regardless, we see that the EMD is a useful way to quantify the performance of pileup mitigation strategies.

Appendix C: Additional Plots
In this appendix, we provide additional plots to complement the ones in the text.
In Fig. 21, we plot the turn-on behavior for all of the relevant single-jet triggers, to compare to the Jet300 study in Fig. 3. In making this plot, we have to address the fact that some of the triggers share the same L1 trigger seed and their firing rates are therefore correlated. For uncorrelated triggers, if trigger A has prescale factor p trig A and trigger B has prescale factor p trig B and both triggers are fully efficient, then the probability of B firing given that A fired is: which is independent of p trig A since the triggers are uncorrelated. On the other hand, if two triggers have the same L1 seed, then the probability of B firing given that A fired is:  HLT Jet370 triggers, which are all seeded by the same L1 SingleJet92 trigger [66]. In Fig. 22, we plot the azimuthal angle (φ) distribution for the two hardest jets. As expected, we observe a flat spectrum in both the CMS Open Data and the MC simulation, though the bin-to-bin fluctuations in the open data are larger than one would expect from statistics alone, possibly indicating an issue with the lack of φ-dependence of the JECs.
In Fig. 23, we plot the complete PFC p T spectra for both neutral and charged constituents, going beyond the limited range shown in Fig. 6. This highlights the tighter relationship between generation-level and simulation-level information when using charged particles alone. Though not shown, we used this plot when deciding to impose the medium JQC, since with only the loose JQC, there was an excess of events with high-p T neutral PFCs, most likely from photon-plus-jet events.
We now use EMD to study the impact of the p PFC T cut in our analysis. In the top row of Fig. 24, we do an apples-to-apples comparison with the same particle selec-tion at generation level and simulation level. As the p T cut on the PFCs gets more aggressive, the generation-tosimulation EMD decreases, indicating better agreement. Of course, this p PFC T cut removes information about jet substructure, so there is a balance between minimizing detector effects and maximizing sensitivity to the underlying radiation pattern. In the bottom row of Fig. 24, the baseline generation-level jet contains all particles ("raw"), regardless of what selections are made at simulation level. When using all PFCs in Fig. 24c, the EMD decreases (i.e. reconstruction improves) as the p PFC T cut gets more stringent, up until the 2 GeV point where we start to see degradation. When using just charged PFCs in Fig. 24d, the peak of the EMD distribution shifts to lower values but there is a long tail, and the reconstruction always degrades with increasing p PFC T cut. In Fig. 25, we study the most anomalous jets according to Q n from Eq. (11) for the additional choices of n of n = 1 2 and n = 2. The results are comparable to the n = 1 case shown in Fig. 17, with all three choices of n agreeing on the three most anomalous jets.