Learning to Isolate Muons in Data

We use unlabeled collision data and weakly-supervised learning to train models which can distinguish prompt muons from non-prompt muons using patterns of low-level particle activity in the vicinity of the muon, and interpret the models in the space of energy flow polynomials. Particle activity associated with muons is a valuable tool for identifying prompt muons, those due to heavy boson decay, from muons produced in the decay of heavy flavor jets. The high-dimensional information is typically reduced to a single scalar quantity, isolation, but previous work in simulated samples suggests that valuable discriminating information is lost in this reduction. We extend these studies in LHC collisions recorded by the CMS experiment, where true class labels are not available, requiring the use of the invariant mass spectrum to obtain macroscopic sample information. This allows us to employ Classification Without Labels (CWoLa), a weakly supervised learning technique, to train models. Our results confirm that isolation does not describe events as well as the full low-level calorimeter information, and we are able to identify single energy flow polynomials capable of closing the performance gap. These polynomials are not the same ones derived from simulation, highlighting the importance of training directly on data.


I. INTRODUCTION
Data collected in hadronic collisions offer a significant opportunity to precisely test the Standard Model (SM) and to search for physics beyond the SM (BSM).The identification of muons resulting from electroweak boson decays (called 'prompt') is a crucial part of many such studies, as muons are typically well measured and have low rates of background.An important source of background for these events comes from muons produced within jets from decays in flight.This 'non-prompt' background is largest at the lower end of the muon transverse momentum spectrum, which has become important in searches for supersymmetry [1][2][3][4][5][6] as well as for low-mass resonances [7][8][9][10].
Prompt muons tend to have less nearby detector activity as compared to muons from jets, which are found near hadrons from the rest of the jet.The concept of isolation is therefore important to much of the work involving the discrimination of prompt muons from the non-prompt backgrounds.A complete description of the isolation requires capturing the high-dimensional data in the vicinity of the muon.In practice, high-dimensional data are challenging to analyze and isolation is typically reduced to a scalar quantity [11][12][13] However, in the reduction from a high-dimensional (low-level) representation of the data to a lower-dimensional (high-level) one, information can be lost.
Deep learning with low-level inputs has been demonstrated to exceed the performance of engineered high-level observables on a number of tasks in high energy physics, starting with Refs.[14,15] and now including many studies [16].In the context of prompt muon identification, deep neural networks were able to outperform classical isolation definitions using simulated data -by as much as 50% in non-prompt background rejection at a prompt muon efficiency of 50% [17].This was achieved by processing all of the calorimeter cells1 in the vicinity of the muon, corresponding to roughly 1000 dimensions per event.Significant suppression of non-prompt backgrounds with a deep learning approach has the potential to improve the precision and sensitivity of many measurements and searches involving muons at the Large Hadron Collider (LHC).
However, previous studies were based on simulations, with relatively simple detector effects.Hadronic final states are complex and difficult to model, so it is reasonable to be concerned that the performance of a deep learning-based isolation strategy trained on simulated events may depend on details of the simulation which are not faithful reproductions of collider data.Scale factors derived using standard tag-and-probe methods [18,19] may correct the efficiency, but the performance in data would be suboptimal [20].Achieving optimal performance in data requires training with data.The limitation is that data are not labeled as prompt or not-prompt, so the supervised machine learning strategies used in previous studies and which require such labels cannot be applied to data.
We propose to overcome this limitation with weakly supervised learning.In contrast to supervised learning, where every event is labeled with certainty as prompt or non prompt, weakly supervised learning is trained with noisy labels, which describe the overall composition of the sample but not individual events.Specifically, we use the Classification Without Labels (CWoLa) [21] approach to weak supervision where two samples of training events are prepared.One sample is dominated by prompt muons, and receives the noisy label of 'signal' (and will be called 'prompt abundant'); the second sample, while still mostly containing prompt muons, has a relatively higher fraction of non-prompt muons and receives the noisy label of 'background' (and will be called 'prompt moderate').Under mild assumptions, training a standard classifier with these noisy labels converges to the same classifier found in a supervised setting.While weak supervision has been used previously for data analysis [22][23][24][25][26], these studies only used 2-18 inputs.Our goal is to approach the muon isolation problem with weak supervision directly on low-level, high-dimensional (O(100)) inputs.While the inputs are high dimensional enough to hold a large number of detected objects, this is only necessary for a small number of events, as on average the inputs have O(10) non-zero entries.
Even if proven effective in data, deep networks operating on low-level observables can be opaque.To improve the interpretability and compactness of the network, we follow Ref. [17], bridging the performance gap between the low-level observables and classical isolation variables through a small set of additional high-level observables identified by the decisions of a network operating at the low-level.We search for new high-level observables among the Energy Flow Polynomials (EFP) [27], a set of relatively simple combinations of energies and angles of reconstructed objects within the isolation cone.EFP observables are identified automatically using the Average Decision Ordering (ADO) method [28], which uses the decisions of the low-level network as a guide.While still complex, the resulting EFP is more physically interpretable than the original deep neural network.Interestingly, the first EFP selected through this process was not identified in the previous study as a top candidate for closing the corresponding gap in simulation [17].This is one more reason why it is essential here to train directly on data.
This paper is organized as follows.Section II introduces the dataset, which is from the CMS experiment [29,30].Then, Sec.III describes the machine learning strategy.Numerical results are presented in Sec.IV.The paper ends with conclusions and outlook in Sec.V.

II. DATASET
Proton-proton collisions at √ s = 8 TeV were recorded in 2012 and curated by the CMS Collaboration and made available through the CERN Open Data Portal [30].The number of collisions corresponds to 19. struction was performed with the Particle Flow (PF) algorithm [12], which integrates calorimeter and tracker information to approximate individual particle fourvectors.The PF algorithm also assigns a particle identification (PID) from one of the following types: muon, charged hadron, neutral hadron, photon, or pileup.For the charged PF objects, the sign of the charge is reconstructed.PF object momenta are represented by their transverse momentum (p T ), pseudorapidity (η), and azimuthal angle (φ).We select events with exactly two muons, both with p T ≥ 25 GeV, |η| ≤ 2.1, and with a dimuon invariant mass between 70 and 110 GeV to accommodate the Z boson mass of 90 GeV [31].Events are separated into two samples which have different mixtures of prompt and non-prompt muon events, as is required by the CWoLa method.One sample, with a higher fraction of nonprompt muons, consists of all events in which the muons have identical electric charge, as well as events with muon pairs of opposite electric charge but reconstructed invariant mass far from the Z boson invariant mass, below 84 GeV or above 96 GeV.This sample is referred to as the "prompt muon moderate sample."The remaining events, which are almost entirely prompt muons, form the complementary sample and are referred to as the "prompt muon abundant sample."These regions are illustrated in Fig 1 .The opposite sign sample is almost entirely from Z boson decays and so is peaked at the Z boson mass.The same sign sample is mostly from decays in flight and has a nearly smooth and steeply falling spectrum.
In order to ensure that the two samples have similar kinematic distributions, event weights are computed so that the muon p T and η spectra are the same between the prompt-enriched and non-prompt-enriched samples.
The unbinned likelihood ratio is estimated using a twodimensional Kernel Density Estimator with Gaussian kernels.The pre-weighted spectra are displayed in Fig. 2. The p T spectrum is peaked near m Z /2 and the sharp features in the muon histogram are due to detector acceptance effects.We additionally validate the core assumption of CWoLa (see Sec. III) -that the (non)prompt muons look the same in both samples -using samples of simulated muons; see Appendix A. Once events are selected, they are formatted to be used as inputs to the neural networks.The low-level inputs are comprised of the p T , η, φ, and PID for each constituent within a 0.45 radius around a given muon.We additionally preprocess the low-level input by centering on the muon and dividing the momenta by the muon transverse momentum.A visualization of the momentum in the vicinity of the muon, not including the muon itself, for both samples is shown in Fig. 3.We see that the sample means per pixel have distinct distributions, with the more prompt sample being more uniform.
Traditional, high-level scalar observables are calculated from the low-level data.These observables include the summed p T of non-muon objects in an event, isolation, and EFP observables.We calculate isolation as defined in Eq. 1, where h ± and h 0 denote charged and neutral hadrons, respectively.This definition quantifies the activity around a muon within a given radius strictly in terms of Particle Flow objects and treats the objects differently according to their Particle Flow ID.The expression is composed of terms which sum over the transverse momenta of the non-muon Particle Flow candidates within the chosen radius, and the result is normalized by the muon momentum.Pileup is mitigated by subtracting half of its sum from the neutral hadron and photon sums, and clamping the result of this subtraction at 0. Distributions of the isolation for two choices of cone radius are shown in We calculate isolation quantities for a set of radii from 0.025 -0.45 in steps of 0.025.CMS has previously studied isolation at radius of 0.3 [12], which is included in our generated set.
While in principle the demonstration of weak supervision as a technique for learning to improve muon isolation beyond cone-based quantities could use simulation instead of data, we have chosen to use collider data for a number of reasons.First, realistic simulation of muon isolation is very challenging, for both the prompt and non-prompt categories; see App. A. Second, a demonstration in data can confirm (or refute) the results of earlier studies in simulation, which showed a significant gap between the power of isolation cones and full use of the lower-level data.If such a gap exists in collider data, it would indicate that additional information is available in nature; the interpretation of that gap in terms of EFP observables will provide clues as to the physical processes involved, and the size of the gap can motivate a further study in a complete experimental context.For this reason, we also do not estimate systematic uncertainties, which would be required before application to searches and measurements.As a data-driven method, there are no simulation-based uncertainties, but there would be method closure uncertainties related to the underlying assumptions of CWoLa and sPlots.

III. METHODS
Classification Without Labels (CWoLa) defines a weakly supervised setting which relies on the principle that given two classes, an optimal classifier may be obtained by training to discriminate between two samples composed of different mixtures of the classes, rather than training directly on two pure class samples.This technique only requires that the two samples have different class mixtures, and these mixtures do not need to be known in order for training to proceed.The essential assumption is that class fraction is the only feature that determines the different properties of the two samples.This means that the spectrum of radiation around the muon for prompt leptons is identical for the prompt muon abundant and the prompt muon moderate samples.Similarly, the probability density for hadrons around the muon for non-prompt leptons should be the same within the prompt muon abundant and the prompt muon moderate samples.We expect this to be the case here, since the invariant mass of the muons and their relative electric charges should statistically independent from the radiation pattern around the muons given the prompt status.This expectation is validated in simulation in App. A.
While CWoLa does not need class labels to derive a classifier, some class information is required to determine the performance of the method.The only information needed is the proportion of prompt muons in each sample; from this information, it is possible to characterize the full tradeoff between signal efficiency and background rejection.The prompt-muon fraction is measured directly from the data in each sample by modeling the invariant mass distribution as a mixture model with two components: one peaking component of Z bosons which decay to two prompt muons, and a second, non-peaking component.The invariant mass spectrum is fit using a Voigt profile and an exponential function for the respective components.Fitting is done with Scipy v1.7.3 [32] and visually demonstrated in Fig 5, where the fit is applied to the full dataset, finding an overall prompt fraction of 95.6 ± 0.6%, where the error bar corresponds to 1σ statistical.In the "prompt muon abundant" sample, the prompt fraction is measured to be 98.9%; in the "prompt muon moderate" sample, the prompt fraction is measured to be 56.0%.This is the first application of weak supervision in particle physics where the relative proportions have also been extracted directly from data.
Characterizing the network performance is non-trivial without pure samples.To measure the efficiency of a varying network threshold in the prompt and non-prompt samples, one could fit the distribution of the invariant mass of events surpassing each threshold.Measurement of the efficiencies of each class allows calculation of performance metrics, such as the standard Receiver Operating Characteristic (ROC) and its associated statistics.However, fits are expensive and stochastic.Fitting the mass spectrum for each threshold output can be avoided using the sPlots technique [33], which can decompose the prompt and non-prompt contributions to distributions of the network output given weights from the single invariant mass fit into the full sample.sPlots assumes that the variable being weighted is statistically independent of the invariant mass, within the individual classes.This is approximately true for our discriminating variables, such as model outputs, and so the method can be applied.Once the variable has been separated by the components, the resulting histograms may be integrated to calculate true and false positive rates, and construct a ROC curve.Performance is evaluated through the Area Under the Curve (AUC) and the signal efficiency at 50% background efficiency.While we do not perform a full determination of the uncertainty, we do consider statistical sources of uncertainty from the training and from the fit 2 .While not an uncertainty per se [34], the statistical variation from the finite size of the training dataset 3 gives a sense for the stability and optimality of the result.This effect is estimated using bootstrapping [35] with 100 event ensembles with a new classifier trained per ensemble.Additionally, we propagate the statistical uncertainty from the fit in each ensemble by sampling 100 times from the fitted parameter covariance matrix.Metrics are recomputed and averaged across each ensemble, and we report the 1σ confidence intervals according to the resulting set of values.
We consider two types of neural networks: high-level networks with an increasing list of engineered observables (such as isolation) and low-level networks that process the full muon image.For the high-level networks, one of our goals is to determine the minimal set of isolation observables that will saturate the performance.To do this, we start by training a network using the single isolation cone corresponding to the largest radius in our set and subsequently train networks with incrementally smaller cones included as inputs.The summed event p T is in-2 While these are the only sources of uncertainty quantified in Table I, other sources are present, such as a bias due to imperfect description of the mass distribution by the fit function. 3The random initialization of the network is also folded into this estimation.cluded as an input in all of these sets, in order to be sensitive to overall normalization effects.The low-level networks take the full, high-dimensional representations of the events as inputs.We use the deep sets architecture [36] implemented as Particle Flow Networks (PFNs) [37] to process these data.This architecture was chosen because the inputs are a permutationinvariant, variable-length set of four-vectors and so a point-cloud model is the natural choice for processing them.Deep sets models are composed of two fully connected networks.The first network embeds each particle flow object (represented by (p T , η, φ, PID)) into a latent space.The second network processes the sum of these latent space vectors across all inputs.The sum operation is permutation invariant and can readily process variable-length inputs.
Additionally, we strive to close the gap in performance between low-and high-level networks using relatively simple variables.Energy Flow Polynomials (EFPs) [27] serve as a set of potential variables for this purpose.EFPs are a set of parameterized functions which sum over objects within an event, were each term is weighted using the angular relations between these objects.EFPs can be represented using graphs, where When κ = 1 the EFPs form a basis for Infrared and Collinear (IRC)-safe observables [27].We compute a set of EFPs which contains IRC-safe, as well as unsafe, information, using the same parameterizations as in Ref. [17]: , for graphs with up to n = 7 nodes and up to d = 7 edges.
We use the Average Decision Ordering (ADO) [28] metric to determine which EFPs from this generated set might bridge the performance gap to the PFN.ADO compares two classifiers on signal and background input pairs, measuring how often the classifiers rank the inputs in the same way.This is quantified with a Heaviside step function on many different pairs, and the results are averaged to obtain the ADO.The ADO can be interpreted as the probability that a given pair will be ordered in the same way by the two classifiers.This is intuitively similar to the AUC metric, which measures the probability that a given signal example will be ranked higher than a given background example.While AUC can be seen as comparing a classifier to the truth, the ADO compares two classifiers to one another without regard for correct ordering.To avoid training a large set of new high-level networks, one for each EFP being considered as an additional observable, we follow the strategy of Ref. [28] and search for EFPs which have a high ADO with our PFN for the subset of events where the PFN and the high-level network disagree.In general, this process can be iterated, selecting new observables until the ADO no longer improves.

IV. RESULTS
The performance of each network is measured through ROC AUC as well as the signal efficiency at a fixed background efficiency of 50%.Fig. 6 illustrates the effects of including additional isolation cones as network input features.Adding cones tends to increase performance up until nine cones are used, after which there is no clear further gain in AUC.There is a significant performance gap between the network which uses nine cones and the PFN, which respectively yield AUCs of 0.848(1)4 and 0.874(1), as well as signal efficiencies of 0.939(1) and 0.957 (1).This suggests that isolation cones alone do not capture all discriminating information available in the low-level data.This is consistent with previous results shown on simulation [17], and it is notable that it holds for real collider data.
We use the ADO metric to search among the EFP observables for ways to close the gap with the PFN performance.Note that the EFPs lack the built-in radial symmetry of the isolation cones, and so may contain additional useful information.The networks using EFP features are also provided the nine largest isolation cones and the summed event p T .Remarkably, the ADO search method is able to identify a single IRC-safe EFP which obtains an AUC of 0.871 (1) and signal efficiency of 0.953 (1), almost fully closing the gap in AUC to the PFN from 0.026 to 0.003.The graph representation of this EFP, as well as class distributions separated through the sPlots technique, are illustrated in Fig. 7.This EFP corresponds to parameters κ = 1 and β = 0.25, and the full expression is provided in Eq. 6.An additional scan is done over the quadratic EFPs included in our full set of calculated EFPs, as these are simple in structure and are therefore more interpretable.This identifies another single EFP with κ = 1 and β = 0.25 which yields performance close to that of the one identified by the first ADO search, at an AUC of 0.870(1) and signal efficiency of 0.956 (1).We further check the performance of sets of EFPs identified as useful by previous work done on simulation [17], which selected an IRC-safe set of EFPs, as well as a set not restricted to be safe.The IRC-safe set yields an AUC of 0.868(1) with a signal efficiency of 0.949(1), while the unsafe set yields an AUC of 0.865(1) with a signal efficiency of 0.954 (1).While these sets identified in simulation close much of the performance gap, they require more features and are outperformed by the EFPs identified directly on the CMS data, underscoring the importance of training in data.
A full summary of performance across the methods used is presented in Table I, as well as depicted in Fig. 8.Our results indicate that we are able to construct a minimal set of high-level observables which perform comparably to the low-level inputs, allowing for the use of more physically intuitive features and less complex networks without making concessions regarding performance.

V. CONCLUSIONS
On collision data from the LHC, we apply neural networks to the problem of prompt muon discrimination.We investigate how much information is present in high and low-level representations of the data, finding that the traditionally used scalar isolation does not capture all useful classification information present at the lowlevel.Furthermore, we find that another high-level set of observables, the EFPs, may be used to create a network which performs almost as well as one operating at the low-level, while providing the advantage of being less complex and more human interpretable.In addition to being notable for using real rather than simulated data, this study demonstrates the use of weakly supervised training methods with CWoLa on low-level   features, as well as performance evaluation without having access to individual class labels.Future work may include investigating the interpretation of the observables selected here, exploring how much information might be captured by other types of high-level observables, and the generalizability of these results.While our study indicates that additional information is available beyond the use of simple cones, and the identification of a single EFP observable which captures that information allows for simple application and interpretation, further work would be required before implementation within an experimental context.A robust estimate of the systematic uncertainties involved has not been done, which would be necessary to establish the optimal observables.Our result does not replace work by the experimental collaborations, but motivates further study.

FIG. 2 :
FIG. 2: Histograms of muon pT and pseudorapidity η in the two samples with varying fractions of prompt muons, as defined in text and Fig. 1.

FIG. 3 :
FIG. 3:The average image of hadronic activity in the vicinity of an identified muon, in angular coordinates of azimuthal angle φ and pseudorapidity η, for our two training samples, one which is dominated by prompt muons (top) and a second which has a more moderate mixture of prompt and nonprompt muons (bottom).The muon itself is excluded from these visualizations, but the energies are normalized by that of the muon.

6 IsolationFIG. 4 :
FIG.4: Histograms of the muon isolation (defined in Eq. 1) for each of our training samples, one of which is dominated by prompt muons, for two choices of isolation cone radius parameter R0 = 0.025 (top) and R0 = 0.45 (bottom).

cd 1
,c,d=1 z a z b z c z d θ ab θ ac θ bd θ 4

FIG. 6 :
FIG. 6: Isolation network performance shown as a function of number of input cones.Performance of the PFN and best performing high-level network are shown as benchmarks.ROC AUC is shown for each model (top) as well as the signal efficiency at a fixed background efficiency (bottom).

FIG. 7 :
FIG. 7: Distribution of the EFP observable identified in the search described by the text.Samples shown are separated by class using the sPlots weighting technique after applying a 50% background efficiency cut according to the outputs of the 9 isolation cone network.Also shown is the graph representation of the EFP.

FIG. 10 :
FIG. 10: (MG5+Pythia+Delphes) Average event images similar to Fig 3, but for the simulated dataset and separated by prompt and non-prompt events.
5 fb −1 .Recon- FIG.5:A visualization of the masses overlaid with the fit and its prompt / non-prompt components.The shaded regions indicate events which are included in the relatively less prompt sample.Here we fit the full CMS sample used in the study, finding that it is 95.6 ± 0.6% prompt overall.
Comparison of the performance of the networks described in TableI, via ROC curves.Shown is background rejection (inverse of efficiency) versus signal efficiency.