Applying Machine Learning Techniques To Intermediate-Length Cascade Decays

,


Introduction
While the Standard Model (SM) of particle physics is extremely successful in describing the known particles and their interactions, it is also known to be an incomplete description of fundamental physics. Some of the best studied extensions of the Standard Model that aim to stabilize the electroweak scale or explain the observed dark matter (DM) relic abundance hint at the existence of new degrees of freedom at roughly the TeV scale. Unfortunately, collider searches for such new particles have yielded only null results until now. Therefore, as the Large Hadron Collider (LHC) is getting ready to start its third run, even if new physics is finally discovered, the signal will in all likelihood either have low statistics, or be difficult to distinguish from backgrounds, or possibly both. This makes it all the more crucial that LHC searches be optimized for maximal efficiency with these challenges in mind. Similarly, post-discovery, the measurement of the properties of the new particles, such as their masses, will be challenging for the same reasons. Among possible final states for new physics at the LHC, supersymmetry (SUSY)like production and decay channels of color-neutral particles, especially with a compressed spectrum for the new particles, offer good examples for the type of signatures mentioned above, as electroweak-only charged particles have low production cross sections, and compressed spectra result in soft momenta in the final state, so backgrounds cannot be reduced by using hard cuts. Our definition of a SUSY-like channel is that new particles can only be produced in pairs due to a Z 2 symmetry, and that each one decays to a lighter new particle plus SM particles, until the lightest new particle is reached, which is collider-stable, a DM candidate and which cannot be detected. Note that this definition applies equally well to scenarios where the new particles have the same spin as their SM partners, such as in the case of extra dimensional models and others. Since there is an invisible particle at the end of any decay chain in such final states, no resonance can be reconstructed from any subset of visible particles.
For short decay chains, very little kinematic information is available event-byevent. Observables such as missing transverse energy (MET), and transverse mass variables such as m T and m T 2 [1][2][3][4][5][6][7][8][9] fully use this available information and provide the best chance for discovery, and for the mass measurement of the unknown particles. As a result, all available information can be extracted from one-dimensional distributions of a small number of kinematic variables. In the other extreme, for sufficiently long decay chains, there is sufficient kinematic information available in the events for the determination of the complete spectrum by algebraic methods . Both of these possibilities have been well studied, and there is little room for improvement. For an extensive review of kinematic variables used in collider phenomenology, we direct the reader to [31].
There are however final states that lie between these extremes, where algebraic methods cannot be used, but there are sufficiently many kinematic observables such that their correlations also contain crucial information that cannot be extracted by only plotting commonly used one-dimensional distributions such as kinematic edges and endpoints. It was shown in ref. [32] that a decay chain that proceeds via three consecutive two-body decays has this property (see figure 1), and subsequent papers [33,34] explored how an analysis based on the full dimensionality of the Lorentz-invariant observables can be used to enhance discovery prospects as well as to improve the precision of mass measurements. In other words, in these event topologies there are features of the phase space distribution that only reveal themselves with a multidimensional study, but not in the standard one-dimensional projections. In particular, it was shown that the variable ∆ 4 , to be reviewed in the next section, is a highly effective observable for measuring not only the mass differences between successive particles in the cascade decay, but also the overall mass scale along the "flat direction", where the locations of kinematic edges and endpoints in each step of the cascade remain fixed. In this paper, we also adopt the same decay chain as the benchmark case for our study. The decay chain that we consider in this paper has been previously studied using end-points in kinematic observables [35]. The effectiveness of a number of kinematic observables for spin determination has also been studied [27,[36][37][38].
Since we are considering SUSY-like collider signatures, the new particles need to be pair produced. However, if the decay chain of figure 1 appears on both sides of the event, then this event topology would in fact contain sufficiently many visible final state particles that algebraic methods could be used. We instead focus on scenarios where the associated production of X-χ is the dominant production channel (or possibly the next-to-dominant channel, after χ-χ production, but for sufficiently heavy χ, this channel is challenging to observe). One can for instance consider a t-channel production diagram, where the mediator in the t-channel couples more strongly to χ than it does to X. This is illustrated in figure 2 for two choices of particle spins, which we will introduce in section 4. All the visible information in the event is then coming from the X-side of the event, and the χ on the other side of the event only affects the total MET vector among the observables.
Recently, advances in machine learning techniques have led to significant improvements in multidimensional data analysis, and in this paper we evaluate the effectiveness of these techniques for the discovery, spin determination and mass measurement in this benchmark decay chain, as a representative of SUSY-like decays that have sufficiently many kinematic observables to warrant a multidimensional analysis, but not enough for a full reconstruction via algebraic techniques. In this study, we first consider how well a deep neural network (DNN) can achieve these goals without human guidance, that is, by taking the final state momenta in the events as its only input. We then proceed to interpret the DNN in terms of human-level (HL) kinematic variables, and we show that an analysis based on the HL variables results in performance comparable to that of the DL inputs. In particular, our results confirm that ∆ 4 is highly correlated with the output of the DNN, in agreement with the conclusions of references [32][33][34].
Supervised machine learning based analyses have been previously applied for the discovery of exotic collider events in [39,40]. In [41], the authors have proposed methods for discovery across a range of model parameters. Applications also include  resolution of combinatorics in complex topologies [42], and parameter inference [43,44].
The visible particles p 1,2,3 can be chosen among a large set of SM final states. In order to keep our analysis as simple as possible, we adopt a benchmark scenario that is as clean as possible in terms of its collider signatures. Once the proof of concept for the methods we present has been established, additional analyses can be performed to adapt these methods to more challenging final states as well. To be specific, we choose p 1 and p 2 to be same-flavor, opposite-charge leptons, such as µ ± -µ ∓ , and we choose p 3 to be a photon. This final state can arise for example if Y is a muon partner, while X, Z and χ are partners to gauge bosons. We emphasize that we are not promoting this to be a particularly plausible scenario of beyond-the-SM physics. It is only meant to provide a relatively clean example of our benchmark event topology, to study the effectiveness of the methods we are proposing. X, Z and χ can all be fermions and Y a boson, or the other way around, and we in fact study both assignments to address not only the questions of discovery and mass measurement, but of spin determination as well.
With these choices, the final state we focus on in this study is µ + µ − γ + MET. Since there is no reason to expect a resonant structure in the µ + µ − system in the signal, we impose a Z-veto in the analysis to eliminate the leading backgrounds. We also veto any significant hadronic activity. With these choices, there are two SM backgrounds contributing to this final state. The first is Z-γ pair production, with the Z decaying to taus, and both taus decaying to muons. The second is triboson production, such as W + W − γ (with both W 's decaying to muons) and ZZ * γ (with the on-shell Z decaying invisibly and the off-shell Z decaying to muons). These are the backgrounds we include in our analysis.
Given the signal and the background described above, a problem that needs to be dealt with early on is the following: Since information about the masses of the new particles and their spins is not known prior to their discovery, we cannot train the neural networks using Monte-Carlo events generated with the correct signal spectrum and the correct spin assignments. As a possible resolution to this problem, we study the effectiveness of 'scanning' over the parameter space using a performance metric that quantifies discovery (S/ √ B in our case). A full scan over the parameter space is computationally very expensive and we will only perform a local scan in order to demonstrate that the true spectrum is at least a local maximum. Scan over local spectra

Event-by-event Analysis
Ensemble-based Analysis Figure 3: Flowchart detailing the overall procedure for discovery, mass gap determination, spin determination and overall mass scale determination.
First, we perform the scan using networks trained solely on the final state momenta, which we will refer to as the Detector Level (DL) variables. We show that over the region of this local scan, the discovery metric is maximized for the networks trained using mass spectra for which the mass differences, or more precisely, the kinematic edges/ endpoints in the three m ij variables have the correct values. Then, we proceed to identify a set of Human Level (HL) variables that match the DNN. In particular, we show that performing the scan using the HL variables rather than the final state momenta results in an improved resolution for the mass gap measurement. As we will see, both the DL and the HL variable-based scans do not perform well in measuring the spins and the overall mass scale. We show that analyses based on ensemble methods using HL kinematic variables can be used to determine the spins and the overall mass scale. Fig 3 shows a schematic representation of our procedure. This paper is organized as follows: We start with a short review of the relevant kinematic observables in section 2, and a review of the relevant machine learning techniques in section 3. Then in section 4, we study the prospects for discovery and mass measurement relying on the final state momenta of the visible particles, and a deep neural network trained on individual events. We then proceed to interpret the DNN in terms of HL variables in section 5, and we perform the second, ensemblebased, stage of the analysis in order to determine the spins of the new particles (in section 6), and their masses (in section 7) much more accurately. We conclude in section 8.

Review of Phase Space and Kinematic Observables
In this section we review important properties of kinematic observables that are used in the analysis of the next few sections. Using the notation of figure 1, we label the four-momenta of the new particles in a given event as p µ X etc, and those of the visible final states as p µ 1 etc. Very generally, the distribution of events in phase space is given by where dΠ P S stands for the phase space volume element, and |M| 2 for the amplitude squared. While the latter contains valuable information about the spins of the particles, and the angular correlations of the final state particles, the information about the spectrum of the new particles and about the boundary of the kinematically available phase space is entirely contained in the phase space factor. As mentioned in the introduction, we consider two possible spin assignments for the new particles. In our analysis of discovery and mass measurement prospects, we consider both possibilities, and we look for commonalities as well as differences in the list of the optimal HL variables in these two cases.
The simplest way to characterize the data is of course through the momentum vectors ⃗ p 1,2,3 of the three visible final state particles, and the of the transverse missing energy (MET). As usual in collider analyses, we express these vectors in terms of p T , the pseudorapidity η (except in the case of the MET) and the azimuthal angle ϕ of the final state particles. Considering boost invariance along the beam direction, differences ∆η and ∆ϕ between any two particles are very useful observables, as is the combination ∆R = ∆η 2 + ∆ϕ 2 . The fully Lorentz-invariant observables include the pair invariant masses m 2 ij = p µ i + p µ j 2 , as well as the total invariant mass of all three visible particles m 2 123 . One less well known Lorentz-invariant observable is ∆ 4 which we define below. This completes the list of kinematic observables that we consider in our paper.
The maximum values of the m 2 ij and m 2 123 variables for this final state topology are well known [2,8,19,35,45,46], and often used for mass (difference) measurements in this event topology. They are given by: It is straightforward to see that these formulas are sensitive to differences of masses, however there exists a flat direction along which the masses can be varied such that the endpoints of all m 2 ij variables remain constant. The kinematic variable ∆ 4 we are about to describe has been shown to be useful for measurements of the mass spectrum along this challenging direction [32][33][34].
For each decaying X particle, the momenta of the final states, p 1,2,3 and χ can be represented as a point in 4-body phase space. While it is not commonly used, there is an elegant description of 4-body phase space [47] that is manifestly Lorentzinvariant. Consider the 4 × 4 matrix Z ij whose elements are given by p i · p j (where we are taking χ to be the fourth particle). Define the functions ∆ i of these momenta as the coefficients in the characteristic polynomial of Z, namely The kinematically allowed region in phase space for X decay can be shown [47] to be defined by the conditions ∆ i > 0 for all i = 1, . . . , 4, with the boundary of the region defined by ∆ 4 = 0 (with ∆ 1,2,3 still positive). As already mentioned, ∆ 4 has been shown to be a powerful observable for analyzing this decay chain, both for discovery and for mass measurement purposes [32-34, 48, 49]. There is one subtlety that needs to be mentioned. Using the definition above, calculating ∆ 4 requires knowledge of all final state momenta, including that of χ, which is invisible. However, if the masses {M X , M Y , M Z , M χ } in the spectrum are assumed known, then all dot products p i · p χ (and therefore ∆ 4 ) can be calculated from only the visible particle momenta, by using the on-shell conditions for the intermediate particles.
The reason that ∆ 4 is such a useful variable is that the volume element of 4-body phase space, expressed in the differentials dm 2 ij , is given by ∆ −1/2 4 , up to a constant and an energy-momentum conserving delta function. As a result, in the Lorentzinvariant coordinates m 2 ij , the distribution of signal events is strongly clustered near ∆ 4 = 0. Background events on the other hand do not arise from X decays, and their ∆ 4 distribution has no similar sharp feature at ∆ 4 = 0. As a result, if the spectrum were somehow known, then an excess of signal events over background can be easily discovered due to the sharp peak in the ∆ 4 distribution near zero. Of course, discovery must precede a knowledge of the spectrum, however ∆ 4 can be used to accomplish both goals simultaneously. In principle, one can scan over the possible spectra {M X , M Y , M Z , M χ }, and look for an excess near ∆ 4 = 0. This feature will be most significant when the correct spectrum is used, so the presence of the excess will both serve as a discovery variable, and as a way to measure the unknown particles masses. Of course, in practice it is computationally prohibitive to perform a scan in all four mass variables. Fortunately, the well-known kinematic edges and endpoints in the m 2 ij distributions already provide good sensitivity for the mass differences, and therefore they can be used first, leaving only the overall mass scale undetermined (parametrized by M χ , say) along the flat direction. Then, one can perform a onedimensional scan over M χ and use ∆ 4 to fix the spectrum completely. Our analysis in the rest of this paper will demonstrate that neural network techniques based on ∆ 4 do indeed result in a high precision measurement of the masses along the flat direction.

Review of Machine Learning Tools
For the results presented here, all neural networks were implemented using the Keras [50] package with the TensorFlow [51] backend. Wherever we implement neural networks in our analysis, we provide details about the number of nodes, layers and activation functions within the respective section.
In the initial phase of our analysis focused on discovery, we implement fully connected deep networks as binary classifiers, the two target classes being the signal and the background. We use the binary cross-entropy as the loss function.
Unlike the first stage of the analysis described in sections 4 and 5, where the neural network analyzes one event at a time and assigns each event an output in the range [0,1] where 0 corresponds to background and 1 corresponds to signal, we will see in sections 6 and 7 that the spin determination and the measurement of the overall mass scale are more challenging, and an ensemble-based analysis is required. In the ensemble-based analysis, the entire data-set is considered as a single input. We also identify the HL variables whose distributions are particularly effective in separating the different hypotheses. Histograms in these variables are then used for the training and evaluation of the networks. Specific details about the histogram binning are provided in the respective sections of the paper.
In our case, post-discovery, the determination of the mass spectrum reduces to determining a single mass value along the flat direction, which we parameterize by M χ . We treat this as a regression problem and we use the Mean Squared Error loss function for training the network.
The Average Decision Ordering (ADO) metric introduced in [52] is very useful for quantifying the correlation between two functions, one of which may be the neural network output and the other an analytic function defined in terms of the inputs. ADO is constructed to quantify the degree to which two functions f and g rank pairs of events belonging to the two classes A and B in the same order. We use the following discrete version of the ADO: where Θ is the Heaviside function.

Analysis Based Solely on Final State Momenta
In this section, we start by studying how well a 'black box' DNN can discriminate between signal and background based on detector level variables alone, by which we mean the momenta of the final state particles, in other words without using any guidance in the form of human level variables. Since the DL variables represent all the available information at the detector, the DNN learns an approximation of the optimal discriminating function, namely the ratio of signal and background distributions. In the next section, we will identify a small combination of HL variables that are sufficient to approximate the optimal discriminator, relying on ADO as a metric. Finally, in sections 6 and 7, we will combine the strength of these HL variables with ensemble-based analysis to tackle the more challenging problems of spin determination and the measurement of the overall mass scale. Figure 3 provides a visual overview of the stages of the analysis. We treat the signal cross section essentially as a free parameter varying in a range consistent with the new particles having electroweak couplings and masses of a few hundred GeV. In order to work on a specific example, we choose the following signal spectrum as a benchmark: This will be denoted as the truth spectrum for the rest of the paper. For the background cross sections, we use the leading order values obtained from Monte Carlo simulations.
For the spin assignments of the new particles, we work with two possibilities, denoting these as the fDM and sDM models, based on whether χ is a fermion or boson (see figure 2). In these two models, the new particles are taken to be: • fDM: X, Z, χ are neutral fermions while Y is a charged scalar.
• sDM: X, Z, χ are neutral scalars while Y is a charged fermion.
We begin by describing the details of Monte Carlo methods we use in our analysis.

Monte Carlo methods and selection cuts
As mentioned in the introduction, the production mechanism of interest to us is pp → Xχ. We take the production to proceed via a heavy t-channel mediator ∆ (which is a scalar/fermion for the fDM/sDM signal model). We take this mediator to couple to up-type quarks. This choice is mostly arbitrary, and motivated by the fact that the constraints on new physics from flavor violation are weaker compared to new particles coupling to down-type quarks. Since the analysis below is based entirely on the decay of X, and we treat the signal cross section as a free parameter, with our results parameterized by this parameter. For the background, we generate pp → µ + µ − γ ν lνl and pp → µ + µ − γ ν lνl ν lνl events, with all possible neutrino flavor combinations. These contain the dibson and triboson processes (including offshell W /Z-bosons) discussed in the introduction. Signal and background events are all generated using MadGraph [53].
The final state can be described by 9 momentum components of the 3 observable particles. We remind the reader that any significant hadronic activity will be vetoed -as a result, the MET does not carry additional information to the momenta of the visible particles. There is a combinatorial ambiguity in the final state since p 1 can be a µ + and p 2 a µ − or the other way around (see figure 1). Therefore it is useful to denote the final state momenta by their charges, namely p + , p − , and p γ .
In Monte Carlo generation, we use relatively loose cuts, demanding only p T,γ > 4GeV to avoid singularities in the matrix element, and we impose the acceptance cut of |η| < 2.5. In order to simulate the detector energy resolution for muons and photons, we use the parameters estimated by ATLAS for the high luminosity run. For muons, these are given by [54] σ(E) and for photons they are given by [55] σ(E) E = 9.84 × 10 −3 ⊕ 9.41 × 10 −2 GeV 1/2 There are two (non-prescaled) ATLAS triggers that are relevant for our final state: a dimuon trigger requiring p T > 14 GeV for both muons, and a µµγ trigger requiring p T > 10 GeV for both muons and p T > 15 GeV for the photon. We include both trigger paths in our event selection. Since our main interest in this paper is in signal spectra with small mass splittings, the invariant mass values take on relatively small values, and a Z-veto can be applied to the µ + -µ − system to eliminate the large backgrounds with on-shell Z's. In order to also avoid the increased background at low invariant mass due to photon conversions, we impose the selection cut After all selection cuts, the background cross section is given by 2.88 fb. We will work with event samples that correspond to the full HL-LHC luminosity of 3000 fb −1 , and we report on the performance of the analysis as a function of the signal strength, quantified by S/B, where S and B are defined as the number of signal and background events after the selection cuts listed above.

Measurement of mass differences
We start our analysis using a DNN trained on DL variables. As mentioned before, the questions of discovery and mass measurement are linked since the DNN needs to be trained without prior knowledge of the spectrum. We will see below that the first phase of our analysis is efficient in measuring the mass differences between the new particles, but not the overall mass scale.
As mentioned in section 1, it is beyond our computational resources to scan over all possible spectra. We settle for a less ambitious goal of at least demonstrating that the correct mass differences present an optimal point in a local scan of a 'testing spectrum' (M 0 ) given by which has the correct mass differences, but is displaced from the truth spectrum (equation 4.1) along the flat direction, which corresponds approximately to the vector (1, 1, 1, 1) in the (M X , M Y , M Z , M χ ) space. We set up the local scan around this testing spectrum, using the following orthogonal basis: where v 4 is taken along the true flat-direction, which deviates slightly from the vector (1,1,1,1). The spectra over which we perform the scans are parametrized as: For each of the spectra M , we generate two sets of Monte-Carlo signal events: one with the spins assigned according to the fDM hypothesis and the other according to the sDM hypothesis. Then, we train an ensemble of DNNs, two for each M , corresponding to the two possible spin assignments, to distinguish the respective signal from the background. Based on the events used to train the network, we refer to these networks as the fDM network and the sDM network respectively.
Each DNN is made up of 3 hidden layers containing 200 hidden nodes. Nodes in the intermediate layers are assigned ReLu activation functions, and the output node is assigned the sigmoid activation function. The input layer contains 9 nodes corresponding to the 3 × 3 = 9 observable momentum components. Additionally, we implement an early stopping monitor, which has a patience of 20 epochs on the validation set. Each network is trained using 1M events each for the signal and the background.
Once the DNNs are trained, we study their output on the benchmark sample (signal at the truth spectrum plus background). In order to infer the dependence of the performance on the signal cross-section, we list our results below for three values of S/B, namely 1.0, 0.1 and 0.01.
In order to provide context for these values of S/B, figure 4 shows how S/B depends on the mass of the mediator (M ∆ ) in the sDM and fDM spin assignments, with the couplings at all vertices of figure 2 taken to have the numerical value 1.
We quantify the performance by ϵ S / √ ϵ B , defined as the enhancement in S/ √ B after applying a cut on the output of the network, compared to the selection cuts of section 4.1. The choice of ϵ S / √ ϵ B (as opposed to ϵ S /ϵ B ) as the performance metric is motivated by the expectation that statistical uncertainties will dominate over systematic ones. In figure 5, we plot ϵ S / √ ϵ B for S/B = 1.0. It can be seen that the performance continues to improve towards stronger cuts on the network output. Here we only plot the metric with the correct spin assignment for simplicity.  : For the local mass scan around the testing spectrum with S/B = 1.0, the parameters α, β and γ favor the correct mass differences in the spectrum, and stronger cuts give rise to better performance. In contrast, the scan along the flat direction shows no preference for the true value of the overall mass scale. The discovery cut we impose corresponds to a background rejection of 0.995. In each panel, the solid green contour corresponds to the true value of the parameter.
However, the trend remains the same even if the incorrect spin assignment is used. In order to preserve sufficient signal statistics for the later stages of the analysis, in each network, we apply a cut on the network output that results in ϵ B = 0.005 (this tends to correspond to ϵ S ≈ 0.5 for the best performing networks in our analysis). Hereafter, this will be referred to as the "discovery cut". The value of the discovery cut is inferred from Monte Carlo samples at each mass point. With the discovery cut, the minimum value of S/B such that the signal statistical significance is boosted above 5σ with the discovery cut is (S/B) min ∼ 6.7 × 10 −3 . We will therefore focus on the range S/B > 0.01 for the rest of our analysis.
We perform one-dimensional scans using the parametrization of equation 4.6. The winning mass hypothesis is chosen to be the one that results in the highest ϵ S / √ ϵ B on the benchmark sample, with the discovery cut, setting ϵ B to 0.005. To account for statistical uncertainties, we run 50 iterations of the scan. From the example scan in figure 5, it can be seen that this procedure results in an accurate measurement of the parameters α, β and γ, thereby fixing the mass-gaps. In contrast, we find that the ϵ S / √ ϵ B metric (at a fixed background rejection) fluctuates randomly as δ is varied, and therefore the value of δ maximizing the metric is uncorrelated with the true value.
To most efficiently utilize our computing resources, we limit the local mass scan to the following range around the testing spectrum: α, β, γ ∈ (−20, 20) GeV and δ ∈ (−500, 500) GeV, with spacings of ∆α = ∆β = ∆γ = 0.5 GeV and ∆δ = 20 GeV. Note that with these choices, the mass hierarchy M X > M Y > M Z > M χ is automatically preserved.  Table 1: The average values and uncertainties of the scan parameters obtained over 50 scan iterations along each direction (all numbers in GeV). Table 1 summarizes the results of the scans (averaged over the scan iterations). As expected, we find that the mass differences are determined to a precision of a few GeV by this procedure, while the overall mass scale is left essentially unconstrained. Note that for high values of S/B there is a bias in the results for α, β and γ. As we will see in the next section, when HL variables are used, these biases are eliminated.

Identification of Optimized Variables
After the DNN analysis based on the DL inputs, our next goal is to search for HL variables which match the performance of this analysis. We employ the ADO-guided method prescribed in ref. [52] for identifying the set of HL variables that have the highest correlation with respect to the DL network. The definition of ADO was given in section 3. For our purposes here, the ADO is calculated from the fraction of pairs of events (one drawn from the signal generated at the testing spectrum and the other drawn from the background) which are ranked in the same order by both the deep network operating on the DL inputs and a given set of HL variables.
To summarize the method presented in ref. [52], one picks the HL variable with the highest ADO with respect to the DL network (say f 1 ) in the first iteration. In the next iteration, only the pairs of events for which f 1 and the DL network result in dissimilar orderings are considered. The second HL variable f 2 is then picked to be the one with the highest ADO over these subset of pairs and so on. We terminate the process when the additional HL variable does not lead to a significant improvement in the AUC.
As we described in section 2, we will consider the following list of HL variables: • The transverse momenta of the three visible final state particles : (p T + , p T − , p T γ ) • The transverse missing energy : MET • The invariant masses of the three visible final state particle pairs : (m +− , m +γ , m −γ ) •     Table 3: Triplets of HL variables with and without ∆ 4 having the highest AUCs over the benchmark data-set. For simplicity, we only list the numbers for when the training data set uses the correct spin model. AUC of the DNN based on DL inputs. We also illustrate this visually in figure 6, showing the performance of the HL variables (MET, ∆ 4 , m +−γ ) (in green) and the DL-based neural network (in red). For comparison, we also show the leading HL triplet that does not include ∆ 4 (in blue). Our results confirm the power of ∆ 4 as a kinematic variable in analyzing this decay chain, in accordance with prior studies. In figure 7, we show the distributions of the HL variables for the two signal models and for the background.
Next, we turn our attention to studying the performance of the HL variables for determining the spectrum, namely fixing the parameters α, β, γ, δ.    training samples and network parameters, it is hard for the DL network to converge towards the global minimum of the loss function. Within the limitations of our analysis, we find that identification of the optimal set of HL variables leads to a better convergence towards the global minimum. This is evident from comparing the resolution for mass differences (Tables 1 and 4). The overall mass scale however still remains unresolved. In the next two sections, we will further improve the performance of the neural networks in order to determine the spins of the new particles and  measure the overall mass scale at a much higher precision. In order to achieve this goal, we will first purify the signal in the data by passing the data through the HL classifier network (trained at the testing spectrum), and apply the discovery cut, which eliminates 99.5% of background events in each network. Since this corresponds to a roughly 50% efficiency for signal events, it results in an enhancement in S/B by a factor of ∼ 100.
We can also estimate S/B, without prior knowledge of the overall mass scale. This will be used in the second stage of the analysis, where we employ ensemblebased methods. We have where ϵ S is evaluated in the network with the highest ϵ S / √ ϵ B after applying the discovery cut, using signal events generated at the testing spectrum. ϵ S+B is the efficiency of the discovery cut on the actual data. In other words, ϵ B and ϵ S are parameters obtained from simulations, while ϵ S+B is measured from data. Figure 9 shows the accuracy of this procedure.

Spin Determination
We now turn our attention to determining the spins of the new particles. We treat this as a binary classification problem. We start by introducing an ensemble-based analysis method which will also allow us to determine the overall mass scale in the next section. With the ensemble-based method, we assess the performance of DL variables first, and we then identify the HL variables that result in a similar performance.

Ensemble-based method
Consider binary classification of ensembles containing N events each, {x i } ∈ E N , where each event x i ∈ E = R k . In our case, the ensemble corresponds to the entire data-set. A neural network performing this classification task must have k ×N nodes in the input layer. Also, one must take into account the permutational invariance of the input vectors (in other words, the ordering of events within an ensemble should not affect the outcome). This can be achieved either by having a special architecture for the neural network or by considering all the allowed permutations of the input vectors during training. Given the size of the event samples in our case and the dimensionality of the final state (k = 9 for DL inputs), we will not attempt to do this. We instead apply the method introduced in ref. [56], to build an ensemble-based classifier based on an event-by-event classifier. We first build a simple event-by-event classifier operating on events x i represented by the complete set of DL variables. Then, using the output of the event-by-event classifier y(x i ), the ensemble-based classification function y N ({x i }) is generated: . (6.1)

Performance with DL variables
The event-by-event network is made up of 3 hidden layers containing 100 nodes each. We use 1M events of each spin model to train the networks. Since our analysis so far has not allowed us to measure the overall mass scale, we continue generating the training samples at the testing spectrum. The network is trained using the labelŝ y(x i ) = 1 if x i ∈ sDM andŷ(x i ) = 0 if x i ∈ fDM. We use 100k ensembles of each spin model.
We remind the reader that we are performing the spin determination analysis after the first stage of the analysis described in the previous section has already been performed. With the amount of signal purification we gained by applying the discovery cut, even in the most conservative cases (S/B > ∼ 0.01 needed for discovery), the signal fraction in the events passing the cut has been boosted to an O(1) number. In Figure 10, we present the ROC curves for representative values of S/B, for an integrated luminosity of 3 ab −1 . As can be seen in that figure, the spin determination is very accurate and only starts to degrade near the lowest S/B values of interest.  The corresponding AUCs are given in brackets.

Performance with HL variables
Next, we identify the HL variables that match the performance of the event-by-event DL spin determination network, and then use those to perform the ensemble-based analysis in terms of these variables.
As before, we compute the ADOs of the HL variables (at the testing spectrum). The variables with the highest ADOs for spin determination are shown in table 5. Replacing the DL variables with (m +− , MET) results in a similar performance for event-by-event spin determination. m +− is identified by the ADO method to be very effective in spin determination, matching the results presented in ref. [36].
The HL network we use has two hidden layers containing 30 units each. Using the output of this event-by-event HL classifier network, we construct ensemble-based classifiers as we did for the DL variables. Figure 11   It is worth spending some time looking into how the spin information is encoded in the variables m +− and MET. In figure 12, we show the distribution of events in these two variables within the sDM and fDM signal models after the discovery cut. Note that the discovery cut eliminates almost all events with MET< 100 GeV, as that region of phase space is background-dominated. We also show in figure 13 the contours of the event-by-event HL network in the (m +− , MET) space. The shape of these contours can be understood with the following observations. The matrix element squared of the decay process X → µ + µ − γχ is proportional to a factor of m 2 +− in the sDM model, but not in the fDM model, resulting in a significant difference in the m +− distributions in the two models, which can also be seen in figure 12. In fact, at low m +− , the network output is basically purely based on m +− , as can be seen on the left side of figure 13. As m +− approaches its maximum value, most of the kinetic energy from the decay in the X-frame is taken by the muons, leaving the χ with little kinetic energy. As a result, there is a boost imbalance between this softer χ produced in the X-decay, and the one produced directly from the initial state and recoiling against the X (see figure 2). In this region, the MET and m +− variables are correlated, which leads to the contours on the right side of figure 13 bending towards the diagonal.

Optimizing the cut thresholds
For the ensemble-based method described above, AUC quantifies the separation between the two spin models achieved by the output function y N ({x i }). However, we are ultimately interested in finding the optimal value of the cut on the output that maximizes the classification accuracy. Note that the HL variables of interest do not include ∆ 4 , it is therefore sufficient to use the correct mass gaps and the overall mass scale is not needed. This being the case, we can use the testing spectrum for the determination of the optimal cut values.
For a given cut y ′ N on the output function y N ({x i }), we define the overall accuracy rate (AR) as the sum of the accuracy rates of sDM and fDM ensembles, as a function of S/B. Then, scanning through the range of the output function, the cut y * N that maximizes the overall AR is found. The accuracy rates for spin determination of the benchmark signal, using HL variables, are shown in Figure 14.

Mass Scale Determination
Having fixed the mass differences in the earlier stage of the analysis (section 5), the spectrum hypotheses between which we are trying to distinguish at this stage are labeled in terms of a single variable, which we can take to be M χ . It was shown in ref. [32,34] that using ∆ 4 as a kinematic observable helps break the degeneracy along the flat direction and determine the overall mass scale, and we therefore expect ∆ 4 to be a powerful variable when performing a neural network based analysis as well. Even so, we saw at the end of section 5 that the overall mass scale still has a large uncertainty when analyzing the data event-by-event, even with ∆ 4 as one of the HL variables. In this section we combine the power of ∆ 4 with an ensemble-based analysis to pin down the overall mass scale.
We remind the reader that a histogram of ∆ 4 depends not only on the true value of M χ in the data, but also the input value M χ that is assumed when calculating ∆ 4 . The core concept for the method we are about to present relies on training the neural network on the shape of this distribution as either the true or input value for M χ is varied.
To determine the true value of M χ , we specify a range [M 1 , M 2 ] that we expect it to lie in, and we transform each event x i to a point in the following two-dimensional space, with M 1 and M 2 used as input values: Any data sample with the true value M χ is then converted into a two-dimensional scatter plot, and with an appropriate binning, into a pixellated heat map. These heat maps are used to train the neural network, as the true M χ is varied between M 1 and M 2 , and M χ being assigned as the output of the neural network during the training. In the testing phase, the output of the neural network is then taken as the measured value of M χ .
To minimize bias, we choose the relatively broad mass range of (100, 900) GeV for this last stage of our analysis. We use a step size of δM χ = 20 when varying M χ . We have checked that a smaller step size does not result in an improvement of the accuracy of the final result.
To generate the training data, we construct the heat maps described above from the data samples for each value of m χ and S/B of interest, after having applied the discovery cut (see section 4.2). We restrict the heat maps to the range −∆ 4,max ≤ ∆ 4 ≤ ∆ 4,max for each sample. Events lying beyond this range are discarded. We refine the resolution of the heat maps until a saturation in performance is reached in terms of the loss function. In our case, this is achieved for a bin size of ∆ 4,max /20 along each of the two axes. Based on this number, we construct 41 × 41 pixel heat maps. The networks have 1681 nodes in the input layer, and three hidden layers with 100, 300, 100 nodes respectively.
Having completed the training phase, we move on to the determination of M χ from the data. In the testing phase, the trained network is run on the data, with the estimated S/B value obtained from equation 5.1 as input. In 1000 iterations for each value of S/B, the prediction for M χ is obtained from the network output in each run. In figure 15, we show the central value and standard deviation for M χ

Conclusions
We have applied machine learning techniques to optimize discovery sensitivity, spin determination and mass measurement in the SUSY-like decay chain of figure 1. This event topology was chosen as a representative case for when signal cross sections are relatively low, and the final state particles do not have high p T due to a compressed signal spectrum. The decay chain is not long enough for algebraic reconstruction methods to be effective, and yet long enough that commonly used one-dimensional distributions of Lorentz-invariant or boost invariant kinematic observables do not capture the full amount of useful kinematic information.
In order to narrow down the parameter space, we started our analysis with a simple neural network that was effective in determining the mass gaps in the spectrum, and in enhancing signal over background for the second stage of the analysis. We have identified the kinematic observables that match the performance of a 'black-box' network in the first stage of the analysis, confirming the importance of the observable ∆ 4 which had previously been proposed to be an effective observable for similar decay chains.
In the second, ensemble-based, stage of the analysis, we were able to achieve a much higher accuracy in determining the spins of the new particles compared to an event-by-event analysis. Similarly, we were also able to measure the overall mass scale, a significant challenge for methods based on commonly used observables such as kinematic edges and endpoints, with an ensemble-based analysis. Once again, ∆ 4 proved to play a significant role in maximizing the precision of the measurement of the overall mass scale.
We point out to the reader that ∆ 4 is an O(8) polynomial of the Lorentz invariant pairs m ij . To test the efficiency of ∆ 4 for discovery, a scan can be performed over polynomial functions of m ij . Specifically, the binary cross-entropy loss can be minimized with respect to the coefficients of the polynomials of m ij . We will address such global optimization in future work.
The application of similar methods to more general event topologies and more challenging final state particles, to which traditional SUSY searches are not sensitive, is of great interest. Having provided a proof-of-concept with this analysis of an idealized final state, we will take on these challenges in future work.