Anomaly Detection for Resonant New Physics with Machine Learning

Despite extensive theoretical motivation for physics beyond the Standard Model (BSM) of particle physics, searches at the Large Hadron Collider (LHC) have found no significant evidence for BSM physics. Therefore, it is essential to broaden the sensitivity of the search program to include unexpected scenarios. We present a new model-agnostic anomaly detection technique that naturally benefits from modern machine learning algorithms. The only requirement on the signal for this new procedure is that it is localized in at least one known direction in phase space. Any other directions of phase space that are uncorrelated with the localized one can be used to search for unexpected features. This new method is applied to the dijet resonance search to show that it can turn a modest 2 sigma excess into a 7 sigma excess for a model with an intermediate BSM particle that is not currently targeted by a dedicated search.

The main goal of high energy physics is to identify the elementary building blocks of matter and to characterize the laws governing their motion.In order to achieve this goal, experiments at the energy frontier collide particles with extremely high momenta in a quest to directly produce the elementary particles and study their interactions.The collider currently able to directly probe the smallest distance scales is the Large Hadron Collider (LHC).Building on decades of effort at previous experiments, the ATLAS and CMS collaborations at the LHC discovered the Higgs boson in 2012 [1,2], completing the Standard Model (SM) of particle physics.While the SM has been enormously successful, it is not a complete theory of nature as it lacks a description of dark matter and gravity, in addition to various technical or aesthetic problems.Despite an intensive and impressive program to search directly for physics beyond the SM at the LHC [3][4][5][6][7], there is still no direct evidence for any new structures in nature.However, there are numerous compelling theoretical motivations for physics beyond the SM (BSM) at the energies scales accessible by the LHC [8].While it could be that the BSM particles are too massive or produced with too low a cross-section to be discovered yet, it is also possible that the current search program is simply not sensitive to the regions of phase space populated by BSM physics.
In order to mitigate the possibility of uncovered regions of phase space, collider experiments have implemented model-independent anomaly detection techniques.Traditionally, there are two such approaches: general searches and bump hunts.The idea of general searches is to compare data and simulation in a large number of event topologies, characterized by the number and type of various physics objects such as leptons or hadronic jets resulting from high energy quark and gluon production [9][10][11][12][13][14][15][16][17][18][19][20].While this approach has a broad cov-erage, it is restricted to simple observables because it relies heavily on simulations for background estimation.In contrast, bump hunts [21] often do not use any simulation for background estimation, other than to motivate and validate the background fit procedure: after identifying a region of phase space where a signal is expected to be localized, the background is fit with a smooth function and interpolated to the signal-sensitive region.Excesses over this background prediction would be an indication of BSM physics.To enhance the resonance structure from a di-object invariant mass, modern classification tools [22] can be used select the target objects like b-quark [23,24], top-quark [25,26], W/Z [25,26] or Higgs boson [27,28] jets from generic quark or gluon jets.However, these classifiers are trained in simulation and calibrated in data, which may lead to suboptimal classifiers.Furthermore, it is not possible in this paradigm to develop classifiers for BSM objects, since no calibration sample exists.This letter presents a new technique to search for BSM physics that significantly extends the bump hunt approach that uses classifiers trained directly on data.Consider a signal that is localized in one kinematic variable (the resonant variable, m res ) on top of a smoothly varying background, for example a dijet resonance that can be reconstructed from the invariant mass of two jets.Suppose that each event has additional auxiliary information (such as substructure in the two jets) that may provide additional discriminating power between signal and background, but the detailed signal characteristics in these auxiliary variables are unknown a priori.Our proposal is that a classifier can be trained to discern the auxiliary characteristics of the signal (if present) directly from data, without reference to any specific signal model hypothesis.The output of this classifier can then be used to select signal-like events and reject background events, producing a new distribution in the resonant vari-able that remains smooth in the case that no signal was present, but that may enhance the significance of the bump if a real signal is present.In the event that a signal is discovered, the output of the classifier can then be studied to infer the signal characteristics.
The key feature of resonant signals that is utilized in our approach is that their localization in one kinematic variable on top of a smoothly varying background allows the identification of potential signal-enhanced and signaldepleted signal and sideband regions, respectively, with almost identical background characteristics.A classifier trained to distinguish the auxiliary characteristics of the signal region events from those of the sideband may in principle be as powerful as a classifier trained to distinguish pure samples of signal and background events -this is a specific application of Classification Without Labels (CWoLa) [29].To see why this is the case, suppose that it is possible to define an ideal sideband selection that contains only background and no signal, and an ideal signal region that contains background identical to that in the sideband but also a small signal that is distinct from the background.By the Neyman-Pearson lemma [30], the most powerful test statistic for discriminating signal (sig) events from background (bg) events using some observables Y is the likelihood ratio and a fully supervised classifier is trained to approximate any monotonic rescaling of this function.A classifier that is trained to discriminate signal region events (sig+bg) from sideband region events (bg) will instead ideally learn to approximate a monotonic rescaling of the function where f sig and f bg are the proportions of signal and background events in the signal region.The fact that Eq. 2 is itself a monotonic rescaling of L (Y ) from Eq. 1 shows that there is no fundamental obstruction for the CWoLabased classifier to identify the ideal decision boundaries for signal selection.The above argument also holds if the sideband region has a small amount of signal, as long as the signal proportion is less than the signal region [29].Practical limits will arise from limited statistics (particularly for the signal) and other technical difficulties that may obstruct a trainable classifier from reaching the performance achievable with labeled simulations, and also from the small differences in the background characteristics between signal and sideband regions.
Our extended bump hunt procedure also has some features in common with the sPlot [55] technique.In particular, sPlot provides a method for determining the distribution of multiple event classes for a resonant feature ('control variable' in the language of Ref. [55]) using a set of uncorrelated auxiliary features ('discriminating variables' in Ref. [55]).The key differences between sPlot and the extended bump hunt are (1) we are interested in using machine learning to isolate a signal-rich region of phase space and (2) we do not take the probability distribution for the auxiliary features as input -the classification procedure learns useful information directly from the data.
A danger that is present when training and testing a classifier on the same dataset is that it may overfit the training data and learn the specific statistical fluctuations in that dataset rather than the true underlying distribution.Classifiers used in this way will preferentially select signal-region events based on their statistical fluctuations, and will create a fake bump in the resonancevariable distribution even when no real signal is present.A simple way to mitigate the background sculpting is to split the underlying dataset randomly into a training set and a test set that will have uncorrelated statistical fluctuations.This would, however, result in an effective loss of luminosity available both for training and for testing.Instead we advocate for an n-fold cross-validation procedure, in which the data is randomly partitioned into n sets of equal size (stratified by m res bin).The selection on each of the n partitions is performed using the output of a classifier trained and validated on the remaining n−1 partitions, resulting in a total of n classifiers.Any statistical fluctuations learnt by a classifier from its training data will be uncorrelated with those in the data on which it is used for event selection.The effects of overtraining on the performance of the classifiers can be mitigated by a nested cross-validation procedure, as is described in detail in Ref. [32].
In using the cross validation procedure there is a danger that the bin counts become non-Poissonian due to correlations between the selections, which would need to be accounted for with computationally expensive test statistic calibration based on a large number of simulated toys.If this were found to be prohibitive in a specific application, a simple test-train split is remains a possibility to avoid this difficulty.However, we find in our tests that this does not distort the test statistic distributions in our examples in Ref. [32], and we find that asymptotic formulae [33] or throwing toys with counts based on the merged selected events provide accurate p-values.
To summarize, the extended bump hunt algorithm proceeds as follows, for a single resonance mass hypothesis mres : 1. Identify an observable m res in which a signal is expected to be resonant, and a set of auxiliary variables Y that are to be used for signal selection.The variables Y must be independent of m res .There are a number of methods for correcting this if not inherently true [34][35][36][37][38][39][40][41].A background model f (m res ) is needed for m res .Typically (and in the example below) this is done with a parametric fit, though non-parametric methods are also possible [42].
2. Define a signal region in a window around mres .
3. Define sideband regions that are disjoint from the signal region but still sufficiently close that the background distribution in Y is expected to be nearly identical.
4. Use a cross-validation procedure to separate training samples from test samples.For each test subsample: (a) Train a classifier to discriminate training events drawn from the sideband regions from those drawn from the signal region, using variables Y .
(b) Select a fraction of the most signal-like test events as determined by the classifiers.
6. Perform a statistical test for the presence of an excess in the signal region of the m res distribution after the cut has been applied, using the data outside of the signal region for background determination using the background model f (m res ).The statistical analysis can be performed using pseudoexperiments generated by sub-sampling from the data itself.
This procedure is repeated starting from step 2 for a series of resonance mass hypotheses, as in a usual bump hunt.This entails the usual trials factor associated with the scan over the resonance variable, but does not invoke any additional trials factor associated with the space of auxiliary variables.Using asymptotic formulae [33] or throwing toys with counts based on the merged selected events provide accurate p-values [32].As a concrete example of the new bump hunting strategy, suppose there is a new resonance that decays into unusual jets.We do not know a priori how to look for the new resonance, but we can consider the substructure of each jet to look for an anomalous radiation pattern.The left plot of Fig. 1 shows the invariant mass of two jet four-vectors in simulated QCD dijet events [31].To illustrate the power of the technique, we have also injected events from the decay of a W particle with a mass of 3 TeV.This W is constructed to decay to a W boson (m W ≈ 80 GeV) and a new X particle (m X ≈ 400 GeV), which itself decays into two W bosons, as described in Ref. [44][45][46].We consider the all-hadronic channel in which each W boson decays into quark pairs.The signal is thus characterized as having two large jets, one with a two-prong substructure and one with a four-prong substructure.The shaded histogram in the left plot of Fig. 1 peaks at the resonance mass of 3 TeV with a broad width due to jet fragmentation and clustering effects.Without any selection on the jets' substructure, there is no significant indication of the signal hiding under the smooth background from generic quarks and gluons.
To enhance the sensitivity of this search using the extended bump hunt method described above, a suite of classifiers are trained to distinguish a sliding signal region from sideband regions.For each jet, the following substructure information (Y ) is used: where m J is the jet mass, n trk is the number of charged particles (tracks) in the ungroomed jet, the Nsubjettiness ratios are defined by τ M N = τ (1) N , and the observables τ N are defined in Ref. [47].The output of the classifiers are then used to select signal-like events over the full range of the m JJ distribution.The resulting distributions are shown in Fig. 1 (left) after applying thresholds on the NN output with overall efficiencies 10%, 1%, 0.2%, and 0.02%, respectively, in descending order.Prior to applying any threshold, the resonant signal has S/B = 6.4 × 10 −3 and significance S/ √ B = 1.8 in the signal region and the m JJ distribution has no discernible resonant feature.However, after applying the threshold determined by the classifier, a clear bump develops in the signal region with local significance of 7σ at the 0.2% threshold.Of course, in the event that the resonance mass is not known in advance then a scan must be performed over possible resonance masses.It is important that the procedure does not create fake bumps in the background when no signal is present.We show in Fig. 1 (right) the p-values obtained in the mass scan over this distribution the case that (a) no signal is present, and (b) the case that the signal has been injected.We find that no significant bumps are created in the signal-free test.Furthermore, we find that traditional searches aimed at finding di-boson resonances using jet substructure-based supervised learning algorithms (but for SM bosons) are not able to enhance the significance of this signal for a wide range of S/B and classifier working points [32].
In order to characterize the signal that the classifier has found, we can study the distribution of selected signallike events, as illustrated in Fig. 2. We see that the classifier trained in the presence of a true signal has identified a population of events with a heavier jet with mass m J A 400 GeV, a small number of tracks, and small τ 43 , and a lighter jet with mass m J B , a small number of tracks, and small τ 21 .
In conclusion, we have presented a new technique to search for physics beyond the SM that requires very little prior knowledge of the signal.The method was demonstrated in simulation on an all-hadronic resonance search at the LHC, where an uninteresting excess was enhanced

With signal
FIG. 1. Left: mJJ distribution of dijet events (including injected signal, indicated by the filled histogram) before and after applying jet substructure cuts using the NN classifier output for the mJJ 3 TeV mass hypothesis.The dashed red lines indicate the fit to the data points outside of the signal region, with the gray bands representing the fit uncertainties.The top set of markers represent the raw dijet distribution with no cut applied, while the subsequent sets of markers have cuts applied at thresholds with efficiency of 10 −1 , 10 −2 , 2 × 10 −3 , and 2 × 10 −4 .Right: Local p0-values for a range of signal mass hypotheses in the case that no signal has been injected (left), and in the case that a 3 TeV resonance signal has been injected (right).The dashed lines correspond to the case where no substructure cut is applied, and the various solid lines correspond to cuts on the classifier output with efficiencies of 10 −1 , 10 −2 , and 2 × 10 −3 .
to a level of discovery.There are many other possibilities for applying this technique directly to data, in any case where the signal is expected to be localized in one dimension.By naturally exploiting the power of modern machine learning, we hope that this extended bump hunt will help to expose new distance scales in nature on the quest for BSM at the LHC and beyond.
The datasets and code used for the case study can be found at Refs. [48,49].First column: all signal region events.This is repeated in the other columns to aid comparisons.Second column: truth-level simulated signal events highlighted in red.Third column: The red dots are the 0.2% most signal-like events selected by the classifier described in the text.Fourth column: The red dots are the 0.2% most signal-like events selected by a classifier trained on the same sample but with true-signal events removed.

FIG. 2 .
FIG.2.2D projections of the 12D feature-space of the signal region dataset.First column: all signal region events.This is repeated in the other columns to aid comparisons.Second column: truth-level simulated signal events highlighted in red.Third column: The red dots are the 0.2% most signal-like events selected by the classifier described in the text.Fourth column: The red dots are the 0.2% most signal-like events selected by a classifier trained on the same sample but with true-signal events removed.