Simulation-Assisted Decorrelation for Resonant Anomaly Detection

A growing number of weak- and unsupervised machine learning approaches to anomaly detection are being proposed to significantly extend the search program at the Large Hadron Collider and elsewhere. One of the prototypical examples for these methods is the search for resonant new physics, where a bump hunt can be performed in an invariant mass spectrum. A significant challenge to methods that rely entirely on data is that they are susceptible to sculpting artificial bumps from the dependence of the machine learning classifier on the invariant mass. We explore two solutions to this challenge by minimally incorporating simulation into the learning. In particular, we study the robustness of Simulation Assisted Likelihood-free Anomaly Detection (SALAD) to correlations between the classifier and the invariant mass. Next, we propose a new approach that only uses the simulation for decorrelation but the Classification without Labels (CWoLa) approach for achieving signal sensitivity. Both methods are compared using a full background fit analysis on simulated data from the LHC Olympics and are robust to correlations in the data.


Introduction
Despite compelling experimental (e.g. dark matter) and theoretical (e.g. the hierarchy problem) evidence for new phenomena at the electroweak scale, experiments at the Large Hadron Collider (LHC) have not yet discovered any physics beyond the Standard Model (BSM). There are major search efforts across LHC experiments [1][2][3][4][5][6][7], where most analyses target a particular class of BSM models. While this work is well-motivated and continuing to improve in sensitivity (in part due to machine learning [8][9][10][11]), there is also a growing need for new search strategies capable of discovery in unexpected scenarios.
A variety of automated anomaly detection techniques using innovative machine learning methods are being proposed to cover the unexpected . An important subset of these proposals targets resonant new physics, where sideband methods can be used to estimate the SM background directly from data. A key challenge facing such methods is that the machine learning classifiers must be relatively independent from the resonant feature, for otherwise artificial bumps can be formed. Many automated decorrelation methods have been proposed to ensure that classifiers are decorrelated from particular features by construction [39][40][41][42][43][44][45][46][47][48][49][50], but they may not apply in all cases. In particular, weakly supervised approaches that learn directly on the signal region cannot be simply combined with a decorrelation scheme because such an approach could degrade the performance in the presence of a signal. A localized signal would manifest as a dependence between the resonant feature and other features for classification, so forcing independence could eliminate signal sensitivity.
In this paper, two weakly supervised approaches are studied: Classification without Labels (CWoLa) [13][14][15]51] and Simulation Assisted Likelihood-free Anomaly Detection (Salad) [27]. CWoLa is a method that does not depend on simulation and achieves signal sensitivity by comparing a signal region with nearby sideband regions in the resonance feature. As a result, CWoLa is particularly sensitive to dependencies between the classification features and the resonant feature. Salad uses a reweighted simulation to achieve signal sensitivity. Since it never directly uses the sideband region, Salad is expected to be more robust than CWoLa to dependencies. In order to recover the performance of CWoLa in the presence of significant dependence between the classification features and the resonant feature, a new method called simulation augmented CWoLa (SA-CWoLa) is introduced. The SA-CWoLa approach augments the CWoLa loss function to penalize the classifier for learning differences between the signal region and the sideband region in simulation, which is signal-free by construction. All of these methods will be investigated using the correlation test proposed in Ref. [28].
This paper is organized as follows. Section 2 reviews the Salad and CWoLa methods and introduces the simulation augmented CWoLa search strategy. Furthermore, the sideband analysis is setup in Sec. 2. The simulations used for illustrating the various approaches are described in Sec. 3. Results for the different strategies are presented in Sec. 4. The paper ends with conclusions and outlook in Sec. 5.

Methods
For a set of features (m, x) ∈ R n+1 , let f : R n → [0, 1] be parameterized by a neural network. The observable m is special, for it is the resonance feature that should be relatively independent from f (x). The signal region (SR) is defined by an interval in m and the sidebands (SB) are neighboring intervals.
All neural networks were implemented in Keras [52] with the Tensorflow backend [53] and optimized with Adam [54]. Each network is composed of three hidden layers with 64 nodes each and use the rectified linear unit (ReLU) activation function. The sigmoid function is used after the last layer. Training proceeds for 10 epochs with a batch size of 200. None of these parameters were optimized; it is likely that improved performance could be achieved with an in-situ optimization based on a validation set.

Simulation Assisted Likelihood-free Anomaly Detection (SALAD)
The Salad network [27] is optimized using the following loss: ) are a set of weights using the Classification for Tuning and Reweighting (Dctr) [55] method. The function g is a parameterized classifier [56,57] trained to distinguish data and simulation in the sideband: The above neural networks are optimized with binary cross entropy, but one could use other functions as well, such as the mean-squared error. Intuitively, the idea of Salad is to train a classifier to distinguish data and simulation in the SR. However, there may be significant differences between the background in data and the background simulation, so a reweighting function is learned in the sidebands that makes the simulation look more like the background in data.

Simulation Augmented Classification without Labels (CWoLa)
The idea of CWoLa [51] is to construct two mixed samples of data that are each composed of two classes. Using CWoLa for resonant anomaly detection [13,14], one can construct the mixed samples using the SR and SB. In the absence of signal, the SR and SB should be statistically identical and therefore the CWoLa classifier does not learn anything useful. However, if there is a signal, then it can detect the presence of a difference between the SR and SB. In practice, there are small differences between the SR and SB because there are dependencies between m and x and so CWoLa will only be able to find signals that introduce a bigger difference than already present in the background. The CWoLa anomaly detection strategy was recently used in a low-dimensional application by the ATLAS experiment [15]. We propose a modification of the usual CWoLa loss function in order to construct a simulation-augmented (SA) CWoLa classifier: where λ > 0 is a hyper-parameter. The limit λ → 0 is the usual CWoLa approach and for λ > 0, the classifier is penalized if it can distinguish the SR from the SB in the (background-only) simulation 1 . In order to help the learning process, the upper and lower sidebands are given the same total weight as each other and together, the same weight as the SR.

Bump Hunt Analysis
In addition to quantifying performance with Receiver Operating Characteristic (ROC) curves, it is also useful to emulate a proper background estimation based on a bump hunt. A histogram of the m jj spectrum, possibly after applying a threshold on one of the classifiers described above, is fit to the following parametric function: where x = m jj / √ s and p i are fit parameters. This function has a long history and has also been recently used by the ATLAS and CMS collaborations (see e.g. [58,59]). Alternative non-parametric functions are also possible (such as Gaussian processes [60]), but these are not needed for the demonstration considered here. The SR is masked during the fit and then a p-value of the observed data is computed in the usual way. In particular, a test statistic is formed from the profile likelihood ratio: where n is the number of observed events in the SR and θ is a nuisance parameter from the sideband fit: where b and σ are the number of events and uncertainty from the sideband fit, respectively. The test statistic itself is q 0 = −2 log(λ 0 ) when the extracted signal strength (µ, θ) = argmax µ ,θ p(n|µ , θ ) is µ > 0 and 0 otherwise. Asymptotic formulae from Wald and Wilks then give the significance Z = √ q 0 [61][62][63].
In practice, one would scan the signal region across the m jj spectrum. In this analysis, we will focus on a single region with or without signal injected. The signal region is defined by m jj ∈ [3.3, 3.7] TeV and the sideband for CWoLa training is defined as m jj ∈ [3.1, 3.3] ∪ [3.7, 3.9] TeV. Long sidebands extended by 300 GeV in either direction are used to train the Salad reweighting function. The background fit is performed between 2.6 and 5 TeV using 30 equally-spaced bins.

Simulation
The simulations used for this study were produced for the LHC Olympics 2020 community challenge [64]. In particular, the background process is composed of generic dijet events with a requirement for at least one such jet with p T > 1.3 TeV. Signal events are W → XY for m W = 3.5 TeV and hypothetical particles X and Y of mass 500 and 100 GeV, each decaying into pairs of quarks. Due to the mass hierarchy between the W boson and its decay products, the final state is characterized by two large-radius jets with two-prong substructure. The background and signal are simulated using Pythia 8 [65,66] and an alternative background sample is simulated using Herwig++ [67]. A detector simulation is performed with Delphes 3.4.1 [68][69][70] using the default CMS detector card. Particle flow objects are the input to jet clustering, implemented using Fastjet [71,72] and the anti-k t algorithm [73] using R = 1.0 for the radius parameter. In what follows, Pythia will play the role of 'data' and the Herwig sample will be used as the 'simulation'. There are one million events for both background samples, corresponding to an integrated luminosity of about 100 fb −1 . In order to simplify the analysis, the dataset is divided in half for training and testing. More complicated procedures based on k-folding to use the entire dataset for both training and testing are also possible, but are not considered here [13,14].
Both the CWoLa and Salad methods have been demonstrated on the unmodified LHC Olympics dataset. Following Ref. [28], the dependence between the jet masses and m jj is artificially strengthened by redefining m j → m j + α m jj for α = 0.1. As shown in Ref. [28], this shift is sufficient to reduce the efficacy of the unmodified CWoLa method.
In addition to the dijet invariant mass, four features are used for the anomaly detection: the invariant mass of the lighter jet, the mass difference of the leading two jets, and the τ 21 [74,75] of the leading two jets. The N -subjettiness τ 21 quantifies the extent to which a jet is characterized by two subjets or one subjet. Histograms of the four input features for the background are shown in Fig. 1. The signal jet masses are localized at the X and Y masses (shifted by α m W ) and the τ 21 are shifted to lower values, indicating two-pronginess. In addition to presenting the data and simulation histograms, Fig. 1 also shows the reweighted background simulation using parameterized weights learned from a long sideband. Sim. + DCTR Signal Figure 1. Left: the jet mass and τ 21 of the jet with a smaller mass. Right: the difference between the heavier and lighter jet masses and τ 21 of the heavier jet. In addition to showing the data, simulation, and signal, the histogram labeled 'Sim.+DCTR' is the simulation with weights derived from a parameterized reweighting function trained on long sidebands.

Results
As a benchmark, 1500 signal events corresponding to a fitted significance of about 2σ is injected into the data for training. For evaluation, the entire signal sample (except for the small number of injected events) is used. Figure 2 shows the performance of various configurations. The fully supervised classifier uses high statistics signal and background samples in the SR with full label information. Since the data are not labeled, this is not achievable in practice. A solid red line labeled 'Optimal CWoLa' corresponds to a classifier trained using two mixed samples, one composed of pure background in the single region and the other composed of mostly background (independent from the first sample) in the SR with the 1500 signal events. This is optimal in the sense that it removes the effect from phase space differences between the SR and SB for the background. The Optimal CWoLa line is far below the fully supervised classifier because the neural network needs to identify a small difference between the mixed samples over the natural statistical fluctuations in both sets. The actual CWoLa method is shown with a dotted red line. By construction, there is a significant difference between the phase space of the SR and SB and so the classifier is unable to identify the signal. At low efficiency, the CWoLa classifier actually anti-tags because the SR-SB differences are such that the signal is more SB-like then SR-like. Despite this drop in performance, the simulation augmenting modification (solid orange) with λ = 0.5 nearly recovers the full performance of CWoLa. For comparison, a classifier trained using simulation directly is also presented in Figure 2. The line labeled 'Data vs. Sim.' directly trains a classifier to distinguish the data and simulation in the SR without reweighting. Due to the differences between the background in data and the simulated background, this classifier is not effective. In fact, the signal is more like the background simulation than the data background and so the classifier is worse than random (preferentially removes signal). The performance is significantly improved by adding in the parameterized reweighting, as advocated by Ref. [27]. With this reweighting, the Salad classifier is significantly better than random and is comparable to SA-CWoLa. The Optimal CWoLa line also serves as the upper bound in performance for Salad because it corresponds to the case where the background simulation is statistical identical to the background in data.
The SA-CWoLa method has one free parameter that must be tuned. Figure 3 quantifies the performance of the SA-CWoLa classifier as a function of λ. The performance of SA-CWoLa is strong and relatively stable for 0.3 < λ < 0.6. For λ 0.2, the classifier is effectively blinded to differences between the SR and SB as illustrated in the orange lines in Fig. 3  Significance Improvement Figure 2. A Receiver Operating Characteristic (ROC) curve (left) and significance improvement curve (right) for various anomaly detection methods described in the text. The significance improvement is defined as the ratio of the signal efficiency to the square root of the background efficiency. A significance improvement of 2 means that the initial significance would be amplified by about a factor of two after employing the anomaly detection strategy. The supervised line is unachievable unless there is no mismodeling and one designed a search for the specific W signal used in this paper. The curve labeled 'Random' corresponds to equal efficiency for signal and background.  While ROC and significance improvement curves are effective for quantifying performance, they do not communicate the complete story because they ignore the impact of background estimation. Figures 6 and 7 show the results of the sideband fit and statistical test (See Sec. 2.3). The fit quality is excellent when considering all bins (see Fig. 4), but there happens to be a small local deficit in the SR. The right plot of Fig. 6 removes this effect by subtracting the fitted residuals in the background-only case for each value of the NN background efficiency. The spectra after applying the nominal CWoLa classifier cannot be fit to the same shape and are thus not included -see    There is a small local deficit in the simulation. The left plot shows the fitted excess without modifying the background while the right plot corrects for the initial deficit by subtracting the residuals of the background-only fit before performing the signal+background fit. In the latter case, the significances are still not S/ √ B due to the uncertainty from the sideband fit.

Shifted SR excess [units of ]
Optimal CWoLa SA-CWoLa SALAD Figure 7. Fit excess without signal injected using the statistical procedure described in Sec. 2.3. Without any signal injected, there is a small (∼ 1.5σ) deficit in the simulation. The right plot shifts the curves so that the 100% efficiency point corresponds to 0σ.

Conclusions
This paper has investigated the impact of dependencies between m jj and classification features for the resonant anomaly detection methods Salad and CWoLa. A new simulation-augmented approach has been proposed to remedy challenges with the CWoLa method. This modification is shown to completely recover the performance of CWoLa from the ideal case where dependences are ignored in the training. In both the Salad and SA-CWoLa methods, background-only simulations provide a critical tool for mitigating the sensitivity of the classifiers on dependences between the resonant feature and the classifier features. These weakly supervised methods are particularly promising, but they are not the only recently-proposed machine-learning based anomaly detection methods. In particular, unsupervised methods also have great potential. The Anomaly Detection with Density Estimation (Anode) [28] does not use simulation at all and has been shown to be relatively robust to dependencies between the resonant feature and the classifier features. Additionally, autoencoder methods have been combined with explicit decorrelation to build in robustness to such dependencies [18].
Each of these unsupervised and semisupervised methods have advantages and weaknesses and it is likely that multiple approaches will be required to achieve broad sensitivity to BSM physics. Therefore, it is critical to study the sensitivity of each technique to dependencies and propose modifications where possible to build robustness. This paper is an important step in the decorrelation program for automated anomaly detection with machine learning. Tools like the ones proposed here may empower higherdimensional versions of the existing ATLAS search [15] as well as other related searches by other experiments in the near future.

Code and Data
The code for this paper can be found at https://github.com/bnachman/DCTRHunting and the simulated data are available from the LHC Olympics [64]. and Scientists (WDTS) under the Science Undergraduate Laboratory Internships Program (SULI). BN would like to thank NVIDIA for providing Volta GPUs for neural network training.