Simulation assisted likelihood-free anomaly detection

Given the lack of evidence for new particle discoveries at the Large Hadron Collider (LHC), it is critical to broaden the search program. A variety of model-independent searches have been proposed, adding sensitivity to unexpected signals. There are generally two types of such searches: those that rely heavily on simulations and those that are entirely based on (unlabeled) data. This paper introduces a hybrid method that makes the best of both approaches. For potential signals that are resonant in one known feature, this new method first learns a parametrized reweighting function to morph a given simulation to match the data in sidebands. This function is then interpolated into the signal region, and then the reweighted background-only simulation can be used for supervised learning as well as for background estimation. The background estimation from the reweighted simulation allows for nontrivial correlations between features used for classification and the resonant feature. A dijet search with jet substructure is used to illustrate the new method. Future applications of Simulation Assisted Likelihood-free Anomaly Detection ( SALAD ) include a variety of final states and potential combinations with other model-independent approaches.


I. INTRODUCTION
An immense search effort by the LHC collaborations has successfully probed many extreme regions of the Standard Model phase space [1][2][3][4][5][6][7].Despite strong theoretical and noncollider experimental motivation, there is currently no convincing evidence for new particles or forces of nature from the LHC searches.However, many final states are uncovered [8,9] and the full hypervariate phase space accessible by modern detector technology is only starting to be probed holistically with deep learning methods [10][11][12][13].There is a great need for new searches that can identify unexpected scenarios.
Until recently, nearly all model independent searches relied heavily on simulation.Generically, these searches operate by comparing data with background-only simulation in a large number of phase space regions.Such searches have been performed without machine learning at D0 [14][15][16][17], H1 [18,19], ALEPH [20], CDF [21][22][23], CMS [24,25], and ATLAS [26][27][28].A recent phenomenological study proposed extending this idea to deep learning classifiers [29,30].While independent of signal models, these approaches are dependent on the fidelity of the background model simulation for both signal sensitivity and background accuracy.If the background simulation is inaccurate, then differences between simulation and (background-only) data will hide potential signals.Even if a biased simulation can find a signal, if the background is mismodeled, then the signal specificity will be poor.
A detailed overview of model independent approaches can be found in Ref. [43].
While it is desirable to be robust to background model inaccuracies, it is also useful to incorporate information from Standard Model simulations.Even though these simulations are only an approximation to nature, they include an extensive set of fundamental and phenomenological physics models describing the highest energy reactions all the way to signal formation in the detector electronics.This paper describes a method that uses a background simulation in a way that depends as little on that simulation as possible.In particular, a model based on the deep neural networks using classification for tuning and reweighting (DCTR) procedure [56] is trained in a region of phase space that is expected to be devoid of signals.In a resonance search, there is one feature where the signal is known to be localized and the sideband can be used to train the DCTR model.This reweighting function learns to morph the simulation into the data and is parametrized in the feature(s) used to mask potential signals.
Then, the model is interpolated to the signal-sensitive region and the reweighted background simulation can be used for both enhancing signal sensitivity and estimating the Standard Model background.As deep learning classifiers can naturally probe high dimensional spaces, this reweighting model can in principle exploit the full phase space for both enhancing signal sensitivity and specificity.
This paper is organized as follows.Section II introduces the Simulation Assisted Likelihood-free Anomaly Detection (SALAD) method.A dijet search at the LHC is emulated to illustrate the new method.The simulation and deep learning setup are introduced in Sec.III, and then the application of DCTR is shown in Sec.IV.The signal sensitivity and specificity are presented in Secs.V and VI, respectively.The paper ends with conclusions and outlook in Sec.VII.

II. METHODS
Let m be a feature (or set of features) that can be used to localize a potential signal in a signal region (SR).Furthermore, let x be another set of features which are useful for isolating a potential signal.The prototypical example is a resonance search where m is the single resonant feature, such as the invariant mass of two jets, while x are other properties of the event, such as the substructure of the two jets.The SALAD method then proceeds as follows: ( where the last factor in Eq. (2.1) is an overall constant that is the ratio of the total amount of data to the total amount of simulation.This property of neural networks to learn likelihood ratios has been exploited for a variety of full phase space reweighting and parameter estimation proposals in high energy physics [56,57,[59][60][61][62]. (2) Simulated events in the SR are reweighted using wðxjmÞ.The function wðxjmÞ is interpolated automatically by the neural network.A second classifier gðxÞ is used to distinguish the reweighted simulation from the data.This can be achieved in the usual way with a weighted loss function such as the binary cross-entropy: ð2:2Þ Events are then selected with large values of gðxÞ.
Asymptotically,1 gðxÞ will be monotonically related with the optimal classifier: It is important that the same data are not used for training and testing.The easiest way to achieve this is by using different partitions of the data for these two tasks.One can make use of more data with a cross-validation procedure [41,42].(3) One could combine the previous step with a standard data-driven background estimation technique such as a sideband fit or the ABCD method.However, one can also directly use the weighted simulation to predict the number of events that should pass a threshold requirement on gðxÞ: for some threshold value c and where I½• is the indicator function that is one when its argument is true and zero otherwise.The advantage of Eq. (2.4) over other data-based methods is that gðxÞ could be correlated with m; for sideband fits, threshold requirements on g cannot sculpt local features in the m spectrum.

III. SIMULATION
A large-radius dijet resonance search is used to illustrate the SALAD method.The simulations are from the LHC Olympics 2020 community challenge R&D dataset [63].The background process is generic 2 → 2 parton scattering (labeled QCD for quantum chromodynamics) and the signal is a hypothetical W 0 boson that decays into an X boson and a Y boson.Each of the X and Y decays to quarks.The masses of the W 0 , X, and Y particles are 3.5, 0.5, and 0.1 TeV, respectively.The mass hierarchy between the W 0 particle and its decay products means that the X and Y particles are Lorentz boosted in the lab frame, and therefore their two-prong decay products are captured inside a single large-radius jet.Particle-level simulations are produced with PYTHIA8 [64,65] or HERWIG++ [66] without pileup or multiple parton interactions and a detector simulation is performed with DELPHES3.4.1 [67][68][69].Particle flow objects are clustered into jets using the FASTJET [70,71] implementation of the anti-k t algorithm [72] using R ¼ 1.0 as the jet radius.Events are selected by requiring at least one such jet with p T > 1.3 TeV.In the remaining studies, the PYTHIA dataset is treated as "data," while the HERWIG dataset is treated as "simulation," to mimic the scenario in practice where the simulation is different from data.
Figure 1 presents the invariant mass of the leading two jets.The p T selection is evident from the peak around 3 TeV.The signal peaks around the W 0 mass, and aside from the kinematic feature from the jet selection, the background distribution is featureless.The spectra from PYTHIA and HERWIG are nearly identical, which may be expected since the invariant mass is mostly determined by hard-scatter matrix elements and not final state effects.
To demonstrate the SALAD approach, two features2 about each of the leading jets are used for classification.The first feature is the jet mass, and the second feature is the N-subjettiness ratio [74,75] This second feature is the most widely used feature for differentiating jets that have two hard prongs (as in the signal) from jets that have only one hard prong (as for most of the background).The two jets are ordered by their mass and the four features used for machine learning are presented in Fig. 2. As expected, the signal mass distributions show peaks at the X and Y masses and the τ 21 distributions are small, indicating two-prong substructure.PYTHIA and HERWIG differ mostly at low mass and across the entire τ 21 distribution.
The baseline performance for classifying signal versus the QCD background is presented in Fig. 3.As is the case for all neural networks presented in the following sections, three fully connected layers with 100 hidden nodes on each intermediate layer are implemented using KERAS [76] and TENSORFLOW [77] with the ADAM [78] optimization algorithm.Rectified linear units are the activation function for all intermediate layers and the sigmoid is used for the final output layer.Networks are trained with binary cross entropy for 50 epochs with early stopping (with patience 10).The supervised classifier presented in Fig. 3 effectively differentiates signal from background, with a maximum significance improvement of about 10.It is expected that the performance of any model independent approach will be bounded from above by the performance of this classifier.

IV. PARAMETRIZED REWEIGHTING WITH DCTR
The first step of the DCTR reweighting procedure is to train a classifier to distinguish the data (PYTHIA) from the simulation (HERWIG) in a sideband region.The output of such a classifier is shown in Fig. 4, where the signal region is defined as m jj ∉ ½3250; 3750 GeV.There are about 850k events in the sideband region and 150k events in the signal region.Unlike the classifier in Fig. 3, the separation in Fig. 4 is not as dramatic because PYTHIA and HERWIG are much more similar than the signal is with QCD.While the background data and simulation are still significantly different, it is important that the simulation be relatively close to the data.One can always ensure that there are no regions of phase space where the data probability density is nonzero while the simulation probability density is zero by augmenting the simulation with e.g., uniform phase space.However, this will lead to very large or very small weights in the DCTR method and therefore a large effective statistical uncertainty in the background prediction and a suboptimal classifier.As expected, the network is a linear function of the likelihood ratio so the ratio plot in Fig. 4 is linear.Interestingly, the signal is more HERWIG-like than PYTHIA-like.The reweighting function is applied to the HERWIG in Fig. 4 to show that the reweighted simulation (Sim þ DCTR) looks nearly identical to the data.All of the events used for Fig. 4 are independent from the ones used for training the network.Figure 5 shows that this reweighting works for all of the input distributions to the neural network as well.
The next step for SALAD is to interpolate the reweighting function.The neural network presented in Fig. 4 is trained to be conditional on m jj , and so it can be evaluated in the SR for values of the invariant mass that were not available during the network training.Note that the signal region must be chosen large enough so that the signal contamination in the sideband does not bias the reweighting function.For this example, for 25% signal fraction in the signal region, the contribution in the sideband is about 1% and has no impact on the DCTR model.Figure 6 shows a classifier trained to distinguish data and simulation in the signal region before and after the application of the interpolated DCTR model.There is excellent closure, also for each of the input features to the classifier as shown in Fig. 7. FIG. 6.A histogram of the classifier output for a neural network trained to distinguish data (PYTHIA) and simulation (HERWIG) in the signal region.The ratio between the simulation (HERWIG) or simulation þ DCTR and data (PYTHIA) is depicted by orange circles (green squares) in the lower panel.FIG. 7. The four features used for machine learning in the signal region, before and after applying DCTR: jet mass (top) and the N-subjettiness ratios τ 21 (bottom) for the more massive jet (left) and the less massive jet (right).The ratio between the simulation (HERWIG) or simulation þ DCTR and data (PYTHIA) is depicted by orange circles (green squares) in the lower panels.

V. SENSITIVITY
After reweighting the signal region to match the data, the next step of the search is to train a classifier to distinguish the reweighted simulation from the data in the signal region.If the reweighting works exactly, then this new classifier will asymptotically learn pðsignal þ backgroundÞ= pðbackgroundÞ, which is the optimal classifier by the Neyman-Pearson lemma [79].If the reweighting is suboptimal, then some of the classifier capacity will be diverted to learning the residual difference between the simulation and background data.If the reweighted simulation is nothing like the data, then all of the capacity will go toward this task and it will not be able to identify the signal.There is therefore a tradeoff between how different the (reweighted) simulation is from the data and how different the signal is from the background.If the signal is much more different from the background than the simulation is from the background data, it is possible that a suboptimally reweighted simulation will still be able to identify the signal (see Sec. VI for problems with background estimation).
Figure 8 shows the sensitivity of the SALAD tagger to signal as a function of the signal-to-background ratio (S=B) in the signal region.In all cases, the background is the QCD simulation using PYTHIA. 3The PYTHIA lines correspond to the case where the simulation follows the same statistics as the data (¼ PYTHIA).The area under the curve (AUC) should be as close to one as possible, and a tagger that is operating uniformly at random will produce an AUC of 0.5.Antitagging (preferentially tagging events that are not signal-like) results in an AUC of less then 0.5.The maximum significance improvement is calculated as the largest value of ϵ S = ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ϵ B þ 0.01% p , where the 0.01% offset regulates statistical fluctuations at low efficiency.
When the S=B ∼ Oð1Þ, then the performance in Fig. 8 is similar to the fully supervised classifier presented in Sec.III.As S=B → 0, the PYTHIA curves approach the random classifier, with an AUC of 0.5 and a max FIG. 8. Four metrics for the sensitivity of the SALAD classifier as a function of the signal-to-background ratio (S=B) in the signal region: the AUC in the top left, the maximum significance improvement (top right), the false positive rate at a fixed 50% signal efficiency (bottom left), and the significance improvement at the same fixed 50% signal efficiency (bottom right).The evaluation of these metrics requires signal labels, even though the training of the classifiers themselves do not have signal labels.Error bars correspond to the standard deviation from training five different classifiers.Each classifier is itself the truncated mean over ten random initializations.Note that the full one million PYTHIA events are divided in two pieces, one that acts as the test set for all methods and one that is used for further study.The remaining half is further split in half to represent the data or the simulation for the lines marked "PYTHIA" in Fig. 8.For a fair comparison, the HERWIG statistics are comparable to 25% of the full PYTHIA dataset.
significance improvement of unity.The HERWIG curve has an AUC less than 0.5 as S=B → 0 because the signal is more HERWIG-like than PYTHIA-like (see Fig. 4), and thus a tagger that requires the features to be datalike (data ¼ PYTHIA) will antitag the signal.Likewise, the efficiency of the tagger on the simulation is higher than 50% when placing a threshold on the NN that keeps 50% of the events in data.The maximum significance improvement quickly drops to unity for HERWIG when S=B ≲ 1%, indicating the network is spending more capacity on differentiating PYTHIA from HERWIG than finding a signal.
For all four metrics, SALAD significantly improves the performance of the HERWIG-only approach.In particular, the SALAD tagger is effective to about S=B ≲ 0.5%, whereas the HERWIG-only tagger is only able to provide useful discrimination power down to about S=B ∼ 1%.For the significance improvement and false positive rate at a fixed true positive rate, the SALAD tagger tracks the PYTHIA tagger almost exactly down to below 1%.The AUC about halfway between PYTHIA and HERWIG at high S=B, which is indicative of poor performance at low efficiency.

VI. BACKGROUND ESTIMATION
The performance gains from Sec. V can be combined with a sideband background estimation strategy, as long as threshold requirements on the classifier do not sculpt bumps in the m jj spectrum.However, there is also an opportunity to use SALAD to directly estimate the background from the interpolated simulation.Figure 9 illustrates the efficacy of the background estimation for a single classifier trained in the absence of a signal.Without the DCTR reweighting, the predicted background rate is too low by a factor of 2 or more below 10% data efficiency.With the interpolated reweighting function, the background prediction is accurate within a few percent down to about 1% data efficiency.
In practice, the difficulty in using SALAD to directly estimate the background is the estimation of the residual bias.One may be able to use validation regions between the signal region and sideband region, but it will never require as much interpolation as the signal region itself.One can rely on simulation variations and auxiliary measurements to estimate the systematic uncertainty from the direct SALAD background estimation, but estimating high-dimensional uncertainties is challenging [80,81].With a low-dimensional reweighting or with a proper high-dimensional systematic uncertainty estimate, the parametrized reweighting used in SALAD should result in a lower uncertainty than directly estimating the uncertainty from simulation.In particular, any nuisance parameters that affect the sideband region and the signal region in the same way will cancel when reweighting and interpolating.

VII. CONCLUSIONS
This paper has introduced SALAD, a new approach to search for resonant anomalies by using parametrized reweighting functions for classification and background estimation.The SALAD approach uses information from simulation in a way that is nearly background-model independent while remaining signal-model agnostic.The only requirement for the signal is that there is one feature where the signal is known to be localized.In the example presented in the paper, this feature was the invariant mass of two jets.The location of the resonance need not be known ahead of time and can be scanned using a series of signal and sideband regions.This scanning will result in a trials factor per nonoverlapping signal region.An additional look elsewhere effect is incurred by scanning the threshold on the neural network.In practice, one could use a small number of widely separated thresholds to be broadly sensitive.As long as the data used for training and testing are independent, there is no additional trials factor for the feature space used for classification.Strategies for maximally using the data for training can be found in Refs.[41,42].
While the numerical SALAD results presented here did not fully achieve the performance of a fully supervised classifier trained directly with inside knowledge about the data, there is room for improvement.In particular, a detailed hyperparameter scan could improve the quality of the reweighting.Additionally, calibration techniques could be used to further increase the accuracy [57].Future work will investigate the potential of SALAD to analyze higher-dimensional feature spaces as well as classifier features that are strongly correlated with the resonant feature.It will also be interesting to compare SALAD with other recently proposed model independent methods.When the nominal background simulation is an excellent FIG. 9.The predicted efficiency normalized to the true data efficiency in the signal region for various threshold requirements on the NN.The x axis is the data efficiency from the threshold.The error bars are due to statistical uncertainties.model of nature, SALAD should perform similarly to the methods presented in Refs.[29,30] and provide a strong sensitivity to new particles.In other regimes where the background simulation is biased, SALAD should continue to provide a physics-informed but still mostly background/signal model-independent approach to extend the search program for new particles at the LHC and beyond.

FIG. 1 .
FIG.1.The invariant mass of the leading two jets.

FIG. 3 .
FIG.3.A supervised classifier trained to distinguish signal from PYTHIA QCD.The top plot is a histogram of the neural network output, the left bottom plot is a receiver operating characteristic (ROC) curve, and the right bottom plot is a significance improvement (SIC) curve.ϵ S is the signal efficiency or true positive rate and ϵ B is the background efficiency or false positive rate.

FIG. 4 .
FIG.4.A histogram of the classifier output for a neural network trained to distinguish data (PYTHIA) and simulation (HERWIG) in the sideband region.The ratio between the simulation (HERWIG) or simulation þ DCTR and data (PYTHIA) is depicted by orange circles (green squares) in the lower panel.