Anomaly detection with density estimation

We leverage recent breakthroughs in neural density estimation to propose a new unsupervised ANOmaly detection with Density Estimation (ANODE) technique. By estimating the conditional probability density of the data in a signal region and in sidebands, and interpolating the latter into the signal region, a fully data-driven likelihood ratio of data versus background can be constructed. This likelihood ratio is broadly sensitive to overdensities in the data that could be due to localized anomalies. In addition, a unique potential benefit of the ANODE method is that the background can be directly estimated using the learned densities. Finally, ANODE is robust against systematic differences between signal region and sidebands, giving it broader applicability than other methods. We demonstrate the power of this new approach using the LHC Olympics 2020 R&D dataset. We show how ANODE can enhance the significance of a dijet bump hunt by up to a factor of 7 with a 10% accuracy on the background prediction. While the LHC is used as the recurring example, the methods developed here have a much broader applicability to anomaly detection in physics and beyond.


I. INTRODUCTION
Despite an impressive and extensive search program from ATLAS [1][2][3], CMS [4][5][6], and LHCb [7] for new particles and forces of nature, there is no convincing evidence for new phenomena at the Large Hadron Collider (LHC).However, there remain compelling theoretical (e.g., naturalness) and experimental (e.g., dark matter) reasons for fundamental structure to be observable with current LHC sensitivity.The vast majority of LHC searches are designed with specific signal models motivated by one of these reasons (e.g., gluino pair production from supersymmetry) in mind, and these searches are optimized with a heavy reliance on simulations, for both the signal and the Standard Model (SM) background.Given that it is impossible to cover every model with a specially optimized search (see e.g., [8,9] for comprehensive lists of currently uncovered models), and given that there are vast regions of unexplored LHC phase space, it is critical to consider extending the search program to include more model-agnostic methods.
A variety of model-agnostic approaches have been proposed to search for physics beyond the Standard Model (BSM) at colliders.These approaches are designed to be broadly sensitive to anomalies in data without focusing on specific models.Yet, they have varying degrees of both signal model and background model independence, as there is often a tradeoff between the broadness of a search and how sensitive it is to particular classes of signal scenarios.Existing and proposed model-agnostic searches range from fully signal model independent but fully background model dependent [10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26] (because they compare data to SM simulation), to varying degrees of partial signal model and background model independence [27][28][29][30][31][32][33][34][35][36][37][38][39][40].A comprehensive overview of existing model-agnostic approaches and how they are classified in terms of signal and background model independence will be given in Sec.II.
This paper introduces a new approach called ANOmaly detection with Density Estimation (ANODE) that is complementary to existing methods and aims to be largely background and signal model agnostic.Density estimation, especially in high dimensions, has traditionally been a difficult problem in unsupervised machine learning.The objective of density estimation is to learn the underlying probability density from which a set of independent and identically distributed examples were drawn.In the past few years, there have been a number of breakthroughs in density estimation using neural networks and the performance of high-dimensional density estimation has greatly improved.The idea of ANODE is to make use of these recent breakthroughs in order to directly estimate the probability density of the data.Assuming the signal is localized somewhere, one can attempt to use sideband methods and interpolation to estimate the probability density of the background.Then, one can use this to construct a likelihood ratio generally sensitive to new physics.
As with any search for BSM, it is not enough to have a discriminant that is sensitive to signals, one must also have a valid method of background estimation, otherwise it will be impossible to claim a discovery of new physics.The method of background estimation can further introduce possible sources of signal and background model dependence, and it is important to avail oneself of data-driven background methods in any truly model-agnostic search.This paper will explore two methods of data-driven background estimation, one based on importance sampling, and the other based on directly integrating the background density estimate obtained in the ANODE procedure.
Other neural network approaches to density estimation have been studied in high energy physics.Such methods include generative adversarial networks (GANs) , autoencoders [56,68], physically inspired networks [69,70], and flows [71,72].GANs are efficient for sampling from a density and are thus promising for accelerating slow simulations, but they do not provide an explicit representation of the density itself.For this reason, ANODE is built using normalizing flows [71] and in particular the recently proposed masked autoregressive flow (MAF) [73].These methods estimate densities by using a succession of neural networks to gradually map the original data to a transformed dataset that follows a simple distribution (e.g., normal or uniform).
The ANODE method is demonstrated using a simulated large-radius dijet search based on the LHC Olympics 2020 R&D dataset [74].In particular, properties of hadronic jets are used as discriminating features to enhance a bump hunt in the invariant mass of pairs of jets.ANODE learns a parametrized density of the features using a sideband and this is combined with a density estimation of the same features in the signal region.The resulting likelihood ratio is able to enhance the sensitivity of a traditional bump hunt from S= ffiffiffi ffi There is currently no dedicated search for generic dijet signatures where each of the jets can also originate from a BSM resonance [8,[75][76][77][78].Therefore, this particular application could be directly useful for extending the LHC physics search program.Many other applications to resonant new physics searches involving jets and other final states are also possible.
In order to benchmark the performance of ANODE, it is compared with the compared with classification without labels (CWoLa) hunting method [33,34].The CWoLa approach is also a neural network-based resonance search, but does not involve density estimation.Instead, CWoLa hunting uses neural networks to identify differences between signal regions and neighboring sideband regions.By turning the problem into a supervised learning task [79], CWoLa is able to effectively find rare resonant signals.However, CWoLa hunting has certain requirements on the independence of the discriminating features and the resonant feature.ANODE does not have this requirement, and the potential for exploiting correlated features is studied by introducing correlations.
This paper is organized as follows.Section II reviews the landscape of model-independent searches at the LHC to provide context for the ANODE method.Section III introduces the details of the ANODE approach and provides a brief introduction to normalizing flows.The reminder of the paper illustrates ANODE through an example based on a dijet search using jet substructure.Details of the simulated samples are provided in Sec.IV, and the results for the signal sensitivity and background specificity are presented in Sec.VA and V B, respectively.A study of correlations between the discriminating features and the resonant feature is in Sec.V C. The paper ends with conclusions and outlook in Sec.VI.

II. AN OVERVIEW OF MODEL-(IN)DEPENDENT SEARCHES
A viable search for new physics generally must have two essential components: it must be sensitive to new phenomena and it must also be able to estimate the background under the null hypothesis (Standard Model only).The categorization of a search's degree of model (in)dependence requires consideration of both of these components.Figure 1 illustrates how to characterize model independence for both BSM sensitivity and SM background specificity.We will now consider each in turn.

A. BSM sensitivity
For BSM sensitivity, the various types of searches are categorized as follows: (i) Almost all searches at the LHC are optimized (with or without machine learning) using simulations of both the SM and particular signal models.This is represented as the lower-left corner of Fig. 1(a).(ii) A handful of searches use signal simulation and unlabeled data to optimize the event selection.These are background model agnostic and are depicted in the upper-left corner of Fig. 1(a).For example, this was used in the γγ channel of the recent t th observation, using events with inverted selection criteria to define the background data sample for optimization [81,82].(iii) A series of signal model agnostic, but background model-dependent searches have been performed by D0 [10][11][12][13], H1 [14,15], ALEPH [16], CDF [17][18][19], CMS [20,21], and ATLAS [22][23][24].All of these searches share essentially the same approach: they BENJAMIN NACHMAN and DAVID SHIH PHYS.REV.D 101, 075042 (2020) 075042-2 compared histograms of data to histograms of SM simulations and looked for discrepancies.Such searches are represented in the lower-right part of Fig. 1(a).Recently, there have been proposals to extend these searches with deep learning [25,26].(iv) More recently, a variety of approaches have been proposed, often relying on sophisticated deep learning techniques, that attempt to be both signal and background model agnostic, to varying degrees.These include approaches based on autoencoders [27][28][29][30][31][32], weak supervision [33,34], nearest neighbor algorithms [35][36][37], probabilistic modeling [38], reweighted simulation [39], and others [40].These are indicated in the upper-right corner of Fig. 1(a).In the upper-right corner of Fig. 1(a), we have also attempted to illustrate in finer detail the differences between some recent model-agnostic approaches.For example, the autoencoder is in the farthest corner since it assumes almost nothing about the signal or the background but can be run directly on the data, as long as the signal is sufficiently rare [27,28].The tradeoff is that there is no optimality guarantee for the autoencoder-any signal that it does find will be found in a rather uncontrolled manner.Meanwhile, CWoLa hunting [33,34] is somewhat more signal and background model dependent than autoencoders, since this approach assumes that the signal is localized in a particular feature, and that there is an uncorrelated set of additional features on which one can train a classifier to distinguish signal region and sideband.In return, one obtains a guarantee of asymptotic optimality-the classifier approaches the likelihood ratio [83] in the limit of infinite statistics. 1he ANODE method introduced in this paper complements the other recently proposed techniques and is asymptotically optimal.To do this, ANODE estimates the density of the background-only scenario using sidebands and compares that with the density estimated in a signal-sensitive region (details are in Sec.III).Like the CWoLa hunting method, the new approach is broadly sensitive to resonant new physics and thus it is placed in the upper-right part of Fig. 1(a).The reason that ANODE is further right and above CWoLa hunting is that it is less sensitive to correlations, a feature that is discussed more below.

B. Background estimation
A variety of methods are commonly used for background estimation and are highlighted in Fig. 1(b).Generally, background estimation is less dependent on the signal model than achieving signal sensitivity and therefore the xaxis range of Fig. 1(b) is more compressed than Fig. 1(a).
(i) In some cases, the simulation is used to directly estimate the background.This is often the case for well-understood backgrounds such as electroweak phenomena or very rare processes that are difficult to constrain with data.[20,21] and general search [22][23][24] strategies are from CMS and ATLAS, respectively.LDA stands for latent dirichlet allocation [38,80], ANOmaly detection with Density Estimation (ANODE) is the method presented in this paper, CWoLa stands for classification without labels [33,34,79], and SALAD stands for simulation assisted likelihood-free anomaly detection [39].Direct density estimation is a form of sidebanding where the multidimensional feature space density is learned conditional on the resonant feature (see Sec. III B).
(ii) Most searches use data in some way to constrain the background prediction.One common approach is the control region method, where a search is complemented by an auxiliary measurement to constrain the simulation.Knowledge of the signal is used to ensure that the auxiliary measurement is not biased by the presence of signal.(iii) The two most common methods for background estimates that do not directly use simulation are the ABCD method and the sideband method (bump hunt).The ABCD method operates by identifying two independent features, each of which is sensitive to the presence of signal.Four regions, labeled A, B, C, and D are constructed by (anti)requiring a threshold on the two features.The background rate in the most signal sensitive region is estimated from the other three regions.Background simulations are required to verify independence of the two features.(iv) Finally, the sideband fit only requires that the background be smooth in the region of a potential signal so that a parametric (or not [84]) function can be fit to sidebands and interpolated.However, this method only works for resonant new physics.While strategies from Fig. 1(a) can often be matched with any approach in Fig. 1(b), there is often one combination that is used in practice.Table I provides examples of various searches and the background estimation technique that typically is associated with that search.Searches with a complex background may use multiple background estimation procedures.
ANODE can be combined with any background estimation technique, but it can also be used directly since the background density is already estimated to construct a signal-sensitive classifier.Even though directly providing an accurate background estimation puts stringent requirements on the accuracy of the density estimation, it also reduces the need for a full decorrelation between classification features and the resonant feature.A variety of decorrelation techniques exist [94][95][96][97][98][99][100][101][102][103][104], but ultimately decorrelating removes information available for classification.

III. THE ANODE METHOD
This section will describe the ANODE proposal for an unsupervised method to search for resonant new physics using density estimation.
Let m be a feature in which a signal (if it exists) is known to be localized around some m 0 .The value of m 0 will be scanned for broad sensitivity and the following procedure will be repeated for each window in m.It is often the case that the width of the signal in m is fixed by detector properties and is signal model independent.A region m 0 AE δ is called the signal region (SR) and m ∉ ½m 0 − δ; m 0 þ δ is defined as the sideband region (SB).A traditional, unsupervised, model-agnostic search is to perform a bump hunt in m, using the SB to interpolate into the SR in order to estimate the background.
Let x ∈ R d be some additional discriminating features in which the signal density is different than the background density.If we could find the region(s) where the signal differs from the background and then cut on x to select these regions, we could improve the sensitivity of the original bump hunt in m.The goal of ANODE is to accomplish this in an unsupervised and model-agnostic way, via density estimation in the feature space x.
More specifically, ANODE attempts to learn two densities p data ðxjmÞ and p background ðxjmÞ for m ∈ SR.Then, classification is performed with the likelihood ratio, RðxjmÞ ¼ p data ðxjmÞ p background ðxjmÞ : ð3:1Þ In the ideal case that p data ðxjmÞ ¼ αp background ðxjmÞ þ ð1 − αÞp signal ðxjmÞ for 0 ≤ α ≤ 1 and m ∈ SR, Eq. (3.1) is the optimal test statistic for identifying the presence of signal.In the absence of signal, RðxjmÞ ¼ 1, so as long as p signal ðxjmÞ ≠ p background ðxjmÞ, R data ðxjmÞ has a nonzero density away from 1 in a region with no predicted background.
In practice, both p data ðxjmÞ and p background ðxjmÞ are approximations and so RðxjmÞ is not unity in the absence of signal.The densities pðxjmÞ are estimated using conditional neural density estimation as described in Sec.III A. The function p data ðxjmÞ is estimated in the signal region and the function p background ðxjmÞ is estimated using the sideband region and then interpolated into the signal region.The interpolation is done automatically by the neural conditional density estimator.Effective density estimation will result in RðxjmÞ in the SR that is localized near unity and then one can enhance the presence of signal by applying a threshold RðxjmÞ > R cut for R cut > 1.The interpolated p background ðxjmÞ can then also be used to estimate the background, as described in Sec.III B.

A. Neural density estimation
The ANODE procedure as described in the previous subsection is completely general with regards to the method of density estimation.In this work, we will demonstrate a proof of concept using normalizing flow models for density estimation.Since normalizing flows were proposed in Ref. [71], they have generated much activity and excitement in the machine learning community, achieving stateof-the-art performance on a variety of benchmark density estimation tasks.
The core idea behind a normalizing flow is to apply a change of variables from a random variable with a simple density (e.g., Gaussian or uniform) to one with a complex density that matches some training dataset.The transformation from one density describing random variable X to another density describing random variable Y follows the usual change of variables formula using the Jacobian, where The first neural density estimation with normalizing flows had the following form for x ∈ R n : where σ is an elementwise nonlinearity and x ∈ R n ; w ∈ R n ; b ∈ R are trainable parameters.The benefit of Eq. (3.4) is that the Jacobian evaluation is simple from the chain rule.
Since the first development of normalizing flows, there has been significant development in extending their expressivity.One innovation is to combine flows with autoregressive density estimation [105].An autoregressive flow [106] modifies the change of variables so that for , where the indices α denote the dimension of X i and Y i for α ¼ 1; …; n.Any f that satisfies this condition is amenable to neural density estimation because the Jacobian determinant evaluation is simple.In particular, the Jacobian is upper triangular and therefore the determinant is the product of the diagonal elements: Q n α¼1 ∂f i;α =∂x α .ANODE is built on an MAF [73].For an MAF, where σ i;α > 0 and μ i;α are arbitrary functions and Y i;1 ¼ μ i;1 þ σ i;1 X i;1 for arbitrary numbers σ i;1 > 0, μ i;1 .As in Eq. (3.3), this procedure is repeated multiple times to build a deep autoregressive flow.The masking in MAF comes from its use of masked autoencoder for distribution estimation (MADE) [107] to evaluate μ i;α and σ i;α for all α in one forward pass.This approach eliminates the need for the recursion in Eq. (3.5).MAF is nearly the same as inverse autoregressive flows (IAFs) [105], which also use Gaussian autoregressions and are built on MADE.The main difference is that MAF is very efficient for density estimation and slow for sampling, while IAF is slow for density estimation and fast for sampling.As ANODE only needs to estimate the density without producing new samples, MAF is selected as the method of choice.
The estimation of p background ðxjmÞ for ANODE requires that the MAF provides a conditional density.This can be accomplished by adding m as an input to all functions μ i and σ i .

B. Estimating the background
An anomaly detection technique is only useful for finding new particles if the Standard Model background can be estimated.As mentioned earlier, one benefit of the direct density estimation in ANODE is that the background can be directly estimated with p background ðxjmÞ.This results in the following multiple possibilities for background estimation that are considered in this work: (i) Direct density estimation: These methods use the interpolated p background ðxjmÞ to directly compute the efficiency ϵ bg ðR c jmÞ of the background after a threshold requirement on RðxjmÞ.Density sampling.One could directly sample events from p background ðxjmÞ using the stacked change of variables specified by Eq. (3.5).As mentioned in Sec.III A, this is less efficient for MAF compared with IAF.This sampling is not pursued in this paper.Density integration.Another approach is to directly integrate p background ðxjmÞ for events with RðxjmÞ > R c , Importance sampling.Analytically integrating a function in high dimensions is impractical, so one can estimate the integral with importance sampling.An effective method to implement this sampling is to make the following observation: The last line in Eq. (3.7) can be estimated by computing the fraction of events in the SR (representing the full distribution) with R > R c and then weighting each event in the counting by 1=R.(ii) Sideband in m: As long as the requirement RðxjmÞ > R c does not sculpt a localized feature in m, one can estimate the background prediction by performing a fit in the m spectrum from the SB and interpolating to the SR.This is a standard approach, as discussed in Sec.I. Further details about background estimation are presented in Sec.V B for the numerical example described in the next section.

C. Comparison with the CWoLa hunting method
The CWoLa hunting method [33,34] is a recently proposed model-agnostic sideband method that also uses machine learning and will serve as a benchmark for ANODE.In the CWoLa hunting approach, the signal sensitivity is achieved by training a classifier to distinguish the SR from the SB.This classifier will approach the likelihood ratio R CWoLa , which is optimal under certain conditions, where the second equality is true in the absence of signal in the sideband2 and the third equality is true when x and m are independent.The background is estimated using a sideband fit after placing a selection based on the above classifier.
A key assumption of the CWoLa method is that x and m are independent.This condition is stronger than the requirement for the background fit, but is necessary for achieving signal sensitivity.In particular, in the presence of a dependence between x and m, the CWoLa classifier will learn the true differences between SB and SR.If these differences are larger than the differences between signal and background in the SR, the CWoLa classifier may not succeed in finding the signal.
In contrast, the ANODE method does not require any particular relationship between x and m to achieve signal sensitivity.In fact, the information about m could be fully contained within x, and ANODE could still succeed in principle.Therefore, ANODE can make use of features which are strongly correlated with m, thus extending the potential sensitivity to new signals.This is possible because of the two step density estimation, interpolating p background ðxjmÞ from the sideband and then estimating p data ðxjmÞ from the SR.Such an approach is not possible with CWoLa hunting, which directly learns the likelihood ratio.The only requirement for ANODE is that there are no nontrivial features in the SR that cannot be smoothly predicted from the SB.Section V C illustrates the ability of ANODE to cope with correlated features.

IV. DETAILS OF THE SAMPLE
A simulated resonance search using large-radius dijets is used to illustrate ANODE.The simulated datasets are from the LHC Olympics 2020 challenge research and development dataset [74].For a background process, one million quantum chromodynamic (QCD) dijet events are simulated with PYTHIA 8 [108,109] without pileup or multiple parton interactions.The signal is a hypothetical W 0 boson (m W 0 ¼ 3.5 TeV) that decays into an X boson (m X ¼ 500 GeV) and a Y boson (m Y ¼ 100 GeV), with the same simulation setup as the QCD dijets.The X and Y bosons decay promptly into quarks and due to their large Lorentz boost in the laboratory frame, the resulting hadronic decay products are captured by a single largeradius jet.The detector simulation is performed with Delphes 3.4.1 [110][111][112] and particle flow objects are clustered into jets using the Fastjet [113,114] implementation of the anti-k t algorithm [115] using R ¼ 1.0 as the jet radius.Events are selected by requiring at least one such jet with p T > 1.3 TeV.While there exist LHC searches for the case that X and Y are electroweak bosons [116,117], the generic case is currently uncovered by a dedicated search.
The resonant feature m will be the invariant mass of the leading two jets, m JJ .These two jets are ordered by their mass m J so that by construction, m J 1 < m J 2 .The discriminating features x are four-dimensional, consisting of the observables, where τ 21 is the n-subjettiness ratio [118,119].This observable is the most widely used single feature for identifying jets with a two-prong substructure.While the ultimate goal of ANODE is to perform density estimation on high-dimensional, low-level features, there is already utility in a search with high-level features from Eq. (4.1).
Thus, to demonstrate how ANODE works, this will be the focus for the rest of this paper.Simulated data are constructed by injecting 1000 signal events to the full background sample.A histogram of m JJ is presented in Fig. 2. As expected, the signal peaks near m W 0. The signal region is defined by m JJ ∈ ½3.3; 3.7 TeV and then the sideband is the rest of the spectrum.The simulated data are divided into two equal samples for training and testing; thus, we have ≈500; 000 background and ≈500 signal events in each sample.In the SR, we are left with ≈60; 000 background and ≈400 signal events in each sample.This corresponds to S= ffiffiffi ffi B p ¼ 1.6 and S=B ¼ 0.6% in the SR.This value of S= ffiffiffi ffi B p would be the approximate significance from a sideband fit (ignoring the fit errors).Section VA will show how much this can be enhanced from ANODE.
The additional four features for classification are shown in Fig. 3.The lighter jet mass peaks near m Y and the difference between masses peaks at about m X − m Y ¼ 400 GeV.The τ 21 observables are lower for the two-prong signal jets than for the mostly one-prong background jets.Jet mass and τ 21 are negatively correlated for QCD jets [95] and so τ 21 is higher for J 2 than for J 1 .

FIG. 2. Histograms for the invariant mass of the leading two jets for the Standard Model background as well as the injected signal.
There are 1 million background events and 1000 signal events.

FIG. 3. The four features used for classification
21 (bottom left), and τ J 2 21 (bottom right).These histograms are inclusive in m JJ .There are 1 million background events and 1000 signal events for the mass histograms.
The conditional MAF (along with most methods of density estimation) has difficulty at sharp, discontinuous edges and boundaries, so we first transform the dataset before performing density estimation.First, all features are linearly scaled to be ðfeatureÞ ↦ x ∈ ½0; 1.Then, the logit transformation logðx=ð1 − xÞÞ is applied to map the scaled features to be between ð−∞; ∞Þ.The Jacobian for this map is accounted for when computing probability densities for the original feature space.Even with this transformation, density estimation is difficult near the boundaries.Therefore, the scaled features are required to have 0.05 < x < 0.95.This keeps 95% (72%) of the signal (background) in the SR.Below we will refer to this as the "fiducial region."All results below are computed with respect to the number of events after this truncation.

A. Sensitivity
The conditional MAF is optimized3 using the loglikelihood loss function, logðpðxjmÞÞ.All of the neural networks are written in PyTorch [120].For the hyperparameters, there are 15 MADE blocks (one layer each) with 128 hidden units per block.Networks are optimized with Adam [121] using a learning rate 10 −4 and weight decay of 10 −6 .The SR and SB density estimators are each trained for 50 epochs.No systematic attempt was made to optimize these hyperparameters, and it is likely that better performance could be obtained with further optimization.For the SR density estimator, the last epoch is chosen for simplicity and it was verified that the results are robust against this choice.The SB density estimator significantly varies from epoch to epoch.Averaging the density estimates pointwise over 10 consecutive epochs results in a stable result.Averaging over more epochs does not further improve the stability.All results with ANODE present the SB density estimator with this averaging scheme for the last 10 epochs.
Figure 4 shows a scatter plot of RðxjmÞ versus log p background ðxjmÞ for the test set in the SR.As desired, the background is mostly concentrated around RðxjmÞ ¼ 1, while there is a long tail for signal events at higher values of RðxjmÞ and between −2 < log p background ðxjmÞ < 2. This is exactly what is expected for this signal: it is an overdensity (R > 1) in a region of phase space that is relatively rare for the background (p background ðxjmÞ ≪ 1).
The background density in Fig. 4 also shows that the RðxjmÞ is narrower around 1 when p background ðxjmÞ is large and more spread out when p background ðxjmÞ ≪ 1.This is evidence that the density estimation is more accurate when the densities are high and worse when the densities are low.This is also to be expected: if there are many data points close to one another, it should be easier to estimate their density than if the data points are very sparse.
Another view of the results is presented in Fig. 5, with one-dimensional information about RðxjmÞ in the SR.The left plot of Fig. 5 shows that the background is centered and approximately symmetric around R ¼ 1 with a standard deviation of approximately 17%.This width is due to various sources, including the accuracy of the SR density, the accuracy of the SB density, and the quality of the interpolation from SB to SR.Each of these sources has contributions from the finite size of the datasets used for training, the neural network flexibility, and the training procedure.The right plot of Fig. 5 presents the number of background and signal events as a function of a threshold R > R c .The starting point are the original numbers background (40,000) and signal (400) numbers in the SR window and the fiducial window.Starting from low S=B and S= ffiffiffi ffi B p one can achieve S=B > 1 and a high S= ffiffiffi ffi B p with a threshold requirement on R. Figure 6 shows that the signal is clearly visible in the x distribution after applying such a threshold requirement.
The performance of R as an anomaly detector is further quantified by the receiver operating characteristic (ROC) and significance improvement characteristic (SIC) curves in Fig. 7.These metrics are obtained by scanning R and computing the signal efficiency (true positive rate) and background efficiency (false positive rate) after a threshold requirement on R. The area under the curve for ANODE is 0.82.For comparison, the CWoLa hunting approach is also shown in the same plots.The CWoLa classifier is trained using sideband regions that are 200 GeV wide on either side of the SR.The sidebands are weighted to have the same number of events as each other and in total, the same as the SR.A single NN with four hidden layers with 64 nodes each is trained using Keras [122] and TensorFlow [123].Dropout [124] of 10% is used for each intermediate layer.Intermediate layers use rectified linear unit activation functions and the last layer uses a sigmoid.The classifier is optimized using binary cross entropy and is trained for FIG. 5. Left: histogram of RðxjmÞ evaluated on the test set; right: the integrated number of events that survive a threshold on RðxjmÞ.The two distributions are scaled to represent the rates for 500,000 total background events and 500 total signal events, as introduced in Sec.IV.FIG. 6. Distributions of m J 1 (left) and m J 2 − m J 1 (right) in the signal region after applying a threshold requirement on R. 300 epochs.As with ANODE, ten epochs are averaged for the reported results. 4he performance of ANODE is comparable to CWoLa hunting in Fig. 7, which does slightly better at higher signal efficiencies and much better at lower signal efficiencies.This may be a reflection of the fact that CWoLa makes use of supervised learning and directly approaches the likelihood ratio, while ANODE is unsupervised and attempts to FIG. 8. Left: the number of events after a threshold requirement R > R c using the two integration methods described in Sec.III B, as well as the true background yield.Right: the ratio the predicted and true background yields from the left plot, as a function of the actual number of events that survive the threshold requirement.The shaded bands around the central predictions are the 1σ statistical (Poisson) uncertainty derived from the observed background counts.The black dashed and dotted lines are 10% and 20% around a ratio of 1. FIG. 9. A comparison of the four features x between the SR and two nearby sidebands defined by m jj ∈ ½3.1; 3.3 TeV (lower sideband) and m jj ∈ ½3.7; 3.9 TeV (upper sideband).learn both the numerator and denominator of the likelihood ratio.With this dataset, ANODE is able to enhance the signal significance by about a factor of 7 and would therefore be able to achieve a local significance above 5σ given that the starting value of S= ffiffiffi ffi B p is 1.6.

B. Background estimation
This section explores the possibility of using the estimate of p background ðxjmÞ to directly determine the background efficiency in the SR after a requirement on R > R c . Figure 8 presents a comparison between integration methods (direct integration and importance sampling) described in Sec.III B and the true background yields.Qualitatively, both methods are able to characterize the yield across several orders of magnitude in background efficiency.However, both methods diverge from the truth in the extreme tails of the R distribution.The right plot of Fig. 8 offers a quantitative comparison between methods.For efficiencies down to about 10 −3 , both methods are accurate within about 25%.The direct integration method has a smaller bias of about 10%.This is consistent with Fig. 5, for which the standard deviation is between 10% and 20%.

C. Performance on a dataset with correlated features
The results presented in the previous sections have established that ANODE is able to identify the signal and estimate the corresponding SM backgrounds introduced in Sec.IV.One fortuitous aspect of the chosen features x introduced in Sec.IV is that they are all relatively independent of m jj .This is illustrated in Fig. 9, using the SR and neighboring sideband regions.As a result of this independence, the CWoLa method is able to find the signal and presumably the ANODE interpolation from SB to SR is easier than if there was a strong dependence.
The purpose of this section is to study the sensitivity of the ANODE and CWoLa hunting methods to correlations in the features x with m jj .Based on the assumptions of the two methods, it is expected that with strong correlations, CWoLa hunting will fail to find the signal while ANODE should still be able to identify the presence of signal in the SR as well as estimate the background.To study this sensitivity in a controlled fashion, correlations are introduced artificially.In practice, adding more features to x will inevitably result in some dependence with m jj ; the artificial example here illustrates the challenges already in low dimensions.New jet mass observables are created, which are linearly shifted, ð5:1Þ where c ¼ 0.1 for this study.The resulting shifted lighter jet mass is presented in Fig. 10.New ANODE and CWoLa models are trained using the shifted dataset and their performance is quantified in Fig. 11.As expected, the fully supervised classifier is nearly the same as Fig. 7. ANODE is still able to significantly enhance the signal, with a maximum significance improvement near 4.While in principle ANODE could achieve the same classification accuracy on the shifted and nominal datasets, the performance on the shifted examples is not as strong as in Fig. 7.In practice, the interpolation of p background into the SR is more challenging now due to the linear correlations.This could possibly be overcome with improved training, better choices of hyperparameters, or more sophisticated density estimation techniques.By construction, there are now bigger differences between the SR and SB than between the SR background and the SR signal.Therefore, the CWoLa hunting classifier is not able to find the signal.This is evident from the ROC curve in the left plot of Fig. 11, which shows that the signal-versus-background classifier is essentially random while the SR-versus-SB classifier has learned something non-trivial.
Last, Fig. 12 shows the performance of direct density estimation for the background prediction using the shifted dataset.The performance is comparable to the unshifted dataset (Fig. 8), meaning that ANODE could potentially be used as a complete anomaly detection method even in the presence of feature spaces.

VI. CONCLUSIONS
This paper has presented a powerful new modelindependent search method called ANODE, which is built on neural density estimation.Unlike other approaches, ANODE directly learns the background probability density and data probability density in a signal region.The ratio of these densities is a powerful classifier and the background density can be directly used to estimate the background efficiency from a threshold requirement on the classifier.Finally, ANODE is robust against correlations in the data, which tend to break other model-agnostic sideband methods such as CWoLa.
The results presented in this paper are meant to be a proof of concept of the general method, and there are many exciting future directions.For example, while this paper focused on collider searches for BSM, the ANODE method is completely general and could be applied to many areas beyond high energy physics, including astronomy and astrophysics.Similarly, while the demonstrations here were based on the innovative MAF density estimation technique, the ANODE method can be used in conjunction with any density estimation algorithm.Indeed, there are numerous other neural density estimation methods from the past few years that claim state-of-the-art performance, including neural autoregressive flows [125] and neural spline flows [126]; exploring these would be an obvious way to attempt to improve the results in this paper.In addition, it would be interesting to attempt the ANODE method on even higherdimensional feature spaces, all the way up to the full lowlevel feature set of the four vectors of all the hadrons in the event.This might already be feasible with existing neural density estimators, at is it common to evaluate their performance on high-dimensional datasets ranging from UCI datasets [127] with up to ∼50 features, to image datasets such as MNIST [128] and CIFAR-10 [129] which have hundreds and thousands of features, respectively.The prospects for the ANODE method are exciting: as the field of neural density estimation continues to grow within the machine learning community, ANODE will become more sensitive to resonant new physics in collider high energy physics and beyond.ACKNOWLEDGMENTS D. S. is grateful to Matt Buckley and John Tamanas for many fruitful discussions on neural density estimation.We are especially grateful to John Tamanas for help with the FIG.12.The same as Fig. 8, but for the shifted dataset.In particular, these plots compare the background prediction from two direct density estimation techniques with the true background yield after a threshold requirement RðxjmÞ > R c .BENJAMIN

APPENDIX: COMMENTS ON OPTIMALITY
The Neyman-Pearson lemma only applies to simple hypothesis tests.The lemma states that for a fixed probability of rejecting the null hypothesis when it is true (level), the probability for rejecting the null hypothesis when the alternative is true (power) is maximized with the likelihood ratio test statistic.For supervised searches with profiled nuisance parameters or for anomaly detection with a composite alternative hypothesis, there is no uniformly most powerful classifier.The goal of this brief section is to clarify what is meant by asymptotically optimal anomaly detection.
For any given BSM model, the procedures labeled asymptotically optimal are likely not optimal.The sense in which they are optimal is as follows.Let the null hypothesis H 0 be that the data are distributed according to p background , a density describing the phase space of the background-only.Furthermore, let the alternative hypothesis H A be that the data are distributed according to p data , the learned density of the data.Distinguishing H 0 from H A is a simple hypothesis test.Therefore, the test statistic p background =p data has the property that for a fixed probability for rejecting H 0 given data ∼ p background , the probability for rejecting H 0 is as high as possible when H A is true (which it is).If p background ¼ p data , then power ¼ level.So ANODE is asymptotically optimal for rejecting the data as background-only, but is not "optimal" for rejecting any particular BSM model.
FIG.1.A graphical representation of searches for new particles in terms of the background and signal model dependence for achieving signal sensitivity (a) and background specificity (b).The Model Unspecific Search for New Physics (MUSiC)[20,21] and general search[22][23][24] strategies are from CMS and ATLAS, respectively.LDA stands for latent dirichlet allocation[38,80], ANOmaly detection with Density Estimation (ANODE) is the method presented in this paper, CWoLa stands for classification without labels[33,34,79], and SALAD stands for simulation assisted likelihood-free anomaly detection[39].Direct density estimation is a form of sidebanding where the multidimensional feature space density is learned conditional on the resonant feature (see Sec. III B).

1 ;
p Y ðyÞ ¼ p X ðxÞ det ∂f ∂x −ð3:2Þ where x and y are realizations of X and Y, respectively, X and Y have the same dimension, and Y ¼ fðXÞ is an invertible function.The process in Eq. (3.2) can be repeated to build a normalizing flow, p Y ðyÞ ¼ p X ðxÞ

FIG. 4 .
FIG. 4. Scatter plot of RðxjmÞ versus log p background ðxjmÞ across the test set in the SR. Background events are shown (as a twodimensional histogram) in gray scale and individual signal events are shown in red.

FIG. 10 .
FIG.10.The lighter jet mass for the SR and the lower and upper sideband regions after the shift defined by Eq. (5.1).

TABLE I .
A table with the common pairings of search strategy for signal sensitivity (left column), the background estimation (middle column), and an example search (right column).
NACHMAN and DAVID SHIH PHYS.REV.D 101, 075042 (2020) 075042-12 conditional MAF code.Additionally, we would like to thank Kyle Cranmer and Uroš Seljak for helpful discussions and Nick Rodd and John Tamanas for helpful comments on the draft.This work was supported by the U.S. Department of Energy, Office of Science under Contract No. DE-AC02-05CH11231.D. S. is supported by DOE Grant No. DOE-SC0010008.D. S. thanks LBNL, BCTP, and BCCP for their generous support and hospitality during his sabbatical year.