Full Phase Space Resonant Anomaly Detection

Physics beyond the Standard Model that is resonant in one or more dimensions has been a longstanding focus of countless searches at colliders and beyond. Recently, many new strategies for resonant anomaly detection have been developed, where sideband information can be used in conjunction with modern machine learning, in order to generate synthetic datasets representing the Standard Model background. Until now, this approach was only able to accommodate a relatively small number of dimensions, limiting the breadth of the search sensitivity. Using recent innovations in point cloud generative models, we show that this strategy can also be applied to the full phase space, using all relevant particles for the anomaly detection. As a proof of principle, we show that the signal from the R\&D dataset from the LHC Olympics is findable with this method, opening up the door to future studies that explore the interplay between depth and breadth in the representation of the data for anomaly detection.


I. INTRODUCTION
In the pursuit of physics beyond the Standard Model (BSM), it is apparent that a parallel path is required to the traditional signal-specific paradigm.Many physics models have too many parameters to scan efficiently and often specific assumptions are still required to interpret the scanned parameter space [1][2][3][4][5][6].Additionally, it may be that new phenomena are not described by known physics models.While most BSM searches posit a specific model or models in the design and execution of an analysis, anomaly detection is an alternative strategy that tries to be agnostic to any particular signal model.
While a number of anomaly detection methods have been applied to resonances, the most sensitive approaches use weakly-supervised learning techniques that are tailored for the resonant case [7,15,17].Of these, nearly all use only a handful of features to avoid sculpting the resonant feature and/or to cope with previous computational limitations.Feature selection implicitly introduces a prior over theoretical models, as only certain signals would be observable with the chosen features.While general symmetry considerations can provide the basis for a model agnostic feature reduction, the number of relevant features required to maintain broad sensitivity is still large [27,28].
We propose a strategy to perform weakly supervised, resonant anomaly detection using a maximal feature set -all of the relevant particles in an event.Our approach is based on the Cathode method introduced in [16]: first, conditional generative models are trained on sideband regions and interpolated into the signal region in order to create a synthetic sample of events following the background distribution; second, a classifier is trained to distinguish between background and data, and the classifier output serves as the anomaly score.In [16], the method was demonstrated using an ordinary normalizing flow trained on a small set of high level features (masses and n-subjettinesses).Here, we extend Cathode using recent innovations in point cloud generative models in order to generate the full set of low-level features -the four momenta of all of the jet constituents.
We will see that both generative models have comparable performance on the anomaly detection task.They are able to use all of the particles inside jets to identify the anomaly in the R&D dataset of the LHC Olympics data challenge [7,37].Currently, our approach is limited in the sense that the sensitivity to the anomaly is only present when the BSM cross section is quite large.However, we expect this sensitivity to improve with future innovations with the generative modeling and/or the classifier.We expect that searching over so many dimensions enhances the breadth of sensitivity at the cost of depth of sensitivity to particular physics models.Future studies will explore the details of the breadth/depth tradeoff.
This paper is organized as follows.Section II briefly introduces the dataset.Our machine learning methods are described in Sec.III.The numerical results are presented in Sec.IV and the paper ends with conclusions and outlook in Sec.V.

II. DATASET
To illustrate our approach, we use the R&D dataset from the LHC Olympics data challenge [7,37] which is briefly described in the following.The background is modeled as inclusive dijets and the signal is resonant boson production A → B(→ qq ′ )C(→ qq ′ ) with masses m A , m B , m C = 3.5, 0.5, 0.1 TeV, respectively.The events are generated with Pythia8 [38] and Delphes3.4.1 [39][40][41] provides a detector simulation.Jets are defined using the anti-k T jet clustering algorithm [42] as implemented in FastJet [43] with R = 1.The focus is on events defined by two high energy jets capturing the decay products of the B and C particles.The pair of jets has a large dijet mass (corresponding to m A ) and nontrivial substructure (corresponding to the two-prong decay).Consequently, the leading jet is required to have p T > 1.2 TeV.
Each event is represented by up to 700 particles with a three-momentum per particle; after selection of the two most energetic jets, the maximum particle multiplicity observed is reduced to 279 per jet.For many previous studies, this high-dimensional information was compressed into a small set of features, such as the masses of the leading jets and the n-subjettiness observables τ 1 and τ 2 [44,45] of the two jets.Instead, we consider all of the particles within the leading two jets.For applications at the LHC, we envision a similar approach using inputs based on reconstructed objects such as Unified Flow Objects [46] in ATLAS or particle flow inputs [47] in CMS.

III. METHODS
We introduce two approaches for generating the dijet system.Both approaches are based on the same structure: one model generates overall jet properties and a second model generates the jet constituents conditioned on the overall features.For the first model, each jet is represented by its transverse momentum p T , pseudo-rapidity η, azimuthal angle ϕ, jet mass m, and constituent particle multiplicity N .Jet constituents (particles) are generated in the relative set of coordinates defined by (p Trel , ∆η, ∆ϕ), where p Trel = p Tpart./p Tjet , ∆η = η part.− η jet and ∆ϕ = ϕ part.− ϕ jet .
Similarly to previous studies, we define the signal region of interest for events with 3300 GeV < m jj < 3700 GeV.The events used for the training in the sidebands are restricted between the interval 2300 GeV < m jj < 3300 GeV and 3700 GeV < m jj < 5000 GeV.The interval range for the sideband region is defined such that enough training data is available for the generative model while avoiding boundary effects at the extremes of the dijet mass distribution.Moreover, 60% of the events in the sideband regions are used for training and testing, while 40% are used for the validation and further refinement of the generative model.In order to get the m jj values in the SR, we perform a kernel density estimate (KDE) fit to the training data.The KDE was implemented with a gaussian kernel and a bandwidth of 0.001 using the Scikit-learn [48] library.

A. Diffusion Generative Models
Our first generative approach is based on the diffusion models introduced in Ref. [32].The neural network architecture used for both the dijet system generation and particle generation are based on a combination of DeepSets [49] with attention layers [50] such that permutation equivariance for the network outputs is guaranteed both at the jet-and particle-generation level.Both diffusion models improve the sensitivity to the time information by feeding random Fourier features [51] through two fully connected layers.This conditional information is then combined with values of m jj after a log-transformation.In the case of the particle generation model, the additional jet kinematic information is passed through a fully connected layer before being concatenated with the other conditional inputs.A diagram showing the interplay between the generative models is shown in Fig. 1.Hyperparameter choices are listed in Tab.I.
To generate events, we use the trained diffusion models with the DDIM [52] sampler with 512 time-intervals and additional second-order correction [53] to improve precision.A simple pre-processing is applied to all inputs to the generative models to ensure all distributions, for particle and jet kinematics, are transformed to have zero mean and unit variance.The training is carried out using the Perlmutter supercomputer interfaced with the Horovod package [54] for distributed training.16 NVIDIA A100 GPUs are used simultaneously during training, while a single GPU is used for evaluation.All models are trained for up to 300 epochs with a cosine learning rate schedule [55] with initial learning rate of 1.6 × 10 −3 .If the loss function does not decrease for 30 consecutive epochs, evaluated in a separate testing set representing 10% to the sample size, the training is stopped.The implementation of all models is carried out using Keras backend [56] with TensorFlow [57].

B. Flow Matching Generative Models
The second approach is based on the optimal transport flow matching training objective for training continuous normalizing flows (CNFs) [34] introduced in Ref. [35].For the jet feature model, the architecture is a fully connected network.The particle feature model is based on the EPiC-FM model introduced in Ref. [58] and combines the flow matching objective with fast and linearly scalable equivariant point cloud (EPiC) layers [36].Before feeding the point cloud into the EPiC layers, the data is projected into a higher dimensional representation.Because EPiC layers require additional global features as inputs, these are calculated with an average and summation pooling.
Both models incorporate an embedded time vector and interpolation into the signal region is possible due to the conditioning of the jet feature model on m jj .The conditioning for the particle feature model consists of the embedded time and the jet features p T , η, ϕ, and m.In both models the conditioning is added to all multilayer perceptrons (MLPs).A visualization of the model architecture can be found in Fig. 2, and for an overview of all hyperparameter choices we refer to Tab.I.
In the generation process, the second order midpoint ordinary differential equation solver with 250 steps (500 function evaluations) from the Torchdyn [59] library is used.As for the diffusion models, a simple pre-processing ensures all distributions to have zero mean and unit variance.The models are implemented with PyTorch [60] using the PyTorch Lightning [61] framework.

C. Classifier
To obtain the final anomaly score, we employ classifiers trained to distinguish the generated, synthetic background samples from the data -this is the weaklysupervised anomaly detection approach proposed in Ref. [16] following the CWoLa framework of Refs.[10,11,62].In this work, we employ a transformer-based classifier architecture with nearly the same structure as the transformer used in the diffusion model.It first takes the dijet system consisting of the jet kinematic information and uses DeepSets combined with attention layers to map the jets to a common latent embedding where the clustered particles are also represented after applying a similar set of DeepSets layers with added attention.The attention layers used to process the particle information consider in the attention matrix only particles belonging to the same jet, with same set of functions reused between the two jets to ensure the entire classifier is permutation invariant with respect to the order of the input jets and the order of the input particles belonging to each jet.In the same latent embedding space, the dijet system and particles are then concatenated in the particle dimension, resulting in a hybrid graph with N 1 + N 2 + 2 "particles", represented by the two jets containing N 1 and N 2 particles, respectively.A fully connected layer is then added after the concatenation, operating only at the feature level, followed by a dropout operation [63] and LeakyRelu activation function.An average pooling operation is then used in the particle dimension to combine the information of all the objects.A second fully connected layer with dropout and LeakyRelu activation are used before the output layer with a fully connected layer of single output and sigmoid activation function.The classifier shares the same learning rate schedule and early stopping criteria as the diffusion models.A visual representation of the classifier architecture is shown in Fig. 3.

IV. RESULTS
We evaluate the performance of the generative model first by comparing distributions of generated quantities with the ones present in the simulation of sideband events.Histograms of the jet and particle kinematic information are shown in Fig. 4. We observe a good agreement between generated and simulated quantities, often to better than 10% accuracy.
We further evaluate the quality of the generative models using a binary classifier trained to distinguish between generated and reference samples, as proposed in [64].Using the same classifier architecture described in Sec.III, trained on generated vs. reference events in the sideband region, we obtain an area under the curve (AUC) metric calculated over 10 independent trainings of 0.514 ± 0.07 and 0.522 ± 0.006 for the diffusion and flow matching models respectively, values very close to the best attainable result of 0.50 representing the scenario where the classifier is unable to tell the difference between the samples.Uncertainties of the AUC values are derived from 10 independent trainings of the classifier using different random initialization.
The next step is to generate background observations in the signal region.Values of m jj are first generated by sampling values from a kernel density estimator trained to learn the probability density of the m jj spectra based on the values of the observations coming from the sideband regions.The values sampled in the signal region are then used as conditional inputs to the generative models.A total of 200k background events are generated in the signal region, roughly a factor of two more than the amount of true background events in the data within the signal region. 1 A comparison of the simulated and generated background distributions in the signal region is shown in Fig. 5.We observe a good agreement between simulated and generated background events with only noticeable differences observed at the extremes of the relative η and ϕ distributions.In the absence of signal, the classifier trained to distinguish between generated vs. reference samples yield an AUC of 0.52 ± 0.01 and 0.53 ± 0.01 for the diffusion and flow matching models respectively.
As a last step, we derive anomaly scores by training the classifier in the signal region to distinguish the neural network-generated background events from the data observed in the signal region.During the training, 200k generated samples are trained against the data consisting of 100k background events and different amount of injected signal.From this dataset, 80% is used for training and 20% for validation.The training is stopped when the loss in the validation set does not decrease for 20 consecutive epochs.For each different signal injection value we train ten classifiers with different random initialization.
The evaluation of the classifier is performed using an independent dataset [37,65] consisting of 100k signal and  background events, used to study different performance metrics.From the classifier outputs, we calculate the significance improvement characteristic curve (SIC) defined as the signal efficiency divided by the square root of the background efficiency versus the signal efficiency.
The SIC represents a multiplicative scaling by which the significance of a signal would increase corresponding to a threshold cut given by a particular signal efficiency.
In Fig. 6, we show an example of the median of the receiver operating characteristic (ROC) and SIC curves, obtained for a signal region containing 2000 signal and 100k background events.The uncertainty band is derived from the 68% confidence bands of 10 independent classifier trainings.The values are compared with an idealized anomaly detector [16] created by training the classifier with independent background events simulated with the same physics simulator used to generate the background in the signal region.A total of 200k background events are used to train the classifier taken from an independent background sample [65].The idealized classifier training follows the same training strategy discussed before.The evaluation is always performed on a large set of signal events and independent background sample [65], to mitigate the statistical fluctuations from the signal injection.Both SIC and AUC values are always calculated using the signal and background efficiencies.Both generative models show excellent performance at identifying the signal component, showing a maximum value of the SIC curve near 40, limited by the amount of background events available in the signal region.We also investigate the maximum SIC value and AUC obtained by training the classifier with different amounts of signal injection.The results are shown in Fig. 7, compared with the idealized anomaly detector results and with a retrained version of the original Cathode method [16] in Fig. 7.
At lower amounts of signal injection, low-dimensional Cathode outperforms even the idealized anomaly detector due to the high dimensionality of the inputs to the latter vs. the hand-picked high-level features input to the former.
Above 1k signal injected events, the idealized anomaly detector matches the maximum performance obtained by the low-dimensional Cathode and continues to improve as the amount of signal increases.Similar conclusions are made for the diffusion and flow matching generative models, which are able to reach a maximum SIC value as high as 50 for large amounts of signal injection.At lower amounts of signal injection, the AUC of the ROC curve obtained by the models is compatible with 50%, showing that the background prediction in the signal region is precise.

V. CONCLUSIONS AND OUTLOOK
Anomaly detection provides a new set of tools for exploring particle physics data in search of BSM physics.Machine learning is empowering these approaches to make the most of our complex datasets.We have explored how state-of-the-art generative neural networks can be deployed in a resonant anomaly detection mode to use all of the available information in hadronic final states.On the LHC Olympics dataset, we find that this method is able to automatically identify the presence of anomalies, even in the high-and variable-dimensional feature space so long as there is enough signal.This performance comes at the tradeoff of specificity.A simpler model trained on a small set of features known to be useful for the signal performs better at low signal injection and worse at relatively high signal injection.This crossover point currently happens around 5-6σ in injected signal, but future innovations may be able to bring this down.It will also be exciting to explore the breadth of signals that this new approach can find.Quantifying the discovery volume will be a complex challenge that is already difficult for existing methods and ever more acute with the full phase space.

FIG. 4 .
FIG. 4. Comparison between jet (first five histograms) and particle (last three histograms) kinematic distributions for samples in the dijet mass sideband.Distributions generated by the diffusion and flow matching models are shown in red and blue respectively, while expected distributions from the pseudo-data are shown as filled histograms.

FIG. 5 .
FIG. 5. Comparison between jet (first five histograms) and particle (last three histograms) kinematic distributions for samples in the dijet mass signal region.Distributions generated by the diffusion and flow matching models are shown in red and blue respectively, while expected background distributions from the pseudo-data are shown as filled histograms.
TrelN , η relN , ϕ relN ), Network architecture of the diffusion generative model.Dijet kinematic information is first generated with a dedicated diffusion model.For each jet created, particles are generated through a second diffusion process conditioned on the jet kinematic information.A conditional embedding is used to process the conditional inputs used for both generative models consisting of the invariant mass of the dijet system (mjj), the time information, and the jet kinematic information in the case of the particle generation model.Network architecture of the flow matching generative models.The upper architecture first generates the kinematic dijet information, the lower architecture generates the particles of each jet based on the jet kinematic information.The jet feature model is conditioned on time t and dijet mass mjj, while the particle feature model is explicitly conditioned on the jet features pT, η, ϕ, and m.N is used to determine the number of particles per jet.
FIG.6.Significance improvement characteristic curve (SIC) (left) and receiver operating characteristic (ROC) (right) obtained by a classifier trained on a sample containing 2000 signal and 100k background events.SIC and ROC values are left empty if the number of background events is below 10 after applying a given threshold with the classifier output to avoid large statistical fluctuations.FIG.7.Median of the Maximum value of the significance improvement characteristic (SIC) curve (left) and area under the curve of the receiver operating characteristic (AUC) (right) as a function of the number of injected signal events.Uncertainties are derived from the 68% confidence bands calculated over ten independent classifier trainings with random initialization.