Self-supervised Anomaly Detection for New Physics

We investigate a method of model-agnostic anomaly detection through studying jets, collimated sprays of particles produced in high-energy collisions. We train a transformer neural network to encode simulated QCD"event space"dijets into a low-dimensional"latent space"representation. We optimize the network using the self-supervised contrastive loss, which encourages the preservation of known physical symmetries of the dijets. We then train a binary classifier to discriminate a BSM resonant dijet signal from a QCD dijet background both in the event space and the latent space representations. We find the classifier performances on the event and latent spaces to be comparable. We finally perform an anomaly detection search using a weakly supervised bump hunt on the latent space dijets, finding again a comparable performance to a search run on the physical space dijets. This opens the door to using low-dimensional latent representations as a computationally efficient space for resonant anomaly detection in generic particle collision events.


I. INTRODUCTION
A central goal of high energy physics is to find the theory that will supersede the Standard Model. A number of competing models exist, with many involving new resonant particles [1,2] such as supersymmetric partners or weakly interacting massive particles. However, the number of candidate particles is too large to justify a hunt-and-pick procedure of data analysis. Therefore it is advantageous to consider methods of anomaly detection that are agnostic to a particular underlying model of new physics.
There are a number of recent machine learning (ML) based anomaly detection proposals designed to reduce model dependence (see Refs. [3][4][5][6] for overviews). Notably, most existing methods (including the first result with data from ATLAS [7]) in the field are bestperforming in low-dimensional spaces. However, a single event from a particle collision experiment can have on the order of a thousand degrees of freedom.
A resolution to this tension is to reduce the dimensionality of an entire particle collision event while preserving its essential character such that a search for anomalies can be done in the reduced-dimension space. A number of methods exist for carrying out this phase space reduction. For example, one could choose a set of observables (e.g. mass, multiplicity) and perform anomaly detection on this set. However, attempts to reduce dimensionality through selecting a choice of observables implicitly favors a certain class of models.
Dimensionality reduction can also be performed with unsupervised ML techniques, which ensure a modelagnostic approach. A common tool in the ML literature for this compression is the autoencoder (AE). An autoencoder is a pair of neural networks whereby one function encodes event space data into a latent space and a second function decodes the latent space back into the event space. No labels are needed because the AE is trained to ensure that the composition of the encoder and decoder is close to the identity to produce high reconstruction efficiency. While effective (see e.g. [8][9][10][11]), AE-based tools may not be ideal for anomaly detection. For one, there is nothing in their architectures or loss functions that ensure that anomalies, or basic physical (i.e. geometric) properties, of events are preserved by the encoder. Additionally, most AE studied so far cannot process a variable number of inputs per event, which would require the decoder to generate a variable number of outputs and the loss to compare events with a variable dimensionality.
These problems can be circumvented by making use of self-supervised contrastive learning techniques, which use only the inherent symmetries of physical data to perform a dimensionality reduction. Recent studies have explored this; for example with astronomical images in Ref. [12], and with constituent level jet data in Ref. [13]. In the former, a ResNet50 architecture [14] is used to map each astronomical image to a latent representation, while in the latter, a permutation-invariant transformer-encoder architecture maps jet constituents to a latent representation. The networks are trained on the contrastive loss, which ensures that the latent space representations faithfully model the physical symmetries of the original objects. Further analysis is then done directly on the latent space representations.
In this paper, we continue the explorations of latent space representations of particle collisions originally carried out in Ref. [13]. We first demonstrate that particle collisions can be well-modeled in a latent space representation with a dimensionality that is an order of magnitude smaller than that of the original events. As part of this work, we extend the per-jet work of Ref. [13] to a Event space dijets and their symmetry-augmented versions are fed as input into the network, which creates a mapping into the latent space by training on the contrastive loss function. The output of the transformer-encoder network is then passed through a head network, consisting of two fully connected layers (FCN1 and FCN2). In practice, the representations from the first fully connected layer perform the best in signal versus background classification tasks.
per-event structure. We then conduct a low-dimension model-agnostic anomaly search in the latent space representations of the particle collision events. For this we use the Classification Without Labels (CWoLa) technique [15][16][17], which uses deep neural-network classifiers to distinguish between anomaly-enriched events and anomalydepleted events. We conduct these studies on dijet resonance events in the LHC Olympics dataset [18].
The structure of this paper is as follows. In Sec. II, we motivate the relevance of contrastive learning to modeling particle collisions and introduce a dataset of dijet events. We further outline a set of symmetry augmentations for the contrastive loss function that leave the essential character of dijet events invariant and explain how these are used in the contrastive learning approach. Lastly, in this section we outline the CWoLa anomaly detection method where we will use the self-supervised event representations. In Sec. III, we implement the contrastive learning method using a transformer neural network to map the dijet events from the event space into a latent space and evaluate the efficiency of the encoding. In Sec. IV, we use the CWoLa method to perform a relevant, but simplified, anomaly bump-hunt analysis using the latent space representations for the dijet events. The paper ends with conclusions and outlook in Sec. V.

II. METHODS
Our overarching goal is to optimize a mapping from the event space of particle collision events (i.e. the representation in the space of the individual particles) to a new latent space representation. The mapping between the event and latent space representations of particle collisions should maximally exploit the physical symmetries of particle collision events. In this way, the dimensionality of the events can be reduced from a few hundred degrees of freedom (corresponding to the momentum 4vectors of the particles) to a few tens.
Such a mapping can be realised by using a transformerencoder neural network architecture [19]. Event space events are fed into a transformer neural network where they are embedded into a reduced-dimension latent space. A distinguishing feature of the transformer architecture is permutation invariance: the latent space representations are invariant with respect to the order that the event constituents are fed into the network. The transformer-encoder network consists of 4 heads each with 2 layers. The output of the transformer-encoder network is fed into a two layers of fully connected networks of the same latent space size. These architecture parameters were not heavily optimized.
See Fig. 1 for a schematic of the full network. We find, as do the authors of Ref. [19] and Ref. [13], that the latent space representation of the first head layer output gives a better representation than that of the final output layer, or that of the transformer-encoder output.

A. Data Selection and Preparation
For this paper, we focus on the LHC 2020 Olympics R&D dataset [3,18]. The full dataset consists of 1,000,000 background dijet events (Standard Model Quantum Chromodynamic (QCD) dijets) and 100,000 signal dijet events. The signal comes from the process Z → X(→ qq)Y (→ qq), with three new resonances Z (3.5 TeV), X (500 GeV), and Y (100 GeV). The only trigger is a single large-radius jet (R = 1) trigger with a p T threshold of 1.2 TeV. The events are generated with Pythia 8.219 [20,21] and Delphes 3.4.1 [22]. Each event contains up to 700 particles with three degrees of freedom (DoF) p T , η, φ. The average number of nonzero DoF per event is 506 ± 174.
For each event, we cluster the jets using FastJet [23,24] with a radius R = 0.8. We select the two highest-mass jets from each event, and we select the 50 hardest (highest p T ) constituents from each event, zeropadding for any jet with fewer than 50. Note that the average number of constituents per event is 81 with a standard deviation of 16.15. However, we found that including more than 50 constituents per jet did not lead to an appreciable improvement in the performance of the transformer-encoder network. The constituents are assumed to be massless, and so the relevant degrees of freedom for each constituent are (p T , η, φ).
For analysis, we select jets from the windows p T ∈ [800, 3000] GeV and η ∈ [−3, 3]. The cut on p T was chosen such that invariant dijet mass m JJ has a lower bound at approximately 2 TeV. This cut removes approximately 12% of eligible events from the LHCO dataset and has the benefit of removing a small tail of events with m JJ below 2 TeV that could appear to be artificially anomalous to the transformer-encoder network.

B. Contrastive Learning
The contrastive learning method is self-supervised, meaning that it is trained using "pseudo-labels" rather than truth labels. Supervised approaches use truth labels which exactly identify the truth label of the data. Pseudo-labels are artificial labels created from the data alone, without access to the truth labels. This means that the contrastive learning method is also unsupervised and receives no information as to whether the training samples are signal or background. Following JetCLR [13], the pseudo-labels are used to identify jets which are related to each other via some augmentation, for example a symmetry transformation. Using the pseudo-labels, this technique aims to construct a latent space representation of events that exploits their physical symmetries.
As an example, consider the transformer-encoder network's encoding of a dijet event r j , and the encoding of an augmented version of that event r j . The exact physical symmetries considered in this analysis are outlined in Sec. II C, but for this example, let the augmentation be a random rotation of the dijet event about the beam axis. These two events represent the same underlying physics, as we expect physical events to be symmetric about the beam axis. Hence, r j and r j are often called positive pairs. Therefore we would want the transformer-encoder network to map the event and its augmented version into similar regions of the latent space. In contrast, we would expect the transformer to map the jet event r j and a different jet event r k into different points of the latent space, since we do not expect a high degree of similarity between two arbitrary events. Therefore r j and r k are often called negative pairs. These positive and negative pairs are exactly the pseudo-labels for the contrastive learning method.
These requirements on the transformer-encoder mapping motivate the expression for the contrastive loss: We can interpret this loss function as follows: sim(r j , r j ) calculates the similarity between two latent space representations, where The similarity is parameterized by a temperature τ , which balances the numerator and denominator in the contrastive loss. The numerator of the contrastive loss optimizes for alignment, which tries to map jets and their augmented versions to similar regions in the latent space. The denominator of the contrastive loss maximizes the uniformity, which tries to use up the entirety of the latent space when creating representations (see Fig. 2).

C. Event Augmentations
We now outline the list of symmetry augmentations used to create physically equivalent latent space jets.
We define the following single-jet augmentations: 1. Rotation: each jet is randomly (and independently) rotated about its central axis in the η − φ plane. This is not an exact symmmetry, but correlations between the radiation patterns of the two jets are negligible.
2. Distortion: each jet constituent is randomly shifted in the η − φ plane. The shift is drawn from a gaussian of mean 0 and standard deviation ∼ 1/p T , where p T is the transverse momentum of the consituent being shifted. This shift represents the smearing from detector effects.
3. Collinear split: a small number of the jet constituents ("mothers") are split into two constituents ("daughters") such that the daughters have η and φ equal to that of the mother, and the transverse momenta of the daughters sum to that of the mother.
We define the following event-wide augmentations: 1. η-shift: the dijet event is shifted in a random η direction.
2. φ-shift: the dijet event is shifted in a random φ direction.
Augmentations are applied to each training batch of the transformer. Each jet in the dijet event receives all three of the single-jet augmentaions. The full event then receives both event-wide augmentations. See Fig. 3 for a visualization of the jet augmentations in the η − φ plane.
The jet augmentations are meant to not modify any of the important physical properties of the jets. As a test, we plot the jet masses of the hardest and second hardest jets from a subset of the LHC Olympics dataset in Fig. 4a, as well as the nsubjettiness variables τ 21 and τ 32 in Fig. 4b and Fig. 4c, both before and after receiving the jet augmentations and find no significant change in the distributions.
We also plot m JJ for the dijet system in Fig. 5, again finding good agreement before and after the augmentations are applied. This confirms that our set of jet augmentations can be seen as true symmetry transformations of the dijet events.

D. Training procedure
We train the transformer-encoder network on a dataset of 50,000 background dijet events and up to 50,000 signal events, optimized on the contrastive loss in Eq. 1. The batch size is set to 400, which is the largest possible given the computing resources available. The network is trained with a learning rate of 0.0001, an early stopping parameter of 20 epochs, and a temperature parameter τ of 0.1. All of these hyperparameters were empirically found to deliver the best transformer performance. The transformer-encoder network is implemented using Pytorch 1.10.0 [25] and optimized with Adam [26]. Jet augmentations are applied batchwise, with each dijet event receiving a different randomized augmentation. We also construct a binary classification dataset used to evaluate the latent space jet representations. This dataset consists of 85,000 signal and background dijet events each. We consider two types of binary classification tasks. Fully connected binary classifiers (FCN's) are implemented in Pytorch and optimized with Adam. These networks consist of three linear layers of sizes (64, 32, 1) with ReLu activation, a dropout of 0.1 between each layer, and a final sigmoid layer. The FCN is trained with a batch size of 400, a learning rate of 0.001, and an early stopping parameter of 5 epochs. Linear classifier tests (LCT's) are implemented in scikit-learn [27]. Both binary classification tasks discriminate signal from background in the latent space and have access to signal and background labels (i.e. are fully supervised).
We further define a "standard test" dataset consisting

E. Anomaly Detection
The usefulness of the latent space dijet representations is evaluated in a realistic model agnostic anomaly detection search setup. CWoLa (Classification Without Labels) [15] is a weakly supervised training method that allows for signal versus background discrimination in cases where training samples of pure signal and background cannot be provided. Such a scenario might occur in resonant bump-hunting, where it is common to define signal and sideband regions, both of which will have a nonnegligible fraction of background events [16,17]. The authors of Ref. [15] show that a classifier that is trained on two mixed samples (each with a different signal fraction) is in fact maximally discriminatory for classifying signal from background.
In Sec. IV, we run a CWoLa training procedure on the latent space representations. Our mixtures consist of one background-only sample and one sample with a suppressed signal fraction representing a mixture of background and a rare unknown anomaly. This is the ideal anomaly detection setup in which a sample of pure background can be generated (and is also the starting point of Refs [28][29][30]). In practice, this may not be possible, and other methods must be used to obtain this dataset directly from data using sideband information [31][32][33][34][35][36].

III. EVALUATING THE LATENT SPACE REPRESENTATIONS
We evaluate the ability of our transformer-encoder network to faithfully translate dijet events into a latent space in the following way: we first train the network to embed event space particle collisions into a reduced-dimension latent space. We then perform a binary classification task on the latent space. We test the sensitivity of this setup to the amount of signal present in the training of the transformer as well as in the training of the classifier. The latter test demonstrates the anomaly detection capability of the approach.

A. Quantifying the effect of each augmentation
As a first study, we explore the importance of each of the five symmetry augmentations outlined in Sec. II C. In Fig. 6, we plot the rejection (1 / false positive rate) versus the true positive rate for a FCN trained on the latent space dijet representations. Each curve in the figure represents a transformer-encoder network trained with all of the symmetry augmentations except the one indicated. The transformer-encoder network is trained on 50,000 signal and 50,000 background dijet events, and the dimension of the transformer latent space is held at 128 dimensions.
In general, the performance of the transformer-encoder (as quantified by the receiver operating characteristic (ROC) area under the curve (AUC) for the FCN) drops sizably if any of the symmetry augmentations is not used during the transformer-encoder network training. The worst-performing transformers are those that do not receive the event-wide η-shift or collinear split augmentations. However, the decrease in AUC is significant for every removed augmentation. It is likely that the addition of symmetry augmentations would lead to further improvement in the transformer performance.  The AUC scores for the FCN's trained on latent space representations are given in Table I. We additionally include as a performance metric the maximum of the significance improvement characteristic (max(SIC)), defined as max( true positive rate √ false positive rate ). The max(SIC) can be seen as the multiplicative factor by which signal significance improves after performing a well-motivated cut on the dataset.

B. Exploring the dimensionality of the latent space
We next gauge how the size of the latent space affects the usefulness of the representations. The latent space jet representations would ideally be lower in dimension than the physical space versions so as to save computational processing time by removing nonessential degrees of freedom from a given dataset. However, a latent space embedding with too few dimensions might not contain enough parameters to encode the essential physical dynamics of the jets.
In Fig. 7, we plot the rejection versus the true positive rate for a FCN trained on the latent space dijet representations. (See Fig. 10 in App. A for curves from a linear classifier.) We scan the latent space size in powers of two from 512 down to 8 dimensions. For all latent space dimensions, the transformer-encoder network is trained on 50,000 signal and 50,000 background dijet events.
The performance of the transformer-encoder improves as the dimension of the latent space increases. We find that a FCN trained on latent space jet representations cannot outperform a FCN trained on event space jet representations, but it can outperform a LCT trained on event space jet representations. Perhaps more striking is that the linear classifier trained on the compressed latent representations outperforms the linear classifier trained on the full event space data. This indicates that the selfsupervised representations are highly expressive despite being compressed, and it agrees with the top-tagging results obtained in Ref. [13].
A selection of AUC scores for the FCN's trained on latent space representations are given in Table II also contains scores for a LCT trained on the latent space representations, as well as a binary classifier constructed from the transformer architecture with an additional sigmoid function as the final layer (Trans+BC), trained using the Binary Cross Entropy loss. The Trans+BC network has access to all of the input particles and is not trained post-hoc on the self-supervised latent space, so we expect it to perform the best of all configurations. The hope is that the FCN performance is as close as possible to the performance of the Trans+BC (on the largest latent space). Table II does show a performance gap between a FCN trained on the latent space and the Trans+BC network. In App. B, we evaluate the performance of both such networks when trained on increasing amounts of data. In both cases, we find that the networks are "data-hungry"; in other words, the classifier performances increase with the amount of training data, and the performances do not saturate when trained on the 85,000 signal and 85,000 background dijet events sampled from the LHCO dataset. Therefore the performance of the FCN could likely reach that of the Trans+BC network with a larger training dataset than what was used in this study.

C. Varying the amount of training signal
In practice, we want to use the transformer-encoder network for model-agnostic anomaly detection. In this case, we would not be able to train the transformer on a known signal fraction, as the training data would contain an unknown (and extremely tiny, if any) percentage of BSM signal. It is therefore useful to see if the  transformer-encoder network is effective at translating rare events into a latent space. We might expect this to be true if the transformer is learning only generic features about collider events that hold for both signal and background events. This is encouraged by the universality of the symmetry augmentations in the contrastive loss.
In Fig. 8a, we hold the dimension of the transformer latent space fixed at 128, then scan the signal to background ratio S/B down from 1.0 to 0.0. In Fig. 8b, we repeat the previous steps for a transformer latent space of dimension 48. (See Fig. 11 in App. A for curves from a linear classifier.) Note that for this study, the transformer-encoder network is always trained on 50,000 background dijet events, but the number of signal dijet events changes with the signal S/B ratio. We find that the classifier performance is robust with respect to the signal to background ratio, as was found in Ref. [13]. This demonstrates that the transformer-encoder network can be trained on background alone and still faithfully model rare signal events.

IV. ANOMALY DETECTION
We now test the usefulness of the latent space jet representations in a more practical setting by performing a CWoLa-style anomaly search. To create the latent space representations, we use a transformer-encoder network trained on a background-only sample of 50,000 dijet events. As before, we use the "standard test" dataset of 10,000 signal and 10,000 background events for all binary classifier tests.
We first create a baseline against which to compare the CWoLa analyses by carrying out a self-supervised binary classification task in the event space representa- tion. For this study, we use 42,500 signal and 42,500 background dijet events. For the anomaly-detection analysis, we set one CWoLa "mixed sample" to be a set of 42,500 background-only dijets (the same as in the self-supervised task). The other mixed sample is a mixture of 42,500 signal and background dijets, with the signal fraction scanned from 0% to 100%. We run the analysis three times, once for the event space dijets and once each for latent space dijets at 128 and 48 dimensions. The evaluation of the performance is always computed with pure signal and background labels. Comparisons of the CWoLa classifier performances are shown in Fig. 9. In Fig. 9a, we use the ROC AUC as a metric for evaluating the CWoLa classifier; in Fig. 9b, we use instead use the max(SIC)as the metric; in Fig. 9c, we provide the false positive rate at a fixed true positive rate of 50%.
We find that the CWoLa weakly supervised classifier performance of a small dimensional latent space is comparable to (but cannot match) that of the full particle event space, with an improvement in performance for a larger dimension latent space as evaluated by the max(SIC) metric. The most notable difference between the classifier performance on latent space vs. event space is that in the former case, the classifier performance diminishes to no better than random at a higher signal fraction for the training data. (This is indicated by the ROC AUC dropping to 0.5, the max(SIC) dropping to 1, and the FPR @ TPR = 0.5 dropping to 0.5.) More specifically, the small-dimension latent space classifier hits random performance at a signal fraction of just below 10 −2 , while the event space classifier does better than random at all nonzero signal fractions.
Overall, the classifier performances at anomaly-level signal fractions as shown in Fig. 9 are lower than what has been seen in other recent anomaly-detection methods on the LHCO data. In fact, the SIC curves shown in Refs. [31][32][33][34][35][36] are typically an order of magnitude greater than those in Fig. 9. However, such curves were constructed by training on standard jet observables (e.g. m J , τ 12 ), and thus represent training methods that are inherently model-dependent. Evidently, this performance decrease from model-dependent anomaly detection methods is the price to pay on this particular signal for using a more widely-applicable, model-agnostic method.
There exist a number of avenues for future work to improve on this contrastive-learning trained classifier. For one: we have considered a small set of symmetry augmentations specific to dijet events. However, additional augmentations for dijet events could be added to the contrastive loss. Alternatively, a different selection of augmentations that leads to an even more general event representation could be chosen. As another avenue: we mentioned earlier (and illustrate in App. B) that the transformer-encoder network is data-hungry. It would therefore be reasonable to expect an improvement in classifier performance if the training dataset were made larger.

V. CONCLUSIONS
In this paper, we have used transformer-encoder neural networks to embed entire collider events into low and fixed-dimensional latent spaces. This embedding is constructed using self-supervised learning based on symmetry transformations. Events that are related by symmetry transformations are grouped together in the latent space while other pairs of events are spread out in the latent space.
We have shown that the latent space preserves the essential properties of the events for distinguishing cer-   tain BSM events from the SM background. This latent space can then be used for a variety of tasks, including anomaly detection. We have shown that anomalies can still be identified in the reduced representation as long as there is enough signal in the dataset. For the particular signal model studied, the required amount of signal is much higher than reported by other studies using high-level features. This illustrates the tradeoff between signal sensitivity and model specificity. Our reduced latent space knows nothing of particular BSM models and is thus broadly useful but not particularly sensitive. Future work that explores the continuum of approaches by adding more augmentations to the contrastive learning may result in superior performance for particular models in the future.

CODE AVAILABILITY
The code can be found at https://github.com/rmastand/JetCLR AD.
In this section, we provide the analogues to Fig. 7 and Fig. 8 with the transformer-encoder efficiency curves calculated for a binary linear classifier test run on the latent space representations (rather than a binary FCN). This aligns with the field-standard way to evaluate representations of jets, through a LCT. However, these plots ( Fig. 10 and Fig. 11) are not shown in the main text of this report as a realistic anomaly detection analysis would be carried out using fully connected networks. For comparison, we also provide efficiency curves for a FCN and a LCT run on the event space dijets.
Appendix B: How data-hungry are the neural networks?
In this section, we provide plots illustrating the datahungry nature of the transformer-encoder network. The performance of the binary classifier trained on the latent space dijet representations, as shown in Fig. 9, is admittedly low. However, it is likely that the performance could improve if the classifiers were trained on a larger amount of data.
In Fig. 12a, we train a FCN on a varying fraction of the available dijet dataset (a 100% training fraction makes use of all 85,000 signal and 85,000 background events). In Fig. 12b, we repeat this procedure for a Trans+BC network. In both cases, the ROC AUCs of the trained binary classifiers do not appear to be saturated when trained on the full dataset.