Searching for New Physics with Deep Autoencoders

We introduce a potentially powerful new method of searching for new physics at the LHC, using autoencoders and unsupervised deep learning. The key idea of the autoencoder is that it learns to map “normal” events back to themselves, but fails to reconstruct “anomalous” events that it has never encountered before. The reconstruction error can then be used as an anomaly threshold. We demonstrate the effectiveness of this idea using QCD jets as background and boosted top jets and RPV gluino jets as signal. We show that a deep autoencoder can significantly improve signal over background when trained on backgrounds only, or even directly on data which contains a small admixture of signal. Finally we examine the correlation of the autoencoders with jet mass and show how the jet mass distribution can be stable against cuts in reconstruction loss. This may be important for estimating QCD backgrounds from data. As a test case we show how one could plausibly discover 400 GeV RPV gluinos using an autoencoder combined with a bump hunt in jet mass. This opens up the exciting possibility of training directly on actual data to discover new physics with no prior expectations or theory prejudice. ar X iv :1 80 8. 08 99 2v 1 [ he pph ] 2 7 A ug 2 01 8

Although the LHC has performed hundreds, if not thousands, of searches for new physics since its inception, so far no definitive evidence for physics beyond the Standard Model has turned up. All the searches for new physics in the expected places (supersymmetry, composite Higgs, fourth generations, Z s, etc) have turned up empty. This strongly motivates methods to look for physics without as much top-down theory prejudice. We need more ways to discover the unexpected at the LHC, and here is where unsupervised machine learning comes into play.
In this paper, we study one promising avenue to perform open-ended searches for new physics at the LHC: anomaly detection with autoencoders and deep learning. An autoencoder [35] is a simple idea with various incarnations and many real world applications to anomaly detection, denoising [36], generative models [37], feature selection and more. (For an introduction to autoencoders and their applications, see e.g. [38][39][40].) In its simplest form it is a lossy algorithm that maps an input to a latent compressed representation and then back to itself. This is illustrated in the cartoon in Fig. 1. A measure for how well the autoencoder performs is the difference between input and output according to some distance metric -the "reconstruction error". For example, for images, it could be the pixel-wise, summed mean-squared difference between input and output. Typically one trains an autoencoder on a sample of background events with the objective of minimizing reconstruction error on the sample. In this way, it learns what background "looks like". Any anomaly (the signal, e.g. new physics) is then expected to be poorly reconstructed by an autoencoder optimized on a sufficiently different background. Hence we can use a cut on the reconstruction error as an anomaly threshold.
For concreteness, we will focus in this work on distinguishing "fat" QCD jets from other types of heavier, boosted resonances decaying to jets. Building on previous work on top tagging [12], we will concentrate on machine learning algorithms that take jet images as inputs. For signal, we will consider all-hadronic top jets, as well as 400 GeV gluinos decaying to 3 jets via RPV. Obviously, this is not meant to be an exhaustive study of all possible backgrounds and signals and methods but is just meant to be a proof of concept. The idea of autoencoders for anomaly detection is fully general and not limited to these signals. We will comment on other forms of inputs in section 5. Moreover there are many other anomaly detection techniques that are not based on autoencoder and/or on reconstruction (loss) which are worth exploring in future work. At the same time autoencoders have been recently used in other high energy physics applications: in parton shower simulation [28], for feature selection of a supervised classification [30], and for automated detection of detector aberrations in CMS [31].
We will explore various architectures for the autoencoder, from simple dense neural networks to convolutional neural networks (CNNs), as well as a shallow linear representation in the form of Principal Component Analysis (PCA). We will see that while they are all effective at improving S/B by factors of ∼ 10 or more, they have important differences. The reconstruction errors of the dense and PCA autoencoders correlate more highly with jet mass, leading to greater S/B improvement for the 400 GeV gluinos compared to the CNN autoencoder. While this may seem better at first glance, we discuss how one might want to use an autoencoder that is decorrelated with jet mass, in order to obtain data-driven side-band estimates of the QCD background and perform a bump hunt in jet mass. Indeed, we show how cutting on the reconstruction error of the CNN autoencoder results in stable jet mass distributions, and we show how this can be used to improve S/B by a factor of ∼ 6 in a jet mass bump hunt for the 400 GeV gluino signal.
We will study the performance of the autoencoder in two modes: a version where it is trained on background-only events, and a version where it is trained on a mixed sample containing both background and signal, meant to be representative of the actual data. An autoencoder trained on a sample of background-only events is an example weakly-supervised machine learning. One could still imagine applying this directly to data, provided one can prepare a control sample that consists only of representative backgrounds. Or one could train on MC backgrounds and hope that the MC is an accurate representation of the background events in the data. As a first test of this assumption, we will train on Pythia and evaluate on both Pythia and Herwig and we will see that the results are similar.
By contrast, the autoencoder trained on mixed samples of background and signal is an example of fully-unsupervised machine learning, and as such is a much more exciting potential application. We will show that, surprisingly, the autoencoder performance is remarkably stable against signal contamination: the performance is barely degraded even if signal is 10% of the training sample! Evidently, there is not much difference between the weakly-supervised and fully-unsupervised modes. Somehow, the autoencoder learns to preferentially reconstruct the background, and still poorly reconstructs the signal, even though it sees the signal as part of the training process. This raises the exciting possibility that the autoencoder could be trained directly on the data, and then could potentially discover any anomalous signal of new physics in the background (perhaps when combined with other variables, for instance a mass cut or bump hunt), provided it looks different enough from SM objects. This would be an ideal method to discover the unexpected or to perform open ended searches for new physics at the LHC.
Aside from open ended anomaly detection, the autoencoder could be viewed as a general-purpose background-cleaner. That is, we could train it on the background (or directly on the data) and then cut on reconstruction loss in order to remove "boring" QCD events, leaving behind a sample that is presumably more signal-rich. We could then study these events in more detail, using other techniques and variables to isolate the signal.
The outline of the paper is as follows. In Section 2 we define autoencoders quantitatively and present the architectures employed in the rest of the paper. We also describe the details of event generation used to obtain the data sets. Section 3 is devoted to the main results of the weakly-supervised mode (with pure background training set). We compare the performance of the different architectures, discuss the methods by which we choose the size of the latent space, and perform a MC comparison in the form of Pythia vs. Herwig. In Section 4 we turn our attention to the fully unsupervised mode. We study the consequences of having a small fraction of signal in the training set, and then we discuss correlation between jet mass and reconstruction loss of the trained autoencoders. We show how by using the CNN autoencoder, a bump hunt in jet mass could potentially reveal the presence of 400 GeV RPV gluinos in the actual data. Finally, we conclude in Section 5 with a summary and list of future directions.

Methods
Let us start with a more detailed introduction to autoencoders. Given an input x ∈ R n we want to learn a mapping intox ∈ R n while passing through a latent representation y ∈ R k . This mapping is implemented by two functions: the encoder f : R n → R k and the decoder g : R k → R n . The functional forms of f and g are determined by the autoencoder architecture; they are parametrized by sets of learnable weights, θ f and θ g , respectively. The aim of the autoencoder (and the aim of the machine-learning training process) is to ensure that x andx = g(f (x; θ f ); θ g ) are as close as possible under a given metric. Useful results are obtained when the dimension of the latent space is smaller than the input one, k n, so that the trivial mapping cannot be learned. Thus the autoencoder learns a compressed representation of the input, optimized on its features.
To evaluate the distance between x andx we will use the L 2 norm, also known as mean-squared reconstruction error: By training the autoencoder to minimize L on a sample of background events, we learn to encode and decode the typical events that arise from the background distribution. Then when the autoencoder is evaluated on signals that do not come from the background distribution, the hope is that it will result in a larger L than usual. Thus, the tails of the L distribution are more likely to be signal than background, and by cutting on L we can cut out background and better detect signals. This one of the possible ways to use an autoencoder as anomaly detector.

Sample generation
The jet image samples used in this work follow the exact same specifications as the "CMS jets" used in [12]. We describe this briefly here but we refer the reader to [12] for more detailed information. The jets are generated using Pythia 8.219 [41] for hadronization and Delphes 3.4.1 [42] for detector simulation. All jets are clustered with FastJet 3.0.1 [43]. We use anti-k T jets with R = 1 and we require p T ∈ [800, 900] GeV and |η| < 1.
The background (used for training the autoencoders) consists of light QCD jets, while for examples of signal we will employ top quark jets and gluino jets with mass mg = 400 GeV. The tops are assumed to decay hadronically, while the gluinos decay to three light-quark jets via RPV SUSY. All the samples are generated by simulating pair production of the heavy resonance starting from pp collisions at 13 TeV (LHC Run II conditions).
In order to ensure that the decay products of the heavy resonance are predominantly contained within the fat jet, we apply a merge requirement of ∆R < 0.6 at the truth level on the partonic daughters of the decayed heavy resonance. We also require a geometric match requirement of ∆R < 0.6 between the fat jet and the original heavy resonance.
In all of our studies, we use sample sizes of 100k for training and testing. We have checked with smaller sample sizes that the performance of the autoencoders seems to saturate at 100k, but we have not performed a detailed study.
After generating the fat jets, we apply several pre-processing steps described in [12] (center, rotate, flip, normalize) and then we pixelize the jets into 37×37 images whose pixel intensities correspond to total p T . We stick to grayscale images in this work for simplicity.

Autoencoder architectures
In this work, we compare two deep-learning autoencoder architectures, as well as a simpler autoencoder based on principal component analysis (PCA) that could be considered as a baseline. All of our autoencoders take the jet images as inputs. In this subsection we will describe them briefly and qualitatively. In appendix A, we will provide full descriptions in the form of Keras code.
• For preliminary exploration will use Principal Component Analysis (PCA). The principal components correspond to the eigenvectors of the correlation matrix, ordered in decreasing eigenvalues. The encoder is just a projection on the first k components and the decoder the projection back to the original space. It can be shown that this minimizes the mean-squared error in the space of linear projections. Thus in this sense PCA is comparable to a linear model (e.g. one layer with linear activations and k the dimension of the latent space) with the convenient property of being deterministic.
• The simplest architecture we consider is just a series of dense (fully-connected) layers. One starts by flattening the N × N image into a single column vector of length N 2 . This is then fed to the dense layers of successively smaller size until one arrives at the latent layer. Then this process is reversed until one arrives back at a column vector of the initial size.
• For a more sophisticated autoencoder, we consider a convolutional neural network (CNN). Here the dimensionality reduction is accomplished via the usual maxpooling layers. After a series of convolutional and max-pooling layers, the output is fed to a series of dense layers, resulting finally in the latent representation. The entire process is reversed (with 2D upsampling layers in place of the max-pooling layers) to arrive back at an image with the same dimensions. (For the arithmetic of the max-pooling and upsampling to work out, we zero-pad the inputs to the CNN autoencoder so that they are 40×40 pixels.) All the architectures have been implemented using Keras 2.1.5 with Tensorflow 1.7.0 backend on Nvidia GPUs (Pascal 100 and GeForce GTX 1080). For training, we used the default Adam algorithm with minibatch size of 1024 1 and a mild early stopping criterion: threshold= 0 and patience= 3 (= 5) for the CNN (dense) autoencoder. As this is a proof-of-concept paper, we have not optimized heavily the training algorithm (e.g. we have not studied the effect of learning rate annealing or momentum). We see that the autoencoder works as advertised: it learns to reconstruct the QCD background that it has been trained on (to be precise, we train on 100k QCD jets and then we evaluate the autoencoder on a separate sample of QCD jets), and it fails to reconstruct the signals that it has never seen before. This is further illustrated in Fig. 3, which shows the average QCD, top and gluino jet image before and after autoencoder reconstruction. We see by eye that the QCD images are reconstructed well on average, while the others contain more errors.
By sliding the reconstruction loss threshold L > L S around, we can turn the histograms in Fig. 2 into ROC curves. The ROC curves for the different autoencoder architectures are shown in Fig. 4 for the top and gluino signals. For comparison we have also included the ROC curve obtained by cutting on jet mass as an anomaly threshold. While the three architectures have comparable performances it is clear there are some important differences. For tops, the CNN outperforms the others, while for gluinos the situation is largely reversed. Surprisingly, for gluinos, the CNN is even outperformed by the humble PCA autoencoder at all but the lowest signal efficiencies! We will explore this in more detail in section 4.2, but a clue as to what's going on is shown in the comparison of the PCA ROC curve with the jet mass ROC curve. For gluinos, they track each other extremely closely, suggesting that the PCA reconstruction error is highly correlated with jet mass. We will confirm this in section 4.2. Evidently, the PCA autoencoder (and to a lesser extent the dense autoencoder) has learned to reconstruct the more numerous low mass QCD jets at the expense of the rarer high mass QCD jets.
Meanwhile the CNN has learned information that is not as correlated with the mass, e.g. details about the jet substructure.
In Table 1, we show the signal efficiency at 90% and 99% background rejection (which we refer to as E 10 and E 100 respectively). The values reported in each case are the average over 5 independent training runs to ameliorate the intrinsic variance (apart from PCA which is deterministic). We see that rejecting 99% of background will keep more than 10% of the signals for both of the deep-learning-based autoencoders.

Choosing the latent dimension
Here we will explore the dependence of the autoencoder on the dimension of the latent space. This is one of the most important choices to make in the design of an autoencoder for anomaly detection. If the dimensionality is too low, the autoencoder is not able to capture all the salient features of the training set. On the other hand, as the encoding space gets larger, we get closer to the trivial representation. Hence we would like to find an optimal compromise. In choosing the latent dimension of the autoencoder, it is important to keep in mind the unsupervised nature of our endeavor. So optimizing the latent dimension using various signals is not the approach we want to take.
One unsupervised method for finding an optimal working point is to use PCA as the initial step. Shown in Fig. 5 (left) is the amount of variance in the data explained by each eigenvector of PCA, in descending order. (This kind of plot is conventionally referred to as a "scree plot" by PCA practitioners who also happen to be mountaineers.) An obvious and common prescription is to choose the number of principal components close to the "elbow" of the scree plot; other choices might be motivated upon more detailed inspection of the cumulative accounted variance (e.g. one might choose the number of encoding dimensions corresponding to 95% or 99% of the total variance). We could then use the same value for the dimensionality of the encoding space in our deep networks.
We can also search for a similar behaviour in the loss function. This is shown in Fig. 5 tg   (right) for the different autoencoders. We see the loss plateaus around the same place for the various autoencoders, and that corresponds roughly to the elbow of the PCA scree plot. The loss function first sharply decreases as more important and meaningful features are learned by the encoded representation. It reaches a plateau supposedly when only marginal information is added to the encoding space. Following the above logic we choose k = 6 encoding dimensions for all of the autoencoders presented in the paper.
Finally, let's examine the wisdom of our choice by looking at the top signal for example. Shown in Fig. 6 is E 10 and E 100 for the top signal (averaged over 5 training runs) as a function of the latent dimension. This shows the same behavior as we saw above -the performance of the autoencoders plateau around k = 6. This is encouraging evidence for our unsupervised method of choosing the latent dimension based on PCA and reconstruction loss.

Robustness with other Monte Carlo
Before turning to unsupervised approaches in the next section, let us consider here the main weakness of the weakly-supervised approach: the reliance on accurate backgroundonly samples for training.
One data-driven approach would be to define a control sample of fat jets, e.g. by inverting a lepton selection. This of course assumes the signal is never produced in association with leptons.
Alternatively, one would train on background Monte Carlo, and then apply the autoencoder to data. This would work only insofar as the Monte Carlo accurately represents the background in the data. Or that any artifacts special to the Monte Carlo are not learned by the autoencoder. In particular different hadronization schemes could have an impact on the final shape of the jets we study and deteriorate the results of an autoencoder.
In this subsection, we will explore the dependence of the autoencoder on the choice of MC generator by evaluating our CNN autoencoder (trained on Pythia) on fat jets produced with Herwig. Fig. 7 shows the resulting distributions of the reconstruction error. The differences are small, and crucially the separation between background and anomaly is preserved. This can be seen as another proof that the autoencoder has mostly learned fundamental jet features which should depend only weakly on the hadronization scheme details.
We can quantify the degradation in performance by fixing a common threshold. For convenience we choose it such that on Pythia we have the usual 90% and 99% background rejection. We select one training instance of the CNN autoencoder at random, which corresponds to E 10 = 0.71 and E 100 = 0.19. Applying the same threshold and the same algorithm to the Herwig set we obtain precisions of s = 0.74 and s = 0.21 respectively, with corresponding background rejection of 87% and 98%.
4 Training directly on data: unsupervised mode 4

.1 Contamination study
In the previous section, we have explored how autoencoders can be trained on samples of background-only jets, and then be used to discover signals such as top quarks and RPV gluinos. This is a prime example of "one-class classification" and weakly-supervised learning. It could potentially have direct applications to LHC searches for new physics, provided the background sample can be validated somehow.
In this section, we will turn to a potentially much more exciting application of autoencoders in the form of unsupervised learning. Rather than train on a sample of background-only jets, we will train on a sample of backgrounds "contaminated" by a small fraction of signal events. We will see how, somewhat surprisingly, the autoencoder still succeeds in detecting anomalies in the test set even though they are present in the training set. Evidently, as long as the autoencoder doesn't see "too many" anomalies in the course of its training, its performance will be largely preserved. Figure 8 shows how the amount of contamination with anomalous events in the training set affects the performance of autoencoders. Here, we use top jet samples for anomalous events. The horizontal axis denotes the fraction of top jets in the entire training set. In the left and right panels, the values of E 10 and E 100 for top jet signals are shown respectively. For dense and CNN autoencoders, each point represents the average of 5 runs. In every architecture, as the contamination ratio increases up to 0.1, the values of E 10 and E 100 tend to gradually decrease but the reduction is not dramatic. This indicates that the contamination does not give a significant impact on the performance of our autoencoders. Just to emphasize how powerful this method potentially is, we see that with the CNN autoencoder, even with 10% signal present in the training sample, the autoencoder arrives at E 100 ∼ 0.1, so after this cut on reconstruction loss, we would end up with S/B ∼ O(1)! Of course, without some way of estimating the background, this unsupervised method of searching for new physics would still probably have limited utility. With just a pure counting experiment (counting the number of events above some reconstruction error threshold), we would have no way of knowing whether we have found new physics, unless we knew beforehand what to expect from the SM background. In the next subsection, we will explore the possibility of combining the autoencoder with a variable like jet mass, in order to perform a bump hunt, with data-driven background estimates coming from sidebands. We see that they are stable against increasingly hard cuts on the reconstruction error.

Correlation with jet mass
In this subsection, we will explore the correlation of the different autoencoders with jet mass. We are motivated by how the autoencoder would be applied in the real world to look for new physics. We are looking for subtle signals in an open-ended way buried in the QCD background. Given that there is no reliable way to estimate the QCD background other than data-driven methods, and given that we are not expecting to achieve extremely high S/B significances, a pure counting experiment seems implausible. Instead, we will still need another variable to side-band in order to estimate the QCD background from the data. Since a large class of new physics starts from the decay of a heavy new resonance, jet mass is an obvious candidate to side band in. From this point of a view, the ideal autoencoder would be one whose reconstruction error is minimally correlated with jet mass. We could then cut hard on the reconstruction error to "clean" out the QCD background, and then look for a bump in the jet mass distribution, confident that the autoencoder cut did not sculpt an artificial peak into the jet mass distribution of the QCD background.
Shown in Fig. 9 (left) is the mean jet mass computed in bins of increasing autoencoder loss, for the QCD background. We see that PCA (gray) and dense (blue) reconstruction errors are correlated with jet mass all the way up to 400 GeV. So cutting on the PCA loss is roughly equivalent to cutting on the jet mass. However, for CNNs the correlation stops for jet masses above ∼ 250-300 GeV. Equivalently, the jet mass distribution should be stable against cutting on the CNN loss for cuts above ∼ 10 −6 . This is borne out in Fig. 9 (right). Here we see the jet mass distribution after cuts on CNN loss that reduce the QCD background by a factor of 10 (blue), 100 (orange), and 1000 (green). The jet mass distribution is remarkably stable as we cut harder on CNN loss. This makes it the superior autoencoder for doing a bump hunt in jet mass for jet masses above ∼ 300 GeV.
To illustrate the possibilities of searching for new physics in this way, by first "cleaning" the QCD background using the CNN autoencoder and then doing a bump hunt in jet mass, we include Fig. 10. These are the jet mass histograms for QCD background and 400 GeV gluinos, now normalized to the LO gluino and QCD cross sections, before (left) and after (right) a cut on CNN autoencoder loss that removes a factor of 1000 of the QCD background. Importantly, we have trained to autoencoder on a mixed sample containing the expected fraction of gluino jets, corresponding to a contamination fraction of 10 −3 . This would be representative of the actual data, if it really contained these gluinos. We see that the S/B achievable here is ≈ 25%. As can be seen clearly from the histograms, this is an impressive improvement on the S/B before the cut (i.e. just from the raw jet mass histogram), which is only ≈ 4%. One could plausibly discover new physics this way!

Discussion
In this paper, we have shown how autoencoders -machine-learning algorithms that learn how to compress and decompress a sample of inputs -are potentially powerful new tools for performing open-ended searches for new physics at the LHC. While autoencoders have many real-world applications to anomaly detection, they have up till now not been widely adopted in high energy physics.
We explored autoencoders in both weakly-supervised and unsupervised forms. In the former mode, we trained autoencoders based on dense and convolutional neural networks on a sample of high p T , R = 1 QCD jet images and showed how they could learn to accurately reconstruct these jet images. Then the hope of using autoencoders for openended anomaly detection is that it would fail to reconstruct signals it hadn't been trained on, and then one could use the reconstruction error as an anomaly threshold. In this paper we demonstrated that the deep autoencoders work as advertised, by applying it to signals consisting of all-hadronic top jets and RPV gluinos. We saw that by thresholding on reconstruction error, the autoencoder could improve S/B's on these signals by sizable amounts.
We also showed how the autoencoder could operate in an unsupervised mode, and discover signals despite having been trained on data that actually contained those signals! In fact, we saw that varying the signal fraction even up to 10% the autoencoder performance was remarkably stable. This implies that one could simply train the autoencoder directly on the data, and then look for a feature corresponding to new physics. As a proof-of-concept, we showed how this could be done with a jet mass bump hunt. We showed that the CNN autoencoder is reasonably decorrelated with jet mass, meaning that we could use the autoencoder to reduce the QCD background and then search for a bump in the jet mass distribution. We saw that it could achieve S/B ∼ 25% for a 400 GeV RPV gluino signal, an improvement of over a factor of 6 from the bump hunt without autoencoder.
We believe this is a very exciting new direction in the search for new physics at the LHC, very unlike conventional approaches. There are many future directions that we envision. Some of these include: • Testing out the autoencoder on other signals and backgrounds. For concreteness, we focused fat jets in a narrow range of p T 's, treating QCD as background and heavy resonances with three subjets as signal. But obviously the idea is general and can be applied to any training and test samples in principle. One could envision applying this to other numbers of subjets, dark showers, non-resonant particles, etc.
• Going further, it would be fascinating to train an autoencoder to flag entire events as anomalous, instead of just individual fat jets.
• We focused on just a few autoencoder architectures in this paper, for the proof of principle, but there are many others on the market. For instance, LSTMs and recurrent neural networks. These have proven to be useful for boosted-object tagging [7,8,11,17] so we expect they will also be useful here. There are also even more complex types of anomaly detection in the computer-science literature based on the idea of GANs [44][45][46] that may also prove useful in this context.
• It would be interesting to dive deeper into the latent representation that is learned by the autoencoder. Do signals and backgrounds cluster in this latent space? Do the latent dimensions correlate strongly with known variables such as jet mass and N-subjettiness?
• We saw here how the CNN autoencoder was reasonably decorrelated with mass. It would be interesting to explore ways to more explicitly decorrelate in mass. The "variable planing" ideas of [3,47] may be useful in this context. Or one could envision training an ensemble of autoencoders on jet samples corresponding to different bins in jet mass. A small enough bin width would probably ensure practical absence of correlation between mass and reconstruction loss. This is well beyond the scope of our study; we reserve this for future work.
Autoencoders are a form of weakly-supervised or unsupervised machine learning which could be ideally suited to the current situation at the LHC, where many topdown-motivated searches have not turned up any evidence for new physics, and many people are wondering what we should be looking for. With an autoencoder approach, one doesn't need to know what one is looking for. It is a powerful new method to search for any signal of new physics in the data, without prejudice.
Note added: While this work was being completed, we learned of the work of [48], who also studied the applications of autoencoders to anomaly detection and searching for new physics at the LHC.