Novelty Detection Meets Collider Physics

Novelty detection is the machine learning task to recognize data, which belong to an unknown pattern. Complementary to supervised learning, it allows to analyze data model-independently. We demonstrate the potential role of novelty detection in collider physics, using autoencoder-based deep neural network. Explicitly, we develop a set of density-based novelty evaluators, which are sensitive to the clustering of unknown-pattern testing data or new-physics signal events, for the design of detection algorithms. We also explore the influence of the known-pattern data fluctuations, arising from non-signal regions, on detection sensitivity. Strategies to address it are proposed. The algorithms are applied to detecting fermionic di-top partner and resonant di-top productions at LHC, and exotic Higgs decays of two specific modes at a $e^+e^-$ future collider. With parton-level analysis, we conclude that potentially the new-physics benchmarks can be recognized with high efficiency.


INTRODUCTION
Since the early developments in the 1950's [1], Machine Learning (ML) has evolved into a science addressing various big data problems. The techniques developed for ML, such as decision tree learning [2] and artificial neural networks (ANN) [3], allow to train computers in order to perform specific tasks usually deemed to be complex for handwoven algorithms. For supervised learning, the algorithm is first trained on labeled data, and then to classify testing data into the categories defined during training. In contrast, in semi-supervised and unsupervised learning, where partially labeled or unlabeled data is provided, the algorithm is expected to find the relevant patterns unassistedly.
The last decade has seen a rapid progress in ML techniques, in particular the development of deep ANN. A deep ANN is a multi-layered network of threshold units [4]. Each unit computes only a simple nonlinear function of its inputs, which allows each layer to represent a certain level of relevant features. Unlike traditional ML techniques (e.g. boosted decision trees) which rely heavily on expertdesigned features in order to reduce the dimensionality of the problem, deep ANN automatically extract pertinent features from data, enabling data-mining without prior assumptions. Fueled by vast amounts of big data and the fast development in training techniques and parallel computing architectures, modern deep learning systems have achieved major successes in computer vision [5], speech recognition [6], natural language processing [7], and have recently emerged as a promising tool for scientific research [8][9][10][11], where the plethora of experimental data presents a challenge for insightful analysis.
High Energy Physics (HEP) is a big data science and has a long history of using supervised ML for data analysis.
Recently, pioneering works have demonstrated the capability of deep ANN in understanding jet substructure [12][13][14][15] and the identification of particles [16] or even whole signal signatures (see e.g. [17], where weakened supervised learning is applied). However, the primary goal of the HEP experiments is to detect predicted or unpredicted physics Beyond the Standard Model (BSM) in order to establish the underlying fundamental laws of nature. Despite its significant role in current data analysis, supervised ML techniques suffer from the model dependence introduced during training. This problem can potentially be addressed by the semi-supervised/unsupervised techniques developed for novelty detection (for a review see, e.g. [18]). Novelty detection is the ML task to recognize data belonging to an unknown pattern. If being interpreted as novel signal, BSM physics could be detected without specifying an underlying theory during data analysis. Hence, a combination of novelty detection and supervised ML may lay out a framework for the future HEP data analysis.
Some preliminary and at least partially related efforts have been made at jet [19,20] and event [21][22][23][24][25][26] level. For novelty detection with given feature representation, its sensitivity depends crucially on the performance of novelty evaluators. Well-designed evaluators will allow to evaluate the data novelty efficiently and precisely. As a matter of fact the design of novelty evaluators or the relevant test statistics defines the frontier of novelty detection [18]. In this letter, we propose a set of density-based novelty evaluators. In contrast to traditional density-based ones, which only quantify isolation of testing data from the known patterns, the new novelty evaluators are sensitive to the clustering of testing data. On this basis, we design algorithms for novelty detection using an autoencoder, which are subsequently applied for detecting several BSM benchmarks at LHC and future e + e − colliders.

ALGORITHMS
Novelty detection using a deep ANN can be separated into three steps: 1) feature learning, 2) dimensional reduction, 3) novelty evaluation. During the first step the ANN is trained under supervision, using labeled known patterns. The nodes of the trained ANN contain the information gathered for classification and constitute the feature space, which has typically a large dimension. In order to reduce the sparse error and to improve the efficiency of the analysis, one removes the irrelevant features by dimensional reduction, which can be implemented using an autoencoder [27]. An autoencoder is an ANN with identical number of nodes for input and output layers and fewer nodes for hidden layers. Its loss-function measures the difference between input and output, defined as the reconstruction error x − x 2 . Here x and x are the vectors of input and output nodes, respectively. Hence the autoencoder learns unsupervised how to reconstruct its input. This allows it to form a submanifold in the full feature space. Afterwards, the novelty of testing data is evaluated, for the final significance analysis. The algorithm is shown in FIG. 1. For the HEP data analysis, the data with known and unknown patterns can be interpreted as SM background and BSM signal, respectively.
We generate Monte Carlo data using MadGraph5_a-MC@NLO [28] and rely on Keras [29] (TensorFlow [30]based) for the ANN construction. For the supervised classification of events with n visible-particle four-momenta (which we internally normalise by 200 GeV) and l labeled patterns we use an ANN with 4n input nodes, l output nodes, and three hidden layers with 30, 30 and 10 nodes, respectively. We use Nesterov's accelerated gradient descent optimizer [31] with a learning rate of 0.3, a learning momentum of 0.99 and a decay rate of 10 −4 . The batch size is fixed to be 30 and the loss function (a) Training data.

FIG. 2: Comparison between traditional and new novelty evaluators. The toy-data is shown in panels (a) and (b), while the novelty response is given in (c) and (d).
is the categorical cross entropy [32,33]. The collection of all nodes constitute the feature space with dimension m = 4n + 30 + 30 + 10 + l. This ensures that it contains the non-linear information learned from classification. We normalize the axes of the feature space to [−1, 1] and use tanh as activation function for the autoencoder. Finally, an autoencoder consisting of five hidden layers with 40, 20, 8, 20 and 40 nodes, respectively, and a learning rate of 2.0 projects this feature space onto an eight-dimensional sub-space. We have checked that the results of all ANNs are stable against variations in the numbers of hidden layers and nodes.

NOVELTY EVALUATION
Novelty evaluation of testing data is a crucial step for novelty detection. Various approaches have been developed in the past decades [18]. For non-time series data, one of the most popular approaches is density-based [34], in which a Local Outlier Factor (LOF), i.e., the ratio of the local density of a given testing data and the local densities of its neighbors, is proposed as a novelty measure. Explicitly, this traditional measure is [35,36] here d train is the mean distance of a testing data to its k nearest neighbors, d train is the average of the mean distances defined for its k nearest neighbors, and is the standard deviation of the latter. The subscript of "train" indicates that all quantities are defined w.r.t the training dataset. We calculate d 2 train 1 /2 using the method suggested in [35,36]. The probabilistic novelty evaluator can be defined as the cumulative distribution Here c is a normalization factor, defined as the root mean square of the measure values for all testing data. This evaluator measures the isolation of testing data from training data. A testing data located away from or at the tail of the training data distribution thus tends to be scored high by O trad [34,35].
However, O trad is blind to the clustering of testing data which generically exists in the BSM datasets and may result in non-trivial structures such as resonance. In order to utilize this feature, we introduce a measure: with m being the dimension of the feature space. Here d test is the mean distance of the testing data to its k nearest neighbors in the testing dataset, whereas d train is the SM prediction of the same, which can be approximately calculated using the training dataset. This measure is reminiscent of the test statistic introduced in [37,38], where similar idea is employed for estimating the divergence of data distribution. As ∆ new is approximately The influence of fluctuations on detection sensitivity can be compensated for as the luminosity L increases, if k scales with L. In this case more and more data are used to calculate 1/d m test in the local bin which is barely changed. This compensation is approximately predicted by the Central Limit Theorem (CLT), which states in this context that the standard deviation of the ∆ new response scales with 1/ √ k or 1/ √ L, for the testing data with known patterns only. We show this in Fig 3, using the known-pattern Gaussian datasets defined before. Indeed, as the number of testing data increases, ∆ new becomes less and less sensitive to the fluctuations (see Fig. 3a).
If the fluctuations are not fully compensated for by luminosity, the known-pattern testing data could still be  scored high by ∆ new , and hence diminish the detection sensitivity. This is often true if S tot /B tot is small, as typically occurs in the analyses at LHC. To address this potential problem, we propose one more evaluator This evaluator utilizes the fact that the known-pattern testing data with high O new scores pretty often come from the high-density regions in the feature space, whereas such data are typically scored low by O trad . As indicated in Fig. 4, O comb performs very well in a typical case where the known and unknown-pattern data distributions are partially overlapped, and many of the known-pattern data, especially the ones in the central region, are scored high by O new due to the fluctuations. The known-pattern datasets used here are the same as before, containing 10 4 events. The unknown pattern is defined as N ((1.5, 1.5) T , 0.1I), with Stot /Btot = 1 /20. Indeed, many high-scoring data of known pattern in Fig. 4a are pushed to the low-scoring end in Fig. 4c, due to the compensation of O trad . This effect results in ∼ 50% improvement in sensitivity, compared to the ones based on O trad or O new only. Here (and similarly below) the significance is calculated against the known + unknown-pattern hypothesis for testing data, using the Poisson-probability-based test statistic [39].

STUDY ON BENCHMARK SCENARIOS
In order to illustrate their performance, we apply the algorithms designed above to two parton-level analyses, with two BSM benchmarks defined for each. Though being unrealistic, it is sufficient for proof of concept.
In the first analysis, we simulate the final state bbl + l − E miss T at the 14 TeV LHC, with a luminosity of 3 ab −1 . We require exactly two bottom quarks with p T > 20 GeV and two charged leptons (e ± and µ ± ) with p T > 10 GeV. The SM background stems mainly from • pp →t l t l , σ = 11.5 fb , • pp → t l bW ± l , σ = 0.365 fb , • pp → Z b Z l , σ = 0.0765 fb . Here the physical cross sections have been universally suppressed by a factor 2000 for simplification. The signal could arise from multiple BSM scenarios in this analysis. Here we consider: In the second analysis, we simulate unpolarized e + e − → Zh production with the final state bbl + l − E miss T at √ s = 240 GeV, with a luminosity of 5 ab −1 . We require exactly two bottom quarks with p T > 10 GeV and two charged leptons (e ± and µ ± ) with p T > 5 GeV. The SM background arises mainly from • e + e − → hZ → Z * inv Zb b l + l − , σ = 0.00686 fb , • e + e − → hZ → Z * bb Z inv l + l − , σ = 0.00259 fb . For BSM scenarios, we consider two specific modes of exotic Higgs decay [40]: Y 1 : h → χ 1 χ 2 → χ 1 χ 1 a. This decay topology can arise from the nearly Peccei-Quinn symmetric limit in the NMSSM [41,42], where χ 2 and χ 1 are binoand singlino-like neutralinos, respectively, and a is a light CP-odd scalar. Y 2 : h → Za in the 2HDM and the NMSSM [40].
The parameter values and cross sections for the four benchmark scenarios are summarized in TAB. I.
The sensitivity performance of the algorithms is presented in FIG. 5. In each panel, we show one curve in the "Ideal" case (assuming 100% signal efficiency and background rejection) and one curve with supervised learning as the references for performance evaluation. In the first analysis, the toy model discussed above precisely mimics what happens in benchmark X 1 . In this case, the BSM signal and the SM data are partially overlapped in the feature space. Many of the SM data in the non-signal regions have a strong O new response, due to fluctuations, and hence diminish the detection sensitivity. However, with the O trad compensation, sizable improvement in sensitivity is achieved. As shown in Fig. 5a, the sensitivity is approximately doubled using O comb , compared to the ones using O new or O trad only. For benchmark X 2 , S tot /B tot is about one order larger than that in benchmark X 1 , as indicated in TAB. I. This tends to enhance the O new response of the signal, compared to the SM data, and hence results in comparable sensitivities for the analyses based on O trad , O new and O comb , respectively. For the benchmarks Y 1 and Y 2 in the second analysis, the fluctuation effect on O new is negligibly small, due to S tot /B tot > 1 (typical for the analyses at e + e − collider), while the known-and unknown-pattern data distributions are not fully separated, hence limiting the efficiency of O trad . This results in a sensitivity performance for O new which is universally better than the others.

SUMMARY AND DISCUSSION
In this letter, we proposed a set of density-based novelty evaluators, O new and O comb , which are sensitive to the clustering of the unknown-pattern testing data, for novelty detection in the HEP data analysis. These evaluators allow to design the algorithms with broad applications in detecting BSM physics. They can be also applied to measuring the SM processes yet to be discovered, if we interpret them as "novel" events. As these algorithms are designed using only general assumptions their application could be extended to other big-data domains as well.
This study could be generalized in multiple directions. We have focused on developing the algorithms for novelty detection in HEP, using parton-level analysis to demonstrate their sensitivity performance. To fill up the gap between the concept and its application to real data analysis, hadron-level analysis is definitely needed. In addition, the algorithms could be improved in several aspects. First, the feature selection in the ANN training process might be not yet fully optimized. The features learned from classification of data with labeled known patterns are likely to be sub-optimal for enhancing the isolation or clustering of the unknown-pattern data. Nevertheless, we may introduce dynamical ML or some feedback mechanisms using the testing dataset, to reinforce the learning of the unknown-pattern features. Second, the distance definition of data depends on the geometry of the feature space. We adopted the Euclidean geometry for simplicity, but it is worthwhile to explore the other possibilities. Third, the amount of memory and time needed to implement O trad increases rapidly with the data size and dimension, which renders O trad not very efficient for large dataset. Ways of accelerating the calculation might be needed. More than that, we would extend the performance analysis of the algorithms to other BSM scenarios, e.g., the ones with interference between the known and unknown patterns, or non-trivial data clusters such as a dip [43]. Although it is beyond the scope of this study, at last we mention that, a full analysis of the systematic and theoretical uncertainties is absent (for recent effort partially addressing this see [44]). We leave these topics to a future study.
Note added While this letter was being finalized, [45] appeared. Both the novelty evaluators proposed here and the test statistic defined in [45] (as well as the one developed in [26] recently) are able to measure the clustering of testing data with unknown pattern. We would like to stress that we developed this project and the relevant ideas independently. Particularly, two significant differences exist between them. First, unlike the test statistic in [26,45] which measures the divergence of the testing dataset from the training dataset, the evaluators proposed quantify the novelty of individual testing data. Such a design difference enables the evaluators to probe the fine/differential structure of the clustering such as peak-dip (a famous BSM example can be found in [43]) more efficiently. Second, as the LEE could be a severe problem for novelty detection at Hadron colliders, we explored how to diminish its influences on detection sensitivity (in relation to this, O comb was designed). This was not developed in [26,45].