Distance

While deep learning has proven to be extremely successful at supervised classification tasks at the LHC and beyond, for practical applications, raw classification accuracy is often not the only consideration. One crucial issue is the stability of network predictions, either versus changes of individual features of the input data or against systematic perturbations. We present a new method based on a novel application of “ distance correlation, ” a measure quantifying nonlinear correlations, that achieves equal performance to state-of-the-art adversarial decorrelation networks but is much simpler and more stable to train. To demonstrate the effectiveness of our method, we carefully recast a recent ATLAS study of decorrelation methods as applied to boosted, hadronic W tagging. We also show the feasibility of regularization with distance correlation for more powerful convolutional neural networks, as well as for the problem of hadronic top tagging.

Introduction.-Recent breakthroughs in deep learning have begun to revolutionize many areas of high energy physics. One area that has received considerable focus is the problem of classifying different types of jets at the LHC. Deep neural networks have been applied, for example, to distinguishing top quarks from light quark and gluon jets. For this problem, a large number of architectures based on fully connected neural networks [1,2], image-based methods [3,4], recursive clustering [5,6], physics variables [7][8][9][10], sets [11], and graphs [12,13] have been studied [14][15][16]. Related challenges of identifying vector bosons [17,18], b-quarks [19,20], and Higgs bosons [13,21] and of distinguishing light quark from gluon jets [22][23][24][25] have seen similar progress. Beyond classifying single particles in an event, there is also work on developing holistic methods that classify full events according to the likely physics process that produced them [26,27]. Finally, some of these novel deep learning methods are beginning to be applied to concrete experimental analyses (see, e.g., [28][29][30]).
So far, the recent activity in developing better jet classifiers with deep learning has focused on maximizing their raw performance. However, the most accurate classifier is often not the best one for actual experimental applications. Instead, what is often desired is the most accurate classifier given the constraint that it is decorrelated with one or more auxiliary variables.
The underlying reason for this requirement is that classifiers are trained on Monte Carlo (MC) simulated examples (for which perfect truth labels are available) but are applied to (unlabeled) collision data. While the simulated events are of high fidelity, they do not perfectly reproduce the real data, and this gives rise to systematic differences between training and testing data. Understanding and mitigating these systematic differences is essential in any experimental analysis, and having a decorrelated classifier has many applications in this regard. For example, if the sources of systematic uncertainty are known, one can attempt to explicitly decorrelate a classifier against them in order to reduce or eliminate their effects [31][32][33][34]. Or, one can attempt to control for these systematic differences using data-driven methods such as sidebanding in the invariant mass (Although different auxiliary variables can be used in experimental analyses, one of the most common choices is invariant mass. So for concreteness, and without loss of generality, we will focus on the case of invariant mass for the remainder of this paper). If the signal is localized but the background is smooth in mass, the sideband method allows one to calculate MC vs data correction factors, define control samples, and estimate backgrounds. But if the classifier sculpts features (e.g., bumps) into the background mass distribution, it cannot be relied on for sidebanding. A classifier that is decorrelated with mass is sufficient (although not necessary) to guarantee smoothness of the background mass distribution.
The issue is especially acute for powerful multivariate classifiers such as neural networks, which will have a strong incentive to "learn the mass" when building the optimal discriminant. Even if one excludes mass from the list of inputs to the machine learning algorithm, it may not be enough to achieve a decorrelated classifier-many of the other inputs may be correlated with mass, and machine learning methods in general are flexible enough to exploit correlations of inputs. Such improvements will be especially relevant for (but not limited to) searches for new resonances with unknown mass. The identification of resonances in invariant mass distributions is historically the main avenue to discovery in experimental particle physics and relies on robust background estimates. Therefore, an important and significant challenge is to design classifiers that are as fully decorrelated from mass as possible while using maximal information.
In this Letter, we will present a new method for training decorrelated classifiers that achieves performance comparable to state-of-the-art methods while being much easier to train. The key observation is that a statistical measure called "distance correlation" (DisCo) [35][36][37][38] is sensitive to general, nonlinear correlations between two random variables and can be efficiently computed from finite samples. DisCo is wellknown in statistics and has been applied to various fields, including data science [39] and biology [40]. To our knowledge, this is the first application of DisCo to particle physics.
By including DisCo as an additive regularizer term in the loss function, we demonstrate that we can achieve a state-ofthe art decorrelated classifier with just one additional hyperparameter (the coefficient of the DisCo regularizer). By varying this coefficient, we can control the tradeoff between classification performance and decorrelation, interpolating between a fully decorrelated tagger and a fully performant one.
To validate our methods and rigorously demonstrate that they are state of the art, we will carefully reproduce the results of a recent ATLAS study of decorrelated taggers for identifying boosted W bosons [41]. This study includes a comprehensive set of decorrelation methods, including [31,[42][43][44]. The most promising technique so far (in terms of achieving the highest classifier performance for a given level of decorrelation) has been adversarially training a pair of neural networks: a classifier distinguishing different classes and an adversary predicting the mass [31,44] for a given classifier output.
The downside of the adversarial method has been that it is extremely difficult to implement in practice. Not only does one have to essentially train two separate neural networks, each with its own set of hyperparameters, but one has to carefully tune these two neural networks against each other. This stems from the nature of adversarial training: the objective is not to minimize a loss function but rather to find a saddle point where the classifier loss is minimized but the adversary loss is maximized. Without careful tuning of learning rate schedules, number of epochs, minibatch sizes, etc., the training easily becomes unstable (since the loss is unbounded from below) and can quickly run away to a meaningless result.
By contrast, DisCo regularization maintains the convex objective of the original loss function (i.e., the DisCo term is a positive measure of nonlinear correlations), making it much more stable to train. And since it only has one additional hyperparameter, no additional tuning is required. We will show, in the context of the ATLAS W-tagging study, that the result of DisCo decorrelation is comparable to that of adversarial decorrelation. In the Supplemental Material [45], we will also demonstrate the state-of-the-art performance for top tagging with jet images and convolutional neural networks (CNNs).
Distance correlation.-Given a sample of paired vectors ð⃗ x i ; ⃗ y i Þ (where the index i runs over the sample) drawn randomly from some distribution, we would like a function that measures the extent to which they are drawn from independent distributions, i.e., the extent to which In order for this function to be applicable in a deep learning context, we also require that this function be differentiable and that it can be computed directly from the sample.
In our case, the vectors are one dimensional and correspond to mass X ¼ m and classifier output Y ¼ y but clearly one can imagine many more applications of such a measure at the LHC and beyond.
The usual Pearson correlation coefficient R only measures linear dependencies, so it is not suitable for our purposes. Specifically, features can have nonlinear dependencies and still exhibit zero Pearson R (While the Pearson correlation coefficient is nonzero only if features are correlated, it can, however, be used to actively correlate features (see, e.g., [53]). There are many information-theoretic measures of similarity of distributions such as KL divergence, Jensen-Shannon distance, and mutual information. These are difficult to compute directly from the sample without binning. One can approximate these measures by training a classifier and using the likelihood ratio trick, but this again leads to adversarial methods (see, e.g., [33,[54][55][56][57]).
One measure that seems to fit the bill perfectly is "distance correlation", which originated in the works of [35][36][37][38]. It can be computed from the sample, and it has the key property of being zero iff X and Y are independent.
The definition of distance covariance is as follows: where X ∈ R p , Y ∈ R q , f X and f Y are the characteristic functions for the random variables X and Y, and f X;Y is the joint characteristic function for X and Y. Finally, is a weight function that is uniquely determined up to an overall normalization by the requirement that dCov is PHYSICAL REVIEW LETTERS 125, 122001 (2020) 122001-2 invariant under constant shifts and orthogonal transformations and equivariant under scale transformations [58].
(1) makes it clear that distance covariance is a measure of the independence of X and Y that is zero iff X and Y are independent.
Using the definition of the characteristic function it is straightforward to verify that we can also express dCov as where j·j refers to the Euclidean vector norm (In fact, there is a family of distance covariance measures parameterized by 0 < α < 2 where one uses jX − X 0 j α instead of jX − X 0 j. These relax the requirement of strict equivariance under rescalings. In this Letter, we will focus on α ¼ 1, but in principle this would be another hyperparameter to explore) and ðX; YÞ, ðX 0 ; Y 0 Þ, ðX 00 ; Y 00 Þ are independently identically distributed from the joint distribution of ðX; YÞ [X 00 is not used in Eq. (3)]. Using this alternative form of dCov 2 , it is straightforward to compute a sampling estimate of dCov 2 from a dataset of ðx i ; y i Þ (In the following we will be reweighting by p T . So we actually need a weighted form of distance correlation. That follows easily from the sample Eq. (3)). Finally, we normalize the distance covariance by the individual distance variances to obtain distance correlation: The distance correlation is bounded between 0 and 1. Normalizing ensures equally strong decorrelation independent of the overall scale. We will add dCorr 2 as a regularizer term to the usual classifier loss function in the following (In principle, another hyperparameter is the exact power of dCorr that one adds to the loss function. We have not explored this in much detail). In detail, where λ is a single hyperparameter that controls the tradeoff between classifier performance and decorrelation, ⃗ y is the output of the neural network on a single minibatch, and ⃗ y true and ⃗m are the true labels and masses, respectively (Our implementation of DisCo is available at [59]). The subscript y true ¼ 0 indicates that the distance correlation is only calculated for the subset of the minibatch that is background; this is the appropriate mode for W tagging. Of course, for other applications it may be more appropriate to apply the decorrelation to all events or even to signal events only.
Sample.-As discussed in the Introduction, we will focus in this paper on W tagging, for which there is a detailed study of existing decorrelation methods by the ATLAS collaboration [41]. (See the Supplemental Material [45] for a brief demonstration of DisCo decorrelation for top tagging.) By recasting the ATLAS study as closely as possible, we will be able to validate our methods and rigorously demonstrate that our method of distance correlation is state of the art.
Following the ATLAS study, we generate the standard model processes pp → WW and pp → jj in PYTHIA8.219 [60] at ffiffi ffi s p ¼ 13 TeV with a generator level cut of p T > 250 GeV on the initial particles. We use DELPHES3.4.1 with the default detector card for detector simulation [61]. We also use the built-in functionality of DELPHES to simulate pileup with hN PU i ¼ 24 as per the ATLAS study [41]. Jets are reconstructed using FASTJET3.0.1 [62] and the anti-k T algorithm [63] with R ¼ 1 distance parameter. Jets are required to have jηj < 2 and to be within ΔR < 0.75 or the original parton. The daughters of the W are also required to be within ΔR < 0.75 of the original W. Finally, jets are trimmed [64] with parameters R sub ¼ 0.2 and f cut ¼ 5%. For the final sample, jets are required to have m ∈ ½50; 300 GeV and p T ∈ ½300; 400 GeV; the mass distributions for signal and background are shown in Fig. 1. Apart from the very last requirement on p T , these are all following the ATLAS study. Here we choose to focus on a more narrow range in p T for simplicity.
From this sample of jets, we compute the complete list of high-level kinematic variables shown in Table 1 of the ATLAS study (see [41] for more details and original references). These form the inputs for all the methods in the ATLAS study. We will also use them as inputs for the dense neural network (DNN) plus distance correlation.
Since we will also study the decorrelation of CNN classifiers (see below), we will also form jet images in the same way as [65]. We form images with Δη ¼ Δϕ ¼ 2 and 40 × 40 pixel resolution. For simplicity, we stick to gray-scale images (with pixel intensity equal to p T ) for this study. Figure 2 shows the average of 100 000 W and QCD jet images.
For all methods, we reweight the training samples so that the p T distributions of signal and background are flat, following the ATLAS study. We use 50 evenly spaced p T bins between 300 and 400 GeV. For evaluation, ATLAS also reweights the signal p T distribution to look like background. But since we are taking such a narrow p T slice, our p T distributions are basically identical, so we skip this step.
All of the data samples used for this study will be made publicly available here [66].
Methods.-Following [41], we measure the tagging performance by the rejection factor R 50 corresponding to the inverse of the false positive rate (the probability of misidentifying a QCD jet as a W jet) at a true positive rate (the probability of correctly identifying a W jet) of 50%. The decorrelation is quantified by the inverse of the Jensen-Shannon divergence (JSD) 1=JSD 50 between the inclusive background distribution and the background distribution passing the selection corresponding to a true positive rate of 50%. The JSD is calculated from histograms with 50 bins between lowest and highest value. The binned entropy is measured in bits.
We have implemented the following pairs of (W tagging, decorrelation) methods in our work. From the ATLAS study: [τ 21 , designed decorrelated tagger (DDT)] [42,67], [D 2 , k-Nearest Neighbors regression (kNN)] [68][69][70], [Adaboost boosted decision tree (BDT), uBoost] [71], and (DNN, adversary) [31]. We will additionally include the simplest and possibly oldest decorrelation method, namely "planing", or reweighting events so that the mass histograms of signal and background are identical. As this approach is relatively simple to implement and does not add much computational cost, it is a good baseline procedure (See [72] for a recent comparison study of planing against other methods.). Finally, to all of this we will add our new method (DNN, DisCo regularization) for comparison. For details on all these methods, see the Supplemental Material [45].
In addition, we will go beyond the ATLAS study and examine a CNN classifier acting on jet images, together with adversarial and DisCo decorrelation. This will demonstrate that DisCo regularization is effective enough to decorrelate more powerful deep learning classifiers that use low-level, high-dimensional features. For the CNN classifier, we use a scaled down version of the classifier in [65]. There are four convolutional layers with 64, 32, 32, 32 filters (size 4 × 4) with 2 × 2 Max pooling after the second and fourth layer. This is followed by three hidden layers with 32, 64, and 64 nodes. All activations are rectified linear units. Finally, we output to SOFTMAX.
For both CNN and DNN with DisCo regularization, we used the Adam optimizer with minibatch size of 2048 and a fixed learning rate of 10 −4 . We found that the relatively large batch size of 2048 helped with the numerical stability of the DisCo regularizer. We note that the sampling estimate (3) for distance covariance is known to be statistically biased, and an unbiased estimator was given in [73]. The bias goes to zero as ∼1=n where n is the size of the sample (the minibatch size in our case). We have verified that, as our minibatch size is sufficiently large, there is no practical benefit to using the unbiased estimate of distance covariance in our case.
For the DNN (CNN), we performed a scan in DisCo parameter λ in the range 0-600 (0-250). All classifiers were trained for 200 epochs; no early stopping was used. We have checked that 200 epochs is enough to ensure convergence in the sense that training for more epochs does not improve things. Then, for each λ and training instance, the model with the best validation loss is selected. This procedure is repeated six times with different random seeds to obtain a sense of the variability in the training outcomes.
In all of the machine learning-based methods we use 250 000/80 000/80 000 signal jets and 110 000/330 000/ 770 000 background jets for training/validation/testing. We use so many background jets in order to minimize the statistical error on the JSD calculation (which is calculated only for the background).
The deep learning algorithms were implemented with PyTorch and trained on an NVIDIA P100GPU.
Results.-Our final result is shown in Fig. 3, where the performance of various decorrelation methods on the test The qualitative (and even quantitative) agreement with Fig. 11(a) of [41] is excellent, and we see a clear tradeoff between classifier performance and the amount of decorrelation.
Comparing DNN þ DisCo to the other methods, we find that it has comparable performance to DNN þ adversary. Meanwhile, DNN þ DisCo is much easier to trainwhereas DisCo adds exactly one hyperparameter and no additional neural network parameters to the DNN, the adversary more than doubles the number of hyperparameters and adds an entire second neural network to the story. See the Supplemental Material [45] for a complete list of hyperparameters for the adversarial training. These were found through manual tuning and their sheer complexity nicely illustrates the need for a simpler method of decorrelation.
We see that DisCo regularization is equally capable of decorrelating the more powerful CNN classifier and again achieves comparable performance to CNN þ adversary. One concern could have been that a more powerful deep learning method such as the CNN could overpower the DisCo regularizer, but our result demonstrates that this is not the case. At the highest levels of decorrelation, we note that both DNN and CNN performances are comparable.
In Fig. 4, we indicate more directly the level of decorrelation in the background mass distribution for the pure CNN case (no decorrelation) and for the CNN þ DisCo method at a working point that achieves 1=JSD 50 ∼ 10 3 . We see that DisCo is quite effective at stabilizing the background mass distribution against a cut on the classifier.
Finally, let us also comment briefly on the performance of planing. Unlike DisCo regularization and some of the other methods studied here, planing yields a single working point instead of a tunable tradeoff between decorrelation and classifier performance. Since its performance depends on the joint probability distribution for mass and the other observables (Planing replaces pðx; mÞ with pðx; mÞ=pðmÞ, which does not guarantee independence), planing is not guaranteed to achieve strong results. But it is interesting to see that in this case (and in many of the cases studied in [72]), planing the DNN and CNN classifiers achieves very good performance. The performance lies on the DisCo regularization curve, and DisCo is capable of further decorrelation.
Conclusion.-Deep learning is greatly increasing the classification performance for a wide number of reconstruction problems in particle physics. With the increasing adoption of these powerful machine learning solutions, a thorough understanding of their stability is needed.
In this Letter, it was shown how a simple regularization term based on the distance correlation metric can achieve state-of-the-art decorrelation power. Training is easier to set up, has far fewer hyperparameters to optimize, and is more stable than adversarial networks, while simultaneously being more powerful than simpler approaches.
DisCo regularization is an effective and promising new method for decorrelation that should have a host of immediate experimental applications at the LHC. At the same time, the potential use cases are much wider and include problems of fairness and bias of decision algorithms in social applications. This will be an extremely interesting direction for future exploration.