ABCDisCo: Automating the ABCD Method with Machine Learning

The ABCD method is one of the most widely used data-driven background estimation techniques in high energy physics. Cuts on two statistically-independent classifiers separate signal and background into four regions, so that background in the signal region can be estimated simply using the other three control regions. Typically, the independent classifiers are chosen"by hand"to be intuitive and physically motivated variables. Here, we explore the possibility of automating the design of one or both of these classifiers using machine learning. We show how to use state-of-the-art decorrelation methods to construct powerful yet independent discriminators. Along the way, we uncover a previously unappreciated aspect of the ABCD method: its accuracy hinges on having low signal contamination in control regions not just overall, but relative to the signal fraction in the signal region. We demonstrate the method with three examples: a simple model consisting of three-dimensional Gaussians; boosted hadronic top jet tagging; and a recasted search for paired dijet resonances. In all cases, automating the ABCD method with machine learning significantly improves performance in terms of ABCD closure, background rejection and signal contamination.


Abstract:
The ABCD method is one of the most widely used data-driven background estimation techniques in high energy physics. Cuts on two statistically-independent classifiers separate signal and background into four regions, so that background in the signal region can be estimated simply using the other three control regions. Typically, the independent classifiers are chosen "by hand" to be intuitive and physically motivated variables. Here, we explore the possibility of automating the design of one or both of these classifiers using machine learning. We show how to use state-of-the-art decorrelation methods to construct powerful yet independent discriminators. Along the way, we uncover a previously unappreciated aspect of the ABCD method: its accuracy hinges on having low signal contamination in control regions not just overall, but relative to the signal fraction in the signal region. We demonstrate the method with three examples: a simple model consisting of three-dimensional Gaussians; boosted hadronic top jet tagging; and a recasted search for paired dijet resonances. In all cases, automating the ABCD method with machine learning significantly improves performance in terms of ABCD closure, background rejection and signal contamination.

Introduction
A key component of high energy physics data analysis, whether for Standard Model (SM) measurements or searches beyond the SM, is background estimation. While powerful simulations and first-principles calculations exist and are constantly improving, they still remain inadequate for the task of precisely estimating backgrounds in many situations. For example, events with a large number of hadronic jets have highmultiplicity SM backgrounds whose cross sections are difficult to estimate. Therefore methods for data-driven background estimation remain a crucial part of the experimental toolkit. The idea behind all data-driven background estimation strategies is to extrapolate or interpolate from some control regions which are background dominated into a signal region of interest.
One classic (see e.g. Ref. [1]) data-driven background method which is used in a multitude [2][3][4][5][6] of physics analyses at the Large Hadron Collider (LHC) and elsewhere is the ABCD method. The idea of the ABCD method is to pick two observables f and g (for example, the invariant mass of a dijet system and the rapidity of that system) which are approximately statistically independent for the background, and which are effective discriminators of signal versus background. Simple thresholds on these observables partition events into four regions. Three of these regions, called B, C and D, are background dominated. The fourth, A, is the signal region. If the observables are independent then the background in the signal region can be predicted from the other three regions via: 1) where N i is the number of events in region i. This setup is depicted schematically for signal and background distributions in Fig. 1.
Typically, the observables f and g for the ABCD method are chosen to be simple, physically well-motivated features such as mass, H T , missing E T , etc. Their independence is always ensured manually, e.g. by choosing features that are known physically to have little correlation or by trial-and-error. 1 In some cases, independence can be guaranteed by using completely orthogonal sources of information, such as measurements from different sub-detectors or properties of independently produced particles. However, more often than not, the features are not 100% independent and one has to apply a residual correction derived from simulations. Ideally, this simulation correction has small uncertainties -either because the effect itself is small, or because the correction is robust. But such corrections, together with the fact that simple kinematic features are typically not optimal discriminants of signal versus background, generally limit the effectiveness of the ABCD method and the sensitivity of the analysis in question.
(See [8], however, for a proposal for extending the ABCD method using higher-order information when the features are not independent.) In this paper, we will explore the systematic application of deep learning to the ABCD method. Deep learning has already demonstrated impressive success in finding observables that are effective at discrimination  and that are uncorrelated with other observables [64][65][66][67][68][69][70][71][72][73][74][75][76][77][78][79]. Building on previous success, we will aim to use deep learning to automate the selection of features used in the ABCD method, simultaneously optimizing their discrimination power while ensuring their independence.
The main tool we will use in automating the ABCD method will be a recently proposed method for training decorrelated deep neural networks [71]. This method uses a well-known statistical measure of non-linear dependence known as Distance Correlation (DisCo) [80][81][82][83]. DisCo is a function of two random variables (or samples thereof) and is zero if and only if the variables are statistically independent, otherwise it is positive. Therefore it can be added as a regularization term in the loss function of a neural network to encourage the neural network output to be decorrelated against any other feature. In [71] it was shown that DisCo decorrelation achieves state-of-the-art decorrelation performance while being easier and more stable to train than approaches based on adversarial methods. Therefore it is ideally suited to automating the ABCD method.
We will propose two new ideas for automating the ABCD method, which we will call Single DisCo and Double DisCo, respectively. In Single DisCo, we will train a single neural network classifier on signal and background and use DisCo regularization to force it to be independent in the background of a second, fixed feature (such as invariant mass). In Double DisCo, we will train two neural network classifiers and use DisCo regularization to force them to be independent of one another.
We will study three examples to illustrate the effectiveness of these methods. The first example is a simple model where signal and background are drawn from threedimensional Gaussian distributions. Here the aim is to understand many of the features of Single and Double DisCo in a fully controlled environment. The second example is boosted hadronic top tagging, where often sideband interpolation in mass is employed. For the ABCD method we treat a window selection on the mass as a classifier variable. Thus we use the invariant mass as the Single DisCo fixed feature, and we then show how Double DisCo can improve on this by combining mass with other information to produce more effective classification. Finally, we examine a search that currently uses the conventional ABCD method: the ATLAS paired dijet resonance search, motivated by RPV squark decays [84] (for a similar search by CMS, see [85]). We show that significant performance gains are possible using Single and Double DisCo.
In the course of our study of the ABCD method, we will uncover a hitherto unappreciated limitation of the method, which we call normalized signal contamination. Usually, practitioners are concerned with the overall signal-to-background ratio in the control regions; if this is small then they are usually satisfied. We point out that in fact another relevant quantity for the significance calculation is the signal-to-background ratio in the control regions relative or normalized to the signal-to-background ratio in the signal region. In other words, the requirement of signal contamination is actually are the numbers of signal and background events in region i = A, B, C, D). In many analyses (e.g. [84]), the signal fraction in the signal region is quite small, meaning that even a small amount of signal contamination in the control regions can bias the p-values reported by the search. We will show that Single and Double DisCo not only improve the discrimination power and background closure of the ABCD method but can also significantly reduce the level of signal contamination at the same time.
This paper is organized as follows. Section 2 reviews the ABCD method and Sec. 3 describes how the method can be automated using deep learning. Numerical results for examples described above are presented in Sec. 4. The paper ends with conclusions and outlook in Sec. 5.

The ABCD method
The ABCD method starts with two features f and g. Imposing thresholds f c and g c divides the feature space into four rectangular regions, A, B, C and D with corresponding event counts: is the total number of events of type and ∈ {signal (s), background (b), all (a)} and Pr(·) is the probability. The regions B, C and D can be used to predict N A : For the ABCD method to be valid, we would need N A,b = N predicted . There are two requirements for N predicted to be accurate. First, the Bernoulli random variables f < f c and g < g c must be independent for the background in order to guarantee that To see this, note that (2.3) is equivalent to which is a definition of independence. While it is sufficient to have one set of thresholds, having a range over which independence holds adds robustness to the estimation procedure. If the ABCD method holds for all values of f c and g c , then f and g themselves must be independent. Note that this condition is stronger than requiring zero linear correlation. Two random variables can have zero linear correlation yet be nonlinearly dependent. In general, such a case would invalidate Eq. (2.3). The second requirement for the ABCD method involves the signal and the background: In particular, if the signal contamination in regions B, C and D is large, then Eq. (2.2) will not hold. But what does large mean in this context? Typically, large signal contamination is taken to be an overall statement, i.e.
for regions i = B, C, D. However, we will now show that in addition to this criterion, another relevant quantity is normalized signal contamination and for the ABCD method to be valid, it must satisfy Note that this is often a much stronger requirement than (2.7). It is not enough that the signal fractions in each control region are small -they must be small compared to the signal fraction in the signal region. In many searches (e.g. the RPV stop search in Sec. 4.3), signal to background can be quite small in the signal region, meaning that this can be a significant (and underappreciated) constraint on the ABCD method. To see why (2.9) is required, suppose that the ABCD method closes exactly, so that Eq. (2.3) holds, but there is some signal contamination in all four regions. Then, This will be compared with the number of events in region A, N A,a = N A,b (1 + δ A ), to decide whether there is an excess or not. In order to detect the signal in A, one needs Eq. (2.9) to be satisfied. Note that we are still assuming that δ B,C,D 1 in order for the subleading terms in Eq. (2.10) to be negligible.
Another point is that generally δ D can be neglected compared to δ B and δ C (as it is diagonally opposite and should therefore be doubly-suppressed). So we expect r > 0 and an overestimate of the background in the signal region. This will make it much harder to discover new physics.
Finally, let us make the connection between the normalized signal contamination and classifier performance. For the fixed thresholds f c and g c , the signal ( s ) and background ( b ) efficiencies for each individual classifier can be computed as: With these definitions and neglecting N D,s , Eq. (2.8) can be re-written as . (2.12) The two terms in Eq. (2.12) are nearly the diagnostic odds ratio and importantly are minimized for a given signal efficiency when the background efficiency is as small as possible. This demonstrates that "classification performance" and "signal contamination" are synonymous in this context -the better a classifier is, the more likely it will be that there is a threshold which ensures a small relative signal contamination.
To illustrate these points, we show in Fig. 2 the effect of signal contamination on the p-value. The left plot shows the interplay between the relative signal contamination r and the number of events N A in the signal region as a function of δ A . For example, if the signal fraction in the signal region is δ A = 10% and N A = 1000, the true p-value is 0.0015 while the reported value assuming negligible signal contamination would be 0.03 or 0.1 with an unaccounted for signal contamination of 4% and 6% in region B, respectively.
Correctly accounting for this signal contamination would require having a signalmodel-dependent ABCD estimation. This could be done (see e.g. [86]), but would be much more complicated than most applications of the ABCD method. Adding an uncertainty to account for potential signal contamination is also not ideal -this is shown in the bottom plot of Fig. 2. Once again, for N A = 1000 and δ A = 10%, the true p-value is 0.0015 and a signal contamination of 4% in region B results in a p-value of 0.03. Adding an uncertainty of 5% increases this to 0.16 and an uncertainty of 10% further increases the p-value to 0.29. So while this would result in a conservative p-value, it means that potential discoveries would be masked.

Automating the ABCD Method
Having described the requirements for the ABCD method (two strong classifiers that are independent for background), we now turn to the main idea of the paper: automating the ABCD method with machine learning. Typically, when the ABCD method is used in experimental analyses, the two features are chosen by hand, based on physical intuition. Usually the features are simple quantities, such as mass, H T , p T , or missing E T . In the remainder of the paper, we will investigate the benefits of allowing the ABCD features to be more complicated functions of the inputs. These functions will be obtained by training neural networks with suitable loss functions that ensure the ABCD objectives. We will see that machine learning has the potential to greatly improve the performance of the ABCD method.
The basic idea is that we want to train a classifier f (X) where X are the input features (either low level inputs such as four vectors or images, or high level inputs such as p T , mass, etc.) that is forced to be decorrelated against another classifier g(X). This will achieve the first ABCD requirement of independent features. If the two classifiers are both good discriminants, this will satisfy the second ABCD requirement.
One can imagine two versions of this idea, both of them new: 1. The second classifier is a simple, existing high-level variable (e.g. mass). In this case the problem is basically identical to the one that has been solved in the literature on decorrelation. We then just have to apply these approaches to the the ABCD method.
2. The second classifier is also a neural network. In this case we need to train two neural networks simultaneously while keeping them decorrelated from one another. This requires us to go beyond the usual literature on decorrelation against a fixed feature.
For the Single DisCo ABCD method, we take the loss function to be the same as in [71]: where X are the features used for classification, y ∈ {0, 1} are the labels, X 0 is the feature that one wants to be decorrelated from f (X) (X 0 could be part of X), and L classifier is the classifier loss such as the commonly used binary cross entropy. The subscript y = 0 in the second term of Eq. (3.1) ensures that the decorrelation is only applied to the background (class 0). Furthermore, λ ≥ 0 is a hyperparameter that determines the decorrelation strength. The function dCorr 2 [f, g] is the squared distance correlation defined in [80][81][82][83] (see App. A). It has the property that 0 ≤ dCorr[f, g] ≤ 1 and dCorr[f, g] = 0 if and only if f and g are independent. For Single DisCo, g(X) = X 0 .
In practice, f is parameterized as a neural network and Eq. (3.1) is minimized using gradient-based methods. The distance correlation is computed for batches of data used to stochastically estimate the gradient. In the limit of small numbers of events, the naive distance covariance computed by replacing expectation values with sample averages is a biased estimator of the true distance correlation. Analogously to the case of sample variance (in which a factor of 1 N −1 instead of 1 N -where N denotes the minibatch-size -is inserted to remove bias), there is an analytic low-N correction to the distance covariance that is unbiased [81,83]. Numerical results suggest that this correction is useful when N is low, but for sufficiently large training datasets with large enough batches, the correlation has little impact on the results.
For the Double Disco ABCD method, we use the loss function where now f and g are two neural networks that are trained simultaneously. When λ = 0, the loss will be minimized when f = g is the optimal classifier (up to degeneracies). When λ → ∞, f and g will be forced to be independent even if one or both of them does not classify well at all. In practice, if λ is taken too large, the DisCo term will tend to overwhelm the training and poor classification performance will result. Thus there should be an optimal λ at some finite value which we can be determined by scanning over λ.

Applications
This section explores the efficacy of Single and Double DisCo in some applications of the ABCD method.

Simple Example: Three-Dimensional Gaussian Random Variables
We begin with a simple example to build some intuition and validate our methods. Consider a three-dimensional space (X 0 , X 1 , X 2 ), where the signal and background are both multivariate Gaussian distributions. We choose the means µ and a covariance matrix Σ for background and signal as and So for the background, all three features are centered at the origin and features X 0 and X 1 are correlated with each other but independent of X 2 . For the signal, all three features are independent but are centered away from the origin. The first feature X 0 will play the role of the known feature for Single DisCo in Sec. 3. All of the neural networks presented in this section use three hidden layers with 128 nodes per layer. The rectified linear unit (ReLU) activation function is used for the intermediate layers and the output is a sigmoid function. A hyperparameter of λ = 1000 is used for both Single and Double DisCo to ensure total decorrelation. The Single Disco training converged after 100 epochs while the Double DisCo training required 200 epochs. Other networks only needed ten epochs. The Double DisCo networks were trained using a single neural network with a two-dimensional output. All models were trained using Tensorflow [87] through Keras [88] with Adam [89] for optimization. Two million examples were generated with 15% used for testing. A batch size of 1% of the total was used for all networks to ensure an accurate calculation of the DisCo term in the relevant loss functions.
We first consider two classifiers: a baseline classifier f BL (X 1 , X 2 ) trained only on X 1 and X 2 and a Single DisCo classifier f SD (X 1 , X 2 ) which includes a penalty for correlations between f SD and X 0 . The values of these classifiers for events drawn from the distributions are plotted in Fig. 3 against the X 0 , X 1 , or X 2 values of these events. We see that even though X 0 was not used in the training of the baseline, the classifier output is still correlated with X 0 because of the correlations between X 0 and X 1 . In contrast to the baseline classifier, the Single DisCo classifier is independent of both X 0 and X 1 and is simply a function of X 2 . Intuitively, it makes sense that a classifier that must be independent of X 0 must also be independent of X 1 . This is justified rigorously in Appendix B.
For Double DisCo, we train two classifiers f DD (X, Y, Z) and g DD (X, Y, Z) according to the Double DisCo loss function. The results are illustrated in Fig. 4. The first classifier depends mostly on Z and the second classifier depends mostly on X and Y . However, the residual dependence on all three observables is not a deficit of the training procedure: even though the three random variables are separable into two independent subsets (X, Y ) and Z, the two classifiers learned by Double DisCo are non-trivial functions of all three variables. There is a large freedom in choosing the two functions f DD and g DD with a very small distance correlation and also excellent classification performance. Evidently, Double DisCo prefers to partition the information differently than the naive partitioning in order to achieve better classification performance. Fig. 5 shows the performance of the Single and Double DisCo classifiers. The curve for the ABCD method is constructed by scanning 100 values of independent thresholds on the two features, evenly spaced in percentile of one classifier or the other to ensure a fixed signal efficiency. Above 50% signal efficiency, the ABCD Double DisCo has nearly the same performance as the fully supervised classifier using all of the available information. The Single DisCo performance is much lower than the Double DisCo performance and is comparable to the best of the two Double DisCo classifiers. The right plot of Fig. 5 demonstrates that Double DisCo is not only more effective at rejection background, but it also has a lower signal contamination.  . Scatter plots showing the relationship (or lack thereof) between the three random variables X 0 , X 1 , and X 2 and 1) a baseline classifier f BL (X 1 , X 2 ) trained on X 1 and X 2 with no regularization and 2) a classifier f SD (X 1 , X 2 ) trained with the Single DisCo loss function that penalizes correlations with X 0 . Only the background events are shown in these plots. The solid lines are the averages of the classifiers over events with the same value of X 0 , X 1 or X 2 . In the third panel, the scatter of the Single Disco classifier is already a line, so no average is needed.  In the Single DisCo case, one of the two classifiers is simply a NN trained with only X 0 (marked 'X 0 ' only in the legend). Right: a scatter plot between background rejection and the normalized signal contamination for ABCD closure within 20%. For comparison, the left plot also shows the performance of the two Double DisCo functions separately, the Single DisCo function on its own, as well as a fully supervised classifier using all the available information all at once.

Boosted Tops
Next we turn to a physical example: boosted, hadronically decaying, tops. When top quarks are highly boosted, their hadronic decay products can be collimated into a single large jet and jet substructure methods are often necessarily to distinguish them from QCD jet backgrounds [90]. One can estimate these backgrounds using sidebands in the jet mass around the top mass m t . For the application of Single and Double DisCo, we will first reframe this estimation as an ABCD method and map mass to a variable where the signal peaks at 1 and the background peaks at a lower value: For our studies we will use the community top tagging comparison sample [11,44]. There are 2 million jets total, 1 million each of signal (top jets) and background (light quark and gluon QCD jets). Of these, half are used for training and the other half for validation. We compute the following set of high level features suggested by [41] m, p T , τ  Here, τ a N are the subjettiness variables introduced in [91,92] and are computed using fastjet [93]. This set of 13 variables is a complete basis for 5-body phase space and therefore it provides a complete description of the physics at the parton level [41,42,47,94]. It also offers a useful [94] feature space for modelling the top quark jets and inclusive jets after hadronization. Histograms of these features for signal and background are presented in Fig. 6. All the features are rescaled to be between 0 and 1. The neural network specification is 3 hidden layers of 64 nodes each, ReLU activations, and batch normalization after the first hidden layer. We train for 200 epochs with fixed learning rate of 10 −3 and the default Adam optimizer. We use a large batch size of 10k to ensure an accurate DisCo sampling estimate.
For Single DisCo, we train a single neural network on just the subjettiness variables (we could have included m and p T too with little change). For Double DisCo, we train two neural networks on all the features ( m, p T , and the subjettiness variables). The neural networks specifications, feature preprocessing, and training details are all the same for Single and Double DisCo. However, for Double DisCo, in addition to the usual DisCo loss term described in Eq. (3.2), we include a second DisCo term which only takes the tail of the neural network outputs (again for background only) as inputs. This was found to help with the stability of the ABCD prediction for lower signal efficiencies, which can be sensitive to the extreme tails of the background. For the tail we required the simultaneous cuts of y 1 > (y 1 ) bg,50 and y 2 > (y 2 ) bg,50 , where y 1,2 are the outputs of the two neural networks and "bg,50" refers to the 50 th percentile cut on the background distributions.   Values of λ larger than 200 tended to destabilize the training. We show in Fig. 7 the background rejection at 30% signal efficiency vs. the normalized signal contamination r defined in Eq. (2.8), for every epoch, DisCo parameter, and value of rectangular cuts on the two classifiers that achieves the required signal efficiency (same method as Sec. 4.1), subject only to the requirement that the ABCD closure for the background is accurate to within 10%: . We see that Double DisCo is able to achieve both higher background rejection and significantly lower signal contamination than Single DisCo. Fig. 8 shows the "best" models for Single DisCo and Double DisCo, where "best" corresponds to an epoch and λ that robustly reaches the upper left corner of Fig. 7.
Here each point in the plot represents a choice of the rectangular cut that achieves 30% signal efficiency. We see that both Single DisCo and Double DisCo are able to achieve accurate ABCD closure and low signal contamination across a wide range of rectangular cuts.
Next we turn to the question of what did Single and Double DisCo learn -specifically how the available information was used by the individual NNs. Shown in Fig. 9 are a number of ROC curves. This includes ROC curves for mass, the individual classifiers in Single and Double DisCo, as well as additional NN classifiers obtained from training simple DNNs on various combinations of mass and NN1, NN2 from Double DisCo.
A first observation is that one of the Double DisCo classifiers (g) outperforms all the other individual classifiers without explicitly added mass information for all values of the signal efficiency. The next best performance is achieved by the Single DisCo Classifier, followed by the second one of the Double DisCo classifiers (f ). 2 Jet mass by itself is very effective for loose selections (corresponding to a high signal efficiency). This can be understood from the good separation observed in Fig. 6 (top left). However, for tighter selections additional substructure information is needed.
Combining mass with one of the Double DisCo classifiers (g) does not strongly alter its performance. This implies that the information contained in mass is learned by this NN. However, it clearly outperforms mass, meaning that g contains more features than just mass. On the other hand, combining mass with the weaker Double DisCo classifier (f ) dramatically improves it -it becomes almost, but not quite, optimal. This is to be expected as f is forced to be independent from g for background examples. If g contains mass completely, then f should be mostly independent of mass, and adding it to f should result in a major performance boost.
Finally, there is no real difference between a combination of the two Double DisCo classifiers (f + g), a further combination also including the mass (mass +f + g), and a direct training on all input features. This further confirms that the mass information has been fully absorbed by f + g -specifically g via the argument above. The maximally inclusive mass +f + g classifier of course should not be used as input to the ABCD method. However, we can compare its performance to results on the same dataset in Ref. [11]. A classifier based on multi-body N-subjettiness trained following the procedure suggested in Ref. [94] achieved a background rejection of up to around 1/900 for a signal efficiency of 30%. We observe a slightly weaker 1/700 which is to be expected as a lower number of N-subjettiness observables is used as inputs here.
In the the scatter plots of the Double DisCo discriminators in Fig. 10, we again observe the larger discrimination power of g compared to f . Looking at the top left distribution, we indeed see no dependence of f on the mass while in the top right a clear correlation is there for g. On the other hand, in the bottom left, we see a trend between f and τ 32 which encodes to which amount the jet is compatible with a 3-prong substructure. This information is largely not learned by g.
We conclude that Double DisCo can do better than Single DisCo because it is partitioning the information differently than just mass versus everything else.

RPV SUSY
For our third example, we consider an actual "real-life" application of the ABCD method on LHC data: the √ s = 13 TeV ATLAS search for paired dijet resonances [84]. Similar searches were conducted by CMS [85] and by both experiments at √ s = 8 TeV [95,96]. These searches were motivated by pair production of identical squarks which each decay promptly to two jets via RPV couplings. For background estimation, these searches all used the standard ABCD method. In this section we will describe our recast of this search and the performance gains derived from training Single and Double DisCo on it.
The ATLAS search consisted of the following steps: • Preselection: Events are required to have at least four jets with p T > 120 GeV and |η| < 2.4. The leading four such jets are used to form two squark candidates based on nearest proximity in ∆R = (∆φ) 2 + (∆η) 2 . The minimum ∆R from the resulting pairings is defined as ∆R min and the two dijet masses are used to form the average mass m avg = 1 2 (m dijet 1 + m dijet 2 ) and fractional mass asymmetry A mass = 1 mavg |m dijet 1 − m dijet 2 |. Events with m avg < 255 GeV must have ∆R min < 0.72 − 0.002(m avg /GeV − 255) and events with m avg ≥ 255 GeV must have ∆R min < 0.72 − 0.0013(m avg /GeV − 255).
• Final selection: For the final selection, the ATLAS search performs counting experiments in successive windows of m avg , and for background estimation uses the ABCD method in | cos θ * | and A mass , where θ * is the polar angle of one of the squarks in the squark-squark center-of-mass frame. The signal region is defined as A mass < 0.05 and | cos θ * | < 0.3. ATLAS ended up setting a limit at approximately m squark = 500 GeV, so we will also focus our analysis on this value of the squark mass. We repeat the preselection cuts but instead of the final selection on m avg , A mass and cos θ * , we instead feed a list of inputs to Single and Double DisCo to learn the optimal features. The inputs are: ∆R min , m avg , cos θ * , A mass , z 12 , z 34 , ∆R 12 , ∆R 34 , m 12 , m 34 , ∆η, ∆φ, p T,12 , p T,34 , (4.6) where z 12 (z 34 ), ∆R 12 (∆R 34 ), m 12 (m 34 ), p T,12 (p T,34 ) are the p T of the subleading jet divided by the sum of the transverse momenta of both jets, the opening angle between the two jets, the invariant mass of the two jets, and the p T of the two jets for the stop dijet pair with the leading small-radius jet (and the other stop dijet pair), respectively. Histograms of these features are shown in Fig. 11. All features are rescaled to the range [0, 1] before feeding to the NNs. For Single DisCo we use cos θ * rather than A mass as the fixed variable X 0 (cos θ * is the stronger of these two features from the ATLAS RPV squark analysis) and feed everything else to the NN classifier. For Double DisCo we feed everything to the two NN classifiers. Squark pair events and multijet events are generated with Pythia 8.230 [97,98] at a center-of-mass-energy of √ s = 13 TeV interfaced with Delphes 3.4.1 [99] using the default CMS run card. Jets are clustered using the anti-k t algorithm [100] with radius parameter R = 0.4 implemented in Fastjet 3.2.1 [93,101]. 1M signal events and 10M background events were generated, of which about 100k signal events and 60k background events pass the preselection. In order to ensure a high event selection efficiency for the background, events are generated using 2 → 3 matrix elements with a minimum separation of R = 0.8 and minimump T of 100 GeV for the softest parton and 200 GeV for the hardest parton. Signal events are produced using the SLHA [102,103] card from the recent ATLAS search [84,104] Table 2. Relative fractions f i of data in the regions i = D, A, F and C used in the ATLAS RPV SUSY analysis and our recast, and signal to background ratios δ i in each region.
The validation of the recasting of the ATLAS analysis is shown in Table 1 and  Table 2. In the former we show the relative signal efficiencies after successive cuts. In the latter we show the relative fractions f i (since we do not attempt to get the overall normalizations of our simulations correct) of data in ATLAS regions i = D, A, F , C (for ATLAS D is the SR); and the signal to background ratio δ i in each region. Following ATLAS, for the data fractions, the counts are taken after the inclusive selection with no mass window cut, while for the signal to background ratios they are taken after the mass window cut. Overall, we see excellent agreement between the ATLAS numbers and our recasted numbers.
For training the NNs, we use 100k signal and 360k background events, while the validation sample consists of 25k signal and 250k background events. In the classifier loss, we rebalance the signal and background contributions as if they were 50/50.
We used the same hyperparameters as the top tagging example. (We also explored using 128 nodes per hidden layer but found that it did not help.) For DisCo parameters we chose λ = 10, 20, 30, 40, 60, 100 . Unlike the top tagging example we do not add the additional DisCo term sensitive to the tails of the background distributions when training Double DisCo; because the background rejections in this case were not as high as for top tagging, the additional term was found not to help. The comparison of Single and Double DisCo is shown in Fig. 12. As in the top tagging section, we have plotted every epoch and every rectangular cut and every value of the disco parameter satisfying the 10% accuracy condition on the ABCD prediction. This shows the performance of the models in the plane of R 10 (background rejection factor at 10% signal efficiency) vs total fractional signal contamination. We see that while Double DisCo cannot surpass Single DisCo in terms of raw performance (as measured by R 10 ), it can achieve dramatically lower signal contamination for roughly the same R 10 .
We have also included scans over the features used in the ABCD method as used in the ATLAS RPV search, these are the green points in Fig. 12. 3 We note that ATLAS had significant normalized signal contamination with their selection (40-80%). Both Single and Double DisCo offer a marked improvement in both signal contamination and background rejection compared to the standard ABCD method with manually-chosen high-level features.

Conclusions
Estimating backgrounds is essential for every experimental analysis in particle physics. One of the most well-established data-driven technique for background estimation is the ABCD method. In this paper we have re-examined the criteria for the ABCD method to be effective and proposed a way to find the variables used to establish the ABCD regions using machine learning.
A general observation we make in this paper is that the signal contamination in the background region normalized to the signal fraction in the signal region drives the quality of the ABCD background estimate. This observation is independent of any machine-learning appraoches to determining the features. We argue that controlling this normalized signal contamination should become a default procedure in applying the ABCD method, since neglecting it can lead to incorrect, and typically overly conservative, p-values.
Regardless of how one estimates contamination of the background samples, a necessary condition for the ABCD method to work is the availability of two independent classifiers. These classifiers are usually found by guessing observables that, on physical grounds, seem like they would be independent, and then verifying their independence with simulations or validation regions. Such a procedure is by no means guaranteed to yield optimal results. Indeed, observables designed for classification, either by hand or learned by machine, easily have better discrimination power than observables chosen to be independent. However, optimal observables aim to make maximum use of available information and will in general exhibit complex dependencies with all other observables.
In this paper, we proposed to use machine learning methodology to optimize the ABCD method. We considered two use cases: 1) Single DisCo, where a first variable (such as mass) is fixed and another is learned to be decorrelated with it and optimize discrimination and 2) Double DisCo, where both variables are learned. For both methods, our machine learning approach builds upon the DisCo loss term, a recently developed method for automated decorrelation. This technique allows for the autonomous construction of a robust data-driven background estimation assuming a specific signal model.
We considered three examples: 1) a simple model of correlated random variables that demonstrates how Single and Double DisCo work, 2) boosted top tagging and 3) an RPV squark search, based on an existing ATLAS analysis. We found that while Single DisCo offers competitive performance in terms of pure background rejection, Double DisCo achieves lower signal contamination levels in both of the physical examples considered. We note that while DisCo was used to demonstrate decorrelation in this paper, the general idea can be combined with any decorrelation method [64][65][66][67][68][69][70][71][72][73][74][75][76][77][78][79] and the best approach may be application-specific.
On the surface, one advantage of the traditional ABCD method has over the proposed automated approaches is that it is largely signal model independent. However, even there, it is necessary to explicitly verify low signal contamination for all considered models using simulations. On the other hand, the training of Single DisCo or Double DisCo can be extended to a cocktail of signal models or parametrised as a function of the considered signal [105,106].
While the Single and Double DisCo approaches achieve excellent performance, even better sensitivity might be obtained by optimizing the necessary criteria of low signal contamination and good ABCD closure more directly. We argued in earlier sections that the Single and Double DisCo loss qualitatively capture these requirements, but direct optimization of the conditions is challenging as they cannot be readily cast in a differentiable form. One might, for example, try an iterated learning approach or one based on reinforcement learning, where the final p-value for ABCD searches is used as a score. Further studies in this direction are left to future work.
Finally, it is important to consider the task of background estimation in the broader context of analysis optimization. A variety of methods have been proposed to directly optimize analysis sensitivity including uncertainty [107][108][109][110]. Background estimation is a key part of analysis design and could be integrated into the ABCD method in order to further optimize the overall discovery potential. An orthogonal approach is to construct searches for new physics in a model independent way [12,78,. Such searches will also require robust and automated data driven background predictions and -at least partially -can be trained with a Single or Double DisCo method.
In summary, we are able to increase the discovery potential of physics analyses by enabling robust background estimates for more powerful classifiers. This improvement is made possible by clearly defining the objectives and then using automated tools to optimize a parametric function to achieve them. The present work shows that even time-tested and widely deployed analysis methods can benefit from systematic optimization.
The distance correlation is then defined analogously to the usual correlation: . (A.2)

B Single DisCo in the Gaussian Case
In Sec. 4.1, we observed that for the simple Gaussian model with three Gaussian random variables X 0 , X 1 and X 2 , the Single DisCo classifier f (X 1 , X 2 ) trained to be independent of X 0 (which is correlated with X 1 but not X 2 ) is only a function of X 2 and does not depend on X 1 . The purpose of this appendix is to prove this. We start by rotating from (X 0 , X 1 , X 2 ) into another set of three Gaussian random variables that are mutually independent: X 0 , W , and X 2 with X 1 = αX 0 + βW , where α, β depend on ρ b and W is independent from (X 0 , X 2 ). Then, we can also write h(X 0 , W, X 2 ) = f (αX 0 + βW, X 2 ). Let Q = (W, X 2 ). Suppose that h(X 0 , Q) and X 0 are independent. Then for all sets A and B: Pr g(X 0 , Q) ∈ A and X 0 ∈ B = Pr h(X 0 , Q) ∈ A × Pr X 0 ∈ B (B.1) For any B, define A B = {h(x 0 , q) : x 0 ∈ B, ∀q}. Then, the probability that h(X 0 , Q) ∈ A B given X 0 ∈ B is unity: Pr h(X 0 , Q) ∈ A B and X 0 ∈ B = Pr h(X 0 , Q) ∈ A B |X 0 ∈ B × Pr X 0 ∈ B = Pr X 0 ∈ B , (B.2) and so Eq. (B.1) simply reduces to Pr[h(X 0 , Q ∈ A B ] = 1. This means that h(x 0 , q) cannot depend on x 0 . Therefore, we conclude that if h(X 0 , Q) and X 0 are independent, then h does not depend on X 0 . The only way for h to not depend on X 0 is for f to not depend on X 1 .