Feature Selection with Distance Correlation

Choosing which properties of the data to use as input to multivariate decision algorithms -- a.k.a. feature selection -- is an important step in solving any problem with machine learning. While there is a clear trend towards training sophisticated deep networks on large numbers of relatively unprocessed inputs (so-called automated feature engineering), for many tasks in physics, sets of theoretically well-motivated and well-understood features already exist. Working with such features can bring many benefits, including greater interpretability, reduced training and run time, and enhanced stability and robustness. We develop a new feature selection method based on Distance Correlation (DisCo), and demonstrate its effectiveness on the tasks of boosted top- and $W$-tagging. Using our method to select features from a set of over 7,000 energy flow polynomials, we show that we can match the performance of much deeper architectures, by using only ten features and two orders-of-magnitude fewer model parameters.


I. INTRODUCTION
Recently there has been enormous progress in training supervised deep learning classifiers to perform object and event identification at the LHC.Deep learning classifiers that make use of low-level information (such as the four vectors of all the reconstructed particles in a jet or event) have been shown to achieve impressive performance gains over cut-based methods and shallow classifiers trained on high level kinematic features, translating directly into better physics performance [1][2][3].
However, all of these high-performing deep learning methods are black boxes, and there has been a parallel effort in AI interpretability / explainability to understand "what the machine learns" [25][26][27][28][29]. Recently, an important step in this direction came from [30], which developed a new forward feature selection technique to efficiently scan through more than 7,000 energy flow polynomials (EFPs) [31] -i.e.quantities that measure the energy distribution inside a jet -in order to identify a small number (typically of order ten) that together repro-duce as closely as possible the performance of a state-ofthe-art black-box NN classifier.Their method relied on a score called "average decision ordering" (ADO) which measures how often a given feature has the same decision ordering as the reference classifier.This method has been applied to W-jets [30], muons [32], electrons [33], and semi-visible dark-jets [34].
Aside from shedding light on "what the machine learns", constructive feature selection methods can have several other interesting applications.Classifiers based on high-level features (HLFs) could be more robust against domain shifts and more easy to calibrate with collider data (as a smaller number of distributions need to be validated).Also, a classifier trained on only a few inputs could be made much more lightweight (far fewer parameters), leading to less intensive training and faster evaluation time.This could have important applications to ML with microsecond inference times, e.g. for the LHC trigger.Finally, even if attempting to replicate a stateof-the-art deep learning classifier with a set of HLFs falls short, it might have important physics implications, as it could teach us that the set of HLFs being used is incomplete and does not fully capture all the correlations in the data.
In this paper, inspired by [30], we present a new method for forward feature selection.It is based on the measure of statistical independence called "distance correlation (DisCo)" [35][36][37][38], which was first used in the HEP literature to decorrelate top taggers against jet mass [39], and was subsequently applied to ABCD background estimation [40] and anomaly detection [41].We use DisCo (instead of ADO) to measure how relevant (statistically dependent) a given set of features is for the classifier output.We show that our DisCo-based forward feature selection method outperforms [30] on both hadronic W tagging and hadronic top tagging, in the sense that it selects features more efficiently, ultimately achieving better performance with fewer features.The upshot is that on top tagging, our method selects as few as 9 EFPs (from the same sample of 7,000+ as [30]), and training a very compact DNN on these small number of EFPs, we achieve nearly state-of-the-art performance, matching the rejection power of ParticleNet-Lite [14] with only a fraction of parameters.
Importantly, our method does not require a previouslyobtained reference classifier, but can also be trained equally well ab initio, using the "truth labels" (0 for background and 1 for signal).This is unlike the method of [30], whose performance suffered when trained on truth labels.Therefore, our DisCo-based forward feature selection method is able to operate in two, conceptually different modes: (1) either as an ab-initio feature selector that aims to produce the best-possible classifier given a set of features; or (2) as a feature selector that aims to "explain" a previously-obtained "black box" classifier.
Note that the proposed forward or constructive feature selection is very different from backward elimination methods which try to iteratively remove features starting with the full set of features, or feature attribution methods which use Shapley values [42][43][44][45][46][47][48][49][50][51][52] to assign contributions of each feature to explain the outcome of a pre-trained classifier output.As we will see in the numerical examples, the performance of a classifier trained on the full space of ≈ 7,000 features is much lower than what a carefully selected set of ≈ 10 features can achieve, further motivating the forward feature selection strategy.
In the following, we first introduce a strategy for forward feature selection in Section II and show how DisCo can be used as a scoring function for promising features.Section III next discusses the concrete application to top tagging.We show that our method reaches performance equal to much more complex architectures, using only a fraction of features and complexity, even matching LorentzNet [23] in ablation studies.There, we also investigate the leading eight EFPs chosen (as well as their stability under repeated application of our method) and attempt to use them to understand "what the machine learns".We observe that the same leading six EFPs are found under multiple iterations of our method, indicating their relevance for this task.Finally, Section IV provides a discussion of results and further outlook.

II. METHOD
For supervised classification tasks1 , forward feature selection methods operate on a feature space We should think of each feature f i as a pre-determined function (e.g. an EFP) that operates on the low-level data x ∈ R d of each event, i.e. f i = f i ( x).Given an alreadyselected set of n features F n = {f i1 , f i2 , . . ., f in }, the goal of forward feature selection is to identify the next feature f in+1 which is expected to improve the performance on the classification task the most.
It is assumed here that the full feature space F is so large, and the training of the classifier sufficiently expensive, that one cannot just brute force select the next feature by training N − n classifiers on all possible additional features f i / ∈ F n .Therefore, what is needed here is a much cheaper-to-compute relevance score, that stands in as a proxy for the classifier itself.
The relevance score takes as input a given set of features, together with a reference label, evaluated over the dataset.The reference label could be either truth labels, in which case we are performing ab initio forward feature selection in order to produce the highest-performing classifier that we can; or the reference label could be a pre-trained state-of-the-art classifier, in which case we are performing forward feature selection for the purposes of AI explainability (explaining the pre-trained "black box" classifier).
In any event, for a set of features, the point is that the relevance score can be obtained much more quickly than training a classifier on the features, and the forward feature selection algorithm can select the feature with the highest score as the next feature.
The 4 steps involved in our feature selection algorithm are illustrated in Fig. 1 and explained in the following: Step 1: Train on known features Train a classifier network on a set of features F n = {f i1 , f i2 , . . .f in } using the full training sample of all events X all , and obtain the classifier output y pred for all events in X all .
For simplicity and best possible performance, we use a dense neural network (details in Appendix B), although any other classification algorithm (e.g.XGBoost, logistic regressor) could be used as well.

2.
Step 2: Select the confusion set X 0 ⊂ X all Instead of calculating the relevance scores using the full dataset, we choose to instead focus on a subset of the full data X 0 ⊂ X all that we call the "confusion set".These are events where we believe the features in F n are least effective in separating signal from background, and where adding a new feature may have the largest impact.To identify this subset, we select all events in a window around y pred = 0.5, as shown in Fig. 2 -these should be the events where the classifier is most confused about whether it is a signal or a background.We observe that using a confusion set instead of the full dataset improves performance.

3.
Step 3: Assign a relevance score to each feature To each feature f i in the feature space F, we assign a relevance score s fi , which gauges how much the feature will improve classification performance.
The relevance score is calculated using the feature vectors evaluated on the events in the confusion set X 0 , together with the classifier output of a reference label y ref : The relevance score assigned to each feature f i is: As described in the Introduction, DisCo is short for distance correlation [35][36][37][38], a measure of statistical dependence that is zero iff the random vectors X and Y are statistically independent, and positive (and ≤ 1) otherwise.Therefore, it is well-suited to judging whether adding f i to the feature vector (f i1 , . . .f in ) produces a stronger correlation with the reference label y ref or not.Here we are using the affine-invariant version of DisCo [53], which is invariant under arbitrary linear transformations of X and Y, in order to make it more robust against basis reparametrizations in the EFP space.The multivariate Affine-DisCo calculation is described in more detail in Appendix C.

4.
Step 4: Add the feature with best relevance score to the list of known features We select the feature with the best score and add it to F n .Then we proceed back to the first step to train a network on the updated set of features F n+1 .The procedure is stopped when the performance metric saturates and the final set of features is returned.
While the above method explicitly describes our DisCo-based Forward Feature Selection algorithm (DisCo-FFS), the protocol is general enough to accommodate also other iterative feature selection techniques.In Appendix A, we use the same framework to outline how the Forward Feature Selection from [30] operates.This is based on Decision Ordering (DO) for the confusion set, and Average Decision Ordering (ADO) for the relevance score, and we will refer to it as DO-ADO-FFS throughout this work.

III. APPLICATION TO TOP-TAGGING A. Data set
We study the performance of the DisCo-Feature Selection algorithms on the top quark tagging landscape data set [1,54].This data set contains boosted, hadronicallydecaying top jets as signal, and QCD (i.e.light quark and gluon) jets as background, which are generated using Pythia8 [55], with a center-of-mass energy of 14 TeV.Multiple interactions and pile-up are not included in this data set.The detector simulation is done using Delphes [56], with the ATLAS detector card.FastJet [57] is used to create jets using the anti-k T algorithm [58] with R = 0.8.Only jets in the p T range [500, 650] GeV, and |η j | < 2, are considered.The data set contains only kinematic information, in the form of energy-momentum four-vectors of all the reconstructed particles in each jet, which are extracted using the Delphes energy-flow algorithm.No additional tracking information or particle information is included.
The full data set contains 2 million events, with 1 million signal events and 1 million background events.This data is split into 1.2M events in the training set, 400k in the validation set, and 400k in the test set, each set containing equal number of signal and background events.

B. Feature Space
For top-tagging we start with where m J is the mass of the jet, p T is the transverse momentum of the jet and m W −candidate is the mass of the W -candidate in the jet, calculated with a very simple method: we recluster each fat jet using the exclusive k T algorithm with R = 0.3 into exactly three subjets.
Then we pick the pair of subjets whose invariant mass comes closest to m W .This pair of subjets gives us the W -candidate and their mass is m W −candidate .The distributions of the initial features are illustrated in Fig. 3.
We then apply feature selection algorithms to a large set of Energy Flow Polynomials (EFPs) [31].EFPs are functions of energy fractions and angular separation of jet constituents: where p T a is the transverse momentum of the ath jet constituent, and the denominator in z a is summed over all jet constituents in a jet J. EFPs have a one-to-one correspondence with a graph G: ab → (each edge) (6) Thus given a graph G, with N nodes and edges (m, ) ∈ G, the EFP is: The original EFPs [31] were introduced as IRC-safe observables, with κ = 1.However in our feature space we are motivated by [30] to consider other values of κ as well.Following [30], 2 we use Energy Flow Polynomials with all combinations of d ≤ 7, β = [0.5, 1, 2] and κ = [−1, 0, 0.5, 1, 2], which form a space of 7,320 unique features.
C. Results

Ab initio feature selection using truth labels
First, we consider the ab initio feature selection task, using the truth labels to guide the algorithms so as to yield the best-possible classifier.
We apply the truth-guided DisCo-FFS and DO-ADO-FFS 3 to the training and validation set, and use the test set only for evaluating the performance.(Network architectures and hyperparameters used in this section are described in Appendix B.) The performance metric choosen for top-tagging is R 30 (the QCD rejection factor at 30% top-tagging efficiency).It allows a better separation of different methods as area under curve (AUC) saturates and is more indicative of the performance at a potential working point.
As shown in Fig. 4, the R 30 value increases as more features are added using the two feature selection methods.This shows that both DisCo-FFS and DO-ADO-FFS are selecting useful features.After 9 features the performance of the features added using the DisCo method saturates with R 30 ≈ 1250.We also see that our proposed method outperforms DO-ADO-FFS and achieves a higher R 30 at each step.
Any worthwhile feature selection algorithm should do better than randomly selecting features.To test this, we randomly select each number of features 10 times, and use the average and standard deviation of the R 30 as our "random baseline" shown in Fig. 4. Interestingly we see that the randomly selecting EFPs can also give better performance, as we add more and more features, but not as good as the FFS methods.

Feature selection using pre-trained classifier
Next we turn to feature selection using a pre-trained classifier (so-called "black-box guiding" in [30]).For the pre-trained classifier, we use the state-of-the-art LorentzNet tagger [23].
We see in Fig. 4 that DO-ADO-FFS with LorentzNet actually performs slightly better than DO-ADO-FFS with truth labels.This somewhat counterintuitive result was also observed by [30] in the context of boosted Wtagging, and we confirm it here.As explained there, the confusion set of the DO-ADO method consists of signal-background pairs which are incorrectly ordered by the classifier trained at every step (called y pred in Sec.II), with respect to the reference labels.When using truth labels for the latter, the confusion set can be significantly contaminated by signal-background pairs which may never be ordered properly, even by the ideal Neyman-Pearson classifier.This can in turn distort the ADO score which is calculated on the confusion set.This explains why the LorentzNet-guided DO-ADO-FFS performs better than the truth-guided DO-ADO-FFS.
Meanwhile, we see from Fig. 4 that there is no significant difference in performance between truth-guided and LorentzNet-guided DisCo-FFS.This is perhaps the more expected and intuitive result.We believe the reason DisCo-FFS does not suffer from the degradation in performance when using truth labels can be understood by the fact that our confusion set is determined solely using the classifier trained at every step, and does not involve the reference labels at all.Also, our confusion set is determined on background and signal jets separately.Therefore, the issue of the forever-incorrectlyordered signal-background pairs never even arises here.It would be interesting to test this explanation further, for example by combining these different ways of choosing the confusion set (DO or y pred ) with different relevance scores (ADO or DisCo).We reserve this for future work.

D. Comparison with other taggers
The top-tagging comparison study [1] includes two methods which use high-level features as inputs for toptagging: one used a NN with multi-body N -subjettiness as input features [6,59], and the other uses a Linear Classification (with Fischer's Linear Discriminant) on EFPs.All other taggers are based on low-level jet information.The proposed DisCo-FFS selection strategy based on 9 EFPs and 3 initial features outperforms all methods in the published study [1].However, it falls short in performance to even more state-of-the-art taggers that were published after [1]: ParticleNet [14], LorentzNet [23], the ParT (particle transformer net) tagger [19], and PELICAN [24].Nevertheless, our tagger is able to achieve a very competitive performance with only 1440 parameters as shown in Table I and Fig. 5.
We also compare our performance to that of a network (architecture described in Appendix B) that was trained on all 7k EFPs.As shown in Table I, this network is only able to obtain a performance of R 30 = 844.This is Shown in gray is also the random selection baseline.The shaded bands around each curve come from training the NN classifier ten times on the same set of features (similar to [1]).Overall, DisCo-FFS seems to select more relevant features than DO-ADO-FFS, resulting in a higher-performing classifier at every step.Interestingly, while DO-ADO-FFS with truth labels actually performs worse than with LorentzNet (a phenomenon also observed in [30]), no degradation in performance is observed for DisCo-FFS with truth labels.
significantly worse than the performance using the small subset of EFPs selected by DisCo-FFS.Clearly, the use of uninformative features in the training deteriorates the performance of the network.In principle, it should be possible to optimize the hyper-parameters to recover the lost performance, but this is not so straightforward in practice, given the amount of time and resources it takes to train a network on all 7k EFPs. 4 This emphasizes the need of doing feature selection.
As a further aside, this result also indicates why another popular feature selection method, which is based on assigning feature attributions using Shapley values, is not suitable here.Shapley values assume the existence of a high-performing classifier trained on a set of features, and then ranks those features in terms of their estimated contributions to the classifier outputs.In fact, the original Shapley values [43,44,47] are very much ill-suited to the problem at hand -their computational complexity grows exponentially with the number of features, so in practice can never be computed for more than ∼ 10 features.Also the features are assumed to be uncorrelated, for the computation of Shapley values.With 7k highly correlated features, this is clearly not the right approach.Later approaches such as SHAP [48] attempt to overcome the computational complexity issue by approximating the Shapley values in various ways.SHAP also used (approximate) Shapley values to unify different feature attribution methods [42,45,46,60].But generally all these works still assume independence of the features.This is an area of active research and it is possible a Shapley-inspired approach will work well on this problem in the future.Suffice to say that in our experiments (based on Deep SHAP [46,48] and the sub-par DNN trained on 7k EFPs), we obtained results that were only marginally better than random selection.

E. Ablation studies
To showcase another important benefit of feature selection, we compare the performance of the features we obtained using DisCo-FFS to ParticleNet and LorentzNet, on smaller training datasets.We take the set of features obtained in section III C and train the same neural network with same hyper-parameters on 5%, 1% and 0.  1% of the training dataset, as shown in Fig. 6.

F. Robustness of the feature selection
It is interesting to ask whether the DisCo-FFS algorithm selects the same features every time.This is not a priori guaranteed, because there is some stochasticity to the algorithm, coming from the training of the NN classifier at every step (which in turn determines the confusion set on which the relevance score is calculated).Shown in Fig. 7 is the R 30 vs number of features selected, after running the DisCo-FFS algorithm five independent times.We see that DisCo-FFS repeatedly chooses the same first six EFPs.After that, the features selected start to diverge from fully deterministic, at first only slowly (there appear to be two possibilities for the pairs of EFPs selected in the 7th and 8th iterations), and then quickly from the 9th EFP onwards (on the 9th EFP, the five trials selected five different EFPs).This is broadly consistent with Fig. 4.There we see the R 30 shooting up rapidly during the first six EFPs, indicating that they provide a lot of classification power, and should produce a strong signal for the relevance score in the DisCo-FFS selection procedure.Then the R 30 plateaus but does rise a little bit, from six EFPs to nine EFPs.This is consistent with a much weaker signal coming from the relevance score and more possibility for randomness.Finally, after nine EFPs, the R 30 no longer rises and instead fluctuates around 1250.This is consistent with the remaining EFPs being selected randomly and not providing any real signal to the relevance score.The selected Energy Flow Polynomials can be used to gain physical insight for the case of top tagging.Shown in Tables II and III are the graphs, chromatic numbers c, (κ, β) values, and cumulative R 30 values of the first eight EFPs selected by DisCo-FFS.We see that 5 of the first 6 EFPs selected are EFPs with c = 3.A chromatic number of a graph is the number of colours one can put to the nodes, so that no edges are connected by the same colour.As noted in [31], the chromatic number of an EFP is also a proxy for the number of prongs in the jet.In other words, c = 3 EFPs are probes of 3-prong substructure -exactly what one would expect to be relevant for top tagging.
Interestingly, there is one c = 2 EFP selected in the first six EFPs.This probe of 2-prong substructure could be related to the two prongs consisting of the b-quark and the boosted W -jet inside the top quark.
We also see from Table II that both IRC-safe and unsafe probes of 3-prong substructure are useful for tagging.The first two EFPs have κ = 2, and hence are an IRCunsafe probe of hard radiation, with the first one being a 3-point correlator, and second one being a 4-point correlator. 5IRC-safe EFPs (κ = 1) are not selected until the fourth and fifth iteration.
In the seventh and eighth iterations, there appear to be two possible paths for the FS algorithm to take, i.e. two unique possibilities for the pairs of EFPs selected.These are shown in Table III.In one of the paths, two IRC-unsafe EFPs probing the 2-prong substructure are selected with one of them probing small-angle radiation (β = 0.5), and the other one probing hard/wide-angle radiation (β = 2), which actually marks the first se- , ParT [19], and PELICAN [24] are the some of the recent taggers with very good performances."DisCo-FFS on EFPs" corresponds to the simple DNN trained on the first nine EFPs selected by DisCo-FFS, while "DNN EFPs" is our DNN trained on all the 7k EFPs.The remaining taggers are taken from [1].We see that the nine EFPs selected using Disco-FFS have a very competitive performance, especially given the number of parameters.lected feature that probes wide-angle radiation.In the other path, we see the appearance of the first EFP which probes 4-prong substructure with small-angle radiation (β = 0.5), and this is followed up by an IRC-safe EFP probing 3-prong substructure.
Interestingly in our single run of LorentzNet-guided DisCo-FFS, the first 6 features are the same as Table II, whereas after that the 7 th -EFP is the same one selected in Path 1 in III.This confirms that the similar performance between DisCo-FFS with truth and with LorentzNet is no coincidence, and is likely because LorentzNet (being so high-performing) is quite close to the truth labels.

IV. CONCLUSIONS
In this work, we have introduced a new forward feature selection method, based on the distance correlation measure of statistical dependence -dubbed DisCo-FFS.Our method can operate equally well on either truthlabels (for ab initio feature selection) or on the outputs of a pre-trained classifier (for explaining a "black box" AI).
We demonstrated the performance of our method using the task of boosted top tagging, as boosted top jets have a rich substructure and many subtle correlations that have proven to be a fruitful laboratory for developing increasingly powerful state-of-the-art taggers in the HEP literature.
Following [30], we have trained our DisCo-FFS method on a large set (7,000+) of Energy Flow Polynomials, which aim to provide a complete description of the jet substructure.We have seen that DisCo-FFS is very effective at selecting EFPs from this large feature set; DisCo-FFS can achieve nearly-state-of-the-art top tagging per- y C Z / A K 3 q w n 6 8 V 6 t z 6 m r T l r N r M P / s j 6 / A E z S p V O < / l a t e x i t > 5. c = 3,  = 1, = 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " q 2 M z C Q l a t e x i t > 6. c = 3,  = 2, = 0.5 FIG. 7. Performance vs. iteration for 5 trials of DisCo-FFS (performance is the mean R30 of 10 trainings).We see that the feature selection is deterministic for the first six EFPs selected (superimposed), and there is a corresponding sharp rise in R30.Then this is followed by 2 paths (marked path 1 and path 2) in the 7 th and 8 th iterations.After that, DisCo-FFS finds different sets of features to achieve similar performance.
formance (matching that of ParticleNet-lite [14]) with a selection of just a small number of EFPs (less than 10).We also show how it outperforms the DO-ADO-FFS method of [30] (which we have attempted to replicate as closely as possible), consistently achieving higher tagging performance after each EFP that is selected.
The fact that our method falls short of the most state of the art deep learning methods (ParT [19], PELI-CAN [24], and LorentzNet [23]) is interesting.Either our method is not fully optimal at selecting the features, or the 7,000+ EFPs we used as the basis of our study do not capture all the physics underlying top tagging.A possible follow-up study to further probe this question would be to supplement the 7,000+ EFPs with additional jet substructure variables, for instance the subjettiness variables of [59,61], jet spectra and morphological features of [62][63][64], or Boost Invariant Polynomials [65].This observation also raises the possibility that there might be more meaningful jet substructure variables out there, beyond those that are presently known, waiting to be discovered.This is obviously an interesting avenue for future research.
Beyond simple object tagging, DisCo-FFS might also be able to shine for tasks -such as building supervised classifiers for new physics discovery -where calibration of the algorithm is difficult and a small number of wellunderstood features is preferable.While particle physics is in an especially good position due to the presence of well-motivated bases of features (such as the used EFPs) such decompositions also exists for other domains, e.g. in the forms of wavelets applied to images (e.g.building on [66]).
In general, EFPs selected could make for a very lightweight and performant top tagger.This could have important applications to triggering [67].For that, a fast way to calculate EFPs on FPGAs would be required.Such will be interesting to explore further.
It would also be potentially illuminating to study the robustness of the selected EFPs under domain shift.For example, recently ATLAS released an official top tagging dataset [68].One could compare the EFPs selected by DisCo-FFS on the different top tagging datasets, and see how one set of EFPs performs on the other dataset.One could also imagine training this method on a restricted set of HLFs (EFPs or otherwise) that are deemed to be "well-modeled" by simulations.This could help with the calibration and robustness of taggers developed using simulation and deployed on data.
Overall, we observe the start of a positive feedback loop between deep learning method development and physicsmotivated feature discovery.Each one drives the other.Early top taggers [69] started with jet substructure variables like N -subjettiness.Then it looked like deep learning was able to go way beyond HLFs and we would have to rely on fully-automated feature engineering.Now there is some signs that we are coming full circle.Ultimately we may hope to match the performance of the SOTA deep learning taggers with just a handful of (yetto-be-invented?) HLFs.This would be a very satisfying outcome, proving that deep learning doesn't have to be a black box but can drive fundamental physics discoveries.
powerful than that -it can also measure statistical dependence of multivariate distributions, a powerful property that enables the forward feature selection algorithm described in this work.
For our case, X = y truth is a 1-D vector, and Y = (f i1 , f i2 , . . ., f in ) is an n-dimensional feature vector.The population value of squared distance covariance of X and Y is given by Distance correlation is given by which is normalized between 0 and 1.
Finally, using the covariance matrices Σ X , Σ Y , affineinvariant distance correlation is simply

FIG. 1 .
FIG. 1. Overview of the proposed forward feature selection algorithm.

FIG. 3 .
FIG. 3. Initial features chosen for top tagging: jet mass mJ (left), jet pT (center), and mass of the W -candidate (right).

FIG. 4 .
FIG. 4. Performance comparison between DisCo-FFS and DO-ADO-FFS methods, truth-guided and LorentzNet-guided.Shown in gray is also the random selection baseline.The shaded bands around each curve come from training the NN classifier ten times on the same set of features (similar to[1]).Overall, DisCo-FFS seems to select more relevant features than DO-ADO-FFS, resulting in a higher-performing classifier at every step.Interestingly, while DO-ADO-FFS with truth labels actually performs worse than with LorentzNet (a phenomenon also observed in[30]), no degradation in performance is observed for DisCo-FFS with truth labels.
5% of the same training data.While both LorentzNet and ParticleNet had a superior performance for the full training dataset, our set of features outperforms ParticleNet at lower training fractions, and more-or-less matches LorentzNet at 0.5% and

FIG. 5 .
FIG.5.R30 vs. number of parameters of the model, for many different approaches to top-tagging.LorentzNet[23], PaticleNet[14], ParT[19], and PELICAN[24] are the some of the recent taggers with very good performances."DisCo-FFS on EFPs" corresponds to the simple DNN trained on the first nine EFPs selected by DisCo-FFS, while "DNN EFPs" is our DNN trained on all the 7k EFPs.The remaining taggers are taken from[1].We see that the nine EFPs selected using Disco-FFS have a very competitive performance, especially given the number of parameters.

FIG. 6 .
FIG.6.Performance of training on 0.5%, 1% and 5% of the training data.The EFPs selected using DisCo outperform ParticleNet, and match up to the performance of LorentzNet[23] at 0.5% of the total training data.
< l a t e x i t s h a 1 _ b a s e 6 4 = " N c 7 o j R I P 9 K O g d D v 1 9 e h C O P 2 k H F b 0 Z T 8 a L 8 W 5 8 z F t z R j Z z i P 7 I + P w B L m O V S w = = < / l a t e x i t > 1. c = 3,  = 2, = 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 C N p w y F X H n N i 8 l 2 B U 5 u Y Z 5 v 3 K G 4 = " > A A A C A 3 i c b Z D L S s N A F I Y n 9 V b r L e p O N 4 N F c C E l q a U K U i i 4 c V n B X q A J 5 W Q 6 a Y d O L s x M h B I K b n w V N y 4 U c e t L u P N t n L Z Z a O s P A x / / O Y c z 5 / d i z q S y r G 8 j t 7 K 6 t r 6 R 3 y x s b e / s 7 p n 7 B y 0 Z J Y L Q J o l 4 J D o e S M p Z S J u K K U 4 7 s a A Q e J y 2 v d H N t N 5 + o E K y K L x X 4 5 i 6 A Q x C 5 j M C S l s 9 8 6 h c c q 5 J 7 e I c O y OI Y 6 i V N X l U Q c 3 u m U W r Z M 2 E l 8 H O o I g y N X r m l 9 O P S B L Q U B E O U n Z t K 1 Z u C k I x w u m k 4 C S S x k B G M K B d j S E E V L r p 7 I Y J P t V O H / u R 0 C 9 U e O b + n k g h k H I c e L o z A D W U i 7 W p + V + t m y j / y k 1 Z G C e K h m S + y E 8 4 V h G e B o L 7 T F C i + F g D E M H 0 X z E Z g g C i d G w F H Y K 9 e P I y t M o l u 1 q q 3 F W K 9 W o W R x 4 d o x N 0 h m x 0 i e r o F j V Q E x H 0 i J 7 R K 3 o z n o w X 4 9 3 4 m L f m j G z m E P 2 R 8 f k D M A C V T A = = < / l a t e x i t > 2. c = 3,  = 2, = 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " K E o N o 6 W O B G j 3 1 z d A / x H x R X m x 6 O M = " > A A A C A 3 i c b Z D L S s N A F I Y n 9 V b r L e p O N 4 N F c C E l q a U K U i i 4 c V n B X q A J 5 W Q 6 a Y d O L s x M h B I K b n w V Ny 4 U c e t L u P N t n L Z Z a O s P A x / / O Y c z 5 / d i z q S y r G 8 j t 7 K 6 t r 6 R 3 y x s b e / s 7 p n 7 B y 0 Z J Y L Q J o l 4 J D o e S M p Z S J u K K U 4 7 s a A Q e J y 2 v d H N t N 5 + o E K y K L x X 4 5 i 6 A Q x C 5 j M C S l s 9 8 + i i 5 F y T W v k c O y O I Y 6 h Z m j y q o G b 3 z K J V s m b C y 2 B n U E S Z G j 3 z y+ l H J A l o q A g H K b u 2 F S s 3 B a E Y 4 X R S c B J J Y y A j G N C u x h A C K t 1 0 d s M E n 2 q n j / 1 I 6 B c q P H N / T 6 Q Q S D k O P N 0 Z g B r K x d r U / K / W T Z R / 5 a Y s j B N F Q z J f 5 C c c q w h P A 8 F 9 J i h R f K w B i G D 6 r 5 g M Q Q B R O r a C D s F e P H k Z W u W S X S 1 V 7 i r F e j W L I 4 + O 0 Q k 6 Q z a 6 R H V 0 i x q o i Q h 6 R M / oF b 0 Z T 8 a L 8 W 5 8 z F t z R j Z z i P 7 I + P w B L O y V S g = = < / l a t e x i t > 3. c = 2,  = 0, = 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " s 7 2 y x F g a M I z Q N v h V G i a / e V o l m E I = " > A A A C B X i c b Z C 7 S g N B F I Z n 4 y 3 G 2 6 q l F o N B s J B l V 2 M U J B C w s Y x g L p B d w t n J b D J k 9 s L M r B C W N D a + i o 2 F I r a + g 5 1 v 4 + R S a O I P A x / / O Y c z 5 / c T z q S y 7 W 8 j t 7 S 8 s r q W X y 9 s b G 5 t 7 5 i 7 e w 0 Z p 4 L Q O o l 5 L F o + S M p Z R O u K K U 5 b i a A Q + p w 2 / c H N u N 5 8 o E K y O L p X w 4 R 6 I f Q i F j A C S l s d 8 7 B k u d e k c n 6 K 3 Q E k C V Q c T T 5 V U L G t i 4 5 Z t C 1 7 I r w I z g y K a K Z a x / x y u z F J Q x o p w k H K t m M n y s t A K E Y 4 H R X c V N I E y A B 6 t K 0 x g p B K L 5 t c M c L H 2 u n i I B b 6 R Q p P 3 N 8 T G Y R S D k N f d 4 a g + n K + N j b / q 7 V T F V x 5 G Y u S V N G I T B c F K c c q x u N I c J c J S h Q f a g A i m P 4 r J n 0 Q Q J Q O r q B D c O Z P X o T G m e W U r d J d q V g t z + L I o w N 0 h E 6 Q g y 5 R F d 2 i G q o j g h 7 R M 3 p F b 8 a T 8 W K 8 G x / T 1 p w x m 9 l H f 2 R 8 / g A g H p X D < / l a t e x i t > 4. c = 3,  = 1, = 0.5 < l a t e x i t s h a 1 _ b a s e 6 4 = " L 2 v T F l W J P Q J s I 8 + Q p S r X B i 0 o A u k = " > A A A C A 3 i c b Z D L S g M x F I Y z 9 V b r b d S d b o J F c C H D j N Y q S K H g x m U F e 4 H O U D J p p g 3 N Z E K S E c p Q c O O r u H G h i F t f w p 1 v Y 3 p Z a O s P g Y / / n M P J + U P B q N

TABLE I .
AUC and R30 comparison of different taggers on the dataset from

TABLE II
. The EFPs selected by Disco-FFS in the first 6 iterations

TABLE III .
Two paths selected by the EFPs in the 7 th and 8 th iteration G. Physical interpretation of the selected features