Automating the Construction of Jet Observables with Machine Learning

Machine-learning assisted jet substructure tagging techniques have the potential to significantly improve searches for new particles and Standard Model measurements in hadronic final states. Techniques with simple analytic forms are particularly useful for establishing robustness and gaining physical insight. We introduce a procedure to automate the construction of a large class of observables that are chosen to completely specify $M$-body phase space. The procedure is validated on the task of distinguishing $H\rightarrow b\bar{b}$ from $g\rightarrow b\bar{b}$, where $M=3$ and previous brute-force approaches to construct an optimal product observable for the $M$-body phase space have established the baseline performance. We then use the new method to design tailored observables for the boosted $Z'$ search, where $M=4$ and brute-force methods are intractable. The new classifiers outperform standard $2$-prong tagging observables, illustrating the power of the new optimization method for improving searches and measurement at the LHC and beyond.


I. INTRODUCTION
Effective identification of hadronic decays of boosted heavy particles like the top quark or W , Z and Higgs (H) bosons is essential for analyses at the Large Hadron Collider (LHC). Jet substructure observables that identify specific discriminating information in the radiation pattern of jets originating from different particles are now necessary, both in the search for new physics and precision Standard Model (SM) measurements. As a result, there is an extensive literature developing observables and techniques for identifying boosted topologies to increase the efficacy of LHC analyses probing extreme regions of phase space [1,2].
One of the key challenges with ML taggers is to iden- * Electronic address: kdatta@ethz.ch † Electronic address: larkoski@reed.edu ‡ Electronic address: bpnachman@lbl.gov tify what information the machine is using for classification. Understanding the origin of discrimination can lead to robustness when taggers are applied outside of the region they were trained, can result in new theoretical insight for other applications, and may produce new simple observables that capture most of the information. While there are many proposals for ML metacognition [4,5,7,8,12,17,40], one particularly powerful approach is to identify simple product observables that capture most of the information from an ML algorithm trained on the full phase space [8]. This approach results in analytically tractable observables that can capture nearly all of the power of a more complicated algorithm, but are also very robust and insightful. One of the most challenging aspects of the approach presented in Ref. [8] is the fitting process for picking the optimal simple product observable.
In this paper, we describe a new procedure based on ML for automating the feature extraction originally presented in Ref. [8]. This method is applied to derive an optimal product observable for discriminating H → bb vs. g → bb and the outcome is compared to the result of Ref. [8] which used a brute force approach. Having validated the method, a new classifier is developed to distinguish a Z from generic quark and gluon jets. The phase space scan required in this later tagging task is too big for the brute force approach and therefore the automated method is required to find the optimal tagger. The resulting classifier has a simple form and is competitive with a tagger using high-dimensional, low-level inputs. In addition to Ref. [42], this is the only other study of the dependence on the mass of the new boson, which is timely given new searches for light boosted bosons [49][50][51].
This paper is organized as follows. The method for constructing product observables is described in Sec. II and the machine learning approaches are detailed in Sec. III. Results for both the Higgs and Z classification tasks are presented in Sec. IV. The paper ends with conclusions and future outlook in Sec. V.

II. N -SUBJETTINESS PRODUCT OBSERVABLES
The information about the kinematic phase space of M -subjets in a jet is resolved with a set of (3M − 4) Nsubjettiness [52][53][54] observables. By increasing M , one can identify the number of subjets required to saturate the classification performance based on the spanning set of N -subjettiness observables [7]: for some choice of N axes within the jet; R is the jet radius parameter, and (∆R) 2 = (∆φ) 2 + (∆η) 2 . Given the minimal M , one can posit an ansatz 1 for a simple product observable that captures most of the information contained in a neural network trained on the entire spanning set: For distinguishing H → bb vs. g → bb jets, Ref. [8] showed that the useful information for classification is saturated by M = 3 and β ML 3 has nearly the same tagging performance as the full 3-body phase space. The parameters a, b, c, d, e that specify β ML 3 were identified by randomly scanning the five-dimensional phase space and exploiting minimal correlations between some of the parameters. This becomes intractable when the optimal M is bigger than 3.
In this paper, we explore methods to overcome the difficulties of extending this procedure to higher dimensions. In one approach, we replace the random sampling segment of the procedure with a combination of neural networks carrying out regression from the parameter space to the distributions of the product observable for individual jets. Off-the-shelf minimization routines can then be used to optimize any metric of the classifier performance. A complementary and simpler approach is to directly use the form in Eq. 2 in the machine learning optimization, where the learnable parameters are the exponents {a, b, c, ...}. Further details are described in the next sections.

B. Construction of optimized product observables
Using the approach followed in Ref. [8], the point of saturation of discrimination power is first identified using a deep neural network (DNN) classifier. For Z vs. QCD and H → bb vs. g → bb discrimination, we note that discrimination power saturates at 4-body (8-dimensional) and 3-body phase space (5-dimensional), respectively. Then it is simple to form the product observable from the elements of the M -body basis corresponding to saturation.
We examine two approaches for finding the optimal product observable. The first approach follows a similar method as the brute-force algorithm. Neural networks approximate signal and background probability distributions conditioned on the parameters {a, b, c, ...} and then any automated optimization procedure can be used to identify the best exponents. For each task, the product observable is calculated for 25,000 signal and background jets for different values of the parameters [a−e] (H → bb) or [a − h] (Z ), in the range [−5, 5]. These distributions are then stored to generate training sets for the neural networks used to carry out regression from the parameter space to the calculation of β ML M with those exponents. While there are multiple possibilities for learning the probability distribution of β M given {a, b, c, ...}, such as generative adversarial networks [72] and variational autoencoders [73,74], the method that we found works well for the product observables is illustrated in Fig. 1. The network takes as input 5 (Higgs) or 8 (Z ) inputs and outputs 25,000 numbers, which represent a dataset that is the same size as the training data, but with the specified parameter values {a, b, c, ...}. From these 25,000 values, the probability distribution of β is formed for signal and background and the one-dimensional likelihood ratio is constructed for optimizing the classifier performance. Variations on this setup are possible, such as (significantly) reducing the number of points needed to specify the probability distributions, but this approach was found to be robust to perturbations in initialization and network architecture. For this paper, it was found that the network did not work well with fewer than 25k example jets per parameter point. For each network, 250k (450k) parameter points were used for training in the Z and ungroomed Higgs (groomed Higgs) case. In only the groomed Higgs case, a single network was trained for signal and background with a 1/0 switch added to the input. Separate networks were trained for signal and background in the Z and ungroomed Higgs cases. To reduce the effects of numerical instability on the training of these networks, we train on samples after taking the natural logarithm of the 25k measured values of the product observables.
Aside from the use (or not) of the switch input, both the H → bb and Z tasks use simple fully-connected neural networks with two hidden layers. The input layer is followed by a dense layer with either 250 or 500 nodes, then another dense layer with 100 or 250 nodes, followed by an output layer with 25,000 nodes using a linear activation. The number of nodes in the hidden layers were bigger for the Z case with grooming compared with the Higgs case or the ungroomed Z case. We use leaky rectified linear units (Leaky ReLU) as the activations for the hidden layers. The networks were compiled with a mean squared error loss function (on the penultimate layer shown in Fig. 1, not on p(β M ) directly), using Adam optimization [75]. The regression networks were each trained for ∼ 10, 000 epochs. All deep learning tasks were carried out with the Keras [76] deep learning libraries, using the TensorFlow [77] backend.
Given the set of 25, 000 values of the β M observable for a given set of parameters, it is straightforward to use these networks in an optimality scan. For this purpose, we use SciPy's [78] basin-hopping [79] global minima finder using the non-linear, derivative free COBYLA (Constrained Optimization BY Linear Approximation) [80] minimizer to scan over local minima. In the optimization, the networks are used to predict background and signal distributions for a given set of parameters. The 1-dimensional binned likelihood distributions 2 of the observable, constructed from the network outputs, was then used to calculate the area under the ROC curve (AUC) to estimate the discrimination power, where (1-AUC) was explicitly chosen as the metric for the basinhopping minimization. Appendix A illustrates that the regression networks can be used to accurately model the dependence of the AUC as a function of the parameters. The observable selected using this procedure will be denoted β ML 3,H→bb in the next sections. We also note that the space of possible inputs is degenerate since a monotonic function of an observable has the same discrimination power as the original observable. However, due to the finite binning required to calculate the AUC's from the likelihood distributions, and statistical fluctuations in a given data sample, the observables do not have precisely the same power as monotonic functions of themselves. The issue of degeneracies is not explicitly dealt with in the minimization procedure, but if the networks are adequately trained over the input space, it is sufficient to locate any one 'global' minimum among local minima of similar depth, using basin-hopping or any other global minimizer.
A second approach to optimizing {a, b, c, ...} directly uses Eq. 2. The product form can be used directly as a tunable function for predicting signal/background with tunable parameters {a, b, c, ...}. This is a more direct way of identifying the optimal solution without explicitly modeling the probability distributions. Optimizing a generic function is possible with methods like stochastic gradient decent, but the product observable is amenable to a significant simplification 3 . In particular, two classifiers that are monotonic transformations of each other result in the same classification performance. By taking the logarithm of Eq. 2, one can transform the problem into linear regression 4 where the inputs are log(τ ) and the coefficients are the exponents. This approach uses the mean squared error loss to identify {a, b, c, ...}. The observable selected using this procedure will be denoted β ML 3,H→bb in the next sections.
In the limit of infinite data and an arbitrarily flexible neural network, both the ensemble learning and linear regression approaches should achieve the same performance. The latter is significantly easier to train, but the complex approach may provide additional benefits because by providing access to the probability distributions, one can optimize any performance metric directly. 2 In principle, one can estimate the AUC without binning, but it was found that there was not a significant sensitivity to the choice of binning. 3 We thank Eric Metodiev for this insightful observation. 4 Linear regression was proven to be sufficient for all IRC safe observables Ref. [5], however our results need not be IRC safe.
This includes batch-level losses like the AUC, false positive rate at a fixed true-positive rate, etc. The mean squared error loss should be sufficient to optimize all of these metrics, but maybe prevented from reaching the desired optimum due to limited training statistics. In practice, we do not find this to be the case with the setup presented here, but the structure may be useful for related tasks in the future.

IV. RESULTS
In this section, we present the new observables obtained for the different classification tasks for the ungroomed Z samples (the groomed case is in Appendix C). For closure, we first demonstrate that this new procedure produces an observable for ungroomed H → bb discrimination with the same performance as the β 3 observable proposed in Ref. [8] (the groomed case in Appendix B). Then we extend the procedure to higher M -body phase space by applying it to Z discrimination for three values of m Z , and propose new observables for those classification tasks.
Utilizing the result that discrimination power for ungroomed H → bb vs. g → bb discrimination saturates at 3-body phase space, we use the procedures proposed in the previous section to find the optimal product observable. The final values for the parameters {a, ..., e} obtained through the optimization are presented in Table I, along with those obtained in the previous study. Interestingly, the exponents with the ensemble method are nearly the same for a, b, d, and e, but slightly different for c. For the regression method, the exponents are nearly the same as the ensemble method up to a constant factor (approximately −2) for c, d, and e, but not for a and b. These results indicate the presence of multiple observables with comparable performance. In Fig. 2a, we plot the distributions of the new observable computed for signal and background, along with the prediction from the ensemble neural network. We note that the network provides a good match to the true distribution, where the latter is also calculated on 10 times more jets. Further, in Fig. 2b we plot the distributions of the observable obtained via the ML regression method. We then compare the ROC curves for the new observables to D (2) 2 [67], N (2) 2 [68] observables, and τ (2) 21 in Fig. 2c. In addition, we also compare the new observables to β 3 in Fig. 2d to demonstrate that the three observables have essentially the same discrimination power as expected. Then, this allows us to proceed to applying the procedure on higher dimensional problems.
We first train neural network classifiers on the M -body N -subjettiness bases, to identify the point of saturation of discrimination power for each value of m Z . 5 The results are presented in Fig. 3, showing that saturation occurs with the 4-body phase space for each case.
We then proceed to construct the β M L 4,Z andβ M L 4,Z product observables with the elements of the 8-dimensional 4-body basis, and run the procedure described in Sec. III and construct the new observables optimized for Z discrimination at three different values of m Z .
We present the distributions of the new observables for Z discrimination in Fig. 4 and then compare their discrimination power to standard observables and DNN's trained on the spanning N -subjettiness bases in Fig. 5.
The corresponding values of {a, b, c, ..., h} and the AUCs are in tables II, III and IV, respectively. The comparison of the true and predicted distributions in Fig. 4 illustrates the excellent quality of the regression network. The ROC curves in Fig. 5 show that the learned β ML andβ ML outperform the state-of-the-art single physics-motivated observables (top row), though the product observables do not fully saturate the performance of the DNN trained on the full 4-body phase space (bottom row). This suggests that a more flexible form (other than a simple product) is required to build a simple observable to capture more of the classification information. The product values obtained from the ensemble and regression methods are not simple scaling of each other, though the fact that both 5 A single neural network architecture, consisting of seven fully connected (five hidden) layers, was utilized for all of the classification tasks. The first four Dense layers consisted of 1000, 1000, 750 and 500 nodes respectively, and were assigned a Dropout [81] regularization of 0.2, to prevent over-fitting on training data. The next two Dense layers consisted of 250 nodes with Dropout regularization 0.1, and 100 nodes without Dropout. The input layer and all hidden layers utilized the ReLU activation function [82], while the output layer, consisting of a single node, used a sigmoid activation. The network was compiled with the binary cross-entropy loss minimization function, using the Adam optimization [75].  3,H→bb to β3 proposed in Ref. [8]; we note that three observables provide essentially the same discrimination power.
have a similar performance suggests that one is a monotonic transformation of the other.
The optimized β ML andβ ML observables are not identical for the different values of m Z (tables II and III), but it would be interesting to study to what extent the trends are physical or are due to the existence of multiple observables with similar performance. We leave this study to future work. However, a first indication that the observables contain similar physical information is studied in Appendix D, where the optimized product for one mass is applied to another mass. The ROC curves are similar for all three product observables when applied to the same m Z .

V. CONCLUSIONS
This paper has extended the growing literature of machine-learning assisted jet substructure-based tagging in two ways. First, we have developed a procedure to automatically identify the optimal product observable, using the N -subjettiness features as an example. This is an important innovation because observables with relatively simple analytic forms are robust complements to complex neural network classifiers and prior to this work, there was no efficient way to identify the best coefficients in the product. Second, we have used this automated framework to identify the optimal product observables for searching for boosted resonances like the Z boson, but with beyond the standard model masses. Jet substructure has proven to be a powerful toolset for such searches, but until now, there has been few studies of the mass dependence of the optimal observables. Future extensions of the methods introduced in this paper may be able to simplify the regression procedure, as well as study the connections between different classifiers with similar performance (including the ones connected by monotonic functions). The power of the method may also be extended by considering other parametric forms besides products. Classification problems demanding a higher M -body phase space are a natural extension of the examples presented here.
As machine learning techniques are used more widely to guide the optimal selection of classifiers, there will be a growing need to simplify and interpret the guidance from the machines. We have prepared an automated approach to construct optimal observables with simple, analytic forms, which can be used for further theoretical and experimental studies. This technique will form the basis of multiple extensions in the future to improve classification performance and increase the robustness of searches and measurement at the LHC and beyond.     [8] and as constructed via the procedure presented in this work (Fig. 7a).  In Fig. 7a, we plot the distributions of the new observable computed for signal and background, along with the prediction from the neural network. We note that the network provides a good match to the true distribution, where the latter is also calculated on 10 times more jets.
In addition, we compare the new observable to β (g) 3 in Fig. 7d to demonstrate that both observables have essentially the same discrimination power as expected. Then, this allows us to proceed to applying the procedure on higher dimensional problems. Further, we plot the ROC curve for the 4-body product observable from the linear regression method, noting that it provides the best performance of the observables that have been explored for this problem. 6 Appendix C: Groomed Z vs. QCD In this section we carry out the same set of studies for mMDT groomed Z discrimination as for the ungroomed cases from Sec. IV B. As in the ungroomed case, Fig. 8 indicates that the saturation of discrimination power occurs at 4-body phase space.  for mMDT groomed Z vs. QCD discrimination at 3 mass points    The results for the final observables for the three m Z points are presented in tables VI and VII, and the observable distributions are plotted in Fig. 9. The performance of the new observables are compared to standard ones and M -body DNN's in Fig. 10 and the corresponding AUCs are shown in Table VIII for different mass points. The conclusions from this section are qualitatively the same as from Sec. IV B, with a slightly lower AUC from both the product observable and the physics-motivated observables. Importantly, the product observables for the groomed case appear to saturate the bounds from the M -body phase space better than in the ungroomed case.
Appendix D: Mass dependence of β ML M Here, we briefly study the performance of the new observables presented in Sec. IV B. They are tested on a different combination of signal and background samples from the ones they were optimized on; for example, we calculate the new observable for m Z = 130 GeV on signal samples for m Z = 90 GeV, and background, that pass the mass window on which the 90 GeV observable was optimized. The results for this study are presented in Fig. 11, and indicate that while these observables are optimized on samples from a specific mass point, they can be applied to other classification tasks and still provide better discrimination performance than standard observables. This also suggests that the different parameter sets in tables II and III may represent observables with very similar physical information even though the Nsubjettiness variables are not invariant under transverse boosts.   proposed in [8]; we note that the latter two 3-body product observables provide essentially the same discrimination power while the 3-and 4-body ones obtained with linear regression outperforms them.
the product observables upto M = 8 in Fig. 12.
We observe that discrimination power gradually increases up to the inclusion of 7-or 8-body phase space variables. Compared to the ROC curve at the point of saturation, from the 4-body DNN classifier, these results suggest that while a DNN can adjust thresholds on the M -body inputs such that there is effectively only redundant discriminating information in higher M -body bases, as is also expected from the physics study in Ref. [7], the product observables do still benefit from including N -subjettiness variables from beyond the point of satu-ration.
Depending on the classification task, the product observables may even come very close to matching the performance of a saturated ML classifier (Fig. 10). However, ultimately it cannot not capture all available information, due to lack of further flexibility of the product form ansatz. These observations will of course vary based on the objects being studied. We leave further physics studies of the product form or other equivalent ansatz to future work.     13 TeV, pT > 500 GeV, R = 0.8 Pythia8, m [45,85]