Jet grooming through reinforcement learning

We introduce a novel implementation of a reinforcement learning (RL) algorithm which is designed to find an optimal jet grooming strategy, a critical tool for collider experiments. The RL agent is trained with a reward function constructed to optimize the resulting jet properties, using both signal and background samples in a simultaneous multi-level training. We show that the grooming algorithm derived from the deep RL agent can match state-of-the-art techniques used at the Large Hadron Collider, resulting in improved mass resolution for boosted objects. Given a suitable reward function, the agent learns how to train a policy which optimally removes soft wide-angle radiation, allowing for a modular grooming technique that can be applied in a wide range of contexts. These results are accessible through the corresponding GroomRL framework.


I. INTRODUCTION
Jets are one of the most common objects appearing in proton-proton colliders such as the Large Hadron Collider (LHC) at CERN. They are defined as collimated bunches of high-energy particles, which emerge from the interactions of quarks and gluons, the fundamental constituents of the proton. In modern analyses, final-state particle momenta are mapped to jet momenta using a sequential recombination algorithm with a single free parameter, the jet radius R, which defines up to which angle particles can get recombined into a given jet.
An example of an LHC collision resulting in two jets is shown in figure 1, where the towers correspond to energy deposits in the calorimeter. The right-hand side gives a schematic visualization of two different representations of jets, either as an image where the pixel intensity encodes the energy flow in that phase-space region [1], or as a tree defined by the recombination sequence of the jet algorithm.
Due to the very high energies of its collisions, the LHC is routinely producing heavy particles, such as top quarks and vector bosons, with transverse momenta far greater than their rest mass. When these objects are sufficiently energetic (or boosted), they can often generate very collimated decays, which are then reconstructed as a single fat jet. These fat jets originating from boosted objects can be distinguished from standard quark and gluon jets by studying differences in their radiation patterns. Since the advent of the LHC program, the physics of the substructure of jets has matured into a remarkably active field of research that has become notably conducive to applications of recent Machine Learning techniques [2][3][4][5][6][7][8].
A particularly useful set of tools for experimental analyses are jet grooming algorithms [9][10][11][12][13], defined as a postprocessing treatment of jets to remove soft wide-angle radiation which is not associated with the underlying hard substructure. Grooming techniques play a crucial role in Standard Model measurements [14,15] and in improving the boson-and top-tagging efficiencies at the LHC.
In this article we introduce a novel framework, which we call GroomRL, to train a grooming algorithm using reinforcement learning (RL) [16,17]. To this end, we decompose the problem of jet grooming into successive steps for which a reward function can be designed taking into account the physical features that characterize such a system. We then use a modified implementation of a Deep Q-Network (DQN) agent [16,18] and train a dense neural network (NN) to optimally remove radiation unassociated from the core of the jet. The trained model can then be applied on other data sets, showing improved resolution compared to state-of-the-art techniques as well as a strong resilience to non-perturbative effects. The framework and data used in this paper are available as open-source and published material in [19,20]. 1

Algorithm 1 Grooming
Input: policy πg, binary tree node

II. JET REPRESENTATION
Let us start by introducing the representation we use for jets. We take the particle constituents of a jet, as defined by any modern algorithm, and recombine them using a Cambridge/Aachen (CA) sequential clustering algorithm [21]. The CA algorithm does a pairwise recombination, adding together the momenta of the two particles with the closest distance as defined by the measure where y i is the rapidity, a measure of relativistic velocity along the beam axis, and φ i is the azimuthal angle of particle i around the same axis. This clustering sequence is then used to recast the jet as a full binary tree, where each of the nodes contains information about the kinematic properties of the two parent particles. For each node i of the tree we define an object T (i) containing the current observable state s t , as well as a pointer to the two children nodes and one to the parent node. The children nodes a and b are ordered in transverse momentum such that p t,a > p t,b , and we label a the "harder" child and b the "softer" one. The set of possible states is defined by a five dimensional box, such that the state of the node is a tuple is the azimuthal angle around the i axis, m is the mass, and k t = p t,b ∆ ab is the transverse momentum of b relative to a.

A. Grooming algorithm
A grooming algorithm acting on a jet tree can be defined by a simple recursive procedure which follows each of the branches and uses a policy π g (s t ) to decide based on the values of the current tuple s t whether to remove the softer of the two branches. This is shown in Algorithm 1, where the minus sign is understood to mean the update of the kinematics of a node after removal of a soft branch. The grooming policy π g (s t ) returns an action a t ∈ {0, 1}, with a t = 1 corresponding to the removal of a branch, and a t = 0 leaving the node unchanged. The state s t is used to evaluate the current action-values Q * (s, a) for each possible action, which in turn are used to determine the best action at this step through a greedy policy.
An example of the action of a grooming algorithm on a tree is shown in figure 2, where the groomed branches are indicated in red. The tree nodes whose kinematics have been modified by the removal of a branch are indicated with a prime.
It is easy to translate modern grooming algorithms in this language. For example, Recursive Soft Drop (RSD) [13] corresponds to a policy where z cut , β and R 0 are the parameters of the algorithm, and 1 corresponds as before to the action of removing the tree branch with smaller transverse momentum.

III. SETTING UP A GROOMING ENVIRONMENT
In order to find an optimal grooming policy π g , we introduce an environment and a reward function, formulating the problem in a way that can be solved using a RL algorithm.
We initialize a list of all trees used for the training, from which a tree is randomly selected at the beginning of each episode. We then start by adding the root of the current tree to an empty priority queue, which orders the nodes it contains according to their ∆ ab value. Each step consists in removing the first node from the priority queue, and taking an action on which of its branches to keep based on the state s t of that node. Once a decision has been taken on the removal of the softer branch, and the parent nodes have been updated accordingly, the remaining children of the node are added to the priority queue. The reward function is then evaluated using the current state of the tree. The episode terminates once the priority queue is empty.
The framework described here deviates from usual RL implementations in that the range of possible states for any episode are fixed at the start. The transition probability between states P(s t+1 |s t , a t ) therefore does not necessarily depend very strongly on the action, although a grooming action can result in the removal of some of the future states and will therefore still have an effect on the distribution.

A. Finding optimal hyper-parameters
The optimal choice of hyper-parameters, both for the model architecture and for the grooming parameters, is determined using the distributed asynchronous hyperparameter optimization library hyperopt [22]. allows for easier extensions to fixed depth algorithms such as the modified Mass Drop Tagger [11] and Soft Drop [12].
The performance of an agent is evaluated by defining a loss function, which is evaluated on a distinct validation set consisting of 50k signal and background jets. For each sample, we evaluate the jet mass after grooming of each jet and derive the corresponding distribution. To calculate the loss function L, we start by determining a window (w min , w max ) containing a fraction f = 0.6 of the final jet masses of the groomed signal distribution, defining w med as the median value on that interval. The loss function is then defined as where f bkg is the fraction of the groomed background sample contained in the same interval, and m target is a reference value for the signal. We scan hyper-parameters using 1000 iterations and select the ones for which the loss L evaluated on the validation set is minimal. In practice we will do three different scans: to determine the best parameters of the reward function, to find an optimal grooming environment, and to determine the architecture of the DQN agent. The scan is performed by requiring hyperopt to use a uniform search space for continuous parameters, a log-uniform search space for the learning rate and a binary choice for all integer or boolean parameters. The optimization used in all the results presented in this work rely on the Tree-structured Parzen Estimator (TPE) algorithm.

B. Defining a reward function
One of the key ingredients for the optimization of the grooming policy is the reward function used at each step during the training. We consider a reward with two components: a first piece evaluated on the full tree, and another that considers only the kinematics of the current node.
The first component of the reward compares the mass of the current jet to a set target mass, typically the mass of the underlying boosted object. We implement this mass reward using a Cauchy distribution, which has two free parameters, the target mass m target and a width Γ, so that Separately, we calculate a reward on the current node which gives a positive reward for the removal of wideangle soft radiation, as well as for leaving intact hardcollinear emissions. This provides a baseline behavior for the groomer. We label this reward component "Soft-Drop" due to its similarity with the Soft Drop condition [12], and implement it through exponential distributions R SD (a t , ∆, z) = a t min 1, e −α1 ln(1/∆)+β1 ln(z1/z) where a t = 0, 1 is the action taken by the policy, and α i , β i , z i are free parameters. The two terms determining R SD are shown in the lower panel of figure 3, using parameter values determined through asynchronous hyper-parameter optimization, shown in the upper row of the figure.
The total reward function is then given by Here N SD is a normalization factor determining the weight given to the second component of the reward.

C. RL implementation and multi-level training
For the applications in this article, we have implemented a DQN agent that contains a groomer module, which is defined by the underlying NN model and the test policy used by the agent. The groomer can be extracted after the model has been trained, using a greedy policy to select the best action based on the Q-values predicted by the NN. This allows for straightforward application of the resulting grooming strategy on new samples.
The training sample consists of 500k signal and background jets simulated using Pythia 8.223 [23]. We will construct two separate models by considering two signal samples, one with boosted W jets and one with boosted top jets, while the background always consists of QCD jets. We use the W W and tt processes, with hadronically decaying W and top, to create the signal samples, and the dijet process for the background. Jets are clustered using the anti-k t algorithm [24] with radius R = 1.0, and are required to pass a selection cut, with transverse momentum p t > 500 GeV and rapidity |y| < 2.5. All samples used in this article can are available online [20]. The grooming environment is initialized by reading in the training data and creating an event array containing the corresponding jet trees.
To train the RL agent, we use a multi-level approach taking into account both signal and background samples.
At the beginning of each episode, we select either a signal jet or a background jet, with probability 1 − p bkg . For signal jets, the reward function uses a reference mass set to the W -boson mass, m target = m W , or to the top mass, m target = m t , depending on the choice of sample. In the case of the background the mass reward function in equation (7) is changed to The width parameters Γ, Γ bkg are also set to different values for signal and background reward functions, and are determined through a hyper-parameter scan. We found that while this multi-level training only marginally improves the performance, it noticeably reduces the variability of the model.

D. Determining the RL agent
The DQN agent uses an Adam optimizer [25], and the training is performed with a Boltzmann policy, which chooses an action according to weighted probabilities, with the current best action being the likeliest.
Let us now determine the remaining parameters of the DQN agent. To this end, we perform two independent scans, for the grooming environment and for the network architecture.
The grooming environment has several options, which are shown in figure 4. Here the distribution of loss values for discrete options are displayed using violin plots, showing both the probability density of the loss values as well as its quartiles. The first plot is the dimensionality of the state observed at each step, which can be a subset of the tuple given in equation (2). We can observe that as the dimension of the input state is increased, the NN is able to leverage this additional information, leading to a decrease of the loss function. The scan over the normalization parameters of the reward functions shows that it is preferable to use a small width Γ for the signal, with a large value Γ bkg for the background, as well as a small value for the 1/N SD factor. One can also see that the multi-level training described in section III C leads to a distribution of loss values concentrated at smaller values. We have also allowed for several functional forms of the signal mass reward function, although for our final model we will use a Cauchy distribution.
The parameters of the network architecture are shown in figure 5, with the first plot showing the mass window containing 60% of the signal distribution, with the median of that interval shown in blue. The scatter plot of the learning rate used for the Adam optimizer shows that a value slightly above 10 −4 yields the best result. The scan shows a preference for a dense network with a large number of units and layers as well as a dropout layer as the architecture of the NN. Finally, we see that using duelling networks [26] leads to a small improvement of the model, while double Q-learning [27] does not.

E. Optimal GroomRL model
The final GroomRL model is trained using the full training sample with 500k signal/background jets for 1M epochs. The overall training time requires four hours of training using a single NVIDIA GTX 1080 Ti GPU with 12 GB of memory which includes all the training jet trees and the DQN parameters.
The parameters of the best GroomRL model obtained following the strategy presented in the previous sections is listed in table II. Here two values are given for the m target parameter, which are used to train on either a sample consisting of W bosons or of top quarks. The resulting models are labeled GroomRL-W and GroomRL-Top respectively.
In figure 6 we show the reward value during the training of the GroomRL for W bosons and top quarks, after applying the LOESS smoothing algorithm on the original curve. We observe an improvement of the reward function during the first 300k training epochs, with the reward becoming relatively stable after that point.

F. Alternative approaches
In this section, we have introduced a novel implementation of RL to tackle the problem of tree pruning. A number of alternative methods could be studied to approach this problem, most notably Monte-Carlo Tree Search (MCTS) algorithms [28,29] and binary classifiers. The heuristic search methods from MCTS explore the tree through random sampling, taking random actions to progress through the tree. Once an endpoint is reached, the result is used to weight the nodes and improve future decisions.
More recently, a NN based MCTSnet implementation was proposed [30], which introduces a framework to learn how to search the tree, integrating simulation-based planning into a NN.
These techniques might provide an interesting basis to construct an efficient groomer. However due to the wide variability of the trees considered in our case study, where each new episode starts from a unique tree, this would require a substantial modification of the algorithm.
Alternatively, one could use a contextual bandit solver [31,32] to train a jet grooming policy. We would expect this method to yield similar results, however, this method does not allow for the modification of the future nodes by the current grooming decision, and is not as easily extendable as our current framework.
Finally, one could attempt to build a jet grooming algorithm from a binary classifier, which uses an input state to determine which action to take next. The main drawback of this method is that one can not straightforwardly impose as loss function the mass resolution of the tree, as this depends on previous states of the current episode. As such, the problem we consider is particularly well adapted to a RL approach.
We leave a more thorough study of the application of these alternative tools to jet grooming for future work.

IV. JET MASS SPECTRUM
Let us now apply the GroomRL models defined in section III E to new data samples. We consider three test sets of 50k elements each: one with QCD jets, one with W initiated jets and one with top jets. The size of the window containing 60% of the mass spectrum of the W sample, as well as the corresponding median value, are given in table II for each different grooming strategy. As a benchmark, we compare to the RSD algorithm, using  parameters z cut = 0.05, β = 1 and R 0 = 1. One can notice a sizeable reduction of the window size after grooming with the ML based algorithms, while all groomers are able to reconstruct the peak location to a value very close to the W mass. The distribution of the jet mass after grooming for each of these samples is shown in figures 7 and 8. Each curve gives the differential cross section dσ/dm j normalized by the total cross section. Figure 7 shows results for the grooming algorithm trained on a W sample, while the results of the algorithm trained on top data are given in figure 8. As references, the ungroomed (or plain) jet mass and the jet mass after RSD grooming are also given, in blue and orange respectively. As expected, one can observe that for the ungroomed case the resolution is very poor, with the QCD jets having large masses due to wide-angle radiation, while the W and top mass peaks are heavily distorted. In contrast, after applying RSD or GroomRL, the jet mass is reconstructed much more accurately. One interesting feature of GroomRL is that it is able to lower the jet mass for quark and gluon jets, further reducing the background contamination in windows close to a heavy particle mass.
For the W case, shown in figures 7b and 8b, there is a sharp peak around the W mass m W , with the GroomRL method providing slightly better resolution. It is also particularly noteworthy that both the GroomRL-W and the GroomRL-Top algorithms have similar performance, despite the latter one having been trained on a completely different data set. This demonstrates that the tools derived from our framework are robust and can be applied to data sets beyond their training range with good results.
In top jets, displayed in figures 7c and 8c, the enhancements are even more noticeable. Here again, the performance of both algorithms is similar, despite the fact that the training of GroomRL-W did not involve any top-related data.
Finally, in figure 9, we show the primary Lund jet plane density as defined in [7] after grooming with GroomRL-W and GroomRL-Top, averaged over 50k jets. This gives a useful visualization of radiation patterns within a jet, providing a physical interpretation of the grooming behavior. The primary Lund jet plane is defined through the (ln 1/∆ ab , ln k t ) coordinates of each of the states of the "primary" declustering sequence, i.e. traversing the jet tree by successively following the hardest branch T (i) → a. The upper boundary of the triangle is due to the kinematic limit of emissions. In contrast, the lower edge corresponds to radiation that gets removed by the grooming algorithm, so that only sufficiently energetic or collinear partons remain in the groomed jet.
An interesting feature of figure 9 is that can one observe that despite producing similar jet mass spectra, the GroomRL-W and GroomRL-Top algorithms differ somewhat, with the former retaining more radiation at wide angles than the latter.

A. Robustness to non-perturbative effects
Let us now consider the impact of non-perturbative effects such as hadronization and underlying event on groomed jets. A key feature of grooming algorithms such as mMDT and Soft Drop is that they reduce the sensitivity of observables to non-perturbative effects, allowing for precise comparisons between theoretical predictions and experimental measurements.
To study the robustness of GroomRL to these contributions, we consider three different QCD jet samples generated through Pythia's dijet process. The first one, which we denote as "truth-level" and used already in the previous sections, includes all non-perturbative effects. A "hadron-level" sample is obtained by removing multiple parton interactions from the simulation, and finally a "parton-level" sample is generated by further turning off the hadronization step in Pythia.
The jet mass spectrum for each sample is shown in figure 10, with results for ungroomed jets as well as after grooming with GroomRL-W, GroomRL-Top and RSD. One can see immediately that the ungroomed jet mass spectrum is strongly affected by non-perturbative effects, while groomed jets become much more robust to these contributions. For masses m > 50 GeV, both GroomRL models become very robust, showing a resilience to hadronization and underlying event similar to that of RSD. In the low mass range, GroomRL remains robust to multiple parton interactions, but starts to show some dependence on hadronization effects.
We note that no parton-level or hadron-level data was used in the training, such that one would not a priori expect the derived algorithm to be particularly resilient to these effects. Although GroomRL already performs surprisingly well, one could easily further improve the robustness of the model by including some of this data with a suitable modification of the reward function in the training of the DQN agent.

V. CONCLUSIONS
We have shown a promising application of RL to the issue of jet grooming. Using a carefully designed reward function, we have constructed a groomer from a dense NN trained with a DQN agent.
This grooming algorithm was then applied to a range of data samples, showing excellent results for the mass resolution of boosted heavy particles. In particular, while the training of the NN is performed on samples consisting of W (or top) jets, the groomer yields noticeable gains in the top (or W ) case as well, on data outside of the training range.
The improvements in resolution and background reduction compared to alternative state-of-the-art methods provide an encouraging demonstration of the relevance of machine learning for jet grooming. In particular, we showed that it is possible for a RL agent to extract the underlying physics of jet grooming and distill this knowledge into an efficient algorithm.
Due to its simplicity, the model we developed also retains most of the calculability of other existing methods such as Soft Drop. Accurate numerical computations of groomed jet observables are therefore achievable, allowing for the possibility of direct comparisons with data. Furthermore, given an appropriate sample, one could also attempt to train the grooming strategy on real data, bypassing some of the limitations due to the use of parton shower programs.
The GroomRL framework, available online [19], is generic and can easily be extended to higher-dimensional inputs, for example to consider multiple emissions per step or additional kinematic information. While the method presented in this article was applied to a specific problem in particle physics, we expect that with a suitable choice of reward function, this framework is in principle also applicable to a range of problems where a tree requires pruning.