Supervised Jet Clustering with Graph Neural Networks for Lorentz Boosted Bosons

Jet clustering is traditionally an unsupervised learning task because there is no unique way to associate hadronic final states with the quark and gluon degrees of freedom that generated them. However, for uncolored particles like $W$, $Z$, and Higgs bosons, it is possible to approximately (though not exactly) associate final state hadrons to their ancestor. By labeling simulated final state hadrons as descending from an uncolored particle, it is possible to train a supervised learning method to create boson jets. Such a method much operates on individual particles and identifies connections between particles originating from the same uncolored particle. Graph neural networks are well-suited for this purpose as they can act on unordered sets and naturally create strong connections between particles with the same label. These networks are used to train a supervised jet clustering algorithm. The kinematic properties of these graph jets better match the properties of simulated Lorentz-boosted $W$ bosons. Furthermore, the graph jets contain more information for discriminating $W$ jets from generic quark jets. This work marks the beginning of a new exploration in jet physics to use machine learning to optimize the construction of jets and not only the observables computed from jet constituents.


Introduction
Lorentz-boosted massive bosons are a common feature of theories that extend the Standard Model (SM) of particle physics. In particular, new heavy particles introduced to solve one of the challenges with the SM may predominately decay into bosons and if there is a large mass hierarchy between the heavy particle and the bosons, the latter will be produced in the lab frame with a significant Lorentz boost. Singly produced bosons can also have significant Lorentz boost when produced in association with initial state radiation. The ATLAS and CMS collaborations have performed extensive searches involving boosted bosons decaying hadronically in the V V [1][2][3][4] [20][21][22][23][24][25][26][27]. These methods range from physically motivated features such as groomed jet mass [28][29][30][31][32], N -subjettiness [33,34] and D 2 [35] to complex observables built using machine learning [21]. ATLAS and CMS have integrated and extended these methods as well as studied them using collision data [36][37][38][39][40]. One feature that all of these algorithms have in common is that they start from a collection of constituents selected using a jet clustering algorithm. Various studies have investigated optimizing the jet clustering algorithm by considering many options [41][42][43]. While important for converging on a method in the traditional paradigm, these approaches are fundamentally limited by the discreteness of the algorithm types and the flexibility offered by the tunable parameters of a given algorithm.
The most common approach for forming the initial Lorentz boosted boson candidate jets is the anti-k t algorithm [44]. This algorithm is a form of unsupervised learning because no per-particle labels are used to form the jets 1 . Instead, a distance measure motivated by the fragmentation of quarks and gluons is used to collect constituents that were likely produced from the same initiating high-energy quark or gluon. This last sentence does not have a precise meaning because quark and gluon jets are not well-defined objects [47,48]. Due to the strength of the strong force, the energy flows from outgoing quarks and gluons are interconnected with each other and with the beam remnants. In contrast, the quarks and gluons from color singlet massive bosons are isolated from the rest of the event. In the limit that the number of colors N c → ∞ or the width of the boson resonance Γ → 0, there is a unique mapping between final state hadrons and ancestor color singlet. The corrections to this picture are suppressed by at least (1/N c ) 2 ('color reconnection') and by powers of Γ/Λ QCD .
Given the approximate (but not exact) mapping between hadrons and color singlets, it makes sense to ask if one could construct a supervised approach to forming jets. In particular, a machine could be trained to label individual particles as originating from a color singlet or not based on the particle kinematic properties as well as the relationship with other particles in the event. While such an approach may give up the calculability afforded by algorithms like anti-k t , it may provide an optimal approach to constructing jets for searches where calculablility is not necessarily required. If the jets are constructed optimally, then their substructure should contain as much information as possible for identifying their origin. One could even co-optimize the jet construction and the jet classification in an end-to-end approach [49,50], but there are many benefits to first building jets, such as the jet energy calibration.
Modern machine learning has proven to be a powerful toolkit for jet substructure. For example, a wide range of architectures and applications have been studied for tagging the origin of jets . To construct a supervised jet clustering algorithm, a machine learning architecture is needed that can process variable length sets as input. Multiple such point cloud methods have been studied for jet substructure [70,72,73,78,79,104], but the structure chosen here is the graph neural network (GNN) (see Ref. [70,72,73,79,[104][105][106][107]). This is because GNNs not only can process variable length sets, but they can also label the relationship between elements (not unique to GNNs, but natural given their construction). This property is critical for labeling particles as originating from the color singlet ancestor or not. Labeling constituents is also known as semantic segmentation and has been studied for other tasks in high 1 Previous attempts at combining jet finding with unsupervised machine learning have been studied in the past [45,46], but do not have the benefits of the supervised approaches discussed here. energy physics ranging from pileup particle identification [72,108] to liquid argon time projection chamber labeling [109,110]. In addition, recent study [111] shows that GNNs can be executed with a latency of less than 1 µs on an field-programmable gate arrays, making such networks very promising for real-time data learning and filtering. This paper is organized as follows. Section 2 introduces the simulated samples used to train the supervised jet clustering algorithm, where Lorentz-boosted W bosons provide a reoccurring example. The graph neural network methods are described in Section 3 and numerical results are presented in Section 4. The paper ends with outlook and conclusions in Section 5.

Simulation
Proton-proton collisions are simulated with Pythia 8.183 [112,113] at a center-ofmass-energy of √ s = 13 TeV. Lorentz boosted W bosons are generated from the decay of a hypothetical W boson with a mass of 600 GeV that decays 100% of the time to a W boson and a Z boson. The W boson is forced to decay hadronically and the Z boson decays into neutrinos. To simulate a quark jet with nearly the same kinematic properties, a hypothetical excited quark q * with a mass of 600 GeV is generated and decays 100% of the time into a quark and a Z boson. This Z boson then is forced to decay into neutrinos. The widths of the W , q * , and W boson are set to 0.01 GeV. In total, 100,000 W and q * events were generated.
As a leading N c generator, it is possible to uniquely trace final state hadrons in Pythia to the W boson. Individual final state hadrons are then labeled based on the existance (or not) of a W boson in their ancestry from the event record. This is illustrated for one event in Fig. 1.
To compare with the graph neural network-based clustering scheme described in the next section, jets are clustered using the anti-k t algorithm [44] with radius parameter R = 1.0 implemented in Fastjet 3.0.3 [114,115]. Jets are only kept if they have p T > 100 GeV. These jets are subsequently trimmed [30] by keeping only R = 0.2 subjets with at least 5% of the ungroomed jet's transverse momentum. Trimming is not the only jet grooming algorithm [28][29][30][31][32], but it is widely used (see e.g. Ref. [41,42]). Figure 2 presents histograms of basic quantities in W events. The number of detector-stable particles with a W ancestor is about the same as the number of constituents inside the leading jet, however, it only accounts for about 10% of the total number of detector-stable particles in the event. The mass computed from the detectorstable particles originating from a W boson is nearly exactly m W while leading jet mass is peaked around m W with a broad width. In the other hand, the mass computed from all detector-stable particles in the event is significantly away from the W boson mass,

Intermediate Hadrons
Detector-stable Particles Figure 1. An illustration of the W → cs decay tracing for a single event. At each step, every non-detector-stable particle is replaced with their immediate descendents from the Pythia event record. The order per row is arbitrary.
making it non-trivial for a machine to reconstruct the W boson mass. The low-mass peak corresponds to cases where both quarks from the W decay are not mostly contained within the leading jet or the leadng jet is unrelated to the quarks from the W decay. Figure 3 shows that the kinematic properties of the jets in W and q * events are similar. The jet transverse momentum spectra are not identical because the radiation pattern outside of the jet cone is different for the color singlet W and single color triplet quarks.

Graph Neural Network Methods
A graph contains a set of nodes, a set of edges with each connecting a pair of nodes, and a set of node-, edge-and graph-level attributes, collectively called graph attributes. Graph Neural Networks (GNN) are trainable functions that operate on a graph to learn latent graph attributes as well as to form a parameterized message-passing by which information is propagated across the graph, ultimately learning sophisticated graph attributes.  Each collision event is represented as a fully connected bidirectional graph in which the nodes are the final state particles and the edges are the connections between all pairs of particles. The node-level attributes are the four-momenta of the particles and the edge-and graph-level attributes will be learned by a GNN. The GNN architecture is same as the one in Ref. [71], which is based on the model in Ref. [116], composed of four trainable components: 1) a node encoder which transforms the node-level attributes into their latent representations; 2) an edge encoder which transforms the aggregated latent attributes of its neighbouring nodes into their latent representations; 3) an interaction network [117]; 4) and an decoder that computes graph-or edge-level classification scores. The encoders and the decoder use basic deep learning building blocks including multilayer perceptrons.
The boosted W boson is reconstructed by training a GNN, namely the edge classifier, to learn the relational information of the final state hadrons. Specifically, the edge-level attributes of the simulated W boson events are labeled as 1 if two hadrons come from the same W boson and 0 otherwise. The edge classifier outputs edge-level classification scores, abbreviated as edge scores, which are compared with the edge labeling using the binary cross-entropy loss. Trainable parameters in the classifier are optimized by the gradient-based stochastic optimizer, Adam [118]. The reconstructed W boson candidate for each event is built from the hadrons that are connected by edges with scores larger than 0.5. The four-momenta of the reconstructed W boson candidate are the sum of the four-momenta of the selected hadrons. The "edge classifier" was trained with 90,000 simulated W boson events and tested with 5,000 events.
The reconstructed W boson candidates from the GNN-based edge classifier carry unique information which other machine learning architectures (or traditional jet substructure observables) can use in order to separate the W boson events from background events, such as the q * events. In this study, another GNN with the same architecture as the edge classifier is used, namely the event classifier. The input graphs are the fully connected bidirectional graphs constructed from the hadrons selected by the trained edge classifier. The graph-level attributes are labeled as 1 for the W boson events and 0 for the q * events. The event classifier outputs graph-level classification score, abbreviated as event scores, which are compared with the graph labeling using the binary cross-entropy loss. Trainable parameters in the classifier are optimized by the gradient-based stochastic optimizer, Adam. The event classifier was trained with 90,000 W boson events and 90,000 q * events, and tested with other 5,000 W boson events and 5,000 q * events. As a comparison, the GNN is also trained with the inputs from the anti-k t algorithm. In this case, the input graphs are the fully connected bidirectional graphs constructed from the hadrons inside the leading jet which in turn is constructed from the anti-k t algorithm. To facilitate the discussions below, the GNN trained with the inputs from the anti-k t algorithm is called tGNN while that trained with the in-puts from the trained edge classifier is called eGNN. All training was performed on an NVIDIA V100 GPU.

Results
The edge classifier was trained for 30 epochs, after which no improvement were seen when the model was evaluated on the testing data. The performance of the edge classifier is showed in Figure 4. Two important metrics are the edge efficiency, defined as the ratio of the number of true edges passing the threshold over the number of total true edges, and the purity, defined as the ratio of the number of true edges passing the threshold over the number of total edges passing the threshold. Varying the threshold in the edge scores results in different values of edge efficiency and purity. Table 1 shows the edge efficiency and purity for three different thresholds on the edges scores. The nodes that are connected by the edges passing a threshold of 0.5 are considered as the hadrons coming from the W bosons. The four-momenta of the reconstructed W boson is the sum of these surviving hadrons. Figure 5 compares the number of hadrons selected by the edge classifier and the anti-k t algorithm. On average, the number of hadrons selected by the edge classifier is about 20% more than that the anti-k t jet includes, disregarding the events with no anti-k t jet with p T > 100 GeV. There are also many particles chosen by one algorithm by not the other. It will be interesting in the future to examine the properties of such particles to identify which features the GNN is learning differently than anti-k t (and vice versa). Furthermore, Figure 6 compares the kinematic distributions of the W boson candidates reconstructed from hadrons selected by the edge classifier or the anti-k t algorithm or the truth-labeled. About 3% of the time, there is no reconstructed jet with p T > 100 GeV, which results in the spike at zero. In addition, the fraction of the reconstructed W energy/mass over the total W energy/mass are compared between the two methods in Figure 7. In both cases, the GNN-based method significantly outperforms the anti-k t based method in reconstructing the boosted W The event classifiers were trained for 25 epochs for the tGNN and 15 epochs for the eGNN. In both cases, no improvement were seen after these epochs when the GNNs were evaluated on the testing data. Figure 8 shows a comparison of the receiver operating characteristic curve (ROC curve) of the two trained GNNs as well as the area under anti-k T algorithm in the Appendix (Fig. 9). There is a small tendency of the jet mass to be near the W mass, but it is not as sharp as for W events. the ROC curve (AUC). The GNN trained with the inputs from the edge classifier outperforms the GNN trained with inputs from the traditional anti-k t algorithm by more than 40% in AUC.   . Comparison of the ROC curve from the GNNs trained with the inputs from the anti-k t based jet clustering and the inputs from the trained edge classifier. Note that the small inefficiency from the p T requirement for the anti-k t jets is not included.

Conclusions
Traditional jet clustering based on unsupervised learning has proven to be an effective tool for studying hadronic final states at the LHC. In particular, the widely-used antik t algorithm is both theoretically and experimentally powerful for studying the SM and searching for physics beyond the SM. A wide variety of jet substructure techniques using these jets with and without machine learning are being developed and many have already been deployed in data analysis. However, there is a unique opportunity with color singlet decays to re-examine the construction of jets.
In particular, we have exploited the precise mapping between color singlet particles and final-state hadrons to constructed a supervised jet clustering based on graph neural network. These jets match the kinematic properties of true W bosons more precisely than the leading anti-k t jet. Furthermore, we have shown that there is more information contained in the graph network jets about the originating particle than anti-k t jets. In particular, a classifier trained using jet constituents to distinguish W boson jets from quark jets is more effective for GNN jets than for anti-k t jets.
This work marks the beginning of a new exploration in jet physics to use machine learning to optimize the construction of jets and not only the observables computed from jet constituents. Tagging Lorentz-boosted color singlet jets is an integral part of measurement and search efforts at the LHC and so further developments in this area have a significant potential to enhance the sensitivity of the LHC physics program. A variety of further studies will be required to integrate supervised jets into the experimental workflow. In particular, future work will investigate how event topology effects GNN jets (i.e. what happens when there are more (W ) jets in the event). Furthermore, it is important to study the impact of detector-effects and to investigate how well such jets could be calibrated, including pileup stability.
The studies presented in this paper have only considered boosted W bosons, but the same ideas could be applied to any color-singlet particles and it will be interesting to see how GNN jets can be integrated with additional information such as b-jet tagging in the case of Higgs bosons. Examining the structure of the supervised jets may also provide useful physical insight about where the information about the initiating particle is embedded in the event radiation pattern. Finally, it may be that the ultimate performance is achievable when supervised learning is combined with unsupervised techniques and this could lead to new insight for traditional quark and gluon jet reconstruction. [6] CMS Collaboration, "Search for heavy resonances decaying into two Higgs bosons or into a Higgs boson and a W or Z boson in proton-proton collisions at 13 TeV," JHEP 01 (2019) 051, arXiv:1808.01365 [hep-ex].
[7] ATLAS Collaboration, "Reconstruction and identification of boosted di-τ systems in a search for Higgs boson pairs using 13 TeV proton−proton collision data in ATLAS," arXiv:2007.14811 [hep-ex].  [10] ATLAS Collaboration, "Search for Higgs boson decays into a Z boson and a light hadronically decaying resonance using 13 TeV pp collision data from the ATLAS detector," arXiv:2004.01678 [hep-ex].
[12] ATLAS Collaboration, "Dijet resonance search with weak supervision using √ s = 13 TeV pp collisions in the ATLAS detector," arXiv:2005.02983 [hep-ex].     [18] ATLAS Collaboration, "Search for boosted resonances decaying to two b-quarks and produced in association with a jet at √ s = 13 TeV with the ATLAS detector," ATLAS-CONF-2018-052 (2018) . http://cds.cern.ch/record/2649081.  Events / 5 GeV Leading jet GNN Figure 9. Comparisons of the four-momenta of the reconstructed jet for the q * events between the anti-k T jet clustering and the GNN jet clustering.