The Boosted Higgs Jet Reconstruction via Graph Neural Network

By representing each collider event as a point cloud, we adopt the Graphic Convolutional Network (GCN) with focal loss to reconstruct the Higgs jet in it. This method provides higher Higgs tagging efficiency and better reconstruction accuracy than the traditional methods which use jet substructure information. The GCN, which is trained on events of the $H$+jets process, is capable of detecting a Higgs jet in events of several different processes, even though the performance degrades when there are boosted heavy particles other than the Higgs in the event. We also demonstrate the signal and background discrimination capacity of the GCN by applying it to the $t\bar{t}$ process. Taking the outputs of the network as new features to complement the traditional jet substructure variables, the $t\bar{t}$ events can be separated further from the $H$+jets events.


I. INTRODUCTION
tracker reconstruction [25,26], jet identification [19,[27][28][29][30][31][32] and event classification [33,34]. In particular, studies in Refs. [26,35] demonstrate that the GNN is capable of labelling constituents of a specific jet after supervised training. However, the effects from pileup events are not considered. They only need to label ∼10% of total particles. In practice, with average number of n = 50 pileup interactions, the total number of particles for an event can reach O(10 4 ), whereas only O(10 2 ) particles are labeled. This renders very low efficiency in training the GNN. In this work, we adopt the dynamic Graph Convolutional Network (GCN) [36] to reconstruct a boosted Higgs jet in any events, taking into account the pileup effects. The GCN is trained on the events of the H+jets process (with p T (H) > 200 GeV) overlaid with n = 50 pileup events. It is designed with focal loss [37], in order to train with the imbalanced dataset, i.e., the number of unlabeled particles is around two orders of magnitude larger than that of the labeled ones. We will demonstrate that the GCN is capable of detecting Higgs jet in events of several other processes efficiently, provided that there are no boosted heavy particles other than the Higgs in the event. Moreover, we will show that the GCN method has improved background discrimination power compared with the traditional jet substructure method.
The paper is organized as follows. In Sec. II, we introduce how the event samples are generated and preprocessed. Some concepts related to the network are briefly introduced in Sec. III. The performance of the network is discussed in Sec. IV. We provide conclusion and outlook in Sec. V.

II. EVENT SIMULATION AND DATA PREPARATION
Events in the analysis are simulated within MG5 aMC@NLO [38] framework. Pythia8 [39] is used to perform the parton shower, hadronization and hadrons decay. The events of H+jets with the Higgs boson decaying into a pair of b-quark are used for training and validating our GCN. The Higgs boson is required to have transverse momentum p T > 200 GeV. Moreover, there are multiple proton-proton collisions (referred to as pileup), during a bunch crossing period at the LHC. We adopt the A3 tune of Pythia8 with phenomenological parameters provided in Refs. [40,41] to simulate pileup events. In preparation of event samples, the final states of each event of a hard process are overlaid with final states of µ = 50 (average number with poisson distribution) pileup events. This leads to dramatically increased difficulty in detecting the Higgs jet constituents with the GCN.
With the contamination of pileup events, the total number of final state particles for each Higgs production event can reach O(10 4 ). However, we find that most of the Higgs constituents have relatively large transverse momenta. In order to reduce the complexity of the algorithm, we only consider the leading 1024 (or 2048) particles with the highest p T for each event. For the H+jets process with p T (H) > 200 GeV, the lowest p T of selected particle is ∼ 0.9 GeV (∼ 0.6 GeV for selecting 2048 particles) for most of the events. So the selection can also help to alleviate the infrared effects of events. In the left panel of Fig. 1, we show the invariant mass distributions for the vector sum of the selected Higgs constituents momenta (m sum inv ). We find a certain distortion (deduction) in the invariant mass distribution due to our selection. The distortion is slightly less severe in the case with 2048 particles. We will show later that such amount of distortion is acceptable in the sense that the GCN can perform better than the traditional jet substructure method. After the selection, the Higgs constituents still take only a small fraction of total particles. In the right panel of Fig. 1 the distributions of the ratio between the number of the unlabeled particles (particles not belong to Higgs jet) and that of the labeled ones (Higgs constituents) are given. Because the typical number of Higgs constituents is 30-50, the ratio is ∼ 25 (50) in the case of selecting 1024 (2048) particles. The GCN takes the pseudo-rapidity (η), azimuthal angle (φ), and transverse momentum (p T ) of all particles in an event as the input features. We have also tried to feed the particle type (lepton, photon or hadron) and particle electric charge to the GCN without finding much improvements 1 . Because the detector is a cylinder, the azimuthal angle (φ, in the range of [0, 2π)) is periodic. We find the GCN gains the best detection efficiency with the definition of φ such that the Higgs constituents do not distribute across the line of φ = 0. In the training sample, this condition is fulfilled by applying two pre-processing procedures: (1) The Higgs boson, whose azimuthal angle is outside the range of [π/2, 3π/2], is shifted by ±π in the φ-coordinate to keep φ(H) ∈ [π/2, 3π/2]; (2) The Higgs constituents, whose angular distances (∆R = ∆y 2 + ∆φ 2 , y is the rapidity) to the Higgs boson are larger than π/2, are not labeled. While during the testing, one can not obtain the angular position of the Higgs boson in advance. This condition can be fulfilled with the φ definition such that those high p T particles of an event are away from the line of the φ = 0. In addition, the training sample is purified by dropping events with m sum inv smaller than 115 GeV, in order to not confuse the network with events that do not include any featured Higgs jets (All events are used in testing sample.).
Events of several processes different from the H+jets are also simulated to test the generality of the network: 1) two Higgs bosons plus three QCD jets; 2) one Higgs boson plus top quark pair; 3) a hypothetical Supersymmetric (SUSY) model process, pp →t * 1t 1 → 1 The GCN is sensitive to the collinear splittings of input momenta. However, infrared effects of the hadronic level events are already alleviated in Monte Carlo simulation, since the very soft and collinear gluons in parton emission are not resolved in an infrared safe fragmentation framework, such as the string model [42] in Pythia8. For example, the MC simulations of the charged-particle multiplicities at the LHC are found to match the data well [43]. (The charged particles are required to have p T 500 MeV to improve the infrared stability.) And the number of charged particles in a jet is also used to distinguish quark and gluon jet at experiments [44]. tχ 0 1 tχ 0 2 →tχ 0 1 tHχ 0 1 , with mχ0 1 = 100 GeV, mχ0 2 = 800 GeV and mt 1 = 1 TeV. The Higgs bosons in the first two processes are forced to be boosted (p T > 200 GeV) and decayed into two b-quarks; We note that each event of all those processes is overlaid with average number of 50 pileup events.

III. GRAPH NEURAL NETWORK AND FOCAL LOSS
The performance of the network in jet detection depends on both the data representation and the network architecture. In this section, we will give a brief introduction to the setups of our GCN and explain how focal loss works.

A. Event Representation and Graph Neural Network
An event at the LHC corresponds to a collection of detected particles, with each particle assigned with a four-momentum (η, φ, p T , E) and an identity 2 . The point cloud, which provides a flexible representation for collection of points in two or three dimensional space, can be naturally used for representing an event. Comparing to the image representation of collider events, which requires a uniform granulation, the different angular resolutions for different types of particles are easier to implement in the point cloud representation.
Treating the point cloud as a graph, GCN is an efficient network to analyze it. A graph consists of nodes and edges. For collider events, nodes correspond to particles, and edges are the connections between particles. Our GCN uses the connections of the K-nearestneighbours (KNN) for each node in the feature space, with the squared distance defined as the sum of squared differences in all coordinates. In the first edge-convolution (EdgeConv) block, it is (3.1) As for the second and third EdgeConv blocks, the feature vectors learned by previous block are viewed as new coordinates of the nodes in the latent space. The neighbors of each node will be dynamically updated with EdgeConv operations.
In the edge-convolution blocks, the edge feature is calculated by [28,36] where index i runs over all particles in an event (which is 1024 in our case), index j runs over the K neighbours of the particle i, and index m corresponds to the number of convolutional filter. The direct sum indicates that the feature vector of i-th particle x i (with dimension C) being concatenated with the vector difference between the feature vectors of particle i and its jth neighbour. θ m is a 2C-dimensional vector which should be adjusted during the training. The inner product between θ m and the concatenated feature vector is delivered to the Rectified Linear Unit (ReLU) activation function to given the final expression. An aggregation function is applied after obtaining the edge features: 3) The new feature x im will be passed to the next edge-convolution block.
Architecture of our GCN, the edge-convolution block is the same as the one proposed in Ref. [28].
The architecture of our GCN is illustrated in Fig. 2. The input coordinates are η and φ, and the input features are η, φ and p T . It contains three EdgeConv blocks, with a universal K = 32. Each EdgeConv block is implemented as three multilayer perceptrons, which is the same as the one proposed in Ref. [28]. The number of channels (filters) of these perceptrons are 32, 64 and 32, receptively. Two convolutional layers with kernel size [1,1] are used for classification task. The first convolution layer contains 32 kernels, and the second contains 2 kernels. To prevent overfitting, a dropout layer with drop rate of 0.25 is inserted. The final classification scores for input particles are assigned by the softmax function.
Note that usually a spatial transformer network (STN) [45] is inserted in front of edgeconvolution block to make the GCN invariant under translation, scale and rotation of input data. However, in our case, the coordinates η and φ are not fully interchangeable, i.e. they have different distributions for given processes. We find including the STN module in the GCN does not improve the Higgs jet detection efficiency, as long as the training sample is sufficient. So the STN module is not adopted in our GCN. Moreover, by taking the η, φ and p T as input features for each particle (node), our GCN is sensitive to the position and the orientation of the graph, and each event can be associated with a graph without ambiguity.

B. Focal Loss
Focal loss was firstly introduced in Ref. [46], which is especially useful in object detection task. It is used for addressing the large class (signal and background) imbalance issue, and makes the model focus on signals which have much smaller number than backgrounds. It is a modification of cross entropy, given as where p is the probability of being positive class, and y is the label which will be given beforehand in the supervised learning. Compared with cross entropy, the hyper parameter α is used to keep a balance between positive and negative samples, and (1−p) γ and p γ terms suppress the loss contribution of well classified samples so that loss function is dominated by the difficult ones. For our case, given the ratio between the number of unlabeled particles and number of labeled ones around 30-50, we find setting α = 0.65 (0.75), γ = 1 for N particle = 1024 (2048) samples works well. 6

IV. RESULT
Our GCN model is implemented with TensorFlow [47]. The model is trained on one million H+jets events with p T (H) > 200 GeV. The learning rate is set to 10 −3 . In this section, we present some results obtained by the well trained GCN.

A. Number of input particles
As have been discussed in Sec. II, we consider input datasets with number of particles N particle = 1024 and N particle = 2048 for each event, respectively. We note that the GCN does not require a fixed size of input. So the GCN which is trained on either N particle = 1024 or N particle = 2048 event sample can be tested on both samples. There will be totally four different applications. We provide the recall (number of correct signal/number of predicted signal), accuracy (number of correct predictions/total number of predictions) and precision (number of correct signal/number of actual signal) for those cases in Tab. I. It is not surprised to find out that the recalls are the highest for the cases where the training and testing samples have the same input size. The accuracies are always high because most of the particles are belong to the background and they are easy to classify. Even though the event sample with larger N particle provides more complete list of Higgs constituents, the fraction of signal is significantly smaller (as shown in Fig. 1). As a result, the precision is higher (the detected signal is purer) when the GCN is tested on the event sample with N particle = 1024. In the following, we will only consider the first case.  The recall, accuracy and precision of the GCN that is trained on N particle = N train event sample, and tested on N particle = N test event sample.

B. Reconstructed kinematic variables
The aim of our GCN is to mark the Higgs constituents out (with classification score greater than 0.5) from the particle list with heavy pileup contamination. The fourmomentum of the reconstructed Higgs boson candidate is given by the sum of the fourmomenta of the marked hadrons. In most cases, our GCN works efficiently, as shown in the left panel of Fig. 3. While in other cases, particles with relatively large transverse momenta and far away from the Higgs in the η − φ plane will also be mis-assigned to the Higgs, as shown in the right panel of Fig. 3. So after the detection by the GCN, we further require that the Higgs constituents should lie within of the reconstructed Higgs jet, otherwise the predicted Higgs constituents are unlabeled. To obtain an intuitive impression of the GCN performance, we compare it to a popular Higgs tagging method which is composed of Cambridge-Aachen (C/A) jet clustering [11], mass-drop tagger [9] and trimming [12], denoted by MDT in the following. In this method, the final state particles are clustered with the C/A jet algorithm with appropriate cone size parameter R 0 in order to capture most of the Higgs decay products. They are referred to as fat jets. The mass-drop tagger uses two criteria to characterize the substructure of the fat jet: by undoing the jet clustering, there is at least one step which breaks the jet j into two subjets j 1 and j 2 (m j 1 > m j 2 ) such that the mass drop is significant (m j 1 < 0.67m j ) and the splitting is not asymmetric The trimming method selects the hard subjets inside a fat jet, in order to mitigate the pileup contamination. The cone size parameter of jet clustering and trimming parameters in the MDT analysis are optimized for each Higgs production process to achieve the highest reconstruction efficiency of Higgs jet within 125 ± 5 GeV mass window.
In the Fig. 4, we plot the invariant mass distributions of the Higgs jets from the ground truth selection, MDT analyses and GCN analyses, respectively. The ground truth selection corresponds to selecting the Higgs constituents which belong to the 1024 highest-p T particles for each event. The Higgs jet momentum is given by the vector sum of momenta of all selected Higgs constituents. In the figure, the height of histogram for the ground truth Higgs invariant mass is scaled by a factor of 0.4 for visibility. In the MDT analyses, R 0 = 1. trimmed by re-clustering the constituents into R sub = 0.20 (0.19) k t -subjets and discarding those with p subjet T < f cut p jet T , where f cut = 0.05. Among those jets with substructure, the Higgs jet candidates are selected either as the leading p T trimmed fat jet (denoted by MDT Lead, which is oversimplified but realistic) or the one closest (in term of angular separation ∆η 2 + ∆φ 2 ) to the true Higgs boson (denoted by MDT Close, corresponds to the ideal case, which can be approached by using sophisticated Higgs tagging method). The performances of the GCN both with and without angular separation cut (Eq. 4.1) are shown. We can find that the GCN method with angular separation cut performs the best, in the sense that it has the largest number of events with reconstructed Higgs jet invariant mass close to 125 GeV. In particular, when the Higgs decay products are well separated, the Higgs constituents can not be fully captured by the CA jet algorithm in the MDT Lead and MDT Close analyses. This leads to the peaks in the distributions of the reconstructed Higgs invariant mass at 10 GeV in these two methods. The problem is less severe in the GCN methods. Comparing left and right panels of Fig. 4, because the higher transverse momentum of the Higgs lead to more remarkable Higgs jet features, the efficiency and the precision of the Higgs reconstructions are improved in all methods 3 , and the fake peak at low invariant mass is greatly suppressed in the events sample with Higgs p T > 300 GeV.
The accuracies of the Higgs momentum reconstructions from the MDT Close method and the GCN method with ∆R limit are shown in Fig. 5, in terms of two-dimensional distributions on the ∆m m − ∆p T p T plane and ∆η−∆φ plane, where the m = 125 GeV and p T is the Higgs boson transverse momentum at parton level; ∆m, ∆p T , ∆η and ∆φ are the deviations between the reconstructed Higgs momenta and the truth-level Higgs boson momentum. Two event samples of the H+jets process with p T (H) > 200 GeV and p T (H) > 300 GeV, are analyzed for illustration. The different shades of gray regions and black contours indicate 20%, 40% and 60% of events, respectively. The closer they are to the center, the higher accuracy they stand for. The GCN method outperforms the MDT method in kinematic variables reconstruction except the transverse momentum, where a systematic excess is prominent. Because the pileup effects are mitigated by trimming in the MDT method and no such pileup mitigation procedure is applied in the GCN method. We have used the vector sum of Higgs constituents momenta as the reconstructed Higgs momentum, the transverse momentum of the Higgs jet can be very sensitive to mis-assigned particles from pileup events, leading to an over estimated Higgs p T . Using the GCN method to reconstruct the Higgs jet with p T > 200 GeV, more than half of the events can be reconstructed with

C. Applications to different processes
Even though the GCN is trained on the events of the H+jets process, it may serve as a general Higgs tagger for many other processes. For illustration, we perform the tests of the GCN on three processes different from the one used for training. Each process contains at least one Higgs jet in the final state. They are two Higgs plus jets process, Higgs plus a top quark pair process, and a SUSY process. The details of their simulations have been discussed in Sec. II.
In Fig. 6, we present the Higgs jet invariant mass distributions obtained by the MDT methods and the GCN methods for these three processes. As before, the ground truth distributions are superposed. For the two Higgs plus jets process, the Higgs jet candidate can be selected without ambiguity in GCN method and MDT Lead method. Unlike the image based method as proposed in Ref. [22] which can detect multiple Higgs jets in an event, the GCN is only capable of detecting one Higgs jet for each event. According to the left panel of Fig. 6, the GCN tends to find out the Higgs with more remarkable substructure in the Higgs pair production process. Thus, the peak of the invariant mass distribution at ∼ 125 GeV is sharper than the one showing in the left panel of The performance of the GCN degrades in the second process due to the existence of energetic top quark decay products. While in the third process, one of the top quark is boosted, producing clusters of particles with high transverse momenta. It can be easily mis-assigned as Higgs constituents by the GCN (similar as in the right panel of Fig. 3, GCN tends to assign high classification scores to the energetic particles within clusters), and they are too energetic to be thrown away by the angular separation cut ∆R < 1.5. As a result, the Higgs detection of the GCN method is not successful in the most of the cases for this process.
In practice, if the target process has been set and for which the performance of our GCN is not satisfying, one can improve the performance by transfer learning: initialize the model with our well trained weights; train the model by using events of the target process and with lower learning rate.

D. Signal and Background Discrimination
To study the background resisting capacity of the GCN, we carry out a comparison study on the performances of our GCN on the H+jets process with p T (H) > 200 (300) GeV and on the top quark pair production process. In simulation of the tt events, the parton level tt production is matched to parton shower up to two additional jets and each event is overlaid with average number of 50 pileup events.
The finite momentum resolutions are considered by smearing each component of a momentum with a Gaussian distribution. The standard derivatives (using the parameters of the ATLAS detector [48,49]) for different particles are listed in Tab. III.  We adopt the Boosted Decision Tree (BDT) technique to obtain a combined discriminating powers of several variables. The variables are classified into two categories: • Jet substructure variables for the Higgs jet candidate 5 .
• Variables reconstructed from the scores that the GCN assigns to Higgs constituents.
The first category includes the N-subjettiness variables [50] τ 21 = τ 2 /τ 1 and τ 32 = τ 3 /τ 2 , with τ N being defined as where the summations run over all Higgs jet constituents, R 0 is the cone size parameter in the original jet clustering algorithm, ∆R I,k denotes the angular distance ( ∆η 2 + ∆φ 2 ) between the subjet I and jet constituent k. There are also variables that characterize the jet profile: The second category includes the average S and the standard deviation σ S of the scores that are assigned to the constituents of the Higgs jet by the GCN. Moreover, we define a p T weighted score S w for the Higgs jet candidate: where i runs over all Higgs constituents; s i and p T,i are the GCN score and the transverse momentum of the ith constituent. The distribution of the S w is presented in the left panel of Fig. 7. Only events which contain GCN tagged Higgs jet are used. It includes 96% of simulated signal events and 87% of simulated background events. The Higgs jet candidates in the events of the signal process tend to gain higher S w than that in the events of background process, which makes this variable useful for signal and background discrimination. Beside the variables in those two categories, the transverse momentum p T (H) and invariant mass m H of the reconstructed Higgs are also powerful discriminating variables. In the middle and right panels of Fig. 7, we plot the distributions of reconstructed Higgs invariant mass for signal (H+jets with p T (H) > 200 GeV) and background (tt) processes, taking into account the detector effects. For the signal process, compared with the left panel of Fig. 4, we can find that the detector effects broaden the peaks at 125 GeV in all methods. The GCN method still outperforms the MDT method after taking into account the detector effects. The distributions of the fake Higgs jet invariant mass for background are typically around 50-100 GeV in MDT and GCN analyses, and the cut ∆R < 1.5 in the GCN method is found to be effective in suppressing the distribution in large m H region.
In the following, the BDT analysis with input of variables in the first category as well as the p T (H) and the m H calculated on the MDT Lead tagged Higgs jet will be referred to as MDT analysis. And the BDT analysis with input of variables in both categories as well as the p T (H) and the m H calculated on the GCN tagged Higgs jet will be referred to as GCN analysis. The receiver operating characteristic (ROC) curves for these two BDT analyses applying to two signal samples (with p T (H) > 200 GeV and p T (H) > 300 GeV, respectively) are plotted in the left panel of Fig. 8. The solid and dashed lines correspond to the results with and without considering the finite momentum resolutions, respectively. We can find the GCN analysis has improved signal and background discriminating power than the MDT analysis for both signal samples. Moreover, the results obtained by the GCN analyses are less sensitive to the momentum resolution than the results of the MDT analyses. The significance improvement curves are presented in the right panel of Fig. 8. It shows that the signal significances improve most when the signal selection efficiency S ∼ 0.2. And the GCN analysis outperforms the MDT analysis by a factor of 1.5 on the signal significance for both signals.

V. CONCLUSION AND OUTLOOK
Representing each collider event as a point cloud, we adopt the GCN with focal loss to reconstruct the Higgs jet in the event. The H+jets with cut p T (H) > 200 GeV is taken as the benchmark process, on events of which the GCN is trained. We find the GCN analysis outperforms the traditional jet substructure analysis in both Higgs tagging efficiency and momentum reconstruction accuracy. And the GCN method is less sensitive to pileup contamination. In the GCN method without dedicated pileup mitigation, 60% of H+jets events with cut p T > 200 (300) GeV can be reconstructed with |∆φ| ∼ |∆η| < 6×10 −2 (2. The GCN which is trained on the events of the H+jets process is also capable of reconstructing the Higgs jet in events of other processes, even though its performance is degraded when there are boosted particles other than the Higgs boson in an event. Finally, we show that the features learned by the GCN are complementary to jet substructure variables in separating the events of the H+jets process and the tt process. There are several limitations of the current GCN method: (1) The method can only reconstruct a single Higgs jet for each event (even in the event with multiple Higgs bosons); (2) Although we try to propose a general Higgs jet detection method which does not depend on the Higgs production process, it turns out that the current method is not efficient in detecting the Higgs jet in processes where the Higgs is accompanied by other energetic particles. Since events of those new processes are not used in training, modifying the network and loss function can not help improving the performance. On the other hand, if the target process has been set, we can always improve the performance by transfer learning (probably with modified loss); (3) The b-tagging is not considered in our study, which is a very important step towards the Higgs tagging. These subjects will be studied in future works.