Probing triple Higgs coupling with machine learning at the LHC

Measuring the triple Higgs coupling is a crucial task in the LHC and future collider experiments. We apply the Message Passing Neural Network (MPNN) to the study of the non-resonant Higgs pair production process $pp \to hh$ in the final state with $2b + 2\ell + E_{\rm T}^{\rm miss}$ at the LHC. Although the MPNN can improve the signal significance, it is still challenging to observe such a process at the LHC. We find that a $2\sigma$ upper bound (including a 10\% systematic uncertainty) on the production cross section of the Higgs pair is 3.7 times the predicted SM cross section at the LHC with the luminosity of 3000 fb$^{-1}$, which will limit the triple Higgs coupling to the range of $[-3,11.5]$.

The discovery of a 125 GeV Higgs boson [1,2] is a great leap in the quest to the origin of mass. The precision measurement of the Higgs couplings is one of the primary goals of the LHC experiment, which will further reveal the electroweak symmetry breaking mechanism and shed lights on the new physics beyond the Standard Model (SM). Although the current measurements of the Higgs couplings with fermions and gauge bosons are compatible with that predicted by the SM, testing the triple and quartic Higgs self-interactions is rather challenging at the LHC (for recent reviews, see e.g., [3][4][5][6][7][8][9][10][11][12][13][14][15][16][17]).
In the Brout-Englert-Higgs mechanism of electroweak symmetry breaking [18][19][20][21][22], the Higgs boson is a massive scalar with self-interactions. The Higgs self-couplings are determined by the structure of the scalar potential, The values of λ 3 and λ 4 are measured via the double and triple Higgs production processes, respectively. In many extensions of the SM, these couplings can be altered by Higgs mixing effects or higher order corrections induced by new particles, such as Two Higgs Doublet Model [23][24][25] and (Next-to-)Minimal Supersymmetric Standard Model [26][27][28][29]. Since the Higgs self-coupling plays an important role in vacuum stability [30] and electroweak baryogenesis [31,32], measuring the Higgs self-coupling will provide a crucial clue to new physics [33].
The triple Higgs coupling can be indirectly probed by using the loop effects in some observables, for example, the single Higgs production [34][35][36], and the electroweak precision observables [37]. With 80 fb −1 of LHC Run-2 data, the triple Higgs coupling has been constrained in the range −3.2 < λ 3 /λ SM 3 < 11.9 at 95% C.L. [38]. On the other hand, the Higgs pair production provides a direct way to measure the triple Higgs coupling at the LHC. Such a production is dominated by the gluon-gluon fusion process, which has two main contributions: one is from the triangle diagram induced by the triple Higgs coupling, and the other is from the box diagram mediated by the top quark, as shown in Fig. 1.
It should be noted that these two amplitudes interfere destructively, and thus results in a small cross section of 38.65 fb for the production process gg → hh at 14 TeV LHC, which is computed at next-to-next-to-next-to-leading order (N 3 LO) and including finite top quark mass effects [39]. The new physics effects that can significantly modify the Higgs pair production have been intensively studied at the LHC (see, for examples [40][41][42][43][44][45][46][47][48][49][50][51] and references therein).
Among these channels, the process of hh → 4b has the largest branching ratio, while the process of hh → bbγγ has a more promising sensitivity because of the low backgrounds.
In this paper, we focus on the Higgs pair production at the HL-LHC with 3ab −1 lumi-  [75] to explore the potential of observing such di-Higgs events through the channel pp → hh → bbW W * .
In addition to the conventional kinematic cut-flow analyses, the machine learning methods have been proposed to accelerate the discovery of new physics . The MPNN framework inherits the generality and powerfulness of Graph Neutral Network (GNN) [111,112]. It abstracts the commonalities between several of the most popular models for graph-structured data, such as spectral approaches [113][114][115] and non-spectral approaches [116] in graph convolution, gated graph neural networks [117], interaction networks [118], molecular graph convolutions [119], deep tensor neural networks [120] and so on [121]. In the MPNN, a collision event is represented as a numerical geometrical graph formed by a number of final state objects, which are non-linear models with a bunch of parameters that relates the output to the input graphs. The supervised learning is used to find optimized parameters, and will help to recognize the pattern in the collision events efficiently. Different from DNN, MPNN is a dynamic neural network and independent of the number and ordering of final state particles. Therefore, the MPNN is suitable for processing the graph representation of collision event. Recently, this method has been successfully applied to collider phenomenological studies, such as jet physics [122], Higgs physics [123] and supersymmetry [124].
This paper is organized as follows. In Section II, we describe the event generation and reconstruction for the signal and backgrounds. Next, in Section III, we illustrate the event graph and network architecture for the MPNN approach. In Section IV, we present numerical results and discussions. Finally, we draw our conclusions in Section V.
The signal cross section is normalized to the next-to-next-to-leading-order (NNLO) accuracy in QCD [127], that is σ gg→hh = 40.7 fb. The main background tt cross section is normalized to the NNLO QCD value 953.6 pb [128]. Along with the signal and tt, all other backgrounds and their normalized cross sections are listed in Table I.
We generate the low-Q 2 soft QCD pile-up events and apply hadronization via package In this work, we follow the default parametrization of Delphes ATLAS card to perform the pile-up subtraction and use the spatial vertex resolution parameter |z| to perform charged pile-up subtraction. We consider every charged particle originating from a reconstructed vertex with |z| > 0.01 cm as coming from pile-up events and only keep those tracks that pass through the TrackPileUpSubtractor in Delphes.
Similar to the tracks, the reconstructed jets are supposed to corrected from low-Q 2 pile-up events containing neutral particles. Jet pile-up subtraction is done via the JetPileUpSubtractor module that takes as input the jet constitutes and pile-up density ρ based on the jet area. This technique helps to correct the jet momenta by calculating pile-up density (ρ) and jet area. Jets are clustered with the calorimeter tower elements using Fastjet 3.3.2 [131] with anti-k T jet algorithm [132], jet radius R = 0.4 with p T > 20GeV, and we allow the default estimation of ρ with the calorimeter towers. As for the pile-up subtraction of missingET, we calculate it based on the pile-up subtracted jets, photons and leptons.
The Delphes card for ATLAS detector simulation is modified as : • Jets, including b-jets, with p T (j) > 20 GeV and |η j | < 2.5 are selected.
• Flat b-tagging efficiency is b→b = 0.75, mis-tagging efficiency for c quark as b is c→b = 0.1, and mis-tagging rates of other jets are j→b = 0.01 [133].
• Maximum transverse momenta ratio for lepton isolation is set as i =e p T i p T < 0.15, where the sum is taken over the transverse momenta p T i of all final state particles i, i = , with p T i > 0.5 GeV and within angular distance ∆R i < 0.3 with lepton candidate .
• Isolation of photons also require i =γ p T i p T γ < 0.12 for particles i, without including γ, with p T i > 0.5 GeV and within angular distance ∆R iγ < 0.3 with photon candidate γ. Photons are required to have p T (γ) > 25 GeV and |η γ | < 2.5 to be selected.
After the reconstruction, the missing transverse momentum E miss T is defined as the negative vector sum of the transverse momenta of the accepted photons, leptons and jets, and unused tracks as in [134]: where the tracks with p T > 0.4 GeV and |η| < 2.5 are considered.
We further apply the following cuts to reduce background events sufficiently relevant to the signals: • The two leading jets must be b-tagged, each with p T > 30 GeV.
• Exactly two opposite sign leptons, each with p T > 25 GeV.
• Modulus of E miss T is required to be E miss T > 20 GeV.
• Angular distances for two leptons and for two b jets are ∆R < 1.0 and ∆R bb < 1.3, respectively.
• Invariant masses for two leptons and for two b jets respectively are m < 65 GeV and 95 GeV < m bb < 140 GeV.
We export only the four momenta (also contain the corresponding charge signs of leptons and b-jet tagging information) of those events which passed the above cuts for later network training.

III. EVENT GRAPH AND NETWORK ARCHITECTURE
Each collider event obtained in the preceding section is converted to an event graph as the input for our neural network. Fig. 2  Due to the rotation invariance of the differential cross section of the collider events around the beam axis, we can get rid of the information of azimuthal angle dependence of the event from the node features, and the difference of azimuthal angles is encoded in edge weights.
This will make sure that the classification is not dependent on the definite azimuthal angle of the final states of an event, and stable w.r.t. the rotation of the event around the beam axis.
The other two advantages of such an event graph design are: (1) The number of nodes equal to the number of final state objects, i.e., number of nodes is not fixed, which guarantees to use full information of final state objects; 1 (2) The node features and edge weights are easily transformed by the four momenta of the object, no sophisticated discriminants are needed to be constructed, which makes the model quite general and easy to implement to other scenarios as well.
The structure of our MPNN is shown in Fig. 3, which consists of one embedding layer, N message passing layers and one output layer. The embedding transformation for input data is given by, where W 0 m and b 0 m are learnable weight and bias vectors, and the activation function ReLU is the rectified linear unit [135]. The dimension of m 0 i is higher than x i . It can be seen that i-th node m 0 i in the embedding layer is a vector which only contains information from input feature x i without including any geometrical pattern of event graph. Then, the i-th node in the n-th message passing layer is obtained by the following transformation, where i and j are indices of nodes, s n i is intermediate vector, + ○ represents vector concatenation, W s and bs are learnable weights and biases. The message passing process is realized by two sub-processes: first, Eq. 6 collects information from all previous nodes and distances between nodes; second, Eq. 7 passes this information together with previous node to the next one. By repeating this process, each note in the message passing layer gets knowledge of other nodes and relationships between them and updates itself. Therefore, the messagepassing mechanism is the key for automatically extracting features of the input event graph, which efficiently disseminates the information among all the nodes taking into account the connections between nodes. After N iterations, each node state vector can be viewed as an encoding of the whole event graph representing the whole information of both the kinematic features of all final states and the geometrical relationship between them. Here, we expand edge weight d ij onto 21 Gaussian bases to make it more suitable for linear transformation [124], and the k-th component of this weight vector is where µ k is linearly distributed in range of [0, 5] and σ = 0.25. Such an expansion is inspired by radial basis function networks [136,137] which can solve non-linear problems by mapping input into high dimensions.
At the output layer, we use the sigmoid function on the vector m N i to get the probability p i of the node i as and then average the probabilities from all nodes at the output layer by with V being the number of nodes in the input event which is the number of final state particles in an event. It should be mentioned that V is not a constant, e. g., if there are two extra light jets and one photon in an event apart from the required two b-jets, two leptons and one MET, then we have V = 7.
The MPNN can be efficiently trained using supervised learning method. We adopt binarycross-entropy as the loss function. Although increasing the number of hidden layers can enable the network to lean more complex features in the data, it may have disadvantages like overfitting and time-consuming. We find that for our network N =3 is the most optimal choice 2 . W 0 m , W n m , W n s , and W p s in Eq. (5 -7, 9) are 30×7, 30×51, 30×60 and 1×30 matrices, respectively. Thus, the overall number of learnable parameters in our MPNN model  Table. I).

IV. RESULTS AND DISCUSSIONS
In order to estimate the observability of the signal, we calculate the signal significance (α) with the following formula, where S and B denote number of signal and background events after our selections, respectively. L is the integrated luminosity of the collider. It should be mentioned that the main  At the last row of the   each method. the signal significance given by MPNN is about 1.12 σ.
Finally, we apply our method to constrain the production cross section of the Higgs pair and the Higgs trilinear coupling in the BSM at 14 TeV LHC. We adopt the modelindependent way to present the 2σ limits on the ratio of σ hh /σ SM hh in the left panel of Fig. 5, where we take the systematic uncertainty β = 0, 10%, 20%, 30% for example. It can bee seen that the production cross section of the Higgs pair larger than 13.5 times of the SM prediction can be excluded for the luminosity L = 139 fb −1 and systematic error β = 30%.
If β can be controlled at 10%, the 2σ upper bound on the ratio of σ hh /σ SM hh will be reduced to 9.5. Such results can be improved to be 10.2 for β = 30% and 3.7 for β = 10% at the HL-LHC. Provided β = 0, this limit on σ hh /σ SM hh will become 1.5. Besides, we reinterpret these bounds for triple Higgs coupling in the right panel of Fig. 5. We find that the ratio of λ 3h /λ SM 3h can be constrained to the range of [−10, 18] for L = 139 fb −1 and β = 30%, and will be further narrowed down to the range of [−3, 11.5] for L = 3000 fb −1 and β = 10% at 2σ level.

V. CONCLUSIONS
In this paper, we explored the discovery potential of Higgs pair production process pp → hh → bbW W * → 2b+2 +E miss T with the Message Passing Neural Network at the (HL-)LHC.
In the MPNN, we can represent each collision event as an event graph that consists of the final state objects, and use the supervised learning to optimize training parameters. By using the MPNN, we obtained that the significance of the SM Higgs pair production process can reach the maximum of about 1.5σ at the HL-LHC. Then, we extended our study to constrain the production cross section of the non-resonant Higgs pair and the triple Higgs trilinear coupling in a model-independent way. We found that the production cross section of the Higgs pair larger than 10.2 times of the SM prediction can be excluded at 2σ level for the HL-LHC when a 30% systematic uncertainty is included. If the systematic error can be well controlled, such as 10%, this upper bound can be improved to 3.7 times of the predicted by the SM, which will constrain the triple Higgs coupling to the range of [−3, 11.5].
Therefore, we expect this channel can play an important role in enhancing the sensitivity of the combining analysis of SM Higgs pair production at the HL-LHC .