Improving the measurement of the Higgs boson-gluon coupling using convolutional neural networks at $e^+e^-$ colliders

In this paper we propose to use convolutional neural networks (CNNs) to improve the precision measurement of the Higgs boson-gluon effective coupling at lepton colliders. The CNN is employed to recognize the Higgs boson and a $Z$ boson associated production process, with the Higgs boson decaying to a gluon pair and the $Z$ boson decaying to a lepton pair at the center-of-mass energy 250 GeV and integrated luminosity 5 ab$^{-1}$. By using CNNs, the uncertainty of the effective coupling measurement can be decreased from $1.94\%$ to about $1.28\%$ using the PYTHIA data and from $1.82\%$ to about $1.22\%$ using the HERWIG data in the Monte Carlo simulation. Moreover, the performance of CNNs using different final state constituents shows that the energy distributions of the leading and subleading jets constituents play a major role in the identification and the optimal uncertainty of effective coupling using CNNs is reduced by about $35\%$ compared to that using conventional method.


I. INTRODUCTION
The Higgs boson occupies a distinct place in the Standard Model (SM) of particle physics. Many lingering physics problems are linked to the Higgs boson, for instance, the stability of the vacuum, electroweak hierarchy problem and dark matter. These problems imply the existence of new physics beyond the SM and require a good understanding of the Higgs properties. The effective coupling of the Higgs boson to a gluon pair is one of the most important parameters. Many theories beyond the SM predict that the Higgs boson-gluon coupling may have deviation from the SM prediction by direct or indirect effects, for example, the stop in supersymmetry or the T quark in little Higgs models can contribute to the coupling through the loop effects [1][2][3][4][5][6][7][8][9][10]. Therefore, the precision measurement of the Higgs boson-gluon coupling will be a touchstone of the SM and may lead to a breakthrough for new physics.
Although the gluon fusion is the most important process of the Higgs boson production at the CERN Large Hadron Collider, the Higgs boson-gluon coupling is still difficult to be determined accurately due to the overwhelming large QCD radiation [11,12]. The better candidates for the precision measurement of Higgs bosongluon coupling can be electron positron colliders, which have the clean environment and the high luminosity. The possible future electron positron colliders, which are usually called the Higgs factory at 250 GeV center-of-mass energy, include the Circular Electron-Positron Collider [13][14][15], Future Circular Collider-electron-positron [16][17][18] and International Linear Collider [19][20][21][22][23]. At the Higgs factory, the measurement on most of the Higgs properties can reach percent level accuracy [11,12,24]. For the Higgs boson-gluon effective coupling the κ g [5,14] is always used to parameterize its deviation from the SM prediction, where κ SM g = 1. With the conventional method (only using the kinematic cuts and b tagging) [25] the uncertainty of the κ g will reach about 2.2% for the channel of a Z boson decaying to a lepton pair including the detector effect at the Circular Electron-Positron Collider.
The measurement accuracy of the Higgs boson-gluon coupling can be further improved through an effective identification of jet types. In the last few decades, many different observables motivated by color charge, color connections, electrical charge, or spin have been proposed and achieved good performance [26][27][28]. For example, the jet energy profile is one of the useful jet substructure observables to distinguish quark and gluon jets by the energy distribution of jet constituents. By using the jet energy profile, the uncertainty of the Higgs boson-gluon coupling can be further reduced to about 1.6% for the channel of a Z boson decaying to a lepton pair [29].
However, an observable usually only describes a certain aspect of the jets or some special processes. Although it is better to choose a set of complementary observables to extract more comprehensive characteristics to identify different types of jets or events, the applicable scope of different observables and the degree of association between them will also be difficult problems. Moreover, the deeper correlations between the jet or event constituents may be difficult to be extracted by the artificial observables.
Deep learning has been applied to solve many complicated problems in particle physics. In particular, deep neural networks have been employed to distinguish dif-ferent types of jets, including Higgs boson tagging [30], boosted W boson tagging [31,32], boosted top tagging [33,34], single merged jet tagging [35], heavy-light quark discrimination [36] and quark-gluon discrimination [37][38][39][40]. They all get an exciting recognition capability and superior to the conventional method. A convolutional neural network (CNN) is one of the most popular and powerful algorithms. Its powerful ability of image recognition makes it easy to extract more comprehensive and deeper features to analyze the jet substructure. It is very suitable for jet tagging and also for testing different shower and hadronization schemes by comparing different Monte Carlo (MC) generators.
In this paper, we propose to use the CNN for the precision measurement of Higgs boson-gluon effective coupling by distinguishing the background processes from the process of a Z boson decaying to a lepton pair and a Higgs boson decaying to a gluon pair (2ℓ2g) at lepton colliders. The global information in an event is used for the training of the CNN instead of the jet information. We will use events from different event generators for neural network training and testing to illuminate the difference between the different shower and hadronization schemes.
The content is organized as follows. In the next section, the CNN is briefly reviewed. In the third section, the MC events are generated by PYTHIA and HERWIG. The production of images and CNN architecture are introduced in the fourth section. In the fifth section, we show the results using the CNN. The conclusion is made in the last section.

II. CONVOLUTIONAL NEURAL NETWORKS
A neural network is one of the most popular algorithms in machine learning. Generally, a neural network consists of an input layer, hidden layer, and output layer. A layer is dense if each of its units connects to all of the units in the previous layer. If a neural network consists of a dense layer completely, it will tune a large number of parameters and waste a lot of computing resources. Actually, each neuron only needs to perceive the local image instead of the global image for image recognition, and then the global information can be obtained by integrating the local information at a higher level. This motivates the design of the CNN [41]. In the last few years, based on the development of computer technology, the CNN has been a mainstay of many major breakthroughs in various fields.
In the image identification, the images in the CNN will pass a convolutional layer, pooling layer, and dense layer. The function of the convolutional layer is extracting features of the image. This can be implemented by the convolution of the filter and the image. A filter is a n × n grid of weights, where n is the filter size. The convolution is that each weight in a filter multiplies the corresponding pixel intensity in a patch the same size as an image. Then, we sum the convolutional values, add a bias, and feed it to an activation function. Activation functions introduce the nonlinear properties into neural networks, which enable the neural networks to learn the deeper information. The most used activation function in CNNs is rectified linear units (ReLU), which is defined as f (x) = max{0, x}. Each convolutional layer usually has many different filters to extract different features of an image. For the multichannel images, there are different colors and convolutional filters in each channel. Each color or channel will be solved by a corresponding filter, like the single color image, and will be accumulated in the final step.
Then, a pooling layer, following the convolutional layer, is used to reduce the number of parameters. The filter of the pooling layer is a m × m grid, where m is the pooling size. The max pooling and average pooling are the most common pooling functions. Max pooling takes the largest value while average pooling takes the average of all values in a filter region. A dropout usually is added to avoid the overfitting. It refers to the randomly discarding of some neural network units at certain probability in each training [42]. Finally, the dense layers are added to integrate the features in the feature maps extracted by the convolution layers and pooling layers to obtain the high-level meanings of the features and then use them for image recognition.
The error of the model can be quantified by the binary cross entropy loss function [43] (1) where N is the number of training events. The y i and Y i are the real value and the predicted value by the CNN of the ith event. The training process is tuning the parameters in the model to minimize the loss function.

III. PRE-PROCESSING
The main process of the Higgs boson production is e + e − → Z * /γ * → Zh at the future e + e − colliders. We choose the process of the Z boson decaying to a lepton pair and the Higgs boson decaying to a gluon pair (2ℓ2g) as the signal process since the Z boson can be reconstructed very well by the lepton pair. The process of different Z boson decay modes Z → e + e − and Z → µ + µ − are discussed first. Then the two lepton channels are combined as Z → ℓ + ℓ − . The backgrounds are divided into two-fermion leptonic (final states are a lepton pair from the Z or γ * intermediate states), two-fermion hadronic (final states are two quarks), four-fermion leptonic (final states are four leptons from the vector boson pair intermediate states), four-fermion semileptonic (final states are a pair of charged leptons and a pair of quarks from the vector boson pair intermediate states), four-fermion hadronic (final states are four quarks), and the Higgs boson production with the final states, which are different from the signal [mainly the Higgs boson and a Z boson associated production process with the Z boson decaying to a lepton pair and the Higgs boson decaying to a b/c quark pair (hbb/hcc) or W/Z boson pair (hW W/hZZ)] [44,45]. Both the signal and background events are simulated at future e + e − colliders [13][14][15][16][17][18][19][20][21][22][23] for the center-of-mass energy 250 GeV and integrated luminosity 5 ab −1 . The parton level MC events are generated by WHIZARD 1.95 [46,47] and transferred to hadron level by PYTHIA 6 [48] and HERWIG 7 [49], respectively. For clarity, we call them PYTHIA data and HERWIG data, respectively.
We select a pair of isolated leptons to reconstruct the Z boson. The rest of the final state constituents are clustered into jets via FASTJET 3.3.0 [50] using the anti-k T algorithm with a large jet cone of R = 1.5, and the energy of each jet is required to be more than 5 GeV. To suppress the two-fermion leptonic and four-fermion leptonic backgrounds [45], we add two cuts at first. One is the number of the stable charge particles in the final state N charge ≥ 10, and another is the electromagnetic energy ratio in the final state R EM < 0.99. Then, the kinematic cuts, i.e., invariant mass, recoil mass, and other constraints of the lepton pair and jet pair, are used to ensure that the lepton pair and jet pair, respectively, come from the Z boson and the Higgs boson to reject the two-fermion hadronic and four-fermion hadronic backgrounds. More details of the analysis can be found in Ref. [29]. The reference also shows that the c tagging cannot decrease the κ g uncertainty effectively since its mistag rate for the gluon jet will exclude some gluon jets. Therefore, we only use the b tagging in this paper.
The kinematic cuts and b tagging can remove a large number of the distinct backgrounds, which will greatly improve the efficiency of the neural network. The remaining backgrounds contain the hbb, hcc, hW W , hZZ and four-fermion semileptonic. The jets in the backgrounds hbb/hcc and four-fermion semileptonic are mainly heavy quark jets and light quark jets, respectively. But the jets in the backgrounds hW W/hZZ are W/Z jets and light quark jets since quite a few of the light quark jets are merged into the W/Z jets with a large jet cone of R = 1.5. It is the complex jet types in the backgrounds that make the signal identification be a challenge.
After all the cuts, the uncertainties of κ g should be evaluated. The evaluation of systematic uncertainties requires a detailed detector study and is unknown yet for the Higgs factory. But the statistical uncertainty of κ g around the SM prediction can be explicitly expressed as where N g and N are the numbers of the Higgs boson decaying to gluon pair events and total events, respectively. In Table I, the second and the third lines are the uncertainties of κ g with the conventional method using PYTHIA data and HERWIG data, respectively. The difference between the results using PYTHIA data and  HERWIG data may come from the different shower and hadronization schemes. The k T -ordered and the angularordered schemes are used for shower effect, and the Lund string and the cluster models are used for the hadronization effect in PYTHIA 6 and HERWIG 7, respectively.

IV. ARCHITECTURE OF CNN
For the training of the CNN, we use the combined lepton channel Z → ℓ + ℓ − . The entire spherical surface, where the azimuthal angle φ ∈ [−π, π] and the polar angle θ ∈ [0, π], is treated as a two dimensional plane image. Each image is designed to have a 66-pixel length in the φ direction and a 34-pixel length in the θ direction. The energy of all the final state stable particles is discretized into pixels as our pixel intensity at lepton colliders. The images of the signal process 2ℓ2g are given the sign one and the other images as the background process are given the sign zero. All the images are divided into the training, validation and test sets in proportion to 8:1:1. The neural network is implemented by using Keras [43] with TensorFlow backend. Our CNN architecture is inspired by the VGGNet [51] architectures and consisted of four iterations of convolutional layers and maxpooling layers shown in Fig.1. Then the feature map is flattened and fed to a dense layer with 128 units. Finally, a dense layer with one unit and a sigmoid activation is added to classify the signal and background processes. Each convolutional layer consists of 64 or 128 filters with filter size 3 × 3 and a ReLU activation. The uniform distribution is used to initialize the filters. The stride length of the convolution is 1. The first convolutional layer is set without padding to weaken the influence of the edge information of the image at the beginning while the others are set with padding to keep all the information of the feature map. Each maxpooling layer performs a 2 × 2 down-sampling with a stride length of 2. A dropout layer follows each maxpooling layer and the dense layer to avoid overfitting. All the dropout rates of dropout layers are 0.5 except that the first one is 0.25.
The binary cross entropy is used as the loss function. The optimization of training uses the Adam algorithm [52] and the learning rate is 0.0005. The training is set with batch size 128 and 100 epochs and an early stopping patience of 5. Thus, the training will stop early if the value of the validation loss does not go down 5 times 1 .
The receiver operator characteristic (ROC) curve is usually used to quantify the performance of neural networks. A ROC curve is generated by plotting the true positive rate against the false positive rate. The area under the curve (AUC) is defined to compare the overall performance of the neural networks. In this paper, the true positive rate is the signal process (2ℓ2g) acceptance efficiency R g and the false positive rate is the mistag efficiency R B of the background processes. Then we test the performance of our neural network and compare it to several different neural networks. Fig.2 shows the background rejection rate 1 − R B as a function of the signal acceptance efficiency R g for the CNN with different architectures. The lines marked as "3-conv", "Alex" and "MiniVGG" represent the performance of the CNN architectures in the Refs. [40,53,54], respectively. The green dotted line is the result using the neural network, which contains three iterations of a convolutional layer and a maxpooling layer. The blue dash line is the result using the famous AlexNet, which uses a stack of convolutional layers to increase the nonlinearity of the neural network and bigger filter size to increase the receptive field. So, the performance of the AlexNet has a significant improvement compared to that of the 3-conv. The red dash-dotted line is the result using the neural network, which is inspired by the MiniVGGNet architecture but with a bigger filter size in the first two convolutional layers. More iterations of the convolution layer stack further enhance the nonlinearity of the neural network and lead to improved performance. According to the advantages of the VGGNet, our neural network uses a stack of convolutional layers with 3×3 filter size instead of a single convolutional layer with a big filter size, which can increase the nonlinearity of the neural network and reduce the number of parameters. The black solid line is the result using our neural network architecture, which is better than other three neural network structures for the identification of our signal and background processes.

V. RESULTS
In this section, we will present the improvement on the κ g uncertainty archived by using the CNN.   3 shows the background rejection rate 1 − R B as a function of the signal acceptance efficiency R g for our CNN. The area under these curves are the AUC values of the different cases. Both training and testing have been applied to the PYTHIA and HERWIG data. For convenience, The symbol "P(H)+P(H)" is used to represent training with the PYTHIA (HERWIG) data and testing with the PYTHIA (HERWIG) data. It can be found that at around R g = 80% the background rejection rate can reach about 80%, meanwhile, the signal acceptance efficiency could still be acceptable. Furthermore, it can be seen that the AUC value of the "H+H" is slightly better than that of the "P+P". More specifically, the curves of the P+P and H+H are very similar at the low signal acceptance efficiency region R g < 70%, but the curve of the H+H is higher than that of the P+P at the high signal acceptance efficiency region R g > 70%. In general, the performance of the P+P and H+H are similar, which indicates that the similar performance of the shower and hadronization schemes in PYTHIA and HERWIG.
The "H+P" and "P+H" are training and testing with different data as a cross-check to illustrate the universality of the CNN model. It makes sense to compare the performance of the CNN models, which are trained with the different data but tested with the same data. The CNN models are universal if their performance are similar. By comparing the "P+P(H)" to the "H+P(H)" in Fig.3, the performance of the CNN model tested with different data is just slightly worse than that tested with same data in all the signal acceptance efficiency region. It means that our CNN models do not have too much overfitting since they are not overly dependent on the certain data.
The different ratios of the remaining signal and backgrounds can be obtained on the ROC curve in Fig.3. The uncertainty of κ g after using the CNN at each point (R g , R B ) can be expressed as (3)   4 presents the uncertainty of κ g after CNN as a function of the signal acceptance efficiency R g using the PYTHIA and HERWIG data. At the optimal point R g = 70%, δκ CNN g can reach about 1.28% by using the P+P and 1.22% by using the H+H. Compared to Table  I, it shows that δκ CNN g can be further reduced by 34% for the P+P and 33% for the H+H. The results using the H+H is about 5% smaller than that using the P+P. The small difference of the results may come from the different shower and hadronization schemes in PYTHIA and HERWIG. The results of the cross check are slightly worse than that of the training and testing with the same data. Comparing the P+P to the H+P, the uncertainties of κ g using the H+P is slightly worse than that using the P+P. But the difference of the P+P and the H+P is less than 0.1%, which far exceeds the measurement accuracy of the future electron positron colliders. The H+H and the P+H are in the same situation. The similar results mean that our CNN models do not have too much overfitting and the results are reliable. In the previous part, one image is constructed with the information of all the final state stable particles in an event. To gain insight into the improvement by the CNN and find the most important features of the signal and background, different images are constructed with different final state constituents. The following analysis only uses the PYTHIA data. Fig.5 shows the uncertainty of κ g after the CNN as a function of the signal acceptance efficiency R g using the different images. The line marked as "all" is the result using the images constructed with the information of all the final state stable particles, and the line marked as "multijet" is the result using the images constructed with the information of all the jets clustered by anti-k T algorithm in an event. The "multijet" result is slightly better than the "all" result in the region R g ∈ [60%, 70%]. However, the difference of the "all" and "multijet" results is less than 0.2% at the optimal points and can be ignored. This indicates that the information of jets makes a major contribution to the identification of the signal and background processes. The reason is that most of the information except the jets in an event is the lepton pair, which are very similar in the signal and background processes after using the kinematic cuts. The line marked as "dijet" is the result using the images only constructed with the information of the leading and subleading jets. The "all" and "dijet" results are very similar, which shows that the leading and subleading jets nearly contribute all the features for the CNN. The "multijet" and "dijet" results are also very similar since most of the events only have two jets with a large jet cone of R = 1.5. If the images are constructed only with the leading and subleading jets, the center of the two jets can be chosen as the image center. Then the constituents of the two jets are discretized into pixels to obtain the "dijet translation" images. By this operation, the jets will not be split into two parts at the margins of the image. It can be seen that the "dijet translation" and the "dijet" results are also very similar, which indicates that the symmetry property in the φ direction has been recognized by the CNN. After showing that the information of the leading and subleading jets makes a major contribution to the identification of the signal and background processes, we further analyze the contribution of each jet. Fig.6 shows the uncertainty of κ g after the CNN as a function of the signal acceptance efficiency R g using the different singlejet images. Each single-jet image has the size 2R × 2R with the jet cone R = 1.5 and is designed to have 34 × 34 pixels. The jet axis is chosen at the image center so that there is a complete jet on the single-jet image. The lines marked as "leading jet" and "subleading jet" represent the results using the leading jet images and the subleading jet images, respectively. We can see that the leading and subleading jets are equally important for the identification. Then the leading and subleading jet images as two different channels are combined as the "dijet 2channel" by analogy with the recognition of color images, with red, green and blue intensities treated as separate input layers. Compared to the "dijet", which puts the leading and subleading jets in one image, the "dijet 2channel" removes the relative location information of the two jets. It can be seen that the "dijet 2-channel" result is just slightly worse than the "dijet" result, so the relative location information of the jets is not important for this discrimination. From the above analysis, we can conclude that the leading and subleading jets make a major contribution to the identification of the signal and background processes.
In the third section, the analysis shows that the jets in the signal process are mainly gluon jets and the jets in the background processes can be mainly divided into quark jets and W/Z jets. The three types of jets have different energy distributions of their constituents. Each jet image using energy as pixel intensity records the energy distribution of the jet constituents. This information can be extracted from the jet images by the CNN to identify the signal and background processes. Therefore, the energy distributions of the leading and subleading jets constituents make a major contribution to the identification of the signal and background processes. The best result using the CNN is compared to the result using the conventional method for the PYTHIA data. Fig.7 shows the best result using the CNN (the line marked as "multijet") and the result using the conventional method (the line marked as "conventional") for the PYTHIA data. Comparing to the result using the conventional method, the CNN has a significant improvement in a wide signal acceptance efficiency region. At the optimal point R g = 70%, the uncertainty of κ g can be decreased from 1.94% to about 1.26% by using the CNN and reduced by about 35% compared to that using the conventional method for the PYTHIA data. Moreover, the result using the HERWIG data is similar to that using the PYTHIA data.

VI. CONCLUSIONS
In this paper, the CNN is used to improve the precision measurement of the Higgs boson-gluon effective coupling at lepton colliders. By using the CNN the uncertainty of κ g can be decreased from 1.94% to about 1.28% using the PYTHIA data and from 1.82% to about 1.22% using the HERWIG data in the channel of a Z boson decaying to a lepton pair in the MC simulation for the center-of-mass energy 250 GeV and integrated luminosity 5 ab −1 . The difference between the expected κ g uncertainties using the PYTHIA and the HERWIG data is less than 0.1%. Moreover, the performance of the CNN using different final state constituents is proof that the energy distributions of the leading and subleading jets constituents play a major role on the identification and the optimal uncertainty of κ g using the CNN is reduced by about 35% compared to that using the conventional method. Ruan for helpful discussions.