Polarization fraction measurement in ZZ scattering using deep learning

Measuring longitudinally polarized vector boson scattering in the ZZ channel is a promising way to investigate unitarity restoration with the Higgs mechanism and to search for possible new physics. We investigated several deep neural network structures and compared their ability to improve the measurement of the longitudinal fraction Z_L Z_L. Using fast simulation with the Delphes framework, a clear improvement is found using a previously investigated 'particle-based' deep neural network on a preprocessed dataset and applying principle component analysis to the outputs.A significance of around 1.7 standard deviations can be achieved with the integrated luminosity of 3000 fb-1 that will be recorded at the High-Luminosity LHC.

Measuring longitudinally polarized vector boson scattering in the ZZ channel is a promising way to investigate unitarity restoration with the Higgs mechanism and to search for possible new physics. We investigated several deep neural network structures and compared their ability to improve the measurement of the longitudinal fraction ZLZL. Using fast simulation with the Delphes framework, a clear improvement is found using a previously investigated 'particle-based' deep neural network on a preprocessed dataset and applying principle component analysis to the outputs. A significance of around 1.7 standard deviations can be achieved with the integrated luminosity of 3000 f b −1 that will be recorded at the High-Luminosity LHC. The technique developed in this article is also useful to other LHC analyses involving helicity fraction measurement.

PACS numbers: HL-LHC, vector boson scatter, electroweak
Vector boson scattering (VBS) is a rare standard model (SM) process which plays a crucial role in electroweak symmetry breaking. The LHC and High-Luminosity LHC (HL-LHC) have enormous potential to both initially observe and study the features of rare processes such as VBS. Our knowledge of the VBS topology at hadron colliders can be combined with advanced data analysis techniques, such as deep learning, to make this pursuit even more promising.
Many VBS studies have been performed based on LHC data, including measurements of W ± W ± [1,2], W ± Z [3,4] and Zγ [5,6]. The topic of this paper is the channel ZZ → 4l. While this channel has the advantage of a clean final state, it suffers from a low production cross section, small branching-ratio of the Z boson to charged leptons, and two large irreducible QCD backgrounds, the production of ZZ via quark-antiquark annihiliation (qqZZ) and via a gluon box diagram (ggZZ). ZZ scattering has recently been observed by ATLAS with a significance larger than 5 standard deviations, using 139 f b −1 of LHC Run II data collected at √ s = 13 TeV [7]. A prior measurement made by CMS with 35.9f b −1 collected at √ s = 13 TeV reported an observed significance of 2.7 standard deviations [8]. Measuring the longitudinally polarized component of VBS (the LL component) is a critical next step for the field, as it is closely related to the important theoretical property of unitarity restoration, through Higgs and possible new physic [9,10]. In the context of Higgs and VBS discoveries, it becomes one of the next big targets, to test directly that Higgs regulates Higgs Goldstone scattering [11]. Prbing LL scattering can also provide alternative and model independent way of measuring Higgs coupling [12]. Fig. 1 [left] shows comparision of 4 lepton invariant mass distributions among the SM and its LL component, the SM but with Higgs and W or Z boson couplings scaled by a factor of 0.8, and its LL component. In this example, the LL component can be sensitive to Higgs couplings with massive gauge bosons, as any deviations from the SM prediction leads to large enhancements of the LL mode especially at high mass tail. However, because the LL cross-section is only ∽10% of the sum of the transverse (TT) and mixed (TL) cross-sections, and has very similar features, this measurement will be very challenging, and advanced techniques will be essential for its success, as can be seen in Fig. 1 [right] where large improvement can be achieved.
Previous studies include WW scattering studies which employ a two-dimensional fit and a deep neural network (DNN) [13,14] and ZZ scattering studies that employ a boosted-decision tree (BDT) [15]. However, due to a low signal yield expected in the ZZ case, a significance of 1.4σ is expected for LL with 3000 f b −1 collected at √ s = 14 TeV. In this study, we compare the performance of several machine-learning models, including a BDT [16] as implemented in TMVA [17], and a DNN as implemented in the Keras library [18] with Tensorflow back-end [19]. There has been a performance comparison study for extracting LL component on same-sign WW channel, using particle-based DNN, dense DNN and BDT [14]. The result shows that the particle-based DNN performs the best. We further enhanced the particle-based DNN classification power by applying standardization and Yeo-Johnson power transformation [20] to each input variable. Finally, principle component analysis (PCA) is applied to the outputs of the DNN, and then two or three dimensional fits are performed, to achieve further enhancement of the signal significance.
VBS ZZ and QCD ZZ background samples are simulated using MadGraph5 aMC@NLO [21] for the hard process and Pythia 6 [22] for the parton showering and hadronization. Delphes version 3 [23] was used for detector simulation, and was configured to simulate the CMS HL-LHC detector ('pileup' has been neglected as in ref. [24]). The LL, TL, and TT samples are obtained from the VBS ZZ sample and processed separately starting from generator level. Z bosons of each sample are decayed into charged lepton pairs by MadGraph's DECAY package, while keeping the polarization information for each of the boson. The following event selection is applied to reconstructed Delphes objects. We require 4 leptons which can be clustered into two pairs of Z boson candidates, with each Z boson candidate consisting of two oppositely-charged same-flavor leptons and an invariant mass (m ll ) satisfying 60 GeV < m ll < 120 GeV. If there is more than one scenario of ZZ combination, we select the minimum (m ll 1 − 90.188) 2 + (m ll 2 − 90.188) 2 case . Each of the selected leptons' transverse momentum (p l T ) must be larger than 5 GeV. The leading and sub-leading leptons' transverse momentum must be larger than 20 GeV and 10 GeV, respectively. We select jets that have transverse momentum (p j T ) larger than 25 GeV and an absolute value of pseudorapidity(|η j |) smaller than 4.7. And we require at least two selected jets. In order to suppress QCD-induced backgrounds, we require that the invariant mass of the two leading jets (m jj ) be larger than 400 GeV, that the absolute value of their pseudorapidity separation (|∆η jj |) be larger than 2.4, and that the event contain no b-tagged jets. After event selection, 100,000, 150,000, 240,000, 48,000, and 40,000 unweighted events are left for LL, TL, TT, qqZZ and ggZZ, respectively.
We now compare the LL vs. LT and TT discriminating power of several variables including machine learning discriminants and kinematic variables. We consider two DNNs, a BDT, transverse momentum of leading lepton (P l1 T ), and azimuthal angle difference between the two leading jets (|∆φ jj |). The DNNs and the BDT models take as input the four momenta of four leptons and two jets. We have constructed two differently structured DNN models, a dense DNN and a particle-based DNN, a structure previously explored in ref [14]. The dense DNN model uses 6 hidden layer with 150 hidden units. In the particle-based model, four momenta of each particle makes one input layer. The model contains two hidden layers with 10 nodes for each particle. Leptons and jets are merged with 3 layers of 20 nodes before they are merged into one big layer of 60 nodes. Finally, 2 layers of 100 nodes are added ahead of the final output layer. Both DNN models implement a 'relu' activation function on every hidden unit, while a sigmoid function is used at each output node. Also, He's uniform [25] function is adopted for weight initialization, and the 'adam' optimizer [26] with a learning rate of 0.001 is used to train each DNN. No overtraining was found -the loss values are comparable for the training and test samples, as is the distribution of the DNN output. Fig. 1 [right] compares the performance of each discriminant, where LL stands for as signal and TL and TT are taken as a background. The particle-based DNN model has the best discrimination power among all of the discriminants and will be the main focus of this paper. Next, we re-organize the particle-based DNN output structure such that it has five output nodes instead of two output nodes, with the five output nodes corresponding to LL, TL, TT, qqZZ, and ggZZ scores, respectively. We also studied the impact of several data pre-processing methods on the DNN performance, including standardization (STD), Yeo-Johnson power transformation [20] together with standardization (YJ&STD), and no preprocessing. The STD uses the LL train dataset as a base dataset to construct a transformation which transforms each variable distribution such that its mean and standard deviation are 0 and 1. The standardization of the base dataset is applied to all of the remaining training and test datasets for the signal and all of the backgrounds. Yeo-Johnson power transformation makes the distribution of the LL train dataset a normal distribution. The transformation is an extension of Box-Cox power transformation [27] onto negative values, and thus is well suited to our input variables. Fig. 2 shows a comparison of the distribution of P j1 T when different data preprocessing methods are applied, where the signal is LL and the background is TL, TT, qqZZ and ggZZ merged together. The LL distribution for the DNN output in the YJ&STD case is shown in Fig. 3. Fig. 4 shows the impact of the different preprocessing methods on the ROC curves for the particle-based DNN. This result shows that the best discriminating power comes from YJ&STD. We thus adopt this preprocessing method.  The signal significance is calculated based on the Asimov dataset [28]. We report a significance for three machine learning methods: 1) a two-step BDT [15]. 2) a two-step DNN, and 3) a DNN with PCA applied to the outputs (DNN-PCA), which are described in detail below.
1) The two-step BDT first classifies QCD vs. VBS (BDT1) and then classifies between LL and all backgrounds (BDT2). BDT2 was trained on the events left after a selection on the output of BDT1, which maximizes S/ √ B. We obtain a significance of 1.41σ when only statistical uncertainties are considered, and a significance of 1.23σ when a 10% uncertainty is applied both on signal and background in addition to statistical uncertainties. These numbers are comparable to the expected significance, 1.4σ, reported in ref. [15], which considers statistical uncertainties, systematic uncertainties, and 10% uncertainty in the ggZZ background yield.
2) Similarly to the two-step BDT, in the two-step DNN the second DNN is applied on the output of the first DNN. The first DNN structure is the previously described particle-based DNN, while a shallow dense NN is used as second DNN. Since there are only five output nodes on the first DNN, which represents LL, TL, TT, qqZZ, and ggZZ score, the second DNN is shallow with 2 hidden layers. Unlike the two-step BDT, no cut is applied on any part of two-step DNN. The expected significance obtained with this method is 1.47σ (1.38σ when systematic uncertainties are considered) The significances obtained with other data preprocessing methods are listed in Table I. 3) The DNN-PCA has only one particle-based DNN, the same particle-based model as was used in the first part of the two-step DNN, but with principle component analysis (PCA) implemented on its output. The PCA algorithm rotates the original axis of the features into a new axis containing decorrelated features. More specifically, given n-dimension target-data distribution, the first principal component has the largest possible variance and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components [29]. The five dimensional output from the particle-based DNN is transformed into five principle components. Distributions of the leading three principle components for each sample is shown in Fig. 5. Additionally, the explained variance ratio of the PCs is listed in Table II. Using PC1 as a discriminant, the signal significance is found to be 1.55σ (1.46σ including systematic uncertainty). We have investigated if the signal significance could be enhanced by using a combination of PC1, PC2, PC3. Fig. 6 shows two dimensional histograms of PC1 and PC2 for the LL, TL and qqZZ samples. Using the PC1 and PC2 two-dimensional distribution, a significance of 1.65σ is obtained (1.57σ when systematic uncertainties are considered). The 3-dimensional distributions of PC1, PC2, and PC3 are shown in Fig. 7, and a significance of 1.74σ is achieved when they are used to extract the signal (1.66σ when systematic uncertainties are considered). This scenario has the largest signal significance among all of the models we have investigated. Fig. 8 summarizes the expected significances obtained with various algorithms we have studied.    In summary, unitarity restoration or possible new physics can be probed by measuring longitudinally polarized VBS. And the VBS ZZ → 4l channel, with its clean final state, is one of the most promising channels for this purpose. We have investigated the performance of several Deep Neural Network structures in the task of extracting the longitudinally polarized component of ZZ scattering. The DNN structure giving the best results was found to be a particle-based DNN [14] working on pre-processed data, that are further decorrelated with principle component analysis and used in a multi-dimensional fit to achieve separation of the different kind of backgrounds. Using fast simulation, about 40% improvement in the expected significance is found with respect to a previous study using a BDT. Assuming an integrated luminosity at the HL-LHC of 3000 f b −1 and considering all backgrounds, a significance of 1.7 standard deviations is expected, which would be increased to larger than 2 standard deviations when CMS and ATLAS measurements are combined. The technique developed in this article is also useful to other LHC analyses involving helicity fraction measurement. Moreover, as shown in Fig. 1 [left], longitudinally polarized VBS component can be sensitive to deviated Higgs couplings and also dimension-8 anomalous gaugue boson couplings, a more detailed studies to probe these new physics using machine learning is being examined, in referring to methods reported recently [30,31].