Polarization measurement for the dileptonic channel of $W^+ W^-$ scattering using generative adversarial network

Measuring the polarization fractions of the $W^+W^-$ scattering reveals the interactions of the Higgs boson as well as new neutral states that are related to the standard model electroweak symmetry breaking. The dileptonic channel has a relatively lower background rate, but the kinematics of its final states can not be fully reconstructed due to the presence of two neutrinos. We propose neural networks to establish maps between the distributions of measurable quantities and the distributions of the lepton angles in $W$ boson rest frames. New physics contributions and collision energy can largely affect the kinematic properties of the $W^+W^-$ scattering beside the lepton angles. To make the network in ignorance of that information, the loss function is modified in two different ways. We show that the networks are promising in reproducing the lepton angle distributions, and the precision of the fitted polarization fractions obtained from network predictions is comparable to that obtained with the truth lepton angle. Although the best-fit values of polarization fractions do not change much after including the background uncertainty, the precisions is substantially reduced. Our trained models are available at GitHub.


I. INTRODUCTION
Vector Boson Scattering (VBS) [1][2][3][4] represents sensitive probe to any new physics that is interacting with the electroweak sector of the Standard Model (SM). If the Higgs sector is extended or the couplings between the Higgs boson and the gauge bosons deviate from the SM predictions, the scattering amplitudes for the longitudinal mode of the VBS will increase with center-of-mass energy and violate unitarity.
At hadron colliders, the VBS processes result in final states with two gauge bosons and a pair of forward-backward jets. VBS channels have been observed at the LHC run-II, including the dileptonic same-sign W ± W ± [5,6], fully leptonic ZZ [7,8], fully leptonic W Z [9,10], and semileptonic W V /ZV with the V decaying hadronically [11,12]. Investigating the polarization modes of the VBS processes is an important step afterward. The polarization of the vector bosons can be measured by their decay products. The interferences among different polarization channels disappear when the azimuthal angles of the decay products are integrated over. Although the selection cuts in analyses induce a certain amount of interferences, it is still possible to extract polarization fractions by fitting the data with simulated templates. There have been many studies of polarization measurement for the W + W − channel [13,14], the fully leptonic W ± W ± channel [15], the fully leptonic W Z/ZZ channel [16], as well as the W W/ZZ from the SM Higgs decay [17] and generic processes with boosted hadronically decaying W boson [18]. The CMS Collaboration studied the prospects for measuring the longitudinal modes of W ± W ± and W Z channels at the future HL-LHC [19,20]. Some recent studies take advantage of deep learning techniques. Taking the final state momenta as input, the network is able to either regress the lepton angle in the gauge boson rest frame [21,22] or classify events from different polarizations [23,24].
In this work, we study the polarization measurement for the dileptonic W + W − channel, as it has a large production cross section at the LHC and is relevant to the neutral scalar bosons. Although resolving the polarization of the hadronically decaying W boson is possible [18], it suffers from large uncertainties and backgrounds. Focusing on the W W scattering with fully leptonic decay in the SM, the fractions of the W boson polarization can be determined from distributions of many kinematic variables [15,25], e.g. the transverse momenta of leptons, the invariant mass of two leptons and so on. Studies in Refs. [23,24] use the neural network to discriminate different polarization modes of VBS processes and the output of the network can be used to extract the fraction of each polarization mode. However, those methods utilize information that also depends on other properties of the process besides the polarization. Thus they can not be applied directly to VBS processes with significant beyond SM (BSM) contributions. The most exclusive variable that characterizes the vector boson polarization is the angle between the charged lepton in the gauge boson rest frame and the gauge boson direction of motion (denoted by θ * hereafter). Because of the presence of two neutrinos in the final state, the lepton angles can not be fully reconstructed. There have been attempts to use a neural network to regress two lepton angles in the gauge boson rest frames [21,22] for the same sign W ± W ± scattering.
We further develop the machine learning methods by adopting the transformer network [26] and the generative adversarial network. The Transformer network is known to be quite successful in extracting features of polarizations for VBS processes [27]. The generative adversarial network is used to regress the distribution of lepton angle θ * . As a result, the network can be used to measure the polarization fraction of the W + W − scattering in a wide class of models. For illustration, we apply the network to a simplified model with an effective operator and the two-Higgs-doublet model (2HDM). In particular, there is an extra neutral Higgs boson in the 2HDM which induces resonant W + W − production, so the kinematic properties of the W + W − jj final state in the 2HDM are quite different from the SM ones. We show that our network works well in regressing the lepton angles for both BSM scenarios. The polarization fraction can be extracted from fitting the distribution of the predicted lepton angle to a linear combination of pure longitudinal/transverse templates. However, to reduce the background events, some preselection cuts need to be applied before constructing the templates. Those cuts will affect the shapes of templates. We find that the shapes of templates (according to our preselection) are similar in different models, but exhibit some dependences on collision energy. This means different sets of templates should be reconstructed at different collision energy.
This paper is organized as follows. The analysis framework is explained in Sec. II, including the setups of the network, event preparation, and fitting procedure. In Sec. III and Sec. IV, we study the performance of the network applied on different models and different collision energies. The effects of backgrounds are discussed in Sec. V. We summarize our work and conclude in Sec VI.

A. Definitions of loss functions and the network
The following issues need to be addressed in network construction: • Because of missing information for two neutrinos in the final state, it is not possible to fully determine the two lepton angles in the rest frames of W bosons for each event. The lepton decay angles of events with the same values of observables (including momenta of leptons and jets, as well as missing transverse momentum) form a distribution. Our network is built to establish a map between the distributions of measurable quantities and the distributions of the lepton angles, based on a large number of events.
• Since we expect that the VBS process is affected by unknown new physics, the network should be able to extract the W boson polarization for processes that have kinematic properties quite different from the SM. This means that the features extracted by the network should be only related to θ * and decorrelated from other process-dependent variables.
• Extracting the polarization fraction requires fitting to given templates. The shapes of templates are affected by preselection cuts; thus, they will be different at different collision energies of hadron collider. On the other hand, the network used to extract polarization information needs to provide features that do not change with collision energy.
Because of the first issue, we can not use the mean square error loss function, which can only reproduce the average value of the lepton angle distribution (for given values of observables) and lead to the deviation between the truth level distribution and the predicted distribution [21,22]. Events with the same measurable momenta of final states while having different θ * ± are grouped into subsets denoted by e e e i . The measurable momenta for the subset e e e i are denoted by p p p i (same for all events in e e e i ) and the set of θ * ± for events in e e e i is denoted by t t t i . The goal of the network is to establish a map that maximizes the probability of P (t t t i |p p p i ) while minimizing the probability P (t t t i |p p p j ) for j = i, where i and j run over all subsets. The loss function of the Conditional Generative Adversarial Network (CGAN) [28] meets the needs: where z is sampled from a Gaussian distribution and E denotes the average over all events. The distribution of (θ * + , θ * − ) in subset t t t i is replaced by a two dimensional Gaussian distribution t t t i (which centers on the mean of (θ * + , θ * − ) with standard derivative 0.01) for simplicity. The discriminative network (D) evaluates the consistency between the p p p i and a lepton angle distribution. The generative network (G) aims to reproduce the t t t i distribution with the input of z and p p p i . The GAN enables us to obtain the lepton angle by sampling instead of taking the average, and it transforms the random distribution z into meaningful distributions based on the information obtained from training samples.
To address the second problem, we adopt the Mutual Information (MI) variable to measure the nonlinear correlation between features and the target variables. For any two sets of variables X and Y , the MI is defined as where P (x,y) is the joint probability density function, and P x and P y are the marginal probability density functions. The I(X; Y ) is larger if X and Y share similar information, while I(X; Y ) = 0 if X and Y are independent of each other. However, MI is difficult to calculate in practice. We use the following approximation to estimate the MI [29] instead where T ω is an arbitrary function described by a neural network in which the weights ω are trained to provide the least upper limit for the I(X; Y ). In our study, the loss function is written such that the MI between the two leptons angles θ * ± and the features (the Transformer output, which has dimension 64, will be discussed later) is maximized, while the MI between the W boson pair momentum (including invariant mass m(W W ), energy E(W W ), rapidity y(W W ), and azimuth φ(W W )) and the features are minimized to reduce the dependence of network performance on W boson pair production mechanism. So the L MI = I(F ; W W ) − I(F ; Θ * ) is added to the category loss of the Transformer network. The F , W W and Θ * denote the sets of feature variables, W boson pair momentum, and two leptons angles, respectively. The T ω networks in I(F ; W W ) and I(F ; Θ * ) are fully connected neural networks, which consist of four layers with [272 = 4 × (64 + 4), 272, 272, 1] and [264 = 4 × (64 + 2), 264, 264, 1] numbers of neurons, respectively, and the ReLU function acts on all layers except the last layer of the two fully connected neural networks.
The transverse momenta of particles in the final state are approximately linearly related to the initial collision energy. Thus we further add the Pearson Correlation Coefficient (PCC) to the loss function to reduce the dependences of the feature variables on collision energy. The PCC is defined as where i runs over all events, andX andȲ denote the average of the variables. In our case, the features from the Transformer output are taken as X and the variable Y indicates the transverse momenta of the W bosons, leptons, as well as forward-backward jets 1 . Thus, 64 × 8 ρ XY 's can be calculated. The averageρ XY is added to the loss of the Transformer network.
Having defined the loss function, we can construct a network based on the Transformer network and the CGAN. The Transformer network [26] with a multi-head self-attention mechanism provides a variety of different attentions and improves the learning ability; thus, it can be used to effectively extract the internal connections of the features. In Ref. [27], it is found to be efficient in extracting polarization information for the W + W − scattering in both semi-leptonic and dileptonic channels. We adopt the same Transformer network as used for the dileptonic channel in this work. The low-level inputs (momenta of final state particles) are transformed into a 64-dimensional feature variable which is supposed to contain the full polarization information. In this study, the loss function of the Transformer network is modified according to the discussions above, using L MI andρ XY . The CGAN uses the feature as a condition and reproduces the two-dimensional lepton angle distribution. The Generator takes the input of the condition (64 dimensions) and a 64-dimensional Gaussian distribution and aims to regenerate the lepton angle distribution. The Discriminator takes the input of the condition and Θ(or Θ * ) 2 and determines whether the input Θ(or Θ * ) is consistent with the condition.
More details of the data processing and architecture of the network are depicted in Fig. 1. The upper-left plot illustrates the processing pipeline of the modified Transformer network. Note that the variables in sets of Θ * , W W , and PCC, which can only be calculated on the Monte Carlo events are only used for training. During the inference stage, only the inputs of the measurable momenta of final states are required. The lower plots show the architectures of the generative network and the discriminative network. In both networks, the condition is processed by a dense network with eight layers. The outputs of the dense networks are reused multiple times in the network as indicated by labels 1 and 2. The ResBlock is proposed to address the degradation problem [30] in training deep networks. We illustrate its decomposition in the upper-right plot. The StyleBlock combines the condition with its input by convolution operation (for more detail, see Ref. [31]). The parameters NF, SF, and S in one-dimensional convolution Conv1D[NF, SF, S] and StyleBlock[NF, SF, S] are the number of filters, filter size, and stride, respectively. The MergeBlock multiplies the condition with lepton angle information in matrix form. The LeakyReLu is an activation function, which is consistent with the ReLU for x ≥ 0 and equals to 0.1x for x < 0. The symbol + ○ means the sum of corresponding elements; c ○ means concatenating by channel; upsampling×2 (downsampling/2) means using linear interpolation to expand (reduce) the dimension of input by twice (half). 1 We also include the pseudorapidities of the W bosons, although they are not linearly correlated with collision energy. 2 The predicted lepton angles θ * ± distribution is denoted by Θ and the truth lepton angles θ * ± distribution is denoted by Θ * .

B. Event simulation and network training
New physics models are implemented in FeynRules [32] (in our case, we consider the effective field theory [33] and the 2HDM). Events at the LHC are simulated within the MG5_aMC@NLO framework [34], including those with fixed helicities of gauge bosons in the final state [25] 3 . The MadSpin [35] is turned on to preserve the polarization information in the decay products of the gauge bosons. The Pythia8 [36] is used for the parton shower, hadronization, and decays of hadrons. The final state jets are reconstructed by Fastjet [37] using the anti-k T algorithm with cone size parameter R = 0.4. The detector effects are simulated by Delphes3 [38] with the ATLAS configuration card, where b-tagging efficiency is set to 70%, and the mistagging rates for the charm-and light-flavor jets are 0.15 and 0.008, respectively [39].
The W + W − scattering is simulated at order of O(α 4 EW ) 4 in the SM model. There are also W + W − jj productions at O(α 2 EW α 2 s ) with much higher rates, but they do not correspond to VBS. They will be treated as the background for the W + W − scattering, because the interference contributions at O(α 3 EW α s ) are found to be small [14,40,41]. As for simulating the processes in BSM, the new physics coupling (α N P ) is assumed to be close to the electroweak coupling. The processes at the order of O(α a EW α b N P ) with a + b = 4 are considered.
In order to separate signal and background events in the dileptonic channel, the following preselections are applied: • exactly two opposite sign leptons with p T ( ) > 20 GeV, |η(l)| < 2.5; • at least two jets with p T (j) > 20 GeV, |η(j)| < 4.5; • the two jets with leading p T should give large invariant mass (m jj > 500 GeV) and have large pseudorapidity separation (|∆η| jj > 3.6); • no b-tagged jet in the final state.
The preselected events are used for training and testing the network. The network input consists of momenta (p x , p y , p z , E) of two leptons, forward and backward jets, the vectorial sum of all detected particles, and the vectorial sum of jets that are not assigned as forwardbackward jets. The transformer network that is used to extract the features of different polarizations is trained on the events of the SM W + W − scattering (with given final state polarizations) at the 13 TeV LHC 5 . Moreover, as discussed in the previous subsection, the following variables are calculated for each Monte Carlo event (used at the training stage): To show the performance gain of adding Eq. II.3 and Eq. II.4 to the loss function, three versions of networks are trained: • network with normal Transformer loss function, denoted by TRANS; • network with L MI being added to the loss function, denoted by TRAMI; • network with both L MI andρ XY being added to the loss function, denoted by TMIPCC.
The CGAN is trained independently of the Transformer (with modified loss). It takes the input condition provided by the well-trained TRANS network, TRAMI network, and TMIPCC network, respectively. Events of the W + W − scattering in both the SM and BSM at several collision energies are used for training the CGAN. The BSM scenarios include the EFT and the 2HDM with several choices of benchmark parameters, as will be discussed later.
In Tab. I, we present the classification accuracies of the Transformer networks and correlation information for the well-trained full networks (Transformer+CGAN). The classification accuracy can reach 44% for the TRANS network and decrease a bit in TRAMI and TMIPCC. However, the great enhancement (reduction) of MI(F;Θ * ) (MI(F;W W )) in TRAMI indicates that the features have been changed dramatically. Theρ XY is effectively reduced in TMIPCC, although there is also a mild reduction in TRAMI. As for the correlation between the truth lepton angles and the predicted ones, we find it is much increased in TRAMI, although adding theρ XY reduces the value by a small amount. We note that the absolute value of MI is not useful, only the relative size has physical meaning.
For demonstration, we show the distributions of the lepton angles cos θ * ± predicted by the TRAMI network for different polarization modes of the SM W + W − scattering at 13 TeV in Fig. 2. The truth level cos θ * ± distributions are also presented for comparison. Ideally, the truth level distributions for the transverse and longitudinal polarized W are (1 ± cos θ * ) 2 and sin 2 θ * , respectively. In practice, those shapes are distorted by the preselection cuts, especially around cos θ * ∼ ±1. We can conclude that the TRAMI network can reproduce the lepton angle distributions well, although its performance of the W + L W − T /W + T W − L processes is slightly worse than that of the W + L W − L /W + T W − T processes, but the situation may change for different networks. In TMIPCC, the performance of the W + L W − T /W + T W − L processes is improved, while that of the W + L W − L process becomes worse. We provide the trained networks in the GitHub repository .

C. Templates and fitting procedure
To obtain the polarization fractions, we need to fit the predicted two-dimensional cos θ * + − cos θ * − distribution to the predefined templates for the W + T W − T polarizations respectively. The templates are obtained by applying the network to the events of the SM W + W − scattering with fixed final state polarization. However, due to the presence of preselections, the templates exhibit some dependences on collision energy. We will need to use different sets of templates for different collision energies. In Fig. 3, we plot the two-dimensional templates obtained from both the TRAMI network predictions and the truth level lepton angles. The network predictions act as nice proxies for the truth lepton angles. Having established the template for each polarization state (T i , i=LL, TL, LT, TT), given a two-dimensional cos θ * + − cos θ * − distribution O, we can perform the binned χ 2 -fit to estimate the fraction (f i ) of each polarization mode. Because of limited statistics, the two-dimensional lepton angle distribution is divided into 10 × 10 bins. The particle swarm optimization [42] is adopted to minimize the χ 2 with constraints: i f i = 1 and f i ∈ [0, 1].

III. TEST ON THE DIFFERENT MODELS AT 13 TEV
A. The W + W − polarization in the SM and EFT We first apply our network to the SM W + W − scattering at the 13 TeV LHC. The onedimensional lepton angle distributions and the fitted polarization fractions obtained from three networks are shown in Fig. 4. The left panels show the comparison of the truth level cos θ * ± and the network output distributions, where we have projected the two dimensional cos θ * + − cos θ * − distributions into each component for visibility. In the middle and right panels, the ∆χ 2 = 1 contours on the f LT − f T L plane and f LL − f T T plane for integrated luminosities 30 ab −1 (may not realistic) and 3 ab −1 are shown. We can find that the networks reproduce the distributions of the truth level lepton angle well 6 . In particular, the TRAMI network, which makes the features focus on the cos θ * ± and decorrelate with the momentum of the W boson pair, has almost the same reconstruction precisions as the truth cos θ * ± ; i.e., the sizes of the contours are similar. As for the TMIPCC network, while the precision of the f LT and f T L fractions is similar to that obtained with the truth cos θ * ± , both the f LL and f T T precision is worse, partly because of the difficulty in training the network with a more complex loss function. Overall, giving the cross section (after preselection cuts) of SM W + W − scattering as 4.36 fb, each fraction of polarization can be resolved with deviation of ∼ 0.2 at the LHC for an integrated luminosity of 3 ab −1 .  The W + W − scattering could be affected by any new physics that is related to the electroweak symmetry breaking of the SM. A general framework to describe the new physics effects is the effective field theory (EFT). To study the network performances in new physics models, following the strategy as discussed in Ref. [27], we consider the following operator [43,44]: where the Φ field is the Higgs doublet and h denotes the SM Higgs boson field. This operator leads to the following changes to the Higgs couplings: Although the updated global fit requiresc H 0.4 [45], we apply our network to the case withc H = −1 to illustrate the network performance when the new physics contributions are sizable. The results are given in Fig. 5. Compared to the SM case, the fraction of the longitudinal W boson is greatly enhanced due to the incomplete cancellation in W L W L → W L W L scattering. However, the overall kinematic properties and the total production cross section (which is 4.82 fb after preselection forc H = −1) of the W + W − scattering in the EFT with non-zeroc H are similar to the SM ones, so the performances of the three networks on the EFT are similar to that on the SM as shown in Fig. 4. The one-dimensional lepton angle distributions predicted by all three networks match the truth lepton angle distributions well. Among the three networks, the TRAMI performs the best-the discrimination power of which is quite close to the truth cos θ * ± . On the other hand, in the TMIPCC network, only the precision of f LT and f T L is comparable to those obtained with the truth cos θ * ± . Many new physics models predict light states mediating the W + W − scattering, the effects of which can not be fully described in the EFT. We consider the type-II 2HDM [46,47] as a benchmark model, as it is featured by both the Higgs coupling modification and the existence of another scalar mediator (besides the SM Higgs) in W + W − scattering. There are six parameters: masses of scalar bosons (m H 1 , m H 2 , m A , and m H ± ), the mixing angle between two CP -even scalars α, and the ratio between two vacuum expectation values tan β. The m H 1 has been measured to be around 125 GeV (we only consider the case of H 1 being the SM-like Higgs boson). The m A and m H ± are not relevant in the W + W − jj production, assuming they are much larger than m H 2 . The couplings of CP -even scalars to the W boson are given by The values of tan β alone are not related to the HW W coupling, although it can influence the W W scattering indirectly via changing the decay width of H 2 . We fix tan β = 5 without loss of generality, and only need to deal with two free parameters: m H 2 and sin(α − β). In Fig. 6, the projected one-dimensional lepton angle (cos θ * ± ) distributions and the ∆χ 2 contours on the polarization fraction planes for the W + W − scattering in 2HDM with m H 2 = 300 GeV and sin(α) = 0.7 at the 13 TeV LHC are shown. Because of the resonant contribution from the H 2 , the cross section of the W + W − scattering is increased to 8.362 fb. As in the SM and EFT, the lepton angle distributions can be reproduced well by all three networks. However, the precision of the polarization fractions obtained from the TRANS network are not as good as those obtained from the truth lepton angles. This is because the features in the TRANS network (which is trained only with the SM events) contain the SM kinematic information (in particular, the invariant mass of the W boson pair). The differences between the kinematic properties of the 2HDM and the SM degrade the performance. The situation is much improved for the TRAMI network, in which the information of W boson pair momentum is decorrelated from the features. To illustrate the performance in a more extreme case, we apply those networks to the process of W + W − scattering solely through a heavy resonance. Both the polarization pattern and the kinematic features are dramatically different from the SM ones. Assuming the mediator is a scalar boson, only the polarization modes W L W L and W T W T are allowed. The ratio f LL /f T T ∼ 67 for the scalar mass around 400 GeV and is increasing fast for a heavier scalar. In Fig. 7, we show the comparison of the truth level cos θ * ± and the network output distributions for the W + W − scattering through a heavy scalar resonance at the 13 TeV LHC. As expected, the TRAMI network performs much better than the TRANS network, since the TRAMI is not supposed to be sensitive to the W W production mechanism. The reconstructed lepton angle distributions from the TRAMI network remain close to the truth ones for resonance mass less than ∼ 1 TeV. The heavier the resonance is, the larger the deviation between the network predictions and truth values.
Up to this point, we have not found it necessary to add the PCC to the loss function, as the performance of the TMIPCC network is not comparable to that of the TRAMI network in all cases. This is mainly because the training samples for the Transformer network are generated at 13 TeV, and the decorrelation of collision energy will not be necessary if we are extracting the W polarization fractions at the same collision energy. In this case, the TRAMI network performs the best and is recommended to use. However, if we want to apply the same network to processes at different collision energy, the subtraction of collision energy dependence becomes essential. In the next section, we further study the network performance at different collision energy by taking the W + W − scattering in the 2HDM at 100 TeV as an example.
IV. THE W + W − POLARIZATION IN 100 TEV p-p COLLISION As we discussed above, although the distributions of θ * ± are supposed to be only related to the W boson polarization, the effects of preselection cuts which distort the lepton angle distribution, depend on the collision energy. The same preselection cuts as proposed in Sec. II B for 13 TeV are also adopted here.  Fig. 8 shows the two-dimensional templates on the cos θ * + -cos θ * − plane for the TMIPCC network prediction and the truth lepton angles at 100 TeV. Compared with Fig. 3, we can find that the distributions for the truth lepton angle vary with collision energy. In particular, the effects of the preselection cuts are milder at the 100 TeV collision, leading to slightly sharper lepton angle distributions at the truth level. On the other hand, the templates from the network predictions become less precise for the 100 TeV case even after applying both MI and PCC in the loss function. This is attributed to the fact that the Transformer networks are only trained on events at the 13 TeV LHC. It is possible that one can optimize the results at 100 TeV by using 100 TeV event samples to train the Transformer networks.
The projected one-dimensional cos θ * ± distributions, as well as the fitted polarization fractions from the network predictions and truth lepton angles, are presented in Fig. 9. The W + W − scattering in the 2HDM with m H 2 = 300 GeV and sin(α) = 0.7 at 100 TeV has been taken as an example. After preselection, the production cross section for the process is 148.74 fb. Unlike the 13 TeV case, the reduced performance of the TRANS network is visible in the cos θ * ± distribution this time. As for the polarization fraction, the TRAMI network can no longer work well, and decorrelating the collision energy dependence is essential. We can see that the TMIPCC network outperforms TRAMI in the 100 TeV case, although there is still a certain amount of deviation between the predicted ones and the truth ones. Meanings of the plots are the same as Fig. 4, except that the different shades of the ∆χ 2 contours from inside out correspond to ∆χ 2 = 1 calculated on datasets with integrated luminosities 3 ab −1 , 1 ab −1 and 500 fb −1 , respectively.

V. SUBTRACTING THE BACKGROUNDS: SM AS A CASE STUDY
So far, we have only considered the application of the networks to the W + W − scattering processes. In practice, there will be events of non-VBS processes that pass the preselections, behaving as backgrounds in our analysis. As a result, we can only obtain the superposed distribution of cos θ * + -cos θ * − , from which the contributions from the background processes need to be subtracted out before applying the fit to the templates. However, due to the uncertainties in the backgrounds simulation, the background subtraction can not be perfect. This will lead to reduced precision in extracting the polarization fractions.
Since we are considering the dileptonic channel of the W + W − scattering, the dominant background processes are the dileptonic tt and tW processes, mixed electroweak-QCD W + W − jj production, as well as the W Zjj production (both at orders of O(α 4 EW ) and O(α 2 EW α 2 s )) with gauge bosons decaying leptonically. The production cross sections at 13 TeV for the simulated background events before (σ fid ) and after (σ ) the preselection cuts are listed in Tab. II. For diboson processes, the transverse momenta of final state jets are required to be greater than 20 GeV. We will use the measured inclusive cross sections at the LHC for the tt [48] and tW [49] processes, and use the leading order cross sections which are calculated by MG5_aMC@NLO for the diboson processes. We note that background events are simulated with at least one lepton in the final state because there could be a misidentified fake lepton due to detector effects. The production cross sections of background processes before and after preselections at the 13 TeV LHC. The superscripts EW and QCD denote the processes at order of O(α 4 EW ) and O(α 2 EW α 2 s ), respectively. The subscript denotes the leptonic decay of that particle. With background contamination, we adopt the results from the TRAMI network to extract the W + W − polarization fractions for the SM production at the 13 TeV LHC. The results are shown in Fig. 10 with varying uncertainties in background subtraction. We have assumed uncorrelated systematic uncertainties for the event numbers in lepton angle bins (10 × 10 on the cos θ * + − cos θ * − plane). The size of the systematic uncertainty in each bin is indicated in the legend. The left panel shows the projected lepton angle (cos θ * ± ) distributions given by the summed templates with the best-fitted fractions, as well as that obtained at the truth level. Since the results with three levels of background uncertainties have similar best fit values, the lepton angle distributions are similar for all three cases. However, the total background cross section after preselection is around two orders of magnitude larger than the signal cross section. The uncertainties of the fitted fractions are very sensitive to the background uncertainties; i.e., the size of the ∆χ 2 = 1 contour is substantially enlarged for increasing background uncertainty. The precision of the extracted fractions is promising only if the background uncertainty in subtraction can be controlled at the 0.1% level. Note that this uncertainty can be much smaller than that of the total cross-section. More refined cuts are necessary for large systematic uncertainty of the background. In this case, the template for each polarization should be adjusted accordingly, and the χ 2 fit should be done on the (network predicted) lepton angle distribution after cuts. Moreover, with more stringent cuts, a larger number of background events need to be simulated, in order to guarantee relatively small statistical uncertainties in our analysis. The main point of the paper is to reproduce the lepton angle distribution, so we decide to leave those more involved analyses for future work.

VI. CONCLUSION
We propose networks composed of a Transformer network and CGAN to predict the distributions of the angles between the charged leptons in the gauge boson rest frames and the gauge bosons directions of motion for the dileptonic channel of W + W − scattering, so that the polarization fractions of the W + W − final state can be obtained from fitting the predicted lepton angle distribution to the given templates.
There could be unknown new physics contributing to the W + W − scattering, which may lead to dramatically different kinematic properties for final states. To ensure that the network is able to predict the lepton angle distribution precisely, irrespective of the W + W − production mechanism, the loss function of the Transformer network is modified with MI and PCC as defined in Eq. II.3 and Eq. II.4. So that the features produced by the Transformer network contain the lepton angle information as much as possible while decorrelating with other kinematic variables. For comparison, three different versions of networks are trained, denoted by TRANS, TRAMI, and TMIPCC.
To illustrate the performances of the networks, we apply them to the events of W + W − scattering with dileptonic decay in the SM, in the EFT with non-zeroc H as well as in the 2HDM with chosen benchmark points. The results are summarized in Tab. III. The TRAMI network performs best at 13 TeV for all models, as the features of it have been trained to focus on the lepton angle while not being sensitive to the W boson pair production mechanism. The fitting precision of the polarization fraction based on the TRAMI predictions is quite similar to that obtained from using the truth lepton angle, except for the f T T in the 2HDM with m H 2 = 300 GeV and sin(α − β) = 0.7. There is a certain amount of deviation, mainly due to the remaining information of the kinematic variables in the features of the TRAMI network. The 1σ ranges for the fitted fractions are around 0.2-0.3 for an integrated luminosity of 3 ab −1 . When applying to the events at 100 TeV, the reduced performances of the TRANS network and TRAMI network become visible in the projected one-dimensional lepton angle cos θ * ± distributions. The situation is much improved for the TMIPCC network in which the decorrelation with collision energy is conducted, although there are still mild deviations between the polarization fractions obtained from the TMIPCC network and the truth lepton angle. Benefited from the increased production rate at higher collision energy, the 1σ ranges for the fitted fractions can reach ∼ 0.1 (0.05) for an integrated luminosity of 1 ab −1 (3 ab −1 ).
In practice, the opposite sign dileptonic channel of W + W − scattering suffers from backgrounds of dileptonically decaying tt, tW , mixed electroweak-QCD W + W − jj as well as W Zjj productions. Considering the uncertainty in background subtraction, the fitting precision of polarization fractions is substantially reduced, mainly due to the relatively small signal to background ratio (after applying the preselections).