Deep neural network application: Higgs boson CP state mixing angle in H → ττ decay and at the LHC

The consecutive steps of cascade decay initiated by H → ττ can be useful for the measurement of Higgs couplings and in particular of the Higgs boson parity. In the previous papers we have found that multidimensional signatures of the τ (cid:1) → π (cid:1) π 0 ν and τ (cid:1) → 3 π (cid:1) ν decays can be used to distinguish between scalar and pseudoscalar Higgs state. The machine learning techniques (ML) of binary classification, offered break-through opportunities to manage such complex multidimensional signatures. The classification between two possible CP states: scalar and pseudoscalar, is now extended to the measurement of the hypothetical mixing angle of Higgs boson parity states. The functional dependence of H → ττ matrix element on the mixing angle is predicted by theory. The potential to determine preferred mixing angle of the Higgs boson events sample including τ -decays is studied using deep neural network. The problem is addressed as classification or regression with the aim to determine the per-event: (a) probability distribution (spin weight) of the mixing angle; (b) parameters of the functional form of the spin weight; (c) the most preferred mixing angle. Performance of proposed methods is evaluated and compared.


I. INTRODUCTION
Machine learning (ML) techniques find increasing number of applications in high energy physics (HEP) phenomenology. Being used at Tevatron and LHC experiments, they have became an analysis standards. For the recent reviews see, e.g., [1][2][3]. The most common approach is via classification routines, however the impact of regression methods is not negligible as well. Let us point to two such examples in LHC experimental analysis. The measurement of polarization fractions in WW pair production using deep neural network (DNN) [4] explores both; the classification [5] and regression [6] approaches. The regression technique is also used in [7] for parton distribution functions.
In this paper we present how ML techniques can be helpful to exploit substructure of the hadronically decaying τ leptons in the measurement of the Higgs boson CP-state mixing angle in H → ττ decay. This problem has a long, decades long, history [8,9] and was studied both for electron-positron [10,11] and for hadron-hadron [12,13] colliders. Despite these efforts, the Higgs boson CP state was for the first time measured at LHC, from H → ττ decay only recently see preliminary document [14]. Until that reference the ML has not been even presented for the analysis design, contrary to the classical experimental analysis strategies, see, e.g., [15] for High Luminosity LHC. One of the reasons is that ML adds complexity to the data analysis. ML solutions need to be investigated in context of their suitability for work on systematic ambiguities.
In [14] only some of the τ decay modes are explored. In particular impact parameter is used for the τ AE → π AE ν and τ AE → μ AE νν and secondary decays of τ → 3πν are only in part used. This is because complexity of the signatures and its multi-dimensional nature. In principle all τ decay modes have the same sensitivity to spin [16] necessary to constrain Higgs CP state. It requires precise reconstruction of its all decay products, including neutrinos and knowledge of τ decay matrix elements. In [14] reconstruction of all decay products separately in τ → 3π decays is not attempted for example. Relatively simple observable for H → ττ → νρνρ which can be understood with the help of single optimal variable [17], for other cases evolve into multitude of Higgs CP sensitive variables. For the case of H → τ − τ → ν3π − ν3π that would be 16 variables even if neutrino momenta were not taken into account.
That variables cannot be called optimal, but only optimallike ones, but may be used in studies of systematic ambiguities. Already in [18] we have investigated such variables. See Fig. 3 there.
Obviously ν τ escapes direct detection, that is why our paper documents a step of the path, this time toward the measurement of the CP mixing angle and not, as in our previous papers, toward distinguishing two hypotheses of the parity states. We include neutrino momenta, as if they were well accessible, in the feature list used to evaluate options of ML techniques in CP mixing angle measurement. This is obviously an idealistic case, but very useful for exploring potential of the ML techniques. We do not address issues of systematic ambiguities due to reconstruction of neutrino momenta, but we rely on Ref. [19] which was dedicated to the discussion on different approximations and their impact on the ML performance in discriminating between two CP mixing hypotheses. Those results indicate that approximate knowledge on the neutrino momenta can be used for that purpose too. In practice, neutrino momenta need to be reconstructed from energy-momentum conservation of the whole event (in pp case transverse components only) and of the τ-lepton and Higgs boson mass constraints. Even that turns out not to be sufficient. In Ref. [19], devoted to distinction of Higgs CP parity states, we have exploited fact that neutrinos orientation angles can be reconstructed with the help of τ decay vertex position. Even more promising, the τ lepton direction can be reconstructed from τ decay vertex position and ansatz on τ lepton time of flight and then neutrinos momenta can be inferred. The later case turned out to be the best performing of all studied approximations on ν τ kinematics.
Identification of neutrinos orientation angles is helpful to develop intuition on the observable structure, also as it is the most difficult to grab, to discuss systematic in contributions from τ → 2π and τ → 3π decay modes. Such studies are of a value to cross check ML results. They may be useful in evaluation of systematic ambiguities, it is more suitable to evaluate systematic ambiguities for one dimensional distributions. Note that in Ref. [14] there were no attempts to use τ decay position vertices in case of τ → 2π and τ → 3π decay modes. The identification and studies of one dimensional distributions important for sensitivity, may be useful in evaluation of systematic ambiguities, it is more standard to evaluate systematic ambiguities for one dimensional distributions. It was a necessary preliminary step for our present study. Obviously they will need, as it was mentioned in Ref. [19], to be repeated with detailed detector response. Not only for momenta smearing, but also for background. The second point will become more important once statistics will grow and systematic errors will get of more concern, than in prototype studies.
On the other hand, theoretical basis for the measurement is simple, the cross section dependence on the parity mixing angle has the form of the first order single angle trigonometric polynomial. It can be implemented in the Monte Carlo simulations as per event spin weight wt, see [20] for more details. In [18,19] we have performed analysis for the three channels of the τ lepton-pair decays, respectively ρ AE ν τ ρ ∓ ν τ , a AE 1 ν τ ρ ∓ ν τ , and a AE 1 ν τ a ∓ 1 ν τ but we limited ourselves to the scalar-pseudoscalar classification case. In the scope of our interest was the kinematics of outgoing decay products of the τ leptons and geometry of decay vertices.
With these concerns in mind, in the following we extend our previous work on the physics of the Higgs CP parity scalar/pseudoscalar classification, to a measurement of scalar-pseudoscalar mixing angle ϕ CP of the Hττ coupling. We do not intend to investigate possible extensions the Standard Model and avoid discussion on the motivations. We constrain ourselves to the measurement of the coupling and the simplest channel of H → τ þ τ − → ρ þ ν τ ρ − ν τ → π þ π 0 ν τ π − π 0 ν τ decay. and focus on comparative studies for potential of different ML techniques.
Possible solutions are analyzed with deep neural network (DNN) algorithms [4] implemented in Tensorflow environment [21] which we have previously found working well for the binary classification [18,19] between scalar or pseudoscalar Higgs boson variants (which correspond to ϕ CP ¼ 0 and ϕ CP ¼ π=2). Our goals for the DNN algorithms are to determine per event: (i) Spin weight as a function of the mixing angle.
(ii) Decay configuration dependent coefficients, for the known functional form of the spin weight distribution. (iii) The most preferred mixing angle, i.e., where the spin weight is at a maximum. These goals are complementary or even to large extent redundant, e.g., with functional form of the spin weight we can easily find the mixing angle at which it has a maximum. But the precision of predicting that value would not be necessarily the same for different methods. All three cases are studied as classification and as regression problems. By this we mean, that the underlying DNN cost functions is either designed for classification or regression tasks. We show quantitative comparison of the performance of DNN learning on the distributions which are relevant for physics analyses and then draw some conclusions.
Paper is organized as follows. In Sec. II we describe a basic phenomenology of the problem. Properties of the matrix elements and distributions at the level of final, measurable quantities as well as unmeasurable quantities are presented. In Sec. III we review lists of features (variables) used as an input to DNN and present samples prepared for analyses. As a straightforward extension of [18,19], still using binary classification, we analyze possibility to distinguish between scalar and arbitrary mix of scalar/pseudoscalar states. This study is covered in Sec. IV. The multi-class classification approach is covered in Sec. V. The regression approach is discussed in Sec. VI.
The comparison of the classification and regression is covered in Sec. VII. Observations relevant for the future studies of systematic errors are addressed. The summary, Sec. VIII, closes the paper.
In Appendix more technical details on the DNN architecture are given together with arguments supporting such a choice. In addition, we describe briefly the data preprocessing chain.

II. PHYSICS CONTENT OF THE PROBLEM
The most general Higgs boson Yukawa coupling to τ lepton pair, expressed with the help of the scalarpseudoscalar parity mixing angle ϕ CP reads as where N denotes normalization, h Higgs field andτ, τ spinors of the τ þ and τ − . As we will see later, this simple analytic form translates itself into useful properties of observable distributions convenient for our goal, determination of the ϕ CP . Recall of the definitions is thus justifiable, and helpful to systematize properties and correlations of the observable quantities (features). The matrix element squared for the scalar/pseudoscalar/ mix parity Higgs, with decay into τ þ τ − pairs can be expressed as where h AE denote polarimetric vectors of τ decays (solely defined by τ decay matrix elements) and R i;j density matrix of the τ lepton pair spin state. In Ref. [22] details of the frames used for R i;j and h AE definition are given. The corresponding CP sensitive spin weight wt has the form: The formula is valid for h AE defined in τ AE rest-frames, h z stands for longitudinal and h ⊥ for transverse components of h. The Rð2ϕ CP Þ denotes the 2ϕ CP angle rotation matrix around the z direction: where τ decay products π AE , π 0 and ν τ 4-momenta are denoted respectively as p π AE , p π 0 ; p ν and q ¼ p π AE − p π 0 . Obviously, complete CP sensitivity can be extracted only if p ν is known (for τ AE → π AE π AE π ∓ ν formula is longer, dependence on modeling of the decay appear too [23], but is of no principle differences). Note that the spin weight wt is a simple first order trigonometric polynomial in a 2ϕ CP angle. This observation is valid for all τ decay channels. For the clarity of the discussion on the DNN results, we introduce α CP ¼ 2ϕ CP , which spans over (0, 2π) range. The α CP ¼ 0; 2π corresponds to scalar state, the α CP ¼ π to pseudoscalar one. Spin weight can be expressed as where depend on the τ decays only.
Distribution of the C 0 , C 1 , C 2 coefficients, for the sample of H → ττ events used for our numerical results is shown in Fig. 1. The C 0 spans (0, 2) range, while C 1 and C 2 of ð−1; 1Þ range have a similar shape, quite different than the one of C 0 .
The amplitude of the wt as function of α CP depends on the multiplication of the length of the transverse components of the polarimetric vectors. The longitudinal component h z þ h z − is defining shift with respect to zero of the wt mean value over a full (0, 2π) range. The maximum of the wt distribution is reached for the opening angle of transverse components of the polarimetric vectors.
The spin weight of formula (5) can be used to introduce transverse spin effects into the event sample when for the generation transverse spin effects were not taken into account at all. The above statement is true, independently if longitudinal spin effects were included and which τ decay channels complete cascade of H → ττ decay. The shape of weight dependence on the Higgs coupling to τ parity mixing angle is preserved.
In Fig. 2 we show distribution of spin weight wt for five example H → ττ events collected in Table I. For each event, depending on the polarimetric vectors, single value of α CP is preferred (by the largest weight). For a physics model with α CP the sample will be more abundantly populated with events for which the angle between polarimetric vectors, ∡ðh T þ ; h T − Þ, is close to α CP . We show distributions when complete polarimetric vectors are used for spin weight wt and when only hadronic parts of polarimetric vectors are used. The second case is indicating easier to attain sensitivity part of observables. The α CP at which spin weight has its maximum is then a bit shifted. Table I specifies values of the polarimetric vectors and the resulting coefficients C i calculated from formulas (6) and for events of Fig. 2. It also explicitly gives ∡ðh T þ ; h T − Þ calculated from complete polarimetric vectors and (in brackets) from their hadronic parts only.

III. MONTE CARLO SAMPLES AND FEATURE LISTS
For compatibility with our previous publications [18,19], we use the same generated event samples, namely Monte Carlo events of the Standard Model, 125 GeV Higgs boson, produced in pp collision at 13 TeV center-of-mass energy, generated with PYTHIA 8.2 [24] and with spin correlations introduced with TAUSPINNER [20] package. For τ lepton decays we use TAUOLAPP library [25]. All spin and parity effects are implemented with the help of weight wt [26,27]. The sample is generated without spin effects, and the spin weights wt i for few different values of CP mixing angle α CP i are stored. Spin weight, formula (3), is calculated using R i;j density matrix and polarimetric vectors h AE .
Later, for a given event it is possible to calculate coefficients C 0 , C 1 , C 2 , using three α CP and linear equation (5). Figure 3 shows the cross-check how well this procedure works. The functional form (orange line) and evaluated spin weights (blue dots) for two example events are shown. The C 0 , C 1 , C 2 coefficients for the functional form are calculated solving Eq. (5) for wt stored in the generated event samples at three values of α CP .
In this paper we present results for the case when both τ's decay τ AE → ρ AE ν τ and about 5 × 10 6 simulated Higgs events are used. To partly emulate detector conditions, a minimal set of cuts is used. We require that the transverse momenta of the visible decay products combined, for each τ, are larger than 20 GeV. It is also required that the transverse momentum of each π AE is larger than 1 GeV. In experimental conditions complex cuts will be applied. Our largely optimistic ones can be used in feasibility estimations though. The emphasis of the paper is to explore different ML approaches to the problem, and we discuss only the case of the Variant-All feature list from paper [19]. It contains the four-momenta of all decay products of τ leptons defined in the rest frame of intermediate resonance pairs, and with sum of hadronic decay products aligned with z-axis are. This represents an ideal benchmark case scenario, for performance monitoring.

IV. BINARY CLASSIFICATION
The use of the DNN for binary classification have been discussed in our previous papers [18,19]. The focus was on discriminating between CP-scalar (H 0 hypothesis) and CP-pseudoscalar (H 1 hypothesis). Now we apply the same procedure but with alternative hypothesis (H α CP ) representing the scalar-pseudoscalar mixed state of mixing parameter α CP . To quantify performance for Higgs CP state classification the weighted area under curve (AUC) [28,29] is used again. For each simulated event we know also Bayes optimal probability that it is sampled from H 0 or H α CP hypothesis, see more detailed description in Appendix. This forms the so called oracle predictions, i.e., ultimate discrimination for this problem. We calculate oracle predictions and evaluate the results of DNN. This is a straightforward extension of the method used in [18,19]. That is why, simple attempt on future discussion of systematic error may follow that suggested in [19]: variations within expected range of detector response can be easily introduced and biases studied.
The oracle predictions for discriminating between H 0 and H α CP hypotheses is increasing with α CP and reach AUC ¼ 0.78 for α CP ¼ π. The performance of DNN is following similar pattern, reaching maximum at α CP ¼ π  (6) and angle ∡ðh T þ ; h T − Þ between transverse components of polarimetic vectors for five example events of H → τ þ τ − ; τ AE → ρ AE ν τ . In brackets, angle of only hadronic part of polarimetric vector is given.

Events
Polarimetric vectors  (5), encapsulating sensitivity to α CP is not symmetric, see Fig. 3. In Table II we show numerical results for few α CP .

V. MULTICLASS CLASSIFICATION
The binary classification discussed in the previous section is easy to generalize to the multiclass case. The DNN is learning to provide per-event probabilities to associate with each class. Single class represents either discrete point or a specific range in 1-dimensional parameter space. We explore three approaches, each providing complementary physics information, but all allowing to quantify, on the per-event basis, which is the preferred mixing angle of the studied Higgs sample: (i) The DNN classifier is learning per-event spin weight as a function of mixing angle α CP . The range of mixing angle ð0; 2πÞ is discretized into equally spaced points called classes. This approach is described in Sec. VA, and used for the figures labeled with: Classification:wt. (ii) The DNN classifier is learning per-event coefficients C 0 , C 1 , C 2 . The allowed range of coefficients is split into several equal size ranges (classes), single class represents a range for a coefficient value. The DNN is trained for each coefficient separately. This approach is described in Sec. V B and used for the figures labeled with: Classification: The DNN classifier is learning per-event most probable mixing angle α CP max , i.e., value of α CP at which spin weight is maximal. The range of mixing angle ð0; 2πÞ is split into several equally spaced points (classes). This approach is described in Sec. V C and used for the figures labeled with: Classification: α CP max . We monitor performance of the learning process in a standard manner, with the loss function on the training and validation sets. Respective distributions are shown in Fig. 20 of Appendix. Note that the loss function, the tf.nn.softmax_cross_entropy_with_logits of the Tensorflow, allows to predict probabilities of the class labels, and not the actual value of the observable at a given class. In case of predicting spin weight distribution, only the normalized to unity shape is predicted. In case of predicting values of C i coefficients or α CP max , vector of probabilities is returned, and the one-hot encoding transformation selecting most probable class is then applied to retrieve actual predicted value of the parameter.

A. Learning spin weight wt
The DNN classifier is trained with per-event feature list and as a label normalized to unity N class -dimensional vector of spin weights [30] wt i is given, each component of wt norm ðα CP Þ vector corresponds to the ith discrete value of mixing angle α CP i . N class denotes number of points to which range ð0; 2πÞ was discretized. The number of classes is kept odd, to assure that α CP ¼ 0; π; 2π, corresponding respectively to scalar/ pseudoscalar/scalar cases, are always represented as a separate class. Training of DNN is performed with N class varying from 3 to 51. This is to understand the trade-off between the better approximation given by high number of classes and smaller complexity of the low-class system.
We quantify the DNN performance for classification problem in the context of physics relevant criteria. The first question is how well DNN is able to reproduce per-event shape of the spin weight wt norm . For two example events,  true and predicted spin weight wt norm distribution with N class ¼ 21 is shown in Fig. 5 as a function of either continuous mixing parameter α CP i or class index i (representing discretized mixing parameter α CP i ). Blue line denote true weights while orange steps denote weights predicted by DNN classifier. In overall, predicted weights follow smoothly true shape of linear cosðα CP Þ and sinðα CP Þ combination. This is encouraging, because the loss function is not correlating explicitly nearby classes. The DNN is discovering this pattern in the process of learning.
To quantify those observations, performance of DNN is monitored on the statistical basis with l 2 norm. The l 2 norm is defined as a square root of the integral of squared difference between predicted p k and true wt norm k over the whole interval ð0; 2πÞ. It then averaged over the number of events N evt . Although p k and wt norm k are functions of α CP , we shall usually skip the argument for the notation brevity.
The p k corresponds to the kth event and is represented as a step function, with step levels given by a N classdimensional output of DNN. For true weights, represented as continuous function (5), we scale them in such a way that R 2π 0 wt norm dα CP ¼ 1, to enable the comparison. Distribution of l 2 norm is shown in Fig. 6, as a function of class multiplicity N class . With increasing number of classes, l 2 decreases. The slope remains very steep up to N class ¼ 21, and seems to flatten around N class ¼ 51. These two values of N class we have chosen as representative for the rest of the paper.
From physics perspectives, learning the shape of wt distribution as function of α CP , is equivalent to learning components of the polarimetric vectors. But, because only the shape, not the normalization, is available the C i coefficients cannot be fully retrieved from formula (5). It is not necessary the aim anyway. The physics interest is more to learn α CP which is preferred by events of the analyzed sample, i.e., value at which wt distribution has its maximum. This corresponds to determining CP mixing angle of the analyzed sample.
The second criterium is the difference between most probable predicted class and most probable true class, denoted as Δ class . When calculating difference between class indices, periodicity of the functional form (5) is taken into account. Class indices represent discrete values of α CP , in range ð0; 2πÞ. The distance between the first and the last class is zero. We take the distance which corresponds to the smaller angle difference and we take the sign according to clockwise orientation vs class index at which true wt has its maximum. Let idp max denote the index of most probable predicted class, idc max be index of true most probable class. The distance jΔ class j is defined as: and the sign is attributed if ðjidp max − idc max jÞ < ððN class − 1Þ − jðidp max − idc max ÞjÞ, or otherwise. In Fig. 7 distributions of Δ class for N class ¼ 21 (top) and 51 (bottom) are shown. The shapes are Gaussian-like and centered around zero. The mean hΔ class i ¼ −0.006 [rad] in both cases and this we can interpret as the bias of the method. The standard deviation of per-event distribution is σ Δ class ¼ 0.165 ½rad for N class ¼ 21 and σ Δ class ¼ 0.126 ½rad for N class ¼ 51. As we can see, the performance has not improved significantly with N class exceeding 21.
The DNN classifier which is predicting normalized spin weight wt norm , provides enough information to identify the most probable mixing angle α CP max with high precision. The information is not sufficient though to reconstruct complete set of C i coefficients and the polarimetric vectors.
B. Learning C 0 , C 1 , C 2 coefficients The second approach is to learn formula (5) coefficients C 0 , C 1 , C 2 for the spin weight wt. They can be then used to predict not only normalized wt norm , but also original wt. Coefficients C 0 , C 1 , C 2 represent physical observables, products of longitudinal and transverse components of polarimetric vectors, as shown in formulas (6).
The classification technique using DNN is configured to learn each of the C i with separate training. The allowed range is well known, the C 0 spans the range (0.0, 2.0) and C 1 , C 2 the range ð−1.0; 1.0Þ, see Fig. 1. The allowed range is binned into N class and as a label, the N classdimensional vector with one-hot encoded value of the C i parameter is associated with each event. Therefore in this case, a single class represents range of the C i coefficient. During training, the DNN is learning per-event association between feature list and the class labels. The output is a probability N class -dimensional vector, which is then converted to one-hot encoded representation, i.e., the most probable class is chosen as a predicted value of the C i coefficient.
Distributions of the difference between true and predicted C i coefficients are shown in Fig. 8. In that case, as there is no periodicity involved, Δ class ¼ idp − idc where idp, idc denote respectively true and predicted class index. Mean of ΔC i is close to zero and standard deviation is of 0.038-0.051, which is less than 5% of the range. Precision with which C i coefficients are predicted is clearly limited by the N class .
We use the true and predicted C 0 , C 1 , C 2 coefficients to calculate wt distribution of (5). It is then discretized with N class points (the N class could be different than the one used for learning coefficients), and the α CP max is determined from the class of maximal weight. The difference between true and predicted α CP max is shown in Fig. 9 for N class ¼ 21 and 51. The Gaussian-like shape of those distributions, centered around zero, clearly demonstrated that method works as expected. The mean and standard deviation of the distributions are close to those obtained with Classification:wt approach, of Fig. 7. Finally, as sanity check we have compared the true distributions of C 0 , C 1 , C 2 with the predicted ones. As we can see in Fig. 10, both distributions match very well for all C i .

C. Learning the α CP max
The third approach is to directly learn per-event most preferred mixing angle, α CP max . The allowed range (0, 2π) is again binned into N class classes, where single bin represents discrete α CP . For training, for every event we take the onehot encoded vector of N class -dimension as a label. The DNN is returning N class -dimensional vector of probabilities, which is then transformed into a single number, that is the class of the highest probability α CP max . With this approach, neither spin weight nor C i coefficients are predicted.
As the event sample is generated without any CP mixture favored, the distribution of the α CP max is expected to be uniform, and such sanity check is demonstrated in the top plot of Fig. 11. The DNN is well reproducing this behavior. The Δα CP max , the difference between true and predicted value of the α CP max is shown in the bottom plot of Fig. 11. In the case of N class ¼ 21, it has a Gaussian-like shape with the mean hΔα CP max i ¼ 0.003 AE 0.001 ½rad and standard deviation 0.139 [rad]. Results are again comparable with the ones obtained with the previously discussed approaches.

VI. REGRESSION
The ML regression is not so commonly used in the high energy physics analyses. The main feature is, that contrary FIG. 8. Difference between true and predicted coefficients C 0 , C 1 , C 2 of formula (5). For DNN training the granularity of N class ¼ 21 was used.
FIG. 9. The difference between true and predicted most probable mixing angle α CP max , calculated using formula (5) and coefficients C 0 , C 1 , C 2 learned with classification method. The granularity of α CP max , N class ¼ 21 and 51 was used respectively for top and bottom plot.
to the classification case, we get a continuous parameter (or set of parameters) as a DNN output. We explore three approaches, defined similarly as in Sec. V (i) The DNN is learning to predict per-event spin weight as a function of mixing angle α CP . The range of mixing angle ð0; 2πÞ is split into discrete points of α CP at which value of spin weight is learned. This approach is described in Sec. VI A and used for the figures labeled with: Regression:wt. (ii) The DNN is learning to predict per-event value of the coefficients C 0 , C 1 , C 2 of the functional form (5). The DNN is trained for all coefficients simultaneously. This approach is described in Sec. VI B and used for the figures labeled with: Regression:C 0 , C 1 , C 2 . (iii) The DNN is learning to predict per-event most probable mixing angle α CP max , i.e., where α CP spin weight has maximum. This approach is described in Sec. VI C and used for the figures labeled with: Regression: α CP max . We continue with the TENSORFLOW package, but now with tf.losses.mean_squared_error function as a loss in the training procedure of Secs. VI A, VI B and self-defined function in the training procedure of Sec. VI C. Mentioned self-defined function is discussed in the Appendix.

A. Learning spin weight wt
Similarly as in the classification case, the DNN regression is trained on an input information consisting of per-event feature list. As a training output we provide a vector of the spin weight wt i for the discrete values of α CP . Training is performed for different granularities of α CP discretization, to monitor performance sensitivity. Again in this case we use odd number of equally spaced points α CP i , so the α CP ¼ 0; π; 2π coincide with a single point. It is worth noting, that in case of regression, both shape and normalization of the wt are learned by the DNN.
For two example events in Fig. 12, true continuous spin weight wt distribution as well as step-function prediction is shown as a function of mixing parameter α CP . In overall, predicted weights follow smoothly expected shape of linear cosðα CP Þ and sinðα CP Þ combination, even if no attempt to regularize for such smooth behavior was made.
Distributions of l 2 norm, defined in the same way as in the classification case, as a function of N class (granularity for discretising α CP ) is shown in Fig. 13. For more compatibility with the classification case of Sec. VA we present results for original wt, as well as normalized to unity wt norm . The results are comparable, with a visible flattening of l 2 for higher values of N class .
In Fig. 14 distributions of Δ class for N class ¼ 21 and 51 used to train DNN regression are respectively shown. The shape is Gaussian-like and as expected centered around Δ class ¼ 0.
B. Learning C 0 , C 1 , C 2 coefficients Regression approach allows us to predict C 0 , C 1 , C 2 coefficients directly, without any need of discretization. The differences between true and predicted ones are shown in Fig. 15. On average, all three coefficients are predicted reasonably well. Consistent are the statistical summaries of ΔC i : means remain in the range AE0.004 and standard deviations in range (0.029-0.042). Coefficients C i are then used to calculate predicted spin weight wt of formula (5).
We have investigated also, how well predicted C 0 , C 1 , C 2 can be used to estimate the most preferred mixing angle, α CP max . For consistency, we evaluate it using the  same criteria as for classification approaches. This is achieved by using coefficients C 0 , C 1 , C 2 to calculate spin weight wt, and then turning it into discrete predictions for wt and wt norm in the N class points. As in Sec. V for classification approach, we use Δ class , defined by formulas (8)- (10).
The distributions of the true and predicted most probable class, α CP max and their difference are shown in Fig. 16 for the N class ¼ 51. We expect the distributions to be flat as sample was generated without any polarization correlation (carrier of CP effects) included, and this sanity check seems to be positive. The difference between true and predicted α CP max forms a narrow peak with the mean value hΔα CP max i ¼ −0.001 AE 0.001 ½rad and standard deviation 0.138 [rad].
Finally, as a sanity check, we have compared the true overall distribution of C 0 , C 1 , C 2 with the predicted one. As we can see in Fig. 17, both distributions match very well.
C. Learning the α CP max As was in the previous subsection, the implementation of the regression method allows a direct, nondiscrete estimation of continuous parameters. This is also desired with the most preferred mixing angle α CP max . The distributions of the true and predicted most probable class, α CP max and their difference are shown in Fig. 18 for the N class ¼ 51. We expect the distributions to be flat as sample was generated without any polarization correlation (carrier of CP effects) included, and this sanity check seems to be positive. As the used event sample is generated without any polarization, the distribution of the α CP max is expected to be uniform, see the top plot of Fig. 18. The DNN is reproducing this feature well. The difference between true and predicted α CP max forms a narrow peak with the mean hΔα CP max i ¼ 0.020 AE 0.003 ½rad and standard deviation 0.458 [rad].

VII. CLASSIFICATION OR REGRESSION: COMPARISON AND COMPLEMENTARITY
In this section we shortly compare classification and regression approaches. In Table III we collect the mean and standard deviation for difference between true and predicted with classification and regression methods C i . There is no clear winner, both methods give predictions of similar precision, with only C 0 being better predicted with regression.
In Table IV we compare the difference between true and predicted α CP max obtained with different methods. With the classification method comparable performance is achieved when learning spin weight wt, coefficients C 0 , C 1 , C 2 or directly α CP max . For the regression method learning directly α CP max is significantly worse performing. Otherwise, there is no clear winner between different methods.
In Ref. [18] we have compared performance of the NN method with the one which can be deduced from the set of one dimensional optimal-variable-like distributions. Assuming partial correlations the classification strength was comparable, somewhat smaller though. This conclusion holds now as well. Note that optimal-like subvariables may turn to be useful for evaluation of systematic ambiguities. For one dimensional distributions established strategies are ready to be used.

VIII. SUMMARY
We have performed a proof-of-concept for the DNN methods in the measurement of Higgs boson H → ττCP mixing angle dependent coupling. That extends work of Refs. [18,19] of classification between scalar and pseudoscalar Higgs CP state. Several solutions of classification and of regression types were prepared and numerical results were collected. For the measurement we have studied approaches where; (i) spin weights, (ii) coefficients for the functional form of the spin weight (iii) directly the mixing angle at which the weight has its maximum, were targeted. In cases (i) and (ii) the classification approach seemed comparable to the regression, but the comparisons relied on the discretized and normalized quantities due to classification limitations. The regression approach seems more natural for continuous observables and does not have such limitations. On the other hand, regression approach has performed much worse in the case of direct α CP max prediction.
For the feature list we have chosen idealistic case, assuming that complete set of τ decay products 4-momenta is known, including challenging to reconstruct neutrinos. We have exploited then the τ → ρν decay mode. The results are encouraging, the understanding of environment for future discussion of measurement ambiguities was not compromised with respect to what was achieved in previous publications for scalar/ pseudoscalar classifications.
The mean value of the preferred mixing angle α CP max can be constrained by the trained DNN with per-event resolution better than 0.15 [rad] using a classification approach. Both classification and regression approaches allow to learn spin weight with uncertainties (average l 2 norm) better than 15%. Both approaches allow also to learn coefficients C 0 , C 1 , C 2 of the functional spin weight form. The coefficients are directly related to the polarimetric vectors of decaying τ AE leptons. This provides interesting possibility for the future studies of experimental ambiguities with samples of the Z → ττ decays, much more abundant and available for the LHC measurements. Departure from SM predictions on Zττ coupling can reveal itself in the observables build from polarimetric vectors of decaying τ AE leptons too.
We plan, following [18,19], to extend our studies to more realistic feature lists and other decay modes. Already now, the variety of ML methods for the determination of most preferred CP state mixing angle, demonstrated potential TABLE III. The mean and standard deviations of ΔC i , the difference between generated and predicted C i , obtained from DNN with classification and regression methods for N class ¼ 51.

Coefficients
Classification Regression and robustness for future experimental analyses and measurements with the LHC data. While analyzing the realistic feature list, one can, in parallel, construct one dimensional distributions which feature partial sensitivity to CP. This can be used to monitor, which improvements are of importance, and to understand why it is the case. Also, by hand combination of such partial significance can be helpful to understand performance of ML solution, even if important correlations between distributions are then ignored. On the other hand, it can help to understand consequences of background contaminations, which may mimic the signatures. In Ref. [27] we have discussed potential consequences of such contaminations, originating from Drell-Yan background, but to exhaust, more complete studies with detector simulation details are to be applied, and not only to optimal-like distributions. In that paper, we have concentrated on a tool, potentially useful for background impact on multidimensional distributions.
Background contamination is a valid concern for the discussed signatures. It may even mimic the signal. That it is not the case needs to be checked. In Ref. [20] we have studied the transverse spin effects of the Drell Yan process; a good example of the background for CP sensitive observables of Higgs to τ decays. It was found, see Fig 4. there, that because sign of transverse spin correlation depends on orientation with respect to reaction plane, spanned on incoming quarks and outgoing τ leptons momenta, the effect cancels out if averaged over the sample. That observation trivially extends to the case of other τ decays and signatures prepared with the help of ML techniques. Only very specific cuts could change that. For that however very detailed knowledge of the detector responses and cut implementation would be needed. This is to be achieved only with the help of collaborations detector response software. Our intuition is that such cuts are rather not an issue. The background will just contribute with events of, on average, no CP-sensitive signatures of any sort. That should be the case of any processes mediated by intermediate spin 1 state, and τ-pairs of no common origin as well. In presence of such background, we expect sensitivity of observables based on optimal variables and ML techniques alike, to be reduced in proportion to possible background fluctuations.

ACKNOWLEDGMENTS
We would like to thank J. Kurek and P. Winkowska for help with technical implementation and testing of the analysis code used for this paper preparation. This project was supported in part from funds of Polish National Science Centre under decisions No. DEC-2017/27/B/ ST2/01391. The majority of the numerical calculations were performed at the PLGrid Infrastructure of the Academic Computer Centre CYFRONET AGH in Krakow, Poland.

APPENDIX: DEEP NEURAL NETWORK
The structure of the simulated data and the DNN architecture follows what was published in our previous papers [18,19]. It is prepared for TENSORFLOW [31], an open-source machine learning library.
We consider H → ττ channel of both τ AE → ρ AE ν decay. The data point is thus an event of the Higgs boson production and τ lepton pair decay products. The structure of the event is represented as follows: The situation, which does not happen in many other cases of ML classification. The A; B; …M distributions highly overlap in the ðf i;1 ; …; f i;D Þ space, the more detailed discussion in case of two hypotheses only, scalar and pseudoscalar, can be found in [18]. Thanks to similar DNN architecture, we have prepared three implementations for measuring Higgs boson CP state: binary classification, multiclass classification and regression: (i) For binary classification the aim is to discriminate between two hypothesis, H 0 and H α CP . (ii) For multiclass classification, the aim is to simultaneously learn weights (probabilities) for several H α CP hypotheses; learn coefficients of the weight functional form or directly learn the mixing angle at which spin weight has its maximum, α CP max . A single class can be either single discretized α CP or a range FIG. 20. The DNN loss for classification (left-side) and regression (right-side), as function of number of epochs used for training. It is shown for learning spin weight (top plots), C i coefficients (middle plots) and most likely mixing angle α CP max (bottom plots). For the classification, N class ¼ 21 was used.
for the C i parameters. The system is learning probabilities for classes to associate with the event.
(iii) For the regression case, the aim is similar as for multiclass classification case, but now the problem is defined as a continuous case. The system is learning value to associate with the event. The value can be a vector of spin weights for a set of H α CP hypotheses, set of C i coefficients or α CP max . The network architecture consists of 6 hidden layers, 1000 nodes each with ReLU activation functions and is initialized with random weights. Such architecture has been found as a good trade-off between the performance and computation time, what can be seen in Fig. 19. Learning procedure is optimized using a variant of stochastic gradient descent algorithm called Adam [32] and batch normalization [33].
The last layer is specific to the implementation case, different is dimension of the output vector, activation function and a loss function. In the following, we will describe details.
Classification: The loss function used in stochastic gradient descent is a cross entropy of valid values and neural network predictions [4]. It is a common choice in case of binary or multiclass classification models. The loss function for sample of N evt events and classification for N class reads as follows: where k stands for consecutive event and i for class index. The y i;k represents neural-network predicted probability for event k being of class i while p i;k represents true probability used in supervised training. Regression: In case of predicting wt the last layer of DNN is N dimensional output (granularity with which we want to discretize it). For predicting C 0 , C 1 , C 2 the last layer of DNN is N ¼ 3 dimensional output, i.e., values of C 0 , C 1 , C 2 . Activation of this layer is a linear function. Loss functions is defined as mean squared error (MSE) between true and predicted parameters Loss ¼ where k stands for event index and i for index of function form parameter. The y i;k represents predicted value of C i th parameter for event k while p i;k represents true value. For predicting the α CP max the last layer of DNN is N ¼ 1 dimensional output, i.e., values of α CP max . The tf.reduce_mean method of TENSORFLOW is used, with the loss function Loss ¼ where y k , p k denotes respectively predicted and true value of α CP max . In Fig. 20, for all problems considered, distributions of the loss functions on the training and validation samples, as a function of number of epochs used for training are shown. Left plots are for the classification and right plots for the corresponding regression. The values of the loss are case specific and should not be directly compared, their shape is monitoring the training process. For all cases the loss is decreasing with number of epochs, both on training and validation samples. It is overlapping for all cases except [Regression: α CP max ] (bottom right plot), for that single case one small loss in performance is observed for validation sample compared to training sample. Training with 25 epochs seems sufficient for both classification and regression for all presented scenarios. We have not observed any gain of interest in any of extended to larger number of epochs Fig. 20 plots.