Machine learning classification: Case of Higgs boson CP state in H → ττ decay at the LHC

Machine learning (ML) techniques are rapidly finding a place among the methods of high-energy physics data analysis. Different approaches are explored concerning how much effort should be put into building high-level variables based on physics insight into the problem, and when it is enough to rely on low-level ones, allowing ML methods to find patterns without an explicit physics model. In this paper we continue the discussion of previous publications on the CP state of the Higgs boson measurement of the H → ττ decay channel with the consecutive τ (cid:1) → ρ (cid:1) ν ; ρ (cid:1) → π (cid:1) π 0 and τ (cid:1) → a (cid:1) 1 ν ; a (cid:1) 1 → ρ 0 π (cid:1) → 3 π (cid:1) cascade decays. The discrimination of the Higgs boson CP state is studied as a binary classification problem between CP even (scalar) and CP odd (pseudoscalar) states using a deep neural network (DNN). Improvements on the classification from the constraints on directly nonmeasurable outgoing neutrinos are discussed. We find that, once added, they enhance the sensitivity sizably, even if only imperfect information is provided. In addition to DNNs we also evaluate and compare other ML methods: boosted trees, random forests, and support vector machines.


I. INTRODUCTION
Machine learning (ML) techniques are finding an increasing number of applications in high-energy physics phenomenology. With the Tevatron and the LHC experiments it has became a standard analysis tool. ML techniques are used for event selection, event classification, background suppression for the signal events of the interest, etc. For a recent comprehensive review, see Refs. [1][2][3]. Over the last years the most significant progress in phenomenology due to ML techniques (in particular the recent developments in neural network methods) has been in hadronic jet reconstruction and classification: jet substructure, flavor, charge, and mass. Some long-standing challenges of more classical algorithms have been addressed; see, e.g., Refs. [4][5][6][7][8][9][10].
In this paper we present studies on the seemingly related problem of how the substructure and pattern of hadronically decaying τ leptons can be useful to determine the CP state of the Higgs boson in the decay H → ττ. The theoretical description of the process including τ-lepton decays is relatively simple and only suffers from minor theoretical ambiguities. On the other hand, a complete detection approach remains a challenge. For example, indirect constraints had to be devised and validated instead of nonmeasurable τ-neutrino momenta, and the related part of the sensitivity was often compromised.
This problem has a long history [11,12]. It was studied for both electron-positron [13,14] and hadron-hadron colliders [15,16]. Despite some interest, CP states in H → ττ decay were not measured or even explored in LHC analysis designs. While more classical experimental analysis strategies have been prepared and documented (see, e.g., Ref. [17]) ML methods for exploring HL-LHC strategies are still at an early stage.
A typical experimental data sample consists of events. Each event can be understood as a point in a multidimensional coordinate space, representing the four-momenta and flavors of observed particles or groups of particles. The physics goal is to identify properties of distributions constructed from these events and to interpret them in a physically meaningful way. The ML algorithms with only low-level features of the event are not necessarily able to efficiently capture all information available. The bestperforming strategy still seems to be the mixing of lowlevel information with human-derived high-level features based on insight into the physics of the problem. Examples of such analyses are presented in Refs. [18,19] where the strategy of mixing low-level and high-level features to remove trivial (physics-wise) symmetries was successfully explored. Then, the ML algorithms do not need to learn some basic physics rules, like rotation symmetry.
In previous papers [20,21] we have demonstrated that ML methods, like deep neural networks (DNNs) [22], can serve as a promising analysis method to constrain the Higgs boson CP state in the decay channel H → ττ. We considered two decay modes of the τ leptons: τ AE → ρ AE ν and τ AE → a AE 1 ν, followed by ρ AE → π AE π 0 and a AE 1 → ρ 0 π AE → 3π AE . This forms three possible hadronic final-state configurations-ρ AE ρ ∓ , a AE 1 ρ ∓ , and a AE 1 a ∓ 1 -each accompanied by a τ-neutrino pair. The information about the Higgs boson CP state is encoded in the angles between the outgoing decay products and the angles between the intermediate resonance decay planes. In earlier studies [12,23] performed with the rather classical optimal variable approach [24], we observed that the best discrimination was achievable from features constructed in the rest frame of the primary intermediate resonance pair of the τ decays, with the z axis aligned with the resonance direction. This idea was also explored in Ref. [20] and will be studied in this paper. We have investigated inputs consisting of mixed low-level and high-level features. Many of the high-level features turned out to be not necessary, but they nevertheless provided benchmark results. On the other hand, (actually simple) nontrivial choices for the representation of some low-level features were necessary to achieve any significant result.
The studies presented in Ref. [20] were limited to input from the hadronic decay products π AE and π 0 ; no detector effects were taken into account. That study was followed by a more systematic evaluation within the context of experimental analysis [21], namely, applying simplified detector effects to the input features. The conclusions of Ref. [20] on the performance of the DNN method still stand, and we will not repeat this evaluation here.
The studies presented in Ref. [20] showed that the case of ρ AE ρ ∓ followed by a AE 1 ρ ∓ is the most sensitive to the Higgs CP channel, and a somewhat weaker sensitivity is achieved in the a AE 1 a ∓ 1 case. Should all of the decay channels be equally sensitive to the Higgs CP state? In Ref. [26] it was demonstrated that, yes, the sensitivity of each τ decay channel to spin is the same. Unfortunately, this requires control of all momenta of the τ decay products, in particular that of nonmeasurable neutrinos. The studies presented in Ref. [20] did not rely on the complete information, limiting the input information to the hadronic (visible) decay products only. However, it is possible to overcome this limitation and (approximately) reconstruct the neutrino momenta from the τ decay vertex position and event kinematics (the momenta of visible τ decay products, overall missing p T , and overall collision center-of-mass system energy). Such a reconstruction is challenging from both the experimental and analysis design perspectives: the relations between necessary features are more complicated.
Nevertheless, this provides new opportunities for ML methods, which we will explore with the help of expert variables: the azimuthal angles of the neutrino orientation. The possibility that the angle may become experimentally available with adequate precision can be concluded from a recent experimental publication of the LHC collaborations on the measurement of the H → ττ signal [27,28], τ substructure reconstruction and classification [29,30], and progress on the precision of B-meson decay vertex position measurements [31][32][33].
We attempt to reconstruct the two neutrinos' fourmomenta (i.e., six quantities) from the experimentally available quantities and examine when such approximate information can be useful. To achieve this goal we propose the following three steps: (1) Reconstruct the neutrino 4-momenta components collinear to the directions of the visible decay products of τ leptons, from the missing transverse energy of the event E x miss , E y miss and the invariant mass of the Higgs boson m H .
(2) Reconstruct the transverse part of the neutrino momenta from the τ-lepton invariant mass constraint. (3) Reconstruct the two remaining azimuthal angles ϕ ν 1 , ϕ ν 2 of the neutrinos (or equivalent information), with the help of τ-decay position vertices. After step 1 we have four independent variables to constrain, and after step 2 only two independent variables remain. The load on the constraints from the τ decay vertex position (probably the least precisely measured parameter) is minimized. This approach can be understood as an attempt to construct high-level features with the expertsupported design. If useful, this may later be replaced with better choices. Several papers with optimal variables in mind followed such a strategy [13,14,16].
For compatibility, we use the same simulated samples as in Ref. [20], namely, Monte Carlo events of the 125 GeV Higgs boson of the Standard Model, produced in pp collisions at 13 TeV center-of-mass energy, generated with PYTHIA8.2 [34] and with spin correlations simulated using TauSpinner [23]. For τ-lepton decays we use Tauolapp [35]. All spin and parity effects are implemented with the help of the TauSpinner weight wt. This is why the samples prepared for the CP-even or -odd Higgs are correlated. For each channel we use about 10 7 simulated Higgs events [36]. In order to partly emulate detector conditions, a minimal set of cuts is used. We require that the combined transverse momenta of the visible decay products, for each τ, is larger than 20 GeV. We also require that the transverse momentum of each π AE is larger than 1 GeV.
As in Ref. [20], we perform a DNN analysis for the three channels of the Higgs, i.e., τ-lepton-pair decays, denoted respectively as, ρ AE − ρ ∓ , a AE 1 − ρ ∓ , and a AE 1 − a ∓ 1 . Only two hypotheses on Higgs parity are compared. However, an extension to a parametrized classification (similar to the approach taken in Ref. [37]) could be envisaged as an obvious next step, e.g., the measurement of the Higgs CP parity-mixing angle. Our paper can also be understood as a work in that direction.
Our baseline for ML methods is the DNN; nonetheless, we also work with more classical ML techniques like boosted trees (BTs) [38], random forests (RFs) [39], and support vector machines (SVMs) [40]. A comparative analysis is presented for the ρ AE − ρ ∓ case and for smaller event samples of about 10 6 events.
Our paper is organized as follows. In Sec. II we briefly recall the physics of the problem and the previous results from Ref. [20]. In Sec. III we discuss how to reconstruct (with some approximation) the outgoing neutrino momenta. We exploit the collinear approximation, mass constraints, and information on the spatial positions of the production and decay vertices. In Sec. IV we present an improvement on the DNN classification from information on the neutrinos. We quantify the necessary precision on the neutrinos' azimuthal angle to improve the performance of the classifier. In Sec. V the main results are summarized and an outlook is provided.
In the Appendix A details concerning the implementation of the DNN analysis are given. In Appendix B we present results obtained with the other ML techniques (BTs, RFs, and SVMs). We also discuss technical benchmarks, like CPU usage and transient memory.

II. CLASSIFICATION BASED ON HADRONIC DECAY PRODUCTS
Let us comment briefly on a few selected results [41] from Ref. [20], summarized in Table I. For the DNN classification, only the directly measurable 4-momenta of the hadronic decay products of the τ leptons were considered. They were boosted to the rest frame of the primary intermediate resonance pairs: ρ AE − ρ ∓ , a AE 1 − ρ ∓ , or a AE 1 − a ∓ 1 . All four vectors were later rotated to the frame where the primary resonances are placed along the z axis. This greatly improved the learning process. The DNN algorithm did not have to, e.g., rediscover rotational symmetry, and from the very beginning the internal weights of the DNN algorithms could determine transverse CPsensitive degrees of freedom from the longitudinal ones. To quantify the performance for Higgs CP classification we used a weighted area under curve (AUC) and receiver operator characteristic (ROC) curve [42,43]. For each simulated event we know (from the calculated matrix elements) the probability that an event is sampled as a scalar or pseudoscalar (for details, see Appendix A). This forms so-called oracle predictions, i.e., the ultimate discrimination for the problem, which is about 0.782, independent [44] of the τ decay channels. Random classification corresponds to 0.500.
For the studied τ-pair decay channels, an AUC in the range 0.557-0.638 was achieved. Note that the AUC score is so much lower than the oracle predictions due to missing information on the neutrino momenta, which are important carriers of spin information but are not directly accessible by measurements. Let us briefly explain the physics context of the problem.
The Higgs boson Yukawa coupling expressed with the help of the scalar-pseudoscalar parity-mixing angle ϕ reads where N denotes normalization, h is the Higgs field, andτ and τ spinors of τ þ and τ − . The matrix element squared for the scalar/pseudoscalar/mixed-parity Higgs with decay into τ þ τ − pairs can be expressed as where h AE denote polarimetric vectors of τ decays (solely defined by the τ decay matrix elements) and R i;j is the density matrix of the τ-lepton pair spin state. Details of the frames used for the definitions of R i;j and h AE can be found in Ref. [45]. The corresponding CP-sensitive spin weight wt is simple: The formula is valid for h AE defined in the τ AE rest frames, and h z and h ⊥ stand for the longitudinal and transverse components of h. Rð2ϕÞ denotes the matrix of 2ϕ angle rotation around the z direction:  [20] for discrimination between scalar and pseudoscalar Higgs CP states. For DNN classification only the hadronic decay products' 4-momenta were used.
Line content where the τ decay products π AE , π 0 and ν τ 4-momenta are denoted as p π AE , p π 0 , p ν , and q ¼ p π AE − p π 0 , respectively. The formula is longer for the h i AE of the decay τ AE → π AE π AE π ∓ ν due to a dependence on the modeling of the decay [21]. Obviously, complete CP sensitivity can be extracted only if p ν is known. Note that the spin weight wt is a simple first-order trigonometric polynomial in a (doubled) Higgs CP parity-mixing angle. This observation is valid for all τ decay channels.

III. APPROXIMATING COMPONENTS OF NEUTRINO MOMENTA
Our conjecture is that some of the steps listed in the Introduction and presented below may in the future be replaced or optimized by the solutions present in ML libraries. The expert variables-in particular, ϕ ν 1 and ϕ ν 2will not be needed. We first need to explain our construction in detail.
We start with the approximate neutrino momenta in the ultrarelativistic (collinear) approximation. We temporarily assume that the neutrino momenta and visible τ products' momenta are collinear. Later, we relax this simplification. This gives a reasonable approximation for the largest collinear components (in not only the laboratory frame but also the Higgs rest frame and the rest frame of its visible decay products).

A. Collinear approximation
The basic kinematical constraint on the 4-momenta of each τ AE → had AE ν decay reads (where had AE stands for the combined hadronic system produced in the decay, i.e., π AE , π 0 , etc.) where p τ 1 , p τ 2 denote the 4-momenta of the decaying τ leptons, p had 1 , p had 2 denote the 4-momenta of their combined hadronic (i.e., measurable) decay products, and p ν 1 , p ν 2 denote the 4-momenta of the decay neutrinos. We temporarily assume that the directions of the hadronic decay products and neutrinos are parallel to the direction of the decaying τ and where x is in the range (0,1). Then, for τ þ and τ − we can write From Eq. (7) we obtain These relations hold in both the laboratory frame and the rest frame of the hadronic decay products, which is a consequence of the properties of Lorentz transformations of ultrarelativistic particles. This is why we can calculate α 1 and α 2 in the laboratory frame but use them in the rest frame of the combined hadronic decay products. That frame seems to be optimal [20] for the construction of expert variables for ML classification.
1. The E x miss , E y miss constraints The laboratory-frame event momentum imbalance in the plane transverse to the beam direction, usually denoted as E x miss , E y miss , can be used to constrain neutrino momenta. It can be attributed to the sum of the transverse components of the neutrino momenta, but it also accumulates all imperfections of the reconstruction of the other outgoing particles of that event. Then, thanks to Eq. (7), and or Finally, by solving for α 1 and α 2 we obtain the expressions which are useful for studies of ML classification.

Using the m H constraint
Equation (12) alone provides solutions for α 1 and α 2 . However, E x miss , E y miss have large experimental uncertainties. At the same time, the high-quality constraint from the known Higgs-boson and τ-lepton masses is E had 1 and E had 2 denote the energies of the hadronic systems had 1 and had 2 . Later, we will use the similar notation E ν for the neutrino energy. Unfortunately, only the product ð1 þ α 1 Þ · ð1 þ α 2 Þ can be controlled in this way, 3. Choosing an optimal solution for the longitudinal neutrino momentum To constrain α 1 and α 2 , we have the three independent equations [Eqs. (12) and (14)] at our disposal. We have checked that all three options lead to comparable predictions and marginal differences in the ML performance, at least as long as the measurement ambiguities of E x miss , E y miss are not taken into account: (1) Approx-1: Eq. (12) only.
(iii) Approx-3: Eq. (14) and α 2 from Eq. (12). The actual choice may optimize experimental precision. For now, the option Approx-1 is chosen as a baseline for the results [46] without much elaboration.
To illustrate the effectiveness, the correlation between α 1 -true [47] and α 1 -Approx-1 is shown for the a AE 1 − ρ ∓ case in the top panel of Fig. 1. In the bottom panel, as a consistency check, the correlation of the a AE 1 − ρ ∓ restframe and laboratory-frame energy fraction x 1 calculated using α 1 -Approx-1 is given. A sample of 10 4 events was used for these scatter plots. The fraction of events contained in the band Δα 1 =α 1 ¼ AE5%ðAE10%) is about 25%(39%), and the fraction in the band Δx 1 =x 1 ¼ AE1% is about 85%. This relatively poor resolution in α 1 will be reflected in the resolution of the approximate neutrino momenta. It will be interesting to observe how much it will affect the classification capability of trained DNNs, which will be discussed in Sec. IV.

B. Energy and transverse components of neutrino momenta
Now, with the help of the approximate p z ν (aligned with the direction of combined visible decay products), we can turn our attention to p x ν and p y ν . In the rest frame of the hadronic decay product system the p had 1;2 momenta are set along the z direction, and thus p x had ¼ p y had ¼ 0. The τ mass constraint reads and for massless ν τ The equations lead to the following relations: where for p z ν ¼ α · p z had one of the α approximations from Sec. III A 3 is used. α 1 , α 2 , E ν 1 , and E ν 2 must be positive; otherwise, the approximation fails and the event cannot be used. Also, events with a negative approximate ðp T ν Þ 2 could be rejected, but for our studies we instead decide to set this component to zero. In total, about 17% of events are rejected for Approx-1. Further 11% of events are rejected if the above criteria are fulfilled with Approx-2 and Approx-3 too. In Fig. 2 we show the distribution of the relative shifts from the generated to approximate E ν , p z ν and p T ν for the a AE 1 − ρ ∓ case. The p T ν is approximated better than E ν and p z ν .
We remain encouraged because for ML classifications even approximate observables (expert variables) may be useful to improve classification scores.

C. Azimuthal angles of neutrinos
After completing step B, we are left without two azimuthal angles for the orientation of p T ν 1 and p T ν 2 only. To capture the sensitivity of the Higgs boson CP they have to be known, preferably in the visible τ-pair decay products rest frame. Those two angles can be inferred from the positions of the τ decay vertices and then through boosts and rotations related to the azimuthal angles in the visible decay product frame.
The transverse coordinates of the primary interaction point are to a good precision consistent with zero. At the same time, the tracks of the τ decay products will not point to this interaction vertex, but rather to the position of the τ decay vertex shifted by the τ's flight path. The direction of the τ's flight path and (as a consequence) its momentum components can be reconstructed. This provides a constraint on the ν τ momentum as well. We do not intend to go into the details of this challenging secondary vertex position measurement. We refer to Refs. [31,32], which discuss the similar problem of the secondary vertex in the case of B-meson decay and its application to the classification of hadronic jets. One may assume that such a measurement is possible for a τ lepton, and that the orientation of the ν τ momentum around the direction of the visible hadronic τ decay products can be constrained.
To determine how precisely we need to know this information, we take the true azimuthal angles ϕ ν 1 and ϕ ν 2 in the rest frame of the visible decay products and smear them. For the Δϕ ν ¼ jϕ smear ν − ϕ true ν j smearing probability we take We have chosen the exponential shape instead of the oftenused Gaussian shape [48]. Note however that the length of the τ flight path follows an exponential distribution. We choose the sign for the shift with equal probabilities. We think that at present it is premature to attempt a realistic detector smearing. Such attempts to investigate experimental smearings for the secondary vertex position have only been reported for the case of the τ AE → π AE π AE π ∓ ν decay channel of Z=γ Ã → ττ production at the LHC [49].

D. Ansatz for the direction of the τ leptons
In Sec. III C we discussed the possibility of adding approximate information on the angle of the outgoing neutrino in the decay plane to the feature list. However, for the multivariate methods this angle does not have to be explicitly present in the feature list. In fact, indirect information such as the approximate direction of the outgoing τ lepton may be good enough.
From the primary and secondary vertex positions, the direction of the laboratory system τ-lepton momentum, i.e., p x , p y , p z is constrained. Assuming a known τ time of flight (t flight ) and mass m τ , we calculate where i sec vtx and i prim vtx denote the spatial position of the reconstructed primary and secondary vertex, respectively, in the laboratory (collision) frame (i ¼ x, y, z). Instead of the unknown true time of flight, we use the value from the Particle Data Group, cτ τ ¼ 87 μm [50]. The true time of flight behaves according to the exponential distribution with mean ht flight i ¼ τ τ . This imposes that the approximation used to estimate p x , p y , and p z is also characterized by an exponential distribution, with mean and sigma close to their true values. The energy of the τ lepton is then calculated using the τ mass constraint, Now, the complete 4-momentum of each τ is boosted into the ρ AE − ρ ∓ , a AE 1 − ρ ∓ , or a AE 1 − a ∓ 1 system rest frame and added to the feature lists for DNN training.

IV. CLASSIFICATION WITH DNNs
The structure of the data and neural network architecture follows Ref. [20]. We start from the code used there. For the convenience of the reader, we summarize the technical description of our DNN model in Appendix A.
Simulated data consist of events where all decay products are stored together with their flavors. The 4-momenta of the laboratory frame are stored and (whenever they are needed) transformed to their respective rest frames, as explained in Sec. II. With respect to the analysis published in Ref. [20] we explore approximate information on neutrino momenta derived from the kinematical constraints of the Higgs decay products. We show that significant improvement may originate from even very inaccurate information on the azimuthal angles of the neutrinos' directions.
We explore the potential of classification with the DNN technique with several variants of the feature lists, as detailed in Table II. They are grouped and marked as Variant-X.Y, where X labels the choice of the main features and Y (in most cases) labels if they are calculated from the generator-level 4-momenta or from the approximation; it may also mark if additional, highlevel variables are used. It gives us very useful tools to quantify how much of the DNN performance we are losing due to certain approximations made on the groups of features.
In Table III, we collect the AUC scores and average precision (APS) scores [51] obtained on the test sample of simulated data (i.e., events not used for training or validation) with the DNN trained on 50 epochs and with a dropout ¼ 0.20. Both are comparable, with the APS scores being systematically slightly lower, except for a few cases of the a AE 1 − a ∓ 1 channel. This configuration was found to be the most stable for the comparison of Variant-X.Y classifications, but it does not necessarily represent the optimal performance of the particular variant of the feature list. In the first line of Table III we recall the oracle predictions [52]; for details, see Appendix A. It cannot be outperformed by the DNN of any Variant-X.Y. It may not be reached even with a feature list containing the complete set of 4-momenta of τ decay products, denoted as Variant-All.
In the following subsections we discuss these results in detail. Lists of features for ML classification, marked as Variant-X.Y. In the third column, the number of features for the ρ AE − ρ ∓ , a AE 1 − ρ ∓ , and a AE 1 − a ∓ 1 channels are given. All components of the 4-momenta are taken in the hadronic decay product rest frame. The primary resonances (ρ AE , a AE 1 ) are aligned with the z axis. E x miss and E y miss are in the laboratory frame. In practice, instead of p T ν and ϕ ν , the pair of variables p T ν cos ϕ ν and p T ν sin ϕ ν is used.

A. Benchmarks using all or only hadronic decay products
For the first benchmark each event is represented by the 4-momenta of both τ-leptons' decay products (including neutrinos) in the rest frame of all hadronic decay products. This set of features is denoted as Variant-All. The results are displayed in the second and third lines of Table III. The DNN should be able to reproduce oracle predictions, which is almost the case if the dropout is not used, but it only approaches them for the baseline configuration with a dropout of 0.20. The dropout lowers the DNN's performance with Variant-All, but we have verified that for other feature lists this is not always the case. It helps to suppress overfitting, as illustrated in Fig. 8 of Appendix A. In the top panel of Fig. 3 we show for the a AE 1 − ρ ∓ channel Variant-All, the AUC score as a function of the number of epochs used for training and validation for the a AE 1 − ρ ∓ channel Variant-All. Scores up to about 0.75 are reached for the validation sample and Variant-All.
For the second benchmark, following Ref. [20], the same events are used but with the features limited to the 4-momenta of visible τ leptons decay products and quantities derived directly from them [53]. In the bottom panel of Fig. 3 we show the ROC curves in the true positive rate (TPR) versus false positive rate (FPR) plane for Variant-All and Variant-1.0.
The achieved AUC and APS scores are collected in Table III. The large difference in the AUC and APS performance between the Variant-All and Variant-1.0 feature sets is present for all channels. In the following we attempt to improve the performance using information on the neutrino momenta and in particular their azimuthal angles.

B. Adding neutrino momenta
In this subsection we present improvements due to the energy and longitudinal neutrino momenta. Such an extension of the feature list is not expected to be very beneficial, as CP information is carried by the transverse degrees of freedom, but it may optimize the use of information learned from the correlations of hadronic decay products.
With the assumptions explained in Sec. III, we approximate each of the neutrino momentum components E ν , p z ν , and p T ν in the rest frame of hadronic decay products. It is interesting to first check what is the potential impact of that information, i.e., when truth-level values are used. We add the laboratory-frame E x miss , E y miss , redundant to some extend, as it was already used in Eq. (7) for p z ν . The augmented feature list, using the true components of neutrino momenta, is denoted as Variant-2.0, while TABLE III. The AUC and APS scores to discriminate scalar and pseudoscalar CP states of the Higgs boson, obtained on the test sample. The DNN was trained on 50 epochs with a dropout of 0.2 (except for the explicitly marked case of Variant-All). Results for the ρ AE − ρ ∓ , a AE 1 − ρ ∓ , and a AE 1 − a ∓ 1 channels are given. The first column labels the choice of features. For details, see Table II.  Fig. 4 we show the DNN performance for the a AE 1 − ρ ∓ samples as a function of the number of training epochs: the AUC score achieved as a function of the number of epochs and the ROC curves.
All three approximations for E ν , p z ν , and p T ν are studied for the feature sets Variant-2.1 and Variant-2.2.
The differences between Approx-1, Approx-2, and Approx-3 are small but will certainly be apparent once detector effects are included.
Clearly, the improvement from the approximate information on the neutrino energy and momenta (longitudinal component and overall size of transverse components) is rather small for all three channels. The most sensitive information on the CP state lies in the azimuthal angles of the individual neutrinos, that is, in the individual p x ν ; p y ν components of the hadronic decay products rest frame and not in . Realistically, any information on the individual p x ν ; p y ν could be reconstructed only if the measurement of the τ decay vertices was possible. In the next section, we evaluate how accurately this information has to be known to become useful. This constitutes a separate experimental challenge. Note that at this step all components of the ν τ momenta except the individual p x ν ; p y ν are reconstructed sufficiently well from the measurable quantities.

C. Azimuthal angles of neutrinos from decay vertices
The azimuthal angles ϕ ν 1 and ϕ ν 2 can be obtained from the measurement of the τ-lepton decay vertices. This allows to reconstruct the τ-lepton momenta and hopefully can be used for our purpose as well. This is a rather widely used technique in the experimental measurements (see, e.g., Ref. [54]), but so far it has only been used for τ-mass and τ-lifetime measurements rather than for neutrino azimuthal angles.
We do not aim to reconstruct these angles; instead, we simply calculate them from the neutrino 4-momenta and add Variant-3.0 and Variant-3.1.β to the feature lists [55]. The first includes the true ϕ true ν 1 and ϕ true ν 2 are used, and the second includes the smeared ϕ smear The AUC scores are evaluated for β in range (0,2). In Fig. 6 the AUC scores for test samples of the three channels are given as a function of β. The AUC scores for β ¼ 0.0 reproduce (as they should) those of Variant-3.0 and are not very far from the scores of Variant-All. This is because the only difference is the approximate information on the energy and longitudinal and transverse momenta of the neutrino. For β above 1.4, the AUC scores decrease to those of the Variant-2.1 sets, which are then equivalent to not having information on the neutrino azimuthal angles at all. Even ϕ smear ν 1 and ϕ smear ν 2 corresponding to the rather large β ¼ 0.4 contribute sizably to CP Higgs sensitivity. The derivative of the sensitivity with respect to β reaches its maximum at about 0.35 and remains constant until β ¼ 0.9. Then, nearly all sensitivity gain is lost. For even larger β, the loss of sensitivity continues, but as the contribution is then already small, the deterioration is small too.
Let us now check if the DNN algorithm is sensitive to the precise modeling of the ϕ ν 1;2 resolution. This is why, for the validation and test sample, we introduce [56] an additional polynomial component for the smearing, The results should mimic the impact of inefficiencies (mismodeling) of the DNN training sample with respect to what is present in the validation or test samples. In Fig. 5 the distribution of ϕ true ν − ϕ smear ν is given for β ¼ 0.4, and b, c ¼ 0.3, 0.8. In Table IV    ρ AE − ρ ∓ , a AE 1 − ρ ∓ , and a AE 1 − a ∓ 1 channels and for β ¼ 0.2, 0.4, 0.6 with further choices of b and c. The additional polynomial component of the smearing introduced to the test sample does not affect the DNN performance. We can see that the degradation due to b, c ¼ 0.3, 0.8 is small and the results provide some encouraging insight into the DNN's capacity to exploit imprecise information and point to a possible direction for studies of systematic uncertainties [57].
In our study, when the precision of the experimental inputs was expected to be better than that from the decay vertex impact parameters, we reconstructed the neutrino momenta components from hadronic products and conservation laws. Only the ϕ ν angles required this rather low-precision input. From Fig. 6 we can expect that the approximate ϕ ν angle with an ambiguity of up to AE π 4 may sizably improve sensitivity.
Such a conjecture on the size of ϕ ν smearing that is critical for CP sensitivity is of interest for any ML application. For β ¼ 1.2 the shift Δϕ ν is bigger than π 4 in a sizable fraction of events. In this case, the DNN solution does not gain sensitivity from ϕ ν . Still, an approach relying less on the ϕ ν measurement and more on restricting which events should be dropped from the analysis could be useful. For large smearing, the elimination of events with a high risk of ϕ ν misreconstruction may be appropriate, as was attempted in Ref. [14]. Discussion of physics properties choice of the ML algorithms may be then of interest.

D. Tau lepton direction
The approximate information on the τ-lepton direction enables the DNN to constrain the neutrinos and significantly improve the classification. For this purpose, Variant-4.0 and Variant-4.1 are defined in Table II. In Table III, the performances of the DNN are presented, when the true-level or approximate τ-lepton spatial momenta components in the ρ AE − ρ ∓ , a AE 1 − ρ ∓ , and a AE 1 − a ∓ 1 rest frames are added. We observe a significant improvement of the performance with respect to Variant-2.1 and a comparable performance to the Variant-3.1.X family. In fact, the performance of Variant-4.0 is close to that of Variant-All. Variant-4.1 is a bit lower, close to Variant-3.1.4. Then, only the τ direction in the laboratory frame is exact and the energy is obtained from the simple ansatz of Sex. III D. When such τ4-momentum is boosted to the ρ AE − ρ ∓ , a AE 1 − ρ ∓ , and a AE 1 − a ∓ 1 rest frames, its direction absorbs some biases. The results of Variant-4.1 indicate that the DNN efficiently converts such an input into information on ν τ .

V. SUMMARY
From the perspective of theoretical modeling, the CPparity phenomenology in the cascade decay H → ττ, τ AE → had AE ν τ is rather simple, because the matrix elements can be easily defined. On the other hand, the parity effect manifests itself in the rather complicated features of multidimensional distributions, where kinematic constraints related to ultrarelativistic boosts and detection ambiguities play an important role in the reconstruction of the τ decay kinematics. Our aim was to evaluate precision requirements for experimental features to become useful.
In our previous paper [20] we studied the performance of the DNN binary classification technique for the hadronic τ-lepton decay products only. In this paper we turned our attention to the ν τ momenta.
Whenever possible, we exploited constraints on the τ mass, H mass, and energy-momentum conservation to minimize dependence on the highly smeared neutrino kinematics deduced from the impact parameters of τ decay and production vertices. The resulting set of expert variables helps DNN algorithms to identify physics-sensitive variables that are useful for identifying differences between the event classes.
Reconstructed with the help of approximation (but only from the visible decay products) longitudinal components of the neutrino momenta alone improved the AUC scores from 0.656, 0.609, and 0.580 to about 0.664, 0.622, and 0.591 for the ρ AE − ρ AE , a AE 1 − ρ ∓ , and a AE 1 − a ∓ 1 cases respectively. The improvement for the Higgs boson CP sensitivity is rather minuscule, even when the detector effects are not taken into account.
A more significant improvement came when the transverse components of the neutrino momenta were known, even imprecisely. This can be achieved if the τ-lepton decay vertices are measured and used to reconstruct the directions of the τ-lepton momenta. The performance of such a reconstruction is detector specific and is a challenge. We have estimated how big of an improvement in CP sensitivity is obtained as a function of detection smearing for the azimuthal angles ϕ ν and ϕν. Even with a large smearing of β ¼ 0.4, the AUC scores improved from 0.664, 0.622, and 0.591 to about 0.738, 0.714, and 0.687 for the ρ AE − ρ AE , a AE 1 − ρ ∓ and a AE 1 − a ∓ 1 cases, respectively. Note that ϕ ν and ϕν represent an intermediate step in the transition from expert variables to DNN algorithms with the direct use of low-level features. We leave the topic of the angle measurements and their use to future work.
Similar performance is expected when high-quality τ-lepton laboratory-frame direction data (as seen in the rest frame of all visible Higgs decay products) is available for the evaluation of the τ direction. The ambiguity in the laboratory-frame τ energy is not that important. An enhancement in the τ AE directions was achieved (Variant-4.1), and the AUC scores reached 0.738, 0.704, and 0.683 for the ρ AE − ρ AE , a AE 1 − ρ ∓ , and a AE 1 − a ∓ 1 cases, respectively.
In Fig. 7 we show the ROC curves for the different feature lists discussed in this paper.
The concept of optimal observables has been used for many years to obtain phenomenologically sound results. It provides essential tests for ML classification, where multidimensional input is used. An approach where sophisticated methods are used to measure h AE of Eq. (2) should be mentioned. All of the complexity of the hadronic τ decays and detector response is then hidden in each τ AE polarimetric vector h AE . Once an algorithm for h AE reconstruction is obtained, the latter step of CP phenomenology is straightforward: the details of the τ AE decay channels and detector effects are resolved. The h AE complexity is smaller than that of the entire H → ττ cascade decay. It is independent of the Higgs phenomenology and calculations can rely on the much more abundant Z → ττ data. Such a possibility was mentioned in Ref. [58] and is being pursued by, e.g., the CMS Collaboration. Then, ML techniques could be used to reconstruct h AE vectors from the complex detector responses to particular τ decay channels and the details of their decay vertex positions.
The evaluation of which of the methods is best, or even how complementary the methods can be, requires the work of experimental groups.
Recently, in Ref. [59] classifiers specifically tuned to tackle the Lorentz group features of high-energy physics signatures were prepared and used. This could be useful for Variant-1.0, where only the 4-momenta of secondary H → ττ decay products were used. In the present work this may be less straightforward some of the features are intimately related to the laboratory frame and their transformation to other frames may be poorly defined. This is why expert-variable-style reconstruction of neutrino azimuthal angles may be an efficient course to follow, or at least useful to better understand the limitations and ambiguities of these methods. The structure of the simulated data and the DNN architecture follows that published in our previous paper [20]. It is prepared for TensorFlow [60], an open-source machine learning library. The learning procedure is optimized using a variant of the stochastic gradient descent algorithm called Adam [61]. We also use batch normalization [62] (which has regularization properties) and dropout [63] (which prevents overfitting) to improve the training of the DNN. The problem of determining the Higgs-boson CP state is framed as a binary classification because the aim is to distinguish between the two possible (scalar and pseudoscalar) Higgs CP states.
We consider three separate problems for H → ττ channels: We solve all three problems using the same neural network architecture. Depending on the decay channel for the outgoing τ pairs, each of the cases contains a different number of dimensions to describe an event, i.e., the production of the Higgs boson decaying into a τ-lepton pair. Each data point consists of features which represent the observables/ variables of the consecutive event. The data point is thus an event of the Higgs-boson production and decay into a τ-lepton pair. The structure of the event is represented as follows: The f i;1 ; …; f i;D represent numerical features, and w a i and w b i are weights proportional to the likelihood that an event comes from a set A or B (binary scalar or pseudoscalar classification). The weights calculated from the quantum field theory matrix elements are available and stored in the simulated data files. This is a convenient situation that does not happen in many other cases of ML classification. The A and B distributions are highly overlapped in ðf i;1 ; …; f i;D Þ space; a more detailed discussion can be found in Ref. [20]. Perfect separation is therefore not possible and w a i =ðw a i þ w b i Þ corresponds to the Bayes optimal probability that an event is sampled from set A and not set B. w a i and w b i are used to compute targets during the training procedure. Because model A and B samples are prepared with the same events and differ only in the spin weights w a i and w b i , the statistical fluctuations of the learning procedure are greatly reduced. This also has consequences for the actual implementation of the DNN metric and code.
To quantify the classification performance, weighted AUC and APS scores are used [64], and we have not explored further alternatives. For each data point x i , the DNN returns a probability p i that it is correctly (not correctly) classified as being of type A and it contributes to the final loss function twice, with weight w a i and w b i , respectively. With this definition, an AUC ¼ 0.5 would be obtained for a random assignment, while an AUC ¼ 1.0 would be obtained for perfect separation. As in the studied problems, the distributions overlap and the best achievable AUC ≃ 0.78 is obtained only with p i ¼ w a i =ðw a i þ w b i Þ (oracle predictions). The value slightly depends on the case studied, due to the applied minimal set of cuts on the kinematics of τ decay products to partly emulate detector conditions.
Weighted events with w a and w b are used for convenience and to limit statistical fluctuations. We have repeated a part of the DNN classification chain (training, validation, testing) using unweighted events too [66] and have found very good performance consistency.
The DNN architecture [67] consists of D-dimensional input (feature list) followed by six layers of 300 nodes each with ReLU [68] activation functions and a one-dimensional output layer returning the probability p i of the choice. It is calculated using the softmax function. The metric minimized by the model is the negative log likelihood of the true targets under Bernoulli distributions.
The parameter that was optimized with respect to what was used in Ref. [20] was a dropout [63]. For the analysis presented in Ref. [20], the AUC score was obtained after training with five epochs. It was considered sufficient. Here, given the variety of feature lists, we performed training on a much larger number of epochs and studied the optimal working point for the number of epochs and dropout. The 20% dropout seemed to be the optimal choice to avoid overfitting, which occurs in cases with a larger number of training epochs. The best performance was achieved after 5-50 epochs, depending on the case. Training with 50 epochs was used on the test samples to quote the final AUC score. While optimizing the dropout level, we observed that although it leads to a more robust DNN (smaller risk for overfitting), the performance was sometimes somewhat reduced. The positive impact of the dropout is illustrated (on a data set consisting of one million event samples) in Fig. 8 for the ρ AE − ρ ∓ channel and the feature list Variant-1.1 for training and validation.

APPENDIX B: ALTERNATIVE ML TECHNIQUES
Although deep neural networks are often used for classification tasks in high-energy physics, many other more classical techniques are used as well and are often able to achieve similar classification performance. Despite promising results that are often reported in papers, one should always remember that a machine-learning technique that performs well on one data set with specific features may deliver not so promising results on another.
This observation is of a fundamental nature and results from the mathematical assumption behind particular ML libraries. The solutions that were developed for the libraries depend on the application domains the particular systems were prepared for. Reference [69] can be used as a guide. The recent study in Ref. [70] compared several machine learning algorithms. Despite not being able to contribute much to the topic, we show that for the application discussed throughout this paper it is indeed the case that the DNN technique by far outperforms the more classical approaches. The following machine learning techniques were chosen for the comparative study: (1) Boosted trees [38].
The AUC scores were used to evaluate the performance of the ML methods. This is one of the recommended approaches when using a single number in the evaluation of machine learning algorithms on binary classification problems. To minimize bias the comparisons were carried out on the same data sets [72].
For the BT method the point of interest was to check the dependence of the obtained results on the depth of a tree. The AUC score as a function of tree depth is given in Fig. 9 for the ρ AE − ρ ∓ case and several feature lists. As depth affects the complexity of the computation, a search for an optimal choice in terms of both results and efficiency of computation was performed. A tree depth between 3 and 10 was suggested in Ref. [38]. At first we have used depths from 3 to 20. The upper bound was increased to see the trend in the AUC score plots. The results seemed to increase up to 20. Additional evaluations with the depth equal to the number of features of a given Variant-X.Y were also carried out.
As suggested in the literature [73], for the RF method 128 trees (estimators) were used. For the next-best split during the tree building number of features equal to log 2 ðN f Þ or ffiffiffiffiffiffi N f p , where N f denotes number of original features, was tried. The performances of the two choices are comparable. The optimal tree-depth value found for BTs was used. Tests with a larger number of trees (300) and greater tree depths (30) were also carried out.
For the SVM method, tests were first performed to determine which kernel-linear or radial basis function (RBF)-gives more promising results. This resulted in choice of RBF kernel with more stable results. In the next step, the parameters C and γ (soft margin and kernel parameter) were fine-tuned. The parameters were evenly distributed on a logarithmic scale from 10 −3 to 10 3 . To avoid excessive computation, fine-tuning was performed only for Variant-All, and on a smaller event sample. The obtained parameters, C ¼ 10 and γ ¼ 0.1, were then used to train the classifier on other feature lists as well.
A comparison of the best performance for the ρ AE − ρ ∓ channel and different feature lists is shown in Table V. Clearly, the performance of the DNN is outstanding.
We compared the execution times, memory usage, and efficiency for all of the classifiers. We used the Prometheus cluster [74]. Jobs were executed at one node with four tasks per node and 5 GB of memory per task. The comparison is reported in Table VI. By far, the SVM training took the longest amount of time. The BT training took the least amount of time-under 10 minutes-which made it 8 times faster to train than the DNN. Both the DNN and RF used the resources well, achieving efficiencies of 68.8 and 96.1%, respectively.
The training times varied significantly between the ML methods. This is probably due to relations between the features that are unexpected by ML. The Higgs CP state classification turned out to be a challenge for more classical ML algorithms.