Jet-Origin Identification and Its Application at an Electron-Positron Higgs Factory

,

Introduction.-Quarks and gluons are standard model (SM) particles that carry color charges of the strong interaction.Due to the color confinement of quantum chromodynamics (QCD), colored particles cannot travel freely in spacetime and are confined to composite particles like hadrons.Once generated in high-energy collisions, quarks and gluons fragment into numerous particles that travel in directions approximately collinear to the initial colored particles.These collinear particles are called jets, see Fig. 1.
We define jet origin identification as the procedure to determine from which colored particle a jet is generated, and consider 11 different kinds: b, b, c, c, s, s, u, ū, d, d, and gluon.A successful jet origin identification is critical for experimental particle physics at the energy frontier.At the Large Hadron Collider, successfully distinguishing quark jets from gluon ones could efficiently reduce the typically large background from QCD processes [1][2][3][4][5][6][7].Jet flavor tagging is essential for the Higgs property measurements at the LHC [5,6,8,9].The determination of jet charge [10,11] was essential for weak mixing angle measurements at both LEP and LHC [12], is critical for time-dependent CP measurements [14,15], and could have a significant impact on Higgs boson property measurements [16].
We realize the concept of jet origin identification in physics events at an electron-positron Higgs factory using a Geant4-based simulation [18] (referred to as full simulation for simplicity), since the electron-positron Higgs factory is identified as the highest-priority future collider project [19,20].We develop the necessary software tools, Arbor [21? ] and ParticleNet [22], for the particle flow event reconstruction and the jet origin identification.We FIG. 1. Event display of an e + e − → ν νH → ν νgg ( √ s = 240 GeV) event simulated and reconstructed with the CEPC baseline detector [17].Different particles are depicted with colored curves and straight lines: red for e ± , cyan for µ ± , blue for π ± , orange for photons, and magenta for neutral hadrons.
demonstrate the jet origin identification performance using an 11-dimensional confusion matrix (referred to as M 11 for simplicity), which exhibits the performance of jet flavor tagging and jet charge measurements.We apply the jet origin identification to rare and exotic Higgs boson decay measurements under the CEPC nominal Higgs operation scenario.This scenario expects an integrated luminosity of 20 ab −1 at √ s = 240 GeV, and could accumulate 4 million Higgs bosons [20,23].We analyze the rare decays H → ss, uū, and d d, and the flavorchanging neutral current (FCNC) decays H → sb, ds, db, and uc (here sb denotes s b or sb, and similarly for ds, db, and uc).We derive upper limits ranging from 10 −3 to 10 −4 for these seven processes.In the SM, the predicted branching ratio for the H → ss process is 2.3 × 10 −4 [24] and the derived upper limit corresponds to three times the SM prediction.The branching ratios for H → uū and d d are expected to be smaller than 10 −6 [24][25][26][27], while branching ratios for the above-mentioned FCNC processes are expected to be smaller than 10 −7 from loop contributions [28].
Detector Geometry and Software Tools.-Wesimulate ν νH, H → uū, d d, ss, cc, b b, and gg processes at 240 GeV center-of-mass energy with the CEPC baseline detector [17].The CEPC baseline detector design is a particle-flow-oriented concept composed of a high-precision vertex system, a large-volume gaseous tracker, high granularity calorimetry, and a large-volume solenoid.We use Pythia-6.4[29] for the event generations and MokkaPlus [30,31] for the Geant4-based detector simulation [18].The simulated samples are processed with the Arbor particle flow algorithm that reconstructs all final-state particles and identifies their species.The reconstructed final-state particles in a physics event are clustered into two jets using the e + e −k t algorithm [32,33].For each jet, the kinematic and species information of all its final-state particles, including the track impact parameters associated with charged final-state particles, are input to a modified ParticleNet algorithm.The algorithm calculates the likelihoods corresponding to 11 different jet categories.For each process, one million physics events are simulated, where 600k events are used for training, 200k for validation, and 200k for testing.The model is trained for 30 epochs, and the epoch demonstrating the best accuracy on the validation sample is selected and applied to the testing sample to extract the numerical results.
Information on the species of the final-state particles is critical for jet origin identification.We compare three scenarios to understand the impact of particle identification.The first scenario assumes perfect identification of charged leptons, i.e., e ± and µ ± can be perfectly differentiated from each other and from charged hadrons.The second scenario further assumes perfect identification of the species of charged hadrons (proton, antiproton, π ± and K ± ).On top of the second scenario, the third one assumes perfect identification of K 0 S and K 0 L .For simplicity, the assignment of particle identification is based on MC truth.On the other hand, full simulation performance studies show that the CEPC baseline detector could identify leptons with an efficiency of 99.5% with a hadron-to-lepton misidentification rate below 1% [34,35].It could also distinguish different species of charged hadrons (π ± , K ± , proton, and anti-proton) to better than 2σ [36][37][38] and reconstruct K 0 S and Λ with a typical efficiency (purity) of 80% (90%) if they decay into charged particles [39].Therefore, the second scenario is used as the default one since it matches the CEPC base-line detector performance, while the third scenario is used for comparison as the K 0 L identification remains challenging.
Figure 2 shows the overall jet origin identification performance with an 11-dimensional confusion matrix, M 11 , derived by classifying each jet into the category with the highest likelihood.In the quark sector, M 11 is approximately symmetric and block diagonalized into 2 × 2 blocks, corresponding to each specific species of quark.Meanwhile, gluon jets can be identified with an efficiency of 67%.
The performance of the jet origin identification can be studied in more detail via jet flavor tagging efficiencies and charge flip rates.For each jet, we compare the gluon likelihood and the five sums of quark and anti-quark likelihoods of every kind.The jet flavor is then defined as the kind with the highest value.The jet charge is determined by comparing the likelihoods between the quark and the anti-quark.Figure 3 illustrates the derived jet flavor tagging efficiencies and charge flip rates, which slightly differ from M 11 due to the different procedure described above.
Figure 3 additionally compares the performance under different particle identification scenarios.In the default scenario, represented by the solid lines, the b/c/s jets could attain tagging efficiencies of 92%/79%/67% and charge flip rates of 19%/7%/17%, respectively.The identification of u and d jets is less accurate, amounting to tagging efficiencies of 37% to 41% and jet charge flip rates of 13% to 24%.Noticeably, the down-type jets have a significantly higher jet charge flip rate than the up-type jets, since the latter carries twice the absolute charge as the former.Of all types, the c jets have the lowest charge flip rate as they are heavier and of the up-type.Figure 3 also exhibits the impact of final-state particle identification on jet origin identification.Compared to the scenario with only lepton identification, introducing charged hadron identification (the default scenario) enhances the s-tagging efficiency from 47% to 67%.Concurrently, it reduces the jet charge flip rates across all types except for u.Additionally, it significantly improves the d-tagging efficiency.The third scenario that includes neutral kaon information further enhances the s-tagging efficiency to 74%.However, the jet charge flip rates remain the same as in the second scenario, since K 0 S and K 0 L are superpositions of s d and |sd⟩ states, meaning their identification has no impact on distinguishing quarks from anti-quarks.
Benchmark Physics Analyses.-Theprecise measurement of Higgs boson properties is a central objective for particle physics.The anticipated precision of Higgs measurements at future Higgs factories has been extensively studied, showing that the major SM decay modes can be measured with a relative accuracy of 0.1% to 1% at electron-positron Higgs factories [20,[40][41][42], surpassing the expected precision at the High Luminosity-LHC (HL-LHC) by one order of magnitude [43].Meanwhile, the rare and FCNC decays of the Higgs boson are of great eff flavor tagging with ± , K ± , K 0 L/S id.P charge flip with ± , K ± , K 0 L/S id.
FIG. 3. Jet flavor tagging efficiencies and charge flip rates with perfect identification of leptons (the first scenario, denoted as ℓ ± in the legend), plus identification of charged hadrons (the second and default scenario, denoted as K ± ) and neutral kaons (the third scenario, denoted as K 0 L/S ).
We explore the anticipated upper limits of H → ss, uū, d d, and H → sb, ds, db, uc at the CEPC, where Higgs bosons are mainly produced via the Higgsstrahlung (ZH) and vector boson fusion (e + e − → ν e νe H, e + e − → e + e − H) processes [48].Our simulation analyses focus on the ν νH, µ + µ − H, and e + e − H channels, with expected event yields of 0.926, 0.135, and 0.141 million under the CEPC nominal Higgs operation scenario, respectively.
We begin with the existing analyses of ν νH, H → b b, cc, gg [49,50] at center-of-mass energy of 240 GeV.These analyses consist of two stages: the first stage performs event selection to concentrate the Higgs to di-jet signal in the entire SM data sample, and the second stage identifies different flavor combinations using the LCFIPlus [32] flavor tagging algorithm.For the Higgs rare and exotic decay analyses, we re-optimize the event selections in the first stage and replace the flavor tagging in the second stage with the jet origin identification.After the event selections (described briefly in Appendix A), the leading SM backgrounds are mainly ℓν ℓ W , ν νZ, and ℓ + ℓ − Z events.Taking the ν νH, H → jj analyses as an example, the event selection in this stage has a final signal efficiency of 24%, and reduces the backgrounds by six orders of magnitude, leading to a background yield of 23k.
A toy MC simulation is then applied to the remaining events to mimic the jet origin identification, by sampling the 11 likelihoods of each jet according to its origin.A gradient boosting decision tree (GBDT) classifier [51] is trained to distinguish signal and background processes using the 22 likelihoods of the jet pair in a physics event.
For the ν νH, H → ss analysis, the combined GBDT scores of the remaining events are illustrated in the upper panel of Fig. 4. Defining the signal strength as the ratio of the observed event yield to the SM prediction, the anticipated upper limit on the signal strength of H → ss at 95% confidence level (CL) [52,53] as a function of cut value is shown in Fig. 4. With the optimal cut on the combined scores, there remain 37 events of H → ss and 5.1k background events, leading to an upper limit of 3.8 on the signal strength of H → ss at 95% CL.A fit to the combined score distributions further improves the upper limit to 3.5.Combined with e + e − H and µ + µ − H channels, an expected upper limit of 3.2 on the signal strength is achieved at 95% CL.It is worth noting that in the analysis of H → ss, the branching ratios of all other Higgs decays are assumed to be at their SM predictions.
We analyze H → uū and H → d d decay modes using the same method.By combining all three channels, the branching ratios of H → uū and d d can be constrained to 0.091% and 0.095% at 95% CL, respectively.These results are less stringent than those for H → ss since the identification of the u and d jets is much more challenging than s jets.We also analyze H → sb, ds, db, and uc decay modes and obtain upper limits ranging from 0.02% to 0.1% for these decay modes.These results are summarized in Table I and Fig. 5.
Discussion and Summary.-Wepropose the concept of jet origin identification that distinguishes jets generated from 11 types of colored SM particles.State-of-theart algorithms are developed to realize the concept of jet origin identification at the future electron-positron Higgs factory, achieving jet flavor tagging efficiencies ranging from 67% to 92% for bottom, charm, and strange quarks, and jet charge flip rates of 7% to 24% for all species of quarks.We analyze the impact of final-state particle identification on jet origin identification and find that charged hadron identification is critical for both jet flavor tagging and charge measurement.The identification of neutral kaons further enhances jet flavor tagging performance but Expected upper limits on the branching ratios of rare Higgs boson decays from this work (green) and the relative uncertainties of Higgs couplings anticipated at CEPC [20] (blue) and HL-LHC (orange) under the kappa-0 fit scenario [54] and scenario S2 of systematics [55], as cited in Ref. [20].The limit on Bss corresponds to an upper limit of 1.7 on the Higgs-strange coupling modifier κs (not shown).
has no impact on jet charge measurement, as expected.
Utilizing jet origin identification, we estimate the upper limits for seven rare and FCNC hadronic decay modes of the Higgs boson.We conclude that the branching ratios of these decay modes could be constrained to 0.02% to 0.1% at 95% CL in the nominal CEPC Higgs operation scenario.For the H → ss decay, the expected upper limit is approximately three times the SM prediction, which improves by more than a factor of two upon previous studies [24,45].The improvement here is largely attributed to our state-of-the-art jet origin identification algorithm, which is capable of exploiting the information of all particles in a jet, not just the kaon particles.The upper limits for H → uū/d d can be interpreted as constraints on the Higgs-quark couplings of < 101 and < 37 times the SM predictions, respectively (i.e.κ u < 101 and κ d < 37).This improves upon existing analyses by roughly one order of magnitude.Regarding the Higgs-boson FCNC decay, a previous study using DELPHES [56] fast simulation indicated that the branching fraction for H → sb (H → db) could be constrained to 10 −2 with an integrated luminosity of 30 ab −1 [57], while our results show an improvement of two (one) order of magnitude.We also quantify the upper limits for H → uc and H → ds in Table I.
Many systematic and theoretical uncertainties are relevant to jet origin identification, including detector performance, beam-induced backgrounds, the number of pile-up events, jet kinematics, jet clustering algorithms, hadronization models, etc. Appendix B summarizes a se-ries of relevant comparison studies.In short, we conclude that jet origin identification performance is stable with respect to jet kinematics in the relevant energy range, see Fig. 6.We observe that the performances obtained from hadronic Z processes at 91.2 GeV and the ν νH processes at 240 GeV are statistically consistent within the detector's fiducial region, as shown in Figs.7 and 8.In other words, the jet origin identification could be calibrated using the large number of events in the Z-pole sample to control the performance-relevant systematic uncertainties for the physics measurements including the Higgs property measurements.We observe comparable performance for different hadronization models with small but visible differences, see Fig. 9.These analyses lay the foundation for the application of jet origin identification at the energy frontier, especially in physics measurements with relatively larger statistical uncertainties, while more dedicated studies are certainly needed.
The jet origin identification algorithm reads critical information from all the reconstructed particles and provides much higher separation power between jets stemming from different species of colored SM particles.Consequently, this could significantly enhance the scientific discovery potential for physics measurements with multijet final states, such as those expected at future Higgs factories.Jet origin identification appreciates a detector capable of efficiently distinguishing final state particles and identifying their species information, as demonstrated in Fig. 3. Recent studies also suggest that a light, precise vertex detector located close to the interaction point is favorable for jet origin identification [58,59].Co-evolving with state-of-the-art detector technology, reconstruction algorithms, and Artificial Intelligence, the jet origin identification algorithm developed here indicates that colored SM particles could potentially be identified with comparable performance to leptons and photons.
We would like to thank Christophe Grojean and Michele Selvaggi for the delightful discussions, and Qiang Li, Gang Li, Congqiao Li, Yuxuan Zhang, and Sitian Qian for their support with the software tools.We would also like to thank Xiaoyan Shen for her continuous supports.This work is supported by the Innovative Scientific Program of the Institute of High Energy Physics, the National Natural Science Foundation of China under grant No. 12342502, and the Fundamental Research Funds for the Central Universities, Peking University.We appreciate the Computing Center at the Institute of High Energy Physics for providing the computing resources.

Appendix A: Event selection of benchmark analyses
This appendix describes the event selection for physics benchmark analyses presented in the letter.
We take as reference the existing full-simulation analysis of ν νH, H → b b, cc, gg at the CEPC [49].This refer-ence simulation analysis considers a nominal luminosity of 5.6 ab −1 .It includes all major SM backgrounds, with a total of 4.6 × 10 7 physics events simulated and processed using the CEPC baseline software, and concludes that the signal strength of the ν νH, H → b b, cc, gg processes can be measured with a relative precision of 0.49%, 5.8%, and 1.8%, respectively.
All benchmark analyses of ν νH, H → jj in this letter use the same kinematic variables for the event selection as in the reference analysis.These kinematic variables include total recoil mass (M recoil ), total visible mass (M invariant ), total visible energy (E vis ), total transverse momentum (P T ), energies of the leading lepton candidate and leading neutral particle, and the Durham distance y 23 [33] that describes the event topology.A loose cut is applied to the sample, with an efficiency of 40% on the ν νH, H → jj process and a reduction of the background to 495k.A BDT cut that combines these kinematic and topological variables is applied, which further suppresses the SM background to 23k and has an efficiency of 24% on the ν νH, H → jj signal, see Table II.
TABLE II.The event selection of ν νH(H → q q/gg) when CEPC operates as a Higgs factory at the center-of-mass energy of 240 GeV and collects an integrated luminosity of 20 ab −1 .The γγ label is the abbreviation of γγ → hadrons process, and SW /SZ refers to single W and single Z processes.The units for mass, energy, and momentum are GeV/c 2 , GeV, and GeV/c, respectively.ν νHq q/gg 2f /γγ SW /SZ W W /ZZ ZH Total 6.4×10 5 4.6×10 The remaining events are then processed with toy MC to mimic the jet origin identification and the GBDT classifier, leading to the distribution shown in Fig. 4 in the letter.

Appendix B: Comparative analyses of jet origin identification
This appendix compares the performance of jet origin identification for different samples.These samples are all full simulation samples using the CEPC baseline detector geometry and perfect lepton and charged hadron identification corresponding to the default scenario of particle identification.
1. Dependence on the jet energy and jet polar angle.We extract the jet flavor tagging efficiencies and charge flip rates for various jet energies and polar angles.On top of the ν νH, H → jj sample at 240 GeV center-of-mass energy, we simulate a Higgs boson at rest with changing mass, and the Higgs boson is forced to decay into a pair of jets.The Higgs boson mass is set to be 91.2,200, 360, and 500 GeV, corresponding to jets with energies from 45.6 to 250 GeV.Fig. 6 shows the performance at different jet energies, where the extracted jet tagging efficiencies and charge flip rates are rather stable.Fig. 8 shows the performance versus the jet polar angle, which is flat in the barrel region of the detector (| cos θ| < 0.8) and exhibits slight degradation in the endcap region.2. Comparison between different physics processes.We compare the jet origin identification performance between the Z → q q process at a center-of-mass energy of 91.2 GeV and the ν νH, H → q q process at 240 GeV center-of-mass energy.We observe that the jet origin identification performance agrees between these processes, especially in the fiducial barrel region of the detector for the flavor tagging performance of b, c, and s, see Fig.The comparison of flavor tagging efficiencies and charge flip rates between the Z → q q process (dashed) at 91.2 GeV center-of-mass energy and the ν νH, H → q q process (solid) at 240 GeV.This result is obtained using a 10category classification for quarks, instead of an 11-category classification that includes a gluon category as presented in the main text.
It should be noted that, since the Z boson does not decay into a pair of gluons, the gluon jet calibration is an open and interesting question, where dedicated QCD studies and usage of hadron collider data could be very helpful.
3. Comparison between different hadronization models.Jet origin identification uses directly the information of reconstructed final-state particles, while the hadronization process is responsible for generating final-state particles from initial quarks or gluons.The dependence of jet origin identification performance on the hadronization model is a natural concern.
We compare the jet origin identification performance of samples derived from different hadronization models, namely Pythia-6.4and Herwig-7.2.2 [60,61].The predictions of the multiplicity of different final-state particles of these two hadronization models could be different by roughly 10% [? ]. Figure 9 shows the performance with different training and test samples.To first order, the performance agrees between models, especially for b, c, and s jets.The performance exhibits small but visible differences for u and d jets.
These comparative analyses show that the jet origin identification performance, especially for the heavy and strange quarks, is rather stable versus the jet kinematics (in the relevant energy range), different physics processes, and even different hadronization models.The observed stability is vital for applying jet origin identification in real experiments.Meanwhile, it is a critical and challenging task to determine and validate the fragmentation behavior of colored particles at a future Higgs factory.
FIG.2.The confusion matrix M11 with perfect identification of leptons and charged hadrons for ν νH, H → jj events at 240 GeV center-of-mass energy.The matrix is normalized to unity for each truth label (row).
FIG.5.Expected upper limits on the branching ratios of rare Higgs boson decays from this work (green) and the relative uncertainties of Higgs couplings anticipated at CEPC[20] (blue) and HL-LHC (orange) under the kappa-0 fit scenario[54] and scenario S2 of systematics[55], as cited in Ref.[20].The limit on Bss corresponds to an upper limit of 1.7 on the Higgs-strange coupling modifier κs (not shown).

FIG. 6 .
FIG.6.The jet origin identification performance: Jet flavor tagging efficiencies (ε) and charge flip rates (P ) for various jet energies.The error for each value is less than the perthousand level.
FIG.7.The comparison of flavor tagging efficiencies and charge flip rates between the Z → q q process (dashed) at 91.2 GeV center-of-mass energy and the ν νH, H → q q process (solid) at 240 GeV.This result is obtained using a 10category classification for quarks, instead of an 11-category classification that includes a gluon category as presented in the main text.