Search for Higgs boson and observation of Z boson through their decay into a charm quark-antiquark pair in boosted topologies in proton-proton collisions at $\sqrt{s}$ = 13 TeV

A search for the standard model (SM) Higgs boson (H) produced with transverse momentum greater than 450 GeV and decaying to a charm quark-antiquark ($\mathrm{c\bar{c}}$) pair is presented. The search is performed using proton-proton collision data collected at $\sqrt{s}$ = 13 TeV by the CMS experiment at the LHC, corresponding to an integrated luminosity of 138 fb$^{-1}$. Boosted H $\to$ $\mathrm{c\bar{c}}$ decay products are reconstructed as a single large-radius jet and identified using a deep neural network charm tagging technique. The method is validated by measuring the Z $\to$ $\mathrm{c\bar{c}}$ decay process, which is observed in association with jets at high $p_\mathrm{T}$ for the first time with a signal strength of 1.00 $_{-0.14}^{+0.17}$ (syst) $\pm$ 0.08 (theo) $\pm$ 0.06 (stat), defined as the ratio of the observed process rate to the standard model expectation. The observed (expected) upper limit on $\sigma$(H) $\mathcal{B}$(H $\to$ $\mathrm{c\bar{c}}$) is set at 47 (39) times the SM prediction at 95% confidence level.


1
The standard model (SM) Higgs boson (H) has been observed at the LHC [1-3] in all its expected primary production modes and most of its dominant decay channels. With the observations of direct couplings to τ leptons [4,5], top quarks [6,7], and bottom quarks [8,9] confirming that the SM Yukawa sector gives rise to the masses of third-generation fermions, attention naturally turns to probing the second generation, specifically muons and charm quarks.
The search for H decays to muon pairs is the most experimentally accessible channel, and has been explored by the ATLAS [10, 11] and CMS [12] Collaborations. The latter recently found evidence of the H coupling to muons [13].
In contrast, the search for H decays to charm quark-antiquark pairs (H → cc) is considerably more challenging, because of the difficulty of identifying such decays and due to the enormous multijet background. However, recent advances in jet substructure and flavor tagging techniques [14] have greatly improved the experimental sensitivity to this decay mode. Prior searches by ATLAS [15,16] and CMS [17,18] have focused on H production in association with a vector boson (VH, where V stands for a W or Z boson), which benefits from strong background rejection thanks to leptonic decays of the vector bosons. The gluon-gluon fusion (ggF) and vector boson fusion (VBF) production modes have larger cross sections, but have yet to be explored.
This letter reports on the first search for the H → cc decay at the LHC, where the H is produced with transverse momentum (p T ) greater than 450 GeV, enriched in events from the ggF production. The search employs the same general strategy as earlier CMS searches for boosted H in the bb decay channel [19,20], but uses new mass decorrelated discriminators to define a charm-enriched signal region. This search strategy provides an additional constraint on the decay process to the existing ATLAS and CMS measurements, in terms of production mode and H p T .
The search is performed using a dataset of proton-proton collisions at √ s = 13 TeV, collected with the CMS detector at the LHC, and corresponding to an integrated luminosity of 138 fb −1 . Candidate events are selected by requiring a high-p T , large-radius jet with substructure observables compatible with those expected from an H → cc decay. Deep neural network (DNN) discriminators are employed to separate the H signal events from the dominant background, specifically quantum chromodynamics (QCD)-induced multijet events. The discriminators are designed to be independent of jet mass, which allows for both an estimation of the QCD background from control samples in data and for the validation of the analysis procedure through a search for Z → cc decays. A model of the jet mass distributions for the H → cc and Z → cc signals, QCD multijet events, and other background processes is fit simultaneously in several disjoint signal and control regions to extract the signal production cross sections.
The CMS apparatus [21] is a multipurpose, nearly hermetic detector, designed to trigger on [22, 23] and identify electrons, muons, photons, and (charged and neutral) hadrons [24-26]. A global "particle-flow" (PF) algorithm [27] aims to reconstruct all individual particles in an event, combining information provided by the all-silicon inner tracker and by the crystal electromagnetic and brass-scintillator hadron calorimeters, operating inside a 3.8 T superconducting solenoid, with data from the gas-ionization muon detectors embedded in the flux-return yoke outside the solenoid. The reconstructed particles are used to build τ leptons, jets, and missing transverse momentum (p miss T ) [28][29][30]. Simulated samples of signal and background events are produced at the matrix element level using various Monte Carlo (MC) event generators. The QCD multijet, Z+jets, and W+jets processes are modeled at QCD leading order (LO) accuracy using the MADGRAPH5 aMC@NLO v2.4.2 simulation. The tt process constitutes a subdominant nonresonant background across the m SD spectrum, the shape of which is taken from simulation, while the total yield is estimated from data. Other EW processes, including diboson, triboson, and ttV processes, are estimated from simulation and found to be negligible.
Events containing leptons are vetoed to reduce SM EW backgrounds. The selection criteria for electrons, muons, and hadronic τ leptons are p T > 10, 10, and 20 GeV and |η| < 2.5, 2.4, and 2.3, along with "veto", "loose", and "very loose" identification requirements [24, 28, 61], respectively. In addition, muons are required to have a relative isolation (scalar p T sum of the PF candidates within a cone with a distance parameter of 0.4, divided by the lepton p T ) of less than 0.25. Events with p miss T > 140 GeV, as well as events with AK4 b-tagged jets with p T > 30 GeV opposite in azimuth to the H candidate jet (∆ϕ(AK4, AK8) > π/2), are removed to reduce the top quark background. The AK4 b jet identification is performed using the DEEPCSV DNN algorithm [62] with a working point corresponding to a 1% misidentification probability for light (u, d, s quark, or gluon) jets.
The dimensionless mass scale variable ρ = 2 ln(m SD /p T ) [63,64] is used to parametrize the QCD background model (described below) as its distribution is approximately invariant versus jet p T , unlike jet m SD . A selection of −6.0 < ρ < −2.1 is imposed to avoid instabilities and edge effects from the SD algorithm and jet clustering [65]. The lower ρ threshold implies an upper jet p T threshold, which is made explicit by requiring p T < 1200 GeV. In simulation, less than 1% of signal events are found above this upper bound.
The N 1 2 variable [66], a ratio of energy correlation functions [67], is a powerful way to identify two-pronged signatures. However, using it for selection distorts the background jet mass distribution as a function of p T . To mitigate this effect, the designing decorrelated taggers (DDT) technique [64], effectively a sliding selection, is applied. The selection is on N 1,DDT 2 ≡ N 1 2 − q 0.26 (p T , ρ) < 0, where q 0.26 (p T , ρ) is the N 1 2 value corresponding to the 26% efficiency for the QCD background, as a function of jet p T and ρ. The target percentile is chosen to optimize the H → cc expected significance.
Finally, jet flavor is determined by the DEEPDOUBLEX DNN algorithm [68]. The model comprises convolutional and recurrent units processing low-level features of secondary vertices and PF candidates, the outputs of which are joined with expert variables [62] in a fully connected layer. The application of feature importance ranking techniques, such as integrated gradients [69] and deep Taylor decomposition [70], indicates the key features to be the angular distances of the PF candidates from both the jet and 2-subjetiness [71] axes. The kinematic properties of the PF candidates defined relative to the parent jet have subleading importance. The model is trained to distinguish between two-pronged H-like signatures of bottom and charm flavors, as well as the QCD background, yielding two per-jet classifiers: charm versus light, referred to as DEEPDOUBLECVL (DDCVL), and charm versus bottom, referred to as DEEP-DOUBLECVB (DDCVB). The performance of the two classifiers is shown, prior to any analysisspecific selection, in Fig. 1. The optimal working point, maximizing the H → cc expected significance after all previous selections are applied, is found with respect to both classifiers at a QCD efficiency of 0.5% and a H → cc efficiency of 20.6%. The corresponding efficiency for H → bb events is 4.8%. As the classifiers are mass independent, the quoted efficiencies also apply to Z → cc and Z → bb.
The relative contributions of H production modes to the overall signal yield are 55%, 25%, and 20% for ggF, VBF, and VH, respectively, and are similar for both cc and bb decay modes. The ttH contribution is suppressed due to the top veto and found to be negligible. Events passing all of the selection requirements described above constitute the signal or "passing" region (SR), The QCD background is not accurately predicted in simulation. Since the flavor discrimination is nearly independent of the jet p T and mass, the ratio of the passing and failing region distributions, R p/f , is expected to be approximately flat with respect to the jet p T and m SD . This can be exploited to obtain an SR prediction of the QCD background from the CR via an appropriate efficiency scaling. A residual difference in shapes can be accounted for by parametrizing the R p/f shape in the two dimensions. In order to take into account a potential mass dependence of the flavor selection efficiency, a correction factor, R QCD p/f (p T , ρ), is fit to the simulated QCD background shapes. Then, a second correction factor of the same functional form, R data (p T , ρ), accounts for mismodellings in simulation. Both are parametrized in terms of Bernstein polynomials [72] in p T and ρ: where n p T is the degree of the polynomial in p T , n ρ is the degree of the polynomial in ρ, a k,ℓ is a Bernstein coefficient, and b ν,n is a Bernstein basis polynomial of degree n. The coefficients a k,ℓ of R QCD p/f (p T , ρ) are determined in a fit to simulated QCD background events; a k,ℓ of R data p/f (p T , ρ) are unconstrained and are determined during the maximum likelihood fit to data. The total effective R p/f is then expressed as: The R p/f are expected to vary between data-taking years because of changes in detector conditions, and are thus fit independently for each year. The minimal degree of the polynomials necessary to fit the QCD simulation and the data is determined by a Fisher F-test [73] and found to be (n p T , n ρ ) = (0, 2), (1, 2), and (0, 2) for R QCD p/f (p T , ρ) and (1, 0), (0, 0), and (1, 0) for R data (p T , ρ) for the years 2016, 2017, and 2018, respectively. Bias tests were performed with respect to the choice of parametrization, and no significant bias was found.
The V+jets processes are modeled using simulation. The differential tt contribution is taken from simulation; however, the normalizations in the SR and CR are corrected via two scale factors measured in a dedicated tt-enriched control region, parametrizing the overall normalization and the efficiency of the DDCvL selection between the SR and CR. This control region is adapted from the SR selection by lowering the H candidate p T threshold, requiring exactly one muon, and inverting the selection requirements on p miss T and b-tagged AK4 jets. The scale factor measurement is performed in situ during the signal extraction, separately for each datataking period, and the values are given in Table 1. The H → bb contribution is taken from the simulation and is fixed to the SM expectation. While its expected SR yield is greater by approximately a factor of 5 than that of the H → cc signal, its impact is negligible with respect to the overall background uncertainty.
The dominant systematic uncertainties for this search are related to the flavor tagging efficiency and jet mass shape. Corrections of the jet mass, jet mass resolution, and N 1,DDT 2 and DDCvB efficiencies are derived from data using W boson jets from semileptonic tt events. These corrections are measured independently of jet flavor, and as such are correlated among all considered resonant (H, Z, W) production and decay processes. The DDCvL misidentification efficiency of the W process is measured here as well. The corrections and their associated uncertainties are given in Table 1. The efficiency of the DDCVL selection for the signal processes is estimated using data and simulation samples enriched in cc pairs from gluon splitting [62]. Signal-like events are selected by requiring each of the two SD subjets of an AK8 jet to contain a muon, targeting semileptonic decays of b/c hadrons. The efficiency is extracted from a template fit to the combined mass of all matched secondary vertices; the measured correction factors are given in Table 1. The relative uncertainty of the misidentification efficiency of bb decays is assigned to be 30%. Varying this value from 10 to 50% has a negligible effect on reported results.
Other systematic uncertainties are assigned to cover potential mismodeling of the H signal, in particular for the ggF and VBF production modes [48], and higher-order corrections to the W and Z processes [44]. Finally, systematic uncertainties for experimental effects, including jet energy scale and resolution [74], trigger and veto efficiencies [75,76], variations in the measured pileup [77], finite simulated sample size [78], and an integrated luminosity measurement [79][80][81], are also included, but are found to have a comparatively small effect.
The parameter of interest in this analysis is the signal strength µ H or µ Z , defined as the ratio of the observed to the SM expected H or Z boson production cross section times the H → cc or Z → cc branching fraction, respectively. These parameters are extracted from a binned (m SD , p T ) maximum likelihood fit to the observed data, where the expected value is the sum of the signal contribution (scaled by the signal strength parameter) and the background contributions, each modified by nuisance parameters to account for the previously discussed systematic effects. The magnitude of each systematic uncertainty is encoded in the likelihood model as an additional constraint, treated according to the frequentist paradigm [82]. The fit is performed simultaneously across all subdivisions of the SR and CR described previously, as well as the per-year tt background enriched CRs.
To validate the analysis strategy, and to confirm the presence of Z → cc decays, the Z signal strength µ Z is measured via a profile likelihood fit, treating µ H as a nuisance parameter, and is found to be 1.00 +0.17 −0.14 (syst) ± 0.08 (theo) ± 0.06 (stat). This corresponds to an excess, both observed and expected, over the µ Z = 0 hypothesis with a significance of well over 5 standard deviations. The precision of the µ Z measurement is primarily limited by the systematic uncertainty in the DDCvL signal tagging efficiency. The subleading uncertainty comes from the modeling of the Z+jets production cross section.
For the extraction of µ H , since the Z cross section has been measured in leptonic decay channels and found to agree with theoretical predictions within 5% in this p T regime [83] and since the Z → cc branching ratio is known to 2% precision [84], we fix µ Z ≡ 1, constraining the expected Z contribution to be within the applicable uncertainties of its SM value. This serves to further constrain in situ the DDCvL signal tagging efficiency uncertainty. The measured efficiencies are compatible with the values quoted in Table 1 and have approximately 30% lower uncertainty.
An observed (expected) upper limit is placed on the signal strength µ H using the profile likelihood ratio test statistic [82], CL s criterion [85,86], and asymptotic formulae [87], and found to be 47 (39) at 95% confidence level. For the best fit value of µ H = 9.4 +20.3 −19.9 , the total m SD distributions in the passing and failing regions are shown in Fig. 2, and a breakdown of the sources of uncertainty affecting the measurement is shown in Table 2. Tabulated results are provided in the HEPData record for this analysis [88].
Table 2: Sources of uncertainty in the measurement of the signal strength µ H = 9.4 +20.3 −19.9 , and their observed impact (∆µ H ) in the fit to the full data set. The impact of each uncertainty is evaluated by computing the uncertainty excluding that source and subtracting it in quadrature from the total uncertainty. The total uncertainty does not match the sum in quadrature of each source because of correlations among the components.  In conclusion, a search for standard model (SM) Z and Higgs bosons produced with transverse momenta greater than 450 GeV and decaying to charm quark-antiquark (cc) pairs has been performed in a data sample corresponding to an integrated luminosity of 138 fb −1 at √ s = 13 TeV. New algorithms based on deep neural networks have been developed to identify jets originating from charm quark pairs. The Z → cc process is observed in association with jets at a hadron collider for the first time, with a signal strength of 1.00 +0.19 −0.17 relative to the SM prediction. This observation establishes Z → cc as an important reference for future X → cc searches. An observed (expected) upper limit on the product of the Higgs boson production cross section and branching fraction to cc of 47 (39) times the SM expectation is set at 95% confidence level.  [5] ATLAS Collaboration, "Measurements of Higgs boson production cross-sections in the H → τ + τ − decay channel in pp collisions at √ s = 13 TeV with the ATLAS detector", JHEP 08 (2022) 175, doi:10.1007/JHEP08(2022)175, arXiv:2201.08269.