Data-driven quark and gluon jet modification in heavy-ion collisions

Whether quark- and gluon-initiated jets are modified differently by the quark-gluon plasma produced in heavy-ion collisions is a long-standing question that has thus far eluded a definitive experimental answer. A crucial complication for quark-gluon discrimination in both proton-proton and heavy-ion collisions is that all measurements necessarily average over the (unknown) quark-gluon composition of a jet sample. In the heavy-ion context, the simultaneous modification of both the fractions and substructure of quark and gluon jets by the quark-gluon plasma further obscures the interpretation. Here, we demonstrate a fully data-driven method for separating quark and gluon contributions to jet observables using a statistical technique called topic modeling. Assuming that jet distributions are a mixture of underlying"quark-like"and"gluon-like"distributions, we show how to extract quark and gluon jet fractions and constituent multiplicity distributions as a function of the jet transverse momentum. This proof-of-concept study is based on proton-proton and heavy-ion collision events from the Monte Carlo event generator Jewel with statistics accessible in Run 4 of the Large Hadron Collider. These results suggest the potential for an experimental determination of quark and gluon jet modifications.

High-energy collisions between large nuclei at the Relativistic Heavy Ion Collider (RHIC) and the Large Hadron Collider (LHC) are a critical laboratory for studying the deconfined phase of QCD matter, the quark-gluon plasma, created in these collisions. Collimated sprays of high-momentum hadrons, called jets, are produced copiously in these collisions and provide an important probe of the quark-gluon plasma they pass through.
A long-standing question is how the quark-gluon plasma resolves the color charge of high-energy QCD partons [1][2][3][4][5][6]. Since jets can originate from either a quark or gluon, and subsequently carry information about their respective total color charge, it is crucial to understand differences in the energy loss and modification of these two categories of jets. Unfortunately, accessing independent information about quark and gluon jets experimentally is very challenging because all jet measurements involve a mixture of contributions from both.
In this letter, we demonstrate a data-driven method to estimate both the quark and gluon jet fractions and their separate substructure modification in heavy-ion collisions. Our method is based on a statistical technique called topic modeling, which was pioneered for applications to quark and gluon jet separation in proton-proton collisions in Refs. [7,8] and has been applied experimentally in Ref. [9]. We present a proof-of-concept that an extension of that technique can be used to extract differences in the modification of quark and gluon jets in heavy-ion collisions with the statistics anticipated in Run 4 of the LHC. This is a critical step toward a modelindependent determination of quark and gluon jet modification in heavy-ion collisions, which would have dramatic consequences for understanding the microscopic structure of the quark-gluon plasma.
A similar type of analysis was recently performed in Ref. [10], which used a measurement of the jet charge and templates for the jet charge distributions of quark and gluon jets to extract the gluon fraction in proton-proton and heavy-ion collisions. In that study, the same Monte Carlo (MC) distributions were used as templates in both proton-proton and heavy-ion collisions, which makes the implicit assumption that the jet charge distributions of quark and gluon jets are unmodified by the quark-gluon plasma. Here, we present a method that does not require templates and does not assume that substructure observables are unmodified by the plasma, allowing for simultaneous estimates of the modification of quark and gluon jet fractions and of their distributions.
Our method is based on a statistical technique called Demix [11] that separates a pair of mixed probability distributions into two common underlying base distributions. This method was demonstrated in Refs. [7,8] as a way to obtain excellent proxies for quark and gluon jets in proton-proton collisions. Consider two probability distributions p 1 (x), p 2 (x) for a jet observable x that are a distinct mixture of the same two underlying base probability distributions b 1 (x), b 2 (x). Namely, we can express the mixture distributions as for distinct fractions f j . This expression of the p j is always ambiguous, however, since there are infinitely many ways to mix the base distributions b i (x) among themselves to obtain new distributionsb i (x) from which the mixture distributions can be expressed as p j (x) = f jb1 (x) + (1 −f j )b 2 (x) with new fractionsf j . The idea behind Demix is to resolve this ambiguity by further requiring that the base distributions are mutually irreducible [12]. Qualitatively, this means that neither base distribution contains any component of the other; a precise definition will follow shortly. We refer to the mutually irreducible base distributions as topics, for their relation to the broader field of topic modeling established in Ref. [7]. See Refs. [13][14][15] for other uses of topic modeling arXiv:2008.08596v1 [hep-ph] 19 Aug 2020 techniques in collider physics.
The notion of "quark-and gluon-initiated jets" is not well-defined at the hadron level. Even at the level of a MC generator, where parton information from the hard process is available, there is still an ambiguity about how to associate final-state jets with their initiating parton. Therefore, the quark and gluon topics we discuss in this letter do not correspond directly to any parton-level intuition about quark-and gluon-initiated jets [16]. Instead, these topics correspond to the operational definition of jet categories introduced in Ref. [8], which defines the quark and gluon categories as the mutually irreducible (i.e., maximally separable) distributions underlying a pair of jet samples. To minimize potential confusions, we will often use the language of quark-like and gluon-like (or "quark" and "gluon" in quotes) to refer to this operational definition.
Since the base distributions extracted from a jet observable x using Demix are mutually irreducible, they can only agree with the MC quark and gluon jet distributions of x if those are also mutually irreducible. It was argued in Ref. [7] that quark-gluon mutual irreducibility is approximately satisfied for the constituent multiplicity (number of constituents) of groomed and ungroomed jets and n SD [17] in proton-proton collisions, though not for other common jet observables like jet mass. This stems from counting observables having exact quark-gluon mutual irreducibility in the high-energy limit [17]. Reference [8] further showed that constituent multiplicity is a nearly optimal classifier to separate operationally defined quark and gluon jets (see Ref. [18] for further developments). We thus focus on constituent multiplicity for extracting quark and gluon fractions in our case study.
The algorithm to extract the mutually irreducible base distributions is straightforward. Base distributions are computed from the mixture distributions via with The reducibility factor κ ij is the maximum fraction of p j (x) that can be subtracted from p i (x) such that the resulting function remains positive for every x, and can thus be normalized to yield a proper probability distribution. Mutual irreducibility of the p i is precisely the condition that κ 12 = κ 21 = 0. The κ ij are directly related to the mixture fractions: For two analytically known mixtures of two base distributions, it is essentially trivial to compute κ ij using eq. (2) and then to extract the mutually irreducible underlying distributions using eq. (1). As an example, two Gaussian distributions with different mean and the same standard deviation are mutually irreducible, so any two convex mixtures built from them can be demixed exactly.
When dealing with finite-sampled distributions, however, one encounters substantial technical difficulties using eq. (2) directly. A histogram of samples from a probability distribution p(x) has a finite, discretized range of histogram bins {x k } at which p(x) is estimated from the finite-statistics sampled distribution,p(x k ). We need a method of defining the reducibility factorsκ ij for a pair of sampled histograms. Naively, the infimum of eq. (2) becomes a minimum of the ratio of the histograms over {x k }; simply taking the minimum, however, is very sensitive to statistical fluctuations. A more robust approach, introduced in Ref. [8], is to defineκ ij to be the ratio of histograms in the bin for which the ratio plus its uncertainty is minimized. This method turns out to be insufficient to deal with the much more limited statistics we aim to utilize in this work, particularly becauseκ ij is typically extracted at the low-statistics end points of the distributions.
One way to address the issue of limited statistics is by using fitting to leverage information about the interior of the distribution, where the statistics are better, to put additional constraints on the tails. We note that the histograms shown later in figs. 1a and 1d are exceptionally well-described by a simultaneous fit to two distinct sums of a pair of skew-normal distributions SN(x; µ, σ, s). That is, they are well-described by the form with N = 2. Here θ = (µ 1 , σ 1 , s 1 , . . . , µ N , σ N , s N ), and α i = (α i,1 . . . , α i,N ) contains N − 1 independent fractions, with the N th fraction constrained by N k=1 α i,k = 1, for jet samples i = 1, 2 (dijet and γ+jet in figs. 1a and 1d). For further generality of the functional form, we consider N = 4 and simultaneously fit the two input distributions to f 4 (x; α 1 , θ) and f 4 (x; α 2 , θ), respectively, with 18 fit parameters α 1 , α 2 , and θ. To estimate the uncertainty on such fits, we use the Markov Chain Monte Carlo (MCMC) ensemble sampler emcee [19] to do posterior estimation using the likelihood function [20] ln .
(5) Here, j indexes the histogram bins of jet sample i, with the jth bin having constituent multiplicity x i,j and probability density y i,j , and the ith sample having total count n i . This form assumes that the number of counts in each histogram bin, n i,j = n i y i,j , are independently Poisson-distributed around the value f (x i,j ; α i , θ), and estimates distributions of the parameters α i , θ for which the observed data is most likely. Following Refs. [20,21], the likelihood function p in eq. (5) is rescaled by a fitindependent constant C that cancels a ln(n i,j !) that arises when taking the log of the Poisson probability distribution. We take a uniform prior on the parameters θ and α i in the range µ k ∈ [0, 50], σ k ∈ [1, 15], s k ∈ [−20, 20], and α i,k ∈ [0, 1], and we start the MCMC walkers in a Gaussian ball of standard deviation 10% around the least-squares fit parameters. We use the distribution of fits to obtain distributions ofκ ij via eq. (2). To combat finite statistics effects, we compute the infimum in eq. (2) as a minimum of the MCMC walkers over a reduced range. We consider only the range for which at least one input histogram is non-zero. For each reducibility factor, we further identify whether the minimum will occur on the left or right side of the range, and truncate the opposite tail at the outermost bin that has at least 10 data points for each input histogram. The distribution ofκ ij is used to compute a distribution of fractions, and its mean and standard deviation are used as the value and uncertainty ofκ ij used to extract the topics.
The samples for our proof-of-concept study come from the heavy-ion MC event generator Jewel 2.1.0 [22,23], based on vacuum jet production in Pythia 6.4.25 [24]. We consider two mixed distributions coming from photon-jet (γ + jet) production and dijet production. For each process, we generate proton-proton and 0 % to 10 % centrality heavy-ion events at 5.02 TeV and reconstruct anti-k t jets using FastJet 3.3.0 [25,26] with radius parameter R = 0.4 within the pseudorapidity range |η| < 1. We include initial-state radiation, but do not include medium recoil effects. (There is no underlying event model in Jewel, but we verified that similar results can be obtained after aggressively grooming jets using the Soft Drop algorithm [27] with z cut = 0.5 and β = 1.5 as in Ref. [28].) For γ+jet events, we consider the recoiling jet with the highest transverse momentum (p T ), and for dijet events, we consider the two highest-p T jets. In the case of heavy-ion collisions, we downsample our Jewel events to mimic the statistics that will be available with the anticipated luminosity L dt = 13 nb −1 after Run 4 [29] (see Supplementary Material for details). The equivalent statistics of our dijet sample are already less than those achievable in Run 4, but this substantially reduces the statistics of our γ + jet sample. We emphasize that we are only using Jewel for demonstration purposes, and this data-driven technique can be applied directly to experimental collider measurements for a range of jet observables beyond just multiplicity. Starting with proton-proton collisions in the top row of fig. 1, we show the distributions of jet constituent multiplicity for γ + jet and dijet samples ( fig. 1a) and the "quark-like" and "gluon-like" topics extracted from these distributions via the data-driven method described above (fig. 1b). The corresponding heavy-ion results are shown in figs. 1d and 1e, keeping in mind that the proton-proton and heavy-ion analyses are completely independent. The extracted topics are in good agreement with the distributions of constituent multiplicity for quark-and gluoninitiated jets as defined at the MC level.
Furthermore, we can use eq. (3) to extract the topic fractions, i.e., the proportions of the topics in the original input distributions. Figures 1c and 1f show the extracted fraction of the gluon-like topic in the γ + jet and dijet samples as a function of jet p T . The gluon topic fractions are marginally higher than the MC-level fraction of gluon-initiated jets in proton-proton collisions, and more dramatically higher in heavy-ion collisions. In interpreting these results, however, one has to be mindful of the inherent ambiguity in using MC-level information to label jets, which we explore in the Supplementary Material. In addition, limited statistics drive the extraction of κ from eq. (2) into the interior of the distribution where the true minimum is not yet achieved. In the Supplementary Material, we repeat the heavy-ion analysis using a γ + jet sample with a factor of about 2.8 higher luminosity. Though the results are consistent within (large) uncertainties, the method with limited statistics will tend to overestimate the gluon-like fraction.
Even accounting for these issues, though, we find a persistently larger gluon-like fraction compared to the MC labeling, at least in the context of Jewel. One possi-ble explanation for this effect is that a "quark-initiated" jet may become more gluon-like through gluon radiation, an effect which may be enhanced by medium-induced gluon radiation in heavy-ion collisions. For methods like this one, as well as for the method in Ref. [10], this would result in a larger fraction of jets being classified as gluon jets. It is also possible that constituent multiplicity, though apparently nearly mutually irreducible in proton-proton collisions, may be less mutually irreducible in the presence of medium effects, so alternative observables (perhaps from machine learning [8]) might be required. Understanding these issues will be important for interpreting eventual LHC Run 4 data, but we emphasize that the operational definition used to define the quark-like and gluon-like topics is independent of its interpretation.
As a final proof-of-concept, in fig. 2 we show the modification of the jet constituent multiplicity distributions for the quark-like ( fig. 2a) and gluon-like (fig. 2b) jet topics as a function of p T . To our knowledge, this represents the first fully data-driven method to separate the modification of a jet observable for "quark" and "gluon" jets. Though we show here the modification of the constituent multiplicity distribution for clarity, we emphasize that once the topic fractions have been extracted, they can be used to extract separate quark and gluon distributions for any jet observable. Since both jet observable distributions and the quark and gluon fractions may change between proton-proton and heavy-ion collisions, it is substantially simpler to interpret the separate modification of quark and gluon topics compared to, e.g., the modification of the dijet distribution. Though not shown here, the method of Ref. [30] could be used to match the proton-proton and heavy-ion jet p T quantiles and further clarify the interpretation.
In summary, we have illustrated a data-driven method to extract quark-like and gluon-like topic fractions and distributions in proton-proton and heavy-ion collisions. Using Jewel samples of comparable statistics to those anticipated in Run 4 of the LHC, we have shown that these topics have a similar qualitative interpretation to the (physically ambiguous) definition of the quark and gluon jets at parton level available from MC generators. We have further shown, as an example, the modification of the constituent multiplicity in heavy-ion collisions separately for quark and gluon jet topics. This study offers an exciting proof-of-concept demonstration of the power of the topic modeling to interpret future heavy-ion collision data, though more quantitative studies will of course be necessary to understand the feasibility of this analysis and the best way to incorporate systematic uncertainties. We leave a detailed study of the impact of underlying event and background subtraction, medium response, and experimental inefficiencies as important future work. This method is well-defined for any jet observable, and the aforementioned effects may render it important to study the performance of this analysis with additional observables beyond the constituent multiplicity.

SUPPLEMENTARY MATERIAL
In this supplement, we provide additional information about limited statistics effects and MC label ambiguities.
Monte Carlo Unweighting: To match the statistics available in Run 4 of the LHC, we had to downsample our heavy-ion γ + jet events. The expected integrated luminosity for heavy-ion collisions after Run 4 is L AA,tot = L dt = 13 nb −1 . We should note, however, that the luminosity we are interested in is multiplied by a factor of A 2 = 208 2 for lead-lead collisions, L nn,tot = 208 2 L AA,tot [31]. With the total crosssection σ for a process, this gives the number of events N = σL nn,tot .
Since our Jewel sample is based on weighted events, we perform a standard unweighting procedure to simulate the statistics possible after Run 4. If we have N w weighted events, the unweighting procedure probabilistically chooses some subset of these events to act as unweighted events. To obtain an expected number of kept events N , the jth event is (independently) kept with probability p j = N wj wtot , where w tot = j w j . Note that this puts a constraint on the required size of N w to be able to unweight it to a sample of size N , namely that 1 ≥ p max = N wmax wtot , with w max being the maximum value of w j . We unweight our proton-proton sample per p T bin by fixing p max = 1 in each bin and downsampling to a sample of expected size N . After this unweighting we have 63603 and 42466 jets in dijet and γ+jet samples, respectively, with p T ∈ [100, 120] GeV, 25046 and 18488 with p T ∈ [120, 140] GeV, and 13707 and 9260 with p T ∈ [140, 160] GeV. For clarity, we downsample our heavy-ion results uniformly across all p T bins so that the resulting statistics correspond to a fixed luminosity of 13 nb −1 (and later 37 nb −1 for fig. 3). Our 13 nb −1 γ+jet sample has 6453, 2825, and 1436 jets in p T ranges [100, 120] GeV, [120, 140] GeV, and [140, 160] GeV, respectively. Our dijet sample has lower luminosity with 71098, 32318, and 15747 jets, respectively, in the same p T bins.
Limited Statistics Effects: While the proton-proton results shown in the main body of the text are relatively robust to statistical effects, the heavy-ion results are impacted by the lower statistics expected for LHC Run 4. This particularly affects the γ + jet sample, where the statistics only enable a rough estimate of the constituent multiplicity distribution. In fig. 3, we show the result of topic extraction after increasing the γ+jet sample to have a factor of about 2.8 more events than expected in LHC Run 4. With limited statistics, the method tends to overestimate the gluon-like fraction, though the results are consistent within uncertainties. With the higher statistics, the gluon-like topic fractions still remain somewhat higher than the MC-level fractions, which could simply be due to residual limited statistics effects. As mentioned in the main text, this effect could alternatively have a physical origin and might arise from "quark-initiated" jets becoming more gluon-like via gluon radiation, larger deviations from mutual irreducibility for constituent multiplicity in the presence of medium effects, or ambiguities Monte Carlo Label Ambiguities: It is important to remember that MC-level information is inherently ambiguous. In the main text, we defined the MC quark-and gluon-initiated jet distributions (γ + quark and γ + gluon in fig. 1) to include only those from jets whose axis is within ∆R = ∆η 2 + ∆φ 2 = 0.4 from a quark or gluon in the initial hard scattering matrix element. This requirement is only satisfied, however, by 87% and 90% of the jets in our heavy-ion sample of γ + jet and dijets, respectively, with p T ∈ [100, 120] GeV. In the analogous proton-proton samples, it is satisfied for 94% and 95% of the jets in γ + jet and dijet samples, respectively. (These percentages can be further increased by requiring the two jets, or the photon and jet, be nearly back-to-back in azimuthal angle.) Particularly for our heavy-ion results, this implies that the γ + jet and dijet samples are not direct combinations of the MC-level quark and gluon jet distributions.
Thus, one might wonder whether part of the mismatch between the gluon-like fractions seen in fig. 1f could be due to the fundamental ambiguity in jet flavor labeling. In fig. 4, we study this by (unphysically) restricting the γ + jet and dijet distributions to those jets whose axis is within ∆R = 0.4 of an initiating parton, and then using these restricted samples to perform the topic extraction. This has relatively little impact on the gluon-like topic fraction, though it tends to make it somewhat more similar to the MC-level gluon-initiated jet fraction. * jtbrewer@mit.edu † jthaler@mit.edu ‡ apturner@mit.edu