A background estimator for jet studies in p+p and A+A collisions

Experimentally, jet physics studies face an unavoidable task: distinguishing, at the detector level, the particles produced in the hard partonic scattering from the ones created in unrelated soft processes such as pileup interactions in high-luminosity proton-proton scattering or the underlying event in heavy-ion collisions. The fluctuating nature of the background constitutes the main source of uncertainty for any subtraction algorithm. Aiming at mitigating the effect of such fluctuations, we present a new method to estimate the background contribution to the transverse momentum on a jet-by-jet basis. Our approach is based on estimating the median background momentum density stored above a $p_{\rm T}$-cut applied at the constituent level and an experimentally accessible correction term related to the signal contribution below the cut. This allows to trade part of the uncertainty due to background contamination for that of the signal below the cut, similarly to SoftKiller method. We propose to reduce the fluctuations of the latter by exploiting intrinsic correlations among the soft and hard sectors of QCD jets. Our data-driven approach is tested against PYTHIA8 and JEWEL di-jet events embedded in a thermal background and compared to the area-median and SoftKiller methods. The main result of this study is a $\sim 5\!-\!40\%$ improvement on the resolution of the reconstructed jet $p_T$ compared to previous methods in a high-luminosity proton-proton scenario. Its applicability in a heavy-ion context is also discussed.


I. INTRODUCTION
The study of jets in ultra-relativistic hadronic collisions has decisively contributed to our understanding of the perturbative and non-perturbative aspects of Quantum Chromodynamics (QCD). Currently, at the Large Hadron Collider (LHC), events with jets in the final state are being extensively exploited to search for signatures of new physics. In the context of heavy-ion collisions, jet physics is immersed into the precision era both from a theoretical and an experimental perspective. Traditional observables, sensitive to the formation of a Quark-Gluon Plasma (QGP), such as the nuclear modification factor [1,2] or di-jet asymmetry [3] , are being complemented with substructure analysis techniques inherited from the p+p community both at RHIC and the LHC [4][5][6].
To perform meaningful theory-to-data comparisons, jet physics relies on two main ingredients: jet reconstruction algorithms and background subtraction methods. The former provide an operational definition of a jet by clustering the particles in the event following an infrared and collinear-safe method based on the notion of a distance between them either in momentum (k T ) or angular space [7][8][9]. The latter deals with the fact that the situation in which an isolated hard partonic scattering is responsible for all the clustered particles is hardly * mehtartani@bnl.gov † ontoso@bnl.gov ‡ m.verweij@uu.nl realized in current collider experiments. In high-energy hadronic collisions two competing mechanisms are relevant: pileup interactions, ubiquitous in high-luminosity facilities and the underlying event whose importance grows with the collision system size, i.e. from p+p to A+A [10]. Both of them populate the low-p T sector of the event and are commonly referred to as background.
The aforementioned clustering algorithms do not distinguish signal from background particles. Consequently, any reconstructed jet property X has two independent contributions, i.e. X reco jet = X sig jet + X bkg jet . Background subtraction methods are developed to subtract an estimated value for the X bkg jet term. We shall focus in the present work on jet transverse momentum reconstruction so that X = p T . To this aim, two steps have to be addressed. First, the value of p bkg T,jet has to be estimated. This is an intricate task due to the fact that the background contribution fluctuates on an event-by-event basis and locally within the same event. Next, the question on how to remove this estimated background at the particle level arises. These two elements are strongly correlated as an improvement on the p bkg T,jet estimate largely impacts the reconstruction of other features such as the jet mass [11][12][13]. In what follows we focus on background estimation. The subtraction part will be addressed in a future work.
Several methods are used to estimate the background transverse momentum contribution on a jet-by-jet basis in p+p and A+A collisions [14][15][16][17]. In this work we would like to draw the attention on the subtle interplay between signal and background in the low-p T , few GeV, region. We pinpoint the underlying mecha-nisms driving the mean and standard deviation of the reconstructed jet momentum distribution. This systematic study leads to the design of a new background estimator that improves the jet p T reconstruction resolution. Furthermore, we propose an experimentally measurable quantity, to be described in the following, to correct for signal contamination. Our method, dubbed ρ-correction as it corrects for the background energy density, aims at mitigating both background and signal fluctuations on a jet-by-jet basis.
This paper is organized as follows. In the next Section we provide a detailed explanation of the generator samples together with a review of existing background estimators and emphasize the importance of signal contamination. The new approach and its application to p+p and A+A collisions are presented in Section III. To conclude, future lines of work are outlined in Section IV.

II. BACKGROUND ESTIMATION
In this Section, the simulation framework together with the key ingredients of existing methods that aim to provide an accurate estimate of the background transverse momentum contribution to a jet p T are exposed. In the last part of this Section we analyze the impact of signal contamination below a given soft p cut T .

A. Simulation set-up
First, we introduce the toy data samples that have been used to compute all the results presented in this manuscript.
The situation of a p+p collision at 13 TeV with large pileup is mimicked by superimposing a PYTHIA8 [18] di-jet event into a number n PU of minimum bias events. The virtuality of the hard partonic scattering is set tô p T = 100 GeV/c and the number of pileup interactions is set to n PU = 200 for the high-luminosity LHC [19]. For completeness, we also consider current LHC running conditions with n PU = 60.
In the case of a central heavy-ion collision, which will be addressed in Sec. III B, we make use of the JEWEL [20] event generator without recoil (JEWEL NR) embedded into a thermal background, as a proxy for the QGP. Di-jet events are produced at a center-of-mass energy per nucleus-nucleus collisions √ s = 5.02 TeV. Then, the parton shower is modified by collisions with the background constituents and medium-induced radiative processes. Background particles are obtained by sampling a Boltzmann distribution with a fixed multiplicity N = 7000 and an average momentum µ = 1.2 GeV/c. This corresponds to an average momentum density of ρ = 250 GeV, a realistic value for central Pb+Pb collisions. In this case, background particles are massless and uniformly distributed in (η, φ)-space with |η| < 3.
For each event, background and signal particles are clustered into anti-k T [9] jets with R = 0.4 using Fast-Jet3.1 [21]. The true, p jet T,truth , and the reconstructed, p jet T,reco , jet momenta are defined as follows where p jet T,raw refers to the jet momentum resulting from the clustering and the subscript sig labels the particles from PYTHIA/JEWEL. The goal of this paper is to provide an estimate of p jet T,bkg that leads to a (p jet T,reco − p jet T,truth )-distribution with a close to zero mean and minimal standard deviation. Finally, the analysis is performed on jets with p jet T,truth > 120 GeV/c.

B. Previous methods
The most largely used background estimation method by the jet experimental community is the areamedian [14] whose biggest strength is its data-driven and unbiased nature. On an event-by-event basis, the particles are clustered with the k T algorithm [7]. After discarding the two hardest clusters (patches), under the assumption that they are predominantly composed of signal particles, ρ defined as is computed for each cluster, where p T and A are the total transverse momentum and area in the (η, φ) plane of the corresponding patch, respectively. The background contribution to the reconstructed jet p T is then given by where med(ρ) is the median p T density of the set of patches.
Although highly successful at estimating p jet T,bkg on average, this method lacks control over background fluctuations. This translates into a (p jet T,reco − p jet T,truth )distribution centered at zero but, as shown in Section III, with a sizable standard deviation [10].
An effective way of improving the resolution of the area-median method is to introduce a p cut T at the constituent level. If signal and background do not overlap in momentum space, removing all particles in the event below the separation scale would result into an exact estimation of the background in each patch (jet). In this ideal scenario the (p jet T,reco − p jet T,truth )-distribution would be a Dirac Delta. This is the underlying idea behind Soft-Killer (SK) [15]. In this case, the event is clustered using a grid in (η, φ)space of size a that is a free parameter of the method whose role is discussed below 1 . For every cell in the grid, the maximum constituent's momentum, p max T,i , is found in order to compute the p cut T defined as Then, the jet-by-jet background estimate of this method is expressed like Behind its apparent simplicity, this algorithm hides a few remarkable subtleties that we would like to comment on. Clearly, SoftKiller's performance depends on the scale separation between background and signal. In general, soft signal particles would be removed if they fall below the p cut T leading to a narrower but shifted to negative values (p jet T,reco − p jet T,truth )-distribution. An effective solution to overcome this bias is to tune the size of the patches where the p cut T is computed. That is, by construction there is a value of the parameter a or equivalently R kT (the resolution parameter in the k T clustering) for which the shift of the (p jet T,reco − p jet T,truth )distribution is zero. This is so because a non-trivial interplay exists between the grid size/clustering radius and the p cut T : modifying R kT is equivalent to change the value of p cut T and, consequently, the grid size can be adjusted to balance the signal contamination and background contribution below and above the cut. Therefore, in the following, we tune the value of R kT such that the absolute mean value of the p T reconstruction distribution is smaller than 0.5 GeV/c. We shall refer to this method as "SoftKiller R adj ". Notice that this choice does not constrain, in principle, the standard deviation of the momentum reconstruction distribution. Finally, we would like to comment on the fact that, in this work, we use SoftKiller as a background estimator and not as a subtraction method. In our opinion, the resulting p jet T,bkg could be used as an input to different subtraction algorithms, e.g. [16], to avoid distortions in the jet substructure due to the removal of all particles below p cut T .

C. Impact of signal contamination
In the previous Section, we remarked that any background estimator that relies on a soft momentum cut is sensitive to QCD radiation below this threshold. To quantify the signal contamination, we have computed the average contribution of the total jet p T accounted by constituents whose momentum is below a certain p T threshold: The event-averaged value of this quantity, for the PYTHIA8 sample, as a function of multiples of the momentum cut at the constituent level, p cut T , is shown in Fig. 1. For completeness, we include the values of Soft-Killer p cut T obtained by clustering the event containing both signal and background particles from n PU = 60 minimum bias events into k T -patches of R kT = 0.2, 0.4, removing the two hardest ones and then applying Eq. 5. As expected, the signal contribution is a monotonous increasing function of p cut T . It is worthwhile to mention that even for considerably low thresholds such as 2 GeV/c the contribution of signal particles amounts to almost 15 GeV/c, which is more than 10% of the jet p T selection threshold. This indicates a strong overlap between signal and background. Further, the aforementioned direct relationship between R and the Soft-Killer p cut T (see Eq. 5) is explicitly shown, i.e. p cut T grows when increasing R kT . More concretely, increasing the p cut T by 1 GeV leads to a signal contamination growth of ∼ 5 GeV. This is an important aspect given the fact that SoftKiller R adj is based on canceling out the signal contamination with the background contribution above the cut. That is, minimizing the shift requires a precise fine-tuning of R kT and small deviations from the optimal value would result into a sizable bias in the p T reconstruction.  other jet features. However, QCD dynamics generates intrinsic correlations between the soft and hard p T sectors within a jet owing to the tree structure of QCD parton cascades. The constituents whose momenta is below a given p cut,soft T , responsible for the p sig T,< -term, originate from the branching of hard p T partons. Each of these multiple soft radiations is not independent from the rest of the shower as its phase space is restricted by mechanisms such as energy conservation or color coherence. This correlated radiation pattern is expected to manifest itself in jet substructure observables. For instance, an enhancement of soft radiation for broader jets is expected. We test this hypothesis within PYTHIA8 and the result is shown in Fig. 2 where the width is defined as [22] width >,hard = p T ,i >p cut,hard and p cut,hard T is a hard momentum threshold above which the width is computed, R refers to the jet radius and ∆R i is the cylinder (η, φ)-distance from each constituent to the jet axis. A clear correlation between p sig T,< and width >,hard is observed. Further, the standard deviation of p sig T,< is reduced when fixing width >,hard as compared to the inclusive case (see Fig. 1). Therefore, signal fluctuations in the low-p T regime can be mitigated by using hard-soft correlations within a jet. For simplicity, we restrict the results of this manuscript to the width and leave a more systematic study of other metrics and combinations of them for future work.
All the elements discussed in this Section, i.e. a momentum cut, the median ρ and the average signal in the soft sector, are combined into a new background estimator to be described below.

III. THE ρ-CORRECTION METHOD
On an event-by-event basis, the ρ-correction algorithm proceeds as follows: 1. Divide the event into patches with R kT = 0.4 via the k T algorithm and remove the two hardest ones.
2. For each patch, compute ρ with the particles whose p T is above a p cut,soft T (which will be discussed below) where A denotes the area of the patch.
4. Re-cluster the whole event via the anti-k T algorithm with R = 0.4.

5.
For each anti-k T cluster, add up the p T 's of those particles that are below p cut,soft 6. The final estimate for the background p T of a given anti-k T jet is where the subscripts "<" , ">" refer to the region where these quantities are computed, i.e. below or above the p cut,soft T .
The pocket-formula of the ρ-correction method, Eq. 11 neglecting the last term, interpolates between the areamedian and SoftKiller for which p cut,soft T = 0 and p cut,SK T (see Eq. 5), respectively. The role of the last element, p shift T,< , is to correct for the offset induced by signal contamination (see Figs. 1,2). To address the definition of p shift T,< , an explicit distinction between jets arising from p+p and A+A collisions is needed. This is due to the impact of the well established phenomenon of energy loss i.e. parton showers are modified in the presence of a QGP leading to a degradation on the jet p T with respect to its parent parton [23].
• p+p: we explore two possibilities for p shift T,< . First, using an inclusive determination of p sig T,< (ρcorrection) i.e. p shift T,< = p sig T,< . Second, the background p T estimate of each jet is corrected for signal contamination with a fluctuating value of p sig T,< depending on its width >,hard such that p shift T,< = (p sig T,< width >,hard ) Given the fact that we exploit hard-soft correlations we name this option "ρ-correction HS corr ". This choice requires a value of p cut,hard T so that the width is computed in the background-free region. We fix it to be 5µ. We have checked that the results are robust against modifications of this value. Also, when p cut,hard T 5µ the soft and hard sectors become de-correlated and the inclusive result is recovered. At this point the natural question on how to extract this information, outside a Monte Carlo environment, arises. We propose to measure it at low pileup proton-proton collisions. The idea of using low pileup events as relatively backgroundfree samples was also explored in [24]. We have studied the behavior of p sig T,< with the jet p T finding a logarithmic dependence. This is an important aspect in terms of applicability of the method to experimental data as inaccuracies in the determination of the true jet p T due to detector effects or background contamination would not substantially affect the value of p sig T,< . Therefore, the ρ-correction method ensures a (p jet T,reco − p jet T,truth )distribution that is centered around zero thanks to the p shift T,< -term.
• A+A: the impossibility of experimentally accessing a background-free environment in a heavy-ion collision in which p shift T,< could be determined in an unbiased way demands an alternative definition of it. For the ρ-correction case we propose where p T,< is obtained by performing an ensemble average of Eq. 10 over a set of unsubtracted jets i.e. containing signal and background particles. The second term in Eq. 13, med(ρ < ), is extracted from the same sample of k T jets that we used to compute Eq. 9. Again, · represents an ensemble average over the unsubtracted jets with area A. Remarkably, every quantity in Eq. 13 can be extracted from data making the proposed method directly applicable to experimental analyses in heavy-ion collisions. If hard-soft correlations where to be exploited, the correction reads as follows The two extra terms in Eq. 14, as compared to Eq. 13, have a transparent interpretation. As previously discussed, (p sig T,< width >,hard ) p+p cannot be determined from heavy-ion data. Therefore, one has to use the same correction as in the p+p case i.e. obtained from low pileup events. Clearly, this introduces an additional shift that can be compensated with p sig T,< p+p , defined in Eq. 7.
Finally, the determination of p cut,soft T proceeds as follows. We select a fixed value of p cut,soft T requiring that it minimizes the standard deviation of the (p jet T,reco − p jet T,truth )-distribution. We have checked that the optimal value of p cut,soft T does not strongly depend on whether the inclusive or exclusive version of p sig T,< is used to correct for the signal contamination. Hence, we believe that the most direct way to apply our method to experimental data is to use Eq. 11 with the inclusive value of p sig T,< to find the optimal p cut,soft T . Once this value is known one can use the width-dependent version of p sig T,< to get a further reduction on the p T resolution. The results of the method in two different contexts are presented in the next Section.

A. Proton-proton collisions
We first test the performance of the method on p+p collisions with different pileup settings: n PU = 60, the actual running conditions of the LHC and n PU = 200, the foreseen scenario at the high-luminosity phase. In Fig. 3, the resolution of the reconstructed jet p T as a function of p cut,soft T for both cases is displayed. The generic shape of this curve can be understood using Eq. 11. For low values of p cut,soft T , med(ρ > ) dominates in Eq. 11 and, as expected, no significant improvement is achieved with respect to the traditional area-median method. In the opposite regime, when p cut,soft T > p cut,SK T , the contribution of med(ρ > ) is identically zero and σ monotonically increases with p cut,soft T due to an increased number of signal particles being misidentify as background, as shown in Fig. 1.
The minimal resolution is achieved for p cut,soft T = 1.4 GeV/c in the n PU = 60 case while in the highluminosity scenario the p T -resolution is optimized for p cut,soft T = 2.2 GeV/c. In the case of SoftKiller, the optimal values of R kT are R kT = 0.23 for n PU = 60 and R kT = 0.22 for n PU = 200. We observe a weak dependence of the standard deviation on p cut,soft T in the region near the minimum as shown in Fig. 3. This is a clear advantage with respect to SoftKiller where a precise finetuning was needed in order to minimize the shift.
The precise values for the mean and standard deviation of the p T -reconstruction distributions for both background settings can be found in Tables II,III for the four explored methods: area-median, SoftKiller R adj ,ρcorrection and ρ-correction HS corr where the width-dependent p sig T,< correction is obtained for the optimal value of p cut,soft T . First of all, any method that includes a momentum threshold highly reduces the standard deviation as compared to the area-median. We find that the p T resolution of ρ-correction and SoftKiller R adj are comparable. These two methods, although approaching the problem from different angles, are in fact similar and, as noted in [15], the value of R kT for which the shift is minimized results into a close-to-minimal resolution. In the high-luminosity scenario, the ρ-correction HS corr method results into a (35%, 5%) improvement on the standard deviation with respect to area-median and SoftKiller, respectively. Remarkably, our method is resilient against the particular density of the background i.e. the general trends persist when reducing the number of minimum bias events as observed in the bottom part of Fig. 3. As in the previous case, the best performance is achieved by ρ-correction HS corr , i.e. when p sig T,< is computed for bins of width >,hard . We conclude that the use of a fundamental property of QCD such as hard-soft correlations improves the background momentum estimation in jet experimental analyses. This is one of the main results of this paper.    The application of the ρ-correction algorithm to a heavy-ion context starts by obtaining the value of p cut,soft T by minimizing σ(p jet T,reco − p jet T,truth ) within the JEWEL sample. The main source of uncertainty with respect to the p+p case is the impossibility of determining p jet T,truth due to energy loss. Therefore, to avoid Monte Carlo dependence, the adjustment of free parameters in any method has to be done with p+p data. The standard deviation in vacuum is minimized, as shown in Fig. 3, for a value of p cut,soft T ∼ 2.5µ for which we compute p shift T,< (both in the inclusive and the width-dependent case) in JEWEL without energy loss. In the case of SoftKiller R adj we follow a similar procedure: first we optimize the value of R kT in JEWEL without energy loss R kT = 0.21 ( p cut T = 3, σ(p cut T ) = 0.05 GeV/c) and then use it in the heavy-ion context.
These are all the ingredients that we need to apply Eq. 11. The two first statistical moments of the momentum reconstruction distributions in the JEWEL NR case are displayed in Table III. Regarding the standard deviation we reach a similar conclusion to that in the p+p case: the ρ-correction HS corr method reduces it by 0.6 GeV/c and 4.4 GeV/c when compared to SoftKiller R adj and area-median, respectively.
The fact that the mean value within SoftKiller deviates from zero, as compared to the p+p case, reflects in-medium jet modification absent in the vacuum baseline. In the case of both ρ-correction and ρcorrection HS corr the mean value of the (p jet T,reco −p jet T,truth )distribution is close to zero thanks to the p shift T,< -term as given in Eqs. 13,14, respectively. This is a remarkable and non-trivial feature as the applicability of our unbiased method to heavy-ions doesn't lie on the common trade-off between resolution improvement and the shift on the mean due to energy-loss. The ρ-correction approach leads to simultaneously obtain a momentum reconstruction distribution centered at zero alike the areamedian with a substantial reduction of its standard deviation along the lines of SoftKiller.
For completeness, the momentum reconstruction distributions are shown in Fig. 4. In there, the positive shift of SoftKiller together with the broadness of the areamedian distribution are clearly visible. We conclude that ρ-correction HS corr is an unbiased method ready to be used as background estimator in jet experimental analyses of heavy-ion collisions.

IV. SUMMARY AND OUTLOOK
In this work we propose a new background estimation method dubbed ρ-correction that aims at improving the resolution on jet p T reconstruction analyses in hadronic collisions. To that end, we analyze the role of key elements of existing algorithms, such as the areamedian and SoftKiller, in terms of their impact on the mean and standard deviation of the (p jet T,reco − p jet T,truth )distribution. In the spirit of SoftKiller, we introduce a p cut T at the constituent level that we determine by minimizing the standard deviation of the jet momentum re-construction distribution, in contrast to their mean value optimization. The price to pay for having a reduction on the standard deviation thanks to a p cut T is an overestimation of the background component due to abundant soft QCD radiation. Our method, summarized in a compact way by Eq. 11, overcomes this offset with an experimental accessible quantity: the average momentum stored below a given p cut T inside a jet to be measured in a clean environment like low pileup p+p collisions. We show how the average of this distribution is enough to correct for the shift in high-luminosity proton-proton collisions while a couple of additional measurable terms are needed in a heavy-ion context due to in-medium jet modification. The method results into a comparable performance to SoftKiller in the high-luminosity scenario while it outperforms both area-median and SoftKiller in heavy-ion collisions.
Finally, we show that exploiting correlations between the soft and hard sectors within a jet, signal fluctuations are mitigated and, consequently, the jet momentum reconstruction resolution is improved with respect to the area-median and SoftKiller methods. The use of QCD correlations generated in the branching process of the shower to reduce the resolution in jet momentum reconstruction is the main novelty of this work. The natural continuation would be to implement a machine learning set up, in the spirit of [17], to systematically identify the set of substructure observables that better correlate with p sig T,< . Work in this direction is on-going. Finally, the combination of this background estimator with a novel subtraction method will be the subject of a follow-up publication [13].