A New Method for Reducing PDF Uncertainties in the High-Mass Drell-Yan Spectrum

Uncertainties in the parametrization of Parton Distribution Functions (PDFs) are becoming a serious limiting systematic uncertainty in Large Hadron Collider (LHC) searches for Beyond the Standard Model physics. This is especially true for measurements at high scales induced by quark and anti-quark collisions, where Drell-Yan continuum backgrounds are dominant. Tools are recently available which enable exploration of PDF fitting strategies and emulate the effects of new data in a future global fit. ePump is such a tool and it is shown that judicious selection of measurable kinematical quantities can reduce the assigned systematic PDF uncertainties by significant factors. This will be made possible by the huge statistical precision of future LHC Standard Model datasets.

Constraining PDFs and their uncertainties is now an intense research program. The systematic uncertainty in the PDF models arises from the 1) experimental uncertainties of the input data used in a global fit, 2) any theoretical assumptions assumed by the fitting groups, and/or 3) the chosen parameterizations characterizing the functional forms of the PDFs themselves. All of the global PDF fitting groups (CTEQ-TEA [1], MMHT [2], and NNPDF [3]) characterize their fits with complex error matrices and so experiments can legitimately include PDF uncertainties as a component to any theoretical error for any measurement or limit.
In this paper we explore the PDF uncertainties as they apply to the BSM search for a resonant Z gauge boson in the dilepton invariant mass spectrum. The dominant and irreducible background process to this search is the Drell-Yan (DY) process. Both ATLAS [4] and CMS [5] have recently completed their searches for new high-mass phenomena from the first √ s = 13 TeV data-taking runs at the LHC. Both set comparable lower bounds on the mass of a hypothetical new vector boson and both publish extensive lists of their systematic uncertainties, including uncertainties attributed to our limited knowledge of PDF fitting.
To date, only 5% of the planned LHC data are in hand and yet these PDF uncertainties might already have limited future mass reaches for such searches. Not only are resonant Z boson searches "at risk" but also W boson searches and especially non-resonant (such as contact interactions) searches, which are very sensitive to sloped shape changes in the background. Therefore, it is critical that we improve our understanding of PDFs and their associated uncertainties. One way to do that in the short-term (on search timescales) is to explore the prospect of using the DY data collected at the LHC in a well-measured control region, to constrain the theoretical uncertainties in a kinematic region relevant to BSM searches.
The machinery of PDF global fitting groups is very complex and for physicists outside of the PDF groups, testing new PDF analysis strategies can be cumbersome. This will change with the recent development of tools like ePump [6] (the Error PDF Updating Method Package, see below for details), which makes it possible to explore the effects of new kinematic inputs to a global fit without requiring a full global analysis.
While other PDF profiling tools exist such as xFitter [7], in this paper we choose to use ePump which has been thoroughly tested [6] against the CT14 global fits. [8] The work in this paper is the first published use of this tool. Through its use, we demonstrate that new insight into kinematics of the DY process has emerged, and that considerable reduction in the quark and anti-quark PDF uncertainties is possible with new data inputs to PDF global fitting, and perhaps also a new analysis strategy.

A. Our Goals
Our goals in this paper are limited. ePump preserves all of the important features of a standard global fitting such as sum rules and full QCD evolution and we use it here as a tool to predict the influence of new data sources by "simulating" PDF global fitting. The result is an updated PDF set, augmented as if the new data were a regular input to a global fit. This pseudo-set includes the full complement of PDF uncertainties. We limit ourselves to the CT14HERA2 parameterization and do not attempt to explore the space of possible functional forms. We do not attempt to optimize theoretical uncertainties associated with any other theoretical considerations like the strong coupling constant, electroweak couplings, or higher order electroweak and QCD effects. We also make no effort to optimize or explore the full set of possible experimental uncertainties. We simply ask the optimistic question: can qualitatively new data when combined with the current inputs of CT14HERA2 reduce future PDF uncertainties and if so, by how much?
Our ansatz is to treat BSM DY searches as consisting of a control region-from which we envision mining DY data for global fitting-and signal region to where those new global fits are extrapolated. Of course as in any control-signal region analysis, the assumption is that the control-region contains only SM physics. We specifically explore the possibility that LHC DY data in a safe control region might be useful to further constrain PDFs appropriate to high-mass BSM searches for which the continuum DY is the dominant background.
Having determined that this is worth consideration, our ultimate proposal is that the LHC experiments and the PDF fitting teams work together to explore inclusion of LHC DY data into global fitting when prepared in a particularly useful way.

III. THE DRELL-YAN PROCESS
The general Drell-Yan process [11] of pp → + − + X at leading order originates from an s-channel exchange of an electroweak boson Here, X denotes any additional final-state particles (radiated partons, the underlying event, multi-parton interactions, etc.). At next-to-leading order, the real corrections introduce three t-channel processes, listed in order of decreasing cross section at LHC energies, The leading order process is depicted in Fig. 2.
In each case, the vector boson decays into a pair of same-flavor, oppositely-charged leptons. For simplicity, our discussion will center on the leading order process, but our simulations and results are based on the calculations using the ResBos [12][13][14] package which resums effects of multiple soft-gluon radiation to all order at the next-to-next-to-leading-log (NNLL) accuracy with Wilson coefficients calculated at next-to-leading-order (NLO) in the CSS transverse momentum resummation formalism.
The DY triple-differential cross section can be represented as a function of the dilepton invariant mass m , the dilepton rapidity y , and the cosine of the lepton polar angle in the Collins-Soper rest frame cos θ * . For the LO s-channel process, this is Here √ s is the CM energy of the LHC, and P 1 and P 2 are the 4-momenta of protons 1 and 2. In the standard fashion, x 1 and x 2 are the incoming parton momentum fractions such that p 1 = x 1 P 1 and p 2 = x 2 P 2 .
The functions f q/P 1 (x 1 , Q 2 ) and fq /P 2 (x 2 , Q 2 ) are the PDFs for quark flavors q andq, respectively. The term (q ↔ q) accounts for the fact that either proton can carry a sea quark, as the LHC is a proton-proton collider.
Finally, the quantity P q accounts for the parton-level dynamics in terms of important electroweak parameters, and exhibits dependencies on both dilepton mass and cos θ . We will be concerned with each factor in this formula, and it is discussed in detail in Sec. III C.
The energy scale of the collision is set by the transferred four-momentum squared Q 2 , which can be identified with the square of the dilepton invariant mass m 2 . Well-known kinematic definitions include and, which parametrizes the dilepton rapidity in terms of the x fractions of the initial-state partons at LO. From these, the variables are related, also at LO, by  This significant lack of precision is the source of the large systematic uncertainties required in a high mass, dilepton Z search. Figure 6 shows the iconic invariant mass distribution of dilepton pairs calculated using the ResBos As DY data inputs are the only way to constrain high-x PDFs, a strategy is explored here that turns this lack of sensitivity into an opportunity. The DY continuum is well- measured and reliably SM physics. If PDF global fits were to include LHC DY data well below any search region, but high enough in invariant mass to better constrain the fits, this uncertainty could be reduced. Moreover, the amount of LHC data that will become available in the coming years will be staggering, so we've decided to explore DY kinematics further in hopes of finding/discovering sensitivities that would help to enhance the potential of high-x PDF fits.
We will show that there are DY observables, such as cos θ * , that could in principle be incorporated in PDF global fitting, and the use of ePump tells us approximately how much reduction in PDF uncertainty is possible, as well as how much smaller the PDF systematic uncertainty might become in the DY differential mass spectrum. Due to the importance of cos θ * , and the role it plays in our fitting strategy, a brief review is given in the next section.  toẑ in the CS frame and can be calculated directly from lab frame lepton quantities with The sign of the z axis is defined on an event-by-event basis as the sign of the lepton pair momentum with respect to the z axis in the laboratory frame. Here, P T and P z are the transverse and longitudinal momentum of the dilepton system, respectively, and, where the lepton (anti-lepton) energy and longitudinal momentum are E 1 and p z,1 (E 2 and p z,2 ), respectively. This definition requires the electric charge identification of each lepton.
We define DY events as forward (cos θ * > 0) or backward (cos θ * < 0) according to the direction of the outgoing lepton in this frame of reference.

C. Relearning Drell-Yan Kinematics
We will exploit a novel feature of the DY subprocess cross section from Eq. (5) and the definition of cos θ * of Eq. (9), to form the basis of the PDF update with ePump in Sec. IV.
v The function P q of Eq. (5) encodes the parton-level dynamics with which is a weighted sum of an even function (1 + cos 2 θ * ) and an odd function cos θ * . In the calculation of the total inclusive cross section, the odd term integrates to zero, but is responsible for inducing the well-known γ * /Z forward-backward asymmetry A F B .
The asymmetry coefficients C 0 q and C 1 q of Eq. (11) include the electroweak couplings of the initial-state quarks and final-state leptons, and describe the m spectrum as Where Here m Z and Γ Z are the mass and decay width of the SM Z boson, and Q f , v f , and a f are the electric charge, and the electroweak vector and axial-vector couplings of each fermion, whose values are shown in We found particular practical significance in focusing on the polar angle. Figure 7 shows several cos θ * distributions of Eq. (9) in discrete slices of dilepton invariant mass. Each mass-slice is further decomposed into sub-processes that consist distinctly of up-type or down-type initial-state quarks. The up-type sub-processes include initial-states of uu, ug, and ug, where u is the up quark or charm quark and g is the gluon. A similar definition applies to the d-type (down, strange, bottom) sub-processes, with u replaced by d. This is in accordance with the four DY reactions in Eqs. (1) and (4).
Intriguingly, the relative up-type and down-type sub-processes are highly dependent on both mass and polar angle θ * . This is especially true above the Z boson mass peak, in which the forward region (cos θ * > 0) shows an increasing degree of separation between the rates associated with the up-type and down-type DY sub-processes. Indeed, in this region the contribution to the total cross section is due almost entirely to the up-type sub-process by itself. At high mass and high polar angle, the LHC DY process proceeds almost entirely through the uū sub-process, effectively making the LHC a uū collider.
Why is this the case? Figure 7(f) shows that as cos θ * nears +1, the up quark dominates DY production by almost a factor of four over that of the down quark. Taking apart Eqs. (5), (11), (12), and (13) explains this observation. Figure 8 shows the quantities P u and P d evaluated at cos θ * = ±1. The closed (open) circles tag P u (P d ) at a √ŝ = 1 TeV for the cos θ * = 1.0 curves. The ratio of P u /P d is about 2. Figure 9 shows the separate parton luminosity functions L uū and L dd for the leading order uū and dd sub-processes in accordance with the CT14HERA2 PDF set. Here too, the closed and open circles refer to the up-quark and down-quark parton luminosity functions and again, the ratio of L uū /L dd is approximately 1.5. The product of these contributions (i.e., P u /P d × (L uū /L dd ) to the rates confirms the near factor of 4 ratio observed in Fig. 7 at cos θ * near 1.
Notice that we've not really learned anything new! DY kinematics is an old subject but CS * θ cos the high-mass behavior is revealing and the question is whether cos θ * behavior as a function of mass should be an important discrimination as an input to global PDF fitting. This is Theory template of a PDF set (parameters + uncertainties) and binned Data template of (pseudo-) data, including statistical uncertainties from integrated luminosity assumptions.
where ePump comes in.

IV. A PROPOSED STRATEGY TO PDF ERROR REDUCTION FOR DY
We attempt to shed light on two questions: 1. If cos θ * data were incorporated in global fitting, how significant might the reduction in PDF uncertainties be?
2. Would those decreased errors be a significant reduction in the overall theoretical uncertainties in future BSM, high-mass DY searches?
In order to answer Question 1, ePump was used, which can update an existing PDF set with new experimental data (or pseudo-data) in order to produce an improved best-fit and Hessian Error PDFs. The ePump workflow can be seen in Fig. 10.
For this analysis, "pseudo-data" are used to mimic a possible future LHC dataset for PDF fitting. As any dataset has finite statistics, the resulting uncertainties in the new PDFs will reflect whatever statistical precision is modeled in the pseudo-data. The effects of new PDFs and uncertainties can then be used to re-evaluate the PDF systematic uncertainty on the high-mass dilepton event yield.
Furthermore, we imagine a Signal Region (SR) as m > 1 TeV and a Control Region (CR) to be for 0.04 < m < 1 TeV. Since new physics should lie above the current limits of approximately m ∼ 3 TeV (as in Sec. II), it would be "fair" to use low-mass DY data to constrain the high-mass DY spectrum.

A. The ePump Package
In a standard PDF global fit, the PDFs are determined by minimizing the function, which consists of contributions from N exp fitted experiments, χ 2 n . In the simplest case with no correlations between data points, the contribution from an experiment can be written where D ni is the experimental data value, σ ni is the experimental error (combined systematic and statistical), and T ni (z) is the theory prediction, which depends on the PDFs, which in turn are described by a finite number of parameters, z. In practice, the χ 2 n for modern experiments will include correlated errors among data points, and there may be additional terms added to impose constraints on the theoretical parameters, but the general procedure is unchanged. The central or best-fit PDFs are obtained by minimizing χ 2 global with respect to z. In addition, χ 2 global is, to a good approximation, a quadratic function of the parameters around the minimum. This is the basis for the Hessian approximation for PDF errors, which utilize PDF eigenvector sets, two for each PDF parameter, to evaluate the uncertainty due to This is where a tool such as ePump is advantageous. ePump works by using the fact that the original χ 2 global is well-approximated by the known quadratic function and the fact that the theory predictions for the new observables, T Nexp+1,i (z), can be approximated using the original Hessian error PDFs. Under these approximations, the minimization and Hessian diagonalization can be performed algebraically [6,18], with the numerical computations taking seconds, rather than hours or days. The output of ePump is an updated central and Hessian eigenvector PDFs, which approximate the result that would be obtained from a full global re-analysis that includes the new data. As an additional benefit, ePump can also directly output the updated predictions and uncertainties for any other observables of interest (such as the cross section in the signal region), without the necessity to recalculate using the updated PDFs. For more details about the use of ePump, see Ref. [6]. The code for ePump and more specific details of its usage can be obtained at the website http://hep.pa.msu.edu/epump/.

B. PDF Update Strategy
ePump requires standard inputs to emulate the global fit-the templates in Fig. 10. We describe our strategy here. The analysis was performed at "truth level," such that the acceptance and efficiency effects associated with the reconstruction and identification of prompt, high-p T leptons in an LHC detector are neglected. However, leptons are well measured at the LHC, so this is an acceptable first look at this technique. Additional dilepton backgrounds were neglected, but are well understood by the LHC experiments as can be seen in Fig. 1.
These backgrounds include tt production, W t Single Top production, W W , W Z, and ZZ Diboson production, and W +jets & Multi-jet production in the electron channel.

C. ePump Template Construction
Naively, one might imagine only using m in the CR to predict the improvement in the SR, but our awareness of the significant differential quark sensitivities to cos θ * (and moderate sensitivity to y ) plus the knowledge that future LHC running will provide enormous continuum DY datasets led us to explore dividing pseudo-data into many bins of dilepton mass m , as well as y and cos θ * .
The fiducial region considered for our analysis is designed explicitly to probe the PDFs at high x, and is defined by 40 GeV < m < 1000 GeV, |y | < 3.6, −1 < cos θ * < 1. Events passing these selections were binned in ePump template histograms, which parametrize the triple-differential cross section of Eq. (5), according to where i, j, and k correspond to the bin indices of each distribution of interest.
The total number of pseudo-data events are given by N ijk pseudo−data , the integrated luminosity of the pseudo-dataset is L int , and (∆m ) i , (2∆|y |) j , and (∆ cos θ * ) k are the corresponding bin widths. [19] The factor of two in the denominator accounts for the modulus in the rapidity bin width. The bins used to parametrize Eq. (17) are where the C and F regions are explicitly called out for the CC and CF selections. The total number of measurement bins is N bins = 12 × 18 × 6 = 1296 for the fiducial region considered and they define the N new data points that supplement Eq. (14).
Events were generated as if they came from a future integrated luminosity and so uncertainties in the ePump results are scattered according to the statistics of such a hypothetical LHC input dataset. For each bin the DY cross section estimate σ ijk Drell−Yan was scaled by a characteristic integrated luminosity L int to arrive at a definite DY event yield N ijk Drell−Yan . The resulting yield was assumed to be the mean of a Poisson distribution, which was then used to throw a random number according to Poisson statistics, thereby populating the bin with N ijk pseudo−data pseudo-data events. Note that the pseudo-data were treated as those of one "experiment," but in practice ATLAS, CMS, and LHCb would all be sources of fitting input data. For illustration we chose two future LHC scenarios for integrated luminosities: L int = 300 fb −1 approximating the data set for one experiment following Run-3 of the LHC, and L int = 3000 fb −1 , approximating that of the final dataset for one experiment of the HL-LHC.

V. PDF UPDATE RESULTS
We can answer Question 1 by re-evaluating the effect of the 3000 fb −1 DY pseudo-dataset on the CT14HERA2 PDFs, as well as Question 2 by assessing the reduction of the PDF systematic uncertainty in the high-mass dilepton spectrum.
A. Impact on CT14HERA2 PDFs Question 1 asked whether explicit inclusion of cos θ data might have a useful effect in reducing the uncertainties in the parton fits. The answer can be seen in the following four plots in Figs. 11 and 12. In order to see the effect of each of the quantities in the ePump-simulated refitting, there are four sets of results in each plot: • The shaded background shows the uncertainties resulting from the current CT14HERA2 uncertainties.
• The dotted curve labeled "mass" corresponds to the error reduction by sending only binned (∆m ) to ePump; i.e., integrated over the y and cos θ * dimensions. • The dashed curve labeled "rapidity" adds the cumulative effect of binned (∆|y |) and (∆m ) to ePump.
• Finally, the solid curve labeled "angle" adds the cumulative effect of binned (∆ cos θ * ), (∆|y |), and (∆m ) to ePump. Figure 11 shows the impact of the ePump update with the 3000 fb −1 scenario on theū(x) and d(x) sea distributions and Fig. 12, the impact on the u v (x) and d v (x) valence distributions.
The sea distributions show a considerable reduction in uncertainty at high x. For example, in both theū(x) andd(x) distributions, the PDF uncertainty is reduced from its pre-update   value of approximately 70% to 20% at x = 0.5. The improvement in the valence distributions at x 0.5 is less dramatic, but substantial improvement is observed in the ranges of x 0.5.
The post-update u v (x) distribution remains better constrained than d v (x) at high x, where the uncertainty measures 2.6% as compared to 11% at x = 0.5, respectively. Table IV lists the pre-and post-update uncertainties for several parton flavors and values of x explicitly. Figures 13 and 14 show the reduction in uncertainties for the 300 fb −1 scenario and Table V is the corresponding comparison for the 300 fb −1 scenario.
The answer to Question 1 is that a global PDF fit which includes DY LHC data below 1 TeV in mass, and binned in rapidity and cos θ * , would dramatically improve the precision u   The results are presented in Fig. 15, which shows the impact of the 3000 fb −1 pseudodataset on the high-mass PDF systematic uncertainty. The PDF uncertainty is evaluated at several characteristic values of dilepton mass, which are listed in Table VI. At m = 5 TeV, the PDF systematic uncertainty is reduced from 31% to 8.9%, a reduction of roughly a factor of 3.5. Similarly, at m = 3 TeV, the uncertainty is reduced from 15% to 3.7%, roughly a factor of 4. In each case, a substantial improvement is obtained compared to the current state-of-the-art predictions (as depicted in Fig. 7). The PDF uncertainty assessed in the ATLAS dilepton analysis is, for example, 13% and 29% at m = 3 and m = 5 TeV, respectively.

C. Additional Analysis Strategies
Remembering that positive cos θ * is essentially all up-quark collisions, an additional analysis strategy might be to only analyze data from specific regions of cos θ * . Figure 15( cos θ * > 0, where the error is assessed with the updated PDFs. Such a combination provides for a marginal, but non-trivial further reduction in the PDF uncertainty at high-mass at the expense of a reduction in statistics. Table VI quantifies the PDF uncertainty before and after the update in several high-mass bins of interest for each of these selections.
For the 300 fb −1 scenario, the results are shown in Fig. 16 and Table VII. While not as dramatic, this is an un-tuned effort and Run-3 LHC data will benefit significantly from the analysis approach we outlined here. Statistical precision can be enhanced by judiciously choosing the pseudo-data granularity.
The answer to Question 2 is that with the triply differential global fitting and analysis  Fig. 16. The current CT14HERA2 uncertainty estimates are shown in the first column, the result of the 300 fb −1 update is shown next, and the 300 fb −1 update with an additional cos θ * > 0 requirement on the dilepton selection is shown last.
enormous PDF uncertainties in any BSM search at high mass can be significantly reduced below the current experimental uncertainties by including DY data in bins of m , y , and cos θ * in future global PDF fitting. We were pleased to see that E. Accomando, J. Fiaschi, F.
Hautmann, S. Moretti reached similar conclusions with a different, but interesting approach.
See [20] and [21]. theoretical uncertainties in the electron channel of the dilepton analysis. As the PDF uncertainty will be reduced well below the current experimental uncertainty, attention will be shifted to the reduction of others, such as the "PDFChoice" uncertainty, improving the discovery potential of future iterations of the dilepton analysis.

VI. OUTLOOK
The impact of a future DY cross section measurement on the CT14HERA2 PDF uncertainty was assessed using the ePump package at the √ s = 13 TeV LHC with 300 fb −1 and 3000 fb −1 of DY pseudo-data. The fiducial region considered for the PDF update was based on three variables: the dilepton mass (m ), the dilepton rapidity (y ), and the cosine of the polar angle in the CS-frame (cos θ * ). This region was divided into 1296 histogram bins and used to construct ePump pseudo-data and signal templates, which were designed to probe the PDFs in the extreme kinematic regions of (x,Q 2 ) only accessible at the LHC.
The CT14HERA2 PDF set was used for the update, but similar effects would be observed in other PDF sets. The results showed a significant reduction in the uncertainties associated with all parton flavors, especiallyū(x) andd(x) sea distribution at high x. Likewise, these reduced PDF uncertainties, when propagated to the dilepton invariant mass spectrum, lead to a significantly improved description at high mass.
These proof-of-concept results indicate a great deal of improvement can still be obtained from precision PDF measurements at LHC. The use of cos θ * as an additional dimension in future PDF global fits is absolutely crucial, as it supplements the more standard doubledifferential measurements in invariant mass and rapidity; when used in conjunction, as was done here, the reduction in uncertainty can be dramatic.
For these reasons, DY cross section measurements could be vital to the success of future searches and measurements at the LHC. Not only will the PDF uncertainty that affects the high-mass dilepton analysis be reduced, improving the discovery potential of many nonresonant new physics models, but also the inclusion of new and robust data into the modern PDF global fits might even bring the uncertainty estimates of the various global fitting groups into better agreement.
Such an opportunity might result in a reduction of the "PDF choice" uncertainty when all PDF groups include triply differential DY data as discussed here. Obviously the goal would be to reach a stage in which the largest uncertainty would cease to be due to the PDFs.
Table VIII compares these uncertainties explicitly. [22] Therefore, for the reasons outlined in this paper, experiments at the LHC and global fitting groups should seriously consider the inclusion of precision measurements of the DY triple-differential cross section in order to further constrain the PDF uncertainties in future PDF global fits. In the short term the methodology described here could also be used by experiments when judiciously choosing careful control regions to update current PDF uncertainties, perhaps vastly improving search sensitivity at high mass.