Study of proton parton distribution functions at high x using ZEUS data

Proton parton distribution functions (PDFs) are poorly constrained by existing data for x larger than 0.6, and the PDFs extracted from global fits differ considerably from each other. A technique for comparing predictions based on different PDF sets to observed event numbers is presented. It is applied to compare predictions from the most commonly used PDFs to published ZEUS data at high Bjorken x. A wide variation is found in the ability of the PDFs to predict the observed results. A scheme for including the ZEUS high-x data in future PDF extractions is discussed.


Introduction
Important questions related to the nature of the strong interaction can be addressed by studying the structure of particles composed of quarks and gluons. The spatial and momentum distributions of these constituents in a hadron are not understood from first principles owing to theoretical and calculational limitations. Thus, our knowledge stems principally from measurements. An important motivation for precision measurements is that they may trigger new theoretical insight.
Knowledge of the proton structure is also of great importance, as protons are used to reach the high-energy frontier in particle colliders such as the Large Hadron Collider (LHC). Calculated cross sections for processes involving protons are based on sets of parton distribution functions (PDFs), i.e., quark and gluon distribution functions. These PDFs have been extracted by a number of collaborations by selecting data sets from various experiments. The PDF sets used for making predictions for LHC processes, and a prescription for how uncertainties related to the PDFs are to be evaluated, have been published elsewhere [1,2]. Identifying effects due to new physics beyond the Standard Model at high center-of-mass energies requires precise knowledge of the PDFs at high x, where x is the fractional longitudinal momentum of the struck parton inside the proton. This knowledge is essential for the isolation of some types of new physics.
Most of the data available for x ≥ 0.6 were obtained in fixed-target experiments in a range of Q 2 , the negative 4-momentum-transfer squared, where perturbative quantum chromodynamics (pQCD) may not be fully applicable. The HERA (Hadron−Elektron Ring Anlage) storage ring, where 920 GeV protons collided with 27.5 GeV electrons or positrons, offered an opportunity to probe the region of high Bjorken 1 x at high Q 2 where pQCD and the Dokshitzer-Gribov-Lipatov-Altarelli-Parisi (DGLAP) evolution dynamics [3][4][5][6][7] are believed to be reliable. The ZEUS collaboration published high-x data collected in the period 2004 − 2007, with integrated luminosities of 187 pb −1 for e − p and 142 pb −1 for e + p scattering [8]. The data cover the Q 2 range of 650 − 20000 GeV 2 and the x range of 0.03 − 1.0.
In the following, predictions based on the principal PDF sets used at the LHC are compared to these ZEUS high-x data. While there is some overlap with data used in other ZEUS publications, a substantial fraction of the high-x data from ZEUS has not been previously used in the extraction of PDFs. In this paper, the predictions for the expected numbers of events from the different PDF sets are compared to the numbers observed in the ZEUS data 1 From here on, no distinction is made between x and Bjorken x. and the results are discussed. A short prescription is given on the use of the high-x data in future PDF extractions.

Method to test high-x predictions
The previous ZEUS publication [8] provides the observed numbers of events in (∆x, ∆Q 2 ) intervals ('bins'). The evaluation of the probability to have observed these numbers of events according to different PDF sets is discussed in the present paper. The probability, P (D|PDF k ), for a prediction based on a given PDF set k to predict the data set D is where the index j labels the bins in (∆x, ∆Q 2 ), ν j,k is the expected number of events in bin j as predicted from PDF set k and n j is the observed number of events. The effect of systematic uncertainties is evaluated by varying the predictions ν j,k according to the sources of systematic uncertainty as described in Section 5 below.
The probability values in Eq. (1) are very small absolute quantities and only the ratio of probabilities for the different PDF sets are of interest; i.e., the Bayes Factors. Using these, an effective ∆χ 2 between two different PDF sets, here labeled k, l, is defined via These quantities are used to evaluate the relative goodness-of-fit of the different PDF sets. The tail-area probabilities (p-values [9]) for the different PDF sets are also provided, based on the expected probability distribution of the quantities P (D|PDF k ). These are used to evaluate the overall goodness-of-fit for the prediction based on the PDF set.
To calculate the probabilities as outlined above, the predictions from the PDF sets must be evaluated. The PDFs are used as input to calculate cross sections for the e ± p → e ± X neutral current cross sections at the Born level with fixed fine structure constant, α(Q 2 = 0). The predictions for the observed number of events in measured kinematic variables, (x rec , Q 2 rec ), are given by integrating over the full kinematic phase space: Here, L is the luminosity, d 2 σ(x,Q 2 |PDF k ) dxdQ 2 is the differential cross section at (x, Q 2 ) for PDF set k using kinematic quantities defined at the Born level and A(x rec , Q 2 rec |x, Q 2 ) transforms the Born level cross sections to observed cross sections including all relevant effects (radiative corrections, detector resolution and acceptance, selection criteria, etc.). This integral is approximated as where λ i,k is the expected number of events for the i th (∆x, ∆Q 2 ) bin at the Born level for PDF set k, and A ji gives the transformation to the expectation in the measured quantities in bin j. In the following, λ k represent the vectors {λ i,k } and ν k the {ν j,k }. For the prediction in Eq. (4) to be accurate and sufficiently precise, it is necessary to have sufficiently fine binning, and to account for all effects through reliable simulation packages. Thus, the bins defined at the Born level are not identical to those used in the measurements; the index i spans a larger range than the index j.
The matrix A (with components A ji ) was decomposed into a matrix representing the QED and QCD radiative effects, R, and a matrix, T , representing the ZEUS specific detector effects: The effects of QED and QCD radiative corrections, including the running of the coupling constants, were determined by running dedicated simulation programs which provide predictions at the 'generator level'. The matrix R transforms the Born-level quantities, λ i,k , to the generator-level quantities, µ l,k : where the same binning was used for µ l,k and λ i,k .
The matrix T provides the transformation of the generator-level quantities to the observed quantities. It accounts for all experimental and analysis-related effects, and is, in principle, independent of the PDF set used in the generation of the Monte Carlo simulated events.
Given the notation above, for a given PDF set k, the vector ν k is given as where 3 Evaluation of R and T

Monte Carlo samples
The event simulation was performed using the HERACLES [10] event generator together with the NLO CTEQ5D [11] PDF set, which is extracted using the zero-mass convention for heavy flavors. The HERACLES program applies O(α) QED and O(α s ) QCD corrections, where α s is strong coupling constant. The QCD simulation is based on a combination of ARIADNE4.12 [12] and MEPS6.5.1 [13] parton-shower simulations as used in the ZEUS high-x analysis [8]. The detector simulation was carried out using GEANT3.21 [14] tuned to match the ZEUS detector performance. Exactly the same analysis steps that were carried out in the extraction of the observed event numbers [8] were applied in the extraction of the Monte Carlo (MC) event numbers.
The MC events were produced in sets with different minimum Q 2 values in order to have sufficient statistical precision at the highest x and Q 2 values. The MC event sets were then combined to produce the final sample. This necessitated the introduction of MC event weights, ω MC , to ensure the correct overall cross section, d 2 σ(x,Q 2 ) dxdQ 2 , as a function of the kinematic variables. Event weights were also introduced in order to improve the MC description of various experimental quantities such as the event-vertex distribution and selection-cut efficiencies [8]. The product of these weights is denoted as ω sim . The total weight for an event m is then ω m = ω MC m ω sim m .

Determination of matrix elements
The ZEUS analysis [8] used 153 (∆x, ∆Q 2 ) bins, fixing the size of the vector ν k . The bins for the event-generator quantities were chosen on a finer scale, and also extend beyond the kinematic region covered in the measured variables in order to capture all 'migrations' from the generated to the measured quantities. Stable results for the predictions were obtained by choosing 429 bins for the λ k and µ k vectors, so that the matrix T has 153 × 429 entries. In particular, it was verified that the predictions for the expected event numbers ν k=CTEQ5D as found in the original analysis [8] were reproduced.
The matrix R is a diagonal 429 × 429 matrix whose elements were evaluated as where M i is the number of generated events in the i th (∆x, ∆Q 2 ) bin, with the kinematic quantities calculated using the four-vectors of the exchanged boson and incoming proton, and L MC and σ i,CTEQ5D are the MC luminosity and Born-level cross section for bin i. Nondiagonal elements of R were set to zero. It was verified that this simplified treatment of the radiative corrections is adequate for the purposes of this analysis and that R does not depend on the PDF set.
The elements of T were calculated as The indicator function I(m ∈ j) = 1 if event m is reconstructed in bin j, else I(m ∈ j) = 0. Figure 1 (a) shows the distribution of ν k=CTEQ5D in bins of the measured kinematic quantities. Figure 1 (b) shows the distribution of the same events in the µ k binning. Figures 2 and 3 show the ratios of the elements of µ k to the elements of µ k=HERAPDF2.0 as functions of x in different Q 2 intervals for selected PDF sets (CT14 [15], MMHT2014 [16], NNPDF3.1 [17], NNPDF2.3 [18], ABMP16 [19] and ABM11 [20]). The ratio is shown at the average x value from HERAPDF2.0 [21] in the bins 2 . The value of Q 2 is for the center of the respective bin. The expected number of events from the different PDF sets differ by more than a factor of 2 at large x values. There is also a systematic difference of several percent in the expectations at the smaller values of x amongst the different sets, which also predict different shapes as a function of x. The differences between the PDF predictions significantly exceed the one-standard-deviation uncertainties given by both HERAPDF2.0 and NNPDF3.1. The overall probabilities for the observed data sets given the expectations from the different PDF sets were calculated using Eq. (1). The natural logarithms of these probabilities are −538.3 and −581.5 for the e + p and e − p data sets, respectively, for HERAPDF2.0. The corresponding p-values taking these probabilities as test statistics are 0.3 and 0.03. These probabilities account only for the statistical uncertainties. A summary of the results for different PDF sets is given in Table 1. The table gives the ratio of the probabilities found for different PDF sets (P1) to those found using HERAPDF2.0 (P2), see Eq. (2). In addition, the effective ∆χ 2 as defined in Eq. (2) is provided.

Comparison of predictions to data
The predictions from all PDF sets yield p-values in an acceptable range for the e + p data, indicating good overall agreement with the observed data. There are nevertheless significant differences in the probabilities. The highest probability for the e + p data is found with the CT14 PDF set, with a Bayes Factor of 5.9 · 10 5 relative to HERAPDF2.0, corresponding to ∆χ 2 = −27. For the e − p data set, large differences are seen in the p-values, with only the HERAPDF2.0 and the ABM11 and ABMP16 sets yielding sizeable p-values. The Bayes Factors vary by more than 6 orders of magnitude, resulting in ∆χ 2 values up to 31. The bin-by-bin comparison of the nominal predictions for the different PDF sets to the the observations are given in the Appendix A.
The probability evaluation was carried out in two x ranges: the 'higher x' range is defined as the three highest-x bins in each Q 2 range. The remaining x range is labeled 'lower x'. A substantial part of the data in the higher-x range has not been used in PDF extractions, while the data in the lower-x range have been included (albeit using different reconstruction techniques and coarser binning). The results for the different x ranges for both e + p and e − p data are given in Table 2. There are significant differences in both x ranges. The predictions from the ABM11 and ABMP16 sets yield lower probabilities than HERAPDF2.0 in the lower-x range of the e − p data, but higher probabilities in the higher-x range, while most of the difference in ∆χ 2 between CT14 and HERAPDF2.0 occurs in the higher-x range. For the e + p data, similar values of ∆χ 2 are observed in both x ranges. These results indicate that the high-x data indeed have some discriminating power.
An effective statistical power of the higher-x data was estimated as follows. The normalization of the HERAPDF2.0 prediction was allowed to vary and an optimum was found using only the higher-x data. The change in the normalization resulting in a decrease of the logarithm of the likelihood of 0.5 was then determined. This normalization change was found to be small: 1.4 % for the e + p data and 1.2 % for the e − p data, indicating that these data should bring important new information.
All PDF sets provide not only nominal values for the parton distribution functions but also uncertainties that translate into uncertainties on the predicted cross sections. The uncertainties for HERAPDF2.0 are provided as so-called variants; for all of these, predictions for the observed number of events were calculated. The resulting spread in the predictions is typically much smaller than the difference seen between predictions from different PDF sets as discussed above. The probabilities to observe the data for the different variants of HERAPDF2.0 were evaluated, with Bayes Factors relative to the nominal predictions ranging from 0.4 → 3, the largest of this range corresponding to ∆χ 2 ≈ 2. Given that the HERAPDF extraction uses a ∆χ 2 = 1 convention for evaluating uncertainties, a number of these variants would likely be excluded if the ZEUS high-x data were included in the PDF extraction.

Systematic uncertainties
In the discussion above, the nominal predictions from different PDF sets were used to predict event numbers, and a Poisson probability was calculated based on the observed number of events. This procedure accounts only for statistical uncertainties. In this section, the impact of systematic uncertainties on the probabilities is discussed. Two classes of systematic uncertainties can be distinguished − those that affect the predictions at the physics simulation level (the µ and R values) and those that affect the matrix T accounting for detector and analysis effects. These are discussed separately for HERAPDF2.0.
For many of the PDF sets discussed in this paper, the combined H1 and ZEUS data [21] were included in their fitting procedure. A large fraction of the ZEUS neutral current data [22,23] entering the combination was taken during the same running period as the high-x data set considered here, and some of the systematic uncertainties are either identical to those discussed below or correlated to them. These uncertainties were already accounted for in the PDF extraction; i.e., the systematic uncertainties described below are not independent of the systematic uncertainties assigned to the PDF sets. A new PDF extraction would be necessary to account properly for the correlated systematic uncertainties. This point is discussed further below.

Uncertainties on µ and R
Possible systematic effects on the predictions are due to imperfect values of R and a small net polarization of the electron and positron beams. The latter effect was found to be negligible [8]. To test whether the diagonal matrix for R determined using the CTEQ5D PDF set was sufficiently accurate, the ratio of the generator-level predictions to the Bornlevel predictions was compared for different PDF sets for all bins used in the analysis. The ratios were found to be within the MC statistical uncertainties and typically well within 1%.
No systematic trends were observed. Variations in the Born-level cross sections are therefore not expected to produce significant uncertainties in the determination of R. Limitations could still be present due to missing effects in the MC simulations used in this analysis. Should an improved simulation of radiative effects become available, the R matrix should be reevaluated and implemented.
The primary source of systematic uncertainty affecting µ k is the luminosity uncertainty, which acts as a scale factor on the elements of µ k . The effect of the luminosity uncertainty was evaluated by increasing and decreasing µ k by ±1.8%, the luminosity uncertainty of the data set [8], separately for the e + p and e − p predictions, and recalculating the probabilities.
The results are shown in Table B.1.
For e − p data, an upward shift of the luminosity improved the agreement between predictions from the CT14, MMHT2014, NNPDF2.3 and NNPDF3.1 sets and the data, while worsening the agreement for HERAPDF2.0 and the ABM11 and ABMP16 sets. The negative shift gave lower (for the CT14, MMHT2014 and NNPDF2.3 and NNPDF3.1 much lower) probabilities. For HERAPDF2.0, the probability was not significantly changed.
For the e + p data, the positive normalization shift made small changes for the CT14, MMHT2014, NNPDF2.3 and NNPDF3.1 sets and resulted in significantly worse probabilities for the HERAPDF2.0 and the ABM11 and ABMP16 sets. For the negative shift, the probability was significantly improved for HERAPDF2.0 and the ABM11 and ABMP16 sets while it generally became worse for the other PDF sets. Given that the normalization uncertainty is related to the data, it cannot be applied separately for the different PDFs sets. This implies that the significant difference in the predictions of the PDFs is not resolved by a change in the normalizations and no clear preference is seen for a given PDF set based strictly on the normalization of the data.

Uncertainties on T
One source of uncertainty on T is the loss of information due to the finite bin sizes in the kinematic variables. A related source of uncertainty derives from using the same matrix for all PDF sets. To test the first effect, the bin sizes at the generator and Born level were systematically decreased until no further significant changes were observed in the predictions. Both effects were then evaluated by replacing the matrix procedure with an event-by-event reweighting procedure. The values of ν k were evaluated in this method as: The sum runs over all generated MC events. The indicator function is used to select events reconstructed in bin j; the ratio of differential cross sections is evaluated using the Bornlevel differential cross sections calculated from the kinematic quantities at the generator level. The numerator is the differential cross section for a desired PDF set k, while the denominator is the differential cross section from the CTEQ5D PDF set. It was verified that the results are effectively identical for the different PDF sets considered in this paper. The differences in ν k were typically at the level of < 0.1% with a maximum difference of 1% seen in the highest x and Q 2 bins.
The limited size of the MC sample resulted in an uncertainty on T . The statistical uncertainties on individual matrix elements are typically ≪ 1%, ranging to 1% in the highest Q 2 and x bins. Propagating these uncertainties to the entries in ν k results in negligible statistical uncertainties in the predicted numbers of events.
The variation in T from the dominant sources of uncertainty given in the ZEUS analysis [8] was further investigated. For each systematic effect considered, T was reevaluated and used to produce a new set of predictions for the number of reconstructed events in the crosssection bins. The probabilities can change by up to a factor 10, corresponding to effective χ 2 changes of almost 5 units. Although these changes are considerably smaller than those resulting from a change in the normalization, they should clearly be included in any new PDF extraction.

Prescription for PDF extractions including the high-x data
Combining the ZEUS high-x data with other HERA data in the extraction of PDFs will require some care. The details will depend on the procedure, the emphasis of the study, and whether the high-x data are used in addition to the combined HERA data [21] or as a substitute for a part of the ZEUS data in a PDF extraction using individual ZEUS and H1 data sets. The latter ansatz could be used to answer the question whether and how the extended x range and the finer binning of the high-x data [8] impact the high-x part of the PDFs. For the HERA II data with √ s = 318 GeV, the full set of H1 e + and e − neutral current data [24] could be used together with the standard ZEUS neutral current data [22,23] for Q 2 < 650 GeV 2 and the ZEUS high-x data [8] for Q 2 ≥ 650 GeV 2 . The standard and high-x ZEUS data have a completely correlated normalization uncertainty while most of the remaining systematic uncertainties can be considered uncorrelated, since the reconstruction methods were significantly different. If wanted, the individual HERA II data sets could be augmented with the combined HERA I data [25] and HERA II combined data sets for lower √ s. However, these data are not expected to have a large impact on the high-x part of the PDFs.
The use of the high-x data in addition to the combined HERA data [21] may however be preferred, since this permits a coherent treatment of systematic uncertainties over as much of the phase space as possible. Care has to be taken to avoid double counting by ensuring that bins individually added do not overlap with the bins from the remaining combined data. Table C.1 provides one possible selection. Combined e + and e − data [21] for x = 0.4 and x = 0.6 were removed for Q 2 values at which ZEUS data [22,23] contributed. The finer binned ZEUS high-x data including the integrated bins were added as also listed in Table C.1. The individual e + and e − H1 data points [24] for the x and Q 2 values of the removed combined data points should also be added back individually.
For the procedure using combined and individual data, the normalizations of the individual data sets have to be adjusted. During the combination of H1 and ZEUS data sets, the normalizations of all individual data sets were shifted [21]. The normalization shifts of the ZEUS e + and e − data sets as applied in the combination have to be applied to the individual ZEUS NC high-x e + and e − data sets. This implies a reduction in the luminosity by the factors given in Table C.2 for these data sets. For the individual NC e + and e − H1 data points [24] to be reintroduced in the analysis, the cross sections have to be increased by the factors given in Table C.2. The normalization uncertainties quoted for the individual data sets should be treated as correlated systematic uncertainties on the individual ZEUS and H1 data points. The remaining systematic uncertainties should be treated as uncorrelated due to the different reconstruction techniques.
Different selections from the choices as listed in Table C.1 are conceivable. Parameterizations with sufficient flexibility in the high-x region could benefit from a more extended use of the high-x data. In any case, the procedures using Poisson statistics as outlined in this paper should be employed to make the optimal use of the high-x data.

Conclusions
The ZEUS high-x data are unique and, to-date, have not been used in the extraction of proton parton distribution functions. They should be included in future PDF extractions using the matrix approach described in this paper. This data set will give access to a previously unused kinematic region, and will help constrain the large uncertainties on the partonic structure at the highest values of x.
The comparison of predictions from modern PDF sets in the kinematic range covered by the ZEUS high-x data show large differences that are well beyond the uncertainties associated with the predictions, indicating that the uncertainties on the PDFs are underestimated. At the highest values of x, they could be significantly larger than currently thought, making the usage of high-x data even more important. The quantitative effect of the ZEUS high-x data on reducing the uncertainties is difficult to ascertain without carrying out the PDF extraction procedure in full detail. A proposal for including these data in future PDF extractions has been outlined.  Table 2: The Bayes Factor (P1/P2) and ∆χ 2 (calculated relative to HERAPDF2.0) from comparisons of predictions using different NNLO PDF sets to the observed numbers of events are shown for two different x ranges. The 'higher-x' region is defined as the highest three x bins in each Q 2 range. The remaining x range is labeled 'lower-x'.  The comparison of event counts in data and in MC for the e − p sample, using different PDFs. The first two columns of the table contain the Q 2 and x values for the center of the bin and the third column contains the number of events reconstructed in the bin in data (n). The further columns contain expectations (ν) and the probability P (n|ν) for the PDFs discussed in the paper.             The comparison of event counts in data and in MC for the e + p sample, using different PDFs. The first two columns of the table contain the Q 2 and x values for the center of the bin and the third column contains the number of events reconstructed in the bin in data (n). The further columns contain expectations (ν) and the probability P (n|ν) for the PDFs discussed in the paper.         The results from comparisons of predictions using different PDF sets increased by 1.8 % (top) and decreased by 1.8 % (bottom) with respect to the observed numbers of events. The Bayes Factor (P1/P2) and ∆χ 2 (calculated relative to the respective values for the given PDFs as given in Table 2) are shown for the two different x ranges defined in the text for the e − p and e + p data sets.