Updating and Optimizing Error PDFs in the Hessian Approach. Part II

In an earlier publication, we introduced the software package, {\tt \texttt{ePump}} (error PDF Updating Method Package), that can be used to update or optimize a set of parton distribution functions (PDFs), including the best-fit PDF set and Hessian eigenvector pairs of PDF sets (i.e., error PDFs), and to update any other set of observables, in the Hessian approach. Here, we validate the {\tt \texttt{ePump}} program with a detailed comparison against a full global analysis, and we demonstrate the potential of {\tt \texttt{ePump}} by presenting selected phenomenological applications relevant to the Large Hadron Collider. For example, we use the package to estimate the impact of the recent LHC data of the measurements of $W$, $Z$ boson and top quark pair differential distributions on the CT14HERA2 PDFs.


I. INTRODUCTION
An understanding of uncertainties due to parton distribution functions (PDFs) is crucial to precision studies of the standard model, as well as to searches for new physics beyond the standard model at hadron colliders, such as the CERN Large Hadron Collider (LHC).
As extensively discussed in Ref. [1], a technique for estimating the impact of new data on the PDFs, without performing a full global analysis, is extremely useful. (See also Refs. [2][3][4].) For this purpose, we have developed a software package, ePump (error PDF Updating Method Package), which can be used to obtain both the updated best-fit PDF and updated eigenvector PDFs from an earlier global analysis. The package can also directly update the predictions for experimental observables and their PDF uncertainties without requiring the use of the updated PDFs to re-calculate the theory predictions. Finally, an alternative use of the package is to optimize a given set of Hessian PDFs for a particular set of observables, so that a reduced number of error PDFs can be used, while maintaining the PDF uncertainty on the observables to any desired precision.
In Ref. [1] some examples were given comparing the results of ePump with a full global analysis, as well as several phenomenological analyses using ePump. In addition, an exercise using ePump was performed in Ref. [5] to show how to assess the potential of precision measurement of triple differential distributions of high-mass (up to sub-TeV) Drell-Yan pairs to reduce the PDF induced errors in predicting the cross section of an extra Z boson with mass greater than a few TeVs produced at the LHC. In this work we provide further checks and more details of the validation of ePump against the full global analysis machinery, and we provide more examples of using ePump to update current PDFs with new LHC data.
In a global analysis of experimental data, the PDFs are defined as a function of a number of fitting parameters, and in turn, the global χ 2 and the theoretical prediction for any observable are also functions of the parameters. The crucial approximations used by ePump are these: 1) The global χ 2 is a quadratic function of the parameters around its global minimum. 2) All other relevant quantities (including theoretical predictions of new observables used in the update, as well as the PDFs themselves) are linear functions of the parameters.
It is these simplifying assumptions that allow ePump to obtain updated best-fit PDFs and error PDFs, and to update the predictions and uncertainties for any other observable, in just a few seconds of CPU time. Note that these approximations are the exact same as those used to calculate the PDF uncertainty for any observable in the Hessian method. However, the impact of these approximations must still be considered when interpreting the results from ePump. In addition, subtleties in the calculations of PDF uncertainties, such as the use of dynamical tolerances and Tier-2 penalties, could potentially induce further discrepancies between the predictions of ePump versus a full global analysis. Thus, it is useful to validate ePump against a full global analysis in as many distinct applications as possible.
The paper is structured as follows. In Sec. II, we perform the aforementioned validation of ePump. To do this we start with a base best-fit and error PDF set, obtained from a global analysis using the CT14HERA2 parametrization. The data included is the CT14HERA2 data sets minus some subset of the data. We then use ePump to update the PDFs by adding back the excluded data sets, and we compare with the standard CT14HERA2 PDFs. If the Hessian approximations were exact, we should find that the ePump predictions reproduce exactly the CT14HERA2 PDFs. Thus, we can test how well the approximations work for different classes of data sets. In Ref. [1] results were shown for this exercise where the jet data was included by ePump. In this paper we present more details of this check with jet data, and also present additional checks with Deeply-Inelastic Scattering (DIS) and Drell-Yan data. In each of these cases, we will see that the updated PDFs obtained from ePump are very close to the global-fit results, i.e., CT14HERA2 PDFs in this case. Furthermore, in Sec. II C, we show, as an example, how to use ePump to directly update the theoretical predictions of the Higgs boson production cross section σ(gg → h) from gluon fusion in proton-proton collider, including its uncertainties induced by the updated error PDFs.
The speed of ePump makes it very useful to perform analyses to investigate the influence of multiple data sets on the PDFs that otherwise might require many different time-consuming global fittings. In Sec. III, we demonstrate how to use ePump to quickly identify the experimental data sets that constrain the CT14HERA2 PDFs most stringently. We find that among all of the 33 data sets included in the CT14HERA2 fits, less than half of them are necessary to effectively constrain the CT14HERA2 PDF errors. Detailed information on the impact of those individual data sets to constrain the CT14HERA2 PDFs, such as which parton flavors and at which x values, will also be discussed.
Of course, one of the main uses for a tool such as ePump is to quickly assess the impact of new data sets prior to updating with a full global analysis. In Sec. IV we provide two detailed examples of this by using ePump to update the CT14HERA2 PDFs with some recent LHC data. First, we examine the impact from the LHC top quark pair (tt) production data provided by ATLAS and CMS Collaborations. Second, we examine the impact from the ATLAS 7 TeV data on W and Z productions [6]. We find that while the tt data can provide potential constraints on the g-PDF, its impact is quite minimal after we have included the inclusive high transverse momentum (p T ) jet production data from the Tevatron and the LHC in the same fit. On the other hand, we find a large impact on the quark PDFs, particularly in the small-x region, when updated by adding the ATLAS 7 TeV W and Z data [6]. This large deviation of the updated PDFs from the original CT14HERA2 PDFs suggests that the ePump result should only be trusted qualitatively in this case, and for quantitative results with this data set a full global fit is required. This conclusion is further supported by examining the magnitude of the two measuresd 0 and d 0 , introduced in Ref. [1], which give the distance between the original and updated PDFs in the parameter space, relative to the updated and original errors, respectively. For the ATLAS 7 TeV W and Z data the value ofd 0 = 1.49 indicates that the original best-fit PDF was far outside the error band for the updated PDFs, so that the new best fit obtained by ePump is more likely to be affected by nonlinearities in the dependence of the observables and the PDFs on the fitting parameters. This, in turn, could produce results that differ from the true global fit.
Finally, concluding remarks are given in Sec. V.

II. VALIDATION OF EPUMP USING DATA SETS IN CT14HERA2
The data sets used for PDF global fitting in CT14HERA2 [7] consist of the HERA Run I+II combined data [8], 15 other sets of DIS data, 14 sets of Drell-Yan data, and 4 sets of jet production data, as listed in Table I and II of Ref. [9]. Here, we will take the CT14HERA2 PDFs, including the best-fit and error sets, as the full global fit result to compare against the results of ePump. We shall see how well ePump reproduces the best-fit PDFs and uncertainty of CT14HERA2 for different classes of experimental data. The test goes as follows. First, we perform a full global analysis with the CT14HERA2 parametrization, using all of the CT14HERA2 data except for a particular subset of data. For instance, when we perform the global analysis with the jet data excluded, we obtain a new set of best-fit and error PDFs, called CT14HERA2mJ. We then use ePump to update CT14HERA2mJ by treating the excluded jet data as "new" data, with the updated PDFs called CT14mJeAll. A comparison between the CT14mJeAll and the CT14HERA2 best-fit and error PDFs can then be used to show how well ePump reproduces the full global analysis for this subset of data. Note that since ePump depends on quadratic and linear approximations, we should not expect perfect agreement. In addition, in the ePump prediction, there are assumptions in how the Tier-2 penalties from the new data affect the updated error PDFs, and therefore the uncertainties in the updated PDFs. However, as we shall see, the updated best-fit PDFs and their uncertainties from ePump are pretty close to those from the full global analysis.
We shall perform this analysis three times by removing different subsets of data from CT14HERA2: 1) excluding all of the DIS data except the HERA I+II combined data (CT14HERA2mD), 2) excluding all Drell-Yan data (CT14HERA2mY), and 3) excluding all Jet data (CT14HERA2mJ). We then add the excluded data back with ePump and compare the updated PDFs with the CT14HERA2 PDFs. To be precise, for the CT14HERA2 parametrization there are 27 parameters, corresponding to 54 error PDFs. In addition, two gluon extreme sets (i.e., eigen-PDF sets 55 and 56) were introduced via the Lagrangian Multiplier method in the CT14HERA2 fit to enlarge the uncertainty in the g-PDF in the small-x region. For our CT14HERA2mD, CT14HERA2mY, and CT14HERA2mJ fits, we did not produce these extra gluon extreme sets. Thus, everywhere in this section we shall exclude the two gluon extreme sets also in the CT14HERA2 PDF errors, in order to have a truer comparison. For convenience we summarize our notations in this paper here: • CT14HERA2mD, CT14HERA2mY, and CT14HERA2mJ are the base sets as described above, to be used by ePump.
• The letter "e" followed by a data set name indicates that the PDFs are obtained from ePump by adding the given data set as "new" data to the base set. For example, in Sec. III A the PDFs CT14mJeCDF are obtained from ePump by adding the CDF inclusive jet data to the base set CT14HERA2mJ.
• The letters "eAll" indicate that PDFs are obtained from ePump by adding back all of data that was excluded in the base set as "new" data. Thus, these sets are the ePump approximation to be compared with the full CT14HERA2 PDF set.
• The suffixes ".54" or ".52" (as for example, in CT14HERA2.54) are used to indicate that the error bands are obtained with 54 or 52 eigen-PDFs, respectively, rather than with the full 56 eigen-PDFs.
Finally, we note that we always show symmetric error bands in this paper. As described in Ref. [1], the symmetric Hessian error bands are invariant under a change of the eigen-PDF basis (unlike the asymmetric errors) and therefore are more reliable when assessing the impact of new data on the PDF errors when using ePump. There are 3287 data points in total in the CT14HERA2 fit [7,9]. Among these, the DIS experiments contribute 2381 data points, of which 1120 data points are from the precision HERA Run I+II combined neutral current and charged current data. If we remove all of the DIS data from the CT14HERA2 fit, this only leaves 906 data points for the reduced (non-DIS) global fit, with the 2381 DIS data points to be added in by ePump as "new" data. In this instance, we may not expect ePump to reproduce the full CT14HERA2 fit results well, since there are more data points (2381) in the "new" data than in the "old" data (906). As a consequence, in many regions of PDF parameter space the "new" DIS data constrain the PDFs much more than the "old" non-DIS data. As discussed in Ref. [ However, since some parameters that should be constrained by DIS data are already fixed, the update obtained by adding the DIS data using ePump will not fully reflect the impact of the new data, and the comparison with CT14HERA2 becomes less meaningful.
Therefore, in the present analysis we choose to keep the HERA Run I+II combined data in the original base fit, allowing us to use the full 27 free parameters. These data provide important information on the decomposition of parton flavors inside the proton, and therefore provide sufficient constraints on the PDFs for a reasonable base set to update with ePump. Our base fit, CT14HERA2mD, is then obtained from a global fit to a total of 2026 data points, which include all non-DIS data and the HERA I+II combined data, and exclude all other DIS data. The remaining DIS data contains 1261 data points, which will be taken as "new" data to update the CT14HERA2mD PDFs by ePump. The updated PDFs, named CT14mDeAll, can then be compared to the CT14HERA2 PDFs.
In Fig. 1, we compare the ePump-updated PDFs (CT14mDeAll) to the base PDFs (CT14HERA2mD) and the true global-fit PDFs (CT14HERA2). It can be seen that the update with ePump yields very similar results as the true global fit. Given the quadratic and linear approximations in ePump, and the number of "new" data points (1261) compared to the number of "old" data points (2026), the results are extremely satisfactory. Moreover, these well-approximated updated PDFs were calculated in just a few seconds of CPU time.
So, prior to a full global fit, one can quickly obtain a first look at the impact of the new data using ePump. As is expected, the DIS data provide important information on the u and d PDFs, whose error bands have shrunk by almost one half with the inclusion of the non-HERA DIS data. It is also evident that the u quark PDF is constrained more than the d quark PDF. This is easily understood by the fact that the electric charge of the u quark is twice that of the d quark, and so it contributes more to the cross section in low energy DIS neutral current processes. Here we perform a similar study for the Drell-Yan data in CT14HERA2. The global fit to all the DIS and jet data, and excluding the Drell-Yan data, is named CT14HERA2mY, which contains 2786 data points. It is worth mentioning that the global fit for CT14HERA2mY with 27 parameters are not very well converged, for the same reason explained in Sec. II A.
Fortunately, we only need to fix one parameter to get a good fit. Thus we are left with enough (26) free parameters to test ePump and the results are still meaningful, as we will see in the following. The ePump-updated PDFs, obtained by adding back all the Drell-Yan data, which contains 501 data points, to the CT14HERA2mY fit, are shown in Fig. 2, together with CT14HERA2 (after removing the two extreme g-PDF sets). Again, we see that ePump yields a result very similar to the true global fit CT14HERA2, with only a small difference for x less than about 0.4, which is negligible compared to the size of the error band. In the large x region, the PDFs are small and there is little experimental data to constrain them, so they are determined by analytic extrapolation and depend strongly on the nonperturbative parameterization forms assumed at the PDF initial scale (which is 1.3 GeV in the CT14HERA2 fit). Therefore, we are not concerned by the differences in the best-fit PDFs at x greater than about 0.4, which are nevertheless still well within the error bands. The updated s-PDF is shown in Fig. 3, where one finds a dramatic difference in the Thus, we find that when there is strong tension between the new and old data sets, it will not be revealed by an enlargement of the ePump-updated PDF error bands, in contrast to that of a true global fit. We shall discuss other methods to explore possible tension between different data sets with ePump later.
C. CT14HERA2 excluding all jet data: CT14HERA2mJ CT14HERA2 contains four sets of inclusive jet production data: CDF [10] and DØ [11] at the Tevatron Run-2, and ATLAS 7 TeV [12] and CMS 7 TeV [13] at the LHC. We denote the global fit to all the CT14HERA2 data, minus the four jet data sets, as CT14HERA2mJ, which contains 2882 data points. 1 The ePump-updated PDFs, obtained by adding back 1 CT14HERA2mJ contains 27 free parameters. We do not need to fix any parameters.
the four jet data sets as "new" data, which contain 405 data points, are designated as CT14mJeAll. As shown in Fig. 4, the jet data mainly constrain the g-PDF, with little effect on the u and d PDFs, cf. Fig. 5. The agreement between the ePump-updated PDF and the CT14HERA2 global fit is quite satisfactory. The g-PDF is modified and the error band is increasingly reduced as x grows from 0.01 to 0.3. At large x values, the difference in the best-fit PDFs is not significant due to the large error band size. Again, the somewhat larger CT14HERA2 PDF error band of d PDF at x in the range from 0.1 to 0.4, as compared to CT14mJeAll, indicates some tension caused by adding the jet data to the rest of the CT14HERA2 data in the global fit, which ePump is unable to see.
The Higgs boson cross section is strongly dependent on the g-PDF, so an interesting question to ask is: What is the impact of the jet data included in the CT14HERA2 fit on the prediction of the Higgs boson production cross section σ(gg → h) at the LHC? As explained in Section II.D of Ref. [1], ePump can not only update the PDFs but also physical observables within a few seconds of CPU time. Table I  This difference may be due to the linear and quadratic approximations in ePump, or it could be due to the effect of the new data on the Tier-2 penalty, which is only treated on average in ePump. Nevertheless, one can use ePump to quickly estimate the impact of some "new" data to updated PDFs and physical observables. For instance, in this case, we could conclude from ePump updating that including jet data in the fit will lead to a more precise result of σ(gg → h) with its uncertainty reduced by about 20%.

III. IMPACT OF INDIVIDUAL CT14HERA2 DATA SETS ON PDFS
ePump can be used to quickly assess the impact of individual data sets on constraining the PDFs in a global analysis. In this section, we will demonstrate this by using ePump to assess the data sets used in the CT14HERA2 global analysis.  A. The impact of jet data in the CT14HERA2 fit As already noted in Sec. II C, the jet data mainly constrain the g-PDF and have little effect on other flavors. From Fig. 4, we see that the jet data prefer a larger g-PDF at x = 10 −2 ∼ 10 −1 and smaller g-PDF at x = 0.2 ∼ 0.4. The error band is reduced by a fairly large amount in the range of x = 10 −2 ∼ 0.2, by about 1/4 to 1/3.
In order to see the impact of individual jet data in the CT14HERA2 fit, we use ePump to add each jet data set individually to CT14HERA2mJ. The results are shown in Fig. 6, with CT14mJeAll shown together in the same graph for comparison. It can be seen that the four jet data sets produce the same qualitative effects on the g-PDF, but quantitatively the CMS 7 TeV jet data [13] yields the result that is most similar to CT14mJeAll. It increases the g-PDF slightly at small x, with maximum pull upward around x ∼ 0.1, but pulls it downward sharply above x ∼ 0.2. While all of the jet data sets reduce the error band, the CMS 7 TeV jet data reduces it the most and is the closest to the all-jet result. The others reduce the errors by a distinctly smaller amount. From this we can draw the conclusion that the CMS jet data has the dominant impact on the g-PDF, among all the jet data included in CT14HERA2. It is worth noting that in the range of x = 0.1 ∼ 0.2, the CDF Run-2 inclusive jet data set leads to a harder g-PDF than the others, while the DØ Run-2, ATLAS 7 TeV and CMS 7 TeV jet data yield similar results for the g-PDF. dicates that the fitting program is directly modifying the differences between them. Namely, it is (u − d) and (ū −d), or rather, the ratios d/u andd/ū that are directly probed by the Drell-Yan data. A reasonable conjecture is that this is due to W ± charge asymmetry data measured at the Tevatron and the LHC. To check this, we use ePump to add each Drell-Yan data set individually to CT14HERA2mY, and compare with CT14mYeAll, which we have already shown is very close to CT14HERA2. We find that although most data sets give a similar trend, only the CMS 7 TeV µ asymmetry data [14], the CMS 7 TeV electron asymmetry data [15], the ATLAS 7 TeV W Z data [6] and the DØ Run2 µ asymmetry data [16] have an appreciable impact. This result is as expected, since most of them are lepton charge asymmetry data. We can use ePump to add just these four charge asymmetry data sets to CT14mYeAsy is obtained by adding CMS 7 TeV µ asymmetry data, CMS 7 TeV electron asymmetry data, ATLAS 7 TeV W Z data and DØ Run2 µ asymmetry data to CT14HERA2mY, using ePump. The PDF ratios are over the best-fit of CT14HERA2mY. approximations used by ePump are unreliable [1].
Another important Drell-Yan data set is the E866 data [17] which measures the ratio of Drell-Yan production in proton-deuteron and proton-hydrogen collisions, σ(pd)/2σ(pp).
These data impose important constraints on the PDF ratiosd/ū and d v /u v at larger values of x. Using ePump to update CT14HERA2mY PDFs by taking the E866 data as "new" data, we obtained the CT14mYeE866 PDFs, which are compared to the CT14HERA2 PDFs in The DIS data sets provide important information on the u and d PDFs, even when the HERA run I+II combined data have already been included, as shown in Fig. 1. In contrast to the Drell-Yan case, the DIS data pulls the u and d PDFs in the same direction. This feature also holds for theū andd PDFs, as shown in Fig. 13. The implication is that precision DIS data are more sensitive to the sum (or, rather, the weighted sum) than the difference or ratio between the d and u PDFs (ord andū PDFs). This is because most of the precision DIS data measured the F 2 structure function. Only a few experiments provided precision measurements of the F 3 structure function, which probes the difference between u and d orū andd. In Fig. 14, we show the impact of DIS data (excluding HERA I+II) on d/u andd/ū in the CT14HERA2 fit, which is seen to be relatively small.
The impact of DIS data on the g-PDF is shown in Fig. 15. It is interesting to note that the DIS data prefer a harder gluon in the x > 0.2 region. This is opposite to the effect of the jet data, which prefer a softer gluon at x > 0.2, cf. Fig. 4. The tension between these two kinds of data on the g-PDF can be seen by noting that the error band of the true global fit CT14HERA2 is wider than that obtained by the ePump updating, cf. Fig. 15.
DIS data are also expected to constrain the strange quark (s) PDF. For most of the DIS data, however, the contribution to the s-PDF is much smaller than to the u and d  the strange quark gives the dominant contribution. Thus, it is interesting to see how much the dimuon data alone can constrain the s-PDF among all the DIS data. This can be quickly investigated with ePump. In Fig. 16, we compare the updated s-PDF, by including only these dimuon data, to the one with the full DIS data included. Although dimuon data sets are not the only ones responsible for determining the s-PDF, they do fully constrain the s-PDF over a wide x region, from 10 −4 to 10 −1 , for both the central PDF and its error band. Furthermore we will see that the effects mostly come from NuTeV dimuon data [18], and only a little from CCFR dimuon data [19].
As before, we can use the ePump updating code to investigate the effect from individual DIS data set on the CT14HERA2 PDFs. We find the bulk of the DIS data contributions comes from CCFR F p 2 [24], xF p 3 [25], CDHSW F p 2 and F p 3 [23], NuTeVνµµ SIDIS, and NuTeV νµµ SIDIS [18] data sets. As discussed above, NuTeV dimuon data are almost FIG. 14: Same as Fig. 1, but for d/u andd/ū PDF ratios.
solely responsible for constraints on the s-PDF, with the stronger constraint coming from NuTeVνµµ SIDIS data. CCFR and CDHSW F p 2 data produce the biggest changes to u and d, and together with CCFR and CDHSW F p 3 , they give the biggest change toū andd. However, the error band on d/u is not reduced until we also add in NMC F d 2 /F p 2 data [22]. Although the other DIS data do not have as large of an impact as the above mentioned 7 data sets, they are still responsible for some "fine structure" of the PDFs. For example, BCDMS F p 2 [20] and F d 2 [21] data measured the structure functions F 2 of protons and deuterons, and cover the large x region x 0.8. Hence, these two data sets constrain u v and d v quarks and g-PDFs in the large x region. Due to limited space, we shall not show all the corresponding plots in this paper, but instead will post them on the website of the ePump project [37]. For completeness, we summarize our findings in Tables II, III and IV, where we list the most prominent effects of each data set in CT14HERA2. We note that such a study by ePump must start from a base set of global-fit PDFs, and the effects of the data listed in the tables refer to their "net" impact when added, one at a time, to the particular base PDF set. For example, for the DIS data in Table II, the base PDF set is CT14HERA2mD, which includes the Drell-Yan and jet data and the HERA Run I+II data [8]. Thus it may happen that the effects of some DIS data are similar to those of the HERA Run I+II data, with the result that an individual data set, such as H1 σ b r data [26], appears to have little or no impact to the updated central set PDFs.
Finally, before leaving this section, we would like to investigate the impact of the CDHSW F 2 and F 3 data in the CT14HERA2 fit. It has long been argued that these data sets were not analyzed properly and therefore should not be used in a global fit [38] TABLE III: Same as Table II, showing experimental data sets on Drell-Yan processes. The base PDF set for this study is CT14HERA2mY. Therefore, the effects refer to the "net" impact when each individual data set is added, one at a time, to CT14HERA2mY.

IV. USING EPUMP TO STUDY THE IMPACT OF NEW DATA
In the previous sections, we have validated ePump against the CT14HERA2 global fit by updating the PDFs with some subset of the CT14HERA2 data sets. We have also used ePump to investigate the impact of individual data sets on the CT14HERA2 PDFs. In this section, we will use ePump to study the potential impact of some new LHC data on improving the CT14HERA2 PDFs. An example was already given in [1], where we analyzed the impact of the CMS inclusive jet production data at √ S = 8 TeV [39]. Here, we consider two more examples of the new LHC data: the LHC 8 TeV tt differential cross section data, and the

ID Experimental data set
Most prominent effects 504 CDF Run-2 inclusive jet production [10] g-PDF at x : 0.02 ∼ 0.5 514 DØ Run-2 inclusive jet production [11] g-PDF at x : 0.02 ∼ 0.5 535 ATLAS 7 TeV 35 pb −1 incl. jet production [12] g-PDF at x : 0.02 ∼ 0.5 538 CMS 7 TeV 5 fb −1 incl. jet production [13] g-PDF at x : 0.02 ∼ 0.5  We shall consider eight tt data sets presented by the CMS [40] and ALTAS [41] collaborations, as listed in Table V. They are the absolute and normalized one-dimensional differential cross sections of the transverse momentum (p t T ) and rapidity (y t ) of top quark, and invariant mass (m tt ) and rapidity (y tt ) of tt pair. The dominant production of tt pairs at the LHC is through the gluon-gluon fusion process. Thus, tt data can potentially constrain the g-PDF, especially at large values of x, due to the large tt invariant mass. We also display in the third column of Table V   updated for this data set, is shown in Fig. 18. One can see that the updated best-fit g-PDF slightly decreases at x > 0.2, with slightly reduced error band at x ∼ 0.3, as compared to the CT14HERA2 PDFs. Hence, this data set prefers a softer g-PDF in the large x region.  Table V. Comparing to the comparable values in the third column, obtained from updating CT14HERA2, we find that the new tt data sets have a much larger effect in the absence of the jet data. In this case, the CMS 8 TeV normalized dσ/σdy tt and ATLAS 8 TeV absolute dσ/d|y tt | data have the largest impact, for which the updated g-PDFs are shown in Fig. 19. It can be seen that the y tt distributions measured at both CMS and ATLAS have comparable effects, and they modify the g-PDF similarly to that of the jet data. But the tt data has less power to reduce the uncertainties of g-PDFs, especially in the x range 0.1 ∼ 0.2. This is consistent with our finding that in the presence of jet data, the new tt data sets have little effect on PDFs, because the tt data produces the same change on the central g-PDFs, but provides less constraining power on the error band. The reason for this can be traced to the simple fact that there are far fewer tt data points than jet data points, due to a smaller production cross section. Thus, the statistical power of tt is smaller.
We can test this interpretation using ePump by increasing the weight of the tt data in the ePump updating. A weight larger than 1 is equivalent to having more tt data points with the same experimental uncertainties or, alternatively, to reducing the experiment uncertainties by a factor of the square root of the weight. Of course, increasing the weight is not exactly the same as increasing the luminosity, since it does not change the central values of the data, which presumably have fluctuations described by the original experimental uncertainties. Nevertheless, one can get some estimate of the potential impact of the tt data as the integrated luminosity is increased. To compare with the effect of the jet data, we multiply the contribution of the new tt data set to χ 2 by a weight equal to the ratio of the number of jet data points to the number of individual tt data points. We have seen in Sec. III A that the CMS 7 TeV jet data [13], with 133 data points, has the dominant effect among all the jet data in CT14HERA2, so we multiply by the weights 133/10 = 13 for the CMS, and 133/5 = 26 for the ATLAS y tt distributions, respectively. The g-PDFs, obtained by updating the CT14HERA2mJ fit with the weighted y tt distributions using ePump, are shown in Fig. 20. The general shapes of the updated g-PDFs are similar to that obtained by including all four jet data in the CT14HERA2 fit. However, the error band of the g-PDFs is not reduced as much as the CT14HERA2 fit for x > 0.01. Hence, we conclude that the jet data will probably impose a stronger constraint on the g-PDF than the tt data, even with more integrated luminosity collected at a higher center-of-mass energy of the LHC.
Before leaving this section, we comment on the impact of the ATLAS 8 TeV dσ/dp t T and dσ/dm tt data. Given the small values of d 0 in Table V, we expect little change in the best-fit PDFs when only including these data to update either the CT14HERA2 or CT14HERA2mJ PDFs. This can happen when the theory prediction is in good agreement with the data even before updating. Fig. 21 shows that this is indeed the case for CT14HERA2, and similar results were found for CT14HERA2mJ. This is also demonstrated by the small χ 2 per data point for these two data sets in Table VI. Therefore, the best-fit g-PDF, updated by the p t T or m tt distribution, does not need to move far from their original position to have a good fit to the data. Another feature we observed from Table VI is that the impact of the tt data to g-PDF is consistent with the jet data included in the CT14HERA2 fit such that the χ 2 /N values updated from CT14HERA2mJ are larger than those from CT14HERA2.

B. ATLAS 7 TeV W Z data
After observing the constraints of new LHC jet data [1] and tt data on the g-PDFs, we would also like to see how new LHC Drell-Yan data could modify the quark PDFs. The low luminosity (35 pb −1 ) ATLAS 7 TeV W ± and Z cross section data [6] were included in the CT14HERA2 fit. Since then, ATLAS has published the more precise ATLAS 7 TeV W Z data with an integrated luminosity of 4.6 fb −1 [42]. Here, we will study the impact of this more precise data on further constraining the CT14HERA2 PDFs. Strictly speaking, we should first remove the old ATLAS 7 TeV W Z data from the CT14HERA2 global fit and then add the new ATLAS 7 TeV W Z data with ePump, so as not to double count the ATLAS 7 TeV W Z data contributions. However, since the two data sets are consistent and the new one has about 100 times the integrated luminosity as the old one, the impact of the double counting should be negligible. Therefore, we shall simply add the new ATLAS 7 TeV W Z data to update the CT14HERA2 PDFs using ePump. Fig. 22 shows the updated PDFs. One can see that the new ATLAS 7 TeV W and Z data have a sizable impact on the quark PDFs and their uncertainties, particularly for x ranging from 10 −4 to a few times 10 −2 . On the one hand, this is understandable, because these data are very precise, with uncertainties less than the percent level. On the other hand, such a large difference between the updated PDFs and the original CT14HERA2 PDFs calls for further investigation.
We note that the new ATLAS 7 TeV W and Z data, to be denoted as "ATL7ZW" data from now on, with a total of 34 data points, cannot be fit well. Its χ 2 per data point after the fit by ePump is found to be around 2.7, which is much larger than that found for the full CT14HERA2 global fit (about 1.25) with a total of 3287 data points. Let us consider the two measures, d 0 andd 0 , introduced in Ref. [1] to assess the quality of the fit given by ePump.
Recall that these two measures are the length of the shift in the parameter space of the best- PDFs found in a true global fit may likely be larger than that given by the ePump program.
With that said, we find from ePump updating that adding the ATLAS 7 TeV W Z data to CT14HERA2 fit would decrease the u and d quark PDFs and increase the s quark PDF at In addition to the single-value criteria, one can also compare the data and theory predictions point-by-point to reveal some more details about the quality of fit, as shown in Fig. 23.
First, we find that there is an overall shift for all the raw data points. This means that the correlated systematic errors, weighted by their corresponding nuisance parameters, play an important role in the fitting. Fig. 24 shows the distributions of nuisance parameters, before and after updating with ePump. The solid curve in the figure shows a standard normal distribution with a mean value of 0 and standard deviation of 1. It shows that there are some large nuisance parameters before the updating. Given the large difference between data and theory for CT14HERA2 in Fig. 23, we conclude that ATLAS 7 TeV W and Z data are not described well by the CT14HERA2 PDFs, so we expect a large impact of this data set to update the CT14HERA2 PDFs. Second, one can see that the ATLAS 7 TeV W and Z data are more precise than the theory predictions, with PDF induced uncertainty included, and even after ePump updating the precision data still cannot be described well by the theory.
This, together with the large contributions from the nuisance parameters, leads to the large χ 2 for this data set. Note that a weight of zero corresponds to CT14HERA2 fit. Given the above discussion, we might expect some tension between the new ATLAS 7 TeV W and Z data and the old data sets included in the CT14HERA2 fit. To examine this, we increase the weight of the ATLAS 7 TeV W Z data while updating the CT14HERA2 PDFs using the ePump program. We can simultaneously obtain the updated predictions for all of the other CT14HERA2 data sets, by including them in ePump as new data, but with zero weight. In this fashion we can see how the fit to the original data sets change as the the new data is added, in order to investigate for possible tensions. Increasing the weight of the ATLAS W Z data forces ePump to fit this data better; however, if some of the original CT14HERA2 data sets have tension with the W Z data, they will be fitted worse as the weight of this W Z data increases. As discussed in Ref. [43], the goodness of fit to individual data set can be quantified by the variable "spartyness" S n , an equivalent Gaussian variable.
A well-fitted data set should have S n between −1 and 1. An S n smaller than −1 means the data set is fitted "too" well and an S n larger than 1 indicates poor fitting.
We find that most of the data sets in CT14HERA2 do not show appreciable tension with the ATLAS W Z data. However, some data sets do exhibit tension, as shown in Fig. 25, which displays the change of spartyness S n for these affected data sets as the weight of the ATLAS W Z data is increased from 0 to 10. Some of these data sets were not well-fitted before (weight=0) and become worse as the weight is increased, e.g., the CDF Run-2 Z rapidity data. Other of these data sets were well-fitted before, but become poorly fitted after the weight is increased, the most significant ones being the NuTeVνµµ SIDIS, the E866 σ pd /(2σ pp ) data sets and the CMS 7 TeV µ and electron asymmetry data.
As discussed in Sec. III C, the s-PDF is mainly constrained by the (anti-)neutrino DIS charged current di-muon data, cf. Fig. 16, and the NuTeVνµµ SIDIS data impose the strongest constraint on s-PDF among those four data sets. Fig. 26 shows the ePump updated In Fig. 26, we also show the updatedd/ū PDF ratio plot with weights of 1 and 10 on the ATLAS W Z data. We find that the ATLAS W Z data prefer a larger value ofd/ū ratio at x around 10 −3 to 10 −1 . This is to be compared with what we concluded in Sec. III that E866 σ pd /(2σ pp ) data set is crucial for constrainingd/ū and d v /u v at x around 10 −2 to 0.2, cf. Fig. 12. Therefore, increasing the weight of ATLAS 7 TeV W Z data contradicts the fit of E866 data and leads to the tension. This can also be illustrated by the comparison between data and theory before and after ATLAS 7 TeV W Z data set is included, see Fig. 29, where we find a deviation the theory predictions from the data for large rapidity when the weight of ATLAS 7 TeV W Z data is increased from 1 to 10 .
For the CMS asymmetry data, the tension with the ATLAS W Z data can be demonstrated in the same way, by comparing theory with data for each data point, as the weight of ATLAS 7 TeV W Z data is increased. In Figs. 30, the comparison are shown for CMS µ and electron asymmetry data. It is apparent that as the weight of ATLAS 7 TeV W Z data is increased from 1 to 10, both the theory predictions of CMS µ and electron asymmetry have an overall upward shift compared to the data for almost all of the data points. Given the precision of the CMS data, this leads to a large χ 2 , which is reflected by the rapid increase of spartyness in Fig. 25.

V. SUMMARY AND PROSPECTS
A fast and efficient tool for estimating the impact of new data on the PDFs is essential in this high precision era of the LHC. In this paper we have tested just such a tool, ePump, both as to its effectiveness and its validity.
We have validated ePump in three trials, where we started with a base PDF set obtained from a global fit with some subset of the CT14HERA2 data removed. We used ePump to update these base PDFs with the missing data sets, and then compared with the CT14HERA2 PDFs. In all three trials (updating with DIS, Drell-Yan, or jet data sets) the ePump results are very close to the CT14HERA2 global-fit results. This is important, because the goal is to have the best approximation to the full global fit as possible. Of course, there are some differences, but they are either small compared to the error bands, or happen in very small or large x regions, where the PDFs depend strongly on the parameterization forms. Another case where the ePump approximations break down is when there are strong tensions between the new and old data. As we have seen, the global fit may increase the error bands, but this will never happen with ePump. Again, we emphasize that ePump is not meant to replace the global fit. However, even in situations with tensions between new and old data, it still gives qualitatively correct results, and therefore provides a useful tool for judging the impact of new data. In addition to updating PDFs, ePump can also update the observables at the same time without the need for recalculating. An example of this use of ePump was given for the predictions of σ(gg → h).
A big advantage of ePump is that it can run very fast. This was exploited to study the impact of different data sets in the CT14HERA2 fit. Summaries of the impact of each of the data sets were given in Tables. II, III, and IV. The impact of each data set is strongest for some particular flavors and for its relevant region of x. But it also depends on the precision of the data and its agreement with the current PDFs. Therefore, even for two data sets that that are sensitive to the same kinematic range and flavor content, they do not necessarily have the same effect on the PDFs. One remarkable thing we found is that among the 33 data sets in CT14HERA2, only 1 jet, 5 Drell-Yan and 8 DIS data sets 3 have the dominant effects.
Just by including these data sets, we can reproduce the bulk part of the CT14HERA2 fit.
The other data sets are only responsible for some fine structures of the PDFs.
It is incredible that we can fit thousands of data points with only 27 or 28 parameters.
This triumph strongly shows the effectiveness of the QCD improved parton model. In such an era of precision, it has become an important and indispensable task to reduce the uncertainties of PDFs. So the natural question is: "What kinds of observables can reduce the PDF uncertainties?" The purpose of ePump is to help answer this question. The "new" data to be investigated by ePump can be new experimental data, or it can be simulated pseudodata, whose impact one might be interested to see. An example of this second scenario was presented in Ref. [5], where ePump was used to show that with increased precision and optimal choice of kinematic variables, the high-invariant-mass Drell-Yan processes can greatly reduce the PDF uncertainties. In this paper, we examined the impact of the latest tt data and W and Z data at the LHC on the CT14HERA2 PDFs. We found that the tt data has the potential to reduce g-PDF uncertainties given increased luminosity and that the high-precision W and Z data also can provide strong constraints on the quark PDFs. Of course, these results will be refined quantitatively by a full global fit, but ePump can quickly assess the qualitative . Similar studies can also be done for other processes.
We expect ePump to play an important role in the study of PDFs, and to assist in the understanding and reduction of theoretical errors in the current era of the high-luminosity LHC. The complete ePump package, together with detailed instructions for installing and file formatting, and additional output files relevant to this study, can be found at the website http://hep.pa.msu.edu/epump/.
Before closing this section, we would like to make two additional remarks. First, we note that the comparison made in this paper between the ePump and global fit analyses hold at any given fixed order, either next-to leading order or next-to-next-to leading order (NNLO), of theory calculations. Second, we note that the default xFitter profiling analysis [4] can be reproduced by ePump, but using a global tolerance set to 1. However, as discussed in the Appendix, setting tolerance to be 1 will greatly overestimate the impact of a given new data set when updating the existing PDFs in the CT PDF global analysis framework.
The parameters z i have been chosen here so that z i = 0 (for all i) for the best-fit PDF set f 0 , while z i = ±δ ij for the 2N eigenvector PDFs f ±j . In this way, each of the eigenvector PDF sets correspond to a ∆χ 2 old = T 2 , where T is an overall global tolerance parameter. If the various data sets were all internally consistent and satisfied Gaussian statistics, then one should use T = 1 at the 68% C.L. or T = 1.645 at the 90% C.L. However, due to inconsistencies between the various data sets, as well as uncertainties arising from the initial choice of PDF parametrization forms, the CTEQ-TEA group has historically chosen a larger value of T = 10 at the 90% C.L.
Another variation in defining the Hessian PDF errors is in the imposition of dynamical tolerances, which have been used in recent CT [7,9] and MMHT [44] PDF sets. The idea here is that if the constraints on a given PDF eigenvector direction comes dominantly from a single data set (or several self-consistent sets), then the overall global tolerance produces too large of a PDF error. By incorporating separate constraints from individual experiments one determines that a ∆χ 2 old = (T ± i ) 2 < T 2 in the particular eigenvector direction should correspond to the given C.L. Keeping Eq. (A1) unchanged, this implies that the Hessian eigenvector PDFs now must correspond to z i = ±(T ± j /T )δ ij . (It can easily be seen that in the presence of dynamical tolerances, the global parameter T scales out of all calculated observables.) We emphasize here that, although the different choices of tolerance parameters (including whether global or dynamical), as well as the choice of 68% or 90% C.L, give different results and/or interpretations for the PDF errors, they all give a self-consistent description of χ 2 old around the global minimum. The next step in updating the PDFs in the Hessian approach is to add the contribution of the new data set (or sets) to χ 2 , yielding where N X is the number of new data points, X E α are the experimental data values, X α (z) are the theoretical predictions, and C −1 αβ is the experimental inverse covariance matrix. We have also included a weight factor w that may be assigned to the new set of data, which by default is set to be 1. To linear order in the PDF parameters (assuming that a global tolerance is used), we can express the theoretical predictions as where X ±i α is the theoretical prediction of X α calculated with the error PDFs f ±i . We emphasize that Eq. (A3) is valid only when z i = ±δ ij corresponds to the error PDFs f ±j .
At this stage, χ 2 new can now be minimized to obtain the new value of the best-fit parameters, which can be used to obtain updated best-fit PDFs. The generalization of Eq. (A3) for dynamical tolerances, as well as the extension of this equation to include diagonal quadratic terms are given in Ref. [1], and are implemented in ePump. In addition, ePump produces an updated set of error PDFs, under the same tolerance and confidence level assumptions of the original error PDFs.
Although these results were given previously in Ref. [1], our purpose for restating them here is to make it clear that the choice of the global tolerance T or the use of dynamical tolerances in the updating of the PDFs should not be chosen freely, but rather is determined by the original Hessian error PDF sets. In particular, if the error PDFs were determined using a global tolerance of T = 10, then the use of Eq. (A3) is only consistent if the value T = 10 is used in Eq. (A2). It is straightforward to show that if the tolerance were set to T = 1 in Eq. (A2) while using such error PDFs, it is equivalent to weighting the new data by a factor w = 100 in the update.
We exemplify this by performing the following exercise. As demonstrated in Section. III C, the dimuon data are almost entirely responsible for constraining the s-PDF. Therefore, we remove the dimuon data from the CT14HERA2 data sets and perform a new global fit, named CT14HERA2mDimu. Then we use ePump to add back the dimuon data. If everything works perfectly, we should expect that the ePump-updated s-PDF will agree with CT14HERA2. Also, based on the present discussion, we should perform the update using the dynamical tolerances, but we shall also try other choices for the tolerance to see the results.
We first compare the update using dynamical tolerances with that using a global tolerance of T 2 = 100. The results are shown in Fig. 31. In this case we find that these two updates give very similar predictions and reproduce CT14HERA2 very well. On checking the dynamical tolerance values for CT14HERA2mDimu, we discovered that the eigenvector directions that are most sensitive to the dimuon data have (T ± i ) 2 90, which explains why the update is not much different when using a global T 2 = 100. In the upper two plots, dynamical tolerance was used (labelled as CT14mDimu(dyn.tol.)). In the lower two plots, dynamical tolerance was turned off and T 2 = 100 was assigned (labeled as CT14mDimu(T=10)). These two results are very similar and both reproduce CT14HERA2 with good agreement. Left panel: the PDF ratios over the best-fit of the base CT14HERA2mD. Right panel: the error bands relative to their own best-fit.
Next we display the results using global tolerance-squares of T 2 = 1 and T 2 = 1.645 2 = 2.706. These correspond to the naive use of Gaussian statistics at the 68% and 90% C.L.'s, respectively, but are inconsistent with the error PDFs used in this update. As explained above, using these small values of T 2 with the given error PDFs is equivalent to overweighting this data by a large value. The s-PDFs after these updates are shown in Fig. 32.
Interestingly, the agreement of the best-fit updates are not too bad when compared with CT14HERA2 (though not as good as when using dynamical tolerances). This can be understood by the fact that the strange quark PDF is mostly determined by the dimuon data, so overweighting this data does not shift the central value by much. However, one can see that the s-PDF error bands for these two values of the tolerance are much smaller than that of CT14HERA2, so that using T 2 = 1 or T 2 = 2.706 would greatly overestimate the effect of this data on reducing the PDF errors. the upper two plots, T 2 = 1 was used (labeled as CT14mDimu(T=1)), while in the lower two plots, and T 2 = 1.645 2 = 2.706 was assigned (labeled as CT14mDimu(T=1)). Dynamical tolerance was turned off in both two cases. These two results cannot reproduce CT14HERA2, giving too small error bands. Left panel: the PDF ratios over the best-fit of the base CT14HERA2m2M. Right panel: the error bands relative to their own best-fit.
In conclusion, in order to best reproduce CT14HERA2 global fit, one should use dynamical tolerance in ePump. Setting tolerance to be 1 will greatly overestimate the impact of a given new data set when updating the existing PDFs in the CT PDF global analysis framework. This conclusion also holds for using MMHT2014 [44] and PDF4LHC15 [45] PDFs in profiling analysis to study the impact of a new (pseudo-) data on updating the existing