On the determination of uncertainties in parton densities

We review various methods used to estimate uncertainties in quantum correlation functions, such as parton distribution functions (PDFs). Using a toy model of a PDF, we compare the uncertainty estimates yielded by the traditional Hessian and data resampling methods, as well as from explicitly Bayesian analyses using nested sampling or hybrid Markov chain Monte Carlo techniques. We investigate how uncertainty bands derived from neural network approaches depend on details of the network training, and how they compare to the uncertainties obtained from more traditional methods with a specific underlying parametrization. Our results show that utilizing a neural network on a simplified example of PDF data has the potential to inflate uncertainties, in part due to the cross validation procedure that is generally used to avoid overfitting data.


I. INTRODUCTION
The partonic structure of hadrons plays a fundamental role in elementary particle and nuclear physics. Understanding the spin structure of the nucleon and the hadronization of partons, interpreting experimental data according to the Standard Model (SM) of fundamental interactions, precisely measuring SM parameters and searching for signals of physics beyond the SM, are only some of the phenomenological problems that rely on knowledge of parton distribution functions (PDFs) [1][2][3][4][5] and fragmentation functions (FFs) [6]. Because these functions cannot yet be computed sufficiently accurately from quantum chromodynamics (QCD), both PDFs and FFs are determined by analysing experimental data from a wide range of hard-scattering processes within the framework of collinear factorization [7].
Here, physical observables are factorized into hard-scattering partonic cross sections, calculable perturbatively from QCD, and nonperturbative distribution functions, whose evolution with the momentum scale probed in the measured process can nonetheless be calculated perturbatively [8].
Considerable progress has been made in recent years through several parallel efforts to improve our knowledge of PDFs, both experimentally and theoretically (for simplicity in the following we will refer only to PDFs, but the discussion applies equally to FFs). The inclusion of an increasingly vast and precise dataset and the growth of the theoretical accuracy of the QCD analyses have been accompanied by an increasing sophistication of the methodological procedures used to represent the uncertainties on the PDFs [9]. These currently include the Hessian [10], Lagrange multiplier [11] and data resampling (DR) methods. The uncertainty estimation method most commonly adopted in the literature has been the Hessian, which has been used by many collaborations in their determination of unpolarized PDFs (see Refs. [2,12,13] for a review) and by the DSS group in their determination of the FFs of light hadrons [14,15]. The Lagrange multiplier method was used by the DSSV group to determine the polarized proton PDFs [16,17], while the DR method was used by the NNPDF collaboration [18] in their determination of the unpolarized and polarized proton PDFs (for recent work, see, e.g., Refs. [19,20]) and of the parton to light hadron FFs [21]. It was also employed by the JAM collaboration in their determinations of the polarized proton PDFs [22] and the pion and kaon FFs [23], and for the first simultaneous fit of these two [24], as well as more recently in the JAM global QCD analyses of unpolarized proton [25,26] and pion [27,28] PDFs. Although the NNPDF and JAM approaches share the same method of quantifying uncertainties, they differ in their parametrization of the PDFs. While the JAM collaboration uses a traditional polynomial functional form for the dependence of the PDFs on the quark momentum fraction x, the NNPDF collaboration implements a similar basic parametric form supplemented by a series of trained neural network weights.
We should note that the above methods all attempt to find approximations to the Bayesian posterior. An alternative approach is to calculate the posterior exactly using Bayes' theorem. This can then be used to obtain the expectation value and standard deviation of the set of PDF parameters, and of any derived quantity. A popular posterior mapping technique is nested sampling (NS) [29], a Monte Carlo method targeted at the efficient calculation of the Bayesian evidence, which is the average likelihood of a model over its probability space. This method produces posterior inferences as an advantageous byproduct, and contemporary implementations of NS algorithms, such as MultiNest [30] and PolyChord [31,32], are now extensively used for parameter estimation from posterior samples in cosmology (see, e.g. Ref. [33]). They have also very recently been used in global QCD analyses of the quark transversity distributions [34] and of the pion distributions [35] by the JAM collaboration.
A popular alternative to nested sampling is the Hamiltonian (or hybrid) Markov chain Monte Carlo (HMC) technique [36,37], based on the Metropolis-Hastings algorithm [38,39], which allows for an efficient sampling of posterior inferences, though without the calculation of the evidence which is typically not required in PDF fits. These methods are implemented in xFitter [40], the open-source package for performing global PDF fits, which has been shown to lead to competitive results [41].
It is evident that a reliable and faithful representation of PDF uncertainties is especially critical in applications such as searches for signals of physics beyond the SM, where subpercent precision is often sought. As the above survey suggests, it is difficult to compare uncertainties between groups using different methodologies in a meaningful and quantitative way [42][43][44]. This is especially challenging when there exist tensions between data sets, or when there are strong departures from the Gaussian approximation to probability distributions of parameters.
In this paper, we review and compare existing methods for estimating uncertainties in global QCD analyses of PDFs. Specifically, we use a computationally inexpensive "toy" model of a PDF fit to compare the uncertainty estimates that arise from the DR, Hessian, NS and HMC methods, and explore whether they produce the same results. In each case, we assume that the functional form of the PDF being fitted is the same as that used to generate the toy data, so as not to introduce any inflated uncertainty arising from inaccuracies in the representation of the fitted functional form.
Furthermore, we investigate the case where the known functional form in the PDF fit is replaced by a neural network, and investigate how the uncertainty estimates that result from DR depend on the the cross validation procedure used in the training of the neural network. We also compare the uncertainty estimates from the neural network approach to the other methods in a similar vein to a "closure test" [45], to establish how often the true values of our toy model fall within the uncertainty bands predicted by each method across many fits.
In Sec. II we begin by reviewing the Hessian, DR, NS and HMC approaches to estimating PDF uncertainties. Section III describes the toy model, which is designed to reproduce the salient features of a real PDF fit, but with a significantly reduced computation time. We then perform a systematic comparison of the results obtained by applying the Hessian, DR, NS and HMC techniques to the toy model analysis. The study of the neural network approach is performed in Sec. IV, where we in particular explore the effects of partitioning and cross validation in the training of neural networks on PDF uncertainties. Finally, in Sec. V we present our conclusions and discuss future applications of this work.

II. FORMALISM
In the framework of Bayesian inference [46][47][48]   The problem, which would otherwise be ill-defined, is usually projected from the infinite-dimensional space of functions into a finite-dimensional space of parameters by choosing a suitable parametrization for the set of functions [f ]. Such a procedure of course requires some assumptions. Some of these, such as a certain degree of smoothness or particular behaviors in certain regions of the functional space, may be physically motivated; nevertheless, they should not bias the result nor affect its statistical interpretation.
Calling a = {a 1 , . . . , a npar } the set of n par parameters and m = {m 1 , . . . , m n dat } the set of n dat measurements, the expectation value and variance of an observable O are defined in the Bayesian framework by where p(a|m) is the conditional probability of the set of parameters a, given the set of measurements m. The problem of determining the probability measure in the space of is then reduced to the problem of determining the conditional probability p(a|m).
One can readily formulate the problem in terms of Bayes' theorem, which can be expressed as where p(a) is the prior, p(a|m) is the posterior, p(m|a) ≡ L(a; m) is the likelihood, and Z is the evidence. The prior is a subjective quantity, and results of a Bayesian analysis are frequently found to be highly dependent on the choice of this function. For problems where the data are sufficiently constraining (as expected in the case of PDF fits), this is less of an issue, and the posterior distribution is mostly determined by the likelihood function. For linear problems, flat priors can be useful: p(a) is associated with a uniform distribution, normalized to unit volume, p(a) = 1; Eq. (2) then implies that The likelihood L(a; m) is seen as a mathematical function of the parameters a, given the data m. For n dat pairs of independent measurements m = {x i , y i }, i = 1, . . . , n dat , a common choice is where N is an appropriate normalization factor, and the χ 2 (often called logarithmic likelihood) is defined as where T (a) represents the predictions of the model for the underlying law given some set of parameters a, and C dat defines the covariance matrix in the space of the data.
Several popular methods that have been developed to determine the posterior probability density p(a|m) have been found to be useful in contemporary QCD analyses of PDFs. In the following, we will review the Markov chain Monte Carlo method including the Hamiltonian variant, as well as the nested sampling method. We will then review the Hessian and data resampling methods, which aim to approximate the Bayesian posterior.

A. Markov chain Monte Carlo
Monte Carlo algorithms are a set of computational tools which rely on repeated random sampling. Steadily rooted in Bayesian statistics [46][47][48], Monte Carlo algorithms consist of using the available data to generate samples of the posterior p(a|m) at a discrete set of points. The expectation value and variance of any observable O can then be expressed as where the parameters a k are distributed according to the posterior p(a|m), and n is the number of parameter sets sampled according to a given algorithm.  [38,39]. This algorithm operates on the principle that certain iterations of the MCMC are accepted or rejected based on predetermined criteria with the effect of producing more precise results. The Metropolis-Hastings algorithm proceeds as follows. At each Monte Carlo time t − 1, the next state a t is chosen by sampling a candidate point a from a proposal distribution π(a t−1 |a ). The candidate point is then accepted with the probability and the Metropolis-Hastings transition kernel is thus If the new set of parameters a is accepted, the next state of the chain becomes a t = a . If it is rejected, the chain does not move and the point at the time t is identical to the point at the previous time t − 1: a t = a t−1 .
A special case of the Metropolis-Hastings algorithm is the random walk Metropolis, for which the proposal distribution is chosen to be such that π(a |a t−1 ) = π(|a t−1 − a |). The acceptance probability then reduces to α(a t , a ) = min[1, p(a |m)/p(a t |m)]. The efficiency of a Markov chain Monte Carlo algorithm can be optimized by choosing the proposal distribution to be as close as possible to the target distribution or alternatively, in the case of a random walk, by carefully setting the size of the scale parameter [41,49]. Optimization becomes increasingly delicate if the number of parameters in a is large, as it is in a real fit of PDFs.

B. Hybrid Markov chain Monte Carlo
Hamiltonian (or hybrid) Markov chain Monte Carlo algorithms [36] were originally developed in lattice field theory. These algorithms produce candidate proposals for the Metropolis-Hastings algorithm [38,39] (see Ref. [50] for an extensive review) and can be used to sample the target probability density, p(a|m). Optimization problems can be circumvented in HMC algorithms by introducing for each set of parameters a a set of conjugate momenta b, and associating a Hamiltonian H(a, b) = 1 2 b T M −1b + U(a) to this joint state of position a and momentum b, where M is a mass matrix, generally taken to be diagonal, and U(a) an arbitrary potential energy. This allows one to define a joint distribution where N is the normalization constant. is not conserved, and are therefore essential to determine the sampling. The numerical noise can be efficiently exploited by appropriate methods to approximate the solution to a system of differential equations. Among these, the leap-frog method (see Sect. 2.3 in Ref. [50] for details) proved to be particularly convenient -one has to choose only the scale parameter and the number of iterations as free parameters, which are conventionally tuned, to obtain an acceptance rate of 60%.

C. Nested sampling
Nested sampling [29] allows for an efficient evaluation of the Bayesian evidence Z in Eq. (2), and also produces a discrete set of posterior inferences. The expectation value and variance of any observable O can be expressed as The key ingredient in NS is to define a prior volume X, such that where the integral extends over the region(s) of parameter space contained within the isolikelihood contour L(a) = λ. The multi-dimensional evidence integral in Eq. (2) can then be transformed into a simpler one-dimensional integral, where L(X), the inverse of Eq. (11), is a monotonically decreasing function of X. Therefore, if one can evaluate the likelihood L i ≡ L(X i ) over a discrete sequence of decreasing values 0 < X n−nact < · · · < X i < · · · < X 1 < X 0 = 1, the evidence can be computed numerically using, e.g., standard quadrature methods, as a weighted sum For instance, for the simple trapezium rule, the weights would be given by The summation in Eq. (13) is performed with an appropriate iterative algorithm as follows. At the starting iteration (i = 0), a flat prior p(a) is assumed so that its volume is X 0 = 1, from which n act active or live samples are drawn. These samples are then ordered according to their increasing likelihood value. The sample corresponding to the smallest likelihood L 0 is removed from the active set, hence becoming inactive. It is then replaced by a sample drawn from the prior, requiring that its likelihood value is L > L 0 .
The corresponding volume delimited by the iso-likelihood contour L > L 0 is described, at the first iteration, by a random variable X 1 = t 1 X 0 , where t 1 is distributed according to the probability for the largest of n act samples drawn uniformly from the interval [0, 1], Pr(t) = n act t nact−1 . The procedure is repeated for each subsequent iteration i until the entire prior volume has been traversed through nested likelihood shells; the sample associated with the lowest likelihood L i in the active set is removed, then it is replaced with a new sample drawn from the prior with L > L i , and finally the corresponding prior volume is reduced to The algorithm stops when the evidence Z in Eq. (13) is determined to some specified precision.
To be precise, an upper limit is set from the remaining active points by assuming that the largest evidence contribution made up by the residual portion of the posterior is ∆Z i = L max X i , where L max is the maximum likelihood in the set of active points at the last iteration.
The algorithm stops when this quantity no longer changes the final evidence estimate, i.e., ∆Z i < , with arbitrarily small. Finally, an increment from the set of n act points is added to the evidence estimate, Eq. (13), where w n−nact+j = X n−nact /n for all j. Once the evidence Z is found, the posterior can be determined using the full sequence of active and inactive points generated in the nested sampling process as The posterior can then be used to compute the expectation value and variance in Eq. (10b).

D. The Hessian method
The Hessian method was first developed in Refs. [10,51], and essentially attempts to approximate the posterior distribution as a multi-variate Gaussian distribution in parameter space. To do this, one first finds the set of parameters a 0 corresponding to the maximum a posteriori (MAP) estimate, which can be obtained by minimizing the χ 2 function assuming flat priors. In a neighborhood of a 0 , the χ 2 (a) function can be expanded up to the quadratic term in its parameters and the posterior approximated by a multi-dimensional Gaussian function, In the last term, ∆a = a − a 0 , the Hessian matrix elements are given by and exp − 1 2 χ 2 (a 0 ) has been absorbed into the normalization of the posterior. The approximated posterior can be reparametrized in terms of the eigendirections of the Hessian matrix via where e k and w k are the orthonormal eigenvectors and eigenvalues of the Hessian matrix, respectively. The parametrization (18) defines a linear change of variables for the posterior, p(a|m) → p(t|m), with t k parametrizing straight paths in the parameter space that radiate from a 0 and run parallel to the corresponding eigenvector. In the new space, the approximated posterior explicitly factorizes along each eigendirection, and allows one to estimate the expectation value of an observable in a simple way, The observable's variance can, in turn, be calculated as where in the second line O is expanded to first order in t, and T k in the third line is the tolerance factor, defined by Whenever the posterior is approximately Gaussian, one expects T k ≈ 1. Introducing a finite step ξ k in each eigendirection, the variance (21) can then be calculated numerically as The step size ξ k can be chosen arbitrarily, as long as it is sufficiently small to guarantee the linear approximation in t. However, it is convenient to set ξ k = T k in order to write the variance as Note that the expressions (20) and (24)  and (iii) the linear approximation of O(a(t)) in t. In practice, however, the Guassian approximation for the posterior is not always valid in all eigendirections, and one of the common symptoms is that the value of T k deviates from unity for eigendirections with very small eigenvalues. Such a situation that can arise because of a lack of experimental constraints, even when the data set being fitted is statistically consistent. To address this issue, we propose to evaluate Eq. (22) using the unapproximated posterior as where the normalization Z k can be efficiently estimated numerically since it only requires a one dimensional integration over the variable t k . As long as the posterior remains approximately factorized for t k values of O(T k ), using the tolerances (25) instead of assuming them all equal to 1 provides a faithful representation of an observable's variance, even when the Gaussian approximation for the posterior breaks down.
In actual global QCD analyses of PDFs, one also needs to accommodate statistical inconsistencies among data sets, which cannot be accounted for by the χ 2 -based likelihood (4). Indeed, such a likelihood is multiplicative with respect to each data set, L = set L set , and does not properly account for a scenario in which two or more data sets are mutually incompatible. To alleviate this problem without adopting a different likelihood, a procedure known as the tolerance criterion has been advocated [14,15,52,53], where the tolerance factors in Eq. (22) are inflated by demanding that a given fraction of the fitted data falls within the variance of the corresponding observable. Values T k ∼ 5 − 10 [2] are typically needed to account for 68% of the data in PDF global analyses.
The tolerances can also be adjusted by requiring that the χ 2 increases above the minimum by a specified ∆χ 2 value, namely by solving χ 2 T k e k / √ w k , m = ∆χ 2 [54]. With ∆χ 2 = 1, one should obtain tolerances comparable to Eq. (25), and a comparison to the tolerance criterion is possible by setting ∆χ 2 = T k . In our analysis we will perform fits to a statistically consistent set of mock data, and therefore will not need to invoke a tolerance criterion.
Nonetheless, the tolerance will need to be explicitly calculated according to Eq. (25) to accommodate for genuine deviations from the Gaussian behavior and to obtain accurate variance estimates.

E. Data resampling
The data resampling method for PDFs was originally pioneered by the NNPDF collaboration some twenty years ago [18] and has been adopted more recently by the JAM collaboration [22][23][24]. Data resampling allows one to approximate the Bayesian posterior distribution through the use of frequentist statistics. By contrast to the methods discussed so far, frequentist methods do not make use of Bayes' theorem at all. For a frequentist, there exists a definite true set of parameters a true , and the job of frequentist parameter inference is to obtain best estimates of these unknown true values. The frequentist probability of an observable taking on a certain value is simply equal to the number of times that value is observed out of a number of repeatable trials. As such, the only information the frequentist has access to is the likelihood L(a; m). It is common to choose the estimate of the true parameter values to be the set of parameters that maximizes the likelihood -the so-called maximum likelihood estimate (MLE).
Various algorithms can be used to perform this maximization, however, the main point of interest here is how the uncertainty would be characterized from a frequentist perspective.
In the Bayesian approach one would have access to the posterior distribution, and could therefore generate credible intervals which describe the degree of belief about a parameter taking on a certain value. The frequentist approach must instead construct confidence intervals, in which the true observables are expected to fall as a proportion of a total number of trials.
One might wonder why one would bother to introduce alternatives to the Bayesian methods described previously. The answer is that Bayesian techniques tend to require many is a low order polynomial, while in NNPDF analyses the function x α (1 − x) β acts as a "preprocessing factor" designed to speed up the minimization and is weighted by P (x) → a neural network.
Importantly, the distribution of the frequentist maximum likelihood estimators can be shown to coincide with the Bayesian posterior distribution given certain assumptions (namely, flat priors and a Gaussian likelihood), which implies that E Bayes ≈ E freq and V Bayes ≈ V freq . One such demonstration was performed in Ref. [55], where it was found that distributions in the replicas obtained using the NNPDF4.0 PDFs approximately reproduced the expected Bayesian posterior distribution. As such, it is not necessary to embrace the fully-fledged frequentist methodology in order to accept the data resampling approach. One can argue that the data resampling method is a trick used to calculate the posterior more efficiently, albeit with some caveats. Nevertheless, the data resampling approach can at most provide an approximation to the Bayesian posterior, and so should not be identified fundamentally as a Bayesian technique.

III. COMPARING METHODOLOGIES
In the following we perform a systematic comparison of the uncertainties that result from applying each of the methods discussed in Sec. II. Our investigation proceeds by fitting a set of pseudodata that has been generated according to a known underlying law and statistical noise. We discuss the generation of the pseudodata in Sec. III A, and the results of our fits in Sec. III B.

A. Construction of toy data
For simplicity, we construct a set of two toy "quark" PDFs with momentum fraction x dependence parametrized by a basic functional form, These functions could, for example, represent the u and d quark PDFs in the proton, with free parameters α i and β i . In this case, we can assign true values to the free parameters (which we choose to be α 1 = 0.5, β 1 = 2.5, α 2 = 0.1, β 2 = 3.0), resulting in the functional form illustrated in Fig. 1. We can then model toy "cross sections" σ j using a linear combination of the two PDFs,

B. Fit results
We perform a fit to a single dataset using each of the methods described in Sec. II, namely, For all the methods considered, bounds on the parameters were provided as α 1 ∈ [−1, 1], 1], and β 2 ∈ [0, 5], and all algorithms started in a random location in this parameter space. For the DR method, the SciPy minimize least_squares optimizer was used to minimize the χ 2 for each data shuffling step, generating 10 5 replicas.
For the Hessian method, the single χ 2 minimization required to find the MAP estimator was also performed using the SciPy minimize least_squares optimizer. The χ 2 itself is given by the sum of square residuals across both the σ 1 and σ 2 cross sections, where i is the index of the two cross sections, j is the index of the data points summing to a total number of points n data in the sample, σ data to make a comparison with a neural network approach. In particular, we will explore the effect of much greater flexibility in the parametric form, and the impact of partitioning and cross validation as a stopping criterion for the underlying iterative minimization algorithm.
As we discuss in the following, the resulting PDF uncertainty estimates will be at stark variance with those found in this section.

IV. NEURAL NETWORKS, PARTITIONING AND CROSS VALIDATION
The aim of this section is to explore the effect of partitioning and cross validation on PDF uncertainties in the training of neural networks (NNs). In Sec. IV A we will describe the role that cross validation plays in fitting data with NNs, before exploring the dependence of predicted uncertainties on the partition fraction in such a framework. In Sec. IV B we will then compare the NN approach to the methods covered in Sec. III in terms of potential inflation of uncertainties.

A. Neural networks with cross validation
Cross validation is often an essential component of the training procedure for NNs if one wants to obtain robust results that generalize to new data sets beyond those used in the training. NN models typically have hundreds of parameters that can produce extremely complicated functional forms. The primary metric of a NN is the loss function, which generally depends in some way on the difference between the predictions of the model and the true values of the function that it is trying to reproduce. Training a network involves finding the values for the weight and bias parameters of the network that minimize the loss function. The danger associated with such a setup is that, given a certain set of data and enough time, an NN will simply end up reproducing the data points with which it is presented. This is referred to as overfitting, and reflects the fact that the NN model is no longer describing the underlying law that we ultimately want to learn and use to make predictions.
For our NN fits we make use of the Tensorflow platform via the Keras interface. The loss function is defined in the same way as the χ 2 function in Eq. (29), except that σ model i is replaced with the predicted value of the cross section according to the NN. Our NN architecture consists of two hidden layers with 10 neurons each, with tanh activation functions and a linear output layer, similar to that used in the NNPDF4.0 analysis [58]. For illustrative purposes, we allow the NN training to run for a long time (10 5 epochs) in order to deliberately overfit the data as a first example.
In this section we only perform NN fits to predict the cross section rather than inferring the PDFs themselves. Our goal is to hone in on the behavior of uncertainties purely due to the introduction of NNs instead of a specific underlying parametrization, and in order to study this it will be sufficiently informative to limit our analysis to the level of the cross section. We also dispense with the preprocessing factor x α (1 − x) β at this stage, although we will demonstrate in Sec. IV B that including it has no significant effect on our results.
An example of overfitting our toy σ 1 data using an NN with the above settings can be seen in Fig. 3. Here, the shape predicted by the NN does not resemble the true functional form at all, showing drastic and unrealistic variations in order to ensure that the data points match the prediction as closely as possible. Clearly this is a result we wish to avoid when applying an NN to our PDF problem.  In this way we ensure that the best fit is found, while also avoiding overfitting the training data.
As explained in Sec. II E, repeated fitting of resampled data is necessary in order to construct uncertainty bands. By comparing the predicted σ 1 distribution both with and without cross validation, as in Fig. 5 (for this exercise f = 0.4), the importance of the cross validation procedure becomes clear. Without using cross validation, the NN model simply reproduces the error bars of the data points given, and predicts an oscillatory shape that follows the data. The NN with cross validation produces an overall smooth distribution that approximates much more closely the underlying law that we are trying to learn.
One of the few aspects of the cross validation procedure which one can tune is the nor can we conclude that below f = 0.5 an increased dataset will remove any potential artificial inflation of uncertainty due to the choice of f .

B. Comparing NN methods to parametric methods
To further comprehend the differences between the NN methodology and that of traditional parametric modelling, it is worth comparing the results of fits to our toy data for all of the methods discussed so far. At this point we also include the preprocessing factor x α (1−x) β in the NN fits. This is usually added with the aim of speeding up the minimization procedure. However, it is worth testing whether its inclusion could be playing a greater role in helping the NN to obtain a shape that is more consistent with the underlying law. We apply this factor in the same way as the NNPDF collaboration; specifically, we parametrize the PDFs as where NN i (x) represents the neural network weights. We then define effective asymptotic exponents as The A full fit for one instance of our toy σ 1 data with 50 points is shown in Fig. 8. Here, we compare the distributions and ratio to the true values for the NN (f = 0.6) with and without preprocessing, as well as parametric DR (recall that parametric DR was shown to produce the same results as the Hessian, HMC and NS). Both NN methods produce predicted uncertainty bands that are much wider than that of parametric DR, particularly at large x. The fact that there is little difference between the NN predictions with and without preprocessing suggests that preprocessing does not play a significant role in this example.
Such a demonstration only becomes meaningful in the context of a full replica fit for many initial data sets rather than just one. By counting how often a given method predicts an uncertainty band that contains the true underlying law (as inspired by closure tests), we can determine whether the predicted uncertainties are statistically valid. Since the uncertainty bands represent the variance of the cross section around its expected value, we do not expect them to always contain the true values. Rather, a reasonable method will produce uncertainty bands that contain the true values around 68% of the time across a large number of initial data sets. In Fig. 9 we determine the percentage of uncertainty bands across all data sets that contain the true values for all x, corresponding to each method. We plot the NN method with f = 0.2 and f = 0.6 to again demonstrate the importance of the partition fraction, and also show the results for the NN with preprocessing. We find that the NN predicted uncertainty bands contain the true value too often, particularly at high x values. This effect is exacerbated by reducing the partition fraction. By contrast, the parametric DR method consistently hovers around the 68% value that would be expected from accurate confidence intervals.
The inflated uncertainty observed is not entirely due to the usage of neural networks in fitting the toy data. Another contributing factor to this result is the presence of the cross validation procedure itself. In Fig. 9 we also include a cross validation method in the context of parametric DR, dividing the data into a training and validation set with f = 0.6 and cutting off the χ 2 minimization when there is no further improvement for the validation set.
Comparing DR with and without cross validation, we can clearly see a substantial increase in the width of the uncertainty bands when cross validation is used, up to a similar mean value as that of NNs at the same partition fraction.
These results imply that, for our specific simplified example of PDF data, the uncertainty bands predicted by an NN fit do not realize the 1σ confidence interval that one would naively expect. This may be due in part to a misrepresentation of the underlying law, but also a result of the cross validation procedure that forms an integral part of any NN fit to experimental data.
One may ask how is it that NN fits, as well as partitioning and cross validation applied to simple parametric DR, can so dramatically change the variance of the fitted observable?
From our perspective, the algorithms deployed in either case effectively lead to a change in the likelihood function. In the case of cross validation applied to DR, this can be heuristically understood by viewing the validation χ 2 as a potential that stochastically pulls the parameters away from the true minimum, thereby increasing the variance of any calculated observable, but in a uniform way. This is illustrated with the very similar shapes for the DR and DR with CV results in Fig. 9, where one goes from an ∼ 68% confidence level to an ∼ 90% when using partitioning and cross validation, as if the likelihood was rescaled by the adopted stopping criterion. NN methods not only appear to rescale the likelihood, but they also do so to different degrees across the x range, resulting in a behavior that is not immediately amenable to a heuristic analysis.

V. CONCLUSIONS AND OUTLOOK
The need for reliable uncertainty quantification in global QCD analysis motivated this study to perform a systematic comparison of different methods for estimating uncertainties on PDFs and similar correlation functions. The test laboratory employed here, in the form of the toy PDF model, although exceedingly simple, allowed us to rigorously test and verify the generation of uncertainties in a controlled context, in which the underlying physical law is known. We were thus able to demonstrate that methods which use a parametrization that matches the underlying law lead to the same uncertainty estimates -regardless of whether the uncertainties are determined using the traditional Hessian or data resampling methods, or explicitly Bayesian techniques such as nested sampling or HMC. While the result may be intuitively expected, it is by no means trivial to verify in practice that all the approaches are indeed exploring the same likelihood function.
We then focused on the impact that partitioning and cross validation have on uncertainty quantification. We found that these added elements effectively lead to a rescaling, and even a deformation of the likelihood function, which also depends on the partition fraction. Our results suggest that this is inherent in the methodology rather than in the details of the model or the data set analyzed.
Comparing the parametric methods with NN approaches, we find indeed that NNs produce wider uncertainty bands than expected, whether or not one uses a preprocessing factor, and that the dependence on the partition fraction is not reduced as one increases the number of data points to be fitted. The enhanced uncertainty is found to be due, at least in part, to the use of cross validation. While DR+CV exhibits this behavior in a uniform way in observable space, heuristically due to the additional influence that the validation data set has on the fitting of the training data set, the NN's inflation of the cross section uncertainty depends also on x and becomes very large as x → 1. Whether this non-uniform uncertainty inflation is due to the highly flexible PDF parametrizations provided by the NNs or to other elements in the analysis algorithm remains an interesting question for future investigation.
Overall, given that the data set being analyzed by the NN and parametric methods is the same, we can only conclude that NN methods (as well as parametric methods supplemented by cross validation) algorithmically deform the nominal likelihood utilized in the analysis in ways that are not necessarily under control. In view of these results it is natural to ask in what sense are PDF fits based on NN methods compatible with those based on Hessian + tolerance methods, since the likelihood functions that they utilize are technically different.
From our perspective, the tolerance criterion effectively is a change in the likelihood function, which can also be cast as a rescaling of the experimental uncertainties to remove potential tensions present among the data. NN-based analyses posit instead that the use of a tolerance criterion is not necessary [18], but the roughly equivalent size of the uncertainties they obtain may simply be coincidental and due to the likelihood deformation and resulting uncertainty inflation that we already observed when analyzing compatible data sets in our toy model.
We thus echo the concern expressed in [44] that a meta-analysis, such as PDF4LHC [43], that combines existing PDFs from different groups may not only obscure the fundamental connection between experimental data and theory, but also obfuscate the true meaning of the uncertainties, since these seem to ultimately originate from a different choice of the likelihood function.
Finally, we make a note of caution about the generality of our results, since the conclusions should be qualified by the fact that we are using a rather simplistic model. While this has been necessary in order to keep the analysis tractable and allow unambiguous conclusions to be drawn from the methodology comparisons, there may be new features present in higher dimensional analyses of more complex data sets, such as in real global QCD analyses.
We leave discussion of this issue to a future publication [59].