Simultaneous Monte Carlo analysis of parton densities and fragmentation functions

We perform a comprehensive new Monte Carlo analysis of high-energy lepton-lepton, lepton-hadron and hadron-hadron scattering data to simultaneously determine parton distribution functions (PDFs) in the proton and parton to hadron fragmentation functions (FFs). The analysis includes all available semi-inclusive deep-inelastic scattering and single-inclusive $e^+ e^-$ annihilation data for pions, kaons and unidentified charged hadrons, which allows the flavor dependence of the fragmentation functions to be constrained. Employing a new multi-step fitting strategy and more flexible parametrizations for both PDFs and FFs, we assess the impact of different data sets on sea quark densities, and confirm the previously observed suppression of the strange quark distribution. The new fit, which we refer to as"JAM20-SIDIS", will allow for improved studies of universality of parton correlation functions, including transverse momentum dependent (TMD) distributions, across a wide variety of process, and the matching of collinear to TMD factorization descriptions.


I. INTRODUCTION
The standard parton correlation functions of QCD, such as collinear parton distribution functions (PDFs) and fragmentation functions (FFs), are being utilized in an increasingly diverse range of phenomenological applications. Beyond their traditional role in predicting new high energy phenomena, they also enter frequently into the study of more complex and extended objects like transverse momentum dependent (TMD) PDFs and FFs and generalized parton distributions (GPDs), where they are needed to understand the transition between different factorization regions. Both TMDs and GPDs are central to the study of the nonperturbative parton structure of hadrons, and understanding how they encapsulate their longitudinal and transverse features will be critical to current experimental programs at Jefferson Lab and elsewhere, as well as to the future Electron-Ion Collider. These considerations provide one of the main motivations for the study of collinear PDFs and FFs in this paper.
The great value of PDFs and FFs extracted from global QCD data analysis lies with their predictive power, or "universality". However, the translation from experimental data to quark and gluon operator structures is a challenging inverse problem. It is not possible to exactly constrain parton correlations from data alone since this connection involves nontrivial convolution integrals in a factorization formalism (whose accuracy itself is difficult to quantify in any given instance), and because of the limited quantity of available data.
The complexity of the inverse problem is also magnified by the number of flavor degrees of freedom involved.
Nevertheless, assessing and maximizing the universality of collinear PDFs and FFs is crucial given the increasingly broad scenarios where they are used. A major focus in the current effort by the Jefferson Lab Angular Momentum (JAM) Collaboration is therefore to both test and broaden the predictive power of parton correlation functions. This is achieved through a Bayesian inference procedure in which PDFs and FFs are extracted simultaneously, and the uncertainty quantification associated with particular parametrizations of parton correlation functions is given in terms of a Bayesian posterior distribution. To test universality, the system of equations relating observables to parton correlation functions must of course exceed the total number of correlation functions involved -a minimum requirement is that the parton correlation functions be overconstrained by the data in the fit.
Of course, realizing this in practical analyses requires that all parton correlation functions be truly fitted simultaneously. This is a major numerical and technological challenge, and traditionally PDFs and FFs have thus been extracted in separate procedures. However, simultaneous fits can be achieved with the Bayesian Monte Carlo approach, and have been implemented recently in the JAM17 [1] analysis of helicity PDFs, and in the JAM19 [2] analysis of unpolarized PDFs and FFs. The same basic methodology was also applied in the three-dimensional JAM3D20 [3] study, in the first combined analysis of TMD observables that satisfies the overconstraining criterion.
In this paper, we extend the previous work by performing the first simultaneous and overconstrained fit of unpolarized PDFs and FFs that utilizes both charged hadron production in semi-inclusive deep-inelastic scattering (SIDIS) and single-inclusive e + e − annihilation (SIA). This is partly motivated by a number of recent observations associated with the study of TMD PDFs. For example, significant tension has recently been found between fits performed with standard sets of PDFs and FFs and fixed order perturbative QCD calculations in processes including SIDIS [4,5], Drell-Yan (DY) [6], and SIA into wide-angle hadron pairs [7]. A number of suggested solutions and explanations have been proposed to account for this, including a possible need for power suppressed corrections [8] at the moderate scales of most SIDIS experiments. However, more tests of the limits of applicability of standard collinear factorization are needed before it is possible to draw firm conclusions. Given that the majority of data used to constrain collinear correlation functions (both PDFs and FFs) are either highly inclusive or exist are at very high scales, or both, it is perhaps not surprising that tension arises when these are evolved downward and used to make predictions at lower scales and for highly differential observables. Indeed, there have been few tests that Q 2 -scaling, a hallmark of the collinear perturbative regime, actually holds to a reasonable approximation in SIDIS measurements at moderate Q 2 . Our hope is that the new combined fit, which we refer to as "JAM20-SIDIS", will help to shed light on this and similar issues in the future.
In Sec. II we begin the discussion by summarizing the methodology used in our simultaneous Monte Carlo analysis, including the parametrizations used for the distributions and the multi-step Bayesian inference algorithm. Details of the data sets included in the fit are summarized in Sec. III, while in Sec. IV we discuss the criteria for universality and how these are met in this analysis. A detailed discussion of the numerical results is given in Sec. V, where we present the fitted PDFs and FFs, as well as detailed comparisons of data to theory. Finally, in Sec. VI we summarize our conclusions and discuss the implications of our analysis. Some formulas for SIDIS cross sections and structure functions are collected in Appendix A.

II. THEORETICAL FRAMEWORK
In this section we give an overview of the theoretical framework on which our analysis is based, including the observables to be fitted, the parametrizations used for the PDFs and FFs, details of the perturbative QCD setup, and Bayesian inference strategy employed.

A. Observables and factorization
In this analysis we work in standard collinear factorization [9][10][11], in which QCD cross sections are separated into perturbatively calculable partonic hard factors convoluted with nonperturbative PDFs and/or FFs. We perform calculations of all observables consistently to order α s in the QCD coupling. Details of the basic theoretical setups for the inclusive DIS, inclusive Drell-Yan lepton-pair production and SIA reactions are provided in the literature [10,12], and will not be repeated here. However, since SIDIS is a comparatively novel addition to global QCD analyses, we review it in more detail in Appendix A.
The processes considered in the present analysis can be summarized as follows: Drell-Yan lepton-pair production where h ± represent charged pions, kaons, or unidentified hadrons, and the nucleon N (or N 1,2 ) in the initial state can be either a proton or a neutron (in practice, deuteron).
Within the framework of collinear factorization, the cross sections for each of these processes can be written schematically as convolutions of hard functions and the nonperturbative par-ton distribution and fragmentation functions, where the symbols ⊗ represent the convolution integrals in longitudinal momentum fractions of the hard scattering functions H ij and the PDFs f i and FFs D h j for parton flavors i, j (see the Appendix). In each process, Q represents the hard scale given by the photon virtuality, Q hadron masses, which allows the observables to be factorized into the short-distance perturbative and long-distance nonperturbative parts.
For the inclusive DIS and SIDIS processes, is the usual Bjorken scaling variable, while for the DY process the analogous scaling variables are defined as where p 1 and p 2 denote the incoming hadron momenta, with the Feynman scaling variable given by In the DY center of mass frame, and in the limit of negligible hadron masses ( Q), the virtual photon rapidity can be written in terms of x 1 and x 2 as For the processes involving fragmentation to a hadron h in the final state, we have for SIDIS in Eq. (2), while for SIA in Eq. (4).

B. Perturbative QCD and numerical setups
For our numerical analysis we make use of Mellin space techniques to enable fast evaluations of observables needed for the Bayesian analysis. In particular, we solve the DGLAP evolution equations analytically in Mellin space [13], which allows one to effectively render high-dimensional momentum space convolutions from process-specific factorization theorems, along with the integrals in the DGLAP equations, in the form of lower-dimensional inverse Mellin transforms. For example, for the inclusive DIS observables one can write where N here is the conjugate variable to x Bj , f j (N, µ 0 ) is the Mellin moment of the PDF and H DIS i (N, µ) is the corresponding moment of the partonic DIS cross section. The analytic solution for the DGLAP evolution is entirely encoded in the evolution matrix U S i,j that evolves the moments f j (N, µ 0 ) of the PDFs from a given input scale µ 0 to the relevant DIS hard scale µ = Q. A similar expression can be written for the SIA cross section, where M is the Mellin conjugate variable for z h , D h j (M, µ 0 ) is the moment of the FF, and H SIA i is the moment of the partonic SIA cross section. The superscripts S and T in the evolution matrix distinguish between the spacelike and timelike evolution for the PDFs and FFs, respectively, which are encoded in the corresponding DGLAP splitting kernels.
The same procedure can be extended for the case of SIDIS, which gives For the case of the Drell-Yan process, a special treatment is required since the Mellin moments for the partonic cross sections are not known. For this we employ the strategy developed by Stratmann and Vogelsang [14], where by the Mellin moments are numerically pre-calculated and used as lookup tables during the analysis. The resulting expression can be written schematically as  [14,[16][17][18].

C. Parametrization of nonperturbative functions
For the nonperturbative parton distribution and fragmentation functions we use standard parametrizations that have been utilized in the literature. Namely, for the dependence on the parton momentum fraction x of the PDF f (x) we use the template function where a = {M, α, β, γ, δ} is a vector containing the shape parameters (α, β, γ, and δ) and a normalization coefficient (M) to be fitted. The integral in the denominator ensures that the value of the normalization coefficient M is equal to the second moment (x-weighted integral) of the function T (x; a). For fitting the PDFs, we assume isospin symmetry to relate the PDFs in the neutron, f i/n (x), to those in the proton, f i/p (x) ≡ f i (x), switching the u ↔ d and p ↔ n labels for the light quark flavors, and taking the PDFs for other flavors equal for the proton and neutron.
In practice, we parametrize the valence u and d quark distributions, u v ≡ f u − fū and d v ≡ f d −fd, directly using the template function (Eq. (16)). The gluon distribution, g ≡ f g , is also directly parametrized per Eq. (16). For the sea quark and antiquark distributions, we use five functions parametrized as in Eq. (16). These are a flavor symmetric sea function (S) that dominates at very low x and flavor specific functions (q 0 (q 0 )) for the s,ū,d, ands that take into account the possible nonperturbative origin of the sea. The distributions for s,ū,d, ands are constructed from these according to: q(q) ≡ f q(q) = S + q 0 (q 0 ). Note that and s −s distributions, whose lowest moments are required to be 2, 1, and zero, respectively.
The normalization for the gluon PDF is determined using the momentum sum rule. With these constraints, there is a total of 28 free parameters for the PDFs.
For the z dependence of FFs, the functional form follows a similar template, where again the integral in the denominator ensures that the coefficient M corresponds to the second moment (z-weighted integral) of the function. In addition to the fragmentation to pions and kaons studied in earlier JAM analyses of SIA and SIDIS data [2,19], here we consider also the inclusive production of unidentified charged hadrons, h ± . Accounting for unidentified hadrons can be implemented in two ways. First, the hadron FFs can be fit independently from those for pions and kaons, as preferred by the NNPDF Collaboration [20].
Alternatively, one can take advantage of existing knowledge of specified hadron FFs and add a fitted residual correction to their sum. Such an approach was adopted by de Florian, Sassot, and Stratmann (DSS) [21], for example, in which a residual correction was fitted to the sum of previously obtained pion, kaon, and proton fragmentation functions.
In our analysis we follow the latter approach, but include only the pion and kaon FFs, so that the residual term D res + i parametrizes the difference between the total hadron FF D h + i and the D π + i and D K + i functions, To reduce the total number of residual FFs being fit, we assume SU(3) flavor symmetry for light quarks and antiquarks, where D res + q and D res + q are parametrized per the template (Eq. (17)). To allow for differentiation between the residual FFs for light quarks and antiquarks, we leave M and β for D res + q as free parameters, but fix α, γ, and δ to be the same as for D res + q . This achieves a similar constraint on the parameters as the condition used by DSS [21], 2D res + q = (1 − z) β D res + q+q . For the pion FFs, D π + i , we reduce the number of fitted functions by grouping the light quarks into "favored" (valence) and "unfavored" (non-valence) flavors, where D π + fav and D π + unf are parametrized as in Eq. (17). For the parameters of the kaon FFs, D K + i , we equate the "unfavored" flavors, parametrized per Eq. (17). Finally, the gluon FFs D h + g for h = π, K, res are also parametrized according to Eq. (17). We use charge conjugation symmetry to relate FFs for opposite charges by where h = π, K, res. This results in 5 fitted functions for pions and residual hadrons, and 6 for kaons.
At this point, there are 17 shape parameters and 5 normalization parameters for residual hadrons, 20 shape parameters, and 5 normalization parameters for pions, and 24 shape parameters and 6 normalization parameters for kaons. The number of shape parameters is reduced further because throughout the fitting procedure, the parameters γ and δ for the gluon, charm, and bottom FFs are fixed at zero. In the end there are 16 free parameters to be fitted for residual charged hadron FFs, 19 free FF pion parameters, and 24 free parameters for the kaon FFs. Together with the 28 PDF parameters, we have a total of 87 free parameters for the fitted functions. In addition, there are also 42 free parameters associated with normalization of various data sets, making for a total of 129 free parameters to be fitted in the analysis.

D. Bayesian inference
Our methodology for extracting nonperturbative PDFs and FFs is based on the general premise of Bayesian inference. Namely, we use Bayes' theorem to define a multivariate probability distribution P for the shape parameters characterizing the PDFs and FFs (the posterior) at a given input scale µ 0 , where L is a standard Gaussian likelihood function, with the χ 2 function defined by Here, d i,e is the value of the i-th data point for the experimental dataset e, with T i,e the theoretical prediction for the data point; α i,e is the uncorrelated systematic and statistical uncertainty for each data point added in quadrature; β k i,e is the k-th source of point-to-point correlated systematic uncertainties for the i-th bin of dataset e, with r k e the related weight; and N e and δN e are the normalization and normalization uncertainty for each data set, respectively. In Eq. (23), π(a) is the prior distribution for the set of parameters a, which is used as input for a given fit to the data.
In principle, given the Bayesian posterior distribution, one can estimate confidence regions for a generic observable O (such as a PDF or a function of PDFs or FFs) by integrating over an d-dimensional parameter space,

III. DATA SETS
The data sets used in the present analysis include the primary electromagnetic processes that traditionally have been used in global QCD analyses, namely, inclusive DIS, Drell-Yan lepton-pair production (which constrain PDFs), SIA (which constrains FFs), and SIDIS (which constrains both PDFs and FFs). The inclusive DIS data are measurements of the F 2 (x Bj , Q 2 ) structure function performed by the BCDMS [22,23] and New Muon Collaborations [24,25] at CERN, and from experiments at SLAC [26], as well as from reduced electron and positron cross sections from the H1 and ZEUS Collaborations [27] at DESY. These include both proton [22,24,26] and deuteron [23,25,26] targets, and with both neutral and charged current probes [27]. For the kinematics we implement cuts of W 2 > 10 GeV 2 and x Bj , in order to select DIS data that can be fitted within leading power factorization.
For Drell-Yan lepton-pair production data we use differential cross section measurements   [28][29][30] at Fermilab, which include proton scattering from proton and deuteron targets. We include data in the range Q 2 > 36 GeV 2 .
Excluding lower Q 2 data is recommended by Ref. [31], which demonstrated that inclusion of the lower Q 2 data results in deteriorated prediction quality with no reduction in uncertainty when compared with fits to DIS data alone.
Identification of heavy quark flavors for some of the SIA datasets is achieved through measurement of the total energy and momentum in secondary vertices. The flavor tagged cross sections for a specific flavor q = c or b are particularly sensitive to the D h q , D h q and D h g fragmentation functions into the observed hadron h. In general, however, care needs to be taken with the precise method for separating primary quark flavors, and there are ongoing discussions regarding the optimal approach to this. For more in-depth discussion see, for example, Ref. [19].
Finally, the critical addition in this work compared with the previous JAM19 analysis [2] is the inclusion of unidentified charged hadron data, along with charged pions and kaons, in the SIDIS off deuterium targets from the COMPASS Collaboration [32,33] at CERN. Since the SIDIS data dσ h ± SIDIS /dQ 2 dx Bj dz h are differential in x Bj and z h , they combine information on both PDFs and FFs, which appear in the description of SIA, Drell-Yan, and DIS data.
Furthermore, as illustrated in Fig. 2, the SIDIS data have significant overlap in x Bj and z h with the x Bj and x F range of inclusive DIS and Drell-Yan data, respectively, and the z h range of SIA data, so that the combined analysis constitutes a genuine test of their universality.
For the COMPASS SIDIS data we use the same kinematic cuts on W 2 and Q 2 as for inclusive DIS, and restrict the fragmentation variable to 0.2 < z h < 0.8 in order to exclude data from the target fragmentation region and avoid large-z threshold corrections.

IV. ASSESSING UNIVERSALITY
Before proceeding to the results of our numerical analysis, we briefly discuss the criteria for universality of the PDFs and FFs and how these are implemented in our analysis. Extracting parton correlation functions, and using the extractions to test models of parton structure, is a nontrivial inverse problem, the detailed examination of which is beyond the scope of the present paper. However, a claim that the success of a fit is a measure of the predictive power of the PDFs and FFs requires a number of basic minimal conditions to be met: 1. The system of unknown correlation functions must be over-constrained, by which we mean that the constraints on unknown correlations imposed by data (or other theoretical constraints such as sum rules) must be greater than the total number of functions involved.
2. Each unknown correlation function must appear at least twice within the set of factorization formulas relating the correlation functions to physical observables.
3. There must be reasonable kinematical overlap between the observables so that correlation functions can be compared within similar ranges of parton momentum fractions.
Using isospin invariance to relate the PDFs in the proton to those in the neutron, we have seven independent PDFs: f u , f d , f s , fū, fd, fs and f g , with PDFs for heavy flavors generated perturbatively. For the FFs, there are five functions for π + production: D π + u , D π + u , D π + c , D π + b and D π + g , assuming that for equal u and d quark masses we can equate D π + d = D π + u . Charge symmetry allows all the FFs for π − production to be related to those for π + production.
For K + production, there are six independent FFs: and D K + g , where we differentiate between the u ands functions. Again, using charge symmetry the where n u = 2, n d = 1 and n s = 0, and the momentum sum rule, Note that in Section II C these constraints were specifically used to fix the values of the normalization parameters for several fitted functions. However, for the purpose of assessing universality, they are simply counted as additional independent equations which include and thus constrain the PDFs.
The data sets discussed in Sec. III also constrain the light quark and gluon PDFs since they appear in expressions for multiple independent observables. Counting these and also the four sum rules (28) and (29), there is a total of 18 relations between the light quark PDFs. The heavy quarks appear in an even greater number of observables. The light quark fragmentation functions appear in at least one SIA observable and, because of charge conjugation invariance, in 2 SIDIS observables, and similarly for the kaon and charged hadron fragmentation functions.
For a robust stress-test of universality, there should be reasonable overlap of the ranges in parton momentum fraction for both the PDFs and the FFs. An indication for how well this is achieved in the current fit can be be gleaned from the kinematical coverage plots shown in In summary, our analysis does indeed fulfill the basic criterion for qualifying as a test of universality, and retaining predictive power for the PDFs and FFs more generally. Note, however, that the momentum sum rule for FFs has not been imposed in the analysis. Instead, this will be used as a consistency check for the final fit in Sec. V.

V. NUMERICAL ANALYSIS
In this section we present the results of our simultaneous Monte Carlo analysis of PDFs and FFs. We begin with a survey of the fitted cross sections for the various global datasets used in this study, focusing especially on the quality of agreement with the SIDIS and SIA data on π ± and K ± , as well as unidentified h ± production. We then present our final fitted PDFs and FFs, and discuss the vital role played by the SIDIS and SIA datasets in particular in constraining the strange quark distribution in the proton.

A. Data and theory agreement
To assess the agreement of the fitted results with the various datasets, in Fig. 3 we show the reduced χ 2 for each individual experiment, which is defined by Here, the expectation value E[...], as defined in Eq. (27a), represents the mean theory, including optimized multiplicative and additive corrections to match the data, with N the total number of data points. In Fig. 3 we show the mean and standard deviation of the Monte Carlo residuals for each experiment e, where the residual per data point is defined as For the inclusive DIS, Drell-Yan and SIDIS datasets we find excellent overall agreement between data and theory, with χ 2 red values close to 1. The χ 2 red for the SIA datasets are slightly higher, but nonetheless the overall fit is very good, giving a total reduced χ 2 red = 1.26 for almost 5000 data points. The values of χ 2 red for each type of dataset and for each specific hadron in the final state are summarized in Table I, along with the number of data points for each dataset.
The residuals profile for the DIS, Drell-Yan and SIDIS datasets is well centered around zero, with variances ∼ 1, indicating an average Gaussian behavior of their associated likelihood function. The variance for the SIDIS h − data from COMPASS, however, is found to be up to ≈ 50% below unity, suggesting a deviation from a Gaussian likelihood. This may production with the COMPASS data [32,33] in various bins of x Bj and y (offset by a factor 2 i ).  Fig. 4, but for K ± COMPASS SIDIS data [32,33].

FIG. 5. As in
be due to the fact that these data are dominated by systematic uncertainties, which is also reflected by the relatively small reduced χ 2 red values, especially for the COMPASS h − data relative to the rest of DIS and SIDIS data sets.
A more detailed comparison with the COMPASS SIDIS is made in Figs. 4, 5 and 6,  FIG. 6. As in Fig. 4, but for unidentified hadron h ± COMPASS SIDIS data [32,33].
where we show the z h dependence of the π ± , K ± and h ± multiplicities, respectively, which are defined as ratios of SIDIS to inclusive DIS cross sections at the same x Bj and Q 2 , The agreement between theory and the experimental z h spectrum is quite remarkable, given that it spans some 2 orders of magnitude, which suggests that at these kinematics a leading power perturbative QCD factorization at next-to-leading order provides sufficient accuracy to describe the data. Interestingly, the differences between the multiplicities for positively and negatively charged hadron species increase with x Bj , especially for kaons, and in the valence region these can differ by an order of magnitude for low values of Q 2 . Such differences can enhance our ability to extract flavor dependent effects in nonperturbative PDFs and parton to kaon FFs from the data. The new data set included for the first time in the present JAM analysis, namely the unidentified charged hadron data shown in Fig. 6, are also well described by our nonperturbative ansatz for the corresponding FFs. In contrast to the excellent agreement with the z h dependence of the data in Figs. 4-6, we note that analysis of the same data differential in the hadron transverse momentum using existing PDFs and FFs within TMD factorization results in poor agreement between predictions and data [4,5], indicating that further work is needed to understand the SIDIS transverse momentum spectra.
For the SIA data sets, there is a somewhat wider spread in the data versus theory comparisons, as seen in Figs. 7, 8 and 9 for the π ± , K ± and unidentified charged hadron h ± final states, respectively. Generally, the π ± data have the best agreement among the SIA datasets, with a reduced χ 2 red = 1.09, followed by the hadron data with χ 2 red = 1.15, and lastly the kaon data, which have an overall reduced χ 2 red = 1.37. For about 3/4 of the ≈ 40 SIA datasets, we find very good agreement with the global fit, with χ 2 red ≈ 1 or below. For the remaining datasets that have larger χ 2 red values, to better understand the reasons for some of the tensions between data and theory we discuss in the following some individual cases ranked by the reduced χ 2 red values.
Starting with the datasets that have the largest χ 2 red values, namely, χ 2 red 3, we identify the OPAL (π ± and c → K ± ), TPC (K ± ), SLD (π ± and c → K ± ), DELPHI (K ± ), and TASSO (π ± and h ± at 35 GeV) datasets. For the inclusive OPAL (π ± ) data, we observe in Fig. 7 that for z h < 0.5 the data are indeed in tension with the corresponding inclusive ALEPH and SLD results, and the overall trend of the data/theory ratio suggests a possible FIG. 9. As in Fig. 7, but for SIA unidentified charged hadron h ± production.
normalization issue with this dataset. This can also be said for the DELPHI (K ± ) which appears to have some tension with the corresponding inclusive OPAL and ALEPH results.
Similarly, from Fig. 8 we find that the TPC (K ± ) spectrum lies below the theory, suggesting again a normalization problem with these data. The situation for the TASSO (π ± ) data is less clear, as only the Q = 14 GeV dataset seems to give a bad fit, while data at other energies can be described fairly well. This again hints at a problem with the overall normalization for this dataset. The same behavior appears also in the TASSO (h ± ) data in Fig. 9, where both the Q = 35 and 45 GeV datasets are above the theoretical cross sections. The case of SLD and OPAL (c → K ± ) data in Fig. 8 shows a clear overestimation of the z h spectra. While one can argue that this problem could be a reflection of the need for a more sophisticated heavy quark treatment in our theory, the description of b-tagged data from SLD, DELPHI and OPAL is relatively good, so that an explanation in terms of a normalization uncertainty in the SLD and OPAL (c → K ± ) data may be more relevant.
For SIA datasets that have smaller, but still large, χ 2 red values, 2 χ 2 red 3, we identify the b-tagged TPC (b → π ± ), OPAL (b → π ± ), and TASSO (h ± at 44 GeV). For the case of the TPC (b → π ± ) data, we see from Fig. 7 that for the largest z h bin the theory overestimates the data. On the other hand, good agreement is found for the SLD (b → π ± ) data at the same kinematics. It is possible that at the smaller Q values of TPC relative to SLD, the range in z h where leading power factorization is applicable is narrower, in particular for the b-tagged data. The z h dependence of the OPAL (b → π ± ) data appear to be clearly different from the theory, even within the large uncertainties. We note here that the OPAL data are presented as truncated moments as a function of the lower limit of the integration, z min h , and the inclusion of the very high z h bins may be problematic for the validity of factorization theorems at z h → 1. Lastly, as with TASSO (h ± at 35 GeV), the somewhat large χ 2 red values for the 44 GeV data is likely attributable to a problem with overall normalization.
For datasets that have χ 2 red 2, we consider the agreement to be generally acceptable.
Finally, we note that most of the large χ 2 red values found in this analysis were absent in the previous JAM Monte Carlo analysis of fragmentation functions [19]. The main reason is the restriction of the SIA datasets here to the range 0.2 < z h < 0.8, chosen to coincide with the range over which the SIDIS data in this work are able to be described within collinear factorization. For the LEP data in particular there are many data points at z h < 0.2 which can be well fitted within the current framework, and which would reduce the overall χ 2 red . A careful point by point comparison of the individual χ 2 red values for the various datasets indeed confirms that similar discrepancies also occurred in Ref. [19]. However, for consistency in our joint analysis of PDFs and FFs, we restrict the kinematic range to the region where both SIA and SIDIS can be simultaneously described. The same choice for the z h range was made in the recent JAM19 analysis, which required SIDIS data to be restricted to z h 0.2 to ensure separation of the target and current fragmentation regions.

B. Parton distributions and fragmentation functions
The proton PDFs from our simultaneous fit are displayed in Fig. 10 at a scale µ 2 = 10 GeV 2 , where we focus on the kinematic region of parton momentum fractions x 0.01 that is constrained by the SIDIS data. For comparison, we contrast our results with other next-to-leading order PDF parametrizations, namely, from the CJ15 [52] and NNPDF3.1 [53] global analyses. NNPDF3.1 fit the uncertainties are smaller because of their inclusion of the neutrino DIS data, which we do not include in our analysis because of unknown nuclear corrections in neutrino scattering [54][55][56]. Our light antiquark asymmetryd −ū is also compatible with the other groups, but again with a larger uncertainty, which may be related to the absence of collider W and lepton asymmetry data in our fit. Finally, for the gluon distribution, the magnitude and uncertainties are very similar across all the analyses, even though our fit does not include jet production data from hadron colliders. This reflects the fact that the HERA DIS data, which are included here, provide strong constraints on the shape of the gluon PDF via scaling violations.
For the parton to hadron FFs, we show in Fig. 11 the z dependence of the FFs at a scale µ 2 = 100 GeV 2 for the positively charged π + , K + and unidentified hadrons h + , as well as for the residual hadrons δh + , defined as the difference between h + and the sum of π + and K + (so that the total is given by h + = π + + K + + δh + ). For most of the flavors we find that the quark → π + fragmentation dominates, as expected from the pion being the lightest hadron in the QCD spectrum. Exceptions to this are fors → K + and c → K + at intermediate z values, and for b quark fragmentation into residual hadrons δh + .
For gluon fragmentation, pion production dominates for z up to ∼ 0.5 − 0.6, above which kaon fragmentation becomes as sizeable as the pion. This is consistent with the findings of previous FF analyses [19,57], which observed that the production of heavier particles such as kaons requires larger momentum fractions from the fragmenting gluon compared to the production of lighter particles.
The production of hadrons heavier than kaons, as indicated in Fig. 11  charm quarks into pions and kaons, but a rather different pattern for the fragmentation of bottom quarks. Some of this difference can be explained by the flavor-changing properties of u-type quarks decaying into d-type quarks. While the charm quark can decay into strange quarks and hence enhance K + production, the same does not occur for bottom quarks, which suppresses kaon production relative to pion production due to the mass difference.
Interestingly, the production of other species of charged hadrons is much larger for b quarks than for c quarks, which may be understood from the greater phase space available for b quarks to decay into heavier hadrons to which charm quarks cannot transition.
In Fig. 12 we present truncated moments for each flavor i and final state hadron h, where we take the lower limits on the z integration z min = 0.2 to restrict the moment to the region of SIDIS kinematics. The truncated moment indicate how energetic is the production different a hadron species h relative to the parent parton i. In general, we find that the production of hadron species heavier than pions and kaons is typically produced with lower energies, which is consistent with the physical picture whereby more energy is required to produce heavier hadrons than lighter hadrons.
As expected, the favored fragmentation ofd quarks is predominantly into highly energetic pions, while for the antistranges the production rate of energetic kaons is slightly higher than that of pions. The unfavored fragmentation of d, s andū quarks follows a similar pattern, with the lightest (pion) state produced at the highest energies followed by kaons and other heavier charged hadrons. An exception to this behavior is for charm and bottom quark fragmentation: for c quarks kaons are produced with energies comparable to those of pions, while for b quarks kaon production is suppressed with heavier mass hadrons produced at similar energies as pions.
Interestingly, the production of hadrons from gluons follows the same pattern as for uquark fragmentation. While the latter can be explained in terms of mass differences between the produced hadron species, the fact that u quarks and gluons give a similar average energy profile across hadron species is intriguing. On perturbative grounds one can argue that gluon fragmentation is enhanced because of the C A = 3 factor in the the gluon splitting function, P gg , relative to quark splitting functions, P qq and P gq , which are proportional to C F = 4/3. The absence of direct constraints on the gluon FF beyond scaling violations, however, anything drawing more than speculative conclusions at present.
We conclude the discussion of our numerical results by focusing on the correlation between the strange to nonstrange PDF ratio R s and the strange to kaon fragmentation function D K + s . In Fig. 13 we show R s and thes → K + FF, with individual Monte Carlo samples color coded by the scaled χ 2 red intensity (with darker replicas indicating higher likelihoods) computed for the specific cases of SIA (K ± ) and SIDIS (K + , K − ) datasets. The SIA datasets have a clear preference for a smaller R s and enhanced D K + s , as was found in the previous JAM19 analysis [2]. Interestingly, the SIDIS (K + , K − ) data, which have smaller χ 2 red , have a slight tendency to favor solutions with a larger R s and smaller D K + s , however, this preference is much weaker than the preference of the SIA data for smaller R s values.
We also note that in the current analysis we have extended the flexibility of the PDF and FF parametrizations, which allowed us to obtain a more uniform Monte Carlo distribution of R s compared JAM19, where a more restricted parametrization gave rise to multiple solutions. Our new analysis confirms that the most probable solutions found in JAM19 did not result from parametrization bias, and corroborates the need for a suppressed strange quark PDF in the proton in order to simultaneously describe both the SIA and SIDIS datasets within leading power QCD factorization.

VI. CONCLUSION
In this paper we have presented the results of a simultaneous Monte Carlo analysis of The analysis -referred to as "JAM20-SIDIS" -represents the most comprehensive determination of parton to hadron (π ± , K ± , h ± ) FFs fitted concurrently with spin-averaged parton distributions, broadening the test of universality of parton correlation functions to more observables. The more thorough exploration of the parameter space and reduced χ 2 red values for each of the ≈ 70 datasets fitted in this study confirmed the previous finding [2] that the combination of SIA and SIDIS datasets have a strong preference for a smaller strange to nonstrange PDF ratio, R s , correlated with an enhanced D K + s FF. As further tests of this scenario, we plan in future to extend the experimental datasets to include weak-boson and jet production in hadronic collisions, from both Tevatron and LHC data, as well as to relax the W 2 cuts for inclusive DIS to incorporate more fixed-target DIS data at high x Bj values [58].
An important application of the current results will be in benchmark calculations of transverse momentum dependent cross sections, and in particular for the small transverse momentum region where the transition from collinear factorization to TMD factorization is expected to set in. One motivation for the present project was to assess the possible role of limitations in collinear PDF and FF fits in explaining discrepancies between theory and data in the range of intermediate and large transverse momentum across a number of transversely differential processes [4][5][6][7]. For this, a truly simultaneous analysis of parton distribution and fragmentation functions across the standard set of electromagnetic processes, integrated over all transverse momentum, is necessary. The general success of the collinear fits for transverse momentum integrated cross sections that we have examined here, and their evident predictive power, suggests that factors unique to the transverse momentum differential treatment are responsible for the tension with data. We plan to address this also in future work.  In this appendix we summarize the basic cross section and structure function formulas relevant for our analysis of the semi-inclusive leptoproduction of a hadron h (with fourmomentum p h ) in the deep-inelastic scattering of a lepton (momentum l) from a nucleon N (momentum p) via the exchange of a virtual photon (momentum q), where the final state hadron is integrated over all transverse momentum. The formal setup follows standard methods described in Refs. [59][60][61][62], for example, and we closely follow the specific techniques in Ref. [63] utilizing the double Mellin moment method from Ref. [14].
The spin-averaged cross section is parametrized in terms of the semi-inclusive structure which are functions of the Bjorken scaling x Bj , the hadron fragmentation scaling variable z h , and the four-momentum transfer squared Q 2 , defined in the standard way as with y = q · p/l · p the inelasticity. Our conventions for the semi-inclusive structure function definitions are directly related to the conventions for the inclusive DIS structure functions, with F T = 2xF 1 and F L = F 2 − 2xF 1 the transverse and longitudinal structure functions, respectively. (Note that other conventions use instead F 1 → 2F 1 and F L → F L /x [64].) In the current fragmentation region, the factorization formulas for the semi-inclusive structure functions, expanded to order α s in the hard part, are given by where f i and D h j label the PDF of flavor i in the proton and parton j → hadron h FF, respectively. The functions C 1 ij (C L ij ) are the lowest-order hard scattering coefficient functions for the F 1 (F L ) structure functions, and the symbol ⊗ denotes the convolution integral over longitudinal momentum fractions, [A ⊗ B] (x) ≡ 1 x (dz/z) A(z)B(x/z). Explicit expressions for C 1 ij (C L ij ) are given, for example, in the appendices of Ref. [63].
In all of the expressions above, kinematical corrections from a nonzero target and final state hadron mass are neglected. To focus attention on the current fragmentation region, in our analysis we impose the kinematic cuts z h > 0.2 and W 2 > 10 GeV 2 , and for the hard scale we choose Q 2 > m 2 c . Therefore, the kinematical corrections may not be entirely negligible [65] at the energies of some of the experiments, although for now we set aside a fuller account of their effect to a future dedicated analysis of mass corrections in SIDIS.