Multi-Parton Interactions in pp collisions from Machine Learning-based regression

In this work, we propose a strategy to construct an event classifier sensitive to Multi-Parton Interactions (MPI) using Machine Learning-based regression. The study is conducted using TMVA and the event generator PYTHIA 8.244. The regression is performed with Boosted Decision Trees (BDT). Event properties like forward charged-particle multiplicity, transverse spherocity and the average transverse momentum ($p_{\rm T}$) are used for training. The kinematic cuts are defined in accordance with the ALICE detector capabilities. Charged-particle production in events with large number of MPI (${\rm N}_{\rm mpi}$) is normalized to that obtained in minimum bias pp collisions. After the normalization to the corresponding $\langle {\rm N}_{\rm mpi} \rangle$, the ratios as a function of $p_{\rm T}$ exhibit a bump at $p_{\rm T} \approx 3$ GeV/$c$; and for higher $p_{\rm T}$ ($>8$ GeV/$c$), the ratios are independent of ${\rm N}_{\rm mpi}$. While the size of the bump increases with increasing ${\rm N}_{\rm mpi}$, the behavior at high $p_{\rm T}$ is expected from the"binary scaling"(parton-parton interactions), which holds given the absence of any parton-energy loss mechanism in PYTHIA. The effects are also observed when particle production is studied as a function of the target variable (${\rm N}_{\rm mpi}^{\rm reg}$). Therefore, its implementation on the high-multiplicity pp data would provide valuable information to understand the heavy ion-like effects discovered in small systems. Regarding the application of the trained BDT on the existing pp data, we report that for events with at least one primary charged-particle within $|\eta|<1$ (${\rm INEL}>0$), the average number of MPI in pp collisions at $\sqrt{s}=5.02$ and 13 TeV are 3.76$\pm1.01$ and 4.65$\pm1.01$, respectively.


I. INTRODUCTION
The goal of the heavy-ion program is to understand the behavior of Quantum Chromo-Dynamics (QCD) at high temperatures and densities. Results at the Large Hadron Collider (LHC) confirmed the formation of a new form of matter characterized by deconfinement, which is compatible with the theoretically predicted Quark-Gluon Plasma (QGP) [1]. The main conclusions arose from comparisons of heavy-ion data with reference data, such as minimum-bias pp and p-A collisions, where no signatures of jet quenching were observed. Surprisingly, the multiplicity-dependent analysis of the pp data at √ s = 7 TeV from the LHC, unveiled very similar azimuthal anisotropies as in heavy-ion collisions [2]. The analysis was further extended to lower and higher energies [3], as well as for other systems such as p-Pb collisions at √ s NN = 5.02 TeV [4,5]. Moreover, reports on the enhancement of (multi-)strange hadrons in pp and p-Pb collisions [6,7], as well as the mass ordering in the hadron p T spectra [8,9] suggest that collective phenomena are present at the LHC energies even in small systems. Naturally, it is suggested that the new phenomena could have the same origin as in heavy-ion collisions, namely, the hydrodynamic response of the produced * antonio.ortiz@nucleares.unam.mx † sushanta.tripathy@cern.ch medium to the initial shape of the interaction region in the transverse plane [10]. However, the main concern relies on the applicability of hydrodynamics to small non-equilibrium systems. On the other hand, from the initial state perspective the azimuthal anisotropy is due to the presence of initial state correlations in the nuclear wave functions of the incoming nuclei [11]. The main concern is whether azimuthal anisotropies established during the initial stages of the collision can survive subsequent final state interactions [12]. Another approach relies on partonic and hadronic transport models, for example AMPT [13]. This model qualitatively, and sometimes quantitatively, describes small system flow signals for various collision systems and energies. The big issue is that in contrast to fluid dynamic simulation, its applicability relies on a sufficiently large mean free path, which is hard to reconcile with the idea of the strongly coupled hydrodynamic system. Other alternative microscopic descriptions, like the one provided by PYTHIA [14], which use string models including interactions between strings [15] along with an initial state provided by a smooth distribution of Multi-Parton Interactions (MPI), which can also reproduce several features of data. Results within the string percolation framework have also been reported [16]. Moreover, recently HER-WIG 7 [17], which incorporates a different hadronization scheme, has significantly improved the description of hadron-to-pion ratios as a function of charged-particle multiplicity [18]. lem [19]. From the experimental side, one challenge for pp collisions is the strong correlation between multiplicity (sensitive to low-p T particles) and hard physics (high-p T particles) [20,21]. It has been shown that the correlation is reduced if the event multiplicity is determined in a pseudorapidity region far from where the observable of interest is measured. However, an additional treatment of the unwanted particle correlations (originated e.g. from jets) has to be implemented in data analysis. Having an eventactivity estimator with minor selection bias could help to improve the comparison of pp collisions with larger systems like those created in p-A and A-A collisions. To illustrate the efforts in this direction, particle production as a function of the number of MPI (N mpi ) unveils interesting collective-like effects in PYTHIA 8 simulations with color reconnection [22]. This motivates the introduction of different multiplicity estimators to increase the sensitivity to MPI. For instance, the relative transverse activity classifier was recently proposed to study the hadronization in events with an extreme underlying event [23]. However, this requires a cut on the transverse momentum of the leading particle, which biases the sample towards hard processes in a non trivial way [24,25]. In this paper, we propose the use of a Machine Learningbased regression to build an event classifier aimed at reducing the selection biases and increasing its sensitivity to MPI. Different tests were performed including extreme cases in the simulations like switching off MPI or allowing the independent fragmentation through switching off color reconnection. Based on this approach, we estimate N mpi using the existing so-called INEL > 0 AL-ICE data [20].
The paper is organised as follows: section 2 describes the multivariate analysis, where the input variables and the model used for the study are discussed. Results are presented in section 3, and finally section 4 contains a summary and outlook.

II. MULTIVARIATE MPI-ACTIVITY ESTIMATION
Our approach relies on a multivariate regression technique based on Boosted Decision Trees (BDT) with gradient boosting training. This is done using the Toolkit for Multivariate Analysis (TMVA) framework which provides a ROOT-integrated machine learning environment for the processing and parallel evaluation of multivariate classification and regression technique [26]. In particular, the construction of an event classifier sensitive to the MPI activity (N mpi ) can be considered as a regression problem where, given a set of input variables tries to minimize the loss function. Such a function describes how the model is predictive with respect to the training data. For the regression problem, TMVA implements the Huber loss function [27].
The training for the MPI-activity estimation is performed on simulated samples of pp collisions at 13 TeV. PYTHIA 8.244 [14] event generator (tune Monash 2013 [28]) is used in our studies. Two samples are employed to check the performance of the method for the estimation of the average N mpi both in MB and high N mpi events. The first sample yields a flat N mpi distribution and the second one is that obtained from MB events. The goal of the analysis is to estimate the number of MPI, therefore N mpi is the target variable in our analysis. The MVA uses several input variables, which are chosen given their correlation with N mpi . Another important factor related to the choice of the variables relies on how well PYTHIA 8.244 describes such features of data. We choose PYTHIA 8.244 Monash 2013 tune as it has been tuned to describe many features of LHC data. In particular, it describes correlations like average p T as a function of mid-rapidity multiplicity, rather well [29]. Given that the tune only consideres observables with unidentified primary-charged particles, only quantities derived from unidentified charged particles are used in the present analysis to train the BDT, which are listed below: • Transverse spherocity: this quantity allows one to know whether a dijet-like structure is present in the event [30]. It is defined for a unit vectorn s which minimizes the ratio: where the sum runs over all primary charged particles with p T > 0.15 GeV/c and within |η| < 0.8.
In agreement with ALICE requirements [20], only events with more than two particles are selected.
As outlined in Ref. [20], spherocity has some important features: -The vector products are linear in particle momenta, therefore spherocity is a collinear safe quantity in pQCD. -The lower limit of spherocity (S 0 → 0) corresponds to event topologies where all transverse momentum vectors are (anti)parallel or the sum of the p T is dominated by a single track.
-The upper limit of spherocity (S 0 → 1) corresponds to event topologies where transverse momentum vectors are "isotropically" distributed. S 0 = 1 can only be reached in the limit of an infinite amount of particles.
• Average transverse momentum: the first moment of the charged-particle transverse momentum spectrum and its correlation with the charged particle multiplicity, encodes information about the underlying particle production mechanism. In particular, in PYTHIA the rise of the average p T with the event multiplicity can only be explained if collective-like effects are included in the simulations (color reconnection). Therefore, this quantity is sensitive to the hadronization mechanism.
• Forward multiplicity: it is determined within the pseudorapidity regions 2.8 < η < 5.1 and −3.7 < η < −1.7, which matches the intervals covered by the ALICE VZERO detector. This has been used by the experiment in order to reduce the autocorrelations, which may affect the spectral shape of the transverse momentum distribution.
The method was trained using simulations at the highest center-of-mass energy achieved by the LHC during run II (13 TeV). Different conditions were varied to estimate a systematic uncertainty. Namely, different sets of input variables were also used (e.g. average p T and mid-pseudorapidity multiplicity), as well as simulations with different N mpi distribution. The trained BDT were applied to simulations at lower energies, and also to simulations which does not include MPI. To check the robustness of the trained BDT against collective-like effects in small systems [22], simulations without color reconnection were also used. The variations of the target value with respect to the real number of MPI was assigned as systematic uncertainty. Figure 1 illustrates the performance of the regression for minimum-bias simulations at lower energies ( √ s = 0.9, 2.76, 5.02, 7, and 13 TeV), the boxes around the points corresponds to the sigma of the N reg mpi − N mpi dis- tribution. Within uncertainties, the method reproduces the energy dependence of N mpi . The performance of the method where color reconnection was not included in the simulations is also shown. Within uncertainties, N reg mpi is independent of color reconnection, suggesting that the method is robust against hadronization models. Moreover, in events where MPI were not activated, the method gives an activity which within uncertainties is below 3.

III. RESULTS
The implementation of the trained BDT to select events with large N mpi also exhibit encouraging results. Figure 2 shows the behavior of particle production as a function of N mpi (left), N reg mpi (middle) and charged-particle multiplicity at mid-pseudorapidity, dN ch /dη (right), in pp collisions at √ s = 2.76 TeV. The results are qualitatively similar at other energies including 13 TeV. Here, the results for 2.76 TeV are shown to illustrate that albeit the BDT was trained for 13 TeV, its discrimination power holds even for lower energies. The study includes the proton-to-pion ratio as a function of p T , as well as a quantity called R pp , which is motivated by the nuclear modification factor used to quantify parton energy loss effects in heavy-ion collisions [31]. R pp is defined as follows: where, N mpi π is the charged-pion production in events with N mpi . Similarly, the pion production in MB events is given by N MB π with average N mpi represented by N mpi, MB . Given the requirement for spherocity calculation, the MB sample corresponds to events with more than two primary charged particles within |η| < 0.8 and p T > 0.15 GeV/c, however, the conclusion remains the same for the most inclusive sample. For the event selection based on N reg mpi and dN ch /dη, their corresponding average values ( N reg mpi and dN ch /dη , respectively) are used in Eq. 2 instead N mpi .
Regarding the analysis as a function of N mpi (top left panel), while R pp is N mpi independent and close to unity [32] at high p T (p T > 8 GeV/c), R pp develops a bump at intermediate p T (1-8 GeV/c) with increasing N mpi . The former effect is consistent with a binary parton-parton scaling which holds given the absence of any parton-energy loss mechanism in PYTHIA. Regarding the behavior at intermediate p T , the bump is attributed to color reconnection [22,24], which mimics collective effects. Albeit the effect is rather large (≈ 40%), it is worth mentioning that, given the limitations of the multiplicity estimators used in the experiments [20], the bump has not been observed in pp data [18]. This is illustrated in the top right-hand-side panel, which shows the effects of autocorrelations in events selected using the mid-pseudorapidity estimator. We want to highlight the fact that using a regression, one can reduce the selection bias and increase the sensitivity to N mpi . The top middle panel of Fig. 2 Primary charged pion Rpp as a function of pT (top) and proton-to-pion ratio as a function of pT (bottom). Results are presented for different event classes based on the actual number of multi-parton interactions (left), the target variable (middle), and mid-pseudorapidity charged-particle multiplicity (right). Results from simulations including color reconnection are shown with full markers, while the case where color reconnection is switched off is displayed with empty markers. The boxes around one indicate the estimated uncertainty associated to event selection.
which qualitatively (and sometimes quantitatively) recovers the main characteristics of the N mpi dependence. The plot also includes the uncertainty related to event selection, which is shown as boxes around one. It has been derived from the average deviation of N mpi with respect to N reg mpi , it is around 30% at N reg mpi = 4 and it is reduced to 17% at higher N reg mpi = 15. It is worth noticing that the effects discussed above are larger than such uncertainties. The implementation of this event selection in pp and p-Pb LHC data would definitely provide valuable information on the production mechanisms, as well as it will help in understanding the similarities with larger systems like those created in A-A collisions in a better way.
We point out that the size of the bump in R pp is hadron mass dependent, whose behavior resembles the features of the R p−−Pb for identified hadrons [9]. To illustrate the hadron mass dependence as a function of the event activity, the bottom panel of Fig. 2 shows the proton-to-pion ratio as a function of p T for the event classes described above. As reported in Ref. [22], the particle ratio gets depleted (enhanced) at low (intermediate) p T with increasing N mpi . A similar feature is also observed when the analysis is performed as a function of N reg mpi . In simulations without color reconnection, the particle ratios are independent of N mpi and N reg mpi within a few percents.
The effect is not observed when the analysis is performed as a function of the charged-particle multiplicity. Last but not least, using the existing ALICE data on p T spectra as a function of mid-pseudorapidity multiplicity estimator, the average number of MPI was estimated using the trained BDT. Figure 3 shows the number of MPI values obtained from regression along with PYTHIA 8.244 calculations. In our approach, the average number of MPI in (INEL > 0) pp collisions at √ s = 5.02 and 13 TeV are 3.76±1.01 and 4.65±1.01, respectively. The INEL > 0 class defined by ALICE corresponds to pp collisions with at least one primary charged particle within |η| < 1.
In summary, we propose the use of multivariate techniques in order to build more robust event classifiers for the better understanding of the similarities observed in different collision systems. The proposed event classifier can be used to test the MPI model in bigger systems like p-A and A-A collisions. Also, it can be used to refine the jet quenching searches in small systems.

IV. CONCLUSIONS
In this work, we have proposed a new way to analyse the pp data using Machine Learning-based regres-sion methods. We have shown that using input variables like charged-particle multiplicity, average transverse momentum and transverse spherocity, one can estimate the number of partonic interactions (N mpi ). The target variable N reg mpi was used to build R pp , which is analogous to the nuclear modification factor used in A-A collisions to study the parton-energy loss effects. Within uncertainties, this quantity is independent of N reg mpi and close to unity at high p T (> 8 GeV/c). Moreover, at intermediate p T (1-8 GeV/c) a bump is observed in events with large event activity. The effect is attributed to multi-parton interactions and color reconnection, and has not been observed in data. Regarding the available ALICE data on p T spectra as a function of multiplicity, the trained methods were applied to such data. In our approach, the average number of MPI in (INEL > 0) pp collisions at √ s = 5.02 and 13 TeV are 3.76±1.01 and 4.65±1.01, respectively.

ACKNOWLEDGMENTS
We acknowledge the technical support of Luciano Diaz and Eduardo Murrieta for the maintenance and operation of the computing farm at ICN-UNAM. Support for this work has been received from CONACyT under the Grant No. A1-S-22917. S. T. acknowledges the postdoctoral fellowship of DGAPA UNAM.  [20].