Towards a full solution of relativistic Boltzmann equation for quark-gluon matter on GPUs

We have developed a numerical framework for a full solution of the relativistic Boltzmann equations for the quark-gluon matter using the multiple Graphics Processing Units (GPUs) on distributed clusters. Including all the $2 \to 2$ scattering processes of 3-flavor quarks and gluons, we compute the time evolution of distribution functions in both coordinate and momentum spaces for the cases of pure gluons, quarks and the mixture of quarks and gluons. By introducing a symmetrical sampling method on GPUs which ensures the particle number conservation, our framework is able to perform the space-time evolution of quark-gluon system towards thermal equilibrium with high performance. We also observe that the gluons naturally accumulate in the soft region at the early time, which may indicate the gluon condensation.


I. INTRODUCTION
Relativistic Boltzmann equation (BE), an effective theory of many-body systems, is a profound and widely used tool to study the properties of the systems out of equilibrium or in thermal equilibrium. Recently, BE is often applied to study the problem of early thermalization, which remains to be one of the "greatest unsolved problems" [1] in relativistic heavy-ion collisions, which collide two accelerated nuclei to create a hot and dense deconfined nuclear matter, named quark-gluon plasma (QGP). The space-time evolution of QGP has been well described by relativistic hydrodynamics simulations. The success of hydrodynamical models on soft hadron production and collective flows provides strong evidence for the rapid thermalization of the quark-gluon system to create a strongly-interacting QGP [2][3][4][5][6].
The time scale expected for thermalization is estimated to be less than 1fm/c [4] or even shorter than 0.25fm/c [7,8] in nucleus-nucleus collisions at Relativistic Heavy-Ion Collider (RHIC) and the Large Hadron Collider (LHC). However, it remains to be a puzzle how an over-occupied gluonic system with weak coupling can reach thermal equilibrium within such a short time scale [5].

A. Background
The study of the systems under the framework of BE with suitable initial conditions, also referred to as the kinetic approach, is a well-established method for probing the real-time quark and gluon dynamics in the dilute regime at weakly coupling limit [9][10][11]. However, a full solution of the relativistic BE involving all parton species, e.g., u, d, s quarks, their antiparticles, and gluons, is still challenging both analytically and numerically due to the complexity of the collision integral, higher dimensions and computing resources.
The typical initial condition for relativistic heavy-ion collisions [12] is an overpopulated gluonic state named Color Glass Condensate (CGC) [13][14][15][16][17][18][19][20], which is formed in the dynamical balance between the splitting and fusion of gluons in the small-x region. The occupation number of small-x gluons is of order 1/α s [21]. The gluon number grows until the gluon size is larger than 1/Q s [22], with Q s being the saturation scale. After the CGC state, the glasma, a state of color electromagnetic fields, may be formed [22][23][24][25][26][27]. Currently, how the glasma transits to a thermalized QGP in a short time scale is not well understood yet, which is often referred to as the early thermalization puzzle.
There have been various studies focusing on the evolution of the glasma stage. For example, the "bottom-up scenario" [28] estimates a thermalization time of order τ th ∼ α −13/5 s /Q s , which is unfortunately too large compared to the time scale required by the hydrodynamics. To reconcile this discrepancy, many other possible mechanisms have been considered.
One interesting mechanism is the so-called plasma instability [29][30][31][32][33][34][35]. It originates from the anisotropic momentum distribution in the plasma and may drastically speed up the process of glasma equilibration [36][37][38]. However, some studies also imply that the plasma instability may not play a significant role at the early stage [39,40], and a scaling solution may also be required [41]. Another mechanism is the Bose-Einstein condensation of gluons. It has been suggested that the gluon condensation at the early stage [42,43] may accelerate the thermalization process [44][45][46][47][48]. The influence of fermions and masses in forming condensation has also been discussed in Refs. [49][50][51][52][53]. However, it is argued that the inelastic scattering processes may strongly hinder the effect of gluon condensation [54,55]. In fact, the role of inelastic scatterings in the thermalization is still not quite clear so far. It has been suggested long time ago that the inelastic processes might be essential for thermalization [56]. Some works by solving BE with the test particle method including 2 → 3 processes has obtained the results close to the "bottom-up" scenario [57]. Another simulation from the Boltzmann approach of multiparton scattering (BAMPS), a package for solving BE using test particle method, suggests that the bremsstrahlung from 2 → 3 processes increases the efficiency of thermalization [58]. Later studies from BAMPS imply that the inverse processes, i.e. the 3 → 2 processes, inhibit the 2 → 3 processes. With both processes included, the time scale of thermal equilibration from the BAMPS is of order α −2 s ln(α s ) −2 Q −1 s [59]. It has also been argued that that the inclusion of all next-to-leading order processes may make the equilibration considerably faster than the simple 2 → 3 processes [60].

B. Motivation
Some of the above studies require solving relativistic BE numerically. Historically, a full numerical solution of the non-relativistic BE has always been a challenge due to its high dimensions and the intrinsic physical properties [61][62][63]. Even in today's petascale clusters, BE still presents a substantial computational challenge [63]. In the real application of non-relativistic BE, one usually needs very dense spatial grids to describe complicated effects related to pressure, temperature, and turbulence, etc.
The main difference between the relativistic and non-relativistic BE lies in the collision term and the coupled equations. In the relativistic BE, the collision integrals are usually much more complicated than those in the non-relativistic case. For example, in our relativistic BE, there are seven particle species with complicated scattering matrix, while in the non-relativistic case, one usually deals with single species of particles in a gas or liquid. Despite the heavy workload of the collision term, our relativistic BE also contains coupled equations of seven species, which need to be solved simultaneously. Furthermore, the distributions for fermions should not exceed unity due to Pauli exclusion principle. This limitation requires that the time step of the evolution be sufficiently small, which in turn drastically slows down the speed of the code.
Though complicated, several numerical tools based on the parton cascade model [64,65] have been developed in the market, e.g., BAMPS [66] and ZPC [67,68]. Another way is the lattice Boltzmann approach [69][70][71], which has been utilized as a fast lattice Boltzmann solver for relativistic hydrodynamics with relaxation time approximation [62,72,73]. In addition, the straightforward implementations of effective kinetic theory (e.g. see Ref. [9] for the theoretical framework) for pure gluons [74,75] and quark-gluon systems [51,52,76] also provide us physical insights for pre-thermalization. While these models and approaches have succeeded in describing the non-equilibrium evolution of quark-gluon matter with a set of parameters given by the physical consideration for simplicity, a full solution of the BE is still demanded for a comprehensive understanding of the thermalization puzzle.

C. Our numerical framework
In this study, we develop a numerical framework for the full numerical solutions of the relativistic BE with the help of the state-of-art GPUs. GPU, known for its high clock rate, high instruction per cycle [77], and multiple cores, makes more and more contributions to computational physics nowadays [78]. Some calculations, which are extremely difficult in the old-time, are now within the scope [79][80][81][82][83][84][85][86][87]. Some physical phenomena could be understood by the simulations via GPUs [88]. With the new GPU techniques, various attempts have been done to tackle the BE from different aspects for non-relativistic cases [63,[89][90][91][92][93]. The GPU techniques also motivate us to develop the framework for solving the relativistic BE in the context of relativistic heavy-ion collisions, despite a series of methodological challenges that have no counterparts in the non-relativistic realm. As a first step, we only consider the 2 → 2 scattering processes in the current work.
The difficulties in solving the relativistic BE originate from two aspects. First, the collision terms are high dimensional integrals. In this work, we use the package ZMCintegral 5.0 [94,95] to perform these high dimensional collisional integrals. ZMCintegral is an opensource Python package developed by some of the authors in this work. The second difficulty is the issue of particle number non-conservation due to the discrete Monte Carlo (MC) integration, see also Refs. [96,97]. To achieve a strict particle number conservation in the CPU framework will usually cost lots of computing time. In our work, we propose a "symmetrical sampling" method for the collisional integrals via GPUs. With the help of new features of GPUs, named CUDA atomic operations [98,99], we can achieve the strict particle number conservation with acceptable computing time.
Our numerical framework provides a full solution of the BE with complete 2 → 2 scattering processes. The program is developed with the combination of Python library Numba [100] and Ray [101], which enable the manipulation of GPU devices on distributed clusters.
We will test the code in many aspects, for instance, the stability of collisional integrals, the particle number conservation, and the total energy conservation. We will also show the time evolution of the distribution functions in both coordinate and momentum spaces.
The structure of this paper is as follows. In Sec. II, we briefly review the ordinary BE for thermal quarks and gluons. Next we introduce our numerical framework in Sec. III.
We first discuss how to get stable numerical results in collision integral in subsec. III A and then introduce the method to keep the particle numbers conserved in subsec III B.
We will present our numerical results in Sec. IV. In subsec. IV A and IV B, we test the stability of the collisional integrals and the particle number conservation. We discuss the total energy conservation in subsec. IV C. With some physical initial conditions, we show the time evolution of the system in both coordinate and momentum spaces in subsec. IV D and IV E, respectively. The summary of our paper will be presented in Sec V.
Throughout this work, we choose the metric g µν = diag{+, −, −, −} and the space and momentum four vectors as x µ = (t, x) and p µ = (E p , p). For a momentum k µ a , we choose µ = (t, x, y, z) for the space-time index and a = 1, 2, 3 for u, d, s flavors.

II. BOLTZMANN EQUATION FOR QUARK GLUON MATTER
In this section, we will briefly review the ordinary BE for the thermal quarks and gluons in the leading-log order. More details can be found in our previous systematic studies in Ref. [10,102].
The relativistic BE, which is an effective theory for relativistic many-body systems, describes the evolution of system in the phase space. The general expression for the BE reads, where f p (t, x, p) is the distribution function and C[f ] is called the collision term. The ∂x/∂t and ∂p/∂t are the effective velocity and effective force for the particles, respectively. One can derive the ∂x/∂t and ∂p/∂t from the equation of motion of the action for a single particle. In our study, the action in the classical level, i.e. up to the order of 0 , is S = dt(p · dx dt − E p ), with E p being the particle's energy. For simplicity, we neglect the particle's physical mass, but the non-vanishing thermal mass m(x) depends on the space-time in general. Accordingly, the particle's energy E p (x) ≡ p 2 + m 2 (x) depends on the space-time.
For thermal quark-gluon matter, the BE has the following general structure [9,10,103]: where f a p (t, x, p) denotes the color and spin averaged distribution function for particle a, and a = q,q, g stands for quarks, anti-quarks and gluons. E a p (x) = p 2 + m 2 a (x) and C a are the energy and collision term for particle a, respectively. −∇ x E a p is an effective force, which comes from the equation of motion of ∂p/∂t [9,10,103].
In the present work, we only consider the 2 → 2 scatterings. The collision term for a quark of flavor a can be obtained, where N q = 2 × 3 = 6 is the quark helicity and color degeneracy factor and the factor 1/2 is included when the initial state is composed of two identical particles. For a gluon, the collision term reads where N g = 2 × 8 = 16 is the gluon helicity and color degeneracy factor.
The collision term for 2 → 2 scatterings, a(k 1 ) + b(k 2 ) → c(k 3 ) + d(p), has the following general expression, where and M ab→cd is the matrix element in which all colors and helicities of the initial and final states are summed over. We summarize all 2 → 2 scattering matrix elements in Table I in the Appendix V A. In our numerical calculation, the tree-level matrix elements for all 2 ↔ 2 scattering processes are set as the default configuration. We also provide an application programming interface (API) for users to define their own matrix elements for some specific purposes.
When the system reaches the global thermal equilibrium, the distribution functions should satisfy the ordinary Bose-Einstein or Fermi-Dirac distributions, where T is the temperature and µ a is the chemical potential for particle a.
The thermal masses of gluon and quark (anti-quark) are usually written as [10,103], where N f is taken to be 3 a we only consider q i = u, d, s quarks and their anti-particles. Note that, when adding the contribution of 1 → 2 scattering, the thermal mass can be introduced in a systematic way, and the invariant momentum differential piece d 3 p/[(2π) 3 E p ] could be replaced by d 3 p/[(2π) 3 |p|] in all momentum integrals, see e.g. in [9,51,52]. In the first time step of evolution, since we have no prior information of particle masses, we will use E p = |p| to perform the calculation in Eq. (8) and (9). In later time steps, we will use the normal E a p (x) = p 2 + m 2 a (x). This iteration approach is reasonable since the difference |E p − p| from non-zero masses is of higher order [10].
In our calculations, we need to check the total particle number conservation. The particle number is defined as: Note that, in general, the total number for each type of particles (e.g. gluons, quarks and anti-quarks) is not conserved due to the strong interaction. However, since we only consider 2 → 2 scatterings in this work, the total particle number N g + is conserved. We will check and confirm the total particle number conservation at each time step of our numerical simulations.
Since the thermal masses of particles depend on the space and time, the ordinary kinetic energy-momentum tensor, is not conserved [103]. Instead, we have: where S ν ex (x) is a source term due to the mass variations.

III. NUMERICAL FRAMEWORK FOR SOLVING BOLTZMANN EQUATION
In order to keep the total particle number conserved, we first rewrite the BE in Eq. (2) as follows, where we have used the following identity, The form of Eq. (13) ensures the conservation of total particle numbers when periodical boundary conditions are applied in phase space. Then using central difference, we can express the left-hand side of Eq. (13) into discrete form as follows, A. The δ-function and the collision term Now we look at the collision term in the right-hand side of Eq. (2) or Eq. (13). Usually, one can integrate over the momentum d 3 k i with the δ-function, which can reduce the number of integral variables. Here, we have two choices: either expressing k 2 by k 3 + p − k 1 or expressing k 3 by k 1 + k 2 − p. These two choices can lead to different numerical behaviors. In this work, we choose the first choice which will make our numerical integrations more stable than the second one (as we will show later).
With the help of the δ-function, we can integrate over d 3 k 2 and obtain, where we have used with There are two roots for k 1z from the equation Substituting Eq. (17) into Eq. (5), we obtain the collision term, which consists of a 5dimensional integration and may be calculated numerically by using the direct MC method on GPU. With the help of the packages Ray and Numba, we can solve the Boltzmann equation (2) for all 2 ↔ 2 scattering processes on the distributed GPU clusters.
Before we present our numerical results, we would like to discuss a little more about the difference between integrating over k 2 or k 3 when we use the δ-function. The difference comes from solving k 1z from the equation In our first choice in Eq.
(17), k 1z is a function of E 3 + E p . Since E 3 + E p is always positive for all k 3 and p, the integration of the collision term in Eq. (2) is stable. If we integral out k 3 in the δ-function, then k 1z will become a function of E 2 − E p , which could flip its sign when we change k 2 and p. Therefore, the integral in Eq. (2) using the second setup is not as stable as using the first setup. We will discuss more on the stability of collision term in Sec. IV A.

B. Particle number conservation and symmetrical sampling
Since we use the direct MC method to compute the collision integral, the total particle numbers are not strictly conserved in each time step due to the randomness of MC sampling.
Such non-conservation of particle numbers can accumulate with time and may affect the result at later time steps. To ensure a strict particle number conservation, we introduce a method named "symmetrical sampling" on GPUs. Here we use the process of gluon scattering as an example to illustrate the basic idea of our "symmetrical sampling" method. To calculate the collision term C gg→gg (x, p) in Eq. (5), we need to sample a series values of the integration variables (k 1x , k 1y , k 3x , k 3y , k 3z ). The collision term can be written as, where V domain is the volume of the integration domain, and the kernelc p denotes,  Figure 1. Illustration of the "symmetrical sampling" method. The samplec g p (sample s i ) will be used by four integrations. This symmetrical reuse of samples leads to a conserved particle number. We call this trick of using samplec g p (sample s i ) for all four momentum grids as "symmetrical sampling".
The sample points have been quadrupled.
Let us consider one specific samplec g p (sample s i ) in Eq. (20) (also see Fig. 1). In the usual MC sampling, one will only add the contribution ofc g p (sample s i ) to C k 1 +k 2 →k 3 +p (x, p). This sample will not influence the values of C k 3 +p→k 1 +k 2 (x, k 1 ), C k 3 +p→k 1 +k 2 (x, k 2 ) and C k 1 +k 2 →k 3 +p (x, k 3 ), for which one will compute their correspondingc g k 1 ,c g k 2 andc g k 3 separately. Due to such independence of C gg→gg (x, p) at each grid, one cannot achieve a strict particle number conservation in the MC approach.
To fix the issue of the particle number non-conservation, we reuse the value ofc g p (sample s i ). Actually, for a given set of p, k 1 , k 2 , k 3 satisfying k 1 + k 2 = k 3 + p, the kernelsc g p ,c g k 1 ,c g k 2 andc g k 3 in different collisional integrals are related to each other due to the symmetry in the scattering amplitude |M ab↔cd | 2 for given p, k 1 , k 2 , k 3 and k 1 +k 2 =k 3 +p .
The above relation can be easily seen from Eq. (21). If we switch particle 1 and p in Eq. (21), we only change the sign of the term f a Therefore, in our program, we will also use the value ofc g p (sample s i ) for C k 3 +p→k 1 +k 2 (x, k 1 ), C k 3 +p→k 1 +k 2 (x, k 2 ) and C k 1 +k 2 →k 3 +p (x, k 3 ) as well. We call this trick, to use the samplec g p (sample s i ) for all four momentum grids, the "symmetrical sampling" method. This means that the sample points have been quadrupled.
With our "symmetrical sampling" trick, we can avoid the errors from the related samples in the collisional integrals at different momentum grids. Accordingly, we can obtain a strict particle number conservation. In principle, one can apply this method in the direct MC sampling based on CPU approaches. However, it usually takes lots of computing time and is hard to implement. Fortunately, with the help of the feature in GPUs, named CUDA atomic operation [98,99], the extra time for implementing symmetrical sampling is almost negligible.
As explained by the official documents [104], "Atomic operations are operations which are performed without interference from any other threads. Atomic operations are often used to prevent race conditions, which are common problems in multithreaded applications." In our case, each value of the arrayc g p , whose element represents a specificc g p value at momenta p, is saved in the global GPU memory. During the process of parallel evaluations, at the same momenta p, we obtain "simultaneously" many values forc g p from different threads [we will also obtain the value ofc g p fromc g k 3 , −c g k 2 , −c g k 1 as discussed in Eq. (22)]. Since the accumulation of these values can only be performed sequentially, when one value ofc g p in a GPU thread is calculated and being accumulated to this global memory arrayc g p , all the other threads do not have the access ofc g p at p. These processes in GPUs refer to CUDA "atomic operation". Compared with parallel CPU manipulation, CUDA is extremely fast at performing this atomic operation, which enables a fast "symmetrical sampling". Similar strategy has also been used in CPUs implementation to ensure particle number conservation, e.g. see Ref. [51,52,[74][75][76].
We note that while the condition k 1 + k 2 = k 3 + p ensures that the integration always preserves energy and momentum conservation in all microscopic processes, the total energy as computed via Eq. (11) might not be strictly conserved due to the discrete grids. To explain the possible non-conservation of the total energy, we can take a close look at Eq. (13). By integrating both sides of Eq. (13) over d 3 xd 3 p/(2π) 3 and using the definition of total particle number in Eq. (10), we obtain the time variation of particle number where a = g, q(q). Our symmetrical sampling method in the collisional integral C[f ] guarantees the time reversal symmetry in all microscopic processes, e.g. in Eq. (22). Such time exactly zero numerically [105,106], which ensures the strict particle number conservation.
Similarly, by integrating Eq. (13) over d 3 xd 3 p/(2π) 3 with the multiplication of p 0 on both sides and using the definition of energy-momentum tensor in Eq. (11), we get the time variation of total energy, where S µ ex is the source term in Eq. (12). For simplicity, let us assume the mass is constant, then the source term vanishes. Although the time reversal symmetry still holds, the errors could come from the discrete grids for p 0 . Therefore, eventually, the integral is not strictly zero numerically. On the other hand, from the above analysis, we could expect that the errors for the non-conservation of total energy will decrease if we increase the number of grids. We will address this point in more details in Sec. IV C.

MENTUM SPACE
In this section, we will first test several aspects of our program, and then show the time evolution of the distribution in both coordinate and momentum spaces. The stability of the collision integrals will be tested in Sec. IV A. The check of particle number conservation will be presented in Sec. IV B, and our results show that the particle number is strictly conserved. The total energy conservation is checked in Sec. IV C. It is found that even though the total energy is not strictly conserved, the numerical errors will decrease very fast with increasing the number of the grid. Finally, we will present the time evolution of the systems in both coordinate and momentum spaces for pure gluons, pure quarks, and the mixture of quarks and gluons in Sec. IV D and IV E. As is expected, the system tends to  be homogenous in coordinate space and become thermalized in momentum space. We also find indications of gluon condensation in the soft region.
A. Test of the stability of the collision integrals As mentioned in Sec. III, there are two ways to integrate over the δ-function in the collision term. Different approaches of handling the δ-function can affect the numerical stability of the collision term. Here we use gluon-gluon scatterings to illustrate such difference.
In Fig. 2, we show the results from two approaches using the same initial distribution for gluons. Each data point represents the numbers of the gluons associated with the distribution function f g (t 0 + dt, x, p) where t 0 = 0fm and dt = 0.01fm. The blue triangle points, labelled as "more stable", stand for particle numbers obtained by using k 2 = k 3 + p − k 1 . The green circle points, labelled as "less stable", denote the results by using k 3 = k 1 + k 2 − p. We can the i-th independent evaluation total particle numbers see that the green circle points spread over a relatively larger area than the blue triangle ones, which means that the errors in the "less stable" method are relatively larger than the "more stable" ones. Therefore, using k 2 = k 3 + p − k 1 to integrate over dk 2 is a more stable method, consistent with our previous argument in Sec. III.

B. Test of particle number conservation
In Fig. 3 Figure 4. Number of particles of different species as a function of evolution time for the cases with and without "symmetrical sampling". The parameters for f g/q(q) (t, x, p) are the same as in Fig. 2 and 3. Panel (a) plots the particle numbers for both fermions and gluons using the symmetrical sampling. Panel (b) compares the total particle number as a function of evolution time with and without using the symmetrical sampling, as shown by the blue and green lines, respectively. On one Nvidia Tesla V100 card, the entire evaluations take 6456 and 8198 seconds for the cases without and with the symmetrical sampling.
symmetrical sampling". In the figures, the blue triangle and green circle points stand for the cases with and without symmetrical sampling, respectively. In the upper panel, we show the gluon numbers for cases of pure gluon scattering and quark-gluon scattering in Fig. 3 (a) and Fig. 3 (b), respectively. We can see that for pure gluon case, the gluon number is conserved when using our symmetrical sampling method. In Fig. 3 (b), since the gluons can be converted to quarks and anti-quarks, there are some variations of the gluon numbers. However, as shown in Fig. 3 (c), the total number of particles, including quarks, anti-quarks and gluons is conserved for 2 → 2 scatterings when using our symmetrical sampling method.
In Fig. 4, we show the numbers of gluons and quarks (anti-quarks) as a function of the evolution time. Since we initialize the system with all gluons, we find that gluons tend to convert into quarks and anti-quarks during the evolution, see in Fig. 4 (a). After some time, both gluon and quark numbers tend to achieve equilibrium. While the individual parton numbers are changing with time, the total particle number is strictly conserved during the evolution when our symmetrical sampling method is employed. Without the symmetrical sampling, the conservation of the total particle number can be violated by a small amount (about 0.03%) as shown in Fig. 4 (b).

C. Test of energy conservation
In implementing the MC integration of the collision term, we can use the symmetrical sampling method to ensure the strict particle number conservation. However, the conservation of total particle number does not guarantee the strict conservation of total energy numerically due to the discrete grids in the numerical calculation. When we calculate the total energy with Eq. (11), the smooth p 0 is approximated by discrete grids. This discretization will affect the numerical evaluation of the total energy. Such effect can be seen in Fig.   5, where a constant gluon mass m 2 g = 0.5GeV 2 is used. To quantify the violation of energy conservation, here we introduce the following quantity δE relative , In Fig. 5, we plot δE relative as a function of the number of momentum grids. In this test, we choose the initial gluon distribution as a step function, f g,intial (p) = 0.5 × θ(1 − |p|/Q s )| with Q s = 1.5GeV. We find that the deviation δE relative decreases fast as a function of the number of the grids. From the figure, we can see that when the number of grids in momentum space is taken as 30, δE relative ∼ 0.015, i.e. the fluctuation of total energy computated from Eq.
(11) is about 1.5% at t = 1fm/c. Such small deviation is mainly caused by the discretization of momentum. Of course, the total energy is still conserved physically for this case since we have used the condition k 1 + k 2 = k 3 + p in the computation of the collision term.
Now we consider the case of dynamical mass as computed by Eq. (8,9). For simplicity, we focus on the systems of pure gluons or u quarks that are homogenous in coordinate space.
Then Eq. (12) reduces to, Here the term ∂ 0 T 00 kin (x) is evaluated directly via [T 00 (t + dt) − T 00 (t − dt)]/(2dt). In Fig. 6, we show the time variation of the kinetic energy ∂ 0 T 00 kin and the zero component of source term S 0 ex . We can see that except for the first few steps, two terms (∂ 0 T 00 kin and S 0 ex ) are almost identical. For gluons as shown by 6 (a), the value of ∂ 0 T 00 kin (x) oscillates drastically for the first few time steps, but after t = 20fm, it slightly fluctuates around the zero value, which indicates the reach of nearly thermal equilibration. This result demonstrates that the change of the total energy comes from the source term.
The above consistency check in Fig. 6 means that our results satisfy Eq. (26) automatically. But this does not mean that the total energy is physically conserved since the masses are dynamically changing with space and time. In principle, for a single flavor case, one can rewrite the right-hand side of Eq. (12) as a total derivative term, then define a modified conserved total energy-momentum tensor as follows [103], where the definition of squared mass in Eq. (8,9) is used to derive the second term. For a multiple flavor case, one usually cannot get a modified conserved energy-momentum tensor.
For the case of dynamical mass, there may be another source of numerical error originating from the definition of mass in Eq. (8,9), apart from the discrete grids. To obtain the squared mass, we need to integrate over p with E p = p 2 + m 2 , which depends on the dynamical mass. At the first step of the time evolution, we use |p| instead of E p to derive the dynamic mass. This tiny difference may also be a source of numerical errors. Despite the above mentioned numerical errors, the conservation of the total energy computed via Eq. (11) can be archived up to 99.8% in Fig. 6.

D. Evolution of distribution functions and dynamical masses in coordinate space
When the spatial part of the BE is taken into account, the evolution of the phase space distributions becomes more complicated. Since the dynamical masses now also depend on the spatial grids as in Eq. (8) and (9), all differential terms in Eq. (13) contribute to the evolution. The initial conditions for the distribution function must be physical, e.g.
the fermion distribution functions should be smaller than unity, and all phase distributions should be positive definite. Here for simplicity, we set the initial gluon distribution as follows: where f g;max and f g;min are two parameters, which stand for the the maximum and minimum values of the distribution. 2|p x;max | is the size of the momentum box in the x direction, and 2|x x;max | is the size of the spatial box in the x direction. One can also choose other types of initial conditions, and the main results will be similar. The distribution function f g (x, p) in Eq. (28) linearly increases in the x direction and linearly decreases in the p x direction. For simplicity, we set the initial quark distribution to be zero, f q(q);initial = 0.    Figure 7. The gluon and u-quark distributions in spatial x and y directions at different times.
The phase space gird is taken as [n x , n y , n z , n px , n py , n pz ] = [10, 10, 1, 10, 10, 1], with n being the number of grids. We have also chosen the coupling constant α s = 0.3, phase space box is of size 3 and dt = 0.0005fm. The initial gluon distribution is given by Eq. (28) with |p x;max | = 2GeV, |x x;max | = 3fm, f g;max = 0.2 and f g;min = 0.1. At each x or y, we plot f g or f u which is averaged in (y, z, p) or (x, z, p), respectively. The calculation takes 3312 seconds on one Nvidia Tesla V100 card.
In Fig. 7, we show the gluon and u-quark distributions as a function of the spatial directions x and y at different evolution times. Initially, the gluon distribution is linear in spatial x and u quark distribution is vanishing. As the time evolves, the gluons tend to convert to quarks, similar to Fig. 4. In the end, the distributions of both u quarks and gluons become approximately uniform in all spatial directions (Here we show the distribution functions in the x and y-directions). Other fermions have a similar pattern as well.
In Fig. 8, we show the dynamical masses for gluons and u quarks in the spatial x and y-directions at different evolution times. Initially, both masses squared m 2 g and m 2 u increase linearly with x. Then as time evolves, they tend to become homogenous in spatial x direction.
The distributions for other flavors of quarks and anti-quarks are similar.

E. Evolution of distribution functions in momentum space and gluon condensation
Now we show the evolution of systems in momentum space. For simplicity, we neglect the spatial dependence and set the systems to be homogenous in coordinate space. As a first attempt, we investigate the time evolution of pure gluon or pure quark systems in momentum space. The initial gluon distribution is chosen to be a step function, where f 0 and Q s are parameters. This initial condition is to mimic the distribution from Color Glass Condensation (CGC) [22], with Q s being the saturation scale. The evolution of the distribution function for a pure gluon system is shown in Fig. 9 (a) and (c). Note that for gluon evolution, the time dt is set very small (in our case dt = 0.00005fm) to ensure that the distribution functions are positive. Similarly for a pure quark system, we choose the initial quark distribution function as, and Q s = 1.5GeV. For pure gluons, the time step is taken as dt = 0.00005fm and for pure quarks dt = 0.001fm. On one Nvidia Tesla V100 card, the evaluation from 0 fm to 50 fm takes 60 hours for pure gluon case and 2 hours for pure fermions case from 0 fm to 20 fm. respectively. The parameters are chosen to be the same as in Fig. 9.
The time evolution of the pure u quark distribution function is shown in Fig. 9 (b) and (d).
For pure quark system, the time step dt can be relative larger (in our case dt = 0.001fm).
From Fig. 9, we find that the thermalization of gluons is quite different from that of quarks. The gluons will first accumulate in the soft region, where the energy is smaller than 1.0GeV. This phenomenon may indicate the gluon condensation. While for quarks, we have not observed such phenomenon. The gluon condensation may be of importance to the pre-thermalization of quark-gluon plasma created in the heavy-ion collisions. We will investigate such issue in our future studies based on our program. It is noted that at the thermal equilibrium, the gluon chemical potential is negative for pure gluon system while the quark chemical potential is positive for pure quark system in Fig. 9.
In Fig. 10, we show the evolution of the dynamical masses squared for gluons and quarks.
For a pure gluon system, the gluon's thermal mass decreases with time and eventually becomes stable as the system reaches thermal equilibrium. For a pure quark system, the quark's thermal mass oscillates with time, but the averaged value tends to be constant.
We now extend our simulation to a system composed of both quarks and gluons. The initial condition for gluons is set the same in Eq. (30), and for quarks we choose f q (p) = 0.
In Fig. 11, we show the distribution functions of gluons and u quarks as a function of parton energy at different evolution times. For gluons, the distribution function in the soft region ( 1.0GeV) increases very fast at the beginning and then decreases to the thermal equilibration.
The accumulation of gluons in the soft region implies possible gluon condensation. For quarks, the distribution function reaches the Fermi-Dirac distribution gradually. We also notice that the chemical potentials for gluons and quarks are both negative in the thermal equilibrium. Compared with the result in Fig. 9 (b) and (d), the negative chemical potential for quarks might come from different initial condition and their interaction with gluons.
In Fig. 12, we show the dynamical masses squared m 2 g,u and the particle numbers as a function of the evolution time. It is interesting that there is no oscillation behavior for m 2 u here, as opposed to the results in Fig. 10. Instead, m 2 u decreases with time and then reaches to a constant, similarly to m 2 g . In Fig. 12 (b), we also confirm that the total particle number is strictly conserved. The gluons convert to quarks in a very short time ( 10fm), and then both gluon and quark numbers tend to be constant.    Figure 12. Time evolution of masses squared m 2 g,u and the particle numbers for quark-gluon systems. The initial condition and parameters are take as the same in Fig. 11.

V. CONCLUSION AND DISCUSSION
In this work, we have developed a new numerical framework for obtaining the full solutions of relativistic BE on GPUs. Our main equation, i.e., the complete relativistic BE, is of form Eq. (2). We have considered the thermal systems of 3 flavor quarks, anti-quarks and gluons.
For simplicity, we only consider 2 → 2 scattering processes, in which the total particle numbers are conserved. Since the quarks and gluons have dynamical masses in Eqs. (8,9), there is an external force in Eq. (2). Also the kinetic energy-momentum tensor in Eq. (12) is not conserved due to the external force.
To solve BE numerically, we first rewrite the main equation as Eq. (13) with its discrete form in Eq. (15). There are two ways to handle the δ-function in the collisional integral, either integrating out k 2 or integrating out k 3 . In this work, we have chosen the first choice in Eq. (17), which is more stable than the second one as shown in Fig. 2. Next we introduce the "symmetrical sampling" method to ensure the conservation of the total particle number.
We have also investigated the energy conservation which is not strictly conserved numerically due to the discrete grids. However, the numerical errors will decrease very fast if we increase the numbers of the grids, and the conservation of the total energy can be achieved up to 99.8% in our calculation.
We have studied the time evolution of the distribution functions in both coordinate and momentum spaces. Fig. 7 has shown the gluon and u-quark distributions as a function of spatial direction x at different evolution time, given the initial condition in Eqs. (28,29).
It is found that the distributions of both u quarks and gluons become homogeneous in the coordinate space at a later time, implying the reach of thermal equilibrium. Fig. 9 and Fig.   11 show the evolution of the distribution functions in the momentum space for pure gluons, pure quarks and quark-gluon mixtures. It is interesting that the thermalization process of gluons is different from that of quarks. The fermions reach the thermal distribution smoothly, while the distribution functions of gluons at an early time increase very fast in the soft region and then decrease to thermal distributions.
In summary, we have provided a full numerical solution to the BE with complete 2 → 2 scattering processes with high computing performance. Our framework may serve as a basis to study the pre-thermalization stage in heavy-ion collisions in the future. Currently, our program can use the grid sizes up to n x , n y , n z = 20 and n px , n py , n pz = 40. Very large grid sizes such as n x , n y , n z , n px , n py , n pz ≥ 50 are still challenging, even with the help of GPU clusters. We will continue to improve our framework and algorithms along this direction.
In the future, we will include 2 → 3 processes and study the interplay between elastic and inelastic scatterings. We may also include the external electromagnetic fields to study the quantum transport phenomena under the strong electromagnetic fields.

APPENDIX
A. Matrix elements squared for 2 → 2 scattering ab → cd M a(k 1 )b(k 2 )→c(k 3 )d(p) 2 q 1 q 2 → q 1 q 2 q 1 q 2 →q 1 q 2 q 1q2 → q 1q2 q 1q2 →q 1q2 tu s 2 Table I. Matrix elements squared for all 2 → 2 parton scattering processes in QCD. The helicities and colors of all initial and final state particles are summed over. q 1 (q 1 ) and q 2 (q 2 ) represent quarks