Three-dimensional cascaded lattice Boltzmann method: improved implementation and consistent forcing scheme

Cascaded or central-moment-based lattice Boltzmann method (CLBM) proposed in [Geier \textit{et al.}, Phys. Rev. E \textbf{63}, 066705 (2006)] possesses very good numerical stability. However, two constraints exist in three-dimensional (3D) CLBM simulations. Firstly, the conventional implementation for 3D CLBM involves cumbersome operations and requires much higher computational cost compared to the single-relaxation-time (SRT) LBM. Secondly, it is a challenge to accurately incorporate a general force field into the 3D CLBM. In this paper, we present an improved method to implement CLBM in 3D. The main strategy is to adopt a simplified central moment set, and carry out the central-moment-based collision operator based on a general multi-relaxation-time (GMRT) framework. Next, the recently proposed consistent forcing scheme in CLBM [L. Fei and K. H. Luo, Phys. Rev. E \textbf{96}, 053307 (2017)] is extended to incorporate a general force field into 3D CLBM. Compared with the recently developed non-orthogonal CLBM [A. D. Rosis, Phys. Rev. E \textbf{95}, 013310 (2017)], our implementation is proved to reduce the computational cost significantly. The inconsistency of adopting the discrete equilibrium distribution functions (EDFs) in the non-orthogonal CLBM is revealed and discussed. The 3D CLBM developed here in conjunction with the consistent forcing scheme is verified through numerical simulations of several canonical force-driven flows, highlighting very good properties in terms of accuracy, convergence and consistency with the nonslip rule. Finally, the techniques developed here for 3D CLBM can be applied to make the implementation and execution of 3D MRT-LBM much more efficient.


I. INTRODUCTION
As a mesoscopic numerical method based on the kinetic theory, the lattice Boltzmann method (LBM) has gained remarkable success for the simulation of complex fluid flows and beyond during the last three decades or so [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15]. Different from the conventional computational fluid dynamics (CFD) methods where the macroscopic governing equations are solved numerically, LBM solves a discrete Boltzmann equation which is designed to reproduce the Navier-Stokes (N-S) equations in the macroscopic limit. In the LBM simulation, the fluid is usually represented by populations of fictitious particles colliding locally and streaming to adjacent nodes along the links of a specified lattice. The scale-bridging nature of LBM allows its natural incorporation of microscopic and/or mesoscopic physics, while the highly efficient collision-streaming algorithm makes it affordable computationally [10].
In the practical implementation, the simplest approach to represent particles colliding process is to relax all the distribution functions (DFs) to their local equilibria at an identical rate, which is known as the single-relaxation-time (SRT) model [16]. While the SRT-LBM is successful in many fluid flows, it may encounter numerical instability for flows with relatively low viscosities [17,18], as well as inaccuracy in implementing the boundary conditions [19,20]. Compared with the SRT model, the multiple-relaxation-time (MRT) model, where the collision step is carried out in the raw moment space, is able to enhance the stability of LBM by carefully separating the time scales among the kinetic modes [17,18,21]. In addition, the MRT-LBM can also improve the numerical accuracy for non-slip boundary conditions by adopting a so-called magic parameter [19,20]. In 2006, a cascaded collision operator was proposed by Geier et al. [22], where the collision step is implemented in the central moment space. To get the higher-order post-collision central moments, the lower-order ones are needed, which implies a from the lowest order to the highest orderi.e., cascaded operation procedure. The cascaded lattice Boltzmann method (CLBM) increases the numerical stability significantly, which is also essentially due to the removal of the ghost modes. Besides, relaxation in a co-moving frame of reference, i.e., in terms of central moments, allows a natural setting to achieve better Galilean invariance, compared with in the frame at rest, for a specified discrete velocity set [22]. More comparisons and discussions between relaxations in the raw moment and central moment spaces can be found in [22][23][24][25][26].
More recently, many studies have been carried out to improve the cascaded collision model and apply the CLBM to practical applications. In [27], Asinari argued that CLBM essentially consists in using a generalized local equilibrium in the frame at rest. In addition, a multiphase CLBM has been developed to simulate multiphase flows by Lycett-Brown and Luo [28,29]. They further extended the model with an improved forcing scheme and achieved binary collision simulations at high parameters [30]. Moreover, a thermal cascaded LBM (TCLBM) has been proposed by the present authors to simulate low-Mach compressible thermal flows [31], and several different CLBMs have been developed later for incompressible thermal flows [32][33][34][35][36]. Finally, CLBM has also been extended to simulate shallow water equations [37], moving boundary problems [38], as well as stationary flows with a preconditioning method [39].
Although the CLBM possesses very good numerical stability and has achieved success for a series of complex flows, two critical problems still exist in the three-dimensional (3D) simulations. Firstly, the practical implementation for 3D CLBM in the original work [22] involves a lot of cumbersome notations and the computational cost is much higher than the SRT-LBM. Even if some efforts have been made [24,29,40], it is still quite difficult to handle the expressions compared with the SRT-LBM. For example, a naive implementation of the most recent non-orthogonal CLBM [40] needs a CPU time that is 13 to 14 times larger than that by the SRT-LBM. Secondly, an accurate and easy-toimplement forcing scheme is needed to incorporate an external or internal force field into the 3D CLBM. In 2009, Premnath et al. proposed a forcing scheme for CLBM by method of central moments [23], which was then extended to 3D model in [24]. In [28][29][30], the SRT-style forcing scheme was used in the CLBM. Besides, an alternative forcing scheme based on a discrete equilibrium has been developed by De Rosis [41]. However, as analyzed by the present authors [42], the forcing schemes in [23,28,41] will lead to some inconsistences for flows with a general force field. Based on a general multi-relaxation-time (GMRT) framework, a consistent forcing scheme in CLBM was proposed in [42]. The consistent forcing scheme can degrade into the widely used forcing schemes in the MRT-LBM [43] and SRT-LBM [44] under certain conditions and shows great advantages over previous forcing schemes in terms of accuracy, isotropy and consistency with the nonslip rule. Therefore, the present paper aims to simplify the implementation and reduce the computational cost for 3D CLBM. In the meantime, the consistent forcing scheme is extended to incorporate a general force field for 3D flows.
The rest of this paper is organized as follows. Section II presents the simplified implementation and the consistent forcing scheme for 3D CLBM. Numerical verifications are carried out in Sec. III. Finally, concluding remarks are given in Sec. IV.

II. SIMPLIFIED 3D CLBM WITH CONSISTENT FORCING SCHEME
The simplified implementation is based on the GMRT framework and adopts a new central moment set. Firstly, the GMRT framework with the consistent forcing scheme is introduced briefly. Then, the choices of the central moment set, central moment equilibria, and forcing terms in the central moment space are given in detail.

A. GMRT framework
In the present work, we focus on the standard D3Q27 discrete velocity model (DVM). However, it should be noted that the procedures shown in this work are not limited to the specified DVM, and can be extended to other DVMs readily. For example, a D3Q19 CLBM can be directly extracted from the D3Q27 CLBM (see in the Appendix A). The lattice speed c = ∆x/∆t = 1 and the lattice sound speed c s = 1/ √ 3 are adopted, in which ∆x and ∆t are the lattice spacing and time step. The discrete velocities e i = [|e ix , |e iy , |e iz ] are defined as (1) where i = 0, 1, ..., 26, |· denotes a 27-dimensional column vector, and the superscript ⊤ denotes the transposition.
We first define the raw and central moments of the discrete distribution functions (DFs) f i , where m, n and p are integers, and u x , u y and u z are velocity components in x, y and z directions, respectively. The equilibrium values k eq mnp andk eq mnp are defined analogously by replacing f i with the discrete equilibrium distribution functions (EDFs) f eq i . To construct the central-moment-based collision operator, a raw moment set |T i and the corresponding central moment set T i are needed, where the elements in |T i and T i are combinations of k mnp andk mnp in the ascending order of (m + n + p), respectively. According to Eq. (2), the transformation from the discrete DFs to their raw moments can be performed through a transformation matrix M, and the shift from the raw moments to central moments can be performed through a shift matrix N, The explicit forms for M and N depend on the raw moment set and the corresponding central moment set, which will be discussed in the next subsection.
In the implementation of CLBM, the collision step is firstly executed in the central moment space. To be consistent with the central-moment-based collision operator, an external or internal force field F = [F x , F y , F z ] should be added by means of central moments [23]. By using a transformation method to incorporate forcing terms with a second-order trapezoidal scheme, the explicit form of collision step can be written as [42], where I is a unit matrix, S is the relaxation matrix, and C i and R i are forcing terms in central moment space and discrete velocity space, respectively. According to the assumption by He et al. [45], R i can be given by Due to the definitions of the transformation and shift matrices, both of them are invertible (explicit expressions for M −1 and N −1 can be easily obtained by software like MATLAB). The post-collision discrete DFs can be reconstructed by In the streaming step, the post-collision discrete DFs in space x stream to their neighbors (x + e i ∆t) along the characteristic lines as usual [10,42], Then, the fluid density and velocity, ρ and u = [u x , u y , u z ], are updated by, From the above, it can be shown that when the shift matrix N is a unit matrix the CLBM degrades into an MRT-LBM on the specified raw moment set |T i , and when all the relaxation parameters in the matrix S are equal to one another the CLBM degrades into an SRT-LBM. Thus we proclaim the above framework as a GMRT framework [42]. It can be also shown that the above forcing scheme can degrade into the MRT version and SRT version of the widely used forcing scheme by Guo et al. [44] under certain conditions. Thus it is named as a consistent scheme [42].
To implement the collision step in Eq. (5), the equilibria and forcing terms in central moment space, T eq i and |C i , should be specified. In [40], the discrete EDFs [16], f eq D,i = ω i ρ[1 + (e i · u)/c 2 s + (e i · u) 2 /2c 4 s − u 2 /2c 2 s ], are adopted. However, we argue three points against this choice: (1) it results in a lot of velocity terms in T eq i and |C i , which is inconsistent with the physics of central moments; (2) it destroys the Galilean invariance for the offdiagonal elements of the third-order raw moments which are preserved naturally in the original CLBM by Geier et al.
The corresponding discrete EDF is in fact a generalized local equilibrium [23,27,31]. In addition, when the viscosity is quite small, i.e., ν ∼ O(10 −7 ), the higher than third order central moment equilibria can be also modified according to the factorized method [24,46]. Substituting the generalized local equilibrium into Eq. (6), we can get We now summarize the computational algorithm of the above proposed implementation and forcing scheme in 3D CLBM: Step 1. Compute central moments T i using the definition in Eq. (2). It should be noted that this method is quite basic and can be optimized by separating the transformation and shift sub-steps according to Eq. (4).
Step 3. Reconstruct the post-collision raw moments according to |T * i =N −1 T * i , and reconstruct the post-collision DFs according to |f * i = M −1 |T * i .
Step 4. Perform the streaming step and update the hydrodynamic variables according to Eqs. (8) and (9). Advance the time step and return to Step 1.
It is known that the most computationally demanding part in 3D CLBM is the reconstruction step of the postcollision DFs [22,29,40]. In the present work, the computational cost can be reduced significantly due to the facts that: firstly, the reconstruction step is divided into two sub-steps; secondly, the simplified central moment set is used such that the elements in M −1 and N −1 are much simplified.
In addition, the present method can also be used to simplify the 3D MRT-LBM. For example, it can be found that the non-orthogonal transformation matrix M in Eq. (14) and its inverse matrix M −1 have 343 and 216 non-zero elements, respectively. In the orthogonal raw moment set by Premnath et al. [24], which has been used to construct the D3Q27 MRT-LBM in [47], both the M and M −1 have 416 non-zero elements. Therefore, compared with the D3Q27 MRT-LBM in [47], an MRT-LBM based on the present raw moment set can simplify the implementation and reduce the computational cost. As shown in Appendix B, the simulation results for 3D Lid-driven cavity flow by the non-orthogonal MRT-LBM are in good agreement with the benchmark solutions by Ku et al. [48], which implies that the non-orthogonal MRT-LBM can retain the numerical accuracy when simplifying the implementation.

III. NUMERICAL SIMULATIONS
In this section, several benchmark problems are conducted to verify the proposed 3D CLBM. In the simulation, the relaxation rates for the conserved central moments, s 0 and s 1 are set to be 1.0. Unless otherwise specified, the tunable relaxation parameters for high-order central moments are also set to be 1.0. The standard half-way bounce-back boundary scheme is used for wall boundaries.  Fig. 2, while the original imposed viscosity is ν = 0.05. As is shown, the simulated viscosity by CLBM C is independent of the reference velocity (or M a) and always agrees well with the imposed value. For the other two methods, the simulated viscosities decrease with the increase of the reference velocity. For example, the relative errors at M a = 0.3 are 0.08%, 8.92%, and 8.91% for CLBM C , CLBM D , and SRT-LBM, respectively.
The above results confirm our argument in Sec. II B that using the discrete EDFs in CLBM as in [40] destroys the Galilean invariance (GI). It should be noted that there are two aspects to the issue of GI, one of which is related to the choice of the collision model and the other pertains to the choice of the discrete velocity model. For the standard lattice, the CLBM C only preserves the GI naturally for the off-diagonal elements of the third-order raw moments. To restore the complete GI, additional correction terms [49] or more symmetrical lattice [50] are needed. We now verify the efficiency of the present simplified CLBM. We consider the same problem under different mesh sizes, i.e., 5 × 101 × 5, 5 × 201 × 5, and 5 × 401 × 5, and measure the CPU time required for 10000 iterations by the present CLBM and SRT-LBM. For each case, the CPU time is the average value after removing the minimum and maximum in nine runs. The code is developed based on C++, and runs on a laptop with Intel (R) Core (TM) i7-6500U CPU @ 2.5 GHz and RAM 8.00GB. The implementation is basic for both CLBM and SRT-LBM, without resorting to any optimization strategies, such as the preconditioning method [39] and the LAPACK library [40]. As shown in TABLE I, the CPU time for the present 3D CLBM is around 2.15 times of the one by SRT-LBM. Compared with the method in [40], where the computational cost overhead ratio is 13 to 14, the present implementation shows significant reduction in the computational cost.

B. Steady Poiseuille flow
The second problem considered is a steady Poiseuille flow driven by a constant body force F = [F x , 0, 0]. Thus the flow direction is along the positive direction of the x axial. The analytical solution is u a = [u x , 0, 0], where u x (z) = u 0 (1 − z 2 /L 2 ) and L is the half-height of the channel. The peak velocity is u 0 =F x L 2 /2ν, by which the Reynolds number is defined as Re = 2u 0 L/ν. Due to the simple flow configuration, only 5 nodes are used to cover the length and width, and periodic boundary conditions are used in these two directions. Firstly, we choose the kinematic viscosity ν = 0.1, and 20 nodes are employed to cover the channel height (2L = 20∆x). The velocity profiles at a series of Reynolds numbers, Re = [10,20,30,40], are plotted in Fig. 3. It can be seen in Fig. 3 that the simulation results are in very good agreement with the analytical solutions. As analyzed by previous studies [19,43], when the relaxation rate for the energy flux is set to be s 3 = (16 − 8s 2 )/(8 − s 2 ), no numerical slip occurs in the steady Poiseuille flow for the MRT-LBM. It is shown recently in [42] that among the existing forcing schemes in CLBM, only the consistent forcing scheme given in [42] preserves the nonslip rule. Here we choose two cases with ν = 0.1 and 0.2 (s 2 = 5/4 and 20/11), and measure the global relative errors at different s 3 . The global relative error is defined as where the summation operator is over all grid nodes. As shown in Fig. 4 when s 3 reaches the target value, E 2 reduces significantly to a very small value, which confirms the consistent nonslip boundary condition in the present 3D CLBM.
We now proceed to test the validity of the modification method in Eq. (13) to separately relax the stress differences and trace of the pressure tensor. As analyzed in [28], in the recovered Navier-Stokes equations by CLBM, the shear stress is expressed as τ = ρν[∇u + (∇u) T ] + ρ(ξ − 2ν/3)(∇ · u)I. As is known, the divergence-free condition cannot be accurately satisfied in LBM, thus the last term in τ is an additional error for incompressible N-S equations and can be removed only when ξ = 2ν/3 (s 2b = s 2 ). As shown in Fig. 5, we choose ν = 0.1 coupled with the nonslip rule and measure E 2 at different ξ, and when ξ = 2ν/3, E 2 reaches the smallest value. In addition, a linear relation between |ξ − 2ν/3| and E 2 can be also seen in Fig. 5. Thus the validity of the modification method in Eq. (13) is confirmed.

C. Steady flow through a square duct
In this subsection, we consider the developed flow through a square duct where the flow field is variable in both the coordinate directions normal to the direction of the driving force F = [F x , 0, 0] [24]. The flow has an analytical solution [51], where a is the duct half-width, and −a ≤ y, z ≤ a. In the simulation, we set F x = 2 × 10 −4 , ν = 0.2, and a = 16∆x. The half-way bounce-back boundary condition is used at the walls, thus the grid lines are located at y = [−15.5∆x, ..., −0.5∆x, 0.5∆x, ..., 15.5∆x] and z = [−15.5∆x, ..., −0.5∆x, 0.5∆x, ..., 15.5∆x]. Only 5 nodes are used to cover the x direction, along which the periodic boundary condition is adopted. The surface contour of the computed velocity field is shown in Fig. 6. It is seen that the present CLBM can reproduce the velocity distribution for steady flow through the square duct, and the simulation result is in qualitative agreement with the analytical solution.
To be quantitative, the simulation results for the velocity profiles at z = [0.5∆x, 7.5∆x, 12.5∆x] are compared with the analytical solutions in Fig. 7. It can be seen that the numerical results are in good agreement with the analytical solutions at different locations. In Sec. III B, we have found that the present CLBM holds the consistent nonslip boundary condition with the MRT-CLBM. It should be noted that the derivation of the nonslip rule s 3 = (16 − 8s 2 )/(8 − s 2 ) is based on the approximation that the velocity field varies in only one coordinate direction [43]. For the steady flow through a square duct, the velocity field varies in both y and z directions, thus the nonslip rule may be not applicable to this problem. To verify our argument, we define the local relative error over the z cross-section, where u ax and u nx denote the analytical and numerical velocities, respectively. Here we measure the change of E(z) with s 3 at z = [0.5∆x, 7.5∆x, 12.5∆x]. As shown in Fig. 8, when s 3 reaches the target value 1.33077, E(z) reduces to the smallest value for z = 0.5∆x and z = 7.5∆x. However, the minimum point has a slight deviation from the target value for the case z = 12.5∆x. The reason for the deviation is that the z = 12.5∆x cross-section is near to the wall boundary, where the flow contains 2D feature and the unidirectional approximation is destroyed. Although our argument is presented for the present CLBM, it should also apply to the MRT-LBM in general. As analyzed by Luo et al. [52] based on the 2D Lid-driven cavity flow simulation, the non-slip condition cannot be accurately satisfied for the complex flows, but usually a very good approximation can be obtained when using the non-slip rule.

D. Taylor-Green vortex flow
As a final example, we want to test the present CLBM with the consistent forcing scheme for an unsteady flow where the force field depends on both time and space. The considered problem is the Taylor-Green vortex flow [53], which has an analytical solution where u 0 is the amplitude of the imposed velocity field, and k 1 and k 2 denote the wave numbers along the x and y directions. The body force is given by In the simulation, the computational domain is defined in 0 ≤ x, y ≤ 2π, and covered by L×L grid points, while only 5 nodes are adopted along the z direction. Thus the wave numbers are k 1 = k 2 = 2π/L. To eliminate the compressibility effect, u 0 is set to be 0.005. Firstly, we choose L = 32∆x and ν = 0.0032, and the corresponding Reynolds number Re = u 0 L/ν is 50. The numerical results of the horizontal velocity profile in the x = L/2 cross-section and the vertical velocity profile in the y = L/2 cross-section are plotted in Fig. 9, from which it can be seen that the numerical results are in good agreement with the analytical solutions at different non-dimensional time, T * = ν(k 2 1 + k 2 2 )t/ ln 2. Then simulations characterized by a series of grid resolutions are carried out, L/∆x = [16,32,64,128], while the kinematic viscosity is obtained through ν = u 0 L/Re. The log-log plot of the global relative error E 2 at T * = 4.0 as a function of the grid spacing is presented in Fig. 10, and the fit slope of the numerical results is 2.0345. This demonstrates that the present CLBM with the consistent forcing scheme has second-order accuracy in space.

IV. CONCLUSIONS
In this work, we present an efficient 3D CLBM formulation with a consistent forcing scheme for a general force field. In the method, the most computationally demanding reconstruction step in 3D CLBM is divided into two sub-steps based on the GMRT framework. A very simple central moment set is adopted to construct the cascaded operator such that the tedious combined terms in the shift matrix N are completely eliminated. To match the separate relaxations for certain second-order moments, the previously used diagonal relaxation matrix is modified to be a block-diagonal matrix.
Our proposed method is very efficient and easy to implement. Through the simulation of the decay of a shear wave, it is confirmed the present method can significantly reduce the computational cost compared with the recently proposed non-orthogonal CLBM [40]. The inconsistency of adopting the discrete EDF in the non-orthogonal CLBM [40] is revealed and discussed. Then the consistent forcing scheme is verified by simulating several benchmark forcedriven flows, highlighting very good properties of the developed methodology in terms of accuracy, convergence and consistency with the nonslip rule.
The implementation method developed is quite intelligible and general. Although the D3Q27 lattice is adopted for the derivation, the corresponding implementation in other lattices can be readily obtained. The derivation method puts the 3D MRT-LBM and CLBM into a unified general framework. Thus the developed methodology is not only applicable to CLBM but can also be adopted to simplify the 3D MRT-LBM, as demonstrated in Appendix B. The central moment set can be extracted from Eq. (12) T i = [k 000 ,k 100 ,k 010 ,k 001 ,k 110 ,k 101 ,k 011 ,k 200 ,k 020 ,k 020 ,k 120 ,k 102 ,k 210 ,k 201 ,k 012 ,k 021 ,k 220 ,k 202 ,k 022 , ] ⊤ . (A2) The block-diagonal matrix is given by (A3) The transformation matrix M can be written explicitly according to Eqs. (4) and (A2) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Thus the equilibria and forcing terms in the central moment space can be given as In the D3Q27 non-orthogonal MRT-LBM, the raw moment equilibria are obtained by |T eq i = M f eq D,i . Although the force field is not considered in this case, the forcing terms in raw moment space can be directly obtained by |G eq i = M |R G,i , where R G,i = ω i [(e i − u)/c 2 s + (e i · u)e i /c 4 s ]F refer to the forcing scheme by Guo et al. [44]. Analogously, a D3Q19 non-orthogonal MRT-LBM can be constructed using the transformation matrix M in Eq. (A4). We now use the D3Q27 non-orthogonal MRT-LBM to simulate the 3D Lid-driven cavity. The flow is confined in a cubic box L × L × L and driven by a top lid at y = L with constant velocity U = 0.1. The Reynolds number Re = U L/ν = 400 is considered in the simulation, and the length of the cubic box is set to be L = 64∆x. From Fig. 11, it can be seen that a pair of vortices are located near the bottom of the x = 0.5L plane, which is consistent with the results reported in [48]. In addition, the velocity profiles by the non-orthogonal MRT-LBM are compared with the benchmark solutions [48] in Fig. 12. It can be seen the present simulation results are in good agreement with the benchmark solutions. Furthermore, it is found that the non-orthogonal MRT-LBM requires approximately 25% less computational time than the orthogonal MRT model in [47].