Professional Documents
Culture Documents
MBlock
MBlock
†panguang601@163.com
1. Introduction
In the past decades, the Lattice Boltzmann Method (LBM) has been developed
as an alternative numerical method for simulating fluid flows [Aidun and Clausen
(2010)]. LBM is based on the statistical physics and derived from the Boltzmann
equation. A direct connection between the lattice Boltzmann equation and Navier–
Stokes equations has been established under the nearly incompressible condition
[Aidun and Clausen (2010)]. The spatial locality of LBM makes it more suitable
for parallel computing compared to the traditional computational fluid dynamic
methods [Mohamad (2011)].
The Graphics Processing Unit (GPU) is designed to process large graphics data
sets for rendering tasks. It has exceeded the computation speed of PC-based Central
† Corresponding author.
1840002-1
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
Processing Unit (CPU) by more than one order of magnitude while being available
at a relatively lower cost. Another advantage for GPU application is that Compute
Unified Device Architecture (CUDA) provided by NVIDIA, reduces the develop-
ment threshold of GPU programming greatly. Due to the inherent parallelism of
LBM, a significant speedup of GPU-based computation of LBM has been reported
in the literatures. Fan et al. [2004] implemented the LBM simulations on a clus-
ter of GPUs with Message Passing Interface (MPI). Tölke and Krafczyk [2008]
implemented a three-dimensional LBM and achieved near teraflop computing on a
single workstation. Zhou et al. [2012] proposed an efficient GPU implementation of
flows with curved boundaries, leading to nearly an 18-fold increase in speed. Tubbs
and Tsai [2011] implemented LBM for solving the shallow water equations and the
by UNIVERSITY OF WESTERN AUSTRALIA on 05/16/17. For personal use only.
advection dispersion equation on GPU, and the results indicated the promise of the
GPU-accelerated LBM for modeling mass transport phenomena in shallow water
Int. J. Comput. Methods Downloaded from www.worldscientific.com
flows. GPU has shown tremendous potential to accelerate LBM computation owing
to the explicit nature of LBM.
The standard LBM is usually employed on uniform grids, which makes the
evolution explicit and the algorithm simple, but at the same time could make the
computational effort increase dramatically on the road to a high mesh resolution.
To overcome the uniform grid restriction, a Multi-Block Lattice Boltzmann Method
(MB-LBM) has be designed and applied over the flow area where a relatively high
resolution is required. Unlike the standard LBM, in the refined domain the MB-LBM
evolves in a smaller time step, which can save the computational cost significantly.
As an effective tool of grid refinements in LBM, the multi-block technique has
been investigated in recent years. In 1998, Filippova and Hänel [1998] introduced a
local second-order grid refinement scheme for the lattice-BGK model and provided
the theoretical foundation for multi-block techniques. In 2000, Lin and Lai [2000]
designed a composite block-structured scheme by placing the fine grid blocks in
certain areas for the mesh refinement. In 2002, Yu et al. [2002] proposed a multi-
block method, in which the fine block is partially overlapped at the interfacial
lattices, increasing the model efficiency greatly. The model of Yu et al. [2002] has
been successfully applied to various areas. Yu and Girimaji [2006] extended this
model to 3D turbulence simulations. Peng et al. [2006] applied it in the immersed
boundary-lattice Boltzmann method with multi-relaxation-time. Liu et al. [2010]
validated the MB-LBM coupled with the large eddy simulation model in transient
shallow water flows simulation. Farhat and Lee [2010] and Farhat et al. [2010]
extended the single-phase MB-LBM to the multiphase Gunstensen model, in which
the grid is free to migrate with the suspended phase, and they also presented a three-
dimensional migrating multi-block model. However, to the authors’ knowledge there
are few detailed discussions on the application of the MB-LBM on GPU. In this
paper, we will devote to developing an efficient and straightforward algorithm for
the implementation of MB-LBM on GPU and validating it in terms of accuracy and
efficiency.
1840002-2
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
The paper is organized as follows. In Sec. 2, the LBM and the multi-block scheme
are introduced. In Sec. 3, some details of the application of the MB-IBM on GPU
are discussed. In Sec. 4, the parallel MB-LBM algorithm is validated by the lid
driven flow and the flow around a circular cylinder, and the performance on GPU
is assessed. Section 5 is the conclusion.
moments of the distribution functions, one can obtain the macroscopic variables of
the fluid, such as density and velocity. The lattice Bhatnagar–Gross–Krook (BGK)
Int. J. Comput. Methods Downloaded from www.worldscientific.com
model, also known as single-relation-time model, is the most popular LBM due to
its simplicity and efficiency [Aidun and Clausen (2010)]. Therefore, the lattice BGK
model is used in the present work. The distribution function evolves as the following
equation:
1
fα (x + eα δt, t + δt) = fα (x, t) − [fα (x, t) − fαeq (x, t)], (1)
τ
where fα is the particle distribution function representing the probability of particles
at the position x and time t with discrete particle velocity eα ; δt is the time step;
τ is the single-relaxation-time, related to the kinematic viscosity ν, τ = 3ν + 0.5;
with the two-dimensional nine-velocity discrete velocity model (D2Q9) [Mohamad
(2011)] shown in Fig. 1, e is expressed as
0 1 0 −1 0 1 −1 −1 1
e= . (2)
0 0 1 0 −1 1 1 −1 −1
fαeq is the equilibrium distribution function, obtained from the Taylor series expan-
sion of the Maxwell–Boltzmann distribution function with the velocity up to second
6 2 5
0
3 1
7 4 8
1840002-3
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
8
by UNIVERSITY OF WESTERN AUSTRALIA on 05/16/17. For personal use only.
ρu = eα f α . (4b)
α=0
Int. J. Comput. Methods Downloaded from www.worldscientific.com
1840002-4
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
M N
A B
by UNIVERSITY OF WESTERN AUSTRALIA on 05/16/17. For personal use only.
Int. J. Comput. Methods Downloaded from www.worldscientific.com
of the fine block, the information of these nodes can be obtained from the interior
nodes of the fine block. For the boundary nodes of the fine block, things are similar.
The communication of boundary information between different blocks is executed
after the collision, where the connections between f˜αc and f˜αf are written as:
τc − 1 ˜f
f˜αc = fαeq,f + mf (f − fαeq,f ), (9)
τf − 1 α
τf − 1
f˜αf = fαeq,c + (f˜c − fαeq,c ). (10)
mf (τc − 1) α
As for the nodes on the line MN marked with solid circles, their distribution func-
tions can be obtained through a spatial interpolation based on the information of
the open nodes.
To eliminate the possibility of a spatial asymmetry caused by interpolations,
the one-dimensional symmetric cubic spline fitting is used to calculate the unknown
nodes on the boundary of the fine block [Yu et al. (2002)], which is done by
According to the spatial continuity of f˜ and f˜ (the first derivative of f˜), the coef-
ficients (ai , bi , ci , di ) in Eq. (11) are computed as follows:
Mi−1 Mi
ai = bi =
6hi 6hi
, (12)
f˜i−1 Mi−1 hi f˜i M i hi
ci = − di = −
hi 6 hi 6
1840002-5
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
So the boundary post-collision distribution function for the nth evolution of the
fine block can be expressed as
Int. J. Comput. Methods Downloaded from www.worldscientific.com
f n n f n n
˜
fi (t) = 0.5 ˜
− 1 fi (t−1 ) − −1 + 1 f˜if (t0 )
mf mf mf mf
n n
+ 0.5 + 1 f˜if (t1 ), (15)
mf mf
n
where the present time is t = t0 + mf .
i>=N
end
1840002-6
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
The flowchart of the computational sequence for the MB-LBM in the two-block
system is shown in Fig. 3.
In this paper, considering the simplicity of the momentum-exchange method
[Mei et al. (1999)], it is used to calculate the force exerted onto the obstacle. In order
to differentiate the nodes in the computational domain, different kinds of node types
are employed to denote the fluid node, the boundary node of the computational
domain, the boundary node of blocks and the solid node. If particles at the solid
node xb (i, j) move to a fluid node along the direction α in the next step, values
(i, j, α) should be stored in an array to index the corresponding fα (xb ). With ᾱ
being the opposite direction of α, the force can be calculated by
1
by UNIVERSITY OF WESTERN AUSTRALIA on 05/16/17. For personal use only.
3. GPU Implementation
In this work, the simulation is carried out on a CPU (Intel Xeon(R) W3550,
3.07GHz) and a NVIDIA GPU (GeForce GTX 980 Ti). And the C language code
is developed based on CUDA.
In the CUDA architecture, the CPU is considered to be the host, while the
GPU is considered to be the device. The code is split up into the CPU and GPU
part, where the latter is called kernel, compiled by NVIDIA C-Compiler (NVCC).
When a kernel function is launched with the required parameters, the number of
blocks in a grid and the number of threads in each block (256 in this paper), it is
executed by these threads on the device. In one block, each thread is indexed by
a thread identification. Threads from different blocks cannot communicate, while
threads from the same block are independent, and can communicate via the shared
memory and have synchronized execution. A kernel is executed in the grid of thread
blocks which are indexed by the block identification. The grid terminates when all
threads of a kernel complete their execution, and the execution continues on the
host until another kernel is launched.
The memory access pattern of a kernel has a great influence on the implementa-
tion performance. The registers are trace buffer on GPU, and can be accessed with
nearly no time delay, yet the capacity is quite small. Hence, it should be avoided
to use excessive local variables in a kernel. The global memory is the largest device
memory in GPU, while it is much slower than the registers. In this work, as each
node requires nearly 200 bytes of memory for double precision computation, most
of the data will be stored in the global memory. Besides, there is a shared mem-
ory for each multiprocessor, allowing communication between threads, and it can be
accessed as fast as the registers. The constant memory, which is instantly accessible,
is used to store the constants that are read-only and accessed frequently.
The LBM code is highly parallelizable for its spatial locality [Mohamad (2011)].
In the collision step, the distribution functions of a certain node do not need to
communicate with its neighbors as shown in Eq. (5a). Nevertheless, in the streaming
1840002-7
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
step due to the communication with the surrounding nodes as indicated in Eq. (5b),
the spatial locality is slightly impaired, causing a misaligned write. Considering the
fact that the misaligned read is faster than the misaligned write [Obrecht et al.
(2011)], to reduce the impact to the efficiency of GPU, the streaming is modified to
Since there are always the same data types of variables recorded for each node, a
struct body, including pointers to node type, position, density, velocity, distribution
Int. J. Comput. Methods Downloaded from www.worldscientific.com
1840002-8
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
that in parallel computing the spatial interpolation is much more challenging than
the temporal interpolation, both in computational complexity and memory size for
Int. J. Comput. Methods Downloaded from www.worldscientific.com
storing the related data. Thus, it is better to place the largest mi on the finest level.
As mentioned above, the arrangement 1-2-3 is always better than 1-3-2.
In this work, to reduce the data copy between the host and the device, all the
procedures, except data output, are executed on the GPU directly. At the same
time, due to the fact that the atom function atomicAdd() in the CUDA toolkit can
only be used for Integer and Long values, the parallel reduction is used to calculate
the force in Eq. (16) after loading the position and direction.
Note that although the analysis in this paper is based on the two-dimensional
MB-LBM model [Yu et al. (2002)], it can be extended to the three-dimensional
model [Yu and Girimaji (2006)] without restrictions. As according to the statement
of Yu and Girimaji [2006], the three-dimensional MB-LBM is a further develop-
ment of the two-dimensional model. In the three-dimensional model, instead of the
one-dimensional cubic spline fitting, a two-dimensional fitting is employed to inter-
polate the information of the nodes on the boundary interface of the fine block. The
function cusparseDgtsv() can still be used to assist the cubic spline fitting.
1840002-9
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
(a) (b)
finest block is located in the corner regions. In Fig. 5(a), there are two levels of
blocks and four separate blocks in the calculation: block 1 and block 2 belong to
the first level, while block 3 and block 4 belong to the second level. The diagram in
Fig. 5(b) contains three levels and seven blocks: block 1 belongs to the first level,
block 2 and block 3 belong to the second level, and block 4 to block 7 belong to the
third level.
1840002-10
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
Fig. 6. (Continued)
In the present work, we set the simulation region to 128–128 in the lattice space
of unrefined grid. The initial density and velocity are set to unity and zero, respec-
tively. The upper wall moves with velocity U = 0.1, which is also the characteristic
velocity. The Reynolds number is defined by Re = UL/v, here L = 128. The moving
half-way bounce-back scheme is adopted for all the boundaries.
The dimensionless locations of the centers of the primary vortex, the lower left
vortex and the lower right vortex of the present work in Fig. 6 and of Ghia et al.
[1982] and Vanka [1986] are listed in Table 1. In Table 1, all the results show a
1840002-11
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
Table 1. Comparison of the vortex centers with previous literatures [Ghia et al. (1982); Vanka
(1986)].
100 1-2 (Fig. 5(a)) (0.6142, 0.7402) (0.0354, 0.0394) (0.9370, 0.0669)
Present (0.6172, 0.7344) (0.0313, 0.0391) (0.9453, 0.0625)
Ghia et al. [1982]
1000 1-4 (Fig. 5(a)) (0.5276, 0.5669) (0.0866, 0.0787) (0.8504, 0.1181)
Present (0.5313, 0.5625) (0.0859, 0.0781) (0.8594, 0.1094)
Ghia et al. [1982]
2000 1-2-2 (Fig. 5(b)) (0.5238, 0.5555) (0.0873, 0.1032) (0.8413, 0.0992)
Present 1-2-4 (Fig. 5(b)) (0.5238, 0.5555) (0.0873, 0.1032) (0.8413, 0.0992)
by UNIVERSITY OF WESTERN AUSTRALIA on 05/16/17. For personal use only.
Vanka [1986] 1-4-2 (Fig. 5(b)) (0.5238, 0.5555) (0.0873, 0.1032) (0.8413, 0.0992)
(0.5250, 0.5500) (0.0875, 0.1063) (0.8375, 0.0938)
Int. J. Comput. Methods Downloaded from www.worldscientific.com
good agreement with the previous works. And for Re = 2000, blocks with different
arrangements provide identical results up to four digits at most.
1840002-12
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
2FL aD
CL = ρU 2 D , and St = U , where FL is the lift force, FD is the drag force, and a
1840002-13
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
10 8
Author CD CL St
CPU GPU
Case Arrangement LUPS Time MLUPS Time MLUPS Acceleration
ratio
updated per step (LUPS), million lattices updated per second (MLUPS), and the
acceleration ratio of GPU to CPU. To be more specific, due to the data types used
in each node are the same, LUPS also represents the amount of data used in one
step, where a large MLUPS means a higher data processing speed.
It can be seen from Table 3 that the ratio of acceleration varies greatly, and the
performance on GPU is even better than that on CPU with the amount of data
increasing. The results in Table 3 also confirm that the arrangement of computa-
tional domain has a great impact on the performance of GPU. In the cases of 2
and 3, the resolutions of upper corners are identical, but the performance of case 2
is much better. Hence, it is not recommended to employ more levels for the same
1840002-14
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
5. Conclusion
In the present paper, we propose a straightforward two-dimensional MB-LBM par-
allel algorithm on single GPU. The accuracy of the GPU-based algorithm is exam-
ined by applying the code to simulate the benchmark cases, including the lid-driven
cavity flow and the flow past a circular cylinder, in two-dimensional geometries.
The satisfactory results are obtained in comparison with the previous literatures.
by UNIVERSITY OF WESTERN AUSTRALIA on 05/16/17. For personal use only.
Although the performance on GPU is consistently much better than that on CPU,
there are still some factors affecting the efficiency. To explore the influence, we con-
Int. J. Comput. Methods Downloaded from www.worldscientific.com
duct several simulations that dealing with different amounts of data, and with the
arrangements of different numbers of blocks and levels. It turns out that the accel-
eration ratio, compared with the performance on CPU, becomes even higher with
the amount of data increasing. Additionally, the arrangement of the computational
domain has a significant effect on the efficiency. Finally, the largest acceleration ratio
of 39.24 is achieved, despite it could still have a further improvement in dealing with
massive data.
Acknowledgments
This work is supported by the National Natural Science Foundation of China
(11502210, 51279165).
References
Aidun, C. K. and Clausen, J. R. [2010] “Lattice-Boltzmann method for complex flows,”
Annu. Rev. Fluid Mech. 42, 439–472.
Fan, Z., Qiu, F., Kaufman, A. and Yoakum-Stover, S. [2004] “GPU cluster for high per-
formance computing,” Proc. 2004 ACM/IEEE Conf. Supercomputing 2004, p. 47.
Farhat, H., Choi, W. and Lee, J. S. [2010] “Migrating multi-block lattice Boltzmann model
for immiscible mixtures: 3D algorithm development and validation,” Comput. Fluids
39, 1284–1295.
Farhat, H. and Lee, J. S. [2010] “Fundamentals of migrating multi-block lattice Boltzmann
model for immiscible mixtures in 2D geometries,” Int. J. Multiphase Flow 36, 769–779.
Filippova, O. and Hänel, D. [1998] “Grid refinement for lattice-BGK models,” J. Comput.
Phys. 147, 219–228.
Ghia, U., Ghia, K. N. and Shin, C. T. [1982] “High-resolutions for incompressible flow
using the Navier-Stokes equations and a multigrid method,” J. Comput. Phys. 48,
387–411.
Lin, C. L. and Lai, Y. G. [2000] “Lattice Boltzmann method on composite grids,” Phys.
Rev. E 62, 2219–2225.
Liu, H., Zhou, J. G. and Burrows, R. [2010] “Lattice Boltzmann simulations of the transient
shallow water flows,” Adv. Water Resour. 33, 387–396.
1840002-15
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
Mei, R., Yu, D., Shyy, W. and Luo, S. L. [1999] “Force evaluation in the lattice Boltzmann
method involving curved geometry,” Physical Review E 65, 041203.
Mohamad, A. A. [2011] Lattice Boltzmann Method: Fundamentals and Engineering Appli-
cations with Computer Codes (Springer-Verlag, London, UK).
Obrecht, C., Kuznik, F., Tourancheau, B. and Roux, J. J. [2011] “A new approach to the
lattice Boltzmann method for graphics processing units,” Comput. Math. Appl. 61,
3628–3638.
Peng, Y., Shu, C., Chew, Y. T., Niu, X. D. and Lu, X. Y. [2006] “Application of multi-
block approach in the immersed boundary–lattice Boltzmann method for viscous fluid
flows,” J. Comput. Phys. 218, 460–478.
Silva, A. L. E., Silveira-Neto, A. and Damasceno, J. J. R. [2003] “Numerical simulation of
two-dimensional flows over a circular cylinder using the immersed boundary method,”
J. Comput. Phys. 189, 351–370.
by UNIVERSITY OF WESTERN AUSTRALIA on 05/16/17. For personal use only.
Tölke, J. and Krafczyk, M. [2008] “TeraFLOP computing on a desktop PC with GPUs for
3D CFD,” Int. J. Comput. Fluid Dyn. 22, 443–456.
Int. J. Comput. Methods Downloaded from www.worldscientific.com
Tubbs, K. R. and Tsai, F. T. C. [2011] “GPU accelerated lattice Boltzmann model for
shallow water flow and mass transport,” Int. J. Numer. Methods Eng. 86, 316–334.
Vanka, S. P. [1986] “Block-implicit multigrid solution of Navier-Stokes equations in prim-
itive variables,” J. Comput. Phys. 65, 138–158.
Xu, S. and Wang, Z. J. [2006] “An immersed interface method for simulating the interaction
of a fluid with moving boundaries,” J. Comput. Phys. 216, 454–493.
Yu, D. and Girimaji, S. S. [2006] “Multi-block lattice Boltzmann method: extension to 3D
and validation in turbulence,” Physica A 362, 118–124.
Yu, D., Mei, R. and Shyy, W. [2002] “A multi-block lattice Boltzmann method for viscous
fluid flows,” Int. J. Numer. Methods Fluids 39, 99–120.
Zhou, H., Mo, G., Wu, F., Zhao, J., Rui, M. and Cen, K. [2012] “GPU implementation of
lattice Boltzmann method for flows with curved boundaries,” Comput. Methods Appl.
Mech. Eng. 225, 65–73.
1840002-16