MBlock

2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
International Journal of Computational Methods

Vol. 15, No. 2 (2018) 1840002 (16 pages)
c World Scientific Publishing Company
DOI: 10.1142/S0219876218400029
ICCM2016: The Implementation of Two-Dimensional

Multi-Block Lattice Boltzmann Method on GPU
Ya Zhang∗ , Guang Pan† and Qiaogao Huang

School of Marine Science and Technology
Northwestern Polytechnical University
by UNIVERSITY OF WESTERN AUSTRALIA on 05/16/17. For personal use only.
Xi’an 710072, China

∗zhangya9741@163.com
Int. J. Comput. Methods Downloaded from www.worldscientific.com
†panguang601@163.com
Received 9 October 2016

Revised 23 March 2017
Accepted 29 March 2017
Published 15 May 2017
A straightforward implementation of the Multi-Block Lattice Boltzmann Method (MB-

LBM) on a Graphics Processing Unit (GPU) is presented to accelerate the simulation of
fluid flows in two-dimensional geometries. The algorithm is measured in terms of both
accuracy and efficiency with the benchmark cases of the lid-driven cavity flow and the
flow past a circular cylinder, and satisfactory results are obtained. The results show
that the performance on GPU becomes even better with the amount of data increas-
ing. Moreover, the arrangement of the computational domain has a significant effect on
the efficiency. These results demonstrate the great potential of GPU on the MB-LBM,
especially when dealing with massive data.
Keywords: Multi-block lattice Boltzmann method; graphics processing unit; ratio of

acceleration.
1. Introduction
In the past decades, the Lattice Boltzmann Method (LBM) has been developed
as an alternative numerical method for simulating fluid flows [Aidun and Clausen
(2010)]. LBM is based on the statistical physics and derived from the Boltzmann
equation. A direct connection between the lattice Boltzmann equation and Navier–
Stokes equations has been established under the nearly incompressible condition
[Aidun and Clausen (2010)]. The spatial locality of LBM makes it more suitable
for parallel computing compared to the traditional computational fluid dynamic
methods [Mohamad (2011)].
The Graphics Processing Unit (GPU) is designed to process large graphics data
sets for rendering tasks. It has exceeded the computation speed of PC-based Central
† Corresponding author.
1840002-1
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
Y. Zhang, G. Pan & Q. Huang
Processing Unit (CPU) by more than one order of magnitude while being available
at a relatively lower cost. Another advantage for GPU application is that Compute
Unified Device Architecture (CUDA) provided by NVIDIA, reduces the develop-
ment threshold of GPU programming greatly. Due to the inherent parallelism of
LBM, a significant speedup of GPU-based computation of LBM has been reported
in the literatures. Fan et al. [2004] implemented the LBM simulations on a clus-
ter of GPUs with Message Passing Interface (MPI). Tölke and Krafczyk [2008]
implemented a three-dimensional LBM and achieved near teraflop computing on a
single workstation. Zhou et al. [2012] proposed an efficient GPU implementation of
flows with curved boundaries, leading to nearly an 18-fold increase in speed. Tubbs
and Tsai [2011] implemented LBM for solving the shallow water equations and the
advection dispersion equation on GPU, and the results indicated the promise of the
GPU-accelerated LBM for modeling mass transport phenomena in shallow water
flows. GPU has shown tremendous potential to accelerate LBM computation owing
to the explicit nature of LBM.
The standard LBM is usually employed on uniform grids, which makes the
evolution explicit and the algorithm simple, but at the same time could make the
computational effort increase dramatically on the road to a high mesh resolution.
To overcome the uniform grid restriction, a Multi-Block Lattice Boltzmann Method
(MB-LBM) has be designed and applied over the flow area where a relatively high
resolution is required. Unlike the standard LBM, in the refined domain the MB-LBM
evolves in a smaller time step, which can save the computational cost significantly.
As an effective tool of grid refinements in LBM, the multi-block technique has
been investigated in recent years. In 1998, Filippova and Hänel [1998] introduced a
local second-order grid refinement scheme for the lattice-BGK model and provided
the theoretical foundation for multi-block techniques. In 2000, Lin and Lai [2000]
designed a composite block-structured scheme by placing the fine grid blocks in
certain areas for the mesh refinement. In 2002, Yu et al. [2002] proposed a multi-
block method, in which the fine block is partially overlapped at the interfacial
lattices, increasing the model efficiency greatly. The model of Yu et al. [2002] has
been successfully applied to various areas. Yu and Girimaji [2006] extended this
model to 3D turbulence simulations. Peng et al. [2006] applied it in the immersed
boundary-lattice Boltzmann method with multi-relaxation-time. Liu et al. [2010]
validated the MB-LBM coupled with the large eddy simulation model in transient
shallow water flows simulation. Farhat and Lee [2010] and Farhat et al. [2010]
extended the single-phase MB-LBM to the multiphase Gunstensen model, in which
the grid is free to migrate with the suspended phase, and they also presented a three-
dimensional migrating multi-block model. However, to the authors’ knowledge there
are few detailed discussions on the application of the MB-LBM on GPU. In this
paper, we will devote to developing an efficient and straightforward algorithm for
the implementation of MB-LBM on GPU and validating it in terms of accuracy and
efficiency.
1840002-2
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
Two-Dimensional Multi-Block Lattice Boltzmann Method on GPU
The paper is organized as follows. In Sec. 2, the LBM and the multi-block scheme
are introduced. In Sec. 3, some details of the application of the MB-IBM on GPU
are discussed. In Sec. 4, the parallel MB-LBM algorithm is validated by the lid
driven flow and the flow around a circular cylinder, and the performance on GPU
is assessed. Section 5 is the conclusion.
2. Multi-Block Lattice Boltzmann Method

In the standard LBM, the fluid is described as imaginative particles, which stream
along a uniform lattice grid and collide with each other at the lattice nodes. The
distribution of particles is modeled with a set of distribution functions. From the
moments of the distribution functions, one can obtain the macroscopic variables of
the fluid, such as density and velocity. The lattice Bhatnagar–Gross–Krook (BGK)
model, also known as single-relation-time model, is the most popular LBM due to
its simplicity and efficiency [Aidun and Clausen (2010)]. Therefore, the lattice BGK
model is used in the present work. The distribution function evolves as the following
equation:
1
fα (x + eα δt, t + δt) = fα (x, t) − [fα (x, t) − fαeq (x, t)], (1)
τ
where fα is the particle distribution function representing the probability of particles
at the position x and time t with discrete particle velocity eα ; δt is the time step;
τ is the single-relaxation-time, related to the kinematic viscosity ν, τ = 3ν + 0.5;
with the two-dimensional nine-velocity discrete velocity model (D2Q9) [Mohamad
(2011)] shown in Fig. 1, e is expressed as

0 1 0 −1 0 1 −1 −1 1
e= . (2)
0 0 1 0 −1 1 1 −1 −1
fαeq is the equilibrium distribution function, obtained from the Taylor series expan-
sion of the Maxwell–Boltzmann distribution function with the velocity up to second
6 2 5
0
3 1
7 4 8
Fig. 1. Lattice pattern: D2Q9.
1840002-3
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
order. It is expressed as follows:

eq eα · u (eα · u)2 u·u
fα = ρwα 1 + + − , (3)
c2s 2c4s 2c2s
where wα is the weighting coefficient,√ valued by w0 = 4/9, w1−4 = 1/9 and w5−8 =
1/36; the sound speed is cs = 1/ 3; ρ and u are the macroscopic density and
velocity, respectively, and can be obtained from the distribution function:
8

ρ= fα , (4a)
α=0
8

ρu = eα f α . (4b)
α=0
The collision and streaming of particles are expressed as:

1
Collision: f˜α (x, t) = fα (x, t) − [fα (x, t) − fαeq (x, t)], (5a)
τ
Streaming: fα (x + eα δt, t + δt) = f˜α (x, t), (5b)
where f˜α (x, t) is the post-collision particle distribution function.
According to the Chapman–Enskog expansion analysis [Aidun and Clausen
(2010)], the macroscopic governing equations of the lattice Boltzmann equation
are the weakly incompressible Navier–Stokes equations, including the mass and
momentum conservation equations
∂ρ
+ ∇ · (ρu) = 0, (6a)
∂t
∂ρu
+ ∇ · (ρuu) = −∇p + ρν∇ · [ρ(∇u + (∇u)T )]. (6b)
∂t
This paper uses the multi-block method proposed by Yu et al. [2002], which
satisfies the continuity of mass, momentum and stresses across the interface. To
illustrate the basic idea, a two-block system consisting of a coarse block and a fine
block is sketched in Fig. 2.
The ratio of the lattice space between the coarse block and the fine block is
defined as:
m = δxc /δxf = mc mf , (7)
where the subscripts c and f refer to the coarse block and the fine block, respectively;
δxc and δxf are the lattice space, mc = 1 and mf = δxc /δxf are the lattice space
parameters. To maintain a viscosity consistency across blocks, the relaxation time
τf and τc should satisfy the following equation:
τf = 0.5 + mf (τc − 0.5). (8)
In Fig. 2, the line AB and MN are the boundary of the coarse block and the fine
block, respectively. Since the boundary nodes on the line AB are in the interior
1840002-4
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
M N
A B
Fig. 2. Interfaces structure between two blocks.
of the fine block, the information of these nodes can be obtained from the interior
nodes of the fine block. For the boundary nodes of the fine block, things are similar.
The communication of boundary information between different blocks is executed
after the collision, where the connections between f˜αc and f˜αf are written as:
τc − 1 ˜f
f˜αc = fαeq,f + mf (f − fαeq,f ), (9)
τf − 1 α
τf − 1
f˜αf = fαeq,c + (f˜c − fαeq,c ). (10)
mf (τc − 1) α
As for the nodes on the line MN marked with solid circles, their distribution func-
tions can be obtained through a spatial interpolation based on the information of
the open nodes.
To eliminate the possibility of a spatial asymmetry caused by interpolations,
the one-dimensional symmetric cubic spline fitting is used to calculate the unknown
nodes on the boundary of the fine block [Yu et al. (2002)], which is done by
f˜(x) = ai (xi − x)3 + bi (x − xi−1 )3 + ci (xi − x) + di (x − xi−1 )

. (11)
xi−1 ≤ x ≤ xi+1
According to the spatial continuity of f˜ and f˜ (the first derivative of f˜), the coef-
ficients (ai , bi , ci , di ) in Eq. (11) are computed as follows:
Mi−1 Mi
ai = bi =
6hi 6hi
, (12)
f˜i−1 Mi−1 hi f˜i M i hi
ci = − di = −
hi 6 hi 6
1840002-5
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
where Mi is the second derivative of f˜i , following the equation

0.5Mi−1 + 2Mi + 0.5Mi+1 = 3(2fi − fi−1 − fi+1 ). (13)
The natural spline end condition is stipulated with M0 = Mn = 0.
The three-point Lagrangian scheme is used in the temporal interpolation of the
post-collision distribution function of the boundary nodes [Yu et al. (2002)]:
 
1
1
 t − tk 
f˜if (t) = f˜if (tk ) 

. (14)
tk − tk 
k=−1 k =−1
k=k
So the boundary post-collision distribution function for the nth evolution of the
fine block can be expressed as

f n n f n n
˜
fi (t) = 0.5 ˜
− 1 fi (t−1 ) − −1 + 1 f˜if (t0 )
mf mf mf mf

n n
+ 0.5 + 1 f˜if (t1 ), (15)
mf mf
n
where the present time is t = t0 + mf .
Set initial values in all blocks
Stream in coarse blocks
Calculate macroscopic variables and

collision in coarse blocks
Spatial interpolation for boundary points in fine

blocks, and store them for temporal interpolation
Stream in fine blocks

Temporal interpolation to obtain values on
boundaries of fine blocks
Stream in fine blocks

Transfer boundary information in fine blocks to coarse blocks;

Transfer the spatial interpolation values to fine blocks boundaries
i>=N
cudaMemcpy to host, and output
end
Fig. 3. The flowchart of the computational sequence for MB-LBM.
1840002-6
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
The flowchart of the computational sequence for the MB-LBM in the two-block
system is shown in Fig. 3.
In this paper, considering the simplicity of the momentum-exchange method
[Mei et al. (1999)], it is used to calculate the force exerted onto the obstacle. In order
to differentiate the nodes in the computational domain, different kinds of node types
are employed to denote the fluid node, the boundary node of the computational
domain, the boundary node of blocks and the solid node. If particles at the solid
node xb (i, j) move to a fluid node along the direction α in the next step, values
(i, j, α) should be stored in an array to index the corresponding fα (xb ). With ᾱ
being the opposite direction of α, the force can be calculated by
1
F = [eα fα (xb ) − eᾱ fᾱ (xb + eα δt)]. (16)

m
All (i,j,α)
3. GPU Implementation
In this work, the simulation is carried out on a CPU (Intel Xeon(R) W3550,
3.07GHz) and a NVIDIA GPU (GeForce GTX 980 Ti). And the C language code
is developed based on CUDA.
In the CUDA architecture, the CPU is considered to be the host, while the
GPU is considered to be the device. The code is split up into the CPU and GPU
part, where the latter is called kernel, compiled by NVIDIA C-Compiler (NVCC).
When a kernel function is launched with the required parameters, the number of
blocks in a grid and the number of threads in each block (256 in this paper), it is
executed by these threads on the device. In one block, each thread is indexed by
a thread identification. Threads from different blocks cannot communicate, while
threads from the same block are independent, and can communicate via the shared
memory and have synchronized execution. A kernel is executed in the grid of thread
blocks which are indexed by the block identification. The grid terminates when all
threads of a kernel complete their execution, and the execution continues on the
host until another kernel is launched.
The memory access pattern of a kernel has a great influence on the implementa-
tion performance. The registers are trace buffer on GPU, and can be accessed with
nearly no time delay, yet the capacity is quite small. Hence, it should be avoided
to use excessive local variables in a kernel. The global memory is the largest device
memory in GPU, while it is much slower than the registers. In this work, as each
node requires nearly 200 bytes of memory for double precision computation, most
of the data will be stored in the global memory. Besides, there is a shared mem-
ory for each multiprocessor, allowing communication between threads, and it can be
accessed as fast as the registers. The constant memory, which is instantly accessible,
is used to store the constants that are read-only and accessed frequently.
The LBM code is highly parallelizable for its spatial locality [Mohamad (2011)].
In the collision step, the distribution functions of a certain node do not need to
communicate with its neighbors as shown in Eq. (5a). Nevertheless, in the streaming
1840002-7
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
step due to the communication with the surrounding nodes as indicated in Eq. (5b),
the spatial locality is slightly impaired, causing a misaligned write. Considering the
fact that the misaligned read is faster than the misaligned write [Obrecht et al.
(2011)], to reduce the impact to the efficiency of GPU, the streaming is modified to
fα (x, t + δt) = f˜α (x − eα δt, t), (17)
where the misaligned write is replaced with the misaligned read.

For systems containing multi-level blocks, according to the flowchart in Fig. 3,
the computation can be expressed as the recursive function in Fig. 4, where the
total level of the refinement is represented by LEVEL. For the simple two level
refinement as shown in Fig. 2, LEVEL = 2 is set before the simulation.
Since there are always the same data types of variables recorded for each node, a
struct body, including pointers to node type, position, density, velocity, distribution
functions and post-collision distribution functions, is created to store variable infor-

mation. With these pointers, the memory in the host and the device is allocated
dynamically before the evolution.
In the stage of the spatial interpolation, it is needed to obtain Mi in Eqs. (12)
and (13). In serial processing, the tridiagonal matrix in Eq. (13) is solved with
the Thomas algorithm. However, this method is almost unfeasible in a paral-
lel computing. The cuSPARSE library presented by NVIDIA contains a set of
basic linear algebra subroutines to handle sparse matrices in parallel mode. The
function cusparseDgtsv() is employed in this paper. It is used in the form of
Fig. 4. Program of the recursive function for MB-LBM.
1840002-8
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
cusparseDgtsv(cusparseHandle t handle, int m, int n, const double *dl, double *d,

double *du, double *B, int ldb), where handle is the handle to the cuSPARSE library
context; m is the size of the linear system (must be larger than or equal to three);
n is the columns of matrix B, which means Mi for different variables can be solved
in a single call; array dl, d, du contain the lower, the main, the upper diagonal of
the tridiagonal linear system, respectively; B is the right-hand-side array, ldb is the
leading dimension of B. The solution will be written in array B before the function
completes.
Note that in this work the arrangement of levels is expressed in the form of m1 -
. . .-mi -. . .-mn from the coarsest to the finest, where m1 is always 1, mi is the ratio of
the lattice space of level i to that of level i−1, n refers to the finest level. It is obvious
that in parallel computing the spatial interpolation is much more challenging than
the temporal interpolation, both in computational complexity and memory size for
storing the related data. Thus, it is better to place the largest mi on the finest level.
As mentioned above, the arrangement 1-2-3 is always better than 1-3-2.
In this work, to reduce the data copy between the host and the device, all the
procedures, except data output, are executed on the GPU directly. At the same
time, due to the fact that the atom function atomicAdd() in the CUDA toolkit can
only be used for Integer and Long values, the parallel reduction is used to calculate
the force in Eq. (16) after loading the position and direction.
Note that although the analysis in this paper is based on the two-dimensional
MB-LBM model [Yu et al. (2002)], it can be extended to the three-dimensional
model [Yu and Girimaji (2006)] without restrictions. As according to the statement
of Yu and Girimaji [2006], the three-dimensional MB-LBM is a further develop-
ment of the two-dimensional model. In the three-dimensional model, instead of the
one-dimensional cubic spline fitting, a two-dimensional fitting is employed to inter-
polate the information of the nodes on the boundary interface of the fine block. The
function cusparseDgtsv() can still be used to assist the cubic spline fitting.
4. Test Cases and Discussion

Lid-driven cavity flow
As a classic benchmark case, the lid-driven cavity flow has been extensively used
to measure the accuracy of a numerical method. And there are plenty of experi-
mental and numerical results available [Ghia et al. (1982); Vanka (1986)] to verify
the present results. With a moving upper wall, a primary vortex will be formed in the
middle of the cavity, and two small vortexes will appear at the lower corners. As the
upper corners are singularity points, a higher resolution near the upper corners will
contribute to increase the stability of the simulation. To explore the influence of the
block arrangement on the computational efficiency, the simulations are carried out
on two different multi-block computational domains, as shown in Fig. 5.
In the arrangements shown in Fig. 5, the finest blocks are placed on the areas
of singularity points or with sharp gradients before the simulations. In Fig. 5, the
1840002-9
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002

(a) (b)
Fig. 5. Arrangements of blocks for the lid driven cavity flow.
finest block is located in the corner regions. In Fig. 5(a), there are two levels of
blocks and four separate blocks in the calculation: block 1 and block 2 belong to
the first level, while block 3 and block 4 belong to the second level. The diagram in
Fig. 5(b) contains three levels and seven blocks: block 1 belongs to the first level,
block 2 and block 3 belong to the second level, and block 4 to block 7 belong to the
third level.
(a) Re = 100 (1-2) (b) Re = 1000 (1-4)
Fig. 6. Streamlines for the lid driven flow.
1840002-10
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002

(c) Re = 2000 (1-2-2) (d) Re = 2000 (1-2-4)
(e) Re = 2000 (1-4-2)
Fig. 6. (Continued)
In the present work, we set the simulation region to 128–128 in the lattice space
of unrefined grid. The initial density and velocity are set to unity and zero, respec-
tively. The upper wall moves with velocity U = 0.1, which is also the characteristic
velocity. The Reynolds number is defined by Re = UL/v, here L = 128. The moving
half-way bounce-back scheme is adopted for all the boundaries.
The dimensionless locations of the centers of the primary vortex, the lower left
vortex and the lower right vortex of the present work in Fig. 6 and of Ghia et al.
[1982] and Vanka [1986] are listed in Table 1. In Table 1, all the results show a
1840002-11
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
Table 1. Comparison of the vortex centers with previous literatures [Ghia et al. (1982); Vanka
(1986)].
Re Arrangement Primary vortex Lower left vortex Lower right vortex
100 1-2 (Fig. 5(a)) (0.6142, 0.7402) (0.0354, 0.0394) (0.9370, 0.0669)
Present (0.6172, 0.7344) (0.0313, 0.0391) (0.9453, 0.0625)
Ghia et al. [1982]
1000 1-4 (Fig. 5(a)) (0.5276, 0.5669) (0.0866, 0.0787) (0.8504, 0.1181)
Present (0.5313, 0.5625) (0.0859, 0.0781) (0.8594, 0.1094)
Ghia et al. [1982]
2000 1-2-2 (Fig. 5(b)) (0.5238, 0.5555) (0.0873, 0.1032) (0.8413, 0.0992)
Present 1-2-4 (Fig. 5(b)) (0.5238, 0.5555) (0.0873, 0.1032) (0.8413, 0.0992)
Vanka [1986] 1-4-2 (Fig. 5(b)) (0.5238, 0.5555) (0.0873, 0.1032) (0.8413, 0.0992)
(0.5250, 0.5500) (0.0875, 0.1063) (0.8375, 0.0938)
good agreement with the previous works. And for Re = 2000, blocks with different
arrangements provide identical results up to four digits at most.
Flow past a circular cylinder

The flow past a stationary cylinder is also characterized by the Reynolds number.
Here the characteristic length is the diameter of the cylinder, and the characteristic
velocity is the far field streaming velocity. For Re = 100, the flow is unsteady and
periodic. There will be some vortexes shedding alternatively at the wake of the
cylinder, and the velocity gradient behind the cylinder is larger than elsewhere.
Therefore, the flow is employed to assess the efficiency of the algorithm for dealing
with more levels and blocks.
The arrangement of the computational domain is shown in Fig. 7. There are
four levels of blocks in the simulation. Block 1 to block 4 belong to the first level;
block 5 and block 6 belong to level two; block 7 to block 9 belong to level 3; block
10 belongs to level 4, the finest level. The ratio of the lattice space between adjacent
levels is 1-2-2-2.
In this calculation, the cylinder diameter D is set to 6 in the lattice space of
unrefined grid. The length and width of the simulation region are 320 and 128,
respectively. The center of the cylinder is at (64, 64), which is located in the finest
block, as shown in Fig. 7. The moving half-way bounce-back scheme is applied on
the top and bottom boundaries. The standard bounce back scheme is used on the
cylinder surface. The velocity and the pressure boundary conditions of Zou and He
are imposed on the inlet and the outlet boundaries, respectively, where the far field
velocity is set to U0 = 0.1. The initial density is unity. The kinematic viscosity for
the first level grid is computed by Re = 100, based on the far field velocity and the
diameter of the cylinder, v = U0 D/Re.
In this case, the drag coefficient, the lift coefficient and the Strouhal number are
2FD
computed to verify the results, which are, respectively, defined by CD = ρU 2 D and
1840002-12
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002

Fig. 7. Arrangement of blocks for the flow past a circular cylinder.
Fig. 8. Velocity contour for the flow past a circular cylinder.
2FL aD
CL = ρU 2 D , and St = U , where FL is the lift force, FD is the drag force, and a
is the frequency of vortex-shedding, obtained by processing FL with Fast Fourier

Transform.
Figure 8 depicts the velocity contour for the flow past a circular cylinder. The
instantaneous vorticity contour of vortex shedding is plotted in Fig. 9. It can be
clearly seen that the vorticity is rather smooth across the block interface. It indi-
cates that the multi-block scheme works well for the unsteady flow. The numerical
results in Table 2 also show that the present results agree well with the data in the
literatures.
The performance of MB-LBM code on GPU

To assess the performance of MB-LBM on CPU and GPU, in Table 3 we list the
following parameters, the time for 104 steps (in second), the number of lattices
1840002-13
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
10 8
Fig. 9. Vorticity contour for the flow past a circular cylinder.

Table 2. Comparison of results at Re = 100

with previous literatures [Zhou et al. (2012);

Silva et al. (2003); Xu and Wang (2006)]
Author CD CL St
Zhou et al. [2012] 1.428 0.315 0.172

Silva et al. [2003] 1.39 — 0.16
Xu and Wang [2006] 1.423 0.34 0.171
Present 1.381 0.304 0.168
Table 3. Performance of CPU and GPU for 104 steps.
CPU GPU
Case Arrangement LUPS Time MLUPS Time MLUPS Acceleration
ratio
1 1–2 (Fig. 5(a)) 31,267 205.66 1.52 64.52 4.85 3.19

2 1–4 (Fig. 5(a)) 145,691 1,083.28 1.34 86.63 16.82 12.50
3 1–2–2 (Fig. 5(b)) 300,688 1,894.23 1.59 245.59 12.24 7.71
4 1–2–4 (Fig. 5(b)) 2,090,912 19,543.09 1.07 498.00 42.00 39.24
5 1–4–2 (Fig. 5(b)) 2,326,104 21,736.04 1.07 659.12 35.29 33.00
6 1–2–2–2 (Fig. 7) 790,392 5,835.10 1.35 578.71 13.66 10.08
updated per step (LUPS), million lattices updated per second (MLUPS), and the
acceleration ratio of GPU to CPU. To be more specific, due to the data types used
in each node are the same, LUPS also represents the amount of data used in one
step, where a large MLUPS means a higher data processing speed.
It can be seen from Table 3 that the ratio of acceleration varies greatly, and the
performance on GPU is even better than that on CPU with the amount of data
increasing. The results in Table 3 also confirm that the arrangement of computa-
tional domain has a great impact on the performance of GPU. In the cases of 2
and 3, the resolutions of upper corners are identical, but the performance of case 2
is much better. Hence, it is not recommended to employ more levels for the same
1840002-14
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
resolution. In addition, considering the time consumed by the spatial interpolation

in MB-LBM and the performance of case 4 and case 5, it is better to place the
largest ratio of the lattice space between adjacent levels mi on the finest level.
5. Conclusion
In the present paper, we propose a straightforward two-dimensional MB-LBM par-
allel algorithm on single GPU. The accuracy of the GPU-based algorithm is exam-
ined by applying the code to simulate the benchmark cases, including the lid-driven
cavity flow and the flow past a circular cylinder, in two-dimensional geometries.
The satisfactory results are obtained in comparison with the previous literatures.
Although the performance on GPU is consistently much better than that on CPU,
there are still some factors affecting the efficiency. To explore the influence, we con-
duct several simulations that dealing with different amounts of data, and with the
arrangements of different numbers of blocks and levels. It turns out that the accel-
eration ratio, compared with the performance on CPU, becomes even higher with
the amount of data increasing. Additionally, the arrangement of the computational
domain has a significant effect on the efficiency. Finally, the largest acceleration ratio
of 39.24 is achieved, despite it could still have a further improvement in dealing with
massive data.
Acknowledgments
This work is supported by the National Natural Science Foundation of China
(11502210, 51279165).
References
Aidun, C. K. and Clausen, J. R. [2010] “Lattice-Boltzmann method for complex flows,”
Annu. Rev. Fluid Mech. 42, 439–472.
Fan, Z., Qiu, F., Kaufman, A. and Yoakum-Stover, S. [2004] “GPU cluster for high per-
formance computing,” Proc. 2004 ACM/IEEE Conf. Supercomputing 2004, p. 47.
Farhat, H., Choi, W. and Lee, J. S. [2010] “Migrating multi-block lattice Boltzmann model
for immiscible mixtures: 3D algorithm development and validation,” Comput. Fluids
39, 1284–1295.
Farhat, H. and Lee, J. S. [2010] “Fundamentals of migrating multi-block lattice Boltzmann
model for immiscible mixtures in 2D geometries,” Int. J. Multiphase Flow 36, 769–779.
Filippova, O. and Hänel, D. [1998] “Grid refinement for lattice-BGK models,” J. Comput.
Phys. 147, 219–228.
Ghia, U., Ghia, K. N. and Shin, C. T. [1982] “High-resolutions for incompressible flow
using the Navier-Stokes equations and a multigrid method,” J. Comput. Phys. 48,
387–411.
Lin, C. L. and Lai, Y. G. [2000] “Lattice Boltzmann method on composite grids,” Phys.
Rev. E 62, 2219–2225.
Liu, H., Zhou, J. G. and Burrows, R. [2010] “Lattice Boltzmann simulations of the transient
shallow water flows,” Adv. Water Resour. 33, 387–396.
1840002-15
2nd Reading
May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002
Mei, R., Yu, D., Shyy, W. and Luo, S. L. [1999] “Force evaluation in the lattice Boltzmann
method involving curved geometry,” Physical Review E 65, 041203.
Mohamad, A. A. [2011] Lattice Boltzmann Method: Fundamentals and Engineering Appli-
cations with Computer Codes (Springer-Verlag, London, UK).
Obrecht, C., Kuznik, F., Tourancheau, B. and Roux, J. J. [2011] “A new approach to the
lattice Boltzmann method for graphics processing units,” Comput. Math. Appl. 61,
3628–3638.
Peng, Y., Shu, C., Chew, Y. T., Niu, X. D. and Lu, X. Y. [2006] “Application of multi-
block approach in the immersed boundary–lattice Boltzmann method for viscous fluid
flows,” J. Comput. Phys. 218, 460–478.
Silva, A. L. E., Silveira-Neto, A. and Damasceno, J. J. R. [2003] “Numerical simulation of
two-dimensional flows over a circular cylinder using the immersed boundary method,”
J. Comput. Phys. 189, 351–370.
Tölke, J. and Krafczyk, M. [2008] “TeraFLOP computing on a desktop PC with GPUs for
3D CFD,” Int. J. Comput. Fluid Dyn. 22, 443–456.
Tubbs, K. R. and Tsai, F. T. C. [2011] “GPU accelerated lattice Boltzmann model for
shallow water flow and mass transport,” Int. J. Numer. Methods Eng. 86, 316–334.
Vanka, S. P. [1986] “Block-implicit multigrid solution of Navier-Stokes equations in prim-
itive variables,” J. Comput. Phys. 65, 138–158.
Xu, S. and Wang, Z. J. [2006] “An immersed interface method for simulating the interaction
of a fluid with moving boundaries,” J. Comput. Phys. 216, 454–493.
Yu, D. and Girimaji, S. S. [2006] “Multi-block lattice Boltzmann method: extension to 3D
and validation in turbulence,” Physica A 362, 118–124.
Yu, D., Mei, R. and Shyy, W. [2002] “A multi-block lattice Boltzmann method for viscous
fluid flows,” Int. J. Numer. Methods Fluids 39, 99–120.
Zhou, H., Mo, G., Wu, F., Zhao, J., Rui, M. and Cen, K. [2012] “GPU implementation of
lattice Boltzmann method for flows with curved boundaries,” Comput. Methods Appl.
Mech. Eng. 225, 65–73.
1840002-16

MBlock

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MBlock

Uploaded by

Copyright:

Available Formats

2nd Reading

May 12, 2017 9:29 WSPC/0219-8762 196-IJCM 1840002

International Journal of Computational Methods

ICCM2016: The Implementation of Two-Dimensional

Ya Zhang∗ , Guang Pan† and Qiaogao Huang

Xi’an 710072, China

Received 9 October 2016

A straightforward implementation of the Multi-Block Lattice Boltzmann Method (MB-

Keywords: Multi-block lattice Boltzmann method; graphics processing unit; ratio of

Y. Zhang, G. Pan & Q. Huang

Two-Dimensional Multi-Block Lattice Boltzmann Method on GPU

2. Multi-Block Lattice Boltzmann Method

Fig. 1. Lattice pattern: D2Q9.

Y. Zhang, G. Pan & Q. Huang

order. It is expressed as follows:

The collision and streaming of particles are expressed as:

Two-Dimensional Multi-Block Lattice Boltzmann Method on GPU

Fig. 2. Interfaces structure between two blocks.

f˜(x) = ai (xi − x)3 + bi (x − xi−1 )3 + ci (xi − x) + di (x − xi−1 )

Y. Zhang, G. Pan & Q. Huang

where Mi is the second derivative of f˜i , following the equation

Set initial values in all blocks

Stream in coarse blocks

Calculate macroscopic variables and

Spatial interpolation for boundary points in fine

Stream in fine blocks

Calculate macroscopic variables and

Stream in fine blocks

Calculate macroscopic variables and

Transfer boundary information in fine blocks to coarse blocks;

cudaMemcpy to host, and output

Fig. 3. The ﬂowchart of the computational sequence for MB-LBM.

Two-Dimensional Multi-Block Lattice Boltzmann Method on GPU

F = [eα fα (xb ) − eᾱ fᾱ (xb + eα δt)]. (16)

Y. Zhang, G. Pan & Q. Huang

fα (x, t + δt) = f˜α (x − eα δt, t), (17)

where the misaligned write is replaced with the misaligned read.

functions and post-collision distribution functions, is created to store variable infor-

Fig. 4. Program of the recursive function for MB-LBM.

Two-Dimensional Multi-Block Lattice Boltzmann Method on GPU

cusparseDgtsv(cusparseHandle t handle, int m, int n, const double *dl, double *d,

4. Test Cases and Discussion

Y. Zhang, G. Pan & Q. Huang

Fig. 5. Arrangements of blocks for the lid driven cavity ﬂow.

(a) Re = 100 (1-2) (b) Re = 1000 (1-4)

Fig. 6. Streamlines for the lid driven ﬂow.

Two-Dimensional Multi-Block Lattice Boltzmann Method on GPU

(c) Re = 2000 (1-2-2) (d) Re = 2000 (1-2-4)

(e) Re = 2000 (1-4-2)

Y. Zhang, G. Pan & Q. Huang

Re Arrangement Primary vortex Lower left vortex Lower right vortex

Flow past a circular cylinder

Two-Dimensional Multi-Block Lattice Boltzmann Method on GPU

Fig. 7. Arrangement of blocks for the ﬂow past a circular cylinder.

Fig. 8. Velocity contour for the ﬂow past a circular cylinder.

is the frequency of vortex-shedding, obtained by processing FL with Fast Fourier

The performance of MB-LBM code on GPU

Y. Zhang, G. Pan & Q. Huang

Fig. 9. Vorticity contour for the ﬂow past a circular cylinder.

Table 2. Comparison of results at Re = 100

with previous literatures [Zhou et al. (2012);

Zhou et al. [2012] 1.428 0.315 0.172

Table 3. Performance of CPU and GPU for 104 steps.

cusparseDgtsv(cusparseHandle t handle, int m, int n, const double dl, double d,