LBM GPU Hsu2018

Accepted Manuscript
Multi-Block Adaptive Mesh Refinement (AMR) for a lattice Boltzmann

solver using GPUs
Fu-Sheng Hsu, Keh-Chin Chang, Matthew Smith
PII: S0045-7930(18)30043-4
DOI: 10.1016/j.compfluid.2018.01.033
Reference: CAF 3716
To appear in: Computers and Fluids
Received date: 16 October 2017

Accepted date: 23 January 2018
Please cite this article as: Fu-Sheng Hsu, Keh-Chin Chang, Matthew Smith, Multi-Block Adaptive Mesh
Refinement (AMR) for a lattice Boltzmann solver using GPUs, Computers and Fluids (2018), doi:
10.1016/j.compfluid.2018.01.033
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Highlights
• AMR (Adaptive Mesh Refinement) is investigated for a Lattice Boltzmann

solver through the use of blocks, each of which is uniform.
T
• The parallel efficiency of the block AMR implementation using GPU de-
IP
vices is investigated.
• The influence block based AMR plays upon the flow field and parallel
CR
implementation using GPU devices is investigated.
US
AN
M
ED
PT
CE
AC
1
ACCEPTED MANUSCRIPT
Multi-Block Adaptive Mesh Refinement (AMR) for a

lattice Boltzmann solver using GPUs
Fu-Sheng Hsua , Keh-Chin Changa , Matthew Smithb,∗
T
a Department
of Aeronautics and Astronautics, National Cheng Kung University, 701,
Taiwan
b Department of Mechanical Engineering, National Cheng Kung University, 701, Taiwan
IP
CR
Abstract
A parallelized Adaptive Mesh Refinement (AMR) approach using lattice Boltz-
US
mann solver [1] using GPU acceleration is developed based around a multi-block
structured uniform mesh code [2, 3]. AMR is obtained by deploying multiple
AN
levels of mesh blocks, with varying resolution employed within each block, for
increased resolution of flows in regions requiring higher accuracy. A simple
three dimensional benchmark is employed for confirming the accuracy and par-
M
allel performance of the AMR implementation. A series of benchmark tests has

been performed with the flow fields and parallel performance compared. Com-
parisons of the computation time between serial and parallel implementations
ED
are discussed with a maximum reported speedup of approximately 55 obtained

with the AMR code.
Keywords: Lattice Boltzmann method, Adaptive mesh refinement, GPU
PT
1. Introduction
CE
The lattice Boltzmann method (LBM) has become popular as an alternative

simulation method over traditional computational fluid dynamic methods for
AC
solving the Navier-Stokes equations. One of many reasons is that LBM can be
implemented with parallel computing easily due to its high degree of locality.
Unlike traditional numerical schemes that discretize the macroscopic equations,
∗ Correspondingauthor
Email address: msmith@mail.ncku.edu.tw (Matthew Smith)
Preprint submitted to Journal of LATEX Templates February 7, 2018

ACCEPTED MANUSCRIPT
the LBM is based on the simplified kinetic theory of gases to simulate the flow
field.
2. Lattice Boltzmann method
T
In a lattice Boltzmann method, the governing equation is the lattice Boltz-
IP
mann equation modeled with a collision model written as:
fi (x + ci δt, t + δt) − fi (x, t) = Ωi (fi (x, t)) (1)
CR
where ci is the discrete velocity and Ωi is the collision operator. The most pop-
ular model in LBM is the LBE with BKG collision operator which is commonly
Ωi (fi (x, t)) =

−1
US
known as the lattice BKG, or simply LBGK, model. The collision operator can
be expressed as
[fi (x, t) − fieq (x, t)] (2)
AN
τ
where τ is the single relaxation time, hence LBGK is also known as single-
relaxation-time (SRT) model, and the superscript eq denotes the distribution
M
function at its equilibrium state, which is:

eq ci · u (ci · u)2 u2
fi (x, t) ≈ ρwi 1 + 2 + − 2 + ······ (3)
cs 2c4s 2cs
ED
where wi is the weighting function for the discrete velocity and cs is the sound
speed which is model dependent. The DnQm (n dimensional, m velocity) models
proposed by Qian [4] are the most representative models. In DnQm models, the
PT
discrete equilibrium distribution function is defined as:

ci · u (ci · u)2 u2
fieq (x, t) ≈ ρwi 1 + 2 + − (4)
2c4s 2c2s
CE
cs
The common models are D2Q9, D3Q15, D3Q19, and D3Q27, with the model
used in this research being D3Q15 (Figure 1), which gives:
AC
1
cs = √ (5)
3


 (0, 0, 0) i=1


ci = c × (±1, 0, 0) , (0, ±1, 0) , (0, 0, ±1) i = 2, 3 · · · 6, 7 (6)



 (±1, ±1, ±1) i = 8, 9 · · · 13, 15
3
ACCEPTED MANUSCRIPT
T
IP
CR
US
AN
Figure 1: D3Q15 model


 4/9 i=1


M
wi = 1/9 i = 2, 3 · · · 6, 7 (7)



 1/36 i = 8, 9 · · · 13, 15
ED
c = δx/δt in equation (6) is the lattice speed where δx is the lattice spacing.
The macroscopic properties, such as fluid density and velocity, can be obtained
from the discrete distribution function:
PT
X
ρ= fi (8)
i
CE
X
ρu = ci f i (9)
i
Even though the LBE is solved in the process, the Navier-Stokes equations can,
AC
in fact, be derived from the LBE through the Chapman-Enskog expansion (a

multi-scale analysis) with the low Mach number assumption [5].
The algorithm for LBM is quite straightforward with two simple steps: col-
lision and streaming.These steps can be described in the mathematical forms:
4
ACCEPTED MANUSCRIPT
Collision:
1
fi∗ (x, t + δt) = fi (x, t) − [fi (x, t) − fieq (x, t)] (10)
τ
Streaming:
fi (x + ci δt, t + δt) = fi∗ (x, t + δt) (11)
T
where fi∗ represents the post-collision state and the single relaxation time τ can
IP
be defined as
ν 1
τ= + (12)
CR
c2s δt 2
where ν is the kinematic viscosity. Note that the collision step is only a local
procedure within each lattice. For the streaming step, even though the par-
US
ticle move from one lattice to another, the m discrete velocities can be done
separately.
AN
3. Multi-Block Mesh Refinement
The idea of multi-block is to divide the entire computation domain into

M
many smaller uniform mesh blocks. The mesh refinement is then obtained by
deploying multiple levels of mesh blocks, each with varying resolution employed,
for increased resolution of flows in regions requiring higher accuracy.
ED
PT
CE
Mesh block 1 Mesh block 2

Level 0 Level 1
AC
Figure 2: Adjacent mesh blocks with two different mesh level 0 and 1.
5
ACCEPTED MANUSCRIPT
Base on this idea, within each mesh block, it is an uniform mesh problem that
can be solved with conventional LBM using GPU [3] with little modification.
However, there is no guarantee that two adjacent mesh blocks must have the
same levels mesh size (as shown in Figure 2). Hence, special treatment must
T
be applied to these adjacent mesh blocks. There are previous investigations
IP
regarding the data exchange at such boundary, some common techniques are
finite difference LBM [6], finite volume LBM [7], interpolation supplemented
CR
LBM [8], and locally embedded uniform grid techniques [9]. In this research
a locally embedded generic, mass conservation technique proposed by Rohde
[1] will be implemented. Their technique is reported to have neither spatial
US
interpolation nor rescaling of the non-equilibrium distribution. The first one
leads to some difficulty in mass conservation while the second one leads to
restricted collision operator, i.e. single relaxation operator only. The detailed
AN
methodology of such scheme can be found in [1].
4. Benchmark Problem
M
The flow field of a three dimensions flow over a square cylinder (0.2 × 0.2 ×
1) was chosen to test the developed parallelized multi-block mesh refinement
ED
simulation code. The geometry of the benchmark problem is shown in Figure 3.

The boundaries conditions are: constant velocity inlet (x = −1), zero gradient
PT
outlet (x = 5), non-slip wall (y = ±1) in y direction, and periodic boundary
1
CE
Non-slip wall in y direction

0.5
Constant velocity
(x = -1) A
Zero gradient
0
(x = 5)
y
-0.5
AC
A'
-1
-1
0
1
Periodic in z direction 2
x 3
4 -0.5
0
5 0.5 z
Figure 3: Geometry with boundary conditions for benchmark problem
6
ACCEPTED MANUSCRIPT
(z = ±1) in z direction.
Series test cases, as listed in Table 1 and Table 2, were performed. Table 1
shows the test cases for the uniform mesh code using single CPU or GPU,
where (N X, N Y, N Z) are the number of lattice in x, y, z direction, respectively.
T
On the other hand, Table 2 shows the test cases for the mesh refinement code
IP
using GPU. Here (M BX, M BY, M BZ) are the number of mesh block in x, y, z
direction. (N X, N Y, N Z) are now the number of lattice within the mesh block.
CR
Note that the mesh refinement code can be used to simulate uniform mesh (case
1 to case 4 in Table 2) as well by setting the mesh block level equal throughout
the computation domain. The mesh refinement (case 5 and 6 in Table 2) was
US
employed around the cylinder (as shown with blue dash in Figure 3).
AN
Table 1: Test cases for uniform mesh using CPU/GPU.
Case 1 2 (=3) 4
(N X, N Y, N Z) (192, 64, 32) (384, 128, 64) (768, 256, 128)
Total mesh 393,126 3,145,728 25,165,824
M
Table 2: Test cases for mesh refinement using GPU.

ED
Case 1 2 3
(M BX, M BY, M BZ) (24, 8, 4) (48, 16, 8) (24, 8, 4)
n
2 (N X, N Y, N Z) (8, 8, 8) (8, 8, 8) 2 (8, 8, 8)
PT
Mesh level, n 0 0 1
Total mesh 393,126 3,145,728 3,145,728
CE
Case 4 5 6
(M BX, M BY, M BZ) (48, 16, 8) (24, 8, 4) (48, 16, 8)
2n (N X, N Y, N Z) 2 (8, 8, 8) 2n (8, 8, 8) 2n (8, 8, 8)
 
AC

  0.5 < x < 2

 1 if
Mesh level, n 1  −0.5 < y < 0.5



 0 elsewhere
Total mesh 25,165,824 966,656 7,733,248
7
ACCEPTED MANUSCRIPT
5. Results
Over the course of this investigation, four different implementations (codes)

were developed. The first two are serial codes named CPU1 and CPU2 . In
CPU1 , the particle propagation is done with a streaming list pre-generating at
T
the beginning of the simulation, whereas CPU2 use a similar algorithm as the
IP
GPU codes [3] which moves the particle in one direction at a time in order. The
other two are parallel codes using single GPU device which employ (i) a uniform
CR
Cartesian grid over the entire domain, which we refer to as GPU uniform, and
(ii) an Adapive Mesh Refinement (AMR) approach, which we will refer to as
the AMR result. GPU uniform adapts the method proposed by Kuznik et al. in
US
[3],which the AMR method is the multi-block mesh refinement code developed
in this research based on [1, 3].
AN
The velocity components and the computation performance results are pre-
sented in this section. Before moving onto the mesh refinement, the GPU codes
needs to be verified with the CPU codes. To examine the developed GPU codes,
M
the simulation results of case 2 are presented in Figure 4 and Figure 5. In Fig-
ure 4 shows the velocity in x direction, u, along the A-A0 axis (see Figure 3).
ED
PT
CE
AC
Figure 4: X velocity, u, along the A-A0 axis of

Figure 5: (Left axis) y velocity, v, and (right
serial and parallel codes (case 2).
axis) z velocity, w, along the A-A0 axis of serial
and parallel codes (case 2).
8
ACCEPTED MANUSCRIPT
T
IP
CR
Figure 6: Comparison of x velocity, u, along
A-A0 (G2). Figure 7: Comparison of y velocity, v, along
US A-A0 (G2).
Figure 5 shows the velocity in y, z direction, v and w, respectively. It can be

AN
seen that the results of both GPU codes agree with that of CPU1 .
For the AMR code, case 1 to 6 in Table 2 are categorized into two groups,
where the first group (G1) includes case 1, 3, 5 and the second group (G2)
M
contains case 2, 4, 6. Note that only the results of G2 are presented here. From
Figure 6, one can see the resulting u velocity does lean towards to the result of
ED
the finer mesh (case 4) around the cylinder. As for the v velocity (Figure 7),
there are small (related to the mesh size) discontinuous jumps at x = 0.5 or
x = 2 due to the difference of the mesh sizes.
PT
The time needed for each code to run the test cases is also considered in this
research. To evaluate the performance of the parallel codes, the time needed for
CE
the serial code is used to calculated the speed up ratio by
SP = Tserial /Tparallel (13)

AC
where Tserial is the time needed by the serial codes, CPU1 and CPU2 , and
Tparallel is the time needed by the GPU parallel codes, GPU uniform and AMR.
Figure 8 shows the computation time and speed up ratio for the codes developed
in this research. It is clear that the algorithm being used in the parallel codes is
not suitable for serial computation (CPU2 ). Therefore, a better, faster algorithm
9
ACCEPTED MANUSCRIPT
T
IP
CR
Figure 8: Computation time and speed up ratio. *Note that the serial codes were compiled
with intel C compiler version 17.0.4 with O2 optimization, while the GPU codes were com-
piled with GNU compiler version 5.4.0 and NVIDIA CUDA compiler version 8.0 with O2
optimization.
US
for serial computation (CPU1 ) is also used to compare the speed up ratio (TCP U2
AN
∼
= 3.0TCP U1 ). The maximum speed up ratio reported by CPU1 is approximately
56 for AMR code (case 3) and about 34 for GPU uniform(case 3); while the
maximum speed up ratio is about 170 for the AMR code and 107 for GPU
M
uniform when compared to CPU2 . As for the mesh refinement performance,

the computation time for case 5 is approximately 30% of case 3 and for case 6,
ED
it is about 25% of case 4.

For further investigation, NVIDIA Visual Profiler (NVVP) is used to analyze
the computation time for each kernel in both GPU uniform (Figure 9) and
PT
AMR (Figure 10) codes. There are 4 and 9 kernels in GPU uniform (shown in
yellow process box in Figure 11) and the AMR (Figure 12) codes respectively.
CE
The LBM kernel in GPU uniform code (LBMgpu−unif orm ), which takes care
of the collisions and streamings in x,y,z directions, is the equivalent to the
(LBM + Z1 + Y1 )AM R or (LBMf + Z2 + Y2 )AM R in the AMR code. The
AC
computation time needed for (LBMgpu−unif orm ) is about 50% more than (LBM
+ Z1 + Y1 + LBMf + Z2 + Y2 )AM R in most cases. The exchange kernel in
GPU uniform code (exchangegpu−unif orm ) is responsible for data exchange in
x direction is equal to (X1 )AM R or (X2 )AM R in AMR code [3]. It can be seen
that exchangegpu−unif orm takes less time than (Xi )AM R ; this is because the
10
ACCEPTED MANUSCRIPT
103 10
3
102 10
2
101 1
10
100 0
10
-1
10
-1
10
T
10-2
-2
10
-3
10
-3
10
IP
10-4
LBM exchange Copy old to new uvwr LBM Z1 Y1 X1 LBMf Z2 Y2 X2 uvwr
CR
Figure 9: Kernel time for GPU Uniform. Figure 10: Kernel time for AMR.
whole computation domain is discritized into smaller mesh blocks, thus, there
US
are more data to be replaced to the right cells. Note that there is a kernel to
copy the new distribution functions to the old ones for the next time iteration in
the GPU uniform kernel, which is avoided in the AMR code through the second
AN
set of (LBM + Z + Y). This kernel takes around 12% of the computation time
in the simulation which makes the code runs slower than the AMR code with
an uniform mesh.
M
In AMR test cases, case 2 and 3 have the same amount of cells due to the
proper combination of mesh blocks number and levels, however as one can see
ED
in Figure 8, case 2 is about 50% slower than case 3. To understand this, one
can start with Figure 10. The biggest differences lies in the data exchange in x,
y, z directions. Case 2 is reported to take almost 4 times longer than case 3 to
PT
execute (Z1 + Y1 + X1 + Z2 + Y2 + X2 ) kernels. One of the reasons for this is

that the second set of (LBMf + Z2 + Y2 )AM R is for higher level mesh (in this
CE
instance, level 1 mesh). For case 2, all meshes are level 0, hence Z2 , Y2 , X2 are
essentially redundant. Another factor that could slow down the simulation for
case 2 is believed to be the memory layout.
AC
6. Conclusions
This research focuses on two parts - the first part is implementation of multi-
block mesh refinement scheme while the second part is the parallel implemen-
11
ACCEPTED MANUSCRIPT
T
IP
CR
US
AN
Figure 11: GPU Uniform flowchart Figure 12: AMR flowchart
tation using a single GPU and the investigation of its parallel performance.
M
As discussed in the previous section, the multi-block mesh refinement scheme

yielded similar results to those obtained with the finer mesh. However, it only
ED
requires about 25% or 30% of the time that is needed for the simulations with
finer uniform mesh hence justifying the use of an AMR technique. For the par-
allel performance, the parallel codes have been compared with the serial codes.
PT
This research shows that due to architecture differences between the CPU and
GPU, optimal codes for each implementation vary and care must be taken when
CE
computing parallel speed up. The parallel performance should not be considered
solely on the computation time needed by the serial computation to avoid any
misleading results - its algorithm plays an important role in the analysis. It is
AC
worth noting that the current AMR code shows a better performance compared
to the GPU uniform code for the same amount of mesh with proper setup. The
developed AMR code reaches an approximately 56 speedup when compared to
the faster serial code CPU1 .
12
ACCEPTED MANUSCRIPT
References
[1] M. Rohde, D. Kandhai, J. J. Derksen, H. E. A. van den Akker, A generic,

mass conservative local grid refinement technique for lattice-boltzmann
schemes, International Journal for Numerical Methods in Fluids 51 (4)
T
(2006) 439–468.
IP
[2] J. Tölke, Implementation of a lattice boltzmann kernel using the compute
CR
unified device architecture developed by nvidia, Computing and Visualiza-
tion in Science 13 (1) (2008) 29.
[3] F. Kuznik, C. Obrecht, G. Rusaouen, J.-J. Roux, Lbm based flow simulation
US
using gpu computing processor, Computers and Mathematics with Applica-
tions 59 (7) (2010) 2380 – 2392, mesoscopic Methods in Engineering and
AN
Science.
[4] Y. H. Qian, D. D’Humires, P. Lallemand, Lattice bgk models for navier-

stokes equation, EPL (Europhysics Letters) 17 (6) (1992) 479.
M
[5] S. Chen, G. D. Doolen, Lattice boltzmann method for fluid flows, Annual
Review of Fluid Mechanics 30 (1) (1998) 329–364.
ED
[6] R. Mei, W. Shyy, On the finite difference-based lattice boltzmann method

in curvilinear coordinates, Journal of Computational Physics 143 (2) (1998)
PT
426 – 448.
[7] F. Nannelli, S. Succi, The lattice boltzmann equation on irregular lattices,

CE
Journal of Statistical Physics 68 (3) (1992) 401–407.
[8] X. He, G. Doolen, Lattice boltzmann method on curvilinear coordinates

AC
system: Flow around a circular cylinder, Journal of Computational Physics

134 (2) (1997) 306 – 315.
[9] O. Filippova, D. Hnel, Grid refinement for lattice-bgk models, Journal of

Computational Physics 147 (1) (1998) 219 – 228.
13

LBM GPU Hsu2018

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LBM GPU Hsu2018

Uploaded by

Copyright:

Available Formats

Accepted Manuscript

Multi-Block Adaptive Mesh Refinement (AMR) for a lattice Boltzmann

Fu-Sheng Hsu, Keh-Chin Chang, Matthew Smith

To appear in: Computers and Fluids

Received date: 16 October 2017

• AMR (Adaptive Mesh Refinement) is investigated for a Lattice Boltzmann

Multi-Block Adaptive Mesh Refinement (AMR) for a

Fu-Sheng Hsua , Keh-Chin Changa , Matthew Smithb,∗

A parallelized Adaptive Mesh Refinement (AMR) approach using lattice Boltz-

allel performance of the AMR implementation. A series of benchmark tests has

are discussed with a maximum reported speedup of approximately 55 obtained

The lattice Boltzmann method (LBM) has become popular as an alternative

Preprint submitted to Journal of LATEX Templates February 7, 2018

2. Lattice Boltzmann method

fi (x + ci δt, t + δt) − fi (x, t) = Ωi (fi (x, t)) (1)

Ωi (fi (x, t)) =

function at its equilibrium state, which is:

discrete equilibrium distribution function is defined as:

in fact, be derived from the LBE through the Chapman-Enskog expansion (a

The idea of multi-block is to divide the entire computation domain into

Mesh block 1 Mesh block 2

simulation code. The geometry of the benchmark problem is shown in Figure 3.

outlet (x = 5), non-slip wall (y = ±1) in y direction, and periodic boundary

Non-slip wall in y direction

Figure 3: Geometry with boundary conditions for benchmark problem

Table 2: Test cases for mesh refinement using GPU.

Total mesh 25,165,824 966,656 7,733,248

Over the course of this investigation, four different implementations (codes)

Figure 4: X velocity, u, along the A-A0 axis of

Figure 5 shows the velocity in y, z direction, v and w, respectively. It can be

the serial code is used to calculated the speed up ratio by

SP = Tserial /Tparallel (13)

uniform when compared to CPU2 . As for the mesh refinement performance,

it is about 25% of case 4.

execute (Z1 + Y1 + X1 + Z2 + Y2 + X2 ) kernels. One of the reasons for this is

As discussed in the previous section, the multi-block mesh refinement scheme

[1] M. Rohde, D. Kandhai, J. J. Derksen, H. E. A. van den Akker, A generic,

[4] Y. H. Qian, D. D’Humires, P. Lallemand, Lattice bgk models for navier-

[6] R. Mei, W. Shyy, On the finite difference-based lattice boltzmann method

[7] F. Nannelli, S. Succi, The lattice boltzmann equation on irregular lattices,

Journal of Statistical Physics 68 (3) (1992) 401–407.

[8] X. He, G. Doolen, Lattice boltzmann method on curvilinear coordinates

system: Flow around a circular cylinder, Journal of Computational Physics

[9] O. Filippova, D. Hnel, Grid refinement for lattice-bgk models, Journal of

You might also like