Professional Documents
Culture Documents
PII: S0045-7930(18)30043-4
DOI: 10.1016/j.compfluid.2018.01.033
Reference: CAF 3716
Please cite this article as: Fu-Sheng Hsu, Keh-Chin Chang, Matthew Smith, Multi-Block Adaptive Mesh
Refinement (AMR) for a lattice Boltzmann solver using GPUs, Computers and Fluids (2018), doi:
10.1016/j.compfluid.2018.01.033
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Highlights
T
• The parallel efficiency of the block AMR implementation using GPU de-
IP
vices is investigated.
• The influence block based AMR plays upon the flow field and parallel
CR
implementation using GPU devices is investigated.
US
AN
M
ED
PT
CE
AC
1
ACCEPTED MANUSCRIPT
T
a Department
of Aeronautics and Astronautics, National Cheng Kung University, 701,
Taiwan
b Department of Mechanical Engineering, National Cheng Kung University, 701, Taiwan
IP
CR
Abstract
US
mann solver [1] using GPU acceleration is developed based around a multi-block
structured uniform mesh code [2, 3]. AMR is obtained by deploying multiple
AN
levels of mesh blocks, with varying resolution employed within each block, for
increased resolution of flows in regions requiring higher accuracy. A simple
three dimensional benchmark is employed for confirming the accuracy and par-
M
1. Introduction
CE
solving the Navier-Stokes equations. One of many reasons is that LBM can be
implemented with parallel computing easily due to its high degree of locality.
Unlike traditional numerical schemes that discretize the macroscopic equations,
∗ Correspondingauthor
Email address: msmith@mail.ncku.edu.tw (Matthew Smith)
the LBM is based on the simplified kinetic theory of gases to simulate the flow
field.
T
In a lattice Boltzmann method, the governing equation is the lattice Boltz-
IP
mann equation modeled with a collision model written as:
CR
where ci is the discrete velocity and Ωi is the collision operator. The most pop-
ular model in LBM is the LBE with BKG collision operator which is commonly
where wi is the weighting function for the discrete velocity and cs is the sound
speed which is model dependent. The DnQm (n dimensional, m velocity) models
proposed by Qian [4] are the most representative models. In DnQm models, the
PT
cs
The common models are D2Q9, D3Q15, D3Q19, and D3Q27, with the model
used in this research being D3Q15 (Figure 1), which gives:
AC
1
cs = √ (5)
3
(0, 0, 0) i=1
ci = c × (±1, 0, 0) , (0, ±1, 0) , (0, 0, ±1) i = 2, 3 · · · 6, 7 (6)
(±1, ±1, ±1) i = 8, 9 · · · 13, 15
3
ACCEPTED MANUSCRIPT
T
IP
CR
US
AN
Figure 1: D3Q15 model
4/9 i=1
M
wi = 1/9 i = 2, 3 · · · 6, 7 (7)
1/36 i = 8, 9 · · · 13, 15
ED
c = δx/δt in equation (6) is the lattice speed where δx is the lattice spacing.
The macroscopic properties, such as fluid density and velocity, can be obtained
from the discrete distribution function:
PT
X
ρ= fi (8)
i
CE
X
ρu = ci f i (9)
i
Even though the LBE is solved in the process, the Navier-Stokes equations can,
AC
4
ACCEPTED MANUSCRIPT
Collision:
1
fi∗ (x, t + δt) = fi (x, t) − [fi (x, t) − fieq (x, t)] (10)
τ
Streaming:
fi (x + ci δt, t + δt) = fi∗ (x, t + δt) (11)
T
where fi∗ represents the post-collision state and the single relaxation time τ can
IP
be defined as
ν 1
τ= + (12)
CR
c2s δt 2
where ν is the kinematic viscosity. Note that the collision step is only a local
procedure within each lattice. For the streaming step, even though the par-
US
ticle move from one lattice to another, the m discrete velocities can be done
separately.
AN
3. Multi-Block Mesh Refinement
many smaller uniform mesh blocks. The mesh refinement is then obtained by
deploying multiple levels of mesh blocks, each with varying resolution employed,
for increased resolution of flows in regions requiring higher accuracy.
ED
PT
CE
Figure 2: Adjacent mesh blocks with two different mesh level 0 and 1.
5
ACCEPTED MANUSCRIPT
Base on this idea, within each mesh block, it is an uniform mesh problem that
can be solved with conventional LBM using GPU [3] with little modification.
However, there is no guarantee that two adjacent mesh blocks must have the
same levels mesh size (as shown in Figure 2). Hence, special treatment must
T
be applied to these adjacent mesh blocks. There are previous investigations
IP
regarding the data exchange at such boundary, some common techniques are
finite difference LBM [6], finite volume LBM [7], interpolation supplemented
CR
LBM [8], and locally embedded uniform grid techniques [9]. In this research
a locally embedded generic, mass conservation technique proposed by Rohde
[1] will be implemented. Their technique is reported to have neither spatial
US
interpolation nor rescaling of the non-equilibrium distribution. The first one
leads to some difficulty in mass conservation while the second one leads to
restricted collision operator, i.e. single relaxation operator only. The detailed
AN
methodology of such scheme can be found in [1].
4. Benchmark Problem
M
The flow field of a three dimensions flow over a square cylinder (0.2 × 0.2 ×
1) was chosen to test the developed parallelized multi-block mesh refinement
ED
1
CE
-0.5
AC
A'
-1
-1
0
1
Periodic in z direction 2
x 3
4 -0.5
0
5 0.5 z
6
ACCEPTED MANUSCRIPT
(z = ±1) in z direction.
Series test cases, as listed in Table 1 and Table 2, were performed. Table 1
shows the test cases for the uniform mesh code using single CPU or GPU,
where (N X, N Y, N Z) are the number of lattice in x, y, z direction, respectively.
T
On the other hand, Table 2 shows the test cases for the mesh refinement code
IP
using GPU. Here (M BX, M BY, M BZ) are the number of mesh block in x, y, z
direction. (N X, N Y, N Z) are now the number of lattice within the mesh block.
CR
Note that the mesh refinement code can be used to simulate uniform mesh (case
1 to case 4 in Table 2) as well by setting the mesh block level equal throughout
the computation domain. The mesh refinement (case 5 and 6 in Table 2) was
US
employed around the cylinder (as shown with blue dash in Figure 3).
AN
Table 1: Test cases for uniform mesh using CPU/GPU.
Case 1 2 (=3) 4
(N X, N Y, N Z) (192, 64, 32) (384, 128, 64) (768, 256, 128)
Total mesh 393,126 3,145,728 25,165,824
M
Case 1 2 3
(M BX, M BY, M BZ) (24, 8, 4) (48, 16, 8) (24, 8, 4)
n
2 (N X, N Y, N Z) (8, 8, 8) (8, 8, 8) 2 (8, 8, 8)
PT
Mesh level, n 0 0 1
Total mesh 393,126 3,145,728 3,145,728
CE
Case 4 5 6
(M BX, M BY, M BZ) (48, 16, 8) (24, 8, 4) (48, 16, 8)
2n (N X, N Y, N Z) 2 (8, 8, 8) 2n (8, 8, 8) 2n (8, 8, 8)
AC
0.5 < x < 2
1 if
Mesh level, n 1 −0.5 < y < 0.5
0 elsewhere
7
ACCEPTED MANUSCRIPT
5. Results
T
the beginning of the simulation, whereas CPU2 use a similar algorithm as the
IP
GPU codes [3] which moves the particle in one direction at a time in order. The
other two are parallel codes using single GPU device which employ (i) a uniform
CR
Cartesian grid over the entire domain, which we refer to as GPU uniform, and
(ii) an Adapive Mesh Refinement (AMR) approach, which we will refer to as
the AMR result. GPU uniform adapts the method proposed by Kuznik et al. in
US
[3],which the AMR method is the multi-block mesh refinement code developed
in this research based on [1, 3].
AN
The velocity components and the computation performance results are pre-
sented in this section. Before moving onto the mesh refinement, the GPU codes
needs to be verified with the CPU codes. To examine the developed GPU codes,
M
the simulation results of case 2 are presented in Figure 4 and Figure 5. In Fig-
ure 4 shows the velocity in x direction, u, along the A-A0 axis (see Figure 3).
ED
PT
CE
AC
8
ACCEPTED MANUSCRIPT
T
IP
CR
Figure 6: Comparison of x velocity, u, along
A-A0 (G2). Figure 7: Comparison of y velocity, v, along
US A-A0 (G2).
contains case 2, 4, 6. Note that only the results of G2 are presented here. From
Figure 6, one can see the resulting u velocity does lean towards to the result of
ED
the finer mesh (case 4) around the cylinder. As for the v velocity (Figure 7),
there are small (related to the mesh size) discontinuous jumps at x = 0.5 or
x = 2 due to the difference of the mesh sizes.
PT
The time needed for each code to run the test cases is also considered in this
research. To evaluate the performance of the parallel codes, the time needed for
CE
where Tserial is the time needed by the serial codes, CPU1 and CPU2 , and
Tparallel is the time needed by the GPU parallel codes, GPU uniform and AMR.
Figure 8 shows the computation time and speed up ratio for the codes developed
in this research. It is clear that the algorithm being used in the parallel codes is
not suitable for serial computation (CPU2 ). Therefore, a better, faster algorithm
9
ACCEPTED MANUSCRIPT
T
IP
CR
Figure 8: Computation time and speed up ratio. *Note that the serial codes were compiled
with intel C compiler version 17.0.4 with O2 optimization, while the GPU codes were com-
piled with GNU compiler version 5.4.0 and NVIDIA CUDA compiler version 8.0 with O2
optimization.
US
for serial computation (CPU1 ) is also used to compare the speed up ratio (TCP U2
AN
∼
= 3.0TCP U1 ). The maximum speed up ratio reported by CPU1 is approximately
56 for AMR code (case 3) and about 34 for GPU uniform(case 3); while the
maximum speed up ratio is about 170 for the AMR code and 107 for GPU
M
AMR (Figure 10) codes. There are 4 and 9 kernels in GPU uniform (shown in
yellow process box in Figure 11) and the AMR (Figure 12) codes respectively.
CE
The LBM kernel in GPU uniform code (LBMgpu−unif orm ), which takes care
of the collisions and streamings in x,y,z directions, is the equivalent to the
(LBM + Z1 + Y1 )AM R or (LBMf + Z2 + Y2 )AM R in the AMR code. The
AC
computation time needed for (LBMgpu−unif orm ) is about 50% more than (LBM
+ Z1 + Y1 + LBMf + Z2 + Y2 )AM R in most cases. The exchange kernel in
GPU uniform code (exchangegpu−unif orm ) is responsible for data exchange in
x direction is equal to (X1 )AM R or (X2 )AM R in AMR code [3]. It can be seen
that exchangegpu−unif orm takes less time than (Xi )AM R ; this is because the
10
ACCEPTED MANUSCRIPT
103 10
3
102 10
2
101 1
10
100 0
10
-1
10
-1
10
T
10-2
-2
10
-3
10
-3
10
IP
10-4
LBM exchange Copy old to new uvwr LBM Z1 Y1 X1 LBMf Z2 Y2 X2 uvwr
CR
Figure 9: Kernel time for GPU Uniform. Figure 10: Kernel time for AMR.
whole computation domain is discritized into smaller mesh blocks, thus, there
US
are more data to be replaced to the right cells. Note that there is a kernel to
copy the new distribution functions to the old ones for the next time iteration in
the GPU uniform kernel, which is avoided in the AMR code through the second
AN
set of (LBM + Z + Y). This kernel takes around 12% of the computation time
in the simulation which makes the code runs slower than the AMR code with
an uniform mesh.
M
In AMR test cases, case 2 and 3 have the same amount of cells due to the
proper combination of mesh blocks number and levels, however as one can see
ED
in Figure 8, case 2 is about 50% slower than case 3. To understand this, one
can start with Figure 10. The biggest differences lies in the data exchange in x,
y, z directions. Case 2 is reported to take almost 4 times longer than case 3 to
PT
instance, level 1 mesh). For case 2, all meshes are level 0, hence Z2 , Y2 , X2 are
essentially redundant. Another factor that could slow down the simulation for
case 2 is believed to be the memory layout.
AC
6. Conclusions
This research focuses on two parts - the first part is implementation of multi-
block mesh refinement scheme while the second part is the parallel implemen-
11
ACCEPTED MANUSCRIPT
T
IP
CR
US
AN
Figure 11: GPU Uniform flowchart Figure 12: AMR flowchart
tation using a single GPU and the investigation of its parallel performance.
M
requires about 25% or 30% of the time that is needed for the simulations with
finer uniform mesh hence justifying the use of an AMR technique. For the par-
allel performance, the parallel codes have been compared with the serial codes.
PT
This research shows that due to architecture differences between the CPU and
GPU, optimal codes for each implementation vary and care must be taken when
CE
computing parallel speed up. The parallel performance should not be considered
solely on the computation time needed by the serial computation to avoid any
misleading results - its algorithm plays an important role in the analysis. It is
AC
worth noting that the current AMR code shows a better performance compared
to the GPU uniform code for the same amount of mesh with proper setup. The
developed AMR code reaches an approximately 56 speedup when compared to
the faster serial code CPU1 .
12
ACCEPTED MANUSCRIPT
References
T
(2006) 439–468.
IP
[2] J. Tölke, Implementation of a lattice boltzmann kernel using the compute
CR
unified device architecture developed by nvidia, Computing and Visualiza-
tion in Science 13 (1) (2008) 29.
[3] F. Kuznik, C. Obrecht, G. Rusaouen, J.-J. Roux, Lbm based flow simulation
US
using gpu computing processor, Computers and Mathematics with Applica-
tions 59 (7) (2010) 2380 – 2392, mesoscopic Methods in Engineering and
AN
Science.
[5] S. Chen, G. D. Doolen, Lattice boltzmann method for fluid flows, Annual
Review of Fluid Mechanics 30 (1) (1998) 329–364.
ED
426 – 448.
13