You are on page 1of 12

Acta Astronautica 159 (2019) 319–330

Contents lists available at ScienceDirect

Acta Astronautica
journal homepage: www.elsevier.com/locate/actaastro

Acceleration of a 2D unsteady Euler solver with GPU on nested Cartesian T


grid
Feng Weia,b, Liang Jina,b,∗, Jun Liua,b, Feng Dinga,b, Xinping Zhengc
a
College of Aerospace Science and Engineering, National University of Defense Technology, Changsha, Hunan 410073, People's Republic of China
b
Science and Technology on Scramjet Laboratory, National University of Defense Technology, Changsha, Hunan 410073, People's Republic of China
c
College of Computer, National University of Defense Technology, Changsha, Hunan 410073, People's Republic of China

A R T I C LE I N FO A B S T R A C T

Keywords: Graphics processing unit (GPU) parallel computation is used to accelerate a solver, which computes the 2D
Unsteady unsteady compressible Euler equation discretized with the finite-volume method on a nested Cartesian grid. In
GPU parallel computation this solver, a second-order accurate upwind scheme is adopted, with explicit time-stepping by using the third-
Euler equation order total-variation-diminishing Runge–Kutta method. An improved parallel strategy is implemented, and
Nested Cartesian grid
through dealing with a test case on a three-level nested Cartesian grid, speedup ratios of 9.98–14.04 are achieved
Execution configuration
respectively at different grid sizes. Furtherly, through a numerical experiment, the relation between the kernel
performance and the execution configuration at different grid sizes is examined by monitoring and analyzing the
performance indicators. Moreover, approaches to improve kernel performances are explored.

1. Introduction The ghost cell method (GCM), a classic approach of the SI method, has
drawn great attention of many researchers in recent years [15–19]. In
As an important research field of fluid dynamics, the study of un- the GCM, body cells that have at least one neighbor in the fluid are
steady flows involves many problems, such as the stage separation [1,2] defined as “ghost cells”, and boundary conditions are enforced by these
and the aeroelastic analysis [3–5]. With the development of computa- ghost cells. The advantages of the GCM on a Cartesian grid lie in the fast
tional fluid dynamics (CFD), more and more people have begun to si- grid generation as well as the low memory usage, and more im-
mulate unsteady problems by using numerical computation [6–8]. portantly, this method is easy to be implemented since there is no need
However, the calculating quantity of solving unsteady problems is large to change the governing equation.
and, especially for unsteady problems with moving boundaries, the grid In recent years, graphics processing unit (GPU)-based general-pur-
needs to be readjusted after each time step since the motion of the pose computation technology has developed greatly, and the GPU has
boundary changes the physical space of the flow field. As a result, much gained much attention for its high floating-point throughput compared
running time is unavoidable when simulating these unsteady problems. with central processing units (CPUs) [20]. For problems with large-
The Cartesian grid method is a feasible and efficient approach for scale data that can be processed in a parallel manner, it is the best
flow simulations with complex stationary and moving boundaries. choice to use a GPU, which has many cores that support large-scale
Based on the immersed boundary method (IBM), which was firstly multithreaded parallel computing. In 2007, an easy-to-use program-
proposed by Peskin [9], various flow problems can be solved on a fixed, ming interface called the Compute Unified Device Architecture (CUDA)
non-body-fitted Cartesian grid. For now, IBMs can be divided into two was proposed by NVIDIA [20]. CUDA helps researchers program GPUs
categories, namely the diffused interface (DI) method [9–11] and the without having to learn complex shader languages and allows GPU code
sharp interface (SI) method [12–15]. The DI method has been widely to be written in regular C. As a result, this low-cost and high-efficiency
used in low Reynolds number flow simulations, especially in biological technology immediately received considerable attention from CFD re-
fluid dynamics, but it is difficult for this method to be applied in high searchers. Brandvik and Pullan [21] took the lead in using CUDA
Reynolds number flow simulations due to the compromised accuracy at technology to accelerate a Euler solver and the speedup ratios were
the interface. For the aerospace engineering, most simulations involve obtained, 29 times for the 2D solver and 16 times for the 3D code,
high Reynolds number flows, which can be handled by the SI method. which demonstrates that the GPU is well suited to CFD. In the work of


Corresponding author. College of Aerospace Science and Engineering, National University of Defense Technology, Changsha, Hunan 410073, People's Republic of
China.
E-mail address: snowblade@sina.com (L. Jin).

https://doi.org/10.1016/j.actaastro.2019.03.020
Received 26 November 2018; Received in revised form 25 February 2019; Accepted 6 March 2019
Available online 09 March 2019
0094-5765/ © 2019 Published by Elsevier Ltd on behalf of IAA.
F. Wei, et al. Acta Astronautica 159 (2019) 319–330

Nomenclature G Vector of inviscid flux at y coordinate


Fc Convective flux of AUSM+
ρ Density, kg/m3 M Mach number of AUSM+
γ Specific heat ratio P Pressure flux of AUSM+
a Speed of sound, m/s ω Angular frequency, °/s
t Time, s α Angle of attack, °
u x velocity, m/s α0 Average angle of attack, °
v y velocity, m/s αm Maximum amplitude angle of attack, °
Ε Total specific energy, J Cp Pressure coefficient
U Vector of conservative variables Δt Global time step, s
F Vector of inviscid flux at x coordinate Δti Local time step of cell i, s

Elsen et al. [22], a GPU was used to accelerate the simulation of a is implemented by using a GPU. Through using an improved parallel
hypersonic vehicle in cruise at Mach 5 and a multigrid scheme was used strategy, the speed of code execution has been greatly improved. For
to accelerate the solution. A speedup ratio of more than 20 was de- most of the research about accelerating a solver by the GPU, kernel
monstrated for this complex geometry. In a bid to further increase the performance with different execution configurations may be tested [35]
calculation speed, researchers began to accelerate the CFD code on but there is no further analysis. Therefore, in this paper, a numerical
multi-GPU platforms. Julien et al. [23] firstly used multiple GPUs to experiment is performed through testing a kernel with different ex-
accelerate a Navier-Stokes solver and a speedup ratio of 21 was ecution configurations at different grid sizes. During the test, perfor-
achieved on a dual-GPU platform. In the work of Phillips et al. [24] and mance indicators of the kernel are monitored in order to find the impact
Jacobsen et al. [25], larger GPU clusters were achieved and the GPU of changing the execution configuration. Through the numerical ex-
code was hundreds of times faster than CPU code. Xu et al. [26] ac- periment, the optimal execution configuration is obtained and the re-
celerated a high-order CFD code for a 3D multiblock structured grid on lation between the performance indicators and the execution config-
the TianHe-1A supercomputer, and successfully performed CPU-GPU uration at different grid sizes is investigated. Moreover, we also discuss
collaborative high-order accurate simulations with large-scale complex the approaches to further improve kernel performance.
grids, which showed a powerful capability of the GPU-based general-
purpose computation technology in the aerospace engineering.
2. Flow solver
For the performance of a GPU program, not only the number of
GPUs, but also GPU parallel methods play important roles. For the time
2.1. Governing equations
discrete scheme, in the work of many researchers, such as Brandvik
et al. [21] and Elsen et al. [22], the explicit Runge-Kutta method was
The cell-centered finite-volume method proposed by Jameson [42]
adopted for the high data parallelism. Implicit methods, such as the
is used to solve the inviscid flow on a nested Cartesian grid, and the
lower-upper symmetric Gauss-Seidel (LU-SGS), will be ineffective on a
compressible Euler equations in 2D are given as
GPU for the strong data dependency. Special approaches should be
adopted to eliminate the data dependency in the LU-SGS, and the re- Ut = −F (U ) x − G (U ) y , x = (x , y ) ∈ Ω (t ), 0<t<T (1)
lated work can be found in Refs. [27–29]. For the data-parallel lower-
upper relaxation (DP-LUR) method, an implicit method with high data where the solution vector U and the x th direction inviscid fluxes F (U )
parallelism, is suitable for running on the GPU [30]. In the specific GPU and the y th direction inviscid fluxes G (U ) are given by
implementation, Komatsu et al. [31] as well as Yue et al. [32], divided
U = [ρ , ρu, ρv , ρE ]T (2)
the computation grid into categories, red points and black points, to
solve the pressure Poisson equation in parallel on the GPU. Leskinen
et al. [33] added auxiliary arrays to store the results computed by the F (U ) = [ρu, ρu2 + p , ρuv, u (E + p)] T
(3)
threads, which avoided different threads from writing over each other's
results. For the unstructured grid, which is at risk of being highly non- G (U ) = [ρv, ρuv , ρv 2 + p , v (E + p)] T
(4)
coalesced, they achieved the coalesced memory access by copying the
The equation of state is as follow
original data to a new array according to the coalescing requirements
like Corrigan et al. [34]. However, the performance was poor since the p 1
E= + ρ (u2 + v 2)
data replication cost too much time. Franco et al. [35] used CUDA to γ−1 2 (5)
accelerate a solver based on a four-order precision finite difference
scheme, and a speedup ratio of 90 was achieved. In their work, data
that needed to be read repeatedly in the global memory were put into
the shared memory, and different execution configurations were tested,
which indicates about 20% performance difference between the block
size of 128 and 512.
As mentioned above, the research of the GPU-acceleration tech-
nology has achieved great progress and have been applied to many
more practical flow simulations [36–39]. However, most of the in-
vestigations focus on flows with stationary boundaries on the struc-
tured/unstructured grid or moving boundaries on the dynamic un-
structured grid [40,41]. For the GPU implementation on the nested
Cartesian grid, there has been a lack of relevant research.
In the present study, the acceleration of an unsteady solver on a
nested Cartesian grid, which can simulate 2D unsteady inviscid com-
pressible flow with both stationary boundaries and moving boundaries, Fig. 1. Schematic illustration of Forrer's ghost cell method

320
F. Wei, et al. Acta Astronautica 159 (2019) 319–330

Fig. 4. Comparison of the computational overheads among parts of the


Fig. 2. Schematic illustration of the interpolation method
Runge–Kutta loop

2.2. Numerical approach


ML / R = uL / R / an (10)
The AUSM + scheme [43] is employed in spatial discretization and 1
⎧ 2 (M ± M ) M ≥1
the following uses the x th direction as an example to illustrate this ML±/ R =
scheme. The convective term F (U ) in Eq. (3) is split into the following ⎨ Mβ± M <1
⎩ (11)
formula
1
F (U ) = Fc + P Mβ± = ± (M ± 1)2 ± β (M 2 − 1)2
(6) 4 (12)

With Fc = uU = MaU and P = (0 , p , 0 , 0) T . The fluxes at the and


interface is defined as pn = p+ pL + p− pR (13)
Fn = Mn an Un (7)
1
⎧ 2
(1 ± sign(M )) M ≥1
Pn = (0 , pn , 0 , 0) T p± =
(8) ⎨ pα± M <1
⎩ (14)
with Mn and pn being split into the following formulas
1
pα± = (M ± 1)2 (2 ∓ M ) ± α (M 2 − 1)2
Mn = ML+ + MR− (9) 4 (15)

Fig. 3. The flowchart of the flow solver

321
F. Wei, et al. Acta Astronautica 159 (2019) 319–330

Fig. 5. GPU implemented for part (2) in the Runge–Kutta loop

Fig. 6. GPU implemented for parts (4-6) in the Runge–Kutta loop

The interface speed of sound an are given by U (1) = U n + ΔtR (U n ) (17)

1
an = (aL + aR) 3 n 1 1
2 (16) U (2) = U + U (1) + ΔtR (U (1) )
4 4 4 (18)
(·) L represents the variable on the left (lower) side of the interface, and
(·) R represents the variable on the right (upper) side of the interface. In 1 n 2 2
Eq. (12) and Eq. (15), α and β are set to be 3/16 and 1/8, respectively U (n + 1) = U + U (2) + ΔtR (U (2) )
3 3 3 (19)
[43]. Fluxes are reconstructed by using a second-order precision re-
construction scheme at the interfaces of cells according to the MUSCL with R (·) being the right-hand-side term in Eq. (1). From tn to tn + 1, the
scheme [44]. The MINMOD limiter is used to limit the gradient to local time step of cell i is given by Eq. (20), and the minimum of the
prevent the reconstruction scheme from oscillating. In the time dis- local time steps is used as the global time step (see Eq. (21)). Moreover,
cretization, considering high precision and natural parallelism of the for solving unsteady flows with moving boundaries on a fixed Cartesian
explicit method, the explicit third-order total-variation-diminishing grid, the time step is limited by Eq. (22) due to changes in cell types
(TVD) Runge-Kutta method [45] is adopted, and the scheme is given by [46].

322
F. Wei, et al. Acta Astronautica 159 (2019) 319–330

code, two rays are used. For each ray generated from the center of a
cell, the number of intersections with the boundary is computed, which
can be used for judging whether the cell is inside or outside the body. It
is obvious that the ray method can be executed in parallel since there is
no need to care about other cells when judging the type of a cell. For the
ghost cells, judgments of them cannot be executed until all the cells of
other types have been judged. Therefore, two kernels, CUDA functions
that run on a GPU, are required. For the assignment of cells, it can be
executed immediately after the type of each cell is determined. As Fig. 5
shows, the first kernel is used to judge and assign values to the flow
field cells, the body cells, and the freshly cleared cells. The second
kernel is used to judge the ghost cells and assign values to them.
For handling of the parts (4–6) in the Runge–Kutta loop, our ap-
proach is inspired by the work of Leskinen et al. [33]. In their scheme,
Fig. 7. Geometry for the Schardin’s problem
an extra kernel was adopted to deal with the results stored in an aux-
iliary array, avoiding the different threads writing over each other's
hi results. However, operations of the auxiliary array cost more than 40%
Δti = CFL
( ui n + ai ) + ( vi n + ai ) (20) time during computation. In addition, the memory footprint of the
array is too much, which reduces the maximum size of the grid that can
Δt = min(Δti ) (21)
be processed on a GPU. In our scheme, only one kernel, we call it
hi “kernel-RK”, is utilized and the auxiliary array takes up only half of the
Δt ≤ memory space. As Fig. 6 shows, all the numerical fluxes of the corre-
max ( ui n , vi n ) (22)
sponding cell are computed, summed and then updated continuously on
with Δx i = Δyi = hi being the cell size of cell i. each thread. In this way, it is no need to expand auxiliary arrays and
Since the Cartesian grid is non-body-fitted and the boundary is sum auxiliary arrays, but an extra time-intensive of the replication
immersed in the background grid, we use the ghost cell method de- computation on each thread is added. Through tests, we find that the
veloped by Forrer [16] to enforce boundary conditions. The scheme is running time of replication computation is about an order of magnitude
given by Eq. 23–25, and the schematic diagram is as shown in Fig. 1. In less than the operations of the auxiliary arrays, demonstrating that our
Eq. (25), n denotes the normal vector of the boundary. For “freshly approach has better overall performance. Although the auxiliary arrays
cleared cells”, the properties of them are obtained by the flow field cells still take up some memory resources, for a nested Cartesian grid, we can
in the vicinity [12]. As Fig. 2 shows, an interpolation method is used to transfer a level of the grid to the GPU for calculation every time, in case
ensure computational precision during the data exchanges between that the memory space of the GPU cannot meet the requirements of all
grids of different levels. the grid.
d
PA = PW + (PW − PC ) 3.2. Validation
h (23)

d To investigate the accuracy of the flow solver, two cases were


ρA = ρW + (ρ − ρC )
h W (24) chosen to complete the validation: the so-called Schardin's problem
vA = vB − 2(vB·n ) n (25) (Case 1) and the supersonic flow over an oscillating NACA0006 airfoil
(Case 2). Case 1 is an unsteady problem with a stationary boundary and
Case 2 is an unsteady problem with a moving boundary. An i7-7700HQ
3. GPU implementation CPU with 2.80 GHz and a GPU card called NVIDIA GeForce GTX 1050
with 4 GB of global memory are used. The relevant parameters of the
3.1. Parallel methodology GTX1050 GPU can be viewed in the CUDA toolkit provided by NVIDIA.
The programming interface is CUDA version 9.1, and the connection
When parallelizing a program by using a GPU, an elaborate design is between CPU and GPU is a PCI Express 16X bus.
indispensable, or it will lead to bad performance. It is necessary to know In Case 1, the problem begins with a planar shock impinging on a
the typical architecture of a GPU first, which can be seen in Refs. finite wedge [47]. The initial geometry, shown in Fig. 7, and compu-
[20,39], as well as the coding execution bottleneck. As Fig. 3 shows, tational parameters are the same as Chang et al. [48]. The grid size of
there are three modules in the flow solver, including preprocessing, a the computation grid is 568 × 400 (cell size h = 0.25mm ) and only one
Runge–Kutta loop, and postprocessing. Through tests, we find that the level is used. The Forrer's ghost cell method (FGCM) is used to enforce
second module in the flow solver, i.e., the Runge–Kutta loop, costs the reflection boundary condition, and the CFL number is taken as 0.8.
much more time compared with the others. Moreover, containing most Fig. 8 shows the density fields computed by Chang et al. and our flow
of the interpolations and complex formulas, the part (2) and parts (4–6) solver as well as the interferograms. In Ref. [48], a body-fitted quad-
take up much time, which is demonstrated in Fig. 4. Note that the rilateral grid was adopted and the second-order accuracy in space and
computational overhead of a GPU program comes from two parts, time was maintained. For the flow behind the wedge, Fig. 9 illustrates
computation and data transfer, so it is not worth implementing code of the density distribution along the symmetry plane computed by Chang
small amount of calculation on the GPU because it may take more time et al. and our flow solver as well as the corresponding experiment re-
to transfer data. Therefore, focusing on the part (2) and parts (4–6) are sult, which shows that our solver can achieve the same accuracy as the
more meaningful and beneficial. The parallel feasibility analysis and body-fitted grid.
corresponding parallel strategy for these parts are introduced as fol- In Case 2, the attack angle motion equation of the NACA0006 airfoil
lows. is as follow.
For the part (2) in the Runge–Kutta loop, it includes judgment and
α (t ) = α 0 + αm sin(ωt ) (26)
assignment of flow field cells, body cells, freshly cleared cell as well as
ghost cells. In this solver, the ray method is utilized to judge the types of with the reduced angular frequency k = ωl/2V∞, l being the airfoil
cells except for the ghost cells, and, to improve the robustness of the chord length, and V∞ being the far-field flow speed.

323
F. Wei, et al. Acta Astronautica 159 (2019) 319–330

Fig. 8. Comparison of the density field with Chang's result as well as the experience result.

The geometry used by Wang [49] is adopted and the schematic sharp corner, such as the wing trailing edge. As Fig. 11 shows, com-
diagram is illustrated in Fig. 10(a). Reduced angular frequency k is pared with [49], there is a certain deviation of the solution when the
0.208, pressure and density are set based on the standard atmosphere grid size in each level is 200 × 200 (cell size h = 0.006 in the deepest
condition and more details can be seen in Ref. [49] as well as [50]. As level), while is almost identical with 400 × 400 (cell size h = 0.003 in
Fig. 10 (b),(c) shows, the three-level nested Cartesian grid is used in this the deepest level) and 600 × 600 (cell size h = 0.002 in the deepest
test, and the boundary condition is enforced by the FGCM. The CFL level). When the grid size in each level is 400 × 400, the Mach number
number is taken as 0.8 and the problem is solved at different grid sizes. counters at different time in the third period are illustrated in Fig. 12.
The pressure coefficients on the airfoil surface at different time in the
third period are measured to verify the accuracy of our flow solver.
Since the Cartesian grid is non-body-fitted, a sufficient amount of cells 3.3. Speedup ratio
are needed to accurately approximate the boundary, especially the
Speedup ratio, the ratio of the running time of code on CPU to GPU,

324
F. Wei, et al. Acta Astronautica 159 (2019) 319–330

each level, which are shown in Table 1. The result shows that the
speedup ratio increasing with the enlarging of the grid size, but the rate
of growth is slowing down. As mentioned above, the GPU running time
is composed of two segments, namely computing and data transfer, so
no parallelization is required for small amount of calculations. How-
ever, in the test, parts like computing time-step are not computed
parallelly at a small grid size, and it also remains as the scale of the grid
gradually increases, which accounting for the slow growth rate of
speedup ratio. In conclusion, it is necessary to measure the running
time between computing and data transfer at different grid sizes.

4. Numerical experiment

Furthermore, in the Case 2, we performed a numerical experiment


through testing the kernel-RK mentioned in section 3.1. Computational
overheads of running the kernel 100 times are used as benchmarks to
evaluate the performance of the GPU code. Two performance indicators
called “achieved occupancy” and “global load throughput” are mon-
itored and analyzed, with the achieved occupancy being the ratio of the
average number of active warps per cycle to the maximum number of
warps supported by a streaming multiprocessor (SM) and the global
load throughput being determined by the number of operations in each
Fig. 9. Comparison of the density distribution along the symmetry plane with
cycle of the SM that reflects the efficiency of memory reading. Both are
Chang’s result as well as the experiment result
important toward the performance of a kernel.

is used to measure the GPU's parallel efficiency. In the following tests 4.1. Execution configuration
(including tests in Chapter 4), we use Case 2 as the test case to discuss
the performance of the flow solver with GPU, and computational The organization form of threads, namely block per grid and thread
overheads of advancing 1000 time-steps are recorded to compute the per block, is called the execution configuration, which greatly affects
speedup ratios. Through computation, speedup ratios of 9.98–14.04 are kernel performance. Note that the innermost dimension of each thread
achieved on the nested Cartesian grid with up to 1200 × 1200 cells in block should be an integer multiple of the warp size when adjusting the

Fig. 10. Geometry for the pitching NACA0006 airfoil and the computational grid.

325
F. Wei, et al. Acta Astronautica 159 (2019) 319–330

Fig. 11. Comparison of the pressure coefficients on the airfoil surface at different time between our result and Wang's.

execution configurations, in case of a significant decrease in the global well. However, we find that when the number of thread blocks reaches
load throughput when the innermost dimension of a thread block is less a certain threshold, the achieved occupancy and global load throughput
than the size of a warp. For the test case with 400 × 400 cells in each will decrease, and the computational speed will slow down. Moreover,
level, Fig. 13 shows the time of running the kernel 100 times and the difference in the running time between the execution configuration
corresponding achieved occupancy and global load throughput when of 1024 × 1 and 64 × 1 is about 50%. Therefore, it is of great sig-
the execution configuration is 1024 × 1, 512 × 1, 256 × 1, 128 × 1, nificance to find the optimal execution configuration to attain the best
64 × 1, and 32 × 1. performance. For cases of thread blocks with the same number of
The result indicates that, for the test kernel, in the beginning, as the threads, different configurations also impact the kernel performance.
number of thread blocks increases, the achieved occupancy and global Table 2 demonstrates the running time and indicators of running the
load throughput elevate, the running time of the kernel shortens as kernel 100 times in different configurations at 128 thread/block. The

326
F. Wei, et al. Acta Astronautica 159 (2019) 319–330

Fig. 12. Mach number contours at different times in the third period

Table 1
Running time on CPU and GPU at different grid sizes as well as the speedup
ratio.
Grid size Time on CPU (s) Time on GPU (s) Speedup ratio

400 × 400 7975.20 799.04 9.98


600 × 600 18057.33 1497.29 12.06
800 × 800 32730.12 2496.20 13.11
1000 × 1000 51653.56 3707.28 13.93
1200 × 1200 75698.91 5392.02 14.04

result indicates that the two indicators affect the kernel performance
together, but neither of them is dominant, so further testing is needed to
find the best execution configuration. We test the kernel at all the ra-
tional execution configurations and find that the optimal execution
configuration is 64 × 1. In this case, the time of running the kernel 100
times is 1.918518 s, with the achieved occupancy being 0.492, and the
global load throughput being 131.70 GB/s. Fig. 13. Computational overhead of running the kernel 100 times with different
execution configurations with grid size of 400 × 400 in each level.

4.2. Grid size


cases at different grid sizes, similar tests are carried out when the grid
sizes are 600 × 600, 800 × 800, and 1000 × 1000 in each level.
The above analysis is based on the grid size of 400 × 400 in each
Figs. 14–16 illustrate the time of running the test kernel 100 times at
level. In a bid to verify whether the obtained regulation is applied to the

327
F. Wei, et al. Acta Astronautica 159 (2019) 319–330

Table 2
Time and indicators for different execution configuration to run kernel 100
times at 128 threads/block.
Execution configuration Time (s) Throughput (GB/s) Occupancy

128 × 1 1.970143 129.30 0.493


64 × 2 1.961981 127.56 0.515
32 × 4 1.973480 124.31 0.540

Fig. 16. Computational overhead of running the kernel 100 times with different
execution configurations with grid size of 1000 × 1000 in each level.

Table 3
Time and indicators for different execution configuration to run kernel 100
times at 64 threads/block at different grid sizes.
Grid size Execution Time (s) Throughput (GB/ Occupancy
configuration s)

400 × 400 64 × 1 1.918415 131.70 0.492


Fig. 14. Computational overhead of running the kernel 100 times with different
32 × 2 1.938072 127.63 0.537
execution configurations with grid size of 600 × 600 in each level.
600 × 600 64 × 1 4.238120 132.45 0.480
32 × 2 4.372084 127.69 0.532
800 × 800 64 × 1 7.632374 130.60 0.493
32 × 2 7.911793 126.05 0.536
1000 × 1000 64 × 1 12.22783 124.75 0.521
32 × 2 12.02101 127.55 0.530

4.3. Discussion

In the test, both the achieved occupancy and global load throughput
are always at low levels. For the achieved occupancy, the most likely
reason is that the test kernel occupies too many registers. Registers are
resources divided by the active warps in the SM, so using fewer registers
means more resident thread blocks in the SM as well as higher occu-
pancy. For the purpose of reducing the utilization of registers, through
storing some variables of the test kernel into the local memory, the
number of occupied registers is reduced from 52 to 32 and the occu-
pancy is increased to approximately 0.8. However, the running time
becomes longer by approximately 8%. We analysis the above phe-
Fig. 15. Computational overhead of running the kernel 100 times with different
nomenon and think the reason is that the access speed of the local
execution configurations with grid size of 800 × 800 in each level.
memory is much slower than registers, leading to lower efficiency and
throughput. Therefore, we think that the best way to reduce the usage
different grid sizes with execution configurations of 1024 × 1, 512 × 1, of registers is to optimize the algorithm and programming as far as
256 × 1, 128 × 1, 64 × 1, and 32 × 1. The regulation mentioned possible whereas it is impossible for some mature large algorithms and
above is basically the same at different grid sizes, and the kernel programs.
reaches the best overall performance with 64 threads/block in all cases. For the global load throughput, one of the reasons is that the kernel
By further adjusting the execution configuration at 64 threads/block, it does not use all the variables defined in a cell, which reduces resource
is found that, when the grid size is 600 × 600 and 800 × 800 in each utilization. Therefore, we can only transfer useful memory resources to
level, the execution configuration with the best overall performance is the GPU. The specific implementation process is depicted in Fig. 17,
64 × 1, and, when the grid size is 1000 × 1000, the best overall per- some data members of the data structure in a cell are reconstructed to a
formance is obtained when the execution configuration is 32 × 2. The new data structure according to the requirements of the kernel, and,
achieved occupancy and global load throughput of different execution before the kernel is launched, only the new data structure in each cell is
configurations at 64 threads/block with different grid sizes are shown transferred to the GPU. In this way, the global load throughput of the
in Table 3. The result shows that although the achieved occupancy and test kernel reaches about 200 GB/s, and the performance is obviously
global load throughput affect the kernel performance together, the improved.
performance of the kernel is more sensitive to the global load
throughput. Moreover, as the size of the grid increases, the impact of
5. Conclusions
the execution configuration is diminishing gradually.
In this study, GPU parallel computation on a nested Cartesian grid

328
F. Wei, et al. Acta Astronautica 159 (2019) 319–330

Fig. 17. The approach of data transferring to the GPU

was implemented to accelerate simulations of unsteady problems with 44–61.


stationary and moving boundaries. According to the parallel feasibility [3] C. Huang, W. Liu, G. Yang, Numerical studies of static aeroelastic effects on grid fin
aerodynamic performances, Chin. J. Aeronaut. 30 (4) (2017) 1300–1314.
analysis of the code, an improved parallel strategy was implemented. [4] Y. Chai, Z. Song, F. Li, Investigations on the influences of elastic foundations on the
Results show that the GPU effectively improves the computational aerothermoelastic flutter and thermal buckling properties of lattice sandwich panels
speed with a speedup ratio of 9.98–14.04. In addition, taking the kernel in supersonic airflow, Acta Astronaut. 140 (2017) 176–189.
[5] M. Winter, F.M. Heckmeier, C. Breitsamter, CFD-based aeroelastic reduced-order
of the part (4–6) in the Runge–Kutta loop as an example, further re- modeling robust to structural parameter variations, Aero. Sci. Technol. 67 (2017)
search on the execution configuration were made by monitoring and 13–30.
analyzing two performance indicators, the achieved occupancy and [6] X. Jiao, J. Chang, Z. Wang, D. Yu, Numerical study on hypersonic nozzle-inlet
starting characteristics in a shock tunnel, Acta Astronaut. 130 (2017) 167–179.
global load throughput. The following conclusions are drawn through
[7] N. Liu, Z. Wang, M. Sun, H. Wang, B. Wang, Numerical simulation of liquid droplet
the numerical experiment. breakup in supersonic flows, Acta Astronaut. 145 (2018) 116–130.
[8] H. Zhao, W. Liu, J. Ding, Y. Sun, X. Li, Y. Liu, Numerical study on separation shock
characteristics of pyrotechnic separation nuts, Acta Astronaut. 151 (2018) 893–903.
● With the increase of thread-block number, both the achieved oc-
[9] C.S. Peskin, Numerical analysis of blood flow in the heart, J. Comput. Phys. 25 (3)
cupancy and the throughput will improve, whereas, when the (1977) 220–252.
number exceeds a certain threshold, the performance of kernel will [10] D.M. Anderson, G.B. McFadden, A.A. Wheeler, Diffuse interface methods in fluid
decrease. When each block contains 64 threads, the kernel has the mechanics, Annu. Rev. Fluid Mech. 30 (1998) 139–165.
[11] J.K. Patel, G. Natarajan, Diffuse interface immersed boundary method for multi-
best overall performance. Of course, different kernels have different fluid flows with arbitrarily moving rigid bodies, J. Comput. Phys. 360 (2018)
optimal execution configurations that need similar testing to get it. 202–228.
● By changing the grid size, the above conclusion still holds when the [12] H.S. Udaykumar, R. Mittal, P. Rampunggoon, A. Khanna, A sharp interface
Cartesian grid method for simulating flows with complex moving boundaries, J.
grid size is different. Moreover, we discovered that the impact of the Comput. Phys. 174 (1) (2001) 345–380.
execution configurations is gradually reduced with the grid size [13] R. Mittal, H. Dong, M. Bozkurttas, F.M. Najjar, A. Vargas, A. Loebbecke, A versatile
increases. sharp interface immersed boundary method for incompressible flows with complex
boundaries, J. Comput. Phys. 227 (10) (2008) 4825–4852.
● Although the occupancy and throughput affect the kernel perfor- [14] D. Angelidis, S. Chawdhary, F. Sotiropoulos, Unstructured Cartesian refinement
mance together, the throughput has a higher impact. To improve the with sharp interface immersed boundary method for 3D unsteady incompressible
throughput, we discarded the useless data and reconstructed a new flows, J. Comput. Phys. 325 (2016) 272–300.
[15] R.V. Maitri, S. Das, J.A.M. Kuipers, J.T. Padding, E.A.J.F. Peters, An improved
data structure with data that needed to be transferred to the GPU. In
ghost-cell sharp interface immersed boundary method with direct forcing for par-
this simple way, the throughput can be significantly improved. ticle laden flows, Comput. Fluids 175 (2018) 111–128.
[16] H. Forrer, M. Berger, Flow simulations on Cartesian grids involving complex
moving geometries, Int. Ser. Numer. Math. 129 (1999) 315–324.
In the future, we are going to improve the computation speed of our
[17] Y.H. Tseng, J.H. Ferziger, A ghost-cell immersed boundary method for flow in
solver by optimizing memory and using multi-GPU. Based on a more complex geometry, J. Comput. Phys. 192 (2) (2003) 593–623.
efficient computing platform, we expect to obtain aerodynamic data as [18] C. Liu, C. Hu, An immersed boundary solver for inviscid compressible flows, Int. J.
soon as possible for the aerodynamic shape optimization. Numer. Methods Fluids 85 (2017) 619–640.
[19] J. Xin, F. Shi, Q. Jin, C. Lin, A radial basis function based ghost cell method with
improved mass conservation for complex moving boundary flows, Comput. Fluids
Acknowledgments 176 (2018) 210–225.
[20] Nvidia Corporation, CUDA C Programming Guide v10.0, (2018).
[21] T. Brandvik, G. Pullan, Acceleration of a 3D Euler solver using commodity graphics
The authors would like to express their gratitude for the financial hardware, 46th AIAA Aerospace Sciences Meeting and Exhibit, 2008 (Reno,
support provided by the Fund of Innovation, Shanghai Aerospace Nevada).
Science and Technology (No.SAST201419). The authors are also [22] E. Elsen, P. LeGresley, E. Darve, Large calculation of the flow over a hypersonic
vehicle using a GPU, J. Comput. Phys. 227 (2008) 10148–10161.
grateful to the reviewers for their extremely constructive comments. [23] T. Julien, S. Inanc, CUDA implementation of a Navier-Stokes solver on multi-GPU
desktop platforms for incompressible flows, 47th AIAA Aerospace Sciences Meeting
Appendix A. Supplementary data Including the New Horizons Forum and Aerospace Exposition, 2009 (Orlando,
Florida).
[24] E.H. Phillips, Y. Zhang, R.L. Davis, J.D. Owens, Rapid aerodynamic performance
Supplementary data related to this article can be found at https:// prediction on a cluster of graphics processing units, 47th AIAA Aerospace Sciences
doi.org/10.1016/j.actaastro.2019.03.020. Meeting Including the New Horizons Forum and Aerospace Exposition, 2009
(Orlando, Florida).
[25] D.A. Jacobsen, I. Senocak, Multi-level parallelism for incompressible flow compu-
References tations on GPU clusters, Parallel Comput. 39 (1) (2013) 1–20.
[26] C. Xu, X. Deng, L. Zhang, J. Fang, G. Wang, Y. Jiang, W. Cao, Y. Che, Y. Wang,
[1] Y. Li, B. Reimann, T. Eggers, Numerical investigations on the aerodynamics of Z. Wang, W. Liu, X. Cheng, Collaborating CPU and GPU for large-scale high-order
SHEFEX-III launcher, Acta Astronaut. 97 (2014) 99–108. CFD simulations with complex grids on the TianHe-1A supercomputer, J. Comput.
[2] Y. Li, B. Reimann, T. Eggers, Coupled simulation of CFD-flight-mechanics with a Phys. 278 (2014) 275–297.
two-species-gas-model for the hot rocket staging, Acta Astronaut. 128 (2016) [27] I. Tanno, K. Morinishi, N. Satofuka, Y. Watanabe, Calculation by artificial com-
pressibility method and virtual flux method on GPU, Comput. & Fluids 45 (1)

329
F. Wei, et al. Acta Astronautica 159 (2019) 319–330

(2011) 162–167. [39] J. Zhang, Z. Ma, H. Chen, C. Cao, A GPU-accelerated implicit meshless method for
[28] R. Löhner, A. Corrigan, Semi-automatic porting of a general fortran CFD code to compressible flows, J. Comput. Phys. 360 (2018) 39–56.
GPUS: the difficult modules, 20th AIAA Computational Fluid Dynamics Conference, [40] W. Ma, Z. Lu, J. Zhang, GPU parallelization of unstructured/hybrid grid ALE
2011 (Honolulu, Hawaii). multigrid unsteady solver for moving body problems, Comput. & Fluids 110 (2015)
[29] J. Zhang, C. Sha, Y. Wu, J. Wan, L. Zhou, Y. Ren, H. Si, Y. Yin, Y. Jing, The novel 122–135.
implicit LU-SGS parallel iterative method based on the diffusion equation of a nu- [41] D. Chandar, J. Sitaraman, D. Mavriplis, GPU parallelization of an unstructured
clear reactor on a GPU cluster, Comput, Phys. Commun. 211 (2017) 16–22. overset grid incompressible Navier-Stokes solver for moving bodies, 50th AIAA
[30] L. Fu, Z. Gao, K. Xu, F. Xu, A multi-block viscous flow solver based on GPU parallel Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace
methodology, Comput. Fluids 95 (2014) 19–39. Exposition, 2012 (Nashville, Tennessee).
[31] K. Komatsu, T. Soga, R. Egawa, H. Takizawa, H. Kobayashi, S. Takahashi, D. Sasaki, [42] A. Jameson, W. Schmidt, E. Turkel, Numerical solution of the Euler equations by
K. Nakahashi, Parallel processing of the building-cube method on a GPU platform, finite volume methods using Runge Kutta time stepping schemes, 14th Fluid and
Comput. Fluids 45 (1) (2011) 122–128. Plasma Dynamics Conference, 1981 (Palo Alto, CA, USA).
[32] Y. Xiang, B. Yu, Q. Yuan, D. Sun, GPU Acceleration of CFD algorithm: HSMAC and [43] M.S. Liou, A sequel to AUSM: AUSM+, J. Comput. Phys. 129 (2) (1996) 364–382.
SIMPLE, Procedia Comput. Sci. 208 (2017) 1982–1989. [44] B.V. Leer, Towards the ultimate conservative difference scheme. II. Monotonicity
[33] J. Leskinen, J. Periaux, Distributed evolutionary optimization using Nash games and conservation combined in a second-order scheme, J. Comput. Phys. 14 (4)
and GPUs – applications to CFD design problems, Comput. Fluids 80 (2013) (1974) 361–370.
190–201. [45] C.W. Shu, S. Osher, Efficient implementation of essentially non-oscillatory shock-
[34] A. Corrigan, F. Camelli, R. Löhner, J. Wallin, Running unstructured grid-based CFD capturing schemes, II, J. Comput. Phys. 83 (1) (1989) 32–78.
solvers on modern graphics hardware, 19th AIAA Computational Fluid Dynamics, [46] S. Tan, C. Shu, A high order moving boundary treatment for compressible inviscid
2009 (San Antonio, Texas). flows, J. Comput. Phys. 230 (15) (2011) 6023–6036.
[35] E.E. Franco, H.M. Barrera, S. Laín, 2D lid-driven cavity flow simulation using GPU- [47] H. Schardin, High frequency cinematography in the shock tube, J. Photogr. Sci. 5
CUDA with a high-order finite difference scheme, J. Braz. Soc. Mech. Sci. Eng. 37 (1957) 19–26.
(4) (2015) 1329–1338. [48] S.M. Chang, K.S. Chang, On the shock–vortex interaction in Schardin's problem,
[36] N. Jain, J.D. Baeder, Aerodynamic characteristics of SC1095 airfoil using hybrid Shock Waves 10 (2000) 333–343.
RANS-LES methods implemented into a GPU accelerated Navier-Stokes solver, 22nd [49] J.L. Wang, Unsteady Aerodynamic Calculation Based on Unstructured Moving
AIAA Computational Fluid Dynamics Conference, 2015 (Dallas, TX). Mesh, Master’s Dissertation Northwestern Polytechnical University, Xi'an, China,
[37] V.N. Emelyanov, A.G. Karpenko, A.S. Kozelkov, I.V. Teterina, K.N. Volkov, 2005.
A.V. Yalozo, Analysis of impact of general-purpose graphics processor units in su- [50] B. Zhang, T. Yang, Z. Feng, Q. Zhang, J. Ge, L. Huo, Application strategy and im-
personic flow modeling, Acta Astronaut. 135 (2017) 198–207. provement of unstructured dynamic grid method based on elasticity analogy (in
[38] S. Shu, N. Yang, GPU-accelerated large eddy simulation of stirred tanks, Chem. Eng. Chinese), J. Aerosp. Power 32 (3) (2017) 648–656.
Sci. 181 (2018) 132–145.

330

You might also like