Professional Documents
Culture Documents
Acceleration of A 2D Unsteady Euler Solver With GPU On Nested Cartesian
Acceleration of A 2D Unsteady Euler Solver With GPU On Nested Cartesian
Acta Astronautica
journal homepage: www.elsevier.com/locate/actaastro
A R T I C LE I N FO A B S T R A C T
Keywords: Graphics processing unit (GPU) parallel computation is used to accelerate a solver, which computes the 2D
Unsteady unsteady compressible Euler equation discretized with the finite-volume method on a nested Cartesian grid. In
GPU parallel computation this solver, a second-order accurate upwind scheme is adopted, with explicit time-stepping by using the third-
Euler equation order total-variation-diminishing Runge–Kutta method. An improved parallel strategy is implemented, and
Nested Cartesian grid
through dealing with a test case on a three-level nested Cartesian grid, speedup ratios of 9.98–14.04 are achieved
Execution configuration
respectively at different grid sizes. Furtherly, through a numerical experiment, the relation between the kernel
performance and the execution configuration at different grid sizes is examined by monitoring and analyzing the
performance indicators. Moreover, approaches to improve kernel performances are explored.
1. Introduction The ghost cell method (GCM), a classic approach of the SI method, has
drawn great attention of many researchers in recent years [15–19]. In
As an important research field of fluid dynamics, the study of un- the GCM, body cells that have at least one neighbor in the fluid are
steady flows involves many problems, such as the stage separation [1,2] defined as “ghost cells”, and boundary conditions are enforced by these
and the aeroelastic analysis [3–5]. With the development of computa- ghost cells. The advantages of the GCM on a Cartesian grid lie in the fast
tional fluid dynamics (CFD), more and more people have begun to si- grid generation as well as the low memory usage, and more im-
mulate unsteady problems by using numerical computation [6–8]. portantly, this method is easy to be implemented since there is no need
However, the calculating quantity of solving unsteady problems is large to change the governing equation.
and, especially for unsteady problems with moving boundaries, the grid In recent years, graphics processing unit (GPU)-based general-pur-
needs to be readjusted after each time step since the motion of the pose computation technology has developed greatly, and the GPU has
boundary changes the physical space of the flow field. As a result, much gained much attention for its high floating-point throughput compared
running time is unavoidable when simulating these unsteady problems. with central processing units (CPUs) [20]. For problems with large-
The Cartesian grid method is a feasible and efficient approach for scale data that can be processed in a parallel manner, it is the best
flow simulations with complex stationary and moving boundaries. choice to use a GPU, which has many cores that support large-scale
Based on the immersed boundary method (IBM), which was firstly multithreaded parallel computing. In 2007, an easy-to-use program-
proposed by Peskin [9], various flow problems can be solved on a fixed, ming interface called the Compute Unified Device Architecture (CUDA)
non-body-fitted Cartesian grid. For now, IBMs can be divided into two was proposed by NVIDIA [20]. CUDA helps researchers program GPUs
categories, namely the diffused interface (DI) method [9–11] and the without having to learn complex shader languages and allows GPU code
sharp interface (SI) method [12–15]. The DI method has been widely to be written in regular C. As a result, this low-cost and high-efficiency
used in low Reynolds number flow simulations, especially in biological technology immediately received considerable attention from CFD re-
fluid dynamics, but it is difficult for this method to be applied in high searchers. Brandvik and Pullan [21] took the lead in using CUDA
Reynolds number flow simulations due to the compromised accuracy at technology to accelerate a Euler solver and the speedup ratios were
the interface. For the aerospace engineering, most simulations involve obtained, 29 times for the 2D solver and 16 times for the 3D code,
high Reynolds number flows, which can be handled by the SI method. which demonstrates that the GPU is well suited to CFD. In the work of
∗
Corresponding author. College of Aerospace Science and Engineering, National University of Defense Technology, Changsha, Hunan 410073, People's Republic of
China.
E-mail address: snowblade@sina.com (L. Jin).
https://doi.org/10.1016/j.actaastro.2019.03.020
Received 26 November 2018; Received in revised form 25 February 2019; Accepted 6 March 2019
Available online 09 March 2019
0094-5765/ © 2019 Published by Elsevier Ltd on behalf of IAA.
F. Wei, et al. Acta Astronautica 159 (2019) 319–330
Elsen et al. [22], a GPU was used to accelerate the simulation of a is implemented by using a GPU. Through using an improved parallel
hypersonic vehicle in cruise at Mach 5 and a multigrid scheme was used strategy, the speed of code execution has been greatly improved. For
to accelerate the solution. A speedup ratio of more than 20 was de- most of the research about accelerating a solver by the GPU, kernel
monstrated for this complex geometry. In a bid to further increase the performance with different execution configurations may be tested [35]
calculation speed, researchers began to accelerate the CFD code on but there is no further analysis. Therefore, in this paper, a numerical
multi-GPU platforms. Julien et al. [23] firstly used multiple GPUs to experiment is performed through testing a kernel with different ex-
accelerate a Navier-Stokes solver and a speedup ratio of 21 was ecution configurations at different grid sizes. During the test, perfor-
achieved on a dual-GPU platform. In the work of Phillips et al. [24] and mance indicators of the kernel are monitored in order to find the impact
Jacobsen et al. [25], larger GPU clusters were achieved and the GPU of changing the execution configuration. Through the numerical ex-
code was hundreds of times faster than CPU code. Xu et al. [26] ac- periment, the optimal execution configuration is obtained and the re-
celerated a high-order CFD code for a 3D multiblock structured grid on lation between the performance indicators and the execution config-
the TianHe-1A supercomputer, and successfully performed CPU-GPU uration at different grid sizes is investigated. Moreover, we also discuss
collaborative high-order accurate simulations with large-scale complex the approaches to further improve kernel performance.
grids, which showed a powerful capability of the GPU-based general-
purpose computation technology in the aerospace engineering.
2. Flow solver
For the performance of a GPU program, not only the number of
GPUs, but also GPU parallel methods play important roles. For the time
2.1. Governing equations
discrete scheme, in the work of many researchers, such as Brandvik
et al. [21] and Elsen et al. [22], the explicit Runge-Kutta method was
The cell-centered finite-volume method proposed by Jameson [42]
adopted for the high data parallelism. Implicit methods, such as the
is used to solve the inviscid flow on a nested Cartesian grid, and the
lower-upper symmetric Gauss-Seidel (LU-SGS), will be ineffective on a
compressible Euler equations in 2D are given as
GPU for the strong data dependency. Special approaches should be
adopted to eliminate the data dependency in the LU-SGS, and the re- Ut = −F (U ) x − G (U ) y , x = (x , y ) ∈ Ω (t ), 0<t<T (1)
lated work can be found in Refs. [27–29]. For the data-parallel lower-
upper relaxation (DP-LUR) method, an implicit method with high data where the solution vector U and the x th direction inviscid fluxes F (U )
parallelism, is suitable for running on the GPU [30]. In the specific GPU and the y th direction inviscid fluxes G (U ) are given by
implementation, Komatsu et al. [31] as well as Yue et al. [32], divided
U = [ρ , ρu, ρv , ρE ]T (2)
the computation grid into categories, red points and black points, to
solve the pressure Poisson equation in parallel on the GPU. Leskinen
et al. [33] added auxiliary arrays to store the results computed by the F (U ) = [ρu, ρu2 + p , ρuv, u (E + p)] T
(3)
threads, which avoided different threads from writing over each other's
results. For the unstructured grid, which is at risk of being highly non- G (U ) = [ρv, ρuv , ρv 2 + p , v (E + p)] T
(4)
coalesced, they achieved the coalesced memory access by copying the
The equation of state is as follow
original data to a new array according to the coalescing requirements
like Corrigan et al. [34]. However, the performance was poor since the p 1
E= + ρ (u2 + v 2)
data replication cost too much time. Franco et al. [35] used CUDA to γ−1 2 (5)
accelerate a solver based on a four-order precision finite difference
scheme, and a speedup ratio of 90 was achieved. In their work, data
that needed to be read repeatedly in the global memory were put into
the shared memory, and different execution configurations were tested,
which indicates about 20% performance difference between the block
size of 128 and 512.
As mentioned above, the research of the GPU-acceleration tech-
nology has achieved great progress and have been applied to many
more practical flow simulations [36–39]. However, most of the in-
vestigations focus on flows with stationary boundaries on the struc-
tured/unstructured grid or moving boundaries on the dynamic un-
structured grid [40,41]. For the GPU implementation on the nested
Cartesian grid, there has been a lack of relevant research.
In the present study, the acceleration of an unsteady solver on a
nested Cartesian grid, which can simulate 2D unsteady inviscid com-
pressible flow with both stationary boundaries and moving boundaries, Fig. 1. Schematic illustration of Forrer's ghost cell method
320
F. Wei, et al. Acta Astronautica 159 (2019) 319–330
321
F. Wei, et al. Acta Astronautica 159 (2019) 319–330
1
an = (aL + aR) 3 n 1 1
2 (16) U (2) = U + U (1) + ΔtR (U (1) )
4 4 4 (18)
(·) L represents the variable on the left (lower) side of the interface, and
(·) R represents the variable on the right (upper) side of the interface. In 1 n 2 2
Eq. (12) and Eq. (15), α and β are set to be 3/16 and 1/8, respectively U (n + 1) = U + U (2) + ΔtR (U (2) )
3 3 3 (19)
[43]. Fluxes are reconstructed by using a second-order precision re-
construction scheme at the interfaces of cells according to the MUSCL with R (·) being the right-hand-side term in Eq. (1). From tn to tn + 1, the
scheme [44]. The MINMOD limiter is used to limit the gradient to local time step of cell i is given by Eq. (20), and the minimum of the
prevent the reconstruction scheme from oscillating. In the time dis- local time steps is used as the global time step (see Eq. (21)). Moreover,
cretization, considering high precision and natural parallelism of the for solving unsteady flows with moving boundaries on a fixed Cartesian
explicit method, the explicit third-order total-variation-diminishing grid, the time step is limited by Eq. (22) due to changes in cell types
(TVD) Runge-Kutta method [45] is adopted, and the scheme is given by [46].
322
F. Wei, et al. Acta Astronautica 159 (2019) 319–330
code, two rays are used. For each ray generated from the center of a
cell, the number of intersections with the boundary is computed, which
can be used for judging whether the cell is inside or outside the body. It
is obvious that the ray method can be executed in parallel since there is
no need to care about other cells when judging the type of a cell. For the
ghost cells, judgments of them cannot be executed until all the cells of
other types have been judged. Therefore, two kernels, CUDA functions
that run on a GPU, are required. For the assignment of cells, it can be
executed immediately after the type of each cell is determined. As Fig. 5
shows, the first kernel is used to judge and assign values to the flow
field cells, the body cells, and the freshly cleared cells. The second
kernel is used to judge the ghost cells and assign values to them.
For handling of the parts (4–6) in the Runge–Kutta loop, our ap-
proach is inspired by the work of Leskinen et al. [33]. In their scheme,
Fig. 7. Geometry for the Schardin’s problem
an extra kernel was adopted to deal with the results stored in an aux-
iliary array, avoiding the different threads writing over each other's
hi results. However, operations of the auxiliary array cost more than 40%
Δti = CFL
( ui n + ai ) + ( vi n + ai ) (20) time during computation. In addition, the memory footprint of the
array is too much, which reduces the maximum size of the grid that can
Δt = min(Δti ) (21)
be processed on a GPU. In our scheme, only one kernel, we call it
hi “kernel-RK”, is utilized and the auxiliary array takes up only half of the
Δt ≤ memory space. As Fig. 6 shows, all the numerical fluxes of the corre-
max ( ui n , vi n ) (22)
sponding cell are computed, summed and then updated continuously on
with Δx i = Δyi = hi being the cell size of cell i. each thread. In this way, it is no need to expand auxiliary arrays and
Since the Cartesian grid is non-body-fitted and the boundary is sum auxiliary arrays, but an extra time-intensive of the replication
immersed in the background grid, we use the ghost cell method de- computation on each thread is added. Through tests, we find that the
veloped by Forrer [16] to enforce boundary conditions. The scheme is running time of replication computation is about an order of magnitude
given by Eq. 23–25, and the schematic diagram is as shown in Fig. 1. In less than the operations of the auxiliary arrays, demonstrating that our
Eq. (25), n denotes the normal vector of the boundary. For “freshly approach has better overall performance. Although the auxiliary arrays
cleared cells”, the properties of them are obtained by the flow field cells still take up some memory resources, for a nested Cartesian grid, we can
in the vicinity [12]. As Fig. 2 shows, an interpolation method is used to transfer a level of the grid to the GPU for calculation every time, in case
ensure computational precision during the data exchanges between that the memory space of the GPU cannot meet the requirements of all
grids of different levels. the grid.
d
PA = PW + (PW − PC ) 3.2. Validation
h (23)
323
F. Wei, et al. Acta Astronautica 159 (2019) 319–330
Fig. 8. Comparison of the density field with Chang's result as well as the experience result.
The geometry used by Wang [49] is adopted and the schematic sharp corner, such as the wing trailing edge. As Fig. 11 shows, com-
diagram is illustrated in Fig. 10(a). Reduced angular frequency k is pared with [49], there is a certain deviation of the solution when the
0.208, pressure and density are set based on the standard atmosphere grid size in each level is 200 × 200 (cell size h = 0.006 in the deepest
condition and more details can be seen in Ref. [49] as well as [50]. As level), while is almost identical with 400 × 400 (cell size h = 0.003 in
Fig. 10 (b),(c) shows, the three-level nested Cartesian grid is used in this the deepest level) and 600 × 600 (cell size h = 0.002 in the deepest
test, and the boundary condition is enforced by the FGCM. The CFL level). When the grid size in each level is 400 × 400, the Mach number
number is taken as 0.8 and the problem is solved at different grid sizes. counters at different time in the third period are illustrated in Fig. 12.
The pressure coefficients on the airfoil surface at different time in the
third period are measured to verify the accuracy of our flow solver.
Since the Cartesian grid is non-body-fitted, a sufficient amount of cells 3.3. Speedup ratio
are needed to accurately approximate the boundary, especially the
Speedup ratio, the ratio of the running time of code on CPU to GPU,
324
F. Wei, et al. Acta Astronautica 159 (2019) 319–330
each level, which are shown in Table 1. The result shows that the
speedup ratio increasing with the enlarging of the grid size, but the rate
of growth is slowing down. As mentioned above, the GPU running time
is composed of two segments, namely computing and data transfer, so
no parallelization is required for small amount of calculations. How-
ever, in the test, parts like computing time-step are not computed
parallelly at a small grid size, and it also remains as the scale of the grid
gradually increases, which accounting for the slow growth rate of
speedup ratio. In conclusion, it is necessary to measure the running
time between computing and data transfer at different grid sizes.
4. Numerical experiment
is used to measure the GPU's parallel efficiency. In the following tests 4.1. Execution configuration
(including tests in Chapter 4), we use Case 2 as the test case to discuss
the performance of the flow solver with GPU, and computational The organization form of threads, namely block per grid and thread
overheads of advancing 1000 time-steps are recorded to compute the per block, is called the execution configuration, which greatly affects
speedup ratios. Through computation, speedup ratios of 9.98–14.04 are kernel performance. Note that the innermost dimension of each thread
achieved on the nested Cartesian grid with up to 1200 × 1200 cells in block should be an integer multiple of the warp size when adjusting the
Fig. 10. Geometry for the pitching NACA0006 airfoil and the computational grid.
325
F. Wei, et al. Acta Astronautica 159 (2019) 319–330
Fig. 11. Comparison of the pressure coefficients on the airfoil surface at different time between our result and Wang's.
execution configurations, in case of a significant decrease in the global well. However, we find that when the number of thread blocks reaches
load throughput when the innermost dimension of a thread block is less a certain threshold, the achieved occupancy and global load throughput
than the size of a warp. For the test case with 400 × 400 cells in each will decrease, and the computational speed will slow down. Moreover,
level, Fig. 13 shows the time of running the kernel 100 times and the difference in the running time between the execution configuration
corresponding achieved occupancy and global load throughput when of 1024 × 1 and 64 × 1 is about 50%. Therefore, it is of great sig-
the execution configuration is 1024 × 1, 512 × 1, 256 × 1, 128 × 1, nificance to find the optimal execution configuration to attain the best
64 × 1, and 32 × 1. performance. For cases of thread blocks with the same number of
The result indicates that, for the test kernel, in the beginning, as the threads, different configurations also impact the kernel performance.
number of thread blocks increases, the achieved occupancy and global Table 2 demonstrates the running time and indicators of running the
load throughput elevate, the running time of the kernel shortens as kernel 100 times in different configurations at 128 thread/block. The
326
F. Wei, et al. Acta Astronautica 159 (2019) 319–330
Fig. 12. Mach number contours at different times in the third period
Table 1
Running time on CPU and GPU at different grid sizes as well as the speedup
ratio.
Grid size Time on CPU (s) Time on GPU (s) Speedup ratio
result indicates that the two indicators affect the kernel performance
together, but neither of them is dominant, so further testing is needed to
find the best execution configuration. We test the kernel at all the ra-
tional execution configurations and find that the optimal execution
configuration is 64 × 1. In this case, the time of running the kernel 100
times is 1.918518 s, with the achieved occupancy being 0.492, and the
global load throughput being 131.70 GB/s. Fig. 13. Computational overhead of running the kernel 100 times with different
execution configurations with grid size of 400 × 400 in each level.
327
F. Wei, et al. Acta Astronautica 159 (2019) 319–330
Table 2
Time and indicators for different execution configuration to run kernel 100
times at 128 threads/block.
Execution configuration Time (s) Throughput (GB/s) Occupancy
Fig. 16. Computational overhead of running the kernel 100 times with different
execution configurations with grid size of 1000 × 1000 in each level.
Table 3
Time and indicators for different execution configuration to run kernel 100
times at 64 threads/block at different grid sizes.
Grid size Execution Time (s) Throughput (GB/ Occupancy
configuration s)
4.3. Discussion
In the test, both the achieved occupancy and global load throughput
are always at low levels. For the achieved occupancy, the most likely
reason is that the test kernel occupies too many registers. Registers are
resources divided by the active warps in the SM, so using fewer registers
means more resident thread blocks in the SM as well as higher occu-
pancy. For the purpose of reducing the utilization of registers, through
storing some variables of the test kernel into the local memory, the
number of occupied registers is reduced from 52 to 32 and the occu-
pancy is increased to approximately 0.8. However, the running time
becomes longer by approximately 8%. We analysis the above phe-
Fig. 15. Computational overhead of running the kernel 100 times with different
nomenon and think the reason is that the access speed of the local
execution configurations with grid size of 800 × 800 in each level.
memory is much slower than registers, leading to lower efficiency and
throughput. Therefore, we think that the best way to reduce the usage
different grid sizes with execution configurations of 1024 × 1, 512 × 1, of registers is to optimize the algorithm and programming as far as
256 × 1, 128 × 1, 64 × 1, and 32 × 1. The regulation mentioned possible whereas it is impossible for some mature large algorithms and
above is basically the same at different grid sizes, and the kernel programs.
reaches the best overall performance with 64 threads/block in all cases. For the global load throughput, one of the reasons is that the kernel
By further adjusting the execution configuration at 64 threads/block, it does not use all the variables defined in a cell, which reduces resource
is found that, when the grid size is 600 × 600 and 800 × 800 in each utilization. Therefore, we can only transfer useful memory resources to
level, the execution configuration with the best overall performance is the GPU. The specific implementation process is depicted in Fig. 17,
64 × 1, and, when the grid size is 1000 × 1000, the best overall per- some data members of the data structure in a cell are reconstructed to a
formance is obtained when the execution configuration is 32 × 2. The new data structure according to the requirements of the kernel, and,
achieved occupancy and global load throughput of different execution before the kernel is launched, only the new data structure in each cell is
configurations at 64 threads/block with different grid sizes are shown transferred to the GPU. In this way, the global load throughput of the
in Table 3. The result shows that although the achieved occupancy and test kernel reaches about 200 GB/s, and the performance is obviously
global load throughput affect the kernel performance together, the improved.
performance of the kernel is more sensitive to the global load
throughput. Moreover, as the size of the grid increases, the impact of
5. Conclusions
the execution configuration is diminishing gradually.
In this study, GPU parallel computation on a nested Cartesian grid
328
F. Wei, et al. Acta Astronautica 159 (2019) 319–330
329
F. Wei, et al. Acta Astronautica 159 (2019) 319–330
(2011) 162–167. [39] J. Zhang, Z. Ma, H. Chen, C. Cao, A GPU-accelerated implicit meshless method for
[28] R. Löhner, A. Corrigan, Semi-automatic porting of a general fortran CFD code to compressible flows, J. Comput. Phys. 360 (2018) 39–56.
GPUS: the difficult modules, 20th AIAA Computational Fluid Dynamics Conference, [40] W. Ma, Z. Lu, J. Zhang, GPU parallelization of unstructured/hybrid grid ALE
2011 (Honolulu, Hawaii). multigrid unsteady solver for moving body problems, Comput. & Fluids 110 (2015)
[29] J. Zhang, C. Sha, Y. Wu, J. Wan, L. Zhou, Y. Ren, H. Si, Y. Yin, Y. Jing, The novel 122–135.
implicit LU-SGS parallel iterative method based on the diffusion equation of a nu- [41] D. Chandar, J. Sitaraman, D. Mavriplis, GPU parallelization of an unstructured
clear reactor on a GPU cluster, Comput, Phys. Commun. 211 (2017) 16–22. overset grid incompressible Navier-Stokes solver for moving bodies, 50th AIAA
[30] L. Fu, Z. Gao, K. Xu, F. Xu, A multi-block viscous flow solver based on GPU parallel Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace
methodology, Comput. Fluids 95 (2014) 19–39. Exposition, 2012 (Nashville, Tennessee).
[31] K. Komatsu, T. Soga, R. Egawa, H. Takizawa, H. Kobayashi, S. Takahashi, D. Sasaki, [42] A. Jameson, W. Schmidt, E. Turkel, Numerical solution of the Euler equations by
K. Nakahashi, Parallel processing of the building-cube method on a GPU platform, finite volume methods using Runge Kutta time stepping schemes, 14th Fluid and
Comput. Fluids 45 (1) (2011) 122–128. Plasma Dynamics Conference, 1981 (Palo Alto, CA, USA).
[32] Y. Xiang, B. Yu, Q. Yuan, D. Sun, GPU Acceleration of CFD algorithm: HSMAC and [43] M.S. Liou, A sequel to AUSM: AUSM+, J. Comput. Phys. 129 (2) (1996) 364–382.
SIMPLE, Procedia Comput. Sci. 208 (2017) 1982–1989. [44] B.V. Leer, Towards the ultimate conservative difference scheme. II. Monotonicity
[33] J. Leskinen, J. Periaux, Distributed evolutionary optimization using Nash games and conservation combined in a second-order scheme, J. Comput. Phys. 14 (4)
and GPUs – applications to CFD design problems, Comput. Fluids 80 (2013) (1974) 361–370.
190–201. [45] C.W. Shu, S. Osher, Efficient implementation of essentially non-oscillatory shock-
[34] A. Corrigan, F. Camelli, R. Löhner, J. Wallin, Running unstructured grid-based CFD capturing schemes, II, J. Comput. Phys. 83 (1) (1989) 32–78.
solvers on modern graphics hardware, 19th AIAA Computational Fluid Dynamics, [46] S. Tan, C. Shu, A high order moving boundary treatment for compressible inviscid
2009 (San Antonio, Texas). flows, J. Comput. Phys. 230 (15) (2011) 6023–6036.
[35] E.E. Franco, H.M. Barrera, S. Laín, 2D lid-driven cavity flow simulation using GPU- [47] H. Schardin, High frequency cinematography in the shock tube, J. Photogr. Sci. 5
CUDA with a high-order finite difference scheme, J. Braz. Soc. Mech. Sci. Eng. 37 (1957) 19–26.
(4) (2015) 1329–1338. [48] S.M. Chang, K.S. Chang, On the shock–vortex interaction in Schardin's problem,
[36] N. Jain, J.D. Baeder, Aerodynamic characteristics of SC1095 airfoil using hybrid Shock Waves 10 (2000) 333–343.
RANS-LES methods implemented into a GPU accelerated Navier-Stokes solver, 22nd [49] J.L. Wang, Unsteady Aerodynamic Calculation Based on Unstructured Moving
AIAA Computational Fluid Dynamics Conference, 2015 (Dallas, TX). Mesh, Master’s Dissertation Northwestern Polytechnical University, Xi'an, China,
[37] V.N. Emelyanov, A.G. Karpenko, A.S. Kozelkov, I.V. Teterina, K.N. Volkov, 2005.
A.V. Yalozo, Analysis of impact of general-purpose graphics processor units in su- [50] B. Zhang, T. Yang, Z. Feng, Q. Zhang, J. Ge, L. Huo, Application strategy and im-
personic flow modeling, Acta Astronaut. 135 (2017) 198–207. provement of unstructured dynamic grid method based on elasticity analogy (in
[38] S. Shu, N. Yang, GPU-accelerated large eddy simulation of stirred tanks, Chem. Eng. Chinese), J. Aerosp. Power 32 (3) (2017) 648–656.
Sci. 181 (2018) 132–145.
330