You are on page 1of 8

Available online at www.sciencedirect.

com
Available online at www.sciencedirect.com
Available online at www.sciencedirect.com

ScienceDirect
Procedia Computer Science 00 (2019) 000–000
Procedia
Procedia Computer
Computer Science
Science 00(2020)
178 (2019)47–54
000–000 www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia

9th International Young Scientist Conference on Computational Science (YSC 2020)


9th International Young Scientist Conference on Computational Science (YSC 2020)
Parallel
Parallel implementation
implementation of
of aa numerical
numerical method
method for
for solving
solving aa
three-dimensional
three-dimensional transport
transport equation
equation for
for aa mesoscale
mesoscale meteorological
meteorological
model
model
D V Leshchinskiya,b,∗ a a
a,b,∗, A V Starchenkoa , E A Danilkina , S A Prohanova
a
D V Leshchinskiy
a
, A V Starchenko , E A Danilkin , S A Prohanov
National Research Tomsk State University, 36 Lenin Ave., Tomsk, Tomsk Oblast, 634050, Russia
a National Research Tomsk State University, 36 Lenin Ave., Tomsk, Tomsk Oblast, 634050, Russia
b Regional Scientific and Educational Mathematical Center, Tomsk State University, 36 Lenin Ave., Tomsk, Tomsk Oblast, 634050, Russia
b Regional Scientific and Educational Mathematical Center, Tomsk State University, 36 Lenin Ave., Tomsk, Tomsk Oblast, 634050, Russia

Abstract
Abstract
A parallel algorithm for numerical solution of a generalized unsteady 3-dimensional advection-diffusion equation for a mesoscale
A parallel algorithm
meteorological modelforis numerical solution
considered. of a of
Efficiency generalized unsteady 3-dimensional
parallel implementations advection-diffusion
of the solver with Message equation for a mesoscale
Passing Interface (MPI),
meteorological
Open Multi-Processing (OpenMP), and NVidia Compute Unified Device Architecture (CUDA) technologies was compared.(MPI),
model is considered. Efficiency of parallel implementations of the solver with Message Passing Interface It was
Open
shownMulti-Processing
that on the mesh (OpenMP),
of 25625632 and NVidia
nodes the Compute
speedup ofUnified Devicereaches
the program Architecture
18 for(CUDA)
OpenMPtechnologies was compared.and
and MPI implementations It was
38
shown that on the mesh of 25625632 nodes the speedup of the program reaches 18 for OpenMP and MPI implementations and 38
for CUDA.
for CUDA.
c 2020

© 2020 The
The Authors.
Authors. Published
Published by
by Elsevier
Elsevier B.V.
B.V.
c 2020

This The
is an Authors.
open accessPublished by Elsevier
article under B.V.
the CC BY-NC-ND
BY-NC-ND license
license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
(https://creativecommons.org/licenses/by-nc-nd/4.0)
This is an open
Peer-review access
under article under
responsibility the scientific
of the CC BY-NC-ND license
committee (http://creativecommons.org/licenses/by-nc-nd/4.0/)
of the 9th International Young Scientist Conference on Computational
Peer-review
Science
Science. under responsibility of the scientific committee of the 9th International Young Scientist Conference on Computational
Science.
Keywords: Numerical method; parallel computing; transport equation; OpenMP; MPI; CUDA;
Keywords: Numerical method; parallel computing; transport equation; OpenMP; MPI; CUDA;

1. Introduction
1. Introduction
Development of computational technologies and computer engineering made mathematical modeling a widely
Development of computational technologies and computer engineering made mathematical modeling a widely
applied instrument for solving problems in ecology and air quality. Inhomogeneous unsteady three-dimensional
applied instrument for solving problems in ecology and air quality. Inhomogeneous unsteady three-dimensional
advection-diffusion equations are the basis of the majority of the mathematical models of continuum mechanics and
advection-diffusion equations are the basis of the majority of the mathematical models of continuum mechanics and
hydrometeorology. These equations describe changes in density, velocity components, temperature, concentration, and
hydrometeorology. These equations describe changes in density, velocity components, temperature, concentration, and
turbulence. Finite difference, finite volume, and finite element methods are applied to solve transport equations. All
turbulence. Finite difference, finite volume, and finite element methods are applied to solve transport equations. All
of the methods reduce the initial differential equation to a system of algebraic equations of large dimension.Modern
of the methods reduce the initial differential equation to a system of algebraic equations of large dimension.Modern
multi-core and multi-processor computational systems with co-processors or graphic processor units (GPUs) can sig-
multi-core and multi-processor computational systems with co-processors or graphic processor units (GPUs) can sig-
nificantly speed up the solution of this system [1].
nificantly speed up the solution of this system [1].

∗ Corresponding author. Tel.: +7-952-182-9559.


∗ Corresponding
E-mail address:author. Tel.: +7-952-182-9559.
360flip182@gmail.com
E-mail address: 360flip182@gmail.com
1877-0509  c 2020 The Authors. Published by Elsevier B.V.
1877-0509
This c 2020

is an open The Authors.
access Published by Elsevier B.V.
1877-0509
This is an © 2020
open Thearticle
access article
under
Authors.
under
the CC BY-NC-ND
Published
the scientific
CC BY-NC-ND
license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
by Elsevier B.V.
license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review
This under
is an open responsibility
access of the
article under committee
the CC BY-NC-ND oflicense
the 9th (https://creativecommons.org/licenses/by-nc-nd/4.0)
International Young Scientist Conference on Computational Science.
Peer-review under responsibility of the scientific committee of the 9th International Young Scientist Conference on Computational Science.
Peer-review under responsibility of the scientific committee of the 9th International Young Scientist Conference on Computational Science
10.1016/j.procs.2020.11.006
48 D V Leshchinskiy et al. / Procedia Computer Science 178 (2020) 47–54
2 D. V. Leshchinskiy et al. / Procedia Computer Science 00 (2019) 000–000

Variety of architectures of high-performance computing systems gives rise to variety of instruments for building
parallel programs. Nowadays MPI, OpenMP, and OpenACC/CUDA parallel programming technologies are the most
widely used in scientific computing. Several parallelizing technologies exist motivate scientists to investigate the
efficiency of applying different technologies and their combinations to particular cases.
There are some works on comparison between the efficiency of the pure MPI parallelizing and the efficiency of
hybrid parallelizing with MPI+OpenMP or MPI+CUDA. Message-Passing Interface standard MPI is usually used
to organize interaction between nodes of a supercomputer. Within a node a program is parallelized with OpenMP
API for shared-memory systems or a co-processor or a GPU is used. Often pure MPI parallelizing works faster than
MPI+OpenMP approach [2, 3].
The aim of this work was to make experiments and choose the technology (MPI, OpenMP, or CUDA), which gives
the fastest code on shared-memory system with a GPU for solving transport equation. Characteristics of the used
computing system: two processors Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz; 192 Gb RAM; and two GPUs
NVidia RTX2080 Ti.
The results will be used to create a parallel version of the TSUNM3 high-resolution mesoscale meteorological
model, which developed at Tomsk State University for forecasting hazardous weather events and air quality over the
city. The mesoscale meteorological model TSUNM3 (Tomsk State University Nonhydrostatic Mesoscale Meteorology
Model) predicts the values of wind velocity components, temperature and humidity in the atmospheric boundary layer
on 50 vertical layers (up to 10 km from the surface) for the area of 200 × 200 km with nested area of 50 × 50 km
(mesh step is 1 km with the city of Tomsk in the centre).Mathematical formulation of the TSUNM3 model includes
11 unsteady inhomogeneous three-dimensional advection-diffusion equations (equations for 3 components of velocity
vector, temperature, turbulent energy, humidity, raindrops, snowflakes, cloud moisture, ice crystals, ice pellets) and an
algebraic closure model, which is computed independently along vertical coordinate lines. The results obtained were
applied to developing a parallel version of the high-resolution mesoscale meteorological model TSUNM3, which is
being developed in Tomsk State University (TSU) for predicting hazardous weather phenomena and air quality in a
city.

2. Mathematical formulation and a numerical method

Transport of the gaseous inert admixture from a point source in idealized cuboid of 50×50×0.6 km was considered.
The source of the admixture was placed in the center of the area at some height above the ground. It was considered
that the soil does not absorb the admixture and the admixture can be transported out of the area.

2.1. Mathematical model

The transport of admixture can be described with generalized unsteady three-dimensional advection-diffusion equa-
tion:

     
∂ρΦ ∂ρUΦ ∂ρVΦ ∂ρWΦ ∂ ∂Φ ∂ ∂Φ ∂ ∂Φ
+ + + = K xy + K xy + Kz + S Φ; (1)
∂t ∂x ∂y ∂z ∂x ∂x ∂y ∂y ∂z ∂z

where t, x, y, z are the time and spatial coordinates (Z is the upward axis); U, V, W are the components of the
velocity vector (the velocity field is solenoidal); is the density; K xy , Kz are the turbulent exchange coefficients; is the
source term.
The Neumann boundary condition was set at all boundaries (a derivative normal to the boundary was set to zero):

∂Φ
= 0. (2)
∂n
D V Leshchinskiy et al. / Procedia Computer Science 178 (2020) 47–54 49
D. V. Leshchinskiy et al. / Procedia Computer Science 00 (2019) 000–000 3

It was considered that there is no admixture in the area at the initial moment:

Φ (0, x, y, z) = 0. (3)

2.2. Numerical method

A Cartesian uniform in horizontal dimensions and refining towards the Earths surface computational mesh was
used for discretization. Differential problem (1)-(3) was approximated with the finite volume method. Each term of
the equation (1) was integrated by each finite volume to obtain the discrete equation:

ρn+1 Φn+1 n n
i, j,k − ρ Φi, j,k
+
τ
      n
3 ∂ρUΦ ∂ρVΦ ∂ρWΦ ∂ ∂Φ ∂ ∂Φ
+ + + − K xy − K xy − S Φ dxdydx−
2 ∂x ∂y ∂z ∂x ∂x ∂y ∂y
Vi, j,k

      n−1 (4)


1 ∂ρUΦ ∂ρVΦ ∂ρWΦ ∂ ∂Φ ∂ ∂Φ
− + + − K xy − K xy − SΦ dxdydx−
2 ∂x ∂y ∂z ∂x ∂x ∂y ∂y
Vi, j,k

   n+1    n


1 ∂ ∂Φ 1 ∂ ∂Φ
− Kz dxdydx − Kz dxdydx = 0
2 ∂z ∂z 2 ∂z ∂z
Vi, j,k Vi, j,k

Here n is the number of a time step; Vi, j,k is the finite volume. Here and below subscript i, j,k shows that the variable
is determined in the center of a finite volume.
The implicit Crank-Nicolson method to approximate the diffusion in vertical direction and the explicit Adams-
Bashforth method for other terms were used to ensure the second order accuracy in time and space. Implicit approx-
imation of diffusion transport terms reduces the limitation to the time step. The system of linear algebraic equations
with a tridiagonal matrix (5) is the result of integration.

ai, j,k Φn+1 n+1 n+1


i, j,k−1 − bi, j,k Φi, j,k + ci, j,k Φi, j,k+1 = di, j,k ;
(5)
i = 1, N x ; j = 1, Ny ; k = 1, Nz ; n = 1, 2, 3, ...

Here ai, j,k , bi, j,k , ci, j,k , di, j,k are the coefficients that were defined explicitly by numerical integration of the equation
(4). Temporal approximation applied allows using the efficient tridiagonal matrix algorithm (TDMA) in vertical direc-
tion. The value of the time step τ was chosen from the stability condition for the finite-difference scheme. Convective
terms of the transport equation were approximated with the monotonic linear upwind scheme MLU of Van Leer [4].

3. Implementation of the numerical algorithm

The developed numerical algorithm was implemented in C ++ programming language. The computational core of
the algorithm consists of loops on i and j that go over all the grid nodes in the XY-plane. For each node (i, j) in the
XY-plane the numerical solution of the equation (5) was found in vertical direction with the TDMA (Fig. 1).
50 D V Leshchinskiy et al. / Procedia Computer Science 178 (2020) 47–54
4 D. V. Leshchinskiy et al. / Procedia Computer Science 00 (2019) 000–000

The runtime of the program was estimated for the case of admixture transport from the constant emission point
source in the area of 50 (L) to 50 (W) to 0.6 (H) km. The source was located in 0.007L km in X-direction, 0.007L km
in Y-direction, and 0.0101H km in Z-direction from the center of the domain. Velocity components were considered
constant: U = 1 m/s, V = 1 m/s, and W = 0 m/s. Diffusion coefficients were also considered constant K xy = Kz = 100
m2 /s. Time step τ was set to 6 s and the modeling was finished at 5000τ. The mesh of 256 × 256 × 32 nodes was used
in computations. The average runtime of the sequential program was 644, 5 s. The accuracy of the numerical solution
was controlled with the analytical solution of the equation considered [5].

Fig. 1. Simplified representation of the algorithm.

4. Parallel implementation of the numerical algorithm

After the sequential program was verified and tested it was parallelized with OpenMP, MPI, and CUDA technolo-
gies.

4.1. OpenMP

OpenMP API is a tool for developing parallel programs for shared-memory computing systems. It consists of
compiler directives, library routines, and environmental variables for creating multithreaded programs. OpenMP is
one of the simplest parallelizing technologies to implement. Parallelization with OpenMP is done by introducing
compiler directives into a sequential program. A program parallelized with OpenMP may also be run as a sequential
because a sequential compiler ignores OpenMP directives. To run a sequential code in multithreading mode it is
sufficient to introduce a directive that parallelizes the loops on i and j:
D V Leshchinskiy et al. / Procedia Computer Science 178 (2020) 47–54 51
D. V. Leshchinskiy et al. / Procedia Computer Science 00 (2019) 000–000 5

During the calculations, it was found that parallelizing of two nested loops gives the same results as parallelizing
only the outer loop. Using the SCHEDULE with the STATIC, DYNAMIC and GUIDED parameters do not provide
any acceleration compared to the default parallelization.
The OpenMP program was run on different number of threads from 1 to 48. Table 1 shows the speedup of the
program that indicates that using all threads available on the server speeds up the program by 17.3.

Table 1. Speedup of the parallel programs on different number of OpenMP threads/MPI processes.
1 2 4 8 16 24 32 48
OpenMP 1,0 2,0 3,8 7,1 13,5 16,5 14,6 17,3
MPI 1,0 2,0 3,9 7,4 13,7 17,6 13,8 18,1

Deviation from linear acceleration appeared when more than 16 threads were used (Figure 2). Due to the increase
in memory load: memory bandwidth per active core decreases with the increase in number of cores of processor
involved.

Fig. 2. Speedup of the OpenMP and MPI program.

The computing system has 24 computing cores. Using hypertrading makes it possible to run the program on 48
processes. The decrease in acceleration when using 32 threads is due to the uneven load of the computational cores
when hypertrading is enabled.

4.2. MPI

MPI is a message-passing library specification for distributed memory systems. At the same time MPI functions
can be implemented on the shared memory system. That is why the efficiency of MPI and OpenMP parallel programs
on the same computational system can be compared.
As opposed to OpenMP, MPI implies explicit distribution of computations among processes and organization of
data exchange therefore, the programming with MPI is more demanding.
In this work two-dimensional domain decomposition was chosen as the main parallelization strategy with pure
MPI. MPI functions for creating two-dimensional Cartesian virtual topology, creating datatypes, and organizing ef-
52 D V Leshchinskiy et al. / Procedia Computer Science 178 (2020) 47–54
6 D. V. Leshchinskiy et al. / Procedia Computer Science 00 (2019) 000–000

fective communication between processes were used in this work. The values of discrete function were sent to adjacent
subdomains by non-blocking point-to-point functions: MPI Isend() and MPI Irecv().
Two-dimensional domain decomposition was chosen due to its better strong-scaling capabilities than a one-
dimensional variant. Three-dimensional decomposition is not suitable for the case considered as the TDMA is difficult
to parallelize.
Two-dimensional domain decomposition strategy includes distributing among processes the subdomains with all
mesh values associated with the nodes belong to this subdomain. The function MPI Cart create() was used to create
a two-dimensional Cartesian grid of processes to distribute subdomains. When the decomposition was done the limits
for the loop counters i and j became i = 2,N x0 + 1, j = 2,Ny1, where N x0 = N x/dims[0], Ny1 = Ny/dims[1];
dims[0] and dims[1] are the numbers of process in X and Y dimensions respectively, dims[0] ∗ dims[1] is the number
of processes involved altogether, N x0 and Ny1 are the numbers of nodes of horizontal mesh treated by each process.
As the finite-difference scheme in horizontal plane uses nine-point stencil, the values of the grid function from the
adjacent process are required to calculate the next time step values at the near-boundary nodes of each subdomain.
Therefore, ghost cells were created on each process to store data from an adjacent process [7].
Consider the domain decomposition in X-direction. Ghost-cells with numbers 0 and 1 and N x0 + 2 and N x0 + 3
were created on the left and on the right of the inner cells in each subdomain. Cells with numbers from 2 to N x0 + 1
were inner. When all the computations for a time step were done, data was sent to ghost-cells. At first, each process,
but the zeroth sent data from the cells with numbers 2 and 3 to the process on the left and received data to the cells
with numbers N x0 + 2 and N x0 + 3 from the process on the right. In the figure 3 these sends are marked with green
arrows. The sends in opposite direction were done similarly (brown arrows in the figure 3).

Fig. 3. Creating ghost-cells and data exchange between processes.

Speedup of the MPI program depending on the number of processes involved is shown in the table 1 and figure 2.
Comparison of the OpenMP and MPI programs showed that in general on the shared memory computational system
used the speedup of the programs are almost equal when from 1 to 16 cores were used. The speedup of the MPI
program is larger when more than 16 processes/threads used. It is important to note that parallelizing with MPI is
significantly more difficult.

4.3. CUDA

CUDA is a parallel computing platform developed by NVidia Corporation to perform parallel computing on
Graphic Processor Units (GPUs). The architecture of GPUs is significantly different from the architecture of general-
purpose CPUs and initially specialized for highly parallel computation. That is, GPUs are designed to run a large
number of parallel threads at once.
There are three main terms in CUDA terminology: device (GPU), host and kernel. Device is a video adapter that
acts as a ”coprocessor” of the central processing unit (CPU). Host refers to the central processing unit that controls
the execution of a program, starts tasks on the device, allocates memory to the GPU, and also allows data to be moved
between the CPU and GPU. That is, host and device have their own separate memory. Kernel is a parallel part of the
algorithm that host runs on device.
The CUDA algorithm for solving the problem consisted of the following stages:
D V Leshchinskiy et al. / Procedia Computer Science 178 (2020) 47–54 53
D. V. Leshchinskiy et al. / Procedia Computer Science 00 (2019) 000–000 7

1. Allocation of the required amount of memory by the host on the device (GPU). Memory was allocated for three
main three-dimensional arrays C, C0, C00. To allocate memory on the device, smart pointers from modern C++ were
used, which allows avoiding basic problems when working with memory, such as memory leaks or control that a
pointer was deleted only once.
The example of the implementation of templates for memory allocation:

Custom deleter to clean up memory when using std::unique:ptr:

2. Copying arrays data from host memory to device memory. The initial data contained in the three 3d arrays was
copied using the CudaMemcpy() function.
3. Launching the main computational kernel from the host for each time step. The kernel function consisted of
lines from 4 to 22 from the sequential algorithm illustration in Figure 1.

Here, the kernel function Solve is a template, which allows to transfer the values of the dimension of the prob-
lem N x, Ny, Nz directly by numerical values directly to the function, and also choose single or double precision
calculations by using the real parameter, which can be defined as:

or

In the entire code also was maximized the using of ”constexpr” constructs to pre-calculate all used constants,
reduce the number of involved registers and avoid unnecessary memory accesses.
4. Execution of the computational kernel on the device. Parallelization of the algorithm was based on the principle
of two-dimensional (2D) data decomposition. The number of threads to start was 256 × 256, i.e. each parallel thread
performed independent computations for the (i, j)-vertical for 32 values. The computing core was executed for each
time step with the process synchronization.
5. Copying the obtained results from the device memory to the host. Copying was carried out using the CudaMem-
cpy function, the correct completion of the computational core calculations was previously checked.
Using the CUDA parallel programming technology to implement the algorithm under consideration makes it pos-
sible to obtain a solution to the problem in 16.6 seconds. At the same time, 36.7 seconds is the best result obtained
when using all the computational resources of the server using the considered parallel programming technologies.

5. Conclusions

The results of calculating the solution of a nonstationary 3D problem of impurity transfer using a semi-implicit
difference scheme on a structured grid with a size of more than 2 million cells showed that the speedup and efficiency
54 D V Leshchinskiy et al. / Procedia Computer Science 178 (2020) 47–54
8 D. V. Leshchinskiy et al. / Procedia Computer Science 00 (2019) 000–000

of an OpenMP program is practically not inferior to an MPI program. However, parallel implementation of programs
using the OpenMP technology is simpler, therefore, when solving computational problems on systems with shared
memory, it is preferable to use the OpenMP parallelization technology.
Using CUDA parallel programming technology and NVidia RTX2080Ti, allows to get a solution in 16.6 seconds.
Using CUDA technology is advantageous for processing a large number of parallel threads. If the computational
algorithm is effectively parallelized using the OpenMP technology, then the same algorithm could be effectively
parallelized using the CUDA technology.
Using the parallel programming technologies MPI, OpenMP and CUDA makes it possible to speed up the process
of obtaining a solution by about 18 and 38 times, respectively.

Acknowledgements

The numerical method was developed and implemented using the MPI, OpenACC and CUDA libraries by A.V.
Starchenko, S.A. Prokhanov and E.A. Danilkin. This work was supported by the Russian Science Foundation under
research project No. 19-71-20042.
Parallelization of the numerical method using OpenMP by D.V. Leshchinskiy. This work was supported by the
Ministry of Science and Higher Education of Russia (agreement No. 075-02-2020-1479/1)

References

[1] Ridwan R., Kistijantoro A. I., Kudsy M. and Gunawan D. (2015) ”Performance evaluation of hybrid parallel computing for WRF model with
CUDA and OpenMP 3rd International Conference on Information and Communication Technology.” (ICoICT) (Nusa Dua) 425430.
[2] Bermous I., Steinle P. (2015) Efficient performance of the Met Office Unified Model v8.2 on Intel Xeon partially used nodes. Geosci. Model
Dev. 8 769.
[3] Voronin K.V. (2014) A numerical study of an MPI/OpenMP implementation based on asynchronous threads for a three-dimensional splitting
scheme in heat transfer problems. Journal of Applied and Industrial Mathematics 8(3) 436.
[4] Van Leer B. (1974) Towards the ultimate conervative difference scheme: II. Monotonicity and conservation combined in a second order scheme.
Journal of Computational Physics 14 361.
[5] Marchuk G. I. (1982) Mathematical modeling for problems of environment. (Moscow: Science Publishing) 319.
[6] Danilkin E A, Starchenko A V 2010 High Performance Computation for Large Eddy Simulation LNCS series 6083 163.
[7] Starchenko A V, Danilkin E A, Semenova A, Bart A A 2016 Parallel Algorithms for a 3D Photochemical Model of Pollutant Transport in the
Atmosphere Communications in Computer and Information Science 687 158; DOI: 10.1007/978-3-319-55669-7.

You might also like