You are on page 1of 12

Computers and Structures 152 (2015) 113–124

Contents lists available at ScienceDirect

Computers and Structures


journal homepage: www.elsevier.com/locate/compstruc

Acceleration of integration in parametric integral equations system


using CUDA
_ ⇑
Andrzej Kuzelewski , Eugeniusz Zieniuk, Marta Kapturczak
Faculty of Mathematics and Informatics, University of Bialystok, Ciolkowskiego 1M, 15-245 Bialystok, Poland

a r t i c l e i n f o a b s t r a c t

Article history: Methods of solving boundary value problems are expected to achieve high accuracy results in shortest
Received 29 September 2014 possible time of calculations. In the previous papers the authors solved boundary value problems with
Accepted 14 February 2015 high accuracy using parametric integral equations system. However, time of calculations was unsatisfac-
Available online 7 March 2015
tory. The most time consuming part of the method is calculation of integrals. In this paper the authors
present approach to accelerate numerical integration in PIES by nVidia CUDA. The speed of integration
Keywords: increased up to 250 times whereas high accuracy of solutions was maintained. Examples included in this
PIES
paper concern solving three-dimensional elasticity problems.
GPU
CUDA
Ó 2015 Elsevier Ltd. All rights reserved.
Multithreaded computing
Boundary value problems

1. Introduction (PIES) to solve boundary value problems. PIES has already been
used to solve problems modelled by different partial differential
Recently, it was noticed rapid development in techniques of equations, such as: Laplace’s [16,17], Helmholtz [18] or Navier–
programming of graphics processing units (GPUs) for numerical Lamé [19,20] in 2D and 3D domains. PIES is an analytical modifica-
calculations in general-purpose GPU applications. Due to great tion of boundary integral equation (BIE). Its remarkable advantage
possibility of improving computing performance, scientists have is direct inclusion in mathematical formalism the shape of bound-
already investigated application of the mentioned technology in: ary of a considered problem [21]. The shape of boundary in PIES is
applied mathematics [1], hydrodynamics [2], physics [3], biology defined by particular functions widely used in computer graphics.
and medicine [4,5], seismic [6], flows [7,8], and others. Many It were applied the curves (Bézier, B-spline, and others) in case of
authors applied GPUs in numerical modelling and solving bound- 2D problems, whilst surface patches (Coons, Bézier, and others)
ary value problems using boundary element method (BEM) in 3D ones. These functions are used instead of the contour inte-
[9,10], finite element method (FEM) [11,12], finite difference gral, as in the case of BIE. Therefore in practice, we need a small
method (FDM) [13] or meshless methods [14]. The interests of sci- number of control points to define the shape of boundary. This
entists are related to the modern architecture of GPUs (multipro- way of modelling is definitely much easier in comparison to the
cessor and multithreaded), very fast floating-point arithmetic other methods: boundary element method (BEM) [22] or finite ele-
units, use of high-speed memory and, nowadays, ease of program- ment method (FEM) [23]). Moreover, the accuracy of solutions can
ming. In the end of 2006, nVidia has produced the first series of be efficiently improved without interference in modelling of the
GPUs that employed a new parallel computing platform and pro- shape of boundary.
gramming model called Compute Unified Device Architecture The former studies focused on accuracy and efficiency of PIES.
(CUDA) [15]. It was a break-through in programming and feasi- The results were compared with the ones obtained by other
bility of general-purpose applications executed on GPU. CUDA C numerical algorithms (FEM or BEM) as well as analytical methods.
programming language is an extension to the C/C++ and simplifies These studies confirmed the high effectiveness and accuracy of
writing of non-graphical programs compared to graphics-oriented PIES in solving 2D [16,19] and 3D [17,20,21] boundary value prob-
languages such as Cg, GLSL or HLSL. lems. However, the problem of increasing computing time of the
For many years the authors of this paper have worked on devel- method forced the authors’ attention. The greater number of input
opment and application of parametric integral equations system data, the longer an algorithm works. It was noticed in case of some
more complex problems solved by PIES, eg. modelled by Navier–
⇑ Corresponding author. Lamé equations or in case of increasing the dimensionality of the
_
E-mail address: akuzel@ii.uwb.edu.pl (A. Kuzelewski).
problem from 2D to 3D. PIES uses the strategy of global numerical

http://dx.doi.org/10.1016/j.compstruc.2015.02.019
0045-7949/Ó 2015 Elsevier Ltd. All rights reserved.
114 A. Kuz_ elewski et al. / Computers and Structures 152 (2015) 113–124

 
integration on the entire surface patches. This integration requires g2 @ g
to use a large number of weight coefficients in cubature instead of P22 ¼ ð1  2mÞ þ 3 22 ;
g @n
splitting of patches on smaller elements. The number of these coef- g g @g g n3  g3 n2
ficients has a significant influence on the duration of PIES P23 ¼ 3 2 2 3  ð1  2mÞ 2 ;
g @n g
application.
In this paper, the authors decided to examine the application of
g3 g1 @g g n1  g1 n3
modern parallel computing solutions to increase the efficiency of P31 ¼ 3  ð1  2mÞ 3 ;
integration in PIES. Included examples were solved for three g2 @n g
dimensional Navier–Lamé equations using CUDA-enabled nVidia g3 g2 @g g n2  g2 n3
P32 ¼ 3 2  ð1  2mÞ 3 ;
GPU. g @n g
 
2. PIES for three-dimensional equations g2 @ g
P33 ¼ ð1  2mÞ þ 3 32 ;
g @n
PIES for three-dimensional equations was obtained as the result
of analytical modification of BIE. Detailed studies of the method- wherein n1  n1 ðv ; wÞ; n2  n2 ðv ; wÞ, n3  n3 ðv ; wÞ are the compo-
ology of modification are presented in [16,19] for 2-D steady-state nents of nj – the normal vector to the surface j. Parameters m and
problems, [24] for 2-D transient problems, and [17,18] for 3-D E are material constants: Poisson’s ratio and Young modulus
problems. Generalization of the mentioned methodology to 3-D respectively.
elasticity problems with smooth boundary modelled by Navier– Integrands (2) and (3) include in their mathematical formalism
Lamé equations results in the following formula of PIES [20]: the shape of a closed boundary. It is created using appropriate rela-
tionships between patches ðfl; jg ¼ 1; 2; . . . ; nÞ, which are defined in
n Z vj Z
X wj n
0:5ul ðv 1 ; w1 Þ ¼ Ulj ðv 1 ; w1 ; v ; wÞpj ðv ; wÞ Cartesian coordinates system as follows:
j¼1 v j1 wj1
o g1 ¼ Pjð1Þ ðv ; wÞ  Plð1Þ ðv 1 ; w1 Þ; g2 ¼ Pð2Þ ð2Þ
j ðv ; wÞ  P l ðv 1 ; w1 Þ;
v
Plj ð 1 ; w1 ; v ; wÞuj ðv ; wÞ J j ðv ; wÞdv dw ð1Þ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð4Þ
g3 ¼ Pjð3Þ ðv ; wÞ  Plð3Þ ðv 1 ; w1 Þ; g ¼ g21 þ g22 þ g23 ;
where v l1 6 v 1 6 v l ; wl1 6 w1 6 wl , v j1 6 v 6 v j ; wj1 6 w 6 wj ,
fl; jg ¼ 1; 2; . . . ; n; n – is the number of parametric patches that cre- ð1Þ
where P j ; Pj ; P j
ð2Þ ð3Þ
are the scalar components of the vector surface
ate the shape of boundary in 3D, whilst function Jj ðv ; wÞ is the
h iT
Jacobian. ð1Þ ð2Þ ð3Þ
Pj ðv ; wÞ ¼ Pj ðv ; wÞ; Pj ðv ; wÞ; Pj ðv ; wÞ
Integrands Ulj ðv 1 ; w1 ; v ; wÞ and Plj ðv 1 ; w1 ; v ; wÞ in (1) are pre-
sented in the following matrix form: which depends on v ; w.
In 3D problems modelled using PIES, vector functions Pj ðv ; wÞ
2 3
U 11 U 12 U 13 have the form of parametric surface patches widely applied in
1 6 7 E computer graphics. The main advantage of the presented approach
Ulj ðv 1 ; w1 ; v ; wÞ ¼ 4 U 21 U 22 U 23 5; l¼ ;
16pð1  mÞlg 2ð1 þ mÞ is an analytical inclusion of the boundary directly in PIES. It results
U 31 U 32 U 33 in a lack of discretization. In traditional BIE boundary is defined by
ð2Þ boundary (contour) integral. Therefore, discretization of boundary
2 3 into elements is required, as in case of BEM. The advantages of
P11 P12 P13 including the boundary directly in mathematical equations (PIES)
1 6 7
Plj ðv 1 ; w1 ; v ; wÞ ¼ 4 P21 P22 P23 5: ð3Þ were widely presented in [16,19,24] for 2-D and [17,20] for 3-D
8pð1  mÞg2 problems.
P31 P32 P33

The individual elements in the matrix (2) in an explicit form are 2.1. Approximation of the boundary functions over the surface patches
presented as follows:
The application of PIES for solving 2D and 3D problems allows
g21 gg gg gg to eliminate discretization of boundary, as well as boundary func-
U 11 ¼ ð3  4mÞ þ ; U 12 ¼ 1 2 2 ; U 13 ¼ 1 2 3 ; U 21 ¼ 2 2 1 ;
g2 g g g tions. In the previous studies, the boundary functions had the
g22 form of approximating series with Chebyshev polynomials as
U 22 ¼ ð3  4mÞ þ 2 ;
g base functions [20]. Unlike the problems modelled by Laplace’s
equation, the ones described by the Navier–Lamé equations
g2 g3 gg gg g2 require boundary functions in a vector form. Approximating ser-
U 23 ¼ ; U 31 ¼ 3 2 1 ; U 32 ¼ 3 2 2 ; U 33 ¼ ð3  4mÞ þ 32 ; ies represent scalar components of vectors of displacements
g 2 g g g
uj ðv ; wÞ and stresses pj ðv ; wÞ for each surface patch j and they
while in the matrix (3):
are presented as follows:
  X X
N 1 M 1
g21 @ g ðprÞ ðpÞ ðrÞ
P11 ¼ ð1  2mÞ þ 3 ; uj ðv ; wÞ ¼ uj Lj ðv ÞLj ðwÞ; ð5Þ
g2 @n p¼0 r¼0
g g @g g n2  g2 n1 X X
P12 ¼ 3 1 2 2  ð1  2mÞ 1 ; N1 M1
ðprÞ ðpÞ ðrÞ
g @n g pj ðv ; wÞ ¼ pj Lj ðv ÞLj ðwÞ; ð6Þ
p¼0 r¼0
g1 g3 @g g n3  g3 n1
P13 ¼ 3  ð1  2mÞ 1 ; where uj
ðprÞ
, pj
ðprÞ
are unknown coefficients in collocation points,
g2 @n g
ðpÞ ðrÞ
g2 g1 @g g n1  g1 n2 whereas the base functions Lj ðv Þ; Lj ðwÞ have the form of
P21 ¼ 3 2  ð1  2mÞ 2 ;
g @n g Lagrange polynomials:
A. Kuz_ elewski et al. / Computers and Structures 152 (2015) 113–124 115

ðpÞ ðv  v 0 Þðv  v 1 Þ.. . ðv  v i1 Þðv  v iþ1 Þ. . .ðv  v N1 Þ Algorithm 1. Creating and solving PIES
Lj ðv Þ ¼ ;
ðv j  v 0 Þðv j  v 1 Þ .. .ðv j  v i1 Þðv j  v iþ1 Þ. .. ðv j  v N1 Þ
ðrÞ ðw  w0 Þðw  w1 Þ. .. ðw  wi1 Þðw  wiþ1 Þ .. .ðw  wM1 Þ Require:
Lj ðwÞ ¼ ;
ðwj  w0 Þðwj  w1 Þ.. .ðwj  wi1 Þðwj  wiþ1 Þ .. .ðwj  wM1 Þ n – the number of parametric patches that creates
ð7Þ boundary
1: for 1 6 l 6 n do
where N – is the number of terms in approximating series (5) and 2: for 1 6 j 6 n do
(6) in direction of coordinate axis v, while M – is the number of 3: //start of calculation ½glj ; ½hlj  from Eqs. (10) and (11)
terms in approximating series (5) and (6) in direction of coordinate
4: if l ¼¼ j then
axis w on the surface patch j.
5: singular integrationðÞ
After substituting (5) and (6) to (1) the following expression is
6: else
obtained:
7: regular integrationðÞ
X X
N1 M1 8: end if
ðpÞ ðrÞ
0:5ul ðv 1 ; w1 Þ Ll ðv 1 ÞLl ðw1 Þ 9: insert submatrix ½glj  into G and ½hlj  into H
p¼0 r¼0
10: end for
( Z vj Z 11: end for
X
n X X
N1 M1
ðprÞ
wj
¼ pj Ulj ðv 1 ; w1 ; v ; wÞ 12: transform ½Hfug ¼ ½Gfpg into ½Afxg ¼ fbg
j¼1 p¼0 r¼0 v j1 wj1
13: solve ½Afxg ¼ fbg
Z vj Z ) 14: solutions on the boundary fxg found
wj
ðprÞ ðpÞ ðrÞ
uj v
Plj ð 1 ; w1 ; v ; wÞ Lj ðv ÞLj ðwÞJ j ðv ; wÞdv dw
v j1 wj1
The way of creating and solving serial version of PIES is shown
ð8Þ in Algorithm 1. The algorithm is proceed as follows. Integrals from
(10) and (11) are calculated using Gauss–Legendre cubature (singu-
where l ¼ 1; 2; . . . ; n.
lar_integration() and regular_integration() procedures are described
in Section 2.3). Next, submatrices ½glj  and ½hlj  are found and insert-
2.2. Numerical solution of PIES
ed into G and H. All computations are performed repeatedly
between respective patches l and j (for loops in lines 1 and 2).
Similar to the previous researches [17,20,21], the pseu-
Next step is transformation of PIES (9) into a system of algebraic
dospectral method [25] was applied to numerical solving of
equations (line 12) and solving the system (line 13). Finally, solu-
PIES. An algebraic version of PIES is obtained by application of
tions on the boundary are obtained (line 14).
 j ¼ M  NÞ in the domain of individual patch-
collocation points ðn
es defining the boundary to (8). It can be presented in the matrix
form: 2.3. Numerical calculation of integrals

½Hfug ¼ ½Gfpg: ð9Þ Numerical solving of PIES requires to calculate surface integrals
included in the mathematical formalism of PIES. In the present
Submatrices ½hlj  and ½glj  of matrices H and G are calculated using
study, we applied Gauss–Legendre cubature (similarly to BEM
the following expressions:
[26]):
  X X ðpÞ  ðtÞ  ðrÞ  ðtÞ 
N1 M1 Z 1 Z 1 Gg X
X Gh
hlj ¼ 0:5dlj Ll v 1 Ll w1 f ðn1 ; n2 Þdn1 dn2 ’ f ðng ; nh Þxg xh ; ð12Þ
p¼0 r¼0 1 1 g¼1 h¼1

X X Z vj Z
N1 M1 wj
þ
ðtÞ ðtÞ ðpÞ ðrÞ
Plj ðv 1 ;w1 ; v ; wÞLj ðv ÞLj ðwÞJ j ðv ;wÞdv dw where ng ; nh – are cubature nodes, xg ; xh – are cubature weight
p¼0 r¼0 v j1 wj1 coefficients, whilst Gg  Gh is the number of cubature nodes.
Surface patches in PIES are defined in local coordinates system
ð10Þ ðv ; wÞ [21]. Hence, all points in domain of Gauss–Legendre cuba-
ture (1 6 n1  1; 1 6 n2  1) are mapped into domain of v ; w
Z vj Z  
  X X
N1 M1 wj coordinates system (0 6 v  1; 0 6 w  1), as well (presented in
glj ¼ Ulj v 1ðtÞ ; wðtÞ
1 ; v; w
ðpÞ ðrÞ
Lj ðv ÞLj ðwÞJ j ðv ; wÞdv dw Fig. 1).
p¼0 r¼0 v j1 wj1
During calculation of integrals in PIES, collocation points and
ð11Þ cubature nodes may lie within different (l – j) or the same (l ¼ j)
where fl; jg ¼ 1; 2; . . . ; n; n – is the number of surface patches, surface. In case of l – j, regular integrals are calculated.
  Therefore, Gauss–Legendre cubature (12) is directly applied in
v 1ðtÞ ; w1ðtÞ – are coordinates of tth collocation point, PIES. In case of l ¼ j collocation points become singular and calcu-
P
t ¼ 1; 2; . . . ; K; K ¼ nj¼1 n
 j – is the number of collocation points and lation of singular integrals is required.
In case of BEM, it were proposed the following strategies of cal-
v j1 < v 1ðtÞ < v j ; wj1 < w1 < wj :
ðtÞ
culation of singular integrals [27]:
   
The integrands Ulj v 1ðtÞ ; w1ðtÞ ; v ; w and Plj v 1ðtÞ ; w1ðtÞ ; v ; w are given  determining the value of integrals in an analytical way,
respectively as (2) and (3).  use of cubatures dedicated to singular integrals,
Finally, an algebraic version of PIES (9) is transformed into sys-  regularization of integrals,
tem of algebraic equations ½Afxg ¼ fbg. To solve the system,  calculation of integrals in polar coordinates,
Gaussian elimination with partial pivoting and iterative refine-  isolation of the singular point by subdividing the area of inte-
ment is used. Solutions on the boundary are obtained after solving gration on smaller components and integration using cubatures
the system. for regular integrals.
116 A. Kuz_ elewski et al. / Computers and Structures 152 (2015) 113–124

(a) (b) (c)

Fig. 1. Scheme of mapping Gauss–Legendre cubature domain into surface patch in PIES: (a) domain of Gauss–Legendre cubature, (b) domain of local coordinates system
ðv ; wÞ, and (c) an example of surface patch.

Similar strategies may be applied in PIES. In the paper [21] the (presented in Fig. 2)) into (v ; w) coordinates system (line 4).
authors have used isolation of singular point strategy to calculate Next, submatrices ½glj i and ½hlj i for ith part of surface patch are cal-
integrals in PIES. Domain in local coordinates system ðv ; wÞ was culated (line 6). Finally submatrices ½glj  and ½hlj  are obtained as
divided into four parts with respect to the singular point (present- the result of addition (lines 7 and 8).
ed in Fig. 2). Then the cubature (12) was applied separately in all
parts of domain.
Algorithm 3. Procedure singular_integration()
The way of calculation regular integrals in PIES is shown in
Algorithm 2. The algorithm is proceed as follows. Integrals from
(10) and (11) are calculated using Gauss–Legendre cubature (lines 1: ½glj  ¼ ½0
between 3 and 10). In order to find integrands (2) and (3), the vari- 2: ½hlj  ¼ ½0
ables in lines 5 and 7 are calculated for each cubature node (in line 3: for 0 6 i 6 3 do
7 also for each collocation point). After calculation of sums in (10) 4: transform ith part of surface into (v ; w) coordinate
and (11) (for loops in lines 1 and 2) submatrices ½glj  and ½hlj  are system
found. 5: regular integrationðÞ
6: ½glj i and ½hlj i found
Algorithm 2. Procedure regular_integration() 7: ½glj  ¼ ½glj  þ ½glj i
8: ½hlj  ¼ ½hlj  þ ½hlj i
Require: 9: end for
N – the number of terms in approximating series (5) and 10: ½glj  and ½hlj  found
(6) in direction of coordinate axis v
M – the number of terms in approximating series (5) and
(6) in direction of coordinate axis w
K – the number of collocation points
Gv – the number of cubature nodes in direction of 2.4. Solutions in the domain
coordinate axis v
Gw – the number of cubature nodes in direction of After solving PIES, only solutions on boundary are
coordinate axis w obtained. They are represented by approximating series (5)
1: for 0 6 p 6 N  1 do or (6). In order to obtain solutions in the domain, analytical
2: for 0 6 r 6 M  1 do modification of integral identity from BIE is required. New
3: for 1 6 kv 6 Gv do integral identity is formulated in the same way as in two
4: for 1 6 kw 6 Gw do dimensional problems [16,19]. It uses the solutions on bound-
5: compute: ary pj ðv ; wÞ and uj ðv ; wÞ. The identity takes the following form
ð1Þ
ð2Þ
ð3Þ
[20]:
P j v ðkv Þ ; wðkw Þ ; P j v ðkv Þ ; wðkw Þ ; Pj v ðkv Þ ; wðkw Þ
ðpÞ ðrÞ n Z vj Z
X wj n o
J j ðv ðkv Þ ; wðkw Þ Þ; Lj ðv ðkv Þ Þ; Lj ðwðkw Þ Þ uðxÞ ¼ b ðx; v ; wÞp ðv ; wÞ  P
U b ðx; v ; wÞu ðv ; wÞ
j j j j
6: for 1 6 t 6 K do j¼1 v j1 wj1

7: compute:
ð1Þ

ðtÞ ðtÞ

ð2Þ

ðtÞ ðtÞ

ð3Þ

ðtÞ ðtÞ
 J j ðv ; wÞdv dw ð13Þ
P l v 1 ; w1 ; P l v 1 ; w1 ; P l v 1 ; w1
Integrands in identity (13) are presented in the following form:
g1 ; g2 ; g3 ; g; U 11 ; . . . ; U 33 ; P11 ; . . . ; P33
    2 3
ðtÞ ðtÞ ðtÞ ðtÞ
Ulj v 1 ; w1 ; v ðkv Þ ; wðk

; Plj v 1 ; w1 ; v ðkv Þ ; wðk wÞ b 11
U b 12
U b 13
U
1 6b E
8: end for
b
U j ðx; v ; wÞ ¼

4 U 21 b 22
U b 23 7
U 5; l¼ ;
16pð1  mÞlr 2ð1 þ mÞ
9: end for b 31
U b 32
U b 33
U
10: end for
ð14Þ
11: end for
12: end for 2 3
b 11
P b 12
P b 13
P
13: ½glj  and ½hlj  found 1 6b
b  ðx; v ; wÞ ¼
P j 4 P 21 b 22
P b 23 7
P 5: ð15Þ
8pð1  mÞr 2
b 31
P b 32
P b 33
P
Singular integrals are calculated using procedure of regular
integration (presented in Algorithm 3). This procedure is preceded The individual elements in the matrix (14) in the explicit form
by mapping appropriate part of surface patch A, B, C or D are presented as follows:
A. Kuz_ elewski et al. / Computers and Structures 152 (2015) 113–124 117

Fig. 2. Isolation of singular point (signed by dot at intersection of lines) for rectangular surface by subdivision into four parts.

2
b 11 ¼ ð3  4mÞ þ r 1 ; U
U b 12 ¼ r 1 r2 ; b 13 ¼ r 1 r 3 ;
U
The way of finding solutions in the domain in PIES is shown in
r2 r2 r2 Algorithm 4. The algorithm is proceed as follows. Integrals from
2 (12) are calculated using Gauss–Legendre cubature (lines between
b 21 ¼ r 2 r 1 b 22 ¼ ð3  4mÞ þ r2 ;
U 2
; U 3 and 10). In order to find integrands (13) and (14), variables in line
r r2
5 are calculated for each cubature node. After calculation of sum in
2 (12) (for loop in line 2) solution in particular searching point in
b 23 ¼ r 2 r 3 ;
U b 31 ¼ r 3 r 1 ;
U b 32 ¼ r 3 r 2 ;
U b 33 ¼ ð3  4mÞ þ r 3 ;
U
r2 r2 r2 r2 domain is found. All computations are performed repeatedly for
all searching points (for loop in line 1). Finally, solutions in the
while in the matrix (15): domain are obtained (line 15).
 2

b 11 ¼ ð1  2mÞ þ 3 r 1 @r ;
P Algorithm 4. Finding solutions in the domain
r 2 @n
b 12 ¼ 3 r 1 r 2 @r r1 n2  r 2 n1 Require:
P  ð1  2mÞ ;
r 2 @n r n – the number of parametric patches that creates
boundary
b 13 ¼ 3 r1 r 3
P
@r
 ð1  2mÞ
r1 n3  r 3 n1
;
N – the number of points, where solutions are searched
r2 @n r K – the number of collocation points
b r2 r1 @r r2 n1  r 1 n2 Gv – the number of cubature nodes in direction of
P 21 ¼ 3 2  ð1  2mÞ ;
r @n r coordinate axis v
Gw – the number of cubature nodes in direction of
 2

b 22 ¼ ð1  2mÞ þ 3 r 2 @r ;
P
coordinate axis w
r 2 @n x – solutions on the boundary
r r @r r2 n3  r 3 n2 1: for 1 6 l 6 N do
b 23 ¼ 3
P
2 3
 ð1  2mÞ ;
r 2 @n r 2: for 1 6 j 6 n do
3: for 1 6 kv 6 Gv do
4: for 1 6 kw 6 Gw
b 31 ¼ 3 r3 r 1
P
@r
 ð1  2mÞ
r3 n1  r 1 n3
;
r2 @n r 5: compute:
ð1Þ ð2Þ ð3Þ
b 32 ¼ 3 3 r 2
r @r r3 n2  r 2 n3 P j ðv ðkv Þ ; wðkw Þ Þ; Pj ðv ðkv Þ ; wðkw Þ Þ; P j ðv ðkv Þ ; wðkw Þ Þ,
P  ð1  2mÞ ;
ðpÞ
ðrÞ

r2 @n r J j v ðkv Þ ; wðkw Þ ; Lj v ðkv Þ ; Lj wðkw Þ


 2
 b 11 ; . . . ; U
r 1 ; r 2 ; r 3 ; r; U b 33 ; P
b 11 ; . . . ; P
b 33
b 33 ¼ ð1  2mÞ þ 3 r 3 @r ;
P
r 2 @n 6: for 1 6 t 6 K do
7: Ub  x; v ðkv Þ ; wðkw Þ
; P b  x; v ðkv Þ ; wðkw Þ

j j
where
8: end for
ð1Þ
r 1 ¼ Pj ðv ; wÞ  x1 ;
ð2Þ
r 2 ¼ Pj ðv ; wÞ  x2 ; r3 9: end for
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 10: end for
ð3Þ 11: end for
¼ Pj ð v ; wÞ  x3 ; r ¼ r 21 þ r 22 þ r 23 : ð16Þ
12: result ul in lth solution point is found
Integrands (14) and (15) in identity (13) are very similar to 13: insert ul in u
functions (2) and (3). The main difference is that functions (14) 14: end for
and (15) additionally require coordinates of points x  fx1 ; x2 ; x3 g 15: solutions u in all points in the domain are found
in the domain where solutions are searched.
118 A. Kuz_ elewski et al. / Computers and Structures 152 (2015) 113–124

Similar problems and advantages to ones found during creating All integrals in PIES are numerically calculated using the
and solving PIES (Algorithm 1) occur here. The calculation of the same Gauss–Legendre cubature with different values of inte-
integrands in expressions (13) and (14) might be performed con- grands. It perfectly fits the SIMT paradigm – the same
currently, as well. instructions during the calculation of the cubature can be pro-
cessed concurrently on a large amount of data. Hence, the
calculation of the integrands in expressions (10) and (11) is
3. Numerical implementation of integration on GPU made for all collocation points on the boundary and it may
be performed concurrently. The integration has become
The architecture of graphics processing units (GPUs) significant- definitely the most time-consuming operation, therefore in
ly differs from central processing units (CPUs). GPU is composed of this study we decided to accelerate the calculation of integrals
multiple floating point units (FPU) and arithmetic and logic units using GPU.
(ALUs). It is connected with the nature of performed calculations The numerical implementation of PIES in CUDA C language was
– the same operations are executed parallely on large amounts of named as GPU-accelerated PIES. To address the challenges, we pro-
data. Using a lot of pixels, texels or vertices is typical in graphics pose to make set-up stage performed directly in GPU (shown in
applications, therefore GPUs are rated as SIMT (single instruction, Algorithm 5).
multiple thread).
CUDA-enabled nVidia GPUs are multithreaded, parallel, Algorithm 5. Set-up stage of GPU-accelerated PIES
many-core processors. They are composed of a set of multi-
threaded streaming multiprocessors (SMs). Each multiprocessor
contains a number of general-purpose streaming processors Require
cores (SPs), multithreaded instruction unit and on-chip shared send to GPU memory:
memory. Also, it comprises a number of 32-bit registers, a tex- v c ; wc – coordinates of all control points (matrix form)
ture cache and a read-only constant cache. SPs can compute v 1 ; w1 – coordinates of all collocation points (matrix form)
one arithmetic multiply or add operation per clock cycle. In n – the number of parametric patches that creates boundary
the case of multiply-and-add operation it is equivalent to two N – the number of terms in approximating series (5) and (6)
operations per clock cycle. The execution of other instructions in direction of coordinate axis v
require greater number of clock cycles (eg. square root is com- M – the number of terms in approximating series (5) and (6)
puted per 10 clock cycles). in direction of coordinate axis w
GPU is directly connected to a read-write off-chip device mem- K – the number of collocation points
ory (DRAM). It can store a large amount of data according to the Gv – the number of cubature nodes in direction of
type of applied hardware (hundreds megabytes to few gigabytes). coordinate axis v
CPU exchanges data with GPU via host and device memory. The Gw – the number of cubature nodes in direction of
main disadvantage of device memory is its latency. SM requires coordinate axis w
about 400–600 clock cycles for access DRAM, whilst SP takes only 1: for 1 6 j 6 n do
ðpÞ ðrÞ
4 clock cycle for accessing to registers or shared memory. More 2: computeLj nblks; thrsoðv c ; wc ; n; Gv ; Gw ; N; M; Lj ; Lj Þ
detailed studies are presented in [28]. //data for regular integration
GPU works in close connection with CPU – it operates as an ð1Þ ð2Þ ð3Þ
3: computePj nblks; thrsoðv c ;wc ;n;Gv ;Gw ;K; Pj ;Pj ;Pj Þ
additional processor attached to CPU. GPU performs operations
assigned to it by the application that must start running on CPU. 4: compute nnblks; thrsoðv c ; wc ; n; Gv ; Gw ; n1 ; n2 ; n3 ; nÞ
Only a part of original program, which requires parallelization, //data for singular integration
ð1Þ
should be recoded. Serial part of CUDA application code runs on 5: computePsj nblks; thrsoðv c ; wc ; n; Gv ; Gw ; K; Psj ,
host (CPU) and parallel part on CUDA device (GPU). The set of all ð2Þ
Psj ; Psj Þ
ð3Þ

functions performed on host is called host program, while func-


6: compute_ns
tions performed on device are called kernels. Host program is
nblks; thrsoðv c ; wc ; n; Gv ; Gw ; ns1 ; ns2 ; ns3 ; nsÞ
responsible for initiation and transfer of data to/from device mem-
7: end for
ory. During execution of a program the host calls kernels, as well.
8: for 1 6 l 6 K do
Data flow in CUDA is divided into four basic steps: 1. initiation of ð1Þ ð2Þ ð3Þ
the program (host), 2. copying data from host to device, 3. calcula- 9: computePl nblks; thrsoðv 1 ; w1 ; n; Gv ; Gw ; K; Pl ; Pl ; Pl Þ
tions on GPU, 4. copying data from device to host. 10: end for
PIES algorithm is not well-suited to GPU implementation in a
direct form. The most important problem is connected with
loading the needed data for each variables in integration proce- In set-up stage appropriate input data are sent to GPU. Next,
dures. In case of GPU implementation, it is connected with all kernels compute matrices of particular variables (marked in
necessity of continuous data exchange between host and device bold). In case of singular integration, calculations consider
(global) memory. In this case the global memory accesses are applied strategy of isolation of singular point. It is connected
not coalesced, that is the memory accesses are not a single with greater size of the matrices. All matrices are calculated
transaction. Also, a part of computations is repeatedly executed once. Later, they are directly used in creating and solving GPU-
in some iterations. accelerated PIES (Algorithm 6) and during finding solutions in
Expressions (10) and (11) indicate, that the number of elements the domain (Algorithm 7). Elements of each matrix are calculat-
in matrices H and G is strictly connected with the number of patch- ed concurrently. Variables blks and thrs mean respectively the
es defining the shape of the boundary and the number of terms in number of blocks and threads. The meaning of a thread on
approximating series (5) and (6). The elements of matrices are GPU is not the same as on CPU. GPU thread is a basic element
obtained as a result of numerical integration, therefore an efficient of the data to be processed. Threads should be running in groups
process of integration is very important. of at least 32 (known as warp) for the best performance. The
A. Kuz_ elewski et al. / Computers and Structures 152 (2015) 113–124 119

total number of threads is numbered in thousands. A block in Algorithm 7. Finding solution in the domain using GPU-acceler-
CUDA contains 64–512 threads. Blocks allow for direct ated PIES
manipulation of a number of threads, which makes program-
ming of GPU easier. Require
The way of creating and solving GPU-accelerated version of PIES n – the number of parametric patches that creates boundary
is shown in Algorithm 6. N – the number of points, where solutions are searched
K – the number of collocation points
Algorithm 6. Creating and solving GPU-accelerated PIES Gv – the number of cubature nodes in direction of
coordinate axis v
Gw – the number of cubature nodes in direction of
Require: coordinate axis w
n – the number of parametric patches that creates boundary ð1Þ ð2Þ ð3Þ ðpÞ ðrÞ
Pj ; Pj ; Pj ; Jj ; n1 ; n2 ; n3 ; n; Lj ; Lj
K – the number of collocation points
Gv – the number of cubature nodes in direction of x – solutions on the boundary
coordinate axis v pð1Þ ; pð2Þ ; pð3Þ - coordinates of points, where solutions are
Gw – the number of cubature nodes in direction of searched
coordinate axis w 1: for 1 6 l 6 N do
ð1Þ ð2Þ ð3Þ ð1Þ ð2Þ ð3Þ ð1Þ ð2Þ ð3Þ 2: for 1 6 j 6 n do
Pj ; Pj ; Pj ; Psj ; Psj ; Psj ; Pl ; Pl ; Pl ; n1 ; n2 ; n3 ; n; ð1Þ ð2Þ ð3Þ
ðpÞ ðrÞ 3: compute rnblks; thrsoðPj ; Pj ; Pj ; pð1Þ ; pð2Þ ; pð3Þ ,
ns1 ; ns2 , ns3 ; ns; Lj ; Lj
Gv ; Gw ; K,r1 ,r2 ,r3 ,r)
1: for 1 6 l 6 n do ðpÞ
4: compute unblks; thrsoðr1 ,r2 ,r3 ,r,n1 ; n2 ; n3 ; n; Lj ,
2: for 1 6 j 6 n do
ðrÞ
3: if l ¼¼ j then Lj ; x; Gv ; Gw ; K; ul Þ
ð1Þ ð2Þ ð3Þ ð1Þ ð2Þ
4: compute g nblks; thrsoðPsj ; Psj ; Psj ; Pl ; Pl , 5: end for
ð3Þ 6: insert ul into u
g g g g)
Pl ; Gv ; Gw ; K, 1 , 2 , 3 ,
7: end for
5: compute g lj nblks; thrsoðg1 ,g2 ,g3 ,g,ns1 ; ns2 ; ns3 ; 8: copy u to CPU memory
ðpÞ ðrÞ
ns, Lj ; Lj ; Gv ; Gw ; K; glj Þ 9: all solutions u in the domain are found
6: compute hlj nblks; thrsoðg1 ,g2 ,g3 ,g,ns1 ; ns2 ;
ðpÞ ðrÞ
ns3 ; ns, Lj ; Lj ; Gv ; Gw ; K; hlj Þ All relationships between appropriate patch and points of solu-
7: else tions are computed concurrently by the kernel in line 3. The next

8:
ð1Þ ð2Þ ð3Þ ð1Þ ð2Þ
compute gnblks; thrso Pj ; Pj ; Pj ; Pl ; Pl ; kernel (line 4) computes integrals from (13). The kernels are per-
formed repeatedly for all patches (for loop in line 2) and solution
ð3Þ
Pl ; Gv ; Gw ; K; g1 ; g2 ; g3 ; gÞ ul in lth point in the domain is obtained. All computations are per-
9: compute g lj nblks; thrsoðg1 ,g2 ,g3 ,g,n1 ; n2 ; n3 ; formed repeatedly for l points where solutions are searched (for
ðpÞ ðrÞ
n; Lj , Lj ; Gv ; Gw ; K; glj Þ loop in line 1). Finally, obtained solutions in the domain u are cop-
ied to CPU memory and written to an output file.
10: compute hlj nblks; thrsoðg1 ,g2 ,g3 ,g,n1 ; n2 ; n3 ;
ðpÞ ðrÞ
n; Lj , Lj ; Gv ; Gw ; K; hlj Þ
11: end if 4. Numerical examples
12: insert ½glj  into G and ½hlj  into H
NVidia GeForce GTX 460 works at 1.55 GHz with 1 GB 256-bit
13: end for
GDDR7 memory (7 streamline multiprocessors each composed by
14: end for
48 CUDA cores, peak performance 1045.6 Gflops) and two Intel
15: copy G and H to CPU memory
Xeon E5507 (4 cores, 4 threads, 2.26 GHz, 4 MB cache memory,
16: transform ½Hfug ¼ ½Gfpg into ½Afxg ¼ fbg
peak performance 36.26 GFlops) were used during tests. It uses
17: solve ½Afxg ¼ fbg
both single-precision and double-precision floating-points
18: solutions on the boundary fxg are found
operations.
Serial version of PIES program was compiled using g++ 4.4.3
with standard settings, whilst GPU-accelerated PIES by nvcc
The algorithm is proceed as follows. All relationships between
V0.2.1221 CUDA 5.0 release with standard settings, as well.
appropriate patches (4) are computed concurrently by the kernels
Numerical tests were carried out on 64-bit Ubuntu Linux operation
in lines 4 and 8. The next kernels (lines 5, 6, 9 and 10) compute
system (kernel 2.6.37).
regular and singular integrals and insert results into submatrices
From efficiency point of view, Gauss–Legendre cubature with
½glj  and ½hlj . All the computations are performed repeatedly
32 or 64 weight coefficients in each direction of surface patch
between respective patches l and j (for loops in lines 1 and 2).
was applied in the presented examples. The number of coefficients
The next part of the algorithm is executed on CPU, due to weak
is connected with warp size.
vulnerability to parallelization of Gaussian elimination. Matrices
G and H are copied to CPU memory. Transformation of PIES (9) into
4.1. GPU-accelerated PIES compared with serial version of PIES
½Afxg ¼ fbg and solving the system are proceeded in the same
way as in serial version of PIES. Finally, solutions on the boundary
The testing example concerned the shape of the boundary
fxg are obtained.
shown in Fig. 3a. The problem was modelled by Navier–Lamé
To find solutions in the domain vector x is sent to GPU memory equations. Boundary conditions were analytical functions, which
and Algorithm 7 is executed. were possessed from an exact solution of the following equations:
120 A. Kuz_ elewski et al. / Computers and Structures 152 (2015) 113–124

(a) (b)

Fig. 3. (a) The shape of considered problem and (b) modelling of the shape of boundary using corner and control points (not shown for the figure clarity).

u1 ¼ 0:5  ð2x1 þ x2 þ x3 Þ; CUDA (32x32 cubature


weight coefficients)
u2 ¼ 0:5  ðx1 þ 2x2 þ x3 Þ; ð17Þ serial (32x32 cubature
Computation time [ms]

weight coefficients)
u3 ¼ 0:5  ðx1 þ x2 þ 2x3 Þ: 104 CUDA (64x64 cubature
weight coefficients)
The values of material constants selected for the calculation serial (64x64 cubature
weight coefficients)
were Young’s modulus E ¼ 1 MPa and Poisson’s ratio m ¼ 0:3.
103
Comparative tests were carried out for the problem with different
number of all collocation points, but the same number of colloca-
tion points in domain of each patch. We considered the possibility
of using 16, 25 or 36 collocation points on each patch in the tests. 102
Influence of the number of cubature weight coefficients on the
results was considered, as well.
The shape of the boundary was modelled by 6 rectangular bicu- 16 25 36
bic Bézier surfaces (curved parts of the boundary) and 10 flat rect- Number of collocation points on patch
angular bilinear Coons surfaces (presented in Fig. 3b). A complete
Fig. 5. Computation time of singular integrals in GPU-accelerated PIES and the
declaration of the boundary defined in PIES by 16 surfaces required serial version of PIES.
80 corner and control points. Some of them are shown in Fig. 3b.
On each surface patch we have defined the same number of collo-
cation points (16, 25 or 36) and finally have solved the system of
110
768–1728 algebraic equations. 32x32 cubature
weight coefficients
The results obtained by GPU-accelerated PIES were compared 105 64x64 cubature
with ones from the serial version of PIES program. Double-preci- weight coefficients

sion floating-point operations were applied in both versions of 100


PIES programs. However, we also performed single-precision GPU
Speedup

95
integration to check how it influenced accuracy and calculation
90

85
104 CUDA (32x32 cubature
weight coefficients)
serial (32x32 cubature 80
Computation time [ms]

weight coefficients)
CUDA (64x64 cubature 75
weight coefficients) 16 25 36
serial (64x64 cubature
103 weight coefficients) Number of collocation points on patch

Fig. 6. Speedup of calculation of regular integrals in GPU-accelerated PIES.

102 time of GPU-accelerated PIES. It can be useful to know whether


application of single-precision GPU integration makes sense from
efficiency and accuracy point of view.
16 25 36
Number of collocation points on patch
4.1.1. Comparison of integration performance
Fig. 4. Computation time of regular integrals in GPU-accelerated PIES and the serial Main study concerned computation time of integration in dou-
version of PIES. ble-precision GPU-accelerated PIES and the serial version of PIES
A. Kuz_ elewski et al. / Computers and Structures 152 (2015) 113–124 121

260
32x32 cubature
Fig. 6). The speedup of singular integration was even better and
weight coefficients achieved 200–255 (presented in Fig. 7). Single-precision GPU-ac-
64x64 cubature
250
weight coefficients celerated PIES was faster, as we expected. Singular integration
speedup achieved up to 302 times, whereas regular – up to 131
240 times. Additionally, it was proved that in case of greater number
Speedup

of weight coefficients (64 64) the speedup of calculations was


230 significantly greater.
Presented results concerned time of computation of integration
220
procedures only. Therefore, if we want to assess speedup of GPU-
accelerated PIES, we should consider time of data transfer between
210
host and GPU global memory, as well as time of computation of
serial parts of the program. Comparison of time of calculations
200
16 25 36 between GPU-accelerated PIES and the serial one is presented in
Number of collocation points on patch Table 1. The speedup is worse than in case of integration proce-
dures, however entire double-precision GPU-accelerated PIES is
Fig. 7. Speedup of calculation of singular integrals in GPU-accelerated PIES.
more than 50 times faster than the serial one. In case of single-pre-
cision GPU-accelerated PIES speedup was slightly greater and
achieved 62.
Table 1
Time of calculations in GPU-accelerated PIES and the serial version of PIES (for 36
Serial part of GPU-accelerated PIES, that is creating and solving
collocation points and 64 64 weight coefficients). of the system of algebraic equations, seems to be the main bottle-
neck of the algorithm. It takes almost 40% of total time of
Procedure Calculation time (in [s])
computations.
Serial GPU-accelerated
version of PIES
PIES 4.1.2. Accuracy of GPU-accelerated PIES
Double- Single- Additional two tests confirmed accuracy of the GPU-accelerated
prec. prec. PIES. First one concerned verification of accuracy of integration.
Initializing program and calculation of 2383.34 20.39 16.49 Relative error norms between values of cubatures obtained by
elements of matrices G and H GPU-accelerated PIES and the serial one were computed. Relative
Creating and solving of the system Ax = b 24.34 24.34 20.41
error norm L2 was computed using the following formula:
Calculation of results in the domain 829.84 18.37 15.01
Entire program 3237.52 63.10 51.91 rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 XK
kekL2 ¼ ðuk  uk Þ2  100%; k ¼ 1; 2; 3: ð18Þ
K k¼1

program. The results of regular integration are presented in Fig. 4, where uk – values of cubatures obtained using GPU-accelerated
whereas singular one in Fig. 5. PIES, uk – values of cubatures obtained using the serial version of
The test was carried out for different number of collocation PIES, K – the number of all computed cubatures.
points, keeping the same number on each surface patch (16, 25 The test was carried out for 16, 25 or 36 collocation points in the
or 36). Additionally, different number (32 32 or 64 64) of domain of each surface patch and 32 or 64 weight coefficients of
Gauss–Legendre cubature weight coefficients were considered. Gauss–Legendre cubature in each direction of patch. Regular and
Computation time was calculated as average time of computing singular integrals were considered separately. The worst value of
all individual submatrices ½hlj  (14) and ½glj  (15). L2 error norm (18) in case of double-precision GPU-accelerated
Regular integration in double-precision GPU-accelerated PIES PIES was 0.242% in singular integration and 0.153% in regular
was about 80–110 times faster than in the serial one (see in one whereas in single-precision 0.257% and 0.155% respectively.

(a) (b) (c)

Fig. 8. (a) The shape of considered problem, (b) modelling of the shape of boundary using corner points, and (c) applied boundary conditions.
122 A. Kuz_ elewski et al. / Computers and Structures 152 (2015) 113–124

104 110
serial (32x32 cubature 32x32 cubature
weight coefficients) weight coefficients
CUDA (32x32 cubature 105 64x64 cubature
weight coefficients
Computation time [ms]

weight coefficients)
103 serial (64x64 cubature 100
weight coefficients)
CUDA (64x64 cubature

Speedup
weight coefficients) 95
102
90

85
101
80

100 75
16 25 36 16 25 36
Number of collocation points on patch Number of collocation points

Fig. 9. Computation time of regular integrals in GPU-accelerated PIES and the serial Fig. 11. Speedup of calculation of regular integrals in GPU-accelerated PIES.
version of PIES.

105 120
serial (32x32 cubature 32x32 cubature
weight coefficients) weight coefficients
CUDA (32x32 cubature 115 64x64 cubature
weight coefficients
Computation time [ms]

weight coefficients)
104 serial (64x64 cubature 110
weight coefficients)
CUDA (64x64 cubature Speedup
weight coefficients) 105
103
100

95
102
90

101 85
16 25 36 1 1.5 2 2.5 3

Number of collocation points Number of collocation points

Fig. 10. Computation time of singular integrals in GPU-accelerated PIES and the Fig. 12. Speedup of calculation of singular integrals in GPU-accelerated PIES.
serial version of PIES.

number of all collocation points, but the same number of colloca-


Application of CUDA did not provided numerical errors which had tion points in domain of each patch. We used 16, 25 or 36 colloca-
significant influence on the accuracy of obtained results despite the tion points on each patch and considered influence of the number
differences in hardware decoding of floating-points operations of cubature weight coefficients on the results, as well.
between GPU and CPU. It should be noted that growing number Similarly to previous example, the results of integration in
of weight coefficients from 32 32 to 64 64 did not significantly GPU-accelerated PIES were compared with the ones from the seri-
influenced the accuracy of integration. al version of PIES program. However, accuracy of the results of
Second test concerned verification of accuracy of solutions problem obtained using GPU-accelerated PIES were compared
obtained by GPU-accelerated PIES. It were computed relative error with the ones from the commercial application of BEM – BEASY
norms (18) of displacements for each coordinate between solu- program [29].
tions obtained by GPU-accelerated PIES (or the serial one) and The shape of the boundary was modelled by 18 flat rectangular
results of solving analytical functions (17). The test was carried bilinear Coons surfaces (presented in Fig. 8b). A complete declara-
out for K ¼ 140 solutions obtained in the domain (internal points tion of the boundary defined in PIES required 20 corner points.
were uniformly distributed inside the domain). The worst error Some of them are shown in Fig. 8b. On each surface patch we have
norm in case of double-precision GPU-accelerated PIES was defined the same number of collocation points (from 16 to 36) and
0.106% whereas in single-precision 0.257%. Reliability and accura- finally have solved the system of 864–1944 algebraic equations.
cy of the results obtained by the serial version of PIES program
have been verified in the previous researches [17,18,20,21].
Table 2
Time of computation of BEASY and GPU-accelerated PIES

4.2. GPU-accelerated PIES vs BEM (BEASY program) Procedure Time of computation (in [s])
BEASY GPU-accelerated
The testing example concerned the problem described by PIES
Navier–Lamé equations. The shape of the boundary is shown in Double- Single-
Fig. 11a. The considered element was firmly fixed at the bottom prec. prec.
and subjected to uniform normal load p ¼ 1 MPa acting vertically Initializing program and calculation of 97 21.44 16.24
downward on the upper part (Fig. 8c). The values of material con- elements of matrices G and H
stants selected for the calculation were Young’s modulus Creating and solving of the system Ax = b 78 33.96 24.22
Calculation of results in the domain 53 42.06 30.86
E ¼ 1 MPa and Poisson’s ratio m ¼ 0:3. As in previous example,
Entire program 228 97.46 71.32
comparative tests were carried out for the problems with different
A. Kuz_ elewski et al. / Computers and Structures 152 (2015) 113–124 123

u1
1
u2
u
3
0.8

Relative error norm [%]

0.6

0.4

0.2

0
192/122 232/192 322/232 378/322 488/378 560/488 690/560 778/690 928/778 1032/928
No. of elements in step k / No. of elements in step k−1

Fig. 13. Relative error norms of the results obtained by BEASY with different size of mesh.

4.2.1. Comparison of computing performance Second test concerned verification of accuracy of solutions
As in the former example, the study concerned computation obtained by GPU-accelerated PIES. It were computed relative error
time of integration in double-precision GPU-accelerated PIES and norms (18) of displacements for each coordinate between solutions
the serial version of PIES program. The results of regular integra- obtained by GPU-accelerated PIES (or the serial one) and results of
tion are presented in Fig. 9, whereas singular one in Fig. 10. solving the problem by BEM. The tests were carried out using com-
The tests were carried out for different number of collocation mercial application of BEM – BEASY program. Therefore, description
points, keeping the same number on each surface patch (16, 25 of variables in (18) is as follows: uk – solutions obtained using GPU-
or 36). Different number (32 32 or 64 64) of Gauss–Legendre accelerated PIES (or the serial version of PIES), uk – solutions
cubature weight coefficients were considered. Computation time obtained by BEASY. Accuracy of solutions obtained by BEM depends
was calculated as average time of computing all individual subma- on the number of elements in mesh. The results obtained using
trices ½hlj  (14) and ½glj  (15). BEASY are treated as a reference for comparison with PIES.
Regular integration in double-precision GPU-accelerated PIES Therefore, we resolved the problem with different number of
was about 76–108 times faster than in the serial one (presented boundary elements in an iterative process, in order to obtain the
in Fig. 11). The speedup of singular integration was a bit better most reliable results. It were computed relative error norms (18)
and achieved 89–117 (presented in Fig. 12). Single-precision for displacements u1 ; u2 ; u3 between the solutions obtained with
GPU-accelerated PIES was faster, as we expected. Singular integra- increasing the number of elements in step k and the ones with a
tion speedup achieved up to 162 times, whereas regular – up to smaller number of elements in step k  1. The results are shown
136 times. in Fig. 13.
We have measured computation time of GPU-accelerated PIES, Fig. 13 proved, that value of relative error norm between mesh-
as well as BEASY. However, direct comparison between programs es with 1032 and 928 elements is almost the same as between ones
is problematic. BEASY uses the full potential of the multi-core with 928 and 778 elements. Hence, mesh with 1032 elements
CPU – it is parallely written program and on our hardware uses (4960 nodes) was applied and system of 14,070 algebraic equa-
both CPUs (all 8 cores). Serial version of PIES program runs only tions was solved in BEASY. The test was carried out for K ¼ 295
on one core. Similarly, part of GPU-accelerated PIES (host program) solutions obtained in the domain (internal points were uniformly
runs on one core, as well. Time of computations of GPU-accelerated distributed inside the domain). The worst error norm in case of
PIES and BEASY is shown in Table 2. double-precision GPU-accelerated PIES was 2.238% whereas in sin-
Double-precision GPU-accelerated PIES ran about 2.5 times fas- gle-precision 2.571%. The results obtained by both PIES methods
ter than BEASY, whereas speedup of single-precision one was 3.2. are close to the ones obtained by BEASY. GPU-accelerated PIES
However, it should be noted that BEASY is a commercial applica- obtained the most accurate results in case of the highest number
tion developed for many years. of cubature weight coefficients.

4.2.2. Accuracy of GPU-accelerated PIES 5. Conclusions


As in the former example, relative error norms (18) between val-
ues of cubatures obtained by GPU-accelerated PIES and the serial The paper presents possibility of application new parallel com-
one were computed, in order to verify accuracy of integration. puting platform and programming model CUDA to accelerate
The tests were carried out for 16, 25 or 36 collocation points in numerical calculations of integrals in PIES. Testing examples are
the domain of each surface patch and 32 or 64 weight coefficients related to solving the elasticity problems modelled by 3D
of Gauss–Legendre cubature in each direction of patch. The worst Navier–Lamé equations.
value of L2 error norm (18) in case of double-precision GPU-acceler- The numerical examples show significant reduction of comput-
ated PIES was 0.264% in singular integration and 0.061% in regular ing time of numerical integration in GPU-accelerated PIES. Parallel
one whereas in single-precision 0.297% and 0.123% respectively. regular integrals were computed about 76–110 times faster than
124 A. Kuz_ elewski et al. / Computers and Structures 152 (2015) 113–124

serial ones. In case of singular integrals, speedup was a bit greater [4] Jeon Y, Jung E, Min H, Chung E-Y, Yoon S. GPU-based acceleration of an
RNA tertiary structure prediction algorithm. Comput Biol Med 2013;43(8):
and achieved 89–225. It depends on the number of weight coeffi-
1011–22.
cients in applied cubature. We noted almost no difference between [5] Townson RW, Jia X, Tian Z, Graves YJ, Zavgorodni S, Jiang SB. GPU-based Monte
accuracy of integration in GPU-accelarated PIES and the serial one, Carlo radiotherapy dose calculation using phase-space sources. Phys Med Biol
although computations performed using GPU were significantly 2013;58(12):4341–56.
[6] Li B, Liu G-F, Liu H. A method of using GPU to accelerate seismic pre-stack time
faster. Therefore, the use of GPU-accelerated PIES is absolutely migration. Chinese J Geophys 2009;52(1):242–9.
appropriate. [7] Obrecht C, Kuznik F, Tourancheau B, Roux J-J. Multi-GPU implementation of
We should note that accuracy of solutions in case of single-pre- the lattice Boltzmann method. Comput Math Appl 2013;65(2):252–61.
[8] Zheng JW, An XH, Huang MS. GPU-based parallel algorithm for particle contact
cision GPU-accelerated PIES is slightly smaller than in double-pre- detection and its application in self-compacting concrete flow simulations.
cision one. However, single-precision version is only about 20% Comput Struct 2012;112:193–204.
faster than double-precision. It is very important from practical [9] D’Azevedo EF, Fata SN. On the effective implementation of a boundary element
code on graphics processing units using an out-of-core LU algorithm. Eng Anal
point of view. Numerical integration using CUDA may be success- Bound Elem 2012;36(8):1246–55.
fully applied even in case of single-precision floating-point [10] Wang Y, Wang Q, Wang G, Huang Y, Wang S. An adaptive dual-information
applications. FMBEM for 3D elasticity and its GPU implementation. Eng Anal Bound Elem
2013;37(2):236–49.
We can note that growing number of cubature weight coeffi- [11] Fu Z, Lewis TJ, Kirby RM, Whitaker RT. Architecting the finite element method
cients did not significantly influenced accuracy of solutions in the pipeline for the GPU. J Comput Appl Math 2014;257:195–211.
first example, where solutions of Navier–Lamé problem obtained [12] Georgescu S, Chow P, Okuda H. GPU acceleration for FEM-based structural
analysis. Arch Comput Methods Eng 2013;20(2):111–21.
using PIES were compared to analytical ones. Growing number of
[13] Richter C, Schops S, Clemens M. GPU acceleration of finite difference schemes
coefficients results in slightly higher accuracy of solutions, howev- used in coupled electromagnetic/thermal field simulation. IEEE Trans Magn
er these calculations require more time. 2013;49(5):1649–52.
The results of solving the second example by GPU-accelerated [14] Karatarakis A, Metsis P, Papadrakakis M. GPU-acceleration of stiffness matrix
calculation and efficient initialization of EFG meshless methods. Comput
PIES and BEASY are close to each other. It should be noted, that Methods Appl Mech Eng 2013;258:63–80.
in case of GPU-accelerated PIES significantly smaller system of [15] Nvidia corporation. NVidia CUDA C programming guide, version 5.0; 2013.
algebraic equations was solved – 1944 equations compared to [16] Zieniuk E. Modelling and effective modification of smooth boundary geometry
in boundary problems using B-spline curves. Eng Comput 2007;23(1):39–48.
14,070 in BEASY. Calculations were about 2.5–3.2 times faster than [17] Zieniuk E, Szerszeń K. Triangular Bézier surface patches in modelling shape of
in BEASY, as well. However, serial part of GPU-accelerated PIES, boundary geometry for potential problems in 3D. Eng Comput 2013;29(4):
that is procedure of solving of algebraic equations system, is the 517–27.
[18] Zieniuk E, Szerszeń K. Triangular Bézier patches in modelling smooth
main bottleneck of the algorithm. It takes almost 40% of total time boundary surface in exterior Helmholtz problems solved by PIES. Arch
of computations. Our future research should focus on paralleliza- Acoust 2009;34(1):51–61.
tion of this part of the algorithm. [19] Zieniuk E, Bołtuć A. Non-element method of solving 2D boundary problems
defined on polygonal domains modeled by Navier equation. Int J Solids Struct
This paper is one of the first attempts to use CUDA in 3D bound- 2006;43(25–26):939–7958.
ary value problems solved by PIES. The presented technique of [20] Zieniuk E, Bołtuć A, Szerszeń K. Modeling complex homogeneous regions using
acceleration of calculations could be extended to the problems surface patches and reliability verification for Navier–Lamé boundary
problems. In: Proceedings of WORLDCOMP’12: the 2012 international
modelled by various equations and solving using PIES. However,
conference on scientific computing; 2012. p. 166–72.
the most important is the acceleration of integration which direct- [21] Zieniuk E. Computational method PIES for solving boundary value
ly influences reduction of duration of PIES program. problems. Polish Scientific Publishers PWN; 2013 [in polish].
[22] Brebbia CA, Telles JCF, Wrobel LC. Boundary element techniques, theory and
applications in engineering. Springer-Verlag; 1984.
Acknowledgments [23] Zienkiewicz OC. The finite element method. McGraw-Hill; 1977.
[24] Zieniuk E, Sawicki D, Bołtuć A. Parametric integral equations systems in 2D
The scientific work is founded by resources for science in the transient heat conduction analysis. Int J Heat Mass Transfer 2014;78:571–87.
[25] Gottlieb D, Orszag SA. Numerical analysis of spectral methods: theory and
years 2010–2013 as a research project. applications. SIAM; 1977.
[26] Becker AA. The boundary element method in engineering: a complete
References course. McGraw-Hill Book Company; 1992.
[27] Hall WS. Integration methods for singular boundary element integrands. In:
Brebbia CA, editor. Boundary elements X, vol. 1. Springer-Verlag; 1988. p.
[1] Sharma G, Agarwala A, Bhattacharya B. A fast parallel Gauss Jordan algorithm
219–36.
for matrix inversion using CUDA. Comput Struct 2013;128:31–7.
[28] Volkov V, Demmel J. LU, QR and Cholesky factorizations using vector
[2] Altomare C, Crespo AJC, Rogers BD, Dominguez JM, Gironella X, Gomez-
capabilities of GPUs. LAPACK working note 202; 2013. <http://www.eecs.
Gesteira M. Numerical modelling of armour block sea breakwater with
berkeley.edu/Pubs/TechRpts/2008/EECS-2008-49.html>.
smoothed particle hydrodynamics. Comput Struct 2014;130:34–45.
[29] BEASY user guide. Version 10.0R15; 2009.
[3] Preis T, Virnau P, Schneider JJ. GPU accelerated Monte Carlo simulation of the
2D and 3D Ising model. J Comput Phys 2009;228(12):4468–77.

You might also like