You are on page 1of 24

Mesh Partitioning Techniques and Domain

Decomposition Methods (Ed. F.Magoules)


Saxe-Coburg, Stirling, Scotland, 2007, pp. 119-142.

Basics of the Domain Decomposition


Method for Finite Element Analysis
G.P.Nikishkov
University of Aizu, Aizu-Wakamatsu City, Fukushima, Japan

Abstract

An introduction to the domain decomposition method for parallel finite element anal-
ysis is presented. The domain decomposition method allows decomposition of large-
size problem solutions to solutions of several smaller size problems. Algorithms of
domain partitioning with compute load balancing as well as direct and iterative solu-
tion techniques are considered.

Keywords: Finite element method, domain decomposition, partitioning, parallel.

1 Introduction
Various appications of the domain decomposition method (DDM) have a long his-
tory in computational science. The DDM in the form of substructuring was used in
finite element analysis soon after introduction of the finite element method in engi-
neering practice [1]. The reason for employing the substructuring technique was the
small memory of computers. To solve large-scale problems, a structure (domain) was
divided into substructures (subdomains) that fit into computer memory.
Computer memory grows but demand for solution of large-scale problems is al-
ways ahead of computer capabilities. Large-scale scientific and engineering simu-
lations performed with the finite element method often require very long computing
time. While limited progress can be made with improvement of numerical algorithms,
the radical time reduction can be reached with multiprocessor computations. In or-
der to perform finite element analysis on a parallel computer, computation should be
distributed across processors.
Element-wise operations, such as calculation of element matrices, are easy to par-
allelize. More difficult is to transform solution of a global equation system into a par-

1
allel procedure. Simple distribution of arithmetic operations across processors leads
to fine-grain parallelism with intensive data communication between processors. Such
parallel computations are usually inefficient.
A coarse-grain parallel finite element algorithm can be based on the DDM [2].
In the DDM, the finite element domain is divided into subdomains along element
boundaries. Subdomain operations are carried out by separate processors without any
data flow among them. Then interprocessor data communications and computations
are necessary to establish proper subdomain connections. With load balancing, the
DDM can be an efficient computational procedure for parallel finite element analysis.
The necessary first stage of the DDM finite element procedure is domain partition-
ing into the specified number of subdomains, which is usually equal to the number
of processors. Both direct and iterative solution methods are used in the finite ele-
ment programs. Attractive features of the direct solution methods in comparison to
iterative algorithms are simplicity, possibility of predicting the computing time, and
the absence of convergence problems. For relatively small problems, direct solution
methods require less time than iterative methods; for large problems, iterative methods
are more efficient.
In this chapter, an introduction to the domain decomposition method for the finite
element analysis is presented. First, domain partitioning is described. Then we con-
sider the DDM with direct LDU equation solver. After derivation of general computa-
tional procedure, attention is paid to domain partitioning with compute load balancing.
Then the DDM algorithm with iterative solution of the equation system is discussed.
Results of some parallel finite element applications illustrate efficiency of the DDM
with direct and iterative solution algorithms.

2 Domain partitioning
A domain partitioning algorithm should produce a subdivision that minimizes the total
computing time on a multiprocessor computer. Thus the whole computations should
be balanced among processors.
In order to minimize the total computing time for the finite element analysis, the
following objectives should be fulfilled: minimization of the number of interface
nodes, which determines the size of the interface equation system (when subdomain
condensation is used) and the amount of data communication; compute load balancing
by assigning different numbers of elements to subdomains.
The quality of domain partitioning can considerably affect computing time of paral-
lel finite element analysis [3]. Numerous algorithms for domain partitioning have been
reported in the literature. Some authors consider only partitioning itself [4, 5, 6, 7].
Publications [8, 9, 10] address also the problem of compute load balancing.
Graph methods are widely used for domain partitioning, usually in the form of
recursive graph bisection (RGB) [5]. The finite element mesh is represented by a
dual diagonal graph. Elements of the mesh compose a graph vertex set. Vertices
are connected by an edge if corresponding finite elements have one or more common

2
Figure 1: Example of the dual diagonal graph.

nodes. An example of the dual diagonal graph is shown in Fig. 1. The RGB algorithm
recursively bisects a graph into two subgraphs. During bisection, graph diameter is es-
timated, and graph vertices are separated into two groups according to their distances
from end nodes. In many cases the RGB algorithm produces far from optimal sub-
domains with “fuzzy” interface boundaries since only distance information for graph
vertices is employed.
Here we present the recursive graph labeling (RGL) algorithm for domain parti-
tioning [11]. The RGL algorithm is based on the graph labeling scheme for matrix
profile reduction [12]. Both global information (distances from the end vertex) and
local information (degree of a current vertex) are used for labeling graph vertices. The
algorithm allows partitioning the domain into subdomains with unequal numbers of
elements as necessary for load balancing. The RGL algorithm produces subdomains
with smooth boundaries. This leads to fewer interface nodes and reduced data com-
munication between subdomains.
Partitioning process consists of: formation of a distance structure for the graph
representing the finite element domain; labelling of graph vertices; and graph division
into subgraphs related to subdomains.

2.1 Graph distance structure


The graph distance structure is an ordering of the graph vertices according to their
distances from the specified vertex s, for example, the lower right vertex in Fig. 1.
The distance structure can be represented as a level structure L(s) = {l0 , l1 , ..., lh },
where level li consists of vertices which have distance i from s. The depth of the
graph is equal to h and the width of the graph w is equal to maximum number of
nodes at some level i. The diameter of a graph is the maximum distance between all
vertex pairs. The degree of a vertex is the number of edges that connect it to neighbor
vertices.
In order to determine the graph diameter, vertex s, which has the smallest degree, is
selected as a starting vertex. The graph distance structure L(s) for vertex s is compiled.
For the last level lh(s) a list of vertices Q, containing one vertex of each degree, is

3
generated. For each vertex i in Q the distance structure L(Qi ) is built. The graph
diameter is assigned the maximum depth from all h(Qi ). The starting vertex s and the
end vertex e are vertices at opposite ends of the diameter.

2.2 Vertex labeling

The graph labeling algorithm is based on the vertex priority, which is a combination
of the vertex degree and its distance from the end vertex

p = W1 h − W2 (d + 1), (1)

where W1 and W2 are weights; h is the distance from the end vertex; and d is the
vertex degree. The priority of each vertix is assigned during labeling. Vertices which
are labeled are excluded from further consideration. Vertices which are adjacent to
labeled vertices are called active. Vertices adjacent to an active vertex but not active,
are called preactive. All the other vertices are inactive.
The vertex labeling algorithm can be presented as follows:
Form the distance structure beginning at the end vertex, L(e).
Compute initial priorities for all vertices pi = W1 hi − W2 (di + 1)
Put the starting vertex into the queue with preactive status
do while queue is not empty
Select from the queue vertex i with maximum priority pi
Label vertex i with next available number
Delete vertex i from the queue
if vertex i is preactive then
do for all vertices j adjacent to vertex i
Increment priority pj by W2
if vertex j is inactive then put it into the queue with preactive status
end do
end if
do for all vertices j adjacent to vertex i
if vertex j is preactive then
Assign vertex j an active status
Increment priority pj by W2
do for all vertices k adjacent to vertex j
Increment priority pj by W2
if vertex k is inactive then
Put it into the queue with preactive status
end do
end if
end do
end while

4
2.3 Division into subdomains
Suppose that the finite element domain should be partitioned into s subdomains with
equal numbers of elements. Integer s is represented as a product of simple numbers s
= s1 · s2 · ... · sq and the graph labeling algorithm is applied q times as shown below:
Represent the number of subdomains as s = s1 · s2 · ... · sq
Current number of subdomains N = 1
do i = 1, q
do j = 1, N
Partition subdomain j into si subdomains using graph labeling
Increment the current number of subdomains N = N · si
end do
end do

Partitioning of a subdomain consisting of E elements into si new equal subdomains


includes the following steps. Using labeling, elements are sorted according to their
priorities. First E/si elements are assigned to the first new subdomain; next E/si
elements are assigned to the second subdomain; and so on. It is not difficult to partition
the subdomain into unequal new subdomains with specified numbers of elements.

3 Domain decomposition method with a direct solver


The DDM with subdomain condensation is called the Schur complement method
[1, 2]. The finite element domain is divided into subdomains. For each subdomain,
elimination of inner nodes is performed. The condensed subdomain matrices are as-
sembled into an interface equation system, which solution produces interface displace-
ments. Finally, inner displacements and other results like strains and stresses are com-
puted.

3.1 DDM algorithm


After division of the computational domain into a number of subdomains, a finite
element equation system can be assembled for each subdomain:

[k]{u} = {f }, (2)

where [k] is a subdomain stiffness matrix, {u} is a subdomain displacement vector,


and {f } is a subdomain load vector.
The subdomain nodes are grouped into interior nodes, designated by the subscript
i, and interface boundary nodes, designated by the subscript b. If the interior nodes
are numbered first and the interface boundary nodes are numbered last, then the sub-
domain equation system can be written in the following matrix form:
· ¸½ ¾ ½ ¾
kii kib ui fi
= . (3)
kbi kbb ub fb

5
Matrices [kii ] and [kbb ] correspond to interior and interface (boundary) nodes respec-
tively. Matrix [kib ] reflects the interaction between the interior and boundary nodes.
A condensation of the subdomain equation system is made by eliminating unknowns
related to interior nodes:
[k̄bb ]{ub } = {f¯b },
[k̄bb ] = [kbb ] − [kbi ][kii ]−1 [kib ], (4)
{f¯b } = {fb } − [kbi ][kii ]−1 {fi }.

Subdomain condensed stiffness matrices [k̄bb ] and subdomain surface load vectors
{f¯b } are assembled into the interface equation system

[K]{U } = {F }. (5)

Solution of the interface system gives unknown interface displacements {U }. In-


terface displacements are diassembled and used for the determination of the interior
displacements for each subdomain:

[kii ]{ui } = {fi } − [kib ]{ub }. (6)

Let us adopt the LDU method for subdomain condensation. The DDM numerical
procedure with the LDU subdomain condensation has three computational phases –
(a) Subdomain assembly and condensation:
X X
[k] = [kel ], {f } = {fel },
el el
[kii ] = [L][D][U ] = [U ]T [D][U ],
[k̄ib ] = [U ]−T [kib ], (7)
[k̄bb ] = [kbb ] − [k̄ib ]T [D]−1 [k̄ib ],
{f¯i } = [U ]−T {fi },
{f¯b } = {fb } − [k̄ib ]T [D]−1 {f¯i }.

(b) Interface assembly and solution:


X X
[K] = [k̄bb ], {F } = {f¯b },
s s (8)
−1
{U } = [K] {F }.

(c) Determination of interior displacements:

{f˜i } = {f¯i } − [k̄ib ]T {ub },


(9)
{ui } = [U ]−1 [D]−1 {f˜i }.
Different subdomains are assigned to different processors for parallel computing.
Subdomain stiffness matrices and subdomain load vectors are assembled of element
stiffness matrices and element load vectors. The subdomain condensation procedure

6
(7) consists of LDU factorization of the matrix [kii ] with the interior degrees of free-
dom and matrix-matrix and matrix-vector multiplications. Subdomain operations of
assembly and condensation take the most time and can be done in parallel without any
data communication. All parallel tasks should be synchronized at the beginning of the
solution of the interface equation system, thus the compute load for the subdomain
assembly and condensation should be balanced among processors.
Optimized node enumeration inside each subdomain can substantially decrease the
operation count for the condensation of the subdomain matrices. The graph labeling
algorithm of Section 2 with necessary modifications can be used for subdomain node
renumbering. The dual diagonal graph is generated for nodes. If a subdomain has a
part of the boundary which does not belong to the interface, then the starting node is
placed on that part of the boundary.
Solution of the interface equation system can be performed with direct or iterative
methods. If a direct method is used, then factorization of the interface equation sys-
tem can be done in parallel [13] using cyclic distribution of matrix columns across
processors. Each column after modifications is broadcast to all processors. Then each
parallel task uses the received column for modification of columns belonging to this
task. The forward solve and backsubstitution phase for the interface system is not
well suited for parallelization because of its very low computation to communication
ratio. Since the backsolve takes a tiny fraction of total computing time, it is possible
to perform this operation in a serial manner. Finite element operations that follow
the determination of displacements (usually stress calculations) are element-oriented
and not difficult to parallelize. The distribution of elements across processors during
the stress calculations can differ from subdomain partition used during the previous
computational phases.

3.2 Compute load for subdomains


Analyzing the DDM computational procedure it is possible to conclude that it is im-
portant to balance compute load for subdomain assembly and condensation. Below
are presented operation count estimates for this computational phase [11, 14].
The operation count for the assembly of the subdomain equation system is mainly
determined by element stiffness calculations, and can be estimated as

c = cel e, (10)

where cel is the operation count for computing the element stiffness matrix and e is
the number of elements in the subdomain. For estimation of the number of arithmetic
operations for element stiffness matrix calculation, it is possible to use the computing
time for element stiffness calculation and the average computer Mflops rate.
Fig. 2 presents a simplified structure of the subdomain matrix composed of sym-
metrical part of a band [kii ], symmetrical part of the interaction [kib ] with triangular
shape, and symmetrical part of the matrix [kbb ] corresponding to interface nodes. The
matrix structure in Fig. 2 corresponds to the subdomain with a topologically regular
mesh and regular node enumeration.

7
h m0

kii kib

kbb

n m

Figure 2: Structure of the subdomain matrix.

The inner loop of the LDU factorization of matrix [kii ] can be described as follows:
do j = 2, n
do i = 1, h − 1
do k = 1, i − 1
Multiply-add
end do
end do
end do .
There are two arithmetic operations of multiplication and addition in the inner k-loop.
For the first h columns the operation count is calculated in the same way as for the
fully populated matrix, and for the remaining (n − h) columns in the same way as for
the pure banded matrix:
µ ¶
2 2
c1 = h n − h . (11)
3

Modification of the interaction matrix [kib ] is a multiple forward reduction for its
columns:
do k = 1, m0
do i = 1, n − (n/m0 )k
do j = 1, h
Multiply-add
end do
end do
end do .
The operation count can be estimated as:

c2 = (n − h)hm0 . (12)

8
The fill factor f for the interaction matrix [kib ] is calculated as
1 Xm
f= hi , (13)
mn i=1

where hi is the height of the ith column. For our simplified structure for the matrix
[kib ] with m0 = m/2, the size m0 is equal to:

m0 = 2mf

and the operation count c2 is expressed as:

c2 = 2(n − h)mhf. (14)

Calculation of the symmetric part of the condensed stiffness matrix [k̄bb ] is a multi-
plication of a profile transpose matrix [k̄ib ] by itself:
do j = 1, m
do i = 1, j
do k = 1, n − (n/m0 )max(i, j)
Multiply-add
end do
end do
end do .

The operation count can be estimated by computing an integral


Z m0
1 1 n 1 4
c3 = ( nm0 − x2 )dx = nm20 = nm2 f 2 , (15)
0 2 2 m0 3 3

where x is a column number in the matrix [kib ].


Summing operation counts of (10), (11), (14) and (15) yields the total compute load
for assembly and condensation of the subdomain stiffness matrix:
2 4
C = cel e + h2 (n − h) + 2(n − h)mhf + nm2 f 2 . (16)
3 3
Partitioning of the finite element domain into subdomains with an equal number
of elements does not lead to interprocessor load balancing since the quantities n, m,
h, and f are different in different subdomains because of the subdomain position and
subdomain node enumeration.
In actual calculations, the structure of the subdomain stiffness matrix is more com-
plicated than the structure shown in Fig. 2. The operation count estimate (16) can be
used for irregular subdomains provided that the halfbandwidth h is replaced by its root
mean square value: r
1 Xn 2
h= hi , (17)
n i=1

where hi is the height of the ith column of the matrix [kii ]

9
3.3 Subdomain load balancing
The first phase of the DDM algorithm is assembly and condensation of the subdomain
stiffness matrices. Since the computation time for the subdomain assembly and con-
densation is the largest fraction of the total solution and the end of this phase is the
synchronization point, the compute load of the subdomain assembly and condensa-
tion should be balanced among processors. The load balancing may be achieved by
partitioning the domain into subdomains with unequal numbers of elements [11].
For subdomains having varying number of elements and similar shape, quantities
of equation (16) are approximately proportional to:
√ √
n ∼ e; m∼ e; h∼ e; f ≈ const.

The operation count for the subdomain can be expressed through the number of ele-
ments
C = cel e + ceq e2 , (18)
where ceq is the operation count related to subdomain condensation. The value of the
coefficient ceq is determined by

4
ceq = (h2 n + 2nmhf + nm2 f 2 )/e20 (19)
3
for some subdomain with known values of n, h, m, and f and number of elements e0 .
A nonlinear equation system for the load balancing problem can be written as:

C1 − C2 = 0
C2 − C3 = 0
... (20)
C
Ps−1 − Cs = 0
ei − E = 0,

or
cel ei + ceq
P
2 el eq 2
i ei − c ei+1 − ci+1 ei+1 = 0, i = 1...s − 1
(21)
ei − E = 0,
where s is the number of subdomains, ei is the number of elements in the ith subdo-
main, and E is the number of elements in the domain.
The nonlinear equation system (21)

Fi (e1 , ...es ) = 0, i = 1...s (22)

can be solved by the Newton-Raphson iterative procedure:

{e}(0) = {E/s, E/s...}


{∆e}(i) = −([J](i−1) )−1 {F }(i−1) (23)
{e}(i) = {e}(i−1) + {∆e}(i) .

10
Here (i) is the iteration number and coefficients of the matrix [J] are equal to Jij =
∂Fi /∂ej :
Ji1 , ...Ji i−1 = 0
Jii = cel + 2ceq i ei
Ji i+1 = −cel − 2ceq i+1 ei+1 (24)
Ji i+2 , ...Jin = 0
Jsi = 1.
Solution of the nonlinear load balancing system (21) predicts the new distribution
of elements among subdomains. The algorithm of partitioning with load balancing is
described by the following pseudo-code:
Represent number of subdomains s as product of simple numbers
Specify equal number of elements in subdomains {e}(0) = {E/s, E/s...}
while load imbalance ≥ specified value
Partition domain into s subdomains {e}(i−1) using recursive graph labeling method
Optimize subdomain node enumeration with minimization of h and f
Calculate {e}(i) by solving algebraic problem of element distribution among subdomains
end while .

Usually it is sufficient to perform 1–3 iterations in order to achieve acceptable load


balancing across subdomains.

3.4 Examples
The above algorithm of domain partitioning with load balancing has been imple-
mented as a C routine. Although it is not possible to prove the convergence of the
iterative load balancing procedure, one can intuitively feel that the iterative procedure
converges if the subdomains undergo small changes in shape and position between
successive iterations. To provide similarity of subdomains during load balancing iter-
ations, the following procedure was introduced into the algorithm. During the initial
partitioning, graph diameter ends for all subdomains are stored. Later, each diam-
eter end is selected as an element located at the subdomain boundary and has the
minimum distance from the diameter end stored during the first partitioning. In or-
der to “sharpen” subdomain boundaries, a large weight W2 for vertex degrees is used
(W2 À W1 ) in the recursive graph labeling algorithm. Degrees for all interior ele-
ments are set to the same value, which is equal to the degree of the interior element in
the regular mesh.
Here, examples of partitioning both regular and irregular meshes are presented.
Partitions are evaluated in terms of parallel efficiency for subdomain assembly and
condensation computational phase. The parallel efficiency is calculated as

C(sequential, equal subdomains)


Ep = ,
s · C(maximum among subdomains)

where s is the number of subdomains. Operation counts C both for partitions into
subdomains with equal number of elements and for optimized partitions are compared

11
(a)

(b)

Figure 3: Eight-subdomain primary partition into equal(a) and optimized (b) subdo-
mains.

to the operation count of the sequential algorithm for the partition with equal subdo-
mains. Because of this, values of the parallel efficiency larger than 1.0 for optimized
subdomains are possible. Two-dimensional four-node elements with Cel = 0.00738·
106 are used in examples.
Figures 3 and 4 show topologically general meshes, selected to demonstrate per-
formance of the proposed algorithm. The primary partition of the domain with two
holes into 8 equal subdomains is presented in Fig. 3a. The optimized partition of this
mesh is shown in Fig. 3b. Figures 4a and 4b illustrate the primary and optimized par-
titions of the domain with one hole into 12 subdomains. Partition into subdomains
with equal numbers of elements leads to the parallel efficiency in the range 0.7–0.8
if the direct LDU algorithm is used for subdomain condensation. The algorithm with
compute load balancing radically improves the parallel efficiency of the assembly and
condensation phase and provides parallel efficiency close to 1.0. Just 2–3 iterations
are necessary to reach balanced mesh partitions. The ratio of maximum and minimum

12
(a)

(b)

Figure 4: 12-subdomain primary partition into subdomains with equal number of ele-
ments (a) and optimized subdomains (b).

13
numbers of elements in optimized subdomains is in the range 1.4–1.5.
Dependence of parallel efficiency of the assembly and condensation on the prob-
lem size is demonstrated for nearly square regular meshes of quadrilateral elements.
An example of an optimized partition of a mesh consisting of 6006 two-dimensional
elements into 16 subdomains is shown in Fig. 5. Just one iteration changed the parallel
efficiency from 0.65 to 1.06.
Parallel efficiency values of equal and optimized 8- and 16-subdomain partitions for
meshes containing from 1056 to 10201 nodes are plotted in Figures 6 and 7. The 16-
subdomain partitions are characterized by lower values of efficiency after partitioning
into equal subdomains. Optimization with 1–2 iterations in most cases yields values
of parallel efficiency that are close to 1.0 and even larger than 1.0 for 16-subdomain
partitions.
The domain decomposition method with the direct LDU solver was used for devel-
opment of a parallel version of an industrial sheet metal forming program [15].

4 Domain decomposition method with an iterative


solver

The main problem of direct methods on parallel computers is their poor performance
for systems with large numbers of processors (the scalability problem). For large
finite element problems (and large number of processors), iterative methods are more
efficient than direct ones. Various iterative methods for solution of large systems of
equations are discussed in monographs [16, 17]. In many practical applications, the
preconditioned conjugate gradient (PCG) method is used because of its simplicity and
efficiency.
A simple data distribution scheme for the PCG method is a row-wise distribution
of the global stiffness matrix [18]. The rows of each processor succeed one another.
Distribution of the vector array corresponds the row distribution of the matrix in a
component-wise manner. Such partitioning may be simple but it can lead to long
boundaries between parts of mesh assigned to processors. A more efficient approach is
based on partitioning the finite element mesh into subdomains using graph partitioning
schemes and processing subdomain matrices and vectors on different processors with
necessary data communication.
Here an efficient implementation of the parallel PCG method with nonoverlapping
domain decomposition for solution of three-dimensional finite element problems [19]
is considered. An algorithm for domain partitioning and algorithms for matrix-vector
and vector-vector multiplications for partitioned arrays are presented in the next sub-
section. Then a parallel procedure of the PCG method for solution of decomposed
finite element problems is described. Performing computations for interior and inter-
face data separately allows overlapped communication and computation.

14
Figure 5: 16-subdomain optimized partition of a regular mesh.

1.25

Assembly and condensation

8 subdomains

1.00
Parallel efficiency

Optimized subdomains

0.75

Equal subdomains

0.50

0.25
0 2000 4000 6000 8000 10000

Number of nodes

Figure 6: Parallel efficiency of subdomain assembly and condensation for equal and
optimized 8-subdomain partitions.

15
1.25

Assembly and condensation

16 subdomains

1.00
Optimized subdomains

Parallel efficiency

0.75

Equal subdomains
0.50

0.25
0 2000 4000 6000 8000 10000

Number of nodes

Figure 7: Parallel efficiency of subdomain assembly and condensation for equal and
optimized 16-subdomain partitions.

4.1 Problem partitioning


In order to implement a parallel solution of the finite element problem, both matrices
and vectors of the finite element model should be divided into parts and distributed
across processor nodes. The choice of partitioning is tightly related to data commu-
nication between subdomains. Since data communication is usually a critical issue
in parallel finite element analysis, we consider matrix-vector and vector-vector opera-
tions for partitioned arrays.
For nonoverlapping domain decomposition, subdomain boundaries coincide with
finite element boundaries and each finite element belongs to one subdomain only. A
simple domain divided into four subdomains is shown in Fig. 8.
Matrices and vectors can be stored on processor nodes in an accumulated form or
in a distributed form. An accumulated matrix or vector contains full entries for both
interior and interface nodes. The term ‘distributed’ means that subdomain matrix or
vector contains entries assembled of contributions from elements belonging to this
subdomain. Entries of the distributed matrix or vector contain full values for the inte-
rior nodes and only partial values for the interface nodes. It is possible to demonstrate
that in general matrix-vector products can not be calculated with accumulated arrays
[20]. Because of this, global matrices (global stiffness matrix and preconditioning
matrix) are stored in a distributed form. The distributed subdomain stiffness matrix
is obtained automatically by assembling element stiffness matrices for elements be-
longing to this subdomain. Both distributed and accumulated storage forms should
be used for vectors. The reason for this becomes clear by considering computation of
vector-vector and matrix-vector products employed inside the iteration procedure of
the PCG method.

16
i
b
eb

Figure 8: Domain divided into four subdomains. For the left upper subdomain, the
following node groups are shown – i: interior nodes; b: interface (boundary) nodes;
eb: external interface nodes. Arrows show data communication for transformation of
a vector from distributed to accumulated form.

Consider computation of an inner product for vectors a and b:


X
α = aT b = ai bi , (25)
i

where the subscript i denotes the ith entry of the vector. When vectors a and b are dis-
tributed across processors as ap and bp (p is a processor index) then the inner product
is computed as: X XX
α= (aT b)p = (ai bi )p (26)
p p i

If both vectors are stored in accumulated form then quantities ai bi for the interface
nodes are referenced several times and the result is incorrect. The correct result can be
obtained if intermediate results ai bi for interface nodes are divided by a multiplicity
factor of each node. The multiplicity factor of a node reflects how many times the
node is mentioned in all subdomains.
A simple approach to correctly perform the inner product is just to have one vector
in accumulated form and the other vector in distributed form. Direct check of equation
(26) shows that the result will be correct. It is easy to demonstrate that a matrix-
vector product should be performed as multiplication of a distributed matrix by an
accumulated vector. The result of such operation is a distributed vector.
During PCG iterations it is necessary to transform vectors from distributed form to
accumulated form. This can be done using interprocessor communication of data, as
illustrated in Fig. 8. To transform of a vector from distributed form a to accumulated
form a, each processor performs the following steps:

17
Disassemble boundary entries ab into array segments corresponding to external
subdomain boundaries;
Send dissassembled ab to neighboring subdomains and receive disassembled exter-
nal interface entries aeb from neighboring subdomains;
Assemble external interface entries aeb to the distributed vector a: a = a + aeb . The
result is the accumulated vector a.
Send-receive operations are indicated by arrows in Fig. 8 for the left upper subdomain.
Element connectivities are used in assembly and disassembly procedures, which are
standard operations in the computational procedure of the finite element method. Posi-
tion of an element entry in a global vector is determined by the correspondent element
connectivity number.
If the finite element domain is composed of same-type elements, then decomposi-
tion into subdomains with an equal number of elements provides compute load bal-
ancing among processors. For minimization of interprocessor data communication
it is desirable to produce subdomains with a minimal interface boundary. The RGL
algorithm introduced in Section 2 is quite suitable for domain partitioning when an
PCG solver is used. The interface boundary provided by the RGL algorithm usually
contains fewer interface nodes than that produced by the RGB algorithm. It is also
worth noting that the RGL algorithm allows multisections of the current subdomain
beyond simple bisection. This means that the total number of produced subdomains
can be a product of any simple numbers, not just a power of two.

4.2 Algorithm for the parallel PCG method


A global finite element equation system
Ku = f (27)
has a sparse symmetric positive definite matrix K, which relates a load vector f and
an unknown displacement vector u. It is possible to improve the properties of the
equation system and the convergence rate of an iterative method by preconditioning,
i.e. by multiplying both sides of equation (3) by a matrix M −1 , which in some sense
is an approximation of A−1 :
M −1 Ku = M −1 f. (28)
The simplest form of preconditioning is diagonal preconditioning, in which matrix M
contains only diagonal entries of matrix K.

4.2.1 Parallel PCG algorithm

The iteration procedure of the PCG algorithm contains two matrix-vector products
accompanied by calculation of several inner products. For partitioned data, results of
matrix-vector products are distributed vectors. Parallel implementation of the PCG
algorithm requires interprocessor communication of boundary data to perform inner
product calculation since distributed vectors should be transformed into an accumu-
lated form. Reduction operations are necessary for computing scalar quantities.

18
Using the procedure presented earlier for distributed-accumulated vector transfor-
mation, a parallel implementation of the preconditioned conjugate gradient method
can be presented as follows:

u0 = 0
r0 = f
Send rib , receive rieb , ri = ri + rieb
do i = 0, 1...
wi = M −1 ri
γi = rTi wi
Reduce γi
Send wib , receive wieb , wi = wi + wieb
if i = 0 pi = wi (29)
else pi = wi + (γi /γi−1 )pi−1
wi = Kpi
βi = pTi wi
Reduce βi
Send wib , receive wieb , wi = wi + wieb
ui = ui−1 + (γi /βi )pi
ri = ri−1 − (γi /βi )wi
if γi /γ0 < ε exit
end do.

Here i is the iteration number; K is the equation system matrix (global stiffness ma-
trix); f is the right-hand side (external load); u is the unknown displacement vector;
M is the preconditioning matrix; r is a residual vector; w and p are working vectors;
and ε is a specified error tolerance. Accumulated vectors are marked by a bar. For
example, w is the distributed vector and w is the same vector in accumulated form.
The result of multiplying a distributed matrix by an accumulated vector is a distributed
vector. In order to transform the distributed vector into its accumulated form, interface
data is communicated between neighboring subdomains and received interface data is
assembled into the distributed vector.
Two communications of boundary data between neighboring subdomains are re-
quired inside each iteration cycle after calculation of vector w. Two reduction opera-
tions are necessary for obtaining the total values of scalars γ and β from their fractions
located on all processor nodes.

4.2.2 Efficient data communication in PCG algorithm

To increase the efficiency of the parallel PCG algorithm, it is possible to use nonblock-
ing communications for interface nodes and to overlap communication with compu-

19
tation. An approach to economize computing time during PCG iteration procedure is
as follows:
Start the communication;
Do some computation with data independent of the communicated array;
Wait for completion of communication;
Continue computation.
A parallel PCG algorithm with efficient communications can be presented as follows:

u0 = 0
r0 = f
Send rib and receive rieb , ri = ri + rieb
do i = 0, 1...
wib = M −1 rbi
Start send wib and receive wieb
wii = M −1 rii
if i > 0 xi−1 = xi−2 + (γi−1 /βi−1 )pi−1
γi = rTi wi
Reduce γi
Wait for receive wieb , wi = wi + wieb (30)
if i = 0 pi = wi
else pi = wi + (γi /γi−1 )pi−1
wib = Kpbi
Start send wib and receive wieb
βi = pTi wi
Reduce βi
wii = Kpii
Wait for receive wieb , wi = wi + wieb
ri = ri−1 − (γi /βi )wi
if γi /γ0 < ε { ui = ui−1 + (γi /βi )pi ; exit }
end do

Since interior and interface nodes are separated, communication of data related
to interior nodes wb can be overlapped with computation for interior nodes wi . At
first, the interface entries are disassembled and send and receive operations are started
for the interface data. Then matrix and vector calculations for the interior nodes are
performed. After waiting for completion of the receive operation, the interface data
should be assembled into the subdomain vector. Finally, other calculations, which
involve the interface data, can be continued. Another possibility to increase algorithm

20
4096 20-node elements
30

Fixed size speedup


20

10

0
0 10 20 30
Number of processors

Figure 9: Fixed size speedup of the parallel PCG algorithm for a mesh of 4096 20-
node elements.

efficiency is to overlap nonblocking communication with the solution update ui =


ui−1 + (γi /βi )pi . Since solution vector u is not used in any computation, its update
can be done any time.

4.3 Examples
Sequential and parallel finite element routines based on the domain decomposition
method with the PCG equation solver have been developed in the C programming
language. The programs were run on IBM SP2, using the MPI (message passing
interface) to execute the parallel code.
Fixed size speedup was investigated for a three-dimensional mesh consisting of
4096 20-node hexahedral elements (16×16×16) and 56355 degrees of freedom. Four
divisions into subdomains were used: 4 subdomains = 2×2×1; 8 subdomains =
2×2×2; 16 subdomains = 4×2×2; 32 subdomains = 4×4×2.
Results of performance of the PCG method with overlapped communication and
computation (30) for the problem of 4096 3D 20-node elements are plotted in Fig. 9.
Parallel efficiency varies from 0.92 for 4 subdomain partitioning to 0.63 for 32 subdo-
main partitioning. Lower parallel efficiency for larger number of processors is related
to the small size of the problem solved on each processor node. For example, just 128
elements belong to each processor in the case of 32 subdomain partitioning.
Individual meshes with element numbers ranging from 1000 to 48000 were used to
determine scaled speedup for the parallel PCG algorithm using up to 48 processors.
In addition to the four above-mentioned partitions, a partition into 48 subdomains

21
50
1000 20-node elements

per processor
40

Scaled speedup
30

20

10

0
0 10 20 30 40 50
Number of processors

Figure 10: Scaled speedup of the parallel PCG algorithm for meshed with 1000 20-
node elements per processor.

(4×4×3) was generated. Scaled speedup values corresponding to 1000 quadratic 20-
node elements per processor are presented in Fig. 10. The results on scaled parallel
speedup are quite satisfactory. The parallel efficiency equals 0.86 for 48 subdomain
partitioning.

5 Conclusion
The domain decomposition method is a useful technique for partitioning a finite el-
ement problems into smaller subproblems, which can be executed on separate pro-
cessor nodes. Efficiency of parallel computations depends on domain partitioning.
A domain partitionig algorithm should produce subdomains with minimal number of
interface nodes and with compute load balancing across processors. When direct solu-
tion method is applied, partitioning of a domain into subdomains with equal numbers
of elements leads to significant load imbalance among processors. In this chapter, the
recursive graph labeling algorithm is used for the distribution of elements among sub-
domains. Compute load balancing is achieved by solution of an algebraic problem to
find subdomains requiring same operation count. The partitioning algorithm is able to
produce balanced subdomain divisions in 1-3 iterations.
An algorithm of the preconditioned conjugate gradient method based on the domain
decomposition with nonoverlapping subdomains is presented. The algorithm formu-
lation contains local vectors in distributed and accumulated forms. Division of the
computational domain consisting of same-type elements into subdomains with equal
number of elements provides compute load balancing across processor nodes. The

22
efficient implementation of the parallel PCG algorithm uses nonblocking communica-
tions for interface nodes and overlapping of communication with computation.

References
[1] J.S. Przemieniecki, “Matrix structural analysis of substructures”, AIAA Journal,
1, 138-147, 1963.
[2] I. Babuška and H.C. Elman, “Some aspects of parallel implementation of the
finite-element method on message passing architectures”, Journal of Computa-
tional and Applied Mathematics, 27, 157-187, 1989.
[3] K. Schloegel, G. Karypis and V. Kumar, “Graph partitioning for high per-
formance scientific simulations”, CRPC Parallel Computing Handbook, Morgan
Kaufmann, 2001.
[4] C. Farhat, “A simple and efficient automatic FEM domain decomposer”, Com-
puters and Structures, 28, 579-602, 1988.
[5] H.D. Simon, “Partitioning of unstructured problems for parallel processing”,
Computing Systems in Engineering, 2, 135-148, 1991.
[6] Y.F. Hu and R.J. Blake, “Numerical experiences with partitioning of unstructured
meshes”, Parallel Computing, 20, 815-829, 1994.
[7] S. Gupta and M.R. Ramirez, “A mapping algorithm for domain decomposition in
massively parallel finite element analysis”, Computing Systems in Engineering, 6,
111-150, 1995.
[8] C. Farhat, N. Maman and G.W. Brown, “Mesh partitioning for implicit compu-
tations via iterative domain decomposition: impact and optimization of the subdo-
main aspect ratio”, International Journal for Numerical Methods in Engineering,
38, 989-1000, 1995.
[9] D. Vanderstraeten and R. Keunings, “Optimized partitioning of unstructured finite
element meshes”, International Journal for Numerical Methods in Engineering, 38,
433-450, 1995.
[10] C.H. Walshaw, M. Cross and M.G. Everett, “A localized algorithm for optimiz-
ing unstructured mesh partitions”, International Journal for Supercomputer Appli-
cations, 9, 280-295, 1995.
[11] G.P. Nikishkov, A. Makinouchi, G. Yagawa and S. Yoshimura, “An algorithm
for domain partitioning with load balancing”, Engineering Computations, 16,120-
135, 1999.
[12] S.W. Sloan, “A FORTRAN program for profile and wavefront reduction”, Inter-
national Journal for Numerical Methods in Engineering, 28, 2651-2679, 1989.

23
[13] C. Farhat and E. Wilson, “A parallel active column equation solver”, Computers
and Structures, 28, 289-304, 1988.

[14] G.P. Nikishkov, A. Makinouchi, G. Yagawa and S. Yoshimura, “Performance


study of the domain decomposition method with direct equation solver for parallel
finite element analysis”, Computational Mechanics, 19, 84-93, 1996.

[15] G.P. Nikishkov, M. Kawka, A. Makinouchi, G. Yagawa and S. Yoshimura, “Port-


ing an industrial sheet metal forming code to a distributed memory parallel com-
puter”, Computers and Structures, 67, 439-449, 1998.

[16] Y. Saad, Iterative methods for sparse linear systems, PWS Publishing, Boston,
1996, 447 pp.

[17] O. Axelsson, Iterative solution methods, Cambridge University Press, 1996, 654
pp.

[18] A. Basermann, B. Reichel and C. Schelthoff, “Preconditioned CG methods for


sparse matrices on massively parallel machines”, Parallel Computing, 23, 381-
398, 1997.

[19] G.P. Nikishkov and A. Makinouchi, “Parallel implementation of the PCG


method with nonoverlapping finite element domain decomposition”, Parallel and
Distributed Computing Systems, ISCA 12th Int. Conf., Fort Lauderdale, FL, USA,
Aug. 18-20, 1999 (Ed. S. Olariu and J. Wu), ISCA, 540-545, 1999.

[20] G. Haase, “New matrix-by-vector multiplications based on nonoverlapping do-


main decomposition data distribution”, Lecture Notes in Computer Science 1300,
726-733, 1997.

24

You might also like