7 Ijnme 2011 PDF

INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN ENGINEERING
Int. J. Numer. Meth. Engng (2011)

Published online in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/nme.3179
A new parallel sparse direct solver: Presentation and numerical

experiments in large-scale structural mechanics parallel computing
I. Guèye1, ∗, † , S. El Arem2 , F. Feyel1 , F.-X. Roux1 and G. Cailletaud2

1 ONERA—Centre de CHÂTILLON, 29 avenue de la Division Leclerc, 92322 CHÂTILLON Cedex, France
2 Centre des Matériaux P. M. FOURT, MINES ParisTech—UMR CNRS 7633, B.P. 87 91003 EVRY Cedex, France
SUMMARY
The main purpose of this work is to present a new parallel direct solver: Dissection solver. It is based on
LU factorization of the sparse matrix of the linear system and allows to detect automatically and handle
properly the zero-energy modes, which are important when dealing with DDM. A performance evaluation
and comparisons with other direct solvers (MUMPS, DSCPACK) are also given for both sequential and
parallel computations. Results of numerical experiments with a two-level parallelization of large-scale
structural analysis problems are also presented: FETI is used for the global problem parallelization and
Dissection for the local multithreading. In this framework, the largest problem we have solved is of an
elastic solid composed of 400 subdomains running on 400 computation nodes (3200 cores) and containing
about 165 millions dof. The computation of one single iteration consumes less than 20 min of CPU time.
Several comparisons to MUMPS are given for the numerical computation of large-scale linear systems on
a massively parallel cluster: performances and weaknesses of this new solver are highlighted. Copyright
䉷 2011 John Wiley & Sons, Ltd.
Received 3 October 2010; Revised 4 January 2011; Accepted 2 February 2011
KEY WORDS: DDM; linear sparse direct solver; finite element method; FETI; nested dissection; structural
mechanics
1. INTRODUCTION
Within the framework of the Finite Elements Method (FEM) for numerical modeling of structural
mechanics problems, it is often about solving linear systems of Partial Differential Equations
(PDE), which represents the single largest part of the overall computing costs (CPU time and
memory requirements) in three-dimensional implicit simulations. Moreover, the discretization of
the PDE often leads to difficult-to-solve systems of equations since, today in modern design and
simulations of industrial structures, it becomes routinely essential to consider non-linearities of
various origins, high heterogeneities, tortuous geometries, and complex boundary conditions. The
arising linear systems are large, ill-conditioned and could have millions of unknowns especially
when the aim is the modeling of complex industrial structures in real scale.
The resort to Domain Decomposition Methods (DDM) is becoming automatic because of the
flexibility they offer to solve both linear and non-linear large systems of equations by dividing the
structure into many substructures (subdomains) and using multiple processing elements simultane-
ously. These methods are based on quite simple and intuitive ideas: a large problem is reduced to
a collection of smaller problems easier to solve computationally than the undecomposed problem,
∗ Correspondence to: I. Guèye, ONERA—Centre de CHÂTILLON, 29, avenue de la Division Leclerc, 92322
CHÂTILLON Cedex, France.
† E-mail: ibrahima.gueye@onera.fr
Copyright 䉷 2011 John Wiley & Sons, Ltd.

I. GUÈYE ET AL.
and most or all of which can be solved independently and concurrently. Thus, the divide and
conquer methods are based on the splitting of the physical domain of the PDE into smaller subdo-
mains forming a partition of the original domain, and by design are suited for implementation on
parallel computer architectures. These methods showed flexibility in treating complex geometry
and heterogeneities in PDE even on serial computers. Parallelism has been employed for many
years, mainly in high-performance computing, but interest in it has grown lately due to the physical
constraints preventing frequency scaling. Increasingly, parallel processing is being seen as the only
cost-effective method for the fast solution of computationally large and data-intensive problems.
Although, the divide and conquer methods such as substructuring method in structural engi-
neering predate DDM, interest in DDM for PDE was spawned only subsequent to the development
of parallel computer architectures.
This decomposition may enter at the continuous level, where different physical models may
be used in different regions, or at the discretization level, where it may be convenient to employ
different approximation methods in different regions, or in the solution of the algebraic systems
arising from the approximation of the PDE. These three aspects are very often interconnected in
practice [1]. More extensive and broader monographs [1, 2] and surveys [3–6] devoted to DDM
with a strong emphasis put on the algebraic and mathematical aspects could be referred.
Based on a dual approach to introduce the continuity conditions at the interface between
subdomains, Finite Element Tearing and Interconnecting (FETI) is the most commonly used
non-overlapping DDM [4, 5, 7, 8]. It is a dual Schur complement method where preconditioned
conjugate gradient iterations are applied to find the interface forces satisfying the interface displace-
ment compatibility. FETI is a robust and suitable method for structural mechanics problems. With
the Balancing DDM [3, 9], FETI is among the first non-overlapping DDM that has demonstrated
numerical scalability with respect to both the mesh and subdomain sizes. However, studies carried
out on large numerical tests showed that its effectiveness decreases beyond a few hundreds of
subdomains. Thus, if we aim to split a large-scale model into a reasonable number of subdomains,
the local systems to solve also become huge. Moreover, with the evolution of microprocessor
technology, we are witnessing a rapid development and spread of new multi-core architectures
(in terms of number of processors, data exchange bandwidth between processors, parallel library
efficiency, processors scaling frequencies), which are deeply linked to the growing importance of
DDM in scientific intensive computation. The essential interest and strength of multi-core solutions
is to enable the simultaneous execution of threads on various cores [10].
The main objective of the current work is to present a new parallel direct solver for large sparse
linear systems: Dissection solver [11]. This solver is based on LU factorization of the sparse
matrix. It also allows to detect automatically and handle properly the zero-energy modes, which
are important when dealing with DDM. Performance evaluation and comparisons with other direct
solvers (MUMPS, DSCPACK,. . .) are also given for both sequential and parallel computations.
We also present the latest results of the parallel computations carried out in the Centre Des
Matériaux, MINES ParisTech to measure the performances that have been obtained using ZéBuLoN
Finite Elements Analysis (FEA) code on JADE cluster of CINES‡ (23 040 cores, 237.80 TFlops
Theoretical peak performance, 267.88 TFlops Maximal LINPACK performance achieved, 18th in
the TOP500 ranking of June 2010). Many FE numerical experiments have been carried out using
a two-level parallelization: FETI solver for the global problem and Dissection solver for the local
problem. First, a speed-up study is carried out to determine the machine optimal configuration
for a given FE problem size (number of subdomains, number of degrees of freedom (dof) per
subdomain, number of processes per machine). Thus, a parametric study is presented to show how
many processes it would be necessary to have running on a single computer (with two processors
of four cores each and 32 GB of shared RAM) for an optimized two-levels parallelization. In
this study, an optimal configuration is to be found, because there is an advantage of splitting
the problem into a large number of subdomains (each subdomain is going to be smaller), but
the number of processes running on the same computation node will be increasing. The splitting
‡ Centre Informatique National de l’Enseignement Supérieur, Montpellier, France.
Copyright 䉷 2011 John Wiley & Sons, Ltd. Int. J. Numer. Meth. Engng (2011)
DOI: 10.1002/nme
DISSECTION SOLVER
advantage is thus limited by the progressive disappearance of the local parallelism, and also by
the FETI method efficiency loss when the size of the elementary problem becomes too small. For
a structural problem of a given size (≈ 3.3 millions unknowns in this study), the influence of two
parameters is discussed: the number of subdomains to consider (FETI effects) and the computation
nodes to be used (multithreading effects).
Finally, we present the results of the scalability study performed to identify the size of the
largest problem that would be possible to solve on JADE cluster. The global problem is a train-like
succession of cubes (subdomains). Each subdomain is treated on a separate machine and taking
advantage of the maximum possible multithreading. Two cases are considered depending on the
presence of zero-energy modes or their absence. The largest problem we have solved is of an elastic
solid composed of 400 subdomains running on 400 computation nodes and containing 164 754 603
dof. The computation of one single iteration consumes around 20 min of CPU time.
2. PARALLEL DIRECT SOLVER FOR LARGE SPARSE LINEAR SYSTEMS
Solving sparse linear systems M x = b by direct methods is often based on the Gaussian elimination
method. Rather than directly handling the system, the matrix M = (Mi j ) is factorized via a LU
decomposition, where L = (L i j ) is a lower triangular matrix and U = (Ui j ) is an upper triangular
matrix. This factorization relies mainly on three steps:
• an analysis step that computes a reordering and symbolic factorization of the matrix. The
reordering techniques are used to minimize fill-in entries of the matrix during the numerical
factorization and to exhibit as many independent calculations as possible. This step produces
an ordering matrix and an elimination tree. The elimination tree is then used to carry out the
subsequent steps.
• a numerical step that determines the lower and upper triangular factors of the matrix according
to the elimination tree that was previously produced.
• a solution step where the numerical solution of the system is obtained by solving the lower
and upper triangular systems resulting from factorization (forward elimination and backward
substitution). For linear problems, this step is nearly the whole time consuming of the direct
method.
In recent years, many parallel sparse direct solvers, such as DSCPACK [12] and MUMPS [13],
have been developed and have proved to be robust, reliable, and efficient for a wide range of
practical problems.
In this section, we present a new parallel direct solver using a direct method based on LU
factorization of the sparse linear system matrix. The matrix can be symmetric or non-symmetric.
The ability of a solver to detect singularities is essential in DDM for the operator itself or when
building automatically optimal preconditioners. In the case of a factorization of singular linear
systems, a strategy is used to handle automatically and properly zero-energy modes.
2.1. Serial implementation

2.1.1. The ordering strategy. The ordering step is based on a nested dissection method [14, 15]. In
fact, this technique allows more parallelism in the factorization than those using minimum degree
techniques [16] and often produces better results.
The nested dissection approach is based on a recursive bisection of the graph of the matrix M
to be factorized. A first bisection is performed by selecting a set of vertices forming a separator.
This separator is then removed from the original graph, which generates a partition into two
disconnected subgraphs. The separator is chosen so that its size is as small as possible and that
the obtained subgraphs have equivalent sizes. Each of these subgraphs is then bisected recursively
following the same principle until the size of the generated substructures is sufficiently small. The
bisection is managed by METIS [17], which is primarily a tool for partitioning graphs or meshes.
DOI: 10.1002/nme
I. GUÈYE ET AL.
(a) (b)
Figure 1. Example of a sparse linear system: (a) structure of matrix M and (b) graph or mesh nodes.
Figure 2. Recursive substructuring by bisection.
It aims to split a given graph into subgraphs of similar size and minimizes separator sizes to reduce
the required communications.
To illustrate the principle of nested dissection technique, we consider a small sparse linear
system where the structure of M is shown in Figure 1(a). The graph (or mesh nodes) representing
the matrix M is given in Figure 1(b).
In this example, we first perform a bisection by selecting a set of unknowns {2, 7, 12}. With
these nodes, we form a separator 21 . This separator is removed from the graph, and it generates
a partition into two disconnected subgraphs, which are subsequently split following the same
technique. We obtain separators 11 and 12 which are, respectively, formed by vertices {5, 6} and
{8, 9}. Finally, we generate the substructures Ii with the sets of unknowns {0, 1}, {10, 11}, {3, 4},
and {13, 14} (Figure 2).
Nested dissection method is an easy-to-parallelize algorithm based on the old algorithmic
paradigm of ‘divide and conquer’. It generates balanced supernodal elimination trees whose supern-
odes are sets of unknowns (see Figure 3). These trees reflect the dependency of unknowns during
elimination and therefore makes the parallel numerical factorization easier.
Using this supernodal elimination tree, the system matrix resulting from the original problem
can be written as shown in Figure 4.
The block matrices M Ii Ii and Ml l (l>0), on the diagonal, correspond to the unknowns of
j j
the same level. It should be noted that M Ii Ii diagonal block has a sparse tridiagonal structure, and
that diagonal block Ml l at level l is a full matrix. All the extra-diagonal sub-matrices M Ii l
j j j
and Ml Ii are sparse and correspond to the connections between the unknowns in subdomain Ii
j
DOI: 10.1002/nme
DISSECTION SOLVER
Figure 3. Supernodal elimination tree.
Figure 4. Reordered matrix M.
and those belonging to the separator lj . The off-diagonal blocks Ml m (l = m) are full and relate
i j
the connections between the unknowns in separator li and those in separator mj .
The overall structure of the matrix M illustrated in Figure 4 and the resulting linear system can
be written as
⎡ ⎤⎡ ⎤ ⎡ ⎤
MI I MI L1 MI L2 xI bI
⎢ ⎥⎢ ⎥ ⎢ ⎥
M x = ⎣ ML1 I ML1 L1 ML1 L2 ⎦ ⎣ x L1 ⎦ = ⎣ bL1 ⎦ = b (1)
ML2 I ML2 L1 ML2 L2 x L2 bL2
where the sub-matrices M I I and M L k L k are diagonal matrices formed, respectively, by all the
blocks M Ii Ii and Mk k . All off-diagonal sub-matrices M I L k , M L k I , M L k L l , and M L l L k are formed,
j j
respectively, by the blocks M Ii k , Mk Ii , Mk l , and Ml k .
j j j m m j
2.1.2. The block factorization. The solution of the linear system of equations M x = b can be
obtained in the following way: first, the matrix M (Equation (1)) is factorized and then the solution
is found by forward and backward substitutions. This two-step procedure is more appropriate for
problems involving matrices with constant coefficients. Before performing this factorization, we
DOI: 10.1002/nme
I. GUÈYE ET AL.
carefully scale the original matrix M for general types of linear systems that we can solve. The
factorization consists of applying LU decomposition method to the matrix as follows:
⎡ ⎤⎡ ⎤
MI I 0 0 I M I I −1 M I L 1 M I I −1 M I L 2
⎢ ⎥⎢ ⎥
M = LU = ⎣ M L 1 I S L 1 L 1 0 ⎦⎢ ⎣0 I S L 1 L 1 −1 S L 1 L 2 ⎥
⎦ (2)
ML2 I SL2 L1 SL2 L2 0 I I
where I is the identity matrix and S L 1 L 1 , S L 1 L 2 , S L 2 L 1 , and S L 2 L 2 are Schur complements defined by
S L 1 L 1 = M L 1 L 1 − M L 1 I M I I −1 M I L 1
(3)
S L 2 L 2 = S̄ L 2 L 2 − S L 2 L 1 S L 1 L 1 −1 S L 1 L 2 with S̄ L 2 L 2 = M L 2 L 2 − M L 2 I M I I −1 M I L 2 .
Note that the computation of the Schur complements in Equation (3) will introduce fill-in in the
block diagonal matrices M L 1 L 1 and M L 2 L 2 , as well as in M L 1 L 2 and M L 2 L 1 . Referring to Figure 4
and the first equation of (3), matrix S L 1 L 1 is obtained by computing
⎡ ⎤
S1 1 0
SL1 L1 = ⎣ ⎦
1 1
(4)
0 S1 1
2 2
where
S1 1 = M1 1 − M1 I1 M I1 I1 −1 M I1 1 − M1 I2 M I2 I2 −1 M I2 1
1 1 1 1 1 1 1 1
(5)
−1 −1
S1 1 = M1 1 − M1 I3 M I3 I3 M I3 1 − M1 I4 M I4 I4 M I4 1
2 2 2 2 2 2 2 2
Likewise, the Schur complements S L 1 L 2 , S L 2 L 1 , and S̄ L 2 L 2 are obtained by computing the blocks
S1 2 and S2 1 using
j 1 1 j

1 1 1 1 1 1 1 1
(6)
−1 −1
S1 2 = M1 2 − M1 I3 M I3 I3 M I3 2 − M1 I4 M I4 I4 M I4 2
2 1 2 1 2 1 2 1

1 1 1 1 1 1 1 1
(7)
1 2 1 2 1 2 1 2
and

4
S̄ L 2 L 2 = M2 2 − M2 Ii M Ii Ii −1 M Ii 2 . (8)
1 1 1 1
i=1
In all operations above, multiplications with the inverse of the matrix block M Ii Ii are replaced
by two solution steps using the Crout factorization, M Ii Ii = L Ii Ii D Ii Ii U Ii Ii (or L Ii Ii D Ii Ii L Ii Ii T in
symmetric case). The matrix products with Ml Ii are performed as sparse matrix operations.
j
2.1.3. The forward and backward substitutions. The solution of the system M x = b can be deter-
mined by solving the triangular systems Ly = b (forward elimination) and U x = y (backward
substitution). Here L and U are obtained from the block factorization of sparse matrix M
(Equation (2)).
DOI: 10.1002/nme
DISSECTION SOLVER
More explicitly, the solution x is obtained by successively solving the following systems of
equations:
• during the forward step:
MI I yI = bI ,
SL1 L1 yL1 = bL1 − ML1 I y I , (9)
SL2 L2 x L2 = bL2 − ML2 I y I − SL2 L1 yL1 ,
• during the backward step:
SL1 L1 x L1 = SL1 L1 yL1 − SL1 L2 x L2 ,
(10)
MI I x I = MI I yI − MI L1 x L1 − MI L2 x L2 .
The first equation in (9) and the second equation in (10) can be readily solved in parallel since
M I I is a block diagonal matrix. Since the full matrix S L 2 L 2 is distributed to all processors, the
third equation in (9) is solved in every processor. The solutions of the equations involving the
block matrix S L 1 L 1 are obtained by solving an interface problem on the separators.
2.1.4. Taking into account singular linear systems. Among all the available sequential and parallel
direct solvers, only a few can automatically and properly handle zero-energy modes associated
with singular linear systems. These singularities have either physical or geometrical origins, or
could appear when splitting the initial domain into substructures. We aim to implement a parallel
direct solver that takes into account these systems.
The approach to compute the zero-energy modes consists of first detecting local singularities
during the block factorization of M. We start the search process on the matrix M I I associated
with substructures. Then we progress up to the root by applying the same process on the blocks
S L k L k at level k>0. At each level, we check if near zero pivots appear when performing a L DU
factorization of a block M Ii Ii or Sl l . In the affirmative case, we refer the treatment of the
j j
equation in question at the end of the factorization of M.
At the end of the factorization, we obtain a list of near zero pivots, which are candidates to be
zero-energy modes of the global matrix M. To compute the zero-energy modes, we first condense
the global system on local singularities to obtain a small Schur complement Ss . Then, we perform
Gaussian elimination with full pivoting on Ss and check if zero pivots are found. If we do not find
zero pivots, then the matrix M is non-singular. If we find a number e of zero pivots then these
pivots are the actual zero-energy modes. A basis N of null space is built using its e corresponding
rows and columns.
Finally, we write the general solution of the sparse linear system M x = b in the form x =
M + b + N , where ∈ Re is a vector of e arbitrary entries and the vector M + b is a particular
solution of the linear system.
2.2. Implementation of the parallel solver

In this section, the full algorithm for factorizing the matrix problem stated in Equation (1) is
described. This step is most suitable when considering, for example, time-dependent simulations
that have constant coefficient matrix. The cost of the factorization, which can be a few orders of
magnitude higher than that of one solution step, may easily be amortized as the number of these
latter steps increases.
Referring to the decomposed matrix shown in Equation (2), the factorization phase can be stated
as follows:
1. Local factorization: for i = 1, . . . , ndoms:
(a) factorize M Ii Ii = L Ii Ii D Ii Ii U Ii Ii ,
(b) compute local contributions Ml Ii M Ii Ii −1 M Ii mk .
j
DOI: 10.1002/nme
I. GUÈYE ET AL.
2. Static condensation: for l = 1, . . . , L r −1:

(a) compute and invert the Schur complement Sl l , using Equation (5),
j j
(b) compute the Schur complements Sl m , Sm l , using Equations (6)–(7).
j k k j
3. Compute Schur complement S̄ L r L r , using Equation (8).
4. Compute and invert the Schur complement Sl l .
1 1
All these steps can be performed in parallel. We also alternate step by step, POSIX threads and
OpenMP since they are compatible and could consequently be combined easily. The strategy we
have adopted at each step is as follows:
• if the number of supernodes is less than the number of cores available then we use optimized
BLAS-3 matrix multiplication routines [18] optimized with OpenMP,
• otherwise, POSIX threads are created.
2.3. Performance evaluation

The parallel direct solver described above has been implemented in ZéBuLoN FEA code. To
compare its performance with those of other direct solvers, simulations of 3-D linear elasticity
problems have been performed on a bi-processors Intel Quad-Core Xeon X5460 64-bit machine,
with 8 cores, 32 GB of memory, and 3.16 GHz frequency.
2.3.1. The sequential performance. Figures 5 and 6 provide the CPU execution time comparison of
Dissection (solver implemented), DSCPACK, and MUMPS solvers. Both DSCPACK and MUMPS
solvers are based on a multifrontal approach. The main DSCPACK weakness is that it does not
handle singular systems whereas MUMPS is not very robust for linear systems arising from very
heterogeneous floating substructures. Sparse Direct and Frontal are two additional direct solvers
implemented in ZéBuLoN and able to detect automatically zero-energy modes.
We can observe in Figure 5 that the performances of the oldest methods of resolution (Sparse
direct and most Frontal solvers) break down when the dimension of linear systems exceeds a few
tens of thousands of dof while the results obtained with Dissection solver are more interesting.
In Figure 6, we observe that the performance of DSCPACK are slightly better than Dissection
solver. However, we should not neglect the fact that DSCPACK solver cannot handle the zero-
energy modes in the floating substructures. Accordingly, Dissection could become a more profitable
solver for FETI methods than other existing solvers in ZéBuLoN FEA code.
12000 Dissection
Sparse Direct
Frontal
10000
CPU execution time (sec.)
8000
6000
4000
2000
0
50000 100000 150000 200000 250000 300000 350000
Number of degrees of freedom
Figure 5. First phase of comparison of CPU execution time.
DOI: 10.1002/nme
DISSECTION SOLVER
DSCPACK
800 MUMPS
Dissection
700
CPU execution time (sec.)

600
500
400
300
200
100
0
50000 100000 150000 200000 250000 300000 350000
Number of degrees of freedom
Figure 6. Second phase of comparison of CPU execution time.
2.8
version hybride
solveur DSCPACK
2.6 solveur MUMPS
2.4
2.2
Speed up
1.8
1.6
1.4
1.2
1
1 2 3 4 5 6 7 8
Number of threads
Figure 7. Speed-up in multithreading.
2.3.2. The multithreaded performance. In this case, we consider the solving of a linear elasticity
problem of 206 763 dof. The multithreaded performance of Dissection is compared to those of
DSCPACK and MUMPS, where the BLAS library optimized with OpenMP is selected.
Results in Figure 7 show that the difference in performance between Dissection and other solvers
observed with sequential computations (using one single thread) is reduced. In part, this shows
that the parallelization strategy proposed behaves well. The maximum performance gain achieved
is about 2.2 on four cores.
3. APPLICATION ON LARGE-SCALE STRUCTURAL ANALYSIS PROBLEMS
After implementation and validation of the new solver, we sought to know the conditions of its
optimal functioning. Many possibilities are indeed available for a problem with a given size. We
DOI: 10.1002/nme
I. GUÈYE ET AL.
can change the number of subdomains; use more or less multithreading; load more or less the
computing nodes, etc.
In what follows, two cases are considered, since by solving a problem with size N using a p
processor, we aim to
• reduce the wall time (WT) by increasing p. We foresee a quasi-linear reduction of the CPU
time and we talk about strong scalability or speed-up property
Sequential time T1 (N )
S p (N ) = = (11)
time on p processors T p (N )
The parallel efficiency is given by
S p (N )
E p (N ) = (12)
p
• increase the problem size by increasing p. It is the scalability property (or weak scalability),
which describes how the solution time varies with the number of processors for a fixed
problem size per processor (scale-up)
Sequential time
C p (N ) =
time on N processors of the problem of size p N
T1 (N )
= (13)
Tp ( p N )
All the large-scale numerical experiments presented in what follows have been performed on
JADE cluster of the CINES. JADE is a parallel scalar supercomputer of 23 040 cores distributed
on 2 880 computing nodes. Each node is a bi-processor computer (two processors SGI Altix ICE
8200EX, Xeon quad core 3.0 GHz) with 4GB RAM per core. The network fabric is an Infiniband
(IB 4x DDR) double planes network. With its Theoretical Peak Performance of 237.80 TFlops,
JADE appears at 18th position within the TOP500 ranking of June 2010.
3.1. Speed-up performance

In this section, we consider the numerical solving of a global problem of constant size: 3 371 544
dof. The structure is fixed at its end (x = 0) and subjected to a simple tension at its free end (x = 1).
The material is considered of linear elastic behavior so that the computation involves a single load
increment of one single iteration. Several computations have been performed with Ns subdomains
of equal size running on Nc computing nodes (each node is of 8 cores). FETI is used as the
global solver and Dissection as local solver. The optimized computing tasks distribution is given
by Table I: the multithreading is equal to 8, 4, 2, or 1 when the number of computation nodes Nc
equals Ns , Ns /2, Ns /4 or Ns /8, respectively.
Table II gives the FETI solver time T p and total WT Tw for all the considered configurations.
We can note, when reading Table II horizontally, that, for a given machine configuration (a given
Table I. Computing tasks distribution (multithreading).

Ns
Nc 8 16 32 64 128 256
8 8 4 2 1
16 8 4 2 1
32 8 4 2 1
64 8 4 2
128 8 4
256 8
DOI: 10.1002/nme
DISSECTION SOLVER
Table II. Solver time T p (s) and total wall time Tw (s) for different configurations.
Ns
Nc 8 16 32 64 128 256
8 118|1299 367|438 179|242 172|202
16 252|311 132|189 98|160 54|93
32 87|125 80|118 33|114 24|79
64 55|83 27|57 20|81
128 22|115 18|81
256 14|243
1400
FETI time T
p
1200 Total Wall Time T
w
1000
800
Time (s)
600
400
200
0
8 16 32 64 128 256
Subdomains number (Ns)
Figure 8. Evolution of T p and Tw with Ns .
number of available processors), T p decreases with the number of subdomains. This observation
(FETI effect) remains valid even for a splitting into 256 subdomains where the best T p is obtained
(14 s). This effect remains visible even with a total removal of the multithreading effect (when
eight subdomains are considered per computing node: Ns = 8Nc ). However, a saturation of this
performance could be observed when the number of subdomains (the interface problem size)
becomes important: for example, with Nc = 128, the T p goes from 22 to 18 s, whereas Ns has been
doubled (from 128 to 256, respectively). Figure 8 shows that the WT Tw becomes increasing when
the number of subdomains exceeds 64. In fact, in these cases, the problem loading time becomes
important and even exceeds the solver time. For example, with Ns = 256, we have T p = 14 s,
whereas the time spent in the problem data loading is 162 s.
The multithreading effect is visible when reading Table II vertically. For Ns number of subdo-
mains, the best time is always obtained with one subdomain per computing node (multithreading
equals 8). For example, with a splitting into 64 subdomains, T p decreases from 172 to 55 s when
the multithreading goes from 1 (minimum) to 8 (maximum), respectively. Table III summarizes the
speed-up main results. It appears that the configuration leading to the optimal parallel performance
(342%) corresponds to a splitting into 32 subdomains distributed on 32 machines. In this case,
256 cores are exploited to achieve 87 s of solver time.
DOI: 10.1002/nme
I. GUÈYE ET AL.
Table III. Solver time T p (s) and total wall time Tw (s) for different configurations.
Nc 8 16 32 64 128 256
dof/subdomain 421 443 215 523 110 211 56 355 29 427 15 363
Niter 44 49 62 70 67 68
Tp 1186 252 87 55 22 14
Sp 8 37.62 109.5 171.5 437.2 693.5
Ep 1 2.351 3.422 2.681 3.415 2.709
Subdomain 1 Subdomain 2 Subdomain N
0 1 2 N X
Figure 9. Train-like stacking subdomains with possible zero-energy movements.
Table IV. Scalability measurements: the number of computing nodes is equal to the subdomains number.
Zero-energy modes are present.
Solver time T p (s) Total wall time Tw (s)
Number of subdomains Ns Number of dof (millions) Dissection Mumps Dissection Mumps
2 0.83 954 442 1114 561
4 1.66 — 463 — 582
10 4.13 — 465 — 589
20 8.25 1001 — 1170 —
50 20.60 1004 471 1190 614
100 41.19 1013 474 1243 664
200 82.38 1024 — 1401 —
300 123.56 1082 483 1650 987
400 164.75 1148 — 2018 —
3.2. Scalability measurement

To perform the measurement of this property, we have considered a modular problem composed
of similar unit cubes of 421 443 dof each. The problem mesh is obtained by stacking N unit cubes
(N subdomains) behind each other in a train-like shape (Figure 9).
The structure is fixed at its end (x = 0) and subjected to a simple tension at its free end (x = N )
as shown in Figure 9. Thus, the problem involves (N −2) floating subdomains. The material
is considered of linear elastic behavior so that the calculation involves a single load increment
of one iteration. Each subdomain is assigned to one machine and, consequently, the maximum
multithreading is considered (which equals to 8).
A tension test is performed on an increasing mesh size going from 2 to 400 subdomains. As
shown in Table IV and Figure 10, the CPU time of the FETI solver T p remains less than 20 min
and remarkably constant when the problem size is increased.
In Table IV, the WT Tw augmentation shows the increasing influence of the FEA code parts
(mesh loading for example) that have not yet been parallelized. This has little effect on the non-
linear calculations that we would perform (fatigue life, damage and crack propagation simulations
DOI: 10.1002/nme
DISSECTION SOLVER
1800
Dissection: FETI time T
Dissection: Total Wall Time T
1600
MUMPS: FETI time T
MUMPS: Total Wall Time T
1400
Time (s)
1200
1000
800
600
400
0 50 100 150 200 250 300
Subdomains number (Ns)
Figure 10. Evolution of T p (s) and Tw (s) with Ns : Dissection vs MUMPS.
Table V. Solver time T p (s): effects of floating subdomains detection.
Ns 50 100 300
Subdomain 1: Dissection 449 451 464
Subdomain 1: MUMPS 433 438 445
Subdomain 2: Dissection 979 993 989
Subdomain 2: MUMPS 449 451 463
in industrial structures) because they would involve tens of loading cycles (hundreds of increments)
once the data loading has been carried out instead of one increment in the current study. The
largest numerical simulation performed in this study, with 400 subdomains, involves a structure of
about 165 millions dof.
It can be observed from Table IV that MUMPS is more than twice as fast as Dissection. With
MUMPS as local solver and 300 subdomains, T p is only 483 s, whereas it is of 1148 s when using
Dissection solver.
To understand this difference, we have been examining meticulously the time needed by each
solver (Dissection and MUMPS) per subdomain and have found that
• In the absence of zero-energy modes (subdomain number 1 in this case study), MUMPS and
Dissection require almost the same CPU time within 4.3% of maximum difference (Table V).
• However, for subdomains with possible zero-energy modes (subdomain 2, for example, which
has six possible zero energy modes), the CPU time is more than doubled with Dissection.
MUMPS keeps almost the same CPU time in both cases.
Additional development efforts are, therefore, required to improve the CPU time needed by Dissec-
tion to solve singular linear systems where zero-energy modes are to be handled.
We have also performed additional numerical experiments (Table VI). In this case, the problem
considered is similar to the previous one but with smaller subdomain size . Four subdomains are
assigned to one computation node but the total dof number per machine has been kept equal to
421 443. Thus the multithreading is of 2 (one subdomain for two cores). We can easily note that
by splitting the problem global mesh into smaller subdomains, we have divided the CPU time by
up to 4 while using the same number of processors. For example, for the case of 41.19 millions
DOI: 10.1002/nme
I. GUÈYE ET AL.
Table VI. Dissection and MUMPS Solver time T p (s) with multithreading 2.
Ns 16 40 80 120 400
Total dof number 1.66 4.13 8.25 12.36 41.19
T p : Dissection 178 251 283 298 300
T p : MUMPS 143 207 208 214 220
Difference (%) 24.47 21.26 36.06 39.25 36.36
300 = 3.38 with Dissection, and by 220 = 2.15 when

dof, the solver CPU time has been divided by 1013 474
using MUMPS as local solver.

Also, we can note from the results presented in Table VI that the difference in CPU time between
Dissection and MUMPS has been reduced to less than 40%.
4. CONCLUSION
A new parallel direct solver is presented in this paper: Dissection solver. Based on LU factorization
of the sparse matrix of the linear system, Dissection allows to detect automatically and handle
properly the zero-energy modes, which is important when dealing with DDM. A performance
evaluation and comparisons with other direct solvers (MUMPS, DSCPACK, Sparse Direct, Frontal)
are also given for both sequential and multithreading computations. Results of the two-level
parallelization of large-scale structural analysis problems are also presented.
Many numerical experiments have been carried out and showed that Dissection is more efficient
than the old direct solvers (SparseDirect and Frontal) implemented in ZéBuLoN. Also, we have
noted that DSCPACK could be slightly more efficient than Dissection. However, given the fact
that DSCPACK is not able to automatically handle the problem of floating subdomains, we think
that Dissection could become more profitable for FETI in ZéBuLoN FEA code.
Some comparisons with MUMPS have also been given for large-scale linear systems arising
from FE simulations. In this framework, the largest problem we have solved is of an elastic solid
composed of 400 subdomains running on 400 computation nodes and containing 164 754 603 dof.
The computation of one single iteration consumes less than 20 min of CPU time.
We have noted that MUMPS remains more than twice as fast as Dissection. These numerical
experiments have been performed on the CINES cluster Jade. We think that the main weakness
of Dissection is the amount of CPU time it requires to handle the zero-energy modes. Additional
work is, therefore, needed to improve this option in Dissection solver. However, since MUMPS
is not very robust for linear systems arising from very heterogeneous floating substructures, we
think that Dissection could be a good alternative as a local solver for parallel computing with
FETI algorithms. In fact, we would perform numerical simulations of fatigue life involving tens of
loading cycles (hundreds of loading increments); and damage and crack propagation in industrial
structures which often consist of several parts made of highly heterogeneous materials.
REFERENCES
1. Toselli A, Widlund OB. Domain Decomposition Methods: Algorithms and Theory. Springer Series in Computational
Mathematics, vol. 34. Springer, 2005.
2. Mathew T. Domain Decomposition Methods for the Numerical Solution of Partial Differential Equations. Springer
Series in Computational Mathematics, vol. 61. Springer, 2008.
3. Le Tallec P. Domain decomposition methods in computational mechanics. Computational Mechanics Advances
1994; 1(2):121–220.
4. Farhat C, Roux F-X. Implicit parallel processing in structural mechanics. Computational Mechanics Advances
1994; 2(1):1–124.
5. Gosselet P, REY C. Non-overlapping domain decomposition methods in structural mechanics. Archives of
Computational Methods in Engineering 2006; 13(4):515–572.
6. Chan T, Mathew T. Domain decomposition algorithms. Acta Numerica 1994; 3:61–143.
DOI: 10.1002/nme
DISSECTION SOLVER
7. Farhat C, Roux F-X. A method of finite element tearing and interconnecting and its parallel solution algorithm.
International Journal for Numerical Methods in Engineering 1991; 32(6):1205–1227.
8. Farhat C, Pierson K, Lesoinne M. The second generation feti methods and their application to the parallel
solution of large-scale linear and geometrically non-linear structural analysis problems. Computer Methods in
Applied Mechanics and Engineering 2000; 184(2–4):333–374.
9. Mandel J. Balancing domain decomposition. Communications in Numerical Methods in Engineering 1993;
9(3):233–241.
10. Guèye I. Résolution des grands systèmes linéaires issus de la méthode des élémnts finis sur des calculateurs
massivement parallèles (in French). Ph.D. Thesis, MINES ParisTech, December 2009.
11. Guèye I, Juvigny X, Feyel F, Roux F-X, Cailletaud G. A parallel algorithm for direct solution of large sparse
linear systems, well suitable to domain decomposition methods. European Journal of Computational Mechanics
2009; 18(7–8):589–605.
12. Raghavan P. Dscpack home page, 2001. Available from: http://www.cse.psu.edu/∼ raghavan/Dscpack.
13. Amestoy PR, Duff IS, L’Excellent J-Y. Multifrontal parallel distributed symmetric and unsymmetric solvers.
Computer Methods in Applied Mechanics and Engineering 2000; 184:501–520.
14. George A. Nested dissection of a regular finite element mesh. SIAM Journal on Numerical Analysis 1973;
10(2):345–363.
15. George A, Liu JW-H. Computer Solution of Large Sparse Positive Definite Systems. Prentice-Hall: Englewood
Cliffs, NJ, 1981.
16. Tinney WF, Walker JW. Direct solutions of sparse network equations by optimally ordered triangular factorization.
Proceedings of the IEEE 1967; 55(11):1801–1809.
17. Karypis G, Kumar V. Metis: unstructured graph partitioning and sparse matrix ordering system, 1995. Available
from: http://www-users.cs.umn.edu/∼karypis/metis.
18. Dongarra JJ, Duff IS, Du Croz J, Hammarling S. A set of level 3 basic linear algebra subprograms. ACM
Transactions on Mathematical Software 1990; 16:1–17.
DOI: 10.1002/nme

7 Ijnme 2011 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

7 Ijnme 2011 PDF

Uploaded by

Copyright:

Available Formats

INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN ENGINEERING

Int. J. Numer. Meth. Engng (2011)

A new parallel sparse direct solver: Presentation and numerical

I. Guèye1, ∗, † , S. El Arem2 , F. Feyel1 , F.-X. Roux1 and G. Cailletaud2

Received 3 October 2010; Revised 4 January 2011; Accepted 2 February 2011

Copyright 䉷 2011 John Wiley & Sons, Ltd.

‡ Centre Informatique National de l’Enseignement Supérieur, Montpellier, France.

2. PARALLEL DIRECT SOLVER FOR LARGE SPARSE LINEAR SYSTEMS

2.1. Serial implementation

Figure 2. Recursive substructuring by bisection.

Figure 3. Supernodal elimination tree.

Figure 4. Reordered matrix M.

S1 2 = M1 2 − M1 I1 M I1 I1 −1 M I1 2 − M1 I2 M I2 I2 −1 M I2 2

S2 1 = M2 1 − M2 I1 M I1 I1 −1 M I1 1 − M2 I2 M I2 I2 −1 M I2 1

2.2. Implementation of the parallel solver

2. Static condensation: for l = 1, . . . , L r −1:

2.3. Performance evaluation

Figure 5. First phase of comparison of CPU execution time.

CPU execution time (sec.)

Figure 6. Second phase of comparison of CPU execution time.

Figure 7. Speed-up in multithreading.

3. APPLICATION ON LARGE-SCALE STRUCTURAL ANALYSIS PROBLEMS

3.1. Speed-up performance

Table I. Computing tasks distribution (multithreading).

Figure 8. Evolution of T p and Tw with Ns .

Subdomain 1 Subdomain 2 Subdomain N

Figure 9. Train-like stacking subdomains with possible zero-energy movements.

3.2. Scalability measurement

Figure 10. Evolution of T p (s) and Tw (s) with Ns : Dissection vs MUMPS.

Table V. Solver time T p (s): effects of floating subdomains detection.

300 = 3.38 with Dissection, and by 220 = 2.15 when

using MUMPS as local solver.

You might also like