You are on page 1of 22

Experience of developing sparse matrix algorithms and

software for sustainablity

X. Sherry Li
xsli@lbl.gov
Lawrence Berkeley National Laboratory

The 8th Annual

Scientific Software Days Conference
Austin, Texas
April 27–28, 2017
Sparse factorizations
Core function for indefinite, ill-conditioned, nonsymmetric (e.g.,
those from multiphysics, multiscale simulations)
Direct solver, “inexact” direct solver, preconditioner
Usage scenarios
Stand-alone solver
Good for multiple right-hand sides
Precondition Krylov solvers
Coarse-grid solver in multigrid (e.g., Hypre)
In nonlinear solver (e.g., SUNDIALS)
Solving interior eigenvalue problems
….
è Bottom of the solvers toolchain. Can package as “black-box”
Algorithms with lower complexity
Tracking architecture features

4/27/17 1
Sparse matrix: lots of zeros
fluid dynamics, structural mechanics, chemical process simulation,
circuit simulation, electromagnetic fields, magneto-hydrodynamics,
seismic-imaging, economic modeling, optimization, data analysis,
statistics, . . .
Example: A of dimension 106, 10~100 nonzeros per row
Matlab: > spy(A)
Boeing/msc00726 (structural eng.) Mallya/lhr01 (chemical eng.)

Use compressed scheme to store them: operations involve scatter/gather

2
Gaussian Elimination (GE)
Solving a system of linear equations Ax = b

First step of GE
éa wT ù é 1 0ù éa wT ù
A=ê ú=ê ú ×ê ú
ëv B û ëv / a I û ë0 Cû
v × wT
C = B-
Repeat GE on C
a
Result in LU factorization (A = LU)
– L lower triangular with unit diagonal, U upper triangular

Then, x is obtained by solving two triangular systems with L

and U

3
Sparse factorization
Store A explicitly … many sparse compressed formats
“Fill-in” . . . new nonzeros in L & U
Graph algorithms: directed/undirected graphs, bipartite graphs,
paths, elimination trees, depth-first search, heuristics for NP-hard
problems, cliques, graph partitioning, . . .
Unfriendly to high performance, parallel computing
dependency
Supernodal DAG Multifrontal tree
1
1
2 U 2 3 4 5 L
U

3
4
L 5
6
6
7 8
L
U
L
U

7
9

4
SuperLU direct solver: SW aspects
www.crd.lbl.gov/~xiaoye/SuperLU
First release 1999, serial and MT with Pthreads.
Later: OpenMP, MPI distributed, MPI + OpenMP + CUDA
Single developer to many developers: svn, testing code

Benefits of IDEAS project:

svn à github: improved distributed contributions
Build-test productivity:
Edit platform-dependent “make.inc” à CMake/Ctest
Both modes co-exist
xSDK interfaces, compatibility between different solvers
Namespace allows three version versions used simultaneously
Easier to manage dependencies (ParMetis, machine-dependent
files), and platform-specific versions (_MT, _DIST, GPU) and
correctness.

4/27/17 5
SuperLU numerical testing
Regression test aims to provide coverage of all routines by testing
all functions of the user-callable routines.

|&|'
BERR = max
\$ ( )*+ '
|| ( ,- / ||0
FERR =
| ) |0

4/2717 6
Malloc/free balance check
Debugging mode SUPERLU_MALLOC / SUPERLU_FREE

void *superlu_malloc(size_t size) void *superlu_free(void *addr)

{ {
char *buf; char *p = ((char *) addr) - 64;
buf = (char *) malloc (size + 64); int n = ((size_t *) p)[0];
buf[0] = size; malloc_total -= n;
malloc_total += size; free (p);
return (void *) (buf + 64); }
}

size

7
“Testing in Scientific Software: Impacts on Research Credibility, Development
Productivity, Maturation, and Sustainability,”
Chapter in “Software Engineering for Science”, Jeffrey Carver, Neil P. Chue Hong,
George K. Thiruvathukal (editors), October 20, 2016, CRC Press.
Roscoe A. Bartlett, Anshu Dubey, Xiaoye S. Li, J. David Moulton, James M.
Willenbring, and Ulrike Meier Yang (2016),

4/27/17 8
SuperLU distributed factorization
• O(N2) flops, O(N4/3) memory for typical 3D problems.
Per-rank Schur complement update
0 1 2 0 1 2 0 Loop through N steps: (Gaussian Elimination)
FOR ( k = 1, N ) {
3 4 5 3 4 5 3
1) Gather sparse blocks A(:, k) and A(k,:) into dense work[]
0 1 2 0 1 2 0 2) Call dense GEMM on work[]
3 4 5 3 4 5 3 3) Scatter work[] into remaining sparse blocks
}
0 1 2 0 1 2 0

3 4 5 3 4 5 3

0 1 2 0 1 2 0 }  


  
• Graph at step k+1 differs from step k
• Panel factorization on critical path 
  
 


• Use look-ahead window to pipeline  

  
computations of different iterations.

Developers: Sherrry Li, Jim Demmel, John Gilbert, Laura Grigori, Piush Sao, Meiyue
Shao, Ichitaro Yamazaki.

9
SuperLU deploying GPU accelerator

§ Overlap two types computations: irregular ones on CPU, regular

data-parallel ones on GPU.
§ 100 nodes GPU clusters: 2.7x faster, 2-5x memory savings.
§ Current work: 3D algorithm to reduce critical path of panel
factorizations.
CPU / GPU concurrent execution

 

} 








\$   %  &    

%  &    


  
}
 

  !  "  ! "  ! "
}
}
}
}

   
 #    

     

• Separate GPU/CPU programs

• Transfer data between GPU/CPU

10
SuperLU optimization on Intel Xeon Phi
Replacing small independent single-threaded MKL DGEMMs by large
amount of parallelism: 10-15% faster.
Challenges: non-uniform block size, many small blocks.
Factorization time: 3 test matrices, mixing MPI and OpenMP
1 node = 64 8 nodes = 512 32 nodes =
cores cores 2048 cores
MPI, Threads 64p, 32p, 256p, 128p, 512p, 256p,
1t 2t 2t 4t 4t 8t
nlpkkt80 -- 66.7 35.2 27.5 24.2 25.7
Ga19As19H42 129.0 130.7 28.3 25.6 15.6 16.8

RM07R 84.6 83.6 25.0 21.0 19.8 16.9

• 72 cores, self-hosted
• 4 threads per core (SIMT)
• 2 512-bit vector units per code (SIMD)

11
Examples in EXAMPLE/
§ pddrive.c: Solve one linear system
§ pddrive1.c: Solve the systems with same A but different right-
hand side at different times
§ Reuse the factored form of A
§ pddrive2.c: Solve the systems with the same pattern as A
§ Reuse the sparsity ordering
§ pddrive3.c: Solve the systems with the same sparsity pattern
and similar values
§ Reuse the sparsity ordering and symbolic factorization
§ pddrive4.c: Divide the processes into two subgroups (two
grids) such that each subgroup solves a linear system
independently from the other. 0 1
2 3
4 5
6 7
8 9
1011
4/27/17 12
Domain decomposition, Schur-complement
(PDSLin : http://portal.nersc.gov/project/sparse/pdslin/)

1. Graph-partition into subdomains, A11 is block diagonal

æ A11 A12 ö æ x1 ö æ b1 ö ! D E1 \$
çç ÷÷ çç ÷÷ = çç ÷÷ \$ #
# 1
D2 E2 &
&

è A21 A22 ø è x2 ø è b2 ø ! A A
# 11 12
&= # &
  &
# A21 A22 & #
" % # Dk Ek &
# &
# F1 F2 … Fk A22 &
" %
2. Schur complement
-1 -T T -1
S = A22 – A21 A11 A12 = A22 – (U11 A21 )T (L11 A12 ) = A22 - W × G
where A11 = L11U11
S = interface (separator) variables, no need to form explicitly

3. Hybrid solution methods:

(1) x2 = S −1 (b2 – A21 A11-1 b1 ) ← iterative solver
(2) x1 = A11-1 (b1 – A12 x2 ) ← direct solver

13
Hierarchical parallelism
Multiple processors per subdomain
one subdomain with 2x3 procs (e.g. SuperLU_DIST)

D1 P P P(0 : 5) E
(0 : 5) 1
D2 P P(6 : 11) E2
(6 : 11)

D3 P P(12 : 17) E3
(12 : 17)

P(0 : 5) P(6 : 11)P(12 : 17)P(18 : 23)

F1 F2 F3 F4 A22
Constant #subdomains, Schur size, and convergence rate, regardless
of core count.
Need only modest level of parallelism from direct solver.

14
PDSLin configurable as Hybrid, Iterative, or Direct

Default
Subdomain: LU
Schur: Krylov
Options Options
(1) num_doms = 0 (1) Subdomain: LU
Schur = A: Krylov Schur: LU
drop_tol = 0.0
(2) FGMRES Inner-Outer:
Subdomain: ILU (2) num_doms = 1
Schur: Krylov ! D E1 \$ Subdomain: LU
# 1 &
# D2 E2 &
# &
#   &
# Dk Ek &
# &
# F1 F2 … Fk A22 &
" %
15
PDSLin in Omega3P: accelerator cavity design

PIP2 cryomodule consisting of 8 cavities

Computation results
§ 2.3M elements
§ Second order finite element (p = 2)
- 14M DOFs, 590M non-zeroes.
- Using MUMPS with 400 nodes, 800 cores, solution time: 6:48 min.
- Solution time on Edison using 100 nodes, 2400cores: 6:20 min.
STRUMPACK “inexact” direct solver
portal.nersc.gov/project/sparse/strumpack/
• In addition to structural sparsity, further apply data-sparsity with low- dense
rank compression:
D1 V2
U1
• O(N logN) flops, O(N) memory for 3D elliptic PDEs. "
\$
"
\$ D1
%
U1B1V2T '
U B VT
%
'
V1 D2 Big U1 0 \$
\$
\$ U B VT D2
'
' 3 3 6 '
U2 U3 = U A ≈\$ # 2 2 1 &

'
\$ U 4 B4V5T ' '
U6 B6V3T
\$ D4
V5 \$ \$
\$ U B VT
' '
D4 \$# # 5 5 4
D5 '
'&
U4 &

• Diagonal block (“near field”) exact; off-diagonal block (“far V4

U6B36V3T D5
field”) approximated via low-rank compression. U5

• Use nested bases + randomized sampling to achieve linear

scaling
• Efficient for many PDEs, BEM methods, integral equations, machine
learning, and structured matrices such as Toeplitz, Cauchy matrices.
HSS
dense
• C++, hybrid MPI + OpenMP; provide Fortran interface.
• Real & complex datatypes, single & double precision (via template), and 64-bit indexing.
• Input interface:
• dense, sparse, or matrix-free (only matvec needed).
• user-supplied cluster tree & block partition.

Developers: Pieter Ghysels, Chris Gorman, Sherry Li, Francois-Henry Rouet

17
HSS approximation error vs. drop tolerance
Randomized sampling to reveal rank
1 Pick random matrix Ωnx(k+p), k target rank, p small, e.g. 10
2 Sample matrix S = A Ω, with slight oversampling p
3 Compute Q =ON-basis(S) via rank-revealing QR

è Adaptive sampling is essential for robustness

||A-AHSS||F/||A||F maximum HSS rank
1 240
d=16
d=64 220
0.01 d=256
d=1024 200
0.0001
180
1e-06 160

1e-08 140

120
1e-10
100 d=16
1e-12 d=64
80 d=256
d=1024
1e-14 60
1 0.01 0.0001 1e-06 1e-08 1e-10 1e-12 1e-14 1 0.01 0.0001 1e-06 1e-08 1e-10 1e-12 1e-14

4/27/17 18
STRUMPACK: parallelism and performance
3 types of tree parallelism: Shared Memory OpenMP Task Parallelism: Intel Ivy
Bridge, compared to MKL Pardiso
• Elimination tree 2
solve time
1.8 factor time
• HSS tree

normalized time (MF=1)

1.6 reorder time
1.4
• BLAS tree Node of etree
1.2
Node of HSS tree 1

out of memory
Node of dense kernels tree 0.8
0.6
0.4
0.2
0

PARDISO

PARDISO

PARDISO

PARDISO

PARDISO

PARDISO

PARDISO

PARDISO

PARDISO
MF
MF+HSS

MF
MF+HSS

MF
MF+HSS

MF
MF+HSS

MF
MF+HSS

MF
MF+HSS

MF
MF+HSS

MF
MF+HSS

MF
MF+HSS
atmosmoddGeo 1438 nlpkkt80 tdr190k torso3 Transport A22 Serena spe10-aniso

Serena (N=1.3M, nnz=64M)

1 10000
numerical fact
symbolic fact
Distributed memory apply prec
Solving Toeplitz systems in linear time timing breakdown:
1000
Quantum Chemistry Toeplitz matrix, 64 MPI, compression+factorization+solve
105 Intel Ivy Bridge
HSS-FFT

runtime (s)
100
n log2 n
104 HSS-GEMM
n2
LU
103
n3 10
Time (s)

102

1
101

100 0.1
1 4 24 192 1152
10 1 cores, MPI processes
16000 32000 64000 128000 256000 370000
n

19
Summary
Explore new algorithms that require lower arithmetic complexity,
lower memory/communication, faster convergence rate
Higher-fidelity simulations require higher resolution, faster turn-around
time

Refactor existing codes and implement new codes for next-

generation machines (exascale in 6-7 years)
Light-weight core, less memory per core, hybrid node with high degree
of parallelism

20
4/27/17 21