s li development experience

© All Rights Reserved

0 views

s li development experience

© All Rights Reserved

- Java - Multithreading
- ANSYS Mpi
- 810d startup
- it
- OpenMP
- ACA Summary
- Advance Computer Architecture
- projects
- Real-time Video Processing for Embedded System
- A Study for the Discovery of Web Usage Patterns Using Soft Computing Based Data Clustering Techniques
- PARALLEL COMPUTING: Models and Algorithms
- B-Fundamentals of DataStage Parallelism
- Reading list for execution time prediction
- Installing MPI _ MPI Tutorial
- Unit(2)
- mpjexpress-csail
- Parallel Processing
- VIA-Evaluation
- Existing Models
- Presentation 2

You are on page 1of 22

X. Sherry Li

xsli@lbl.gov

Lawrence Berkeley National Laboratory

Scientific Software Days Conference

Austin, Texas

April 27–28, 2017

Sparse factorizations

Core function for indefinite, ill-conditioned, nonsymmetric (e.g.,

those from multiphysics, multiscale simulations)

Direct solver, “inexact” direct solver, preconditioner

Usage scenarios

Stand-alone solver

Good for multiple right-hand sides

Precondition Krylov solvers

Coarse-grid solver in multigrid (e.g., Hypre)

In nonlinear solver (e.g., SUNDIALS)

Solving interior eigenvalue problems

….

è Bottom of the solvers toolchain. Can package as “black-box”

Advances

Algorithms with lower complexity

Tracking architecture features

4/27/17 1

Sparse matrix: lots of zeros

fluid dynamics, structural mechanics, chemical process simulation,

circuit simulation, electromagnetic fields, magneto-hydrodynamics,

seismic-imaging, economic modeling, optimization, data analysis,

statistics, . . .

Example: A of dimension 106, 10~100 nonzeros per row

Matlab: > spy(A)

Boeing/msc00726 (structural eng.) Mallya/lhr01 (chemical eng.)

2

Gaussian Elimination (GE)

Solving a system of linear equations Ax = b

First step of GE

éa wT ù é 1 0ù éa wT ù

A=ê ú=ê ú ×ê ú

ëv B û ëv / a I û ë0 Cû

v × wT

C = B-

Repeat GE on C

a

Result in LU factorization (A = LU)

– L lower triangular with unit diagonal, U upper triangular

and U

3

Sparse factorization

Store A explicitly … many sparse compressed formats

“Fill-in” . . . new nonzeros in L & U

Graph algorithms: directed/undirected graphs, bipartite graphs,

paths, elimination trees, depth-first search, heuristics for NP-hard

problems, cliques, graph partitioning, . . .

Unfriendly to high performance, parallel computing

Irregular memory access, indirect addressing, strong task/data

dependency

Supernodal DAG Multifrontal tree

1

1

2 U 2 3 4 5 L

U

3

4

L 5

6

6

7 8

L

U

L

U

7

9

4

SuperLU direct solver: SW aspects

www.crd.lbl.gov/~xiaoye/SuperLU

First release 1999, serial and MT with Pthreads.

Later: OpenMP, MPI distributed, MPI + OpenMP + CUDA

Single developer to many developers: svn, testing code

svn à github: improved distributed contributions

Build-test productivity:

Edit platform-dependent “make.inc” à CMake/Ctest

Both modes co-exist

xSDK interfaces, compatibility between different solvers

Namespace allows three version versions used simultaneously

Easier to manage dependencies (ParMetis, machine-dependent

files), and platform-specific versions (_MT, _DIST, GPU) and

correctness.

4/27/17 5

SuperLU numerical testing

Regression test aims to provide coverage of all routines by testing

all functions of the user-callable routines.

|&|'

BERR = max

$ ( )*+ '

|| ( ,- / ||0

FERR =

| ) |0

4/2717 6

Malloc/free balance check

Debugging mode SUPERLU_MALLOC / SUPERLU_FREE

{ {

char *buf; char *p = ((char *) addr) - 64;

buf = (char *) malloc (size + 64); int n = ((size_t *) p)[0];

buf[0] = size; malloc_total -= n;

malloc_total += size; free (p);

return (void *) (buf + 64); }

}

size

7

“Testing in Scientific Software: Impacts on Research Credibility, Development

Productivity, Maturation, and Sustainability,”

Chapter in “Software Engineering for Science”, Jeffrey Carver, Neil P. Chue Hong,

George K. Thiruvathukal (editors), October 20, 2016, CRC Press.

Roscoe A. Bartlett, Anshu Dubey, Xiaoye S. Li, J. David Moulton, James M.

Willenbring, and Ulrike Meier Yang (2016),

4/27/17 8

SuperLU distributed factorization

• O(N2) flops, O(N4/3) memory for typical 3D problems.

Per-rank Schur complement update

0 1 2 0 1 2 0 Loop through N steps: (Gaussian Elimination)

FOR ( k = 1, N ) {

3 4 5 3 4 5 3

1) Gather sparse blocks A(:, k) and A(k,:) into dense work[]

0 1 2 0 1 2 0 2) Call dense GEMM on work[]

3 4 5 3 4 5 3 3) Scatter work[] into remaining sparse blocks

}

0 1 2 0 1 2 0

3 4 5 3 4 5 3

0 1 2 0 1 2 0 }

look−ahead window

• Graph at step k+1 differs from step k

• Panel factorization on critical path

computations of different iterations.

Developers: Sherrry Li, Jim Demmel, John Gilbert, Laura Grigori, Piush Sao, Meiyue

Shao, Ichitaro Yamazaki.

9

SuperLU deploying GPU accelerator

data-parallel ones on GPU.

§ 100 nodes GPU clusters: 2.7x faster, 2-5x memory savings.

§ Current work: 3D algorithm to reduce critical path of panel

factorizations.

CPU / GPU concurrent execution

}

$ % &

}

! " ! " ! "

}

}

}

}

#

• Transfer data between GPU/CPU

10

SuperLU optimization on Intel Xeon Phi

Replacing small independent single-threaded MKL DGEMMs by large

multithreaded MKL DGEMMs: 15-20% faster.

Using nested parallel for and tasking avoids load imbalance and increases

amount of parallelism: 10-15% faster.

Challenges: non-uniform block size, many small blocks.

Factorization time: 3 test matrices, mixing MPI and OpenMP

1 node = 64 8 nodes = 512 32 nodes =

cores cores 2048 cores

MPI, Threads 64p, 32p, 256p, 128p, 512p, 256p,

1t 2t 2t 4t 4t 8t

nlpkkt80 -- 66.7 35.2 27.5 24.2 25.7

Ga19As19H42 129.0 130.7 28.3 25.6 15.6 16.8

• 72 cores, self-hosted

• 4 threads per core (SIMT)

• 2 512-bit vector units per code (SIMD)

11

Examples in EXAMPLE/

§ pddrive.c: Solve one linear system

§ pddrive1.c: Solve the systems with same A but different right-

hand side at different times

§ Reuse the factored form of A

§ pddrive2.c: Solve the systems with the same pattern as A

§ Reuse the sparsity ordering

§ pddrive3.c: Solve the systems with the same sparsity pattern

and similar values

§ Reuse the sparsity ordering and symbolic factorization

§ pddrive4.c: Divide the processes into two subgroups (two

grids) such that each subgroup solves a linear system

independently from the other. 0 1

2 3

4 5

6 7

8 9

1011

4/27/17 12

Domain decomposition, Schur-complement

(PDSLin : http://portal.nersc.gov/project/sparse/pdslin/)

æ A11 A12 ö æ x1 ö æ b1 ö ! D E1 $

çç ÷÷ çç ÷÷ = çç ÷÷ $ #

# 1

D2 E2 &

&

è A21 A22 ø è x2 ø è b2 ø ! A A

# 11 12

&= # &

&

# A21 A22 & #

" % # Dk Ek &

# &

# F1 F2 … Fk A22 &

" %

2. Schur complement

-1 -T T -1

S = A22 – A21 A11 A12 = A22 – (U11 A21 )T (L11 A12 ) = A22 - W × G

where A11 = L11U11

S = interface (separator) variables, no need to form explicitly

(1) x2 = S −1 (b2 – A21 A11-1 b1 ) ← iterative solver

(2) x1 = A11-1 (b1 – A12 x2 ) ← direct solver

13

Hierarchical parallelism

Multiple processors per subdomain

one subdomain with 2x3 procs (e.g. SuperLU_DIST)

D1 P P P(0 : 5) E

(0 : 5) 1

D2 P P(6 : 11) E2

(6 : 11)

D3 P P(12 : 17) E3

(12 : 17)

F1 F2 F3 F4 A22

Advantages:

Constant #subdomains, Schur size, and convergence rate, regardless

of core count.

Need only modest level of parallelism from direct solver.

14

PDSLin configurable as Hybrid, Iterative, or Direct

Default

Subdomain: LU

Schur: Krylov

Options Options

(1) num_doms = 0 (1) Subdomain: LU

Schur = A: Krylov Schur: LU

drop_tol = 0.0

(2) FGMRES Inner-Outer:

Subdomain: ILU (2) num_doms = 1

Schur: Krylov ! D E1 $ Subdomain: LU

# 1 &

# D2 E2 &

# &

# &

# Dk Ek &

# &

# F1 F2 … Fk A22 &

" %

15

PDSLin in Omega3P: accelerator cavity design

Computation results

§ 2.3M elements

§ Second order finite element (p = 2)

- 14M DOFs, 590M non-zeroes.

- Using MUMPS with 400 nodes, 800 cores, solution time: 6:48 min.

- Solution time on Edison using 100 nodes, 2400cores: 6:20 min.

STRUMPACK “inexact” direct solver

portal.nersc.gov/project/sparse/strumpack/

• In addition to structural sparsity, further apply data-sparsity with low- dense

rank compression:

D1 V2

U1

• O(N logN) flops, O(N) memory for 3D elliptic PDEs. "

$

"

$ D1

%

U1B1V2T '

U B VT

%

'

V1 D2 Big U1 0 $

$

$ U B VT D2

'

' 3 3 6 '

U2 U3 = U A ≈$ # 2 2 1 &

'

$ U 4 B4V5T ' '

U6 B6V3T

$ D4

V5 $ $

$ U B VT

' '

D4 $# # 5 5 4

D5 '

'&

U4 &

U6B36V3T D5

field”) approximated via low-rank compression. U5

scaling

• Efficient for many PDEs, BEM methods, integral equations, machine

learning, and structured matrices such as Toeplitz, Cauchy matrices.

HSS

dense

• C++, hybrid MPI + OpenMP; provide Fortran interface.

• Real & complex datatypes, single & double precision (via template), and 64-bit indexing.

• Input interface:

• dense, sparse, or matrix-free (only matvec needed).

• user-supplied cluster tree & block partition.

17

HSS approximation error vs. drop tolerance

Randomized sampling to reveal rank

1 Pick random matrix Ωnx(k+p), k target rank, p small, e.g. 10

2 Sample matrix S = A Ω, with slight oversampling p

3 Compute Q =ON-basis(S) via rank-revealing QR

||A-AHSS||F/||A||F maximum HSS rank

1 240

d=16

d=64 220

0.01 d=256

d=1024 200

0.0001

180

1e-06 160

1e-08 140

120

1e-10

100 d=16

1e-12 d=64

80 d=256

d=1024

1e-14 60

1 0.01 0.0001 1e-06 1e-08 1e-10 1e-12 1e-14 1 0.01 0.0001 1e-06 1e-08 1e-10 1e-12 1e-14

4/27/17 18

STRUMPACK: parallelism and performance

3 types of tree parallelism: Shared Memory OpenMP Task Parallelism: Intel Ivy

Bridge, compared to MKL Pardiso

• Elimination tree 2

solve time

1.8 factor time

• HSS tree

1.6 reorder time

1.4

• BLAS tree Node of etree

1.2

Node of HSS tree 1

out of memory

Node of dense kernels tree 0.8

0.6

0.4

0.2

0

PARDISO

PARDISO

PARDISO

PARDISO

PARDISO

PARDISO

PARDISO

PARDISO

PARDISO

MF

MF+HSS

MF

MF+HSS

MF

MF+HSS

MF

MF+HSS

MF

MF+HSS

MF

MF+HSS

MF

MF+HSS

MF

MF+HSS

MF

MF+HSS

atmosmoddGeo 1438 nlpkkt80 tdr190k torso3 Transport A22 Serena spe10-aniso

1 10000

numerical fact

symbolic fact

Distributed memory apply prec

Solving Toeplitz systems in linear time timing breakdown:

1000

Quantum Chemistry Toeplitz matrix, 64 MPI, compression+factorization+solve

105 Intel Ivy Bridge

HSS-FFT

runtime (s)

100

n log2 n

104 HSS-GEMM

n2

LU

103

n3 10

Time (s)

102

1

101

100 0.1

1 4 24 192 1152

10 1 cores, MPI processes

16000 32000 64000 128000 256000 370000

n

19

Summary

Explore new algorithms that require lower arithmetic complexity,

lower memory/communication, faster convergence rate

Higher-fidelity simulations require higher resolution, faster turn-around

time

generation machines (exascale in 6-7 years)

Light-weight core, less memory per core, hybrid node with high degree

of parallelism

20

4/27/17 21

- Java - MultithreadingUploaded byKoushik Sinha
- ANSYS MpiUploaded byIvanrips
- 810d startupUploaded byS.Dharanipathy
- itUploaded byshyam15287
- OpenMPUploaded byFairuz Azmi
- ACA SummaryUploaded bySomilKumar1990
- Advance Computer ArchitectureUploaded bymuditvijay
- projectsUploaded byapi-3728136
- Real-time Video Processing for Embedded SystemUploaded bySeventysix OneoneEight
- A Study for the Discovery of Web Usage Patterns Using Soft Computing Based Data Clustering TechniquesUploaded byIntegrated Intelligent Research
- PARALLEL COMPUTING: Models and AlgorithmsUploaded byOctavian Bărbânţă
- B-Fundamentals of DataStage ParallelismUploaded byparuarun
- Reading list for execution time predictionUploaded byRoberto Camacho
- Installing MPI _ MPI TutorialUploaded bylolo406
- Unit(2)Uploaded byakttripathi
- mpjexpress-csailUploaded byMohsan Naqi
- Parallel ProcessingUploaded bySiri Ajay
- VIA-EvaluationUploaded byapi-27351105
- Existing ModelsUploaded bygrenouille2
- Presentation 2Uploaded byS M Jawad Fakhir
- Multithreading in LabWindows_CVI 2006Uploaded bymarcinparcin
- IntroductionUploaded byAlexandre Albizzati
- Getting StartedUploaded bySkyezine Via Kit Fox
- Studioxe Evalguide Add ParallelismUploaded bysudeshpec
- 270tprocess ThreadsUploaded byG.Keerthana
- A scalabale architecture for ordered parallelismUploaded bymadhur_ahuja
- Experimenter ReportUploaded byBalam Darks
- A Simultaneous Dynamic Optimization StrategyUploaded byAbid Jamali
- VectorUploaded byRakend Reddy
- Java_KMUploaded byKetan Bhadekar

- Proceedings Multigrid Methods IVUploaded bydmp130
- PanneerchelvamUploaded bydmp130
- qft-II-21-1pUploaded bydmp130
- qft-II-20-1pUploaded bydmp130
- gmsh ManualUploaded byMohamed Salah Sedek
- qft-II-3-1pUploaded bydmp130
- qft-II-2-1pUploaded bydmp130
- 10df0c544ae3e915aedb10fab0c3b0725ef0Uploaded bydmp130
- 2015_GEOPHYS_Frequencies of the Ricker WaveletUploaded bydmp130
- qft-II-17-1pUploaded bydmp130
- 02-syllabusUploaded byandrewkz
- 41 Section Handout 8Uploaded bydmp130
- Coding StyleUploaded byelmoreilly
- geom.pdfUploaded bydmp130
- barrier.pdfUploaded bydmp130
- StatUploaded bydmp130
- Convex Optimization — Boyd & VandenbergheUploaded byPremSuman8
- DualityUploaded byxcentricme
- Intro to PETScUploaded bydmp130
- Effective Parameter (Permittivity and User Guide_epsmuestUploaded bydmp130
- 2011-09_haefele_HDF5-XDMFUploaded bydmp130
- Home Maintenance Guide and TipsUploaded bydmp130
- Syllabus_13Uploaded bydmp130
- PartA7Uploaded bydmp130
- LaiChingMa PhD DissertationUploaded bydmp130
- FDTD TFSF 2DUploaded bydmp130
- MIT22_05F09_lec04Uploaded bydmp130
- Supersymmetry and Cosmology - J. FengUploaded bydmp130
- Standard Practices c++Uploaded bydmp130

- sparse_sp_intro.pdfUploaded byAbderrahim
- CompQMUploaded byIrina Ionita
- Dynamic Spectrum Sensing in Cognitive Radio Networks Using Compressive SensingUploaded byradhakodirekka8732
- Bundle Adjustment RulesUploaded byJajang Nurjaman
- Page rank.pdfUploaded byShruti Bansal
- A Review of Simulation Models for Railway SystemsUploaded bynpfh
- finite element programming with MATLABUploaded bybharathjoda
- Project TopicsUploaded byHgfghf Ghfghgf
- Learn COSMOSMUploaded byIulian Mărăcineanu
- numerical.pdfUploaded byRogerio
- bioinfo_ug_R2011bUploaded bytjts
- Intel Linear Solver BasicsUploaded bybooks_uk
- ch5Uploaded byyashwanthr3
- M.Phil Computer Science Image Processing ProjectsUploaded bykasanpro
- 10.1.1.192.6515Uploaded byRod Lagahit
- UPO2_RPTUploaded byAndre Oliveira
- Pa 3 AaaaaaaUploaded bySrivenu Chinna
- Openvdb IntroductionUploaded byRafael Campos
- 1409.4842v1Uploaded byRahul Dey
- Vmlschp1 VectorsUploaded byJustA Dummy
- 3 - CUDA Model and Language - TalkUploaded byratan-r-ankolekar-4158
- computational QMUploaded byFredrick Mutunga
- scipy-refUploaded bySugan Raj
- Recommender LabUploaded byAkash Pushkar Charan
- Compact Matrix DecompositionUploaded bylucky6519
- gauss siedelUploaded byDina Marcela
- 02 ArraysUploaded byMm
- Circuit SimulatorUploaded bySouvik Das
- frame based adaptive compressed sensing of speech signalUploaded byGanesh Garigapati
- Matlab NotesUploaded byJyothi Prakash