Professional Documents
Culture Documents
Architectures On Linear
Algebra and Numerical
Libraries
Jack Dongarra
Innovative Computing Laboratory
University of Tennessee
1
High Performance Computers
Computers
♦ ~ 20 years ago
¾ 1x106 Floating Point Ops/sec (Mflop/s)
¾ Scalar based
♦ ~ 10 years ago
¾ 1x109 Floating Point Ops/sec (Gflop/s)
♦ ~ Today
¾ 1x1012 Floating Point Ops/sec (Tflop/s)
¾ Highly parallel, distributed processing, message passing, network based
¾ data decomposition, communication/computation
♦ ~ 10 years away
¾ 1x1015 Floating Point Ops/sec (Pflop/s)
¾ Many more levels MH, combination/grids&HPC
¾ More adaptive, LT and bandwidth aware, fault tolerant,
extended precision, attention to SMP nodes
2
TOP500
TOP500
Rate
- Updated twice a year
Size
SC‘xy in the States in November
Meeting in Mannheim, Germany in June
- All data available from www.top500.org
3
Fastest Computer Over Time
50
45
40
35
GFlop/s
30
25
20
TMC
15
CM-2
(2048)
10
Cray
5 Y-MP (8) Fujitsu
VP-2600
0
1990 1992 1994 1996 1998 2000
Year
500
Hitachi
450 CP-PACS
400 Intel (2040)
350 Paragon
GFlop/s
(6788)
300
Fujitsu
250
VPP-500
200 (140)
TMC
150 CM-5
NEC
100 SX-3 (1024)
TMC
50 Cray
Y-MP (8)
Fujits u
VP-2600
CM-2
(2048)
(4)
0
1990 1992 1994 1996 1998 2000
Year
In 1980 a computation that took 1 full year to complete
can now be done in ~ 13 minutes!
Fastest Computer Over Time
ASCI White
5000 Pacific
4500 (7424)
4000
3500 Intel ASCI
GFlop/s
Red Xeon
3000 ASCI Blue
(9632)
Pacific SST
2500
Intel (5808)
2000 ASCI Red SGI ASCI
1500 (9152) Blue
1000 Intel Hitachi Mountain
TMC Fujitsu Paragon CP-PACS
500 Cray
Y-MP (8)
Fujitsu
VP-2600
CM-2 SX-3
(4)
CM-5
(1024) (140)
(6788)
(2048)
0
1990 1992 1994 1996 1998 2000
Year
Ju 94
Ju 95
Ju 96
Ju 97
Ju 8
Ju 99
0
N 93
N 94
N
95
N 96
N 97
N 98
N 99
N 00
-9
-0
-
-
-
n-
n-
n-
n-
n-
n-
n-
n-
ov
ov
ov
ov
-
ov
ov
ov
ov
Ju
[60G - 400 M][4.9 Tflop/s 55Gflop/s], Schwab #15, 1/2 each year, 209 > 100 Gf, faster than Moore’s law,
Performance Development
1000000 1 PFlop/s
ASCI
100000
Performance [GFlop/s]
Earth Simulator
10000 Sum
1000 1 TFlop/s
N=1
100
10
1 My Laptop
N=500
0.1
9
4
97
00
03
06
09
93
96
99
02
05
08
v-
v-
v-
v-
v-
v-
n
-
n-
n-
n-
n-
n-
No
No
No
No
No
No
Ju
Ju
Ju
Ju
Ju
Ju
Entry 1 T 2005 and 1 P 2010
Architectures
Architectures
SP2
ASCI Red
200 Y-MP C90
SX3
100
Single SMP Sun HPC
0 Processor VP500
93
94
N 95
N
6
N 7
N 98
N 99
N 00
4
Ju 5
Ju 6
Ju 7
Ju 8
Ju 9
0
3
n-
9
-9
-9
-9
-9
-9
-9
-0
-9
n-
n-
n-
n-
n-
n-
n-
ov
ov
ov
ov
ov
ov
ov
ov
Ju
Ju
Ju
N
10
112 const, 28 clus, 343 mpp, 17 smp
Ju
n-
0
100
200
300
400
500
9
N 3
ov
-9
Ju 3
n-
9
N 4
ov
-9
Ju 4
intel
n-
9
N 5
ov
-9
Ju 5
n-
9
N 6
ov
-9
Ju 6
n-
9
N 7
ov
-9
Inmos Transputer
Other
Ju 7
n-
9
N 8
ov
-9
Chip Technology
IBM
SUN
MIPS
Alpha
Ju 8
n-
9
N 9
ov
HP
-9
Ju 9
n-
0
N 0
ov
-0
0
High-Performance Computing Directions:
Beowulf-class PC Clusters
Definition: Advantages:
♦ COTS PC Nodes ♦ Best price-
¾ Pentium, Alpha,
performance
PowerPC, SMP
♦ Low entry-level cost
♦ COTS LAN/SAN
♦ Just-in-place
Interconnect
configuration
¾ Ethernet, Myrinet,
Giganet, ATM ♦ Vendor invulnerable
♦ Open Source Unix
♦ Scalable
¾ Linux, BSD ♦ Rapid technology
(2X/1.5yr)
100
Processor-Memory
Performance Gap:
(grows 50% / year)
10
DRAM
DRAM
9%/yr.
1
(2X/10 yrs)
1980
1981
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1982
1983
1984
1985
1986
1999
2000
Time
13
Optimizing Computation and
and
Memory Use
Use
♦ Computational optimizations
Memory Hierarchy
Hierarchy
Software (SANS)
(SANS)
900.0
Vendor BLAS
800.0 ATLAS BLAS
F77 BLAS
700.0
600.0
MFLOPS
500.0
400.0
300.0
200.0
100.0
0.0 20
0
0
33
00
0
6
0
2
0
35
33
20
27
60
26
20
11
16
50
-9
-2
/1
-5
3-
2-
o-
II
-
6
-
4-
2-
28
III
n-
0-
35
56
rc
p3
lo
Pr
60
er
er
ip
/7
ev
pa
th
ev
ow
0i
C
tiu
tiu
m
00
C
A
Po
P
00
00
aS
C
iu
en
E
en
90
D
10
12
nt
ltr
M
M
M
M
P
D
P
Pe
IB
IB
U
IB
A
IR
I
R
un
SG
G
S
Architectures
S
♦ ATLAS is faster than all other portable BLAS
implementations and it is comparable with
machine-specific libraries provided by the vendor.
18
Intel PIII 933 MHz
MKL 5.0 vs ATLAS 3.2.0 using Windows
2000
800
Vendor BLAS
ATLAS BLAS
700 Reference BLAS
600
500
MFLOPS
400
300
200
100
BLAS
0
50
100
150
200
250
300
th
lo
n-
60
D 0
E
C
ev
56
-5
33
D
E
C
ev
6-
F77 NoTrans
50
H 0
P
Vendor NoTrans
ATLAS NoTrans
90
00
/7
35
IB /1
35
M
P
P
Atlas Trans
C
F77 NoTrans
Vendor Trans
60
4-
11
IB 2
M
P
ow
er
2-
16
IB 0
M
P
ow
er
3-
20
P 0
en
tiu
m
P
ro
-2
Architectures
00
P
en
tiu
m
II-
26
P 6
en
tiu
m
S
G III
9
IR 33
10
00
0i
p2
S
G 8-
20
IR
12 0
00
0i
p3
S 0-
un 27
U 0
Matrix Vector Multiply DGEMV
ltr
aS
DGEMV
pa
rc
2
20
0
20
LU Factorization, Recursive
Recursive
w/ATLAS
700
Vendor BLAS
600
ATLAS BLAS
F77 BLAS
500
400
MFLOPS
300
200
100
0
00
33
0
6
0
0
5
0
2
33
0
0
20
27
26
16
20
13
11
20
20
50
-9
-6
-5
8-
0-
2-
3-
II -
4-
5/
2-
o-
II I
6-
on
56
p2
p3
73
er
er
60
rc
Pr
ev
m
hl
m
ev
pa
0i
0i
0/
iu
C
At
iu
m
EC
Po
Po
00
00
PP
nt
00
EC
aS
nt
iu
D
Pe
Pe
10
12
D
nt
P9
M
AM
ltr
D
Pe
IB
IB
IR
IR
U
H
IB
n
SG
SG
Su
Architecture
♦ PHiPAC
¾ Portable High Performance ANSI C
www.icsi.berkeley.edu/~bilmes/phipac initial automatic GEMM
generation project
♦ FFTW Fastest Fourier Transform in the West
¾ www.fftw.org
♦ UHFFT
¾ tuning parallel FFT algorithms
¾ rodin.cs.uh.edu/~mirkovic/fft/parfft.htm
♦ SPIRAL
0 0 0 0 0 0 0 0 0 0
1 3 5 7 9
23
Size
Machine-Assisted Application
Application
♦ Communication libraries
24
Work in Progress:
Progress:
8 binomial 8
16 binomial 16
32 binary 32
64 binomial 64
128 binomial 128
256 binomial 256
512 binomial 512
1K sequential 1K
2K binary 2K
4K binary 2K
8K binary 2K
16K binary 4K
32K binary 4K
64K ring 4K
128K ring 4K
256K ring 4K
512K ring 4K 25
1M binary 4K
Reformulating/Rearranging/Reuse
Reformulating/Rearranging/Reuse
Anew = A − u T
− wv
T
y new = AT u
wnew = Anew v
♦ Fetch each entry of A once
♦ Restructure and combined operations
♦ Results in a speedup of > 30%
26
CG Variants by Dynamic
Selection at Run Time
♦ Variants combine
inner products to
reduce
communication
bottleneck at the
expense of more
scalar ops.
♦ Same number of
iterations, no
advantage on a
sequential processor
♦ With a large number
of processor and a
high-latency network
may be advantages.
♦ Improvements can
range from 15% to
50% depending on
size. 27
CG Variants by Dynamic
Selection at Run Time
♦ Variants combine
inner products to
reduce
communication
bottleneck at the
expense of more
scalar ops.
♦ Same number of
iterations, no
advantage on a
sequential processor
♦ With a large number
of processor and a
high-latency network
may be advantages.
♦ Improvements can
range from 15% to
50% depending on
size. 28
Gaussian Elimination
Elimination
x x
0 0
x 0
. ...
.
x . 0
x 0
Standard Way LINPACK
subtract a multiple of a row apply sequence to a column
x x
0 L a2
0
...
0
a1 a3
0 a2 =L-1 a2
nb LAPACK nb a3=a3-a1*a2 29
apply sequence to nb then apply nb to rest of matrix
L A12
A21 A22
Recursive Factorizations
Factorizations
AMD Athlon
LU 1GHz (~$1100 system)
Factorization
800
Recursive LU
400 Dual-processor
600
LAPACK
p /sp /s
300
400
Recursive LU
M FMloflo
200
200
LAPACK
100 Uniprocessor
00
500
500 1000 1000
1500 2000 15002500 3000
2000 3500 4000
2500 4500 3000
5000
Order
Order
32
DGEMM ATLAS & IIIDGETRF
Pentium 933 MHz
Recursive
SGEMM
system)
using SSE
2000
2000
1500
p /sp / s
1500
1000
M FMloflo
1000
500
500
0 0
100
100 200 300
200 400 300500 600400 700 800
500 900 600
1000
Order
Order
33
DGEMM ATLAS & IIIDGETRF
Pentium 933 MHz
Recursive
DGEMM
system)
800
2000
600
p /sp /s
1500
400
M FMloflo
1000
200
500
0
0
1 100 2 3
200 4 300 5 6 400 7 8
500 9 60010
Order
Order
34
DGEMM ATLAS & IIIDGETRF
Pentium 933 MHz
Recursive
S, D, C, and Z GEMM
system)
S & C use SSE
2000
2000
1500
p /sp /s
1500
1000
M FMloflo
1000
500
500
0 0
100
100 200 300
200 400 300500 600400 700 800
500 900 600
1000
Order
Order
35
SuperLU - High Performance
Sparse Solvers
♦ SuperLU; X. Li and J. Demmel
¾ Solve sparse linear system A x = b using
Gaussian elimination.
modern architectures:
rate
machines
¾ SuperLU_DIST : distributed-memory
parallel machines
36
Recursive Factorization Applied
to Sparse Direct Methods
Victor Eijkhout, Piotr Layout of sparse recursive matrix
Luszczek & JD
1. Symbolic Factorization
2. Search for blocks that
contain non-zeros
3. Conversion to sparse
recursive storage
4. Search for embedded
blocks
5. Numerical factorization
37
Dense recursive factorization
factorization
♦ The algorithm:
function rlu(A)
begin
rlu(A11); recursive call
A21←A21 · U-1(A11); xTRSM() on upper triangular submatrix
A12 ←L1-1(A11) · A12; xTRSM() on lower triangular submatrix
A22 ←A22-A21·A12; xGEMM()
rlu(A22); recursive call
end.
♦ Solutions - continued
♦ No partial pivoting
39
Recursive storage conversion steps
Matrix divided into 2x2 blocks Matrix with explicit 0’s and fill-in
40
41
Breakdown of Time Across Phases
For the Recursive
Breakdown of TimeSparse
Across PhasesFactorization
100%
Numerical
90% fact.
80%
Ebedded
70% blocking
60%
Recursive
50% conversion
40%
Block
30%
conversion
20%
Symbolical
10%
fact.
0%
af23560 ex11 goodwin jpwh_991 mcfe memplus olafu orsreg_1 psmigr_1 raefsky_3 raefsky_4 saylr4 sherman3 sherman5 wang3 42
Dif f er ent Test M at r ices
SETI@home
♦ Use thousands of Internet-
connected PCs to help in
the search for
extraterrestrial
intelligence.
♦ Uses data collected with
the Arecibo Radio ♦ Largest distributed
Telescope, in Puerto Rico computation project in
♦ When their computer is idle
or being wasted this existence
software will download a ¾ ~ 400,000 machines
for analysis.
♦ The results of this analysis ♦ Today many companies
m
ct
me
ne
t
er
on
ws
s
l
us
me
i
s
t
fl o p
erc
Distributed
Massively
of
D
ec s w/
ut
i
d
l
f
c
ho
int
mp se
ng
or
k
lel
systems
pi
a
IT
parallel
TI
@
wu
Co i
d ba
ial
t
e
r
r al
tw
C
hetero
tr
o
systems
eo
us
A
S
P
a
SE
Ne
G
r
En
sp
C
l
geneous
homo
geneous
♦
Napster on steroids.
45
NetSolve
Network Enabled Server
♦ NetSolve is an example of a grid based
hardware/software server.
♦ Easy-of-use paramount
♦ Based on a RPC model but with …
Client Schedule
Database
Request
AGENT(s)
S2 !
Matlab
Mathematica
C, Fortran A,
B,
An
Java, Excel
C
sw
er
(
Op(C, A, B) S3 S4
C
)
S1 S2
C
A
47
No knowledge of the grid required, RPC like.
Basic Usage
Scenarios
♦ Grid based numerical
library routines
output
Futures for Linear Algebra Numerical
Algorithms and Software
♦ Numerical software will be adaptive,
gone.
¾ After all, its not reasonable to ask for exactness in numerical computations.
¾ Audibility of the computation, reproducibility at a
cost
♦ Importance of floating point arithmetic will be
undiminished.
¾ 16, 32, 64, 128 bits and beyond.
♦ Reproducibility, fault tolerance, and auditability
♦ Adaptivity is a key so applications can function
appropriately
49
Contributors to These Ideas
Ideas
♦ Top500
¾ Erich Strohmaier, LBL
¾ Hans Meuer, Mannheim U
♦ Linear Algebra
¾ Victor Eijkhout, UTK For additional
¾ Piotr Luszczek, UTK information see…
♦ See:
http://developer.intel.com/software/products/mkl/mkllicense51_lnx.htm
51