The Impact of Computer Architectures On Linear Algebra and Numerical Libraries

The Impact Of Computer
Architectures On Linear
Algebra and Numerical
Libraries
Jack Dongarra
Innovative Computing Laboratory
University of Tennessee
1
High Performance Computers
Computers
♦ ~ 20 years ago
¾ 1x106 Floating Point Ops/sec (Mflop/s)
¾ Scalar based
♦ ~ 10 years ago
¾ 1x109 Floating Point Ops/sec (Gflop/s)
¾ Vector & Shared memory computing, bandwidth aware
¾ Block partitioned, latency tolerant
♦ ~ Today
¾ 1x1012 Floating Point Ops/sec (Tflop/s)
¾ Highly parallel, distributed processing, message passing, network based
¾ data decomposition, communication/computation
♦ ~ 10 years away
¾ 1x1015 Floating Point Ops/sec (Pflop/s)
¾ Many more levels MH, combination/grids&HPC
¾ More adaptive, LT and bandwidth aware, fault tolerant,
extended precision, attention to SMP nodes
2
TOP500
TOP500
- Listing of the 500 most powerful

Computers in the World
- Yardstick: Rmax from LINPACK MPP
Ax=b, dense problem TPP performance
Rate
- Updated twice a year
Size
SC‘xy in the States in November
Meeting in Mannheim, Germany in June
- All data available from www.top500.org
3
Fastest Computer Over Time
50
45
40
35
GFlop/s
30
25
20
TMC
15
CM-2
(2048)
10
Cray
5 Y-MP (8) Fujitsu
VP-2600
0
1990 1992 1994 1996 1998 2000
Year
In 1980 a computation that took 1 full year to complete

can now be done in ~ 9 hours!
500
Hitachi
450 CP-PACS
400 Intel (2040)
350 Paragon
GFlop/s
(6788)
300
Fujitsu
250
VPP-500
200 (140)
TMC
150 CM-5
NEC
100 SX-3 (1024)
TMC
50 Cray
Y-MP (8)
Fujits u
VP-2600
CM-2
(2048)
(4)
0
1990 1992 1994 1996 1998 2000
Year
can now be done in ~ 13 minutes!
ASCI White
5000 Pacific
4500 (7424)
4000
3500 Intel ASCI
GFlop/s
Red Xeon
3000 ASCI Blue
(9632)
Pacific SST
2500
Intel (5808)
2000 ASCI Red SGI ASCI
1500 (9152) Blue
1000 Intel Hitachi Mountain
TMC Fujitsu Paragon CP-PACS
TMC NEC VPP-500 (2040) (5040)
500 Cray
Y-MP (8)
Fujitsu
VP-2600
CM-2 SX-3
(4)
CM-5
(1024) (140)
(6788)
(2048)
0
1990 1992 1994 1996 1998 2000
Year

can today be done in ~90 second!
Top 10 Machines (Nov 2000)
2000)
Rank Company Machine Procs Gflop/s Place Country Year

1 IBM ASCI White 8192 4938 Livermore National Laboratory Livermore 2000
Sandia National Labs
2 Intel ASCI Red 9632 2380 USA 1999
Albuquerque
ASCI Blue-Pacific Lawrence Livermore National
3 IBM 5808 2144 USA 1999
SST, IBM SP 604e Laboratory Livermore
ASCI Blue Los Alamos National Laboratory
4 SGI 6144 1608 USA 1998
Mountain Los Alamos
SP Power3 Naval Oceanographic Office
5 IBM 1336 1417 USA 2000
375 MHz (NAVOCEANO)
SP Power3 National Center for
6 IBM 1104 1179 USA 2000
375 MHz Environmental Protection
Leibniz Rechenzentrum
7 Hitachi SR8000-F1/112 112 1035 Germany 2000
Muenchen
SP Power3 UCSD/San Diego
8 IBM 1152 929 USA 2000
375 MHz, 8 way Supercomputer Center
High Energy Accelerator
9 Hitachi SR8000-F1/100 100 917 Research Organization /KEK Japan 2000
Tsukuba
10 Cray Inc. T3E1200 1084 892 Government USA 1998
Performance Development
Development
100 Tflop/s 88.1 TF/s

SUM
10 Tflop/s
4.9 TF/s
1 Tflop/s Intel
IBM
Intel ASCI
N=1 ASCI Red ASCI Red White
Sandia Sandia
100 Gflop/s
Fujitsu Fujitsu Hitachi/Tsukuba 55.1 GF/s
'NWT' NAL 'NWT' NAL CP-PACS/2048
Intel XP/S140
Sandia
10 Gflop/s N=500 IBM 604e
Sun 69 proc
Sun Ultra A&P
HPC 10000
SGI HPC 1000
Merril Lynch
1 Gflop/s Cray POWER News
Cray Y-MP Y-MP C94/364 CHALLANGE International
SNI VP200EX M94/4 'EPA' USA GOODYEAR
Uni Dresden KFA Jülich
100 Mflop/s
Ju 93
Ju 94
Ju 95
Ju 96
Ju 97
Ju 8
Ju 99
0
N 93
N 94
N
95
N 96
N 97
N 98
N 99
N 00
-9
-0
-
-
-
n-
n-
n-
n-
n-
n-
n-
n-
ov
ov
ov
ov
-
ov
ov
ov
ov
Ju
[60G - 400 M][4.9 Tflop/s 55Gflop/s], Schwab #15, 1/2 each year, 209 > 100 Gf, faster than Moore’s law,
Performance Development
1000000 1 PFlop/s
ASCI
100000
Performance [GFlop/s]
Earth Simulator
10000 Sum
1000 1 TFlop/s
N=1
100
10
1 My Laptop
N=500
0.1
9
4
97
00
03
06
09
93
96
99
02
05
08
v-
v-
v-
v-
v-
v-
n
-
n-
n-
n-
n-
n-
No
No
No
No
No
No
Ju
Ju
Ju
Ju
Ju
Ju
Entry 1 T 2005 and 1 P 2010
Architectures
Architectures
Constellation Cluster - NOW

500 SIMD CM2
Cluster of
Paragon Sun HPC
400
T3D
CM5
300 MPP T3E
SP2
ASCI Red
200 Y-MP C90
SX3
100
Single SMP Sun HPC
0 Processor VP500
93
94
N 95
N
6
N 7
N 98
N 99
N 00
4
Ju 5
Ju 6
Ju 7
Ju 8
Ju 9
0
3
n-
9
-9
-9
-9
-9
-9
-9
-0
-9
n-
n-
n-
n-
n-
n-
n-
ov
ov
ov
ov
ov
ov
ov
ov
Ju
Ju
Ju
N
10
112 const, 28 clus, 343 mpp, 17 smp
Ju
n-
0
100
200
300
400
500
9
N 3
ov
-9
Ju 3
n-
9
N 4
ov
-9
Ju 4
intel
n-
9
N 5
ov
-9
Ju 5
n-
9
N 6
ov
-9
Ju 6
n-
9
N 7
ov
-9
Inmos Transputer
Other
Ju 7
n-
9
N 8
ov
-9
Chip Technology
IBM
SUN
MIPS
Alpha
Ju 8
n-
9
N 9
ov
HP
-9
Ju 9
n-
0
N 0
ov
-0
0
High-Performance Computing Directions:
Beowulf-class PC Clusters
Definition: Advantages:
♦ COTS PC Nodes ♦ Best price-
¾ Pentium, Alpha,
performance
PowerPC, SMP
♦ Low entry-level cost
♦ COTS LAN/SAN
♦ Just-in-place
Interconnect
configuration
¾ Ethernet, Myrinet,
Giganet, ATM ♦ Vendor invulnerable
♦ Open Source Unix
♦ Scalable
¾ Linux, BSD ♦ Rapid technology
♦ Message Passing tracking

Computing Enabled by PC hardware, networks and operating system
¾ MPI, PVM achieving capabilities of scientific workstations at a fraction of
¾ HPF the cost and availability of industry standard message
passing libraries. However, much more of a contact sport. 12
Where Does the Performance Go? or
Why Should I Care About the Memory Hierarchy?
Processor-DRAM Memory Gap (latency)

1000 CPU
µProc
60%/yr.
“Moore’s Law”
Performance
(2X/1.5yr)
100
Processor-Memory
Performance Gap:
(grows 50% / year)
10
DRAM
DRAM
9%/yr.
1
(2X/10 yrs)
1980
1981
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1982
1983
1984
1985
1986
1999
2000
Time
13
Optimizing Computation and
and
Memory Use
Use
♦ Computational optimizations
¾ Theoretical peak:(# fpus)*(flops/cycle) * Mhz

¾ PIII: (1 fpu)*(1 flop/cycle)*(850 Mhz) = 850 MFLOP/s
¾ Athlon: (2 fpu)*(1flop/cycle)*(600 Mhz) = 1200 MFLOP/s
¾ Power3: (2 fpu)*(2 flops/cycle)*(375 Mhz) = 1500 MFLOP/s
♦ Operations like:
¾ α = xTy : 2 operands (16 Bytes) needed for 2 flops;
at 850 Mflop/s will requires 1700 MW/s bandwidth
¾ y = α x + y : 3 operands (24 Bytes) needed for 2 flops;
at 850 Mflop/s will requires 2550 MW/s bandwidth
♦ Memory optimization
¾ Theoretical peak: (bus width) * (bus speed)
¾ PIII : (32 bits)*(133 Mhz) = 532 MB/s = 66.5 MW/s
¾ Athlon: (64 bits)*(133 Mhz) = 1064 MB/s = 133 MW/s

14
¾ Power3: (128 bits)*(100 Mhz) = 1600 MB/s = 200 MW/s
Memory Hierarchy
Hierarchy
♦ By taking advantage of the principle of locality:
¾ Present the user with as much memory as is available in

the cheapest technology.
¾ Provide access at the speed offered by the fastest
technology.
Processor Tertiary
Storage
Secondary
(Disk/Tape)
Control Storage
(Disk)
Level Main
On-Chip
Registers
2 and 3 Memory Distributed Remote

Cache
Datapath Cache (DRAM) Memory Cluster

(SRAM) Memory
Speed (ns): 1s 10s 100s 10,000,000s 10,000,000,000s

(10s ms) (10s sec)
Size (bytes): 100s
Ks Ms 100,000 s 10,000,000 s
(.1s ms) (10s ms)
Gs Ts
Self-Adapting Numerical
Numerical
Software (SANS)
(SANS)
♦ Today’s processors can achieve high-performance, but

this requires extensive machine-specific hand tuning.
♦ Operations like the BLAS require many man-hours /
platform
• Software lags far behind hardware introduction
• Only done if financial incentive is there
♦ Hardware, compilers, and software have a large
design space w/many parameters
¾ Blocking sizes, loop nesting permutations, loop unrolling
depths, software pipelining strategies, register allocations,
and instruction schedules.
¾ Complicated interactions with the increasingly sophisticated
micro-architectures of new microprocessors.
♦ Need for quick/dynamic deployment of optimized routines.
♦ ATLAS - Automatic Tuned Linear Algebra Software 16
Software Generation
Strategy
♦ Level 1 cache multiply
optimizes for:
♦ Code is iteratively
¾ TLB access
generated & timed until

¾ L1 cache reuse
optimal case is found.

¾ FP unit usage
We try:
¾ Memory fetch
¾ Differing NBs
¾ Register reuse
¾ Breaking false
¾ Loop overhead
dependencies
minimization ¾ M, N and K loop unrolling
♦ Takes about 30 minutes ♦ Designed for RISC arch
to run.
¾ Super Scalar
♦ “New” model of high ¾ Need reasonable C
performance programming compiler
where critical code is

♦ Today ATLAS in use by
machine generated using Matlab, Mathematica,
parameter optimization.
Octave, Maple, Debian, 17
Scyld Beowulf, SuSE, …
ATLAS (DGEMM n = 500)
500)
900.0
Vendor BLAS
800.0 ATLAS BLAS
F77 BLAS
700.0
600.0
MFLOPS
500.0
400.0
300.0
200.0
100.0
0.0 20
0
0
33
00
0
6
0
2
0
35
33
20
27
60
26
20
11
16
50
-9
-2
/1
-5
3-
2-
o-
II
-
6
-
4-
2-
28
III
n-
0-
35
56
rc
p3
lo
Pr
60
er
er
ip
/7
ev
pa
th
ev
ow
0i
C
tiu
tiu
m
00
C
A
Po
P
00
00
aS
C
iu
en
E
en
90
D
10
12
nt
ltr
M
M
M
M
P
D
P
Pe
IB
IB
U
IB
A
IR
I
R
un
SG
G
S
Architectures
S
♦ ATLAS is faster than all other portable BLAS
implementations and it is comparable with
machine-specific libraries provided by the vendor.
18
Intel PIII 933 MHz
MKL 5.0 vs ATLAS 3.2.0 using Windows
2000
800
Vendor BLAS
ATLAS BLAS
700 Reference BLAS
600
500
MFLOPS
400
300
200
100
DGEMM DSYMM DSYRK DSYR2K DTRMM DTRSM
BLAS

19
MFLOPS
A
M
D
A
0
50
100
150
200
250
300
th
lo
n-
60
D 0
E
C
ev
56
-5
33
D
E
C
ev
6-
F77 NoTrans
50
H 0
P
Vendor NoTrans
ATLAS NoTrans
90
00
/7
35
IB /1
35
M
P
P
Atlas Trans
C
F77 NoTrans
Vendor Trans
60
4-
11
IB 2
M
P
ow
er
2-
16
IB 0
M
P
ow
er
3-
20
P 0
en
tiu
m
P
ro
-2
Architectures
00
P
en
tiu
m
II-
26
P 6
en
tiu
m
S
G III
9
IR 33
10
00
0i
p2
S
G 8-
20
IR
12 0
00
0i
p3
S 0-
un 27
U 0
Matrix Vector Multiply DGEMV
ltr
aS
DGEMV
pa
rc
2
20
0
20
LU Factorization, Recursive
Recursive
w/ATLAS
700
Vendor BLAS
600
ATLAS BLAS
F77 BLAS
500
400
MFLOPS
300
200
100
0
00
33
0
6
0
0
5
0
2
33
0
0
20
27
26
16
20
13
11
20
20
50
-9
-6
-5
8-
0-
2-
3-
II -
4-
5/
2-
o-
II I
6-
on
56
p2
p3
73
er
er
60
rc
Pr
ev
m
hl
m
ev
pa
0i
0i
0/
iu
C
At
iu
m
EC
Po
Po
00
00
PP
nt
00
EC
aS
nt
iu
D
Pe
Pe
10
12
D
nt
P9
M
AM
ltr
D
Pe
IB
IB
IR
IR
U
H
IB
n
SG
SG
Su
Architecture

21
Related Tuning Projects
Projects
♦ PHiPAC
¾ Portable High Performance ANSI C
www.icsi.berkeley.edu/~bilmes/phipac initial automatic GEMM
generation project
♦ FFTW Fastest Fourier Transform in the West
¾ www.fftw.org
♦ UHFFT
¾ tuning parallel FFT algorithms
¾ rodin.cs.uh.edu/~mirkovic/fft/parfft.htm
♦ SPIRAL
¾ Signal Processing Algorithms Implementation Research for

Adaptable Libraries maps DSP algorithms to architectures
♦ Sparsity
¾ Sparse-matrix-vector and Sparse-matrix-matrix multiplication
www.cs.berkeley.edu/~ejim/publication/ tunes code to sparsity
structure of matrix more later in this tutorial
22
ATLAS Matrix Multiply
(64 & 32 bit floating point results)
4500 32 bit floating point using SSE
4000
3500 Intel P4 1.5 GHz SSE
3000 Intel P4 1.5 GHz
Mflop/s
AMD Athlon 1GHz

2500
Intel IA64 666MHz
2000
1500
1000
500
0
0 0 0 0 0 0 0 0 0 0
1 3 5 7 9
23
Size
Machine-Assisted Application
Application
Development and Adaptation

Adaptation
♦ Communication libraries
¾Optimize for the specifics of one’s

configuration.
♦ Algorithm layout and implementation
¾Look at the different ways to express

implementation
24
Work in Progress:
Progress:
ATLAS-like Approach Applied to Broadcast

Broadcast
(PII 8 Way Cluster with 100 Mb/s switched network)

Root
Sequential Binary Binomial

Ring
Message Size Optimal algorithm Buffer Size
(bytes) (bytes)
8 binomial 8
16 binomial 16
32 binary 32
64 binomial 64
128 binomial 128
256 binomial 256
512 binomial 512
1K sequential 1K
2K binary 2K
4K binary 2K
8K binary 2K
16K binary 4K
32K binary 4K
64K ring 4K
128K ring 4K
256K ring 4K
512K ring 4K 25
1M binary 4K
Reformulating/Rearranging/Reuse
Reformulating/Rearranging/Reuse
♦ Example is the reduction to narrow band

from for the SVD
Anew = A − u T
− wv
T
y new = AT u
wnew = Anew v
♦ Fetch each entry of A once
♦ Restructure and combined operations
♦ Results in a speedup of > 30%
26
CG Variants by Dynamic
Selection at Run Time
♦ Variants combine
inner products to
reduce
communication
bottleneck at the
expense of more
scalar ops.
♦ Same number of
iterations, no
advantage on a
sequential processor
♦ With a large number
of processor and a
high-latency network
may be advantages.
♦ Improvements can
range from 15% to
50% depending on
size. 27
CG Variants by Dynamic
Selection at Run Time
♦ Variants combine
inner products to
reduce
communication
bottleneck at the
expense of more
scalar ops.
♦ Same number of
iterations, no
advantage on a
sequential processor
♦ With a large number
of processor and a
high-latency network
may be advantages.
♦ Improvements can
range from 15% to
50% depending on
size. 28
Gaussian Elimination
Elimination
x x
0 0
x 0
. ...
.
x . 0
x 0
Standard Way LINPACK
subtract a multiple of a row apply sequence to a column
x x
0 L a2
0
...
0
a1 a3
0 a2 =L-1 a2
nb LAPACK nb a3=a3-a1*a2 29
apply sequence to nb then apply nb to rest of matrix
Gaussian Elimination via a

Recursive Algorithm
F. Gustavson and S. Toledo
LU Algorithm:
1: Split matrix into two rectangles (m x n/2)
if only 1 column, scale by reciprocal of pivot & return
2: Apply LU Algorithm to the left part
3: Apply transformations to right part

(triangular solve A12 = L-1A12 and
matrix multiplication A22=A22 -A21*A12 )
4: Apply LU Algorithm to right part
L A12
A21 A22
Most of the work in the matrix multiply30

Matrices of size n/2, n/4, n/8, …
Recursive Factorizations
Factorizations
♦ Just as accurate as conventional method

♦ Same number of operations
♦ Automatic variable blocking
¾ Level 1 and 3 BLAS only !
♦ Extreme clarity and simplicity of expression

♦ Highly efficient
♦ The recursive formulation is just a
rearrangement of the point-wise LINPACK
algorithm
♦ The standard error analysis applies (assuming
the matrix operations are computed the
“conventional” way).
31
DGEMM
PentiumATLAS & DGETRF
III 550 MHz Recursive
Dual Processor
AMD Athlon
LU 1GHz (~$1100 system)
Factorization
800
Recursive LU
400 Dual-processor
600
LAPACK
p /sp /s
300
400
Recursive LU
M FMloflo
200
200
LAPACK
100 Uniprocessor
00
500
500 1000 1000
1500 2000 15002500 3000
2000 3500 4000
2500 4500 3000
5000
Order
Order
32
DGEMM ATLAS & IIIDGETRF
Pentium 933 MHz
Recursive
SGEMM
AMD Athlon 1GHz (~$1100

Windows 2000
system)
using SSE
2000
2000
1500
p /sp / s
1500
1000
M FMloflo
1000
500
500
0 0
100
100 200 300
200 400 300500 600400 700 800
500 900 600
1000
Order
Order
33
Pentium 933 MHz
Recursive
DGEMM

Windows 2000
system)
800
2000
600
p /sp /s
1500
400
M FMloflo
1000
200
500
0
0
1 100 2 3
200 4 300 5 6 400 7 8
500 9 60010
Order
Order
34
Pentium 933 MHz
Recursive
S, D, C, and Z GEMM

Windows 2000
system)
S & C use SSE
2000
2000
1500
p /sp /s
1500
1000
M FMloflo
1000
500
500
0 0
100
100 200 300
200 400 300500 600400 700 800
500 900 600
1000
Order
Order
35
SuperLU - High Performance
Sparse Solvers
♦ SuperLU; X. Li and J. Demmel
¾ Solve sparse linear system A x = b using
Gaussian elimination.
¾ Efficient and portable implementation on
modern architectures:
¾ Sequential SuperLU : PC and workstations
¾ Achieved up to 40% peak Megaflop
rate
¾ SuperLU_MT : shared-memory parallel
machines
¾ Achieved up to 10 fold speedup
¾ SuperLU_DIST : distributed-memory
parallel machines
¾ Achieved up to 100 fold speedup
¾ Support real and complex matrices, fill-
reducing orderings, equilibration, numerical
pivoting, condition estimation, iterative
refinement, and error bounds.
♦ Enabled Scientific Discovery

¾ First solution to quantum scattering of 3
charged particles. [Recigno, Baertschy, Isaacs
& McCurdy, Science, 24 Dec 1999]
¾ SuperLU solved complex unsymmetric systems
of order up to 1.79 million, on the ASCI Blue
Pacific Computer at LLNL.
36
Recursive Factorization Applied
to Sparse Direct Methods
Victor Eijkhout, Piotr Layout of sparse recursive matrix
Luszczek & JD
1. Symbolic Factorization
2. Search for blocks that
contain non-zeros
3. Conversion to sparse
recursive storage
4. Search for embedded
blocks
5. Numerical factorization
37
Dense recursive factorization
factorization
♦ The algorithm:
function rlu(A)
begin
rlu(A11); recursive call
A21←A21 · U-1(A11); xTRSM() on upper triangular submatrix
A12 ←L1-1(A11) · A12; xTRSM() on lower triangular submatrix
A22 ←A22-A21·A12; xGEMM()
rlu(A22); recursive call
end.
♦ Replace xTRSM and xGEMM with sparse

implementations that are themselves recursive 38
Sparse Recursive Factorization Algorithm
Algorithm
♦ Solutions - continued
¾fast sparse xGEMM() is two-level algorithm

¾recursive operation on sparse data structures
¾dense xGEMM() call when recursion reaches
single block
♦ Uses Reverse Cuthill-McKee ordering
causing fill-in around the band
♦ No partial pivoting
¾use iterative improvement or
¾pivot only within blocks
39
Recursive storage conversion steps
Matrix divided into 2x2 blocks Matrix with explicit 0’s and fill-in
Recursive algorithm division lines
• - original nonzero value
0 - zero value introduced due to

blocking
x - zero value introduced due to fill-in
40
41
Breakdown of Time Across Phases
For the Recursive
Breakdown of TimeSparse
Across PhasesFactorization
100%
Numerical
90% fact.
80%
Ebedded
70% blocking
60%
Recursive
50% conversion
40%
Block
30%
conversion
20%
Symbolical
10%
fact.
0%
af23560 ex11 goodwin jpwh_991 mcfe memplus olafu orsreg_1 psmigr_1 raefsky_3 raefsky_4 saylr4 sherman3 sherman5 wang3 42
Dif f er ent Test M at r ices
SETI@home
♦ Use thousands of Internet-
connected PCs to help in
the search for
extraterrestrial
intelligence.
♦ Uses data collected with
the Arecibo Radio ♦ Largest distributed
Telescope, in Puerto Rico computation project in
♦ When their computer is idle
or being wasted this existence
software will download a ¾ ~ 400,000 machines
300 kilobyte chunk of data ¾ Averaging 26 Tflop/s
for analysis.
♦ The results of this analysis ♦ Today many companies
are sent back to the SETI trying this for profit.
team, combined with

thousands of other
participants. 43
Distributed and Parallel Systems
Systems
m
ct
me
ne
t
er
on
ws
s
l
us
me
i
s
t
fl o p
erc
Distributed
Massively
of
D
ec s w/
ut
i
d
l
f
c
ho
int
mp se
ng
or
k
lel
systems
pi
a
IT
parallel
TI
@
wu
Co i
d ba
ial
t
e
r
r al
tw
C
hetero
tr
o
systems
eo
us
A
S
P
a
SE
Ne
G
r
En
sp
C
l
geneous
homo
geneous
♦ Gather (unused) resources ♦ Bounded set of resources

♦ Steal cycles ♦ Apps grow to consume all cycles
♦ System SW manages resources ♦ Application manages resources
♦ System SW adds value ♦ System SW gets in the way
♦ 10% - 20% overhead is OK ♦ 5% overhead is maximum
♦ Resources drive applications ♦ Apps drive purchase of
♦ Time to completion is not equipment
critical ♦ Real-time constraints
♦ Time-shared ♦ Space-shared
The Grid
♦ To treat CPU cycles and software like commodities.
♦
Napster on steroids.
♦ Enable the coordinated use of geographically

distributed resources – in the absence of central
control and existing trust relationships.
♦ Computing power is produced much like utilities such
as power and water are produced for consumers.
♦ Users will have access to “power” on demand
45
NetSolve
Network Enabled Server
♦ NetSolve is an example of a grid based
hardware/software server.
♦ Easy-of-use paramount
♦ Based on a RPC model but with …
¾ resource discovery, dynamic problem solving

capabilities, load balancing, fault tolerance
asynchronicity, security, …
♦ Other examples are NEOS from Argonne
and NINF Japan.
♦ Use a resource, not tie together
geographically distributed resources for
a single application. 46
NetSolve: The Big Picture
Picture
Client Schedule
Database
Request
AGENT(s)
S2 !
Matlab
Mathematica
C, Fortran A,
B,
An
Java, Excel
C
sw
er
(
Op(C, A, B) S3 S4
C
)
S1 S2
C
A
47
No knowledge of the grid required, RPC like.
Basic Usage
Scenarios
♦ Grid based numerical
library routines
♦ “Blue Collar” Grid Based

¾ User doesn’t have to have
software library on their Computing
machine, LAPACK, SuperLU, ¾ Does not require deep
ScaLAPACK, PETSc, AZTEC, knowledge of network
ARPACK programming
♦ Task farming applications
¾ Level of expressiveness
¾ “Pleasantly parallel”
right for many users
execution
¾ User can set things up,
¾ eg Parameter studies no “su” required
♦ Remote application ¾ In use today, up to 200
execution servers in 9 countries
¾ Complete applications with
user specifying input
parameters and receiving 48
output
Futures for Linear Algebra Numerical
Algorithms and Software
♦ Numerical software will be adaptive,
exploratory, and intelligent
♦ Determinism in numerical computing will be
gone.
¾ After all, its not reasonable to ask for exactness in numerical computations.
¾ Audibility of the computation, reproducibility at a
cost
♦ Importance of floating point arithmetic will be
undiminished.
¾ 16, 32, 64, 128 bits and beyond.
♦ Reproducibility, fault tolerance, and auditability
♦ Adaptivity is a key so applications can function
appropriately
49
Contributors to These Ideas
Ideas
♦ Top500
¾ Erich Strohmaier, LBL
¾ Hans Meuer, Mannheim U
♦ Linear Algebra
¾ Victor Eijkhout, UTK For additional
¾ Piotr Luszczek, UTK information see…
¾ Antoine Petitet, UTK

¾ Clint Whaley, UTK

♦ NetSolve
www.netlib.org/netsolve/
¾ Dorian Arnold, UTK www.netlib.org/atlas/

¾ Susan Blackford, UTK www.cs.utk.edu/~dongarra/
¾ Henri Casanova, UCSD
¾ Michelle Miller, UTK
¾ Sathish Vadhiyar, UTK
Many opportunities within the
group at Tennessee 50
Intel® Math Kernel Library 5.1 for IA32 and
Itanium™ Processor-based Linux applications Beta
License Agreement PRE-RELEASE
♦ The Materials are pre-release code, which may not be fully
functional and which Intel may substantially modify in producing
any final version. Intel can provide no assurance that it will
ever produce or make generally available a final version. You
agree to maintain as confidential all information relating to your
use of the Materials and not to disclose to any third party any
benchmarks, performance results, or other information relating
to the Materials comprising the pre-release.
♦ See:
http://developer.intel.com/software/products/mkl/mkllicense51_lnx.htm
51

The Impact of Computer Architectures On Linear Algebra and Numerical Libraries

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Impact of Computer Architectures On Linear Algebra and Numerical Libraries

Uploaded by

Copyright:

Available Formats

The Impact Of Computer

¾ Vector & Shared memory computing, bandwidth aware

¾ Block partitioned, latency tolerant

- Listing of the 500 most powerful

In 1980 a computation that took 1 full year to complete

TMC NEC VPP-500 (2040) (5040)

In 1980 a computation that took 1 full year to complete

Rank Company Machine Procs Gflop/s Place Country Year

100 Tflop/s 88.1 TF/s

Constellation Cluster - NOW

♦ Message Passing tracking

Processor-DRAM Memory Gap (latency)

¾ Theoretical peak:(# fpus)*(flops/cycle) * Mhz

¾ PIII : (32 bits)*(133 Mhz) = 532 MB/s = 66.5 MW/s

¾ Athlon: (64 bits)*(133 Mhz) = 1064 MB/s = 133 MW/s

♦ By taking advantage of the principle of locality:

¾ Present the user with as much memory as is available in

2 and 3 Memory Distributed Remote

Datapath Cache (DRAM) Memory Cluster

Speed (ns): 1s 10s 100s 10,000,000s 10,000,000,000s

♦ Today’s processors can achieve high-performance, but

generated & timed until

optimal case is found.

performance programming compiler

where critical code is

DGEMM DSYMM DSYRK DSYR2K DTRMM DTRSM

♦ ATLAS is faster than all other portable BLAS

♦ ATLAS is faster than all other portable BLAS

¾ Signal Processing Algorithms Implementation Research for

AMD Athlon 1GHz

Development and Adaptation

¾Optimize for the specifics of one’s

¾Look at the different ways to express

ATLAS-like Approach Applied to Broadcast

(PII 8 Way Cluster with 100 Mb/s switched network)

Sequential Binary Binomial

♦ Example is the reduction to narrow band

Gaussian Elimination via a

2: Apply LU Algorithm to the left part

3: Apply transformations to right part

4: Apply LU Algorithm to right part

Most of the work in the matrix multiply30

♦ Just as accurate as conventional method

♦ Extreme clarity and simplicity of expression

AMD Athlon 1GHz (~$1100

AMD Athlon 1GHz (~$1100

AMD Athlon 1GHz (~$1100

¾ Efficient and portable implementation on

¾ Sequential SuperLU : PC and workstations

¾ Achieved up to 40% peak Megaflop

¾ SuperLU_MT : shared-memory parallel

¾ Achieved up to 10 fold speedup

¾ Achieved up to 100 fold speedup

¾ Support real and complex matrices, fill-

reducing orderings, equilibration, numerical

pivoting, condition estimation, iterative

refinement, and error bounds.

♦ Enabled Scientific Discovery

charged particles. [Recigno, Baertschy, Isaacs

& McCurdy, Science, 24 Dec 1999]

¾ SuperLU solved complex unsymmetric systems

of order up to 1.79 million, on the ASCI Blue

Pacific Computer at LLNL.

¾ Theoretical peak:(# fpus)(flops/cycle) Mhz