You are on page 1of 51

The Impact Of Computer

Architectures On Linear
Algebra and Numerical
Libraries

Jack Dongarra
Innovative Computing Laboratory
University of Tennessee

1
High Performance Computers
Computers

♦ ~ 20 years ago
¾ 1x106 Floating Point Ops/sec (Mflop/s)

¾ Scalar based

♦ ~ 10 years ago
¾ 1x109 Floating Point Ops/sec (Gflop/s)

¾ Vector & Shared memory computing, bandwidth aware

¾ Block partitioned, latency tolerant

♦ ~ Today
¾ 1x1012 Floating Point Ops/sec (Tflop/s)
¾ Highly parallel, distributed processing, message passing, network based
¾ data decomposition, communication/computation
♦ ~ 10 years away
¾ 1x1015 Floating Point Ops/sec (Pflop/s)
¾ Many more levels MH, combination/grids&HPC
¾ More adaptive, LT and bandwidth aware, fault tolerant,
extended precision, attention to SMP nodes
2
TOP500
TOP500

- Listing of the 500 most powerful


Computers in the World
- Yardstick: Rmax from LINPACK MPP
Ax=b, dense problem TPP performance

Rate
- Updated twice a year
Size
SC‘xy in the States in November
Meeting in Mannheim, Germany in June
- All data available from www.top500.org
3
Fastest Computer Over Time

50

45

40

35

GFlop/s

30

25

20
TMC
15
CM-2
(2048)
10
Cray
5 Y-MP (8) Fujitsu
VP-2600
0
1990 1992 1994 1996 1998 2000

Year

In 1980 a computation that took 1 full year to complete


can now be done in ~ 9 hours!
Fastest Computer Over Time

500
Hitachi
450 CP-PACS
400 Intel (2040)
350 Paragon
GFlop/s

(6788)
300
Fujitsu
250
VPP-500
200 (140)
TMC
150 CM-5
NEC
100 SX-3 (1024)
TMC
50 Cray
Y-MP (8)
Fujits u
VP-2600
CM-2
(2048)
(4)
0
1990 1992 1994 1996 1998 2000
Year
In 1980 a computation that took 1 full year to complete
can now be done in ~ 13 minutes!
Fastest Computer Over Time

ASCI White
5000 Pacific
4500 (7424)
4000
3500 Intel ASCI
GFlop/s

Red Xeon
3000 ASCI Blue
(9632)
Pacific SST
2500
Intel (5808)
2000 ASCI Red SGI ASCI
1500 (9152) Blue
1000 Intel Hitachi Mountain
TMC Fujitsu Paragon CP-PACS

TMC NEC VPP-500 (2040) (5040)

500 Cray
Y-MP (8)
Fujitsu
VP-2600
CM-2 SX-3
(4)

CM-5
(1024) (140)

(6788)

(2048)
0
1990 1992 1994 1996 1998 2000

Year

In 1980 a computation that took 1 full year to complete


can today be done in ~90 second!
Top 10 Machines (Nov 2000)
2000)

Rank Company Machine Procs Gflop/s Place Country Year


1 IBM ASCI White 8192 4938 Livermore National Laboratory Livermore 2000
Sandia National Labs
2 Intel ASCI Red 9632 2380 USA 1999
Albuquerque
ASCI Blue-Pacific Lawrence Livermore National
3 IBM 5808 2144 USA 1999
SST, IBM SP 604e Laboratory Livermore
ASCI Blue Los Alamos National Laboratory
4 SGI 6144 1608 USA 1998
Mountain Los Alamos
SP Power3 Naval Oceanographic Office
5 IBM 1336 1417 USA 2000
375 MHz (NAVOCEANO)
SP Power3 National Center for
6 IBM 1104 1179 USA 2000
375 MHz Environmental Protection
Leibniz Rechenzentrum
7 Hitachi SR8000-F1/112 112 1035 Germany 2000
Muenchen
SP Power3 UCSD/San Diego
8 IBM 1152 929 USA 2000
375 MHz, 8 way Supercomputer Center
High Energy Accelerator
9 Hitachi SR8000-F1/100 100 917 Research Organization /KEK Japan 2000
Tsukuba
10 Cray Inc. T3E1200 1084 892 Government USA 1998
Performance Development
Development

100 Tflop/s 88.1 TF/s


SUM
10 Tflop/s
4.9 TF/s
1 Tflop/s Intel
IBM
Intel ASCI
N=1 ASCI Red ASCI Red White
Sandia Sandia
100 Gflop/s
Fujitsu Fujitsu Hitachi/Tsukuba 55.1 GF/s
'NWT' NAL 'NWT' NAL CP-PACS/2048
Intel XP/S140
Sandia
10 Gflop/s N=500 IBM 604e
Sun 69 proc
Sun Ultra A&P
HPC 10000
SGI HPC 1000
Merril Lynch
1 Gflop/s Cray POWER News
Cray Y-MP Y-MP C94/364 CHALLANGE International
SNI VP200EX M94/4 'EPA' USA GOODYEAR
Uni Dresden KFA Jülich
100 Mflop/s
Ju 93

Ju 94

Ju 95

Ju 96

Ju 97

Ju 8

Ju 99

0
N 93

N 94

N
95

N 96

N 97

N 98

N 99

N 00
-9

-0
-
-

-
n-

n-

n-

n-

n-

n-
n-

n-
ov
ov

ov

ov
-

ov

ov

ov

ov
Ju

[60G - 400 M][4.9 Tflop/s 55Gflop/s], Schwab #15, 1/2 each year, 209 > 100 Gf, faster than Moore’s law,
Performance Development
1000000 1 PFlop/s
ASCI
100000
Performance [GFlop/s]

Earth Simulator
10000 Sum

1000 1 TFlop/s
N=1
100

10

1 My Laptop
N=500
0.1
9
4

97

00

03

06

09
93

96

99

02

05

08
v-

v-

v-

v-

v-

v-
n
-

n-

n-
n-

n-
n-
No

No

No

No

No

No
Ju

Ju

Ju
Ju

Ju

Ju
Entry 1 T 2005 and 1 P 2010

Architectures
Architectures

Constellation Cluster - NOW


500 SIMD CM2
Cluster of
Paragon Sun HPC
400
T3D
CM5
300 MPP T3E

SP2
ASCI Red
200 Y-MP C90

SX3
100
Single SMP Sun HPC

0 Processor VP500
93

94

N 95

N
6

N 7

N 98

N 99

N 00
4

Ju 5

Ju 6

Ju 7

Ju 8

Ju 9

0
3

n-

9
-9

-9

-9

-9

-9

-9

-0
-9
n-

n-

n-

n-

n-

n-

n-
ov

ov

ov

ov

ov

ov

ov

ov
Ju

Ju

Ju
N

10
112 const, 28 clus, 343 mpp, 17 smp
Ju
n-

0
100
200
300
400
500
9
N 3
ov
-9
Ju 3
n-
9
N 4
ov
-9
Ju 4

intel
n-
9
N 5
ov
-9
Ju 5
n-
9
N 6
ov
-9
Ju 6
n-
9
N 7
ov
-9
Inmos Transputer
Other

Ju 7
n-
9
N 8
ov
-9
Chip Technology

IBM
SUN

MIPS

Alpha

Ju 8
n-
9
N 9
ov
HP

-9
Ju 9
n-
0
N 0
ov
-0
0
High-Performance Computing Directions:
Beowulf-class PC Clusters
Definition: Advantages:
♦ COTS PC Nodes ♦ Best price-
¾ Pentium, Alpha,
performance
PowerPC, SMP
♦ Low entry-level cost
♦ COTS LAN/SAN
♦ Just-in-place
Interconnect
configuration
¾ Ethernet, Myrinet,
Giganet, ATM ♦ Vendor invulnerable
♦ Open Source Unix
♦ Scalable
¾ Linux, BSD ♦ Rapid technology

♦ Message Passing tracking


Computing Enabled by PC hardware, networks and operating system
¾ MPI, PVM achieving capabilities of scientific workstations at a fraction of
¾ HPF the cost and availability of industry standard message
passing libraries. However, much more of a contact sport. 12
Where Does the Performance Go? or
Why Should I Care About the Memory Hierarchy?

Processor-DRAM Memory Gap (latency)


1000 CPU
µProc
60%/yr.
“Moore’s Law”
Performance

(2X/1.5yr)
100
Processor-Memory
Performance Gap:
(grows 50% / year)
10

DRAM

DRAM
9%/yr.

1
(2X/10 yrs)
1980

1981

1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1982
1983
1984
1985
1986

1999
2000
Time
13
Optimizing Computation and
and

Memory Use
Use

♦ Computational optimizations

¾ Theoretical peak:(# fpus)*(flops/cycle) * Mhz


¾ PIII: (1 fpu)*(1 flop/cycle)*(850 Mhz) = 850 MFLOP/s
¾ Athlon: (2 fpu)*(1flop/cycle)*(600 Mhz) = 1200 MFLOP/s
¾ Power3: (2 fpu)*(2 flops/cycle)*(375 Mhz) = 1500 MFLOP/s
♦ Operations like:
¾ α = xTy : 2 operands (16 Bytes) needed for 2 flops;
at 850 Mflop/s will requires 1700 MW/s bandwidth
¾ y = α x + y : 3 operands (24 Bytes) needed for 2 flops;
at 850 Mflop/s will requires 2550 MW/s bandwidth
♦ Memory optimization
¾ Theoretical peak: (bus width) * (bus speed)

¾ PIII : (32 bits)*(133 Mhz) = 532 MB/s = 66.5 MW/s

¾ Athlon: (64 bits)*(133 Mhz) = 1064 MB/s = 133 MW/s


14
¾ Power3: (128 bits)*(100 Mhz) = 1600 MB/s = 200 MW/s

Memory Hierarchy
Hierarchy

♦ By taking advantage of the principle of locality:

¾ Present the user with as much memory as is available in


the cheapest technology.
¾ Provide access at the speed offered by the fastest
technology.
Processor Tertiary
Storage
Secondary
(Disk/Tape)
Control Storage
(Disk)
Level Main
On-Chip
Registers

2 and 3 Memory Distributed Remote


Cache

Datapath Cache (DRAM) Memory Cluster


(SRAM) Memory

Speed (ns): 1s 10s 100s 10,000,000s 10,000,000,000s


(10s ms) (10s sec)
Size (bytes): 100s
Ks Ms 100,000 s 10,000,000 s
(.1s ms) (10s ms)
Gs Ts
Self-Adapting Numerical
Numerical

Software (SANS)
(SANS)

♦ Today’s processors can achieve high-performance, but


this requires extensive machine-specific hand tuning.
♦ Operations like the BLAS require many man-hours /
platform
• Software lags far behind hardware introduction
• Only done if financial incentive is there
♦ Hardware, compilers, and software have a large
design space w/many parameters
¾ Blocking sizes, loop nesting permutations, loop unrolling
depths, software pipelining strategies, register allocations,
and instruction schedules.
¾ Complicated interactions with the increasingly sophisticated
micro-architectures of new microprocessors.
♦ Need for quick/dynamic deployment of optimized routines.
♦ ATLAS - Automatic Tuned Linear Algebra Software 16
Software Generation
Strategy
♦ Level 1 cache multiply
optimizes for:
♦ Code is iteratively
¾ TLB access

generated & timed until


¾ L1 cache reuse

optimal case is found.


¾ FP unit usage
We try:
¾ Memory fetch
¾ Differing NBs
¾ Register reuse
¾ Breaking false
¾ Loop overhead
dependencies
minimization ¾ M, N and K loop unrolling
♦ Takes about 30 minutes ♦ Designed for RISC arch
to run.
¾ Super Scalar
♦ “New” model of high ¾ Need reasonable C

performance programming compiler

where critical code is


♦ Today ATLAS in use by
machine generated using Matlab, Mathematica,
parameter optimization.
Octave, Maple, Debian, 17
Scyld Beowulf, SuSE, …
ATLAS (DGEMM n = 500)
500)

900.0
Vendor BLAS
800.0 ATLAS BLAS
F77 BLAS
700.0

600.0
MFLOPS

500.0

400.0
300.0

200.0

100.0

0.0 20

0
0

33

00

0
6
0
2

0
35
33

20
27
60

26
20
11

16
50

-9

-2
/1
-5

3-

2-
o-

II
-
6
-

4-

2-

28
III
n-

0-
35
56

rc
p3
lo

Pr
60

er

er

ip
/7
ev

pa
th

ev

ow

0i
C

tiu

tiu
m
00
C
A

Po
P

00

00

aS
C

iu

en
E

en
90
D

10

12
nt

ltr
M

M
M
M

P
D

P
Pe
IB

IB

U
IB
A

IR

I
R

un
SG

G
S

Architectures

S
♦ ATLAS is faster than all other portable BLAS
implementations and it is comparable with
machine-specific libraries provided by the vendor.
18
Intel PIII 933 MHz
MKL 5.0 vs ATLAS 3.2.0 using Windows
2000
800
Vendor BLAS
ATLAS BLAS
700 Reference BLAS

600

500
MFLOPS

400

300

200

100

DGEMM DSYMM DSYRK DSYR2K DTRMM DTRSM

BLAS

♦ ATLAS is faster than all other portable BLAS


implementations and it is comparable with
machine-specific libraries provided by the vendor.
19
MFLOPS
A
M
D
A

0
50
100
150
200
250
300
th
lo
n-
60
D 0
E
C
ev
56
-5
33
D
E
C
ev
6-
F77 NoTrans
50
H 0
P
Vendor NoTrans
ATLAS NoTrans

90
00
/7
35
IB /1
35
M
P
P
Atlas Trans

C
F77 NoTrans
Vendor Trans

60
4-
11
IB 2
M
P
ow
er
2-
16
IB 0
M
P
ow
er
3-
20
P 0
en
tiu
m
P
ro
-2

Architectures
00
P
en
tiu
m
II-
26
P 6
en
tiu
m
S
G III
­9
IR 33
10
00
0i
p2
S
G 8-
20
IR
12 0
00
0i
p3
S 0-
un 27
U 0
Matrix Vector Multiply DGEMV

ltr
aS
DGEMV

pa
rc

20
0
20
LU Factorization, Recursive
Recursive

w/ATLAS
700
Vendor BLAS
600
ATLAS BLAS
F77 BLAS

500

400
MFLOPS

300

200

100

0
00

33

0
6
0

0
5

0
2
33

0
0

20

27
26
16

20
13

11

20
20
50

-9
-6

-5

8-

0-
2-

3-

II -
4-
5/

2-
o-

II I
6-
on

56

p2

p3
73

er

er
60

rc
Pr
ev

m
hl

m
ev

pa
0i

0i
0/

iu
C
At

iu
m
EC

Po

Po

00

00
PP

nt
00
EC

aS
nt
iu
D

Pe

Pe

10

12
D

nt
P9

M
AM

ltr
D

Pe
IB

IB

IR

IR

U
H

IB

n
SG

SG

Su
Architecture

♦ ATLAS is faster than all other portable BLAS


implementations and it is comparable with
machine-specific libraries provided by the vendor.
21
Related Tuning Projects
Projects

♦ PHiPAC
¾ Portable High Performance ANSI C
www.icsi.berkeley.edu/~bilmes/phipac initial automatic GEMM
generation project
♦ FFTW Fastest Fourier Transform in the West
¾ www.fftw.org
♦ UHFFT
¾ tuning parallel FFT algorithms

¾ rodin.cs.uh.edu/~mirkovic/fft/parfft.htm

♦ SPIRAL

¾ Signal Processing Algorithms Implementation Research for


Adaptable Libraries maps DSP algorithms to architectures
♦ Sparsity
¾ Sparse-matrix-vector and Sparse-matrix-matrix multiplication
www.cs.berkeley.edu/~ejim/publication/ tunes code to sparsity
structure of matrix more later in this tutorial
22
ATLAS Matrix Multiply
(64 & 32 bit floating point results)
4500 32 bit floating point using SSE
4000
3500 Intel P4 1.5 GHz SSE
3000 Intel P4 1.5 GHz
Mflop/s

AMD Athlon 1GHz


2500
Intel IA64 666MHz
2000
1500
1000
500
0

0 0 0 0 0 0 0 0 0 0
1 3 5 7 9
23
Size
Machine-Assisted Application
Application

Development and Adaptation


Adaptation

♦ Communication libraries

¾Optimize for the specifics of one’s


configuration.
♦ Algorithm layout and implementation

¾Look at the different ways to express


implementation

24
Work in Progress:
Progress:

ATLAS-like Approach Applied to Broadcast


Broadcast

(PII 8 Way Cluster with 100 Mb/s switched network)


Root

Sequential Binary Binomial


Ring
Message Size Optimal algorithm Buffer Size
(bytes) (bytes)

8 binomial 8
16 binomial 16
32 binary 32
64 binomial 64
128 binomial 128
256 binomial 256
512 binomial 512
1K sequential 1K
2K binary 2K
4K binary 2K
8K binary 2K
16K binary 4K
32K binary 4K
64K ring 4K
128K ring 4K
256K ring 4K
512K ring 4K 25
1M binary 4K
Reformulating/Rearranging/Reuse
Reformulating/Rearranging/Reuse

♦ Example is the reduction to narrow band


from for the SVD

Anew = A − u T
− wv
T
y new = AT u
wnew = Anew v
♦ Fetch each entry of A once
♦ Restructure and combined operations
♦ Results in a speedup of > 30%
26
CG Variants by Dynamic
Selection at Run Time
♦ Variants combine
inner products to
reduce
communication
bottleneck at the
expense of more
scalar ops.
♦ Same number of
iterations, no
advantage on a
sequential processor
♦ With a large number
of processor and a
high-latency network
may be advantages.
♦ Improvements can
range from 15% to
50% depending on
size. 27
CG Variants by Dynamic
Selection at Run Time
♦ Variants combine
inner products to
reduce
communication
bottleneck at the
expense of more
scalar ops.
♦ Same number of
iterations, no
advantage on a
sequential processor
♦ With a large number
of processor and a
high-latency network
may be advantages.
♦ Improvements can
range from 15% to
50% depending on
size. 28
Gaussian Elimination
Elimination

x x
0 0
x 0
. ...
.

x . 0
x 0
Standard Way LINPACK
subtract a multiple of a row apply sequence to a column

x x
0 L a2
0
...

0
a1 a3
0 a2 =L-1 a2
nb LAPACK nb a3=a3-a1*a2 29
apply sequence to nb then apply nb to rest of matrix

Gaussian Elimination via a


Recursive Algorithm
F. Gustavson and S. Toledo
LU Algorithm:
1: Split matrix into two rectangles (m x n/2)
if only 1 column, scale by reciprocal of pivot & return

2: Apply LU Algorithm to the left part

3: Apply transformations to right part


(triangular solve A12 = L-1A12 and
matrix multiplication A22=A22 -A21*A12 )

4: Apply LU Algorithm to right part

L A12

A21 A22

Most of the work in the matrix multiply30


Matrices of size n/2, n/4, n/8, …

Recursive Factorizations
Factorizations

♦ Just as accurate as conventional method


♦ Same number of operations
♦ Automatic variable blocking
¾ Level 1 and 3 BLAS only !

♦ Extreme clarity and simplicity of expression


♦ Highly efficient
♦ The recursive formulation is just a
rearrangement of the point-wise LINPACK
algorithm
♦ The standard error analysis applies (assuming
the matrix operations are computed the
“conventional” way).
31
DGEMM
PentiumATLAS & DGETRF
III 550 MHz Recursive
Dual Processor

AMD Athlon
LU 1GHz (~$1100 system)
Factorization

800
Recursive LU
400 Dual-processor
600
LAPACK
p /sp /s

300
400
Recursive LU
M FMloflo

200
200
LAPACK
100 Uniprocessor

00

500
500 1000 1000
1500 2000 15002500 3000
2000 3500 4000
2500 4500 3000
5000

Order

Order
32
DGEMM ATLAS & IIIDGETRF
Pentium 933 MHz
Recursive
SGEMM

AMD Athlon 1GHz (~$1100


Windows 2000

system)
using SSE

2000

2000
1500

p /sp / s

1500
1000

M FMloflo

1000
500

500

0 0

100
100 200 300
200 400 300500 600400 700 800
500 900 600
1000

Order

Order
33
DGEMM ATLAS & IIIDGETRF
Pentium 933 MHz
Recursive
DGEMM

AMD Athlon 1GHz (~$1100


Windows 2000

system)

800

2000
600

p /sp /s

1500
400

M FMloflo

1000
200

500
0

0
1 100 2 3
200 4 300 5 6 400 7 8
500 9 60010

Order

Order
34
DGEMM ATLAS & IIIDGETRF
Pentium 933 MHz
Recursive
S, D, C, and Z GEMM

AMD Athlon 1GHz (~$1100


Windows 2000

system)
S & C use SSE

2000

2000
1500

p /sp /s

1500
1000

M FMloflo

1000
500

500

0 0

100
100 200 300
200 400 300500 600400 700 800
500 900 600
1000

Order

Order
35
SuperLU - High Performance
Sparse Solvers
♦ SuperLU; X. Li and J. Demmel
¾ Solve sparse linear system A x = b using

Gaussian elimination.

¾ Efficient and portable implementation on

modern architectures:

¾ Sequential SuperLU : PC and workstations

¾ Achieved up to 40% peak Megaflop

rate

¾ SuperLU_MT : shared-memory parallel

machines

¾ Achieved up to 10 fold speedup

¾ SuperLU_DIST : distributed-memory

parallel machines

¾ Achieved up to 100 fold speedup

¾ Support real and complex matrices, fill-

reducing orderings, equilibration, numerical

pivoting, condition estimation, iterative

refinement, and error bounds.

♦ Enabled Scientific Discovery


¾ First solution to quantum scattering of 3

charged particles. [Recigno, Baertschy, Isaacs

& McCurdy, Science, 24 Dec 1999]

¾ SuperLU solved complex unsymmetric systems

of order up to 1.79 million, on the ASCI Blue

Pacific Computer at LLNL.

36
Recursive Factorization Applied
to Sparse Direct Methods
Victor Eijkhout, Piotr Layout of sparse recursive matrix
Luszczek & JD

1. Symbolic Factorization
2. Search for blocks that
contain non-zeros
3. Conversion to sparse
recursive storage
4. Search for embedded
blocks
5. Numerical factorization

37
Dense recursive factorization
factorization

♦ The algorithm:
function rlu(A)
begin
rlu(A11); recursive call
A21←A21 · U-1(A11); xTRSM() on upper triangular submatrix
A12 ←L1-1(A11) · A12; xTRSM() on lower triangular submatrix
A22 ←A22-A21·A12; xGEMM()
rlu(A22); recursive call
end.

♦ Replace xTRSM and xGEMM with sparse


implementations that are themselves recursive 38
Sparse Recursive Factorization Algorithm
Algorithm

♦ Solutions - continued

¾fast sparse xGEMM() is two-level algorithm


¾recursive operation on sparse data structures
¾dense xGEMM() call when recursion reaches
single block

♦ Uses Reverse Cuthill-McKee ordering

causing fill-in around the band

♦ No partial pivoting

¾use iterative improvement or

¾pivot only within blocks

39
Recursive storage conversion steps

Matrix divided into 2x2 blocks Matrix with explicit 0’s and fill-in

Recursive algorithm division lines

• - original nonzero value

0 - zero value introduced due to


blocking
x - zero value introduced due to fill-in

40
41
Breakdown of Time Across Phases
For the Recursive
Breakdown of TimeSparse
Across PhasesFactorization

100%
Numerical
90% fact.
80%
Ebedded
70% blocking
60%
Recursive
50% conversion
40%
Block
30%
conversion
20%
Symbolical
10%
fact.
0%
af23560 ex11 goodwin jpwh_991 mcfe memplus olafu orsreg_1 psmigr_1 raefsky_3 raefsky_4 saylr4 sherman3 sherman5 wang3 42
Dif f er ent Test M at r ices
SETI@home
♦ Use thousands of Internet-
connected PCs to help in
the search for
extraterrestrial
intelligence.
♦ Uses data collected with
the Arecibo Radio ♦ Largest distributed
Telescope, in Puerto Rico computation project in
♦ When their computer is idle
or being wasted this existence
software will download a ¾ ~ 400,000 machines

300 kilobyte chunk of data ¾ Averaging 26 Tflop/s

for analysis.
♦ The results of this analysis ♦ Today many companies

are sent back to the SETI trying this for profit.

team, combined with


thousands of other
participants. 43
Distributed and Parallel Systems
Systems

m
ct
me
ne
t
er

on
ws

s
l
us
me

i
s
t
fl o p
erc
Distributed
Massively

of

D
ec s w/

ut
i
d

l
f
c
ho

int
mp se
ng

or
k

lel

systems

pi
a

IT
parallel
TI
@

wu
Co i
d ba

ial
t
e
r

r al
tw

C
hetero­

tr
o
systems

eo

us

A
S
P
a
SE

Ne
G
r
En

sp
C
l

geneous
homo­
geneous

♦ Gather (unused) resources ♦ Bounded set of resources


♦ Steal cycles ♦ Apps grow to consume all cycles
♦ System SW manages resources ♦ Application manages resources
♦ System SW adds value ♦ System SW gets in the way
♦ 10% - 20% overhead is OK ♦ 5% overhead is maximum
♦ Resources drive applications ♦ Apps drive purchase of
♦ Time to completion is not equipment
critical ♦ Real-time constraints
♦ Time-shared ♦ Space-shared
The Grid
♦ To treat CPU cycles and software like commodities.


Napster on steroids.

♦ Enable the coordinated use of geographically


distributed resources – in the absence of central
control and existing trust relationships.
♦ Computing power is produced much like utilities such

as power and water are produced for consumers.

♦ Users will have access to “power” on demand

45
NetSolve
Network Enabled Server
♦ NetSolve is an example of a grid based
hardware/software server.
♦ Easy-of-use paramount
♦ Based on a RPC model but with …

¾ resource discovery, dynamic problem solving


capabilities, load balancing, fault tolerance
asynchronicity, security, …
♦ Other examples are NEOS from Argonne
and NINF Japan.
♦ Use a resource, not tie together
geographically distributed resources for
a single application. 46
NetSolve: The Big Picture
Picture

Client Schedule
Database
Request
AGENT(s)
S2 !
Matlab
Mathematica
C, Fortran A,
B,
An

Java, Excel
C
sw
er
(

Op(C, A, B) S3 S4
C
)

S1 S2
C
A

47
No knowledge of the grid required, RPC like.
Basic Usage
Scenarios
♦ Grid based numerical
library routines

♦ “Blue Collar” Grid Based


¾ User doesn’t have to have
software library on their Computing
machine, LAPACK, SuperLU, ¾ Does not require deep
ScaLAPACK, PETSc, AZTEC, knowledge of network
ARPACK programming
♦ Task farming applications
¾ Level of expressiveness
¾ “Pleasantly parallel”
right for many users
execution
¾ User can set things up,
¾ eg Parameter studies no “su” required
♦ Remote application ¾ In use today, up to 200
execution servers in 9 countries
¾ Complete applications with
user specifying input
parameters and receiving 48

output
Futures for Linear Algebra Numerical
Algorithms and Software
♦ Numerical software will be adaptive,

exploratory, and intelligent

♦ Determinism in numerical computing will be

gone.

¾ After all, its not reasonable to ask for exactness in numerical computations.
¾ Audibility of the computation, reproducibility at a
cost
♦ Importance of floating point arithmetic will be
undiminished.
¾ 16, 32, 64, 128 bits and beyond.
♦ Reproducibility, fault tolerance, and auditability
♦ Adaptivity is a key so applications can function
appropriately
49
Contributors to These Ideas
Ideas

♦ Top500
¾ Erich Strohmaier, LBL
¾ Hans Meuer, Mannheim U
♦ Linear Algebra
¾ Victor Eijkhout, UTK For additional
¾ Piotr Luszczek, UTK information see…

¾ Antoine Petitet, UTK


¾ Clint Whaley, UTK


♦ NetSolve
www.netlib.org/netsolve/

¾ Dorian Arnold, UTK www.netlib.org/atlas/


¾ Susan Blackford, UTK www.cs.utk.edu/~dongarra/
¾ Henri Casanova, UCSD
¾ Michelle Miller, UTK
¾ Sathish Vadhiyar, UTK
Many opportunities within the
group at Tennessee 50
Intel® Math Kernel Library 5.1 for IA32 and
Itanium™ Processor-based Linux applications Beta
License Agreement PRE-RELEASE
♦ The Materials are pre-release code, which may not be fully
functional and which Intel may substantially modify in producing
any final version. Intel can provide no assurance that it will
ever produce or make generally available a final version. You
agree to maintain as confidential all information relating to your
use of the Materials and not to disclose to any third party any
benchmarks, performance results, or other information relating
to the Materials comprising the pre-release.

♦ See:
http://developer.intel.com/software/products/mkl/mkllicense51_lnx.htm

51

You might also like