You are on page 1of 28

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/220781801

LAPACK: A Portable Linear Algebra Library for High-Performance Computers

Conference Paper · January 1990


Source: DBLP

CITATIONS READS
382 1,045

10 authors, including:

Jack Dongarra Christian Bischof


University of Tennessee Technische Universität Darmstadt
1,772 PUBLICATIONS   68,458 CITATIONS    360 PUBLICATIONS   6,371 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Performance evaluation of HPC systems View project

Software Factory 4.0 View project

All content following this page was uploaded by Christian Bischof on 02 June 2014.

The user has requested enhancement of the downloaded file.


LAPACK: A Portable L ine a r Al ge b r a L i b r a r y

fo r Hi g h-Pe r f o r ma n ce Co mput e r s

E. Anderson, Z. Bai, C. Bi s cho f, J. De mme l, J. Do nga r r a , J. DuCr o z,

A. Gr e e nba um, S. Ha mma r l i ng , A. Mc Ke nne y, D. So r e ns e n

Abstract
The goal of the LAPACK project is to desi gn and i mpl e-
ment a portabl e l i near al gebra l i brary f or eci ent use on a va-
ri ety of hi gh- perf ormance computers. The l i brary i s based on
the wi del y used LINPACKand EI SPACKpackages f or sol vi ng
l i near equati ons, ei genval ue probl ems, and l i near l east- squares
probl ems, but extends thei r f uncti onal i ty i n a number of ways.
The maj or methodol ogy f or maki ng the al gori thms run f aster
i s to restructure themto perf ormbl ock matri x operati ons (e. g. ,
matri x- matri x mul ti pl i cati on) i n thei r i nner l oops. These bl ock
operati ons may be opti mi zed to expl oi t the memory hi erarchy
of a speci c archi tecture. The LAPACKproj ect i s al so worki ng
on new di vi de- and- conquer al gori thms f or certai n ei genprob-
l ems, and new al gori thms that yi el d hi gher rel ati ve accuracy
f or a vari ety of l i near al gebra probl ems.

1 Introducti on
The Uni versi ty of Tennessee, the Courant I nsti tute f or Mathemati cal
Sci ences, the Numeri cal Al gori thms Group Ltd. , Ri ce Uni versi ty, Ar-
gonne Nati onal Laboratory, and Oak Ri dge Nati onal Laboratory are
devel opi ng a transportabl e l i near al gebra l i brary i n Fortran 77. The
l i brary i s i ntended to provi de a uni f ormset of subrouti nes to sol ve
the most common l i near al gebra probl ems and to run eci entl y on a
wi de range of hi gh- perf ormance computers.
The LAPACK l i brary (shorthand f or Li near Al gebra Package)
wi l l provi de routi nes f or sol vi ng systems of si mul taneous l i near equa-
ti ons, l east- squares sol uti ons of overdetermi ned systems of equati ons,
and ei genval ue probl ems. The associ ated matri x f actori zati ons (LU,

1
Chol esky, QR, SVD, Schur, general i zed Schur) wi l l al so be provi ded,
as wi l l rel ated computati ons such as updati ng or reorderi ng of the
f actori zati ons and condi ti on numbers (or esti mates thereof ). Dense
and banded matri ces wi l l be provi ded f or, but not general sparse ma-
tri ces. I n al l areas, si mi l ar f uncti onal i ty wi l l be provi ded f or real and
compl ex matri ces.
The newl i brary wi l l be based on the successf ul EI SPACK[ 32, 25]
and LI NPACK [ 14] l i brari es, i ntegrati ng the two sets of al gori thms
i nto a uni ed, systemati c l i brary. A great deal of e ort has al so
been expended to i ncorporate desi gn methodol ogi es and al gori thms
that make the LAPACK codes more appropri ate f or today' s hi gh-
perf ormance archi tectures. The LI NPACKand EI SPACKcodes were
wri tten i n a f ashi on that, f or the most part, i gnored the cost of data
movement. Most of today' s hi gh- perf ormance machi nes, however, i n-
corporate a memory hi erarchy [ 16, 27, 34] to hel p di sgui se the di er-
ence i n speed of memory accesses and vectori zed oati ng- poi nt oper-
ati ons. As a resul t, codes must be caref ul about reusi ng data i n order
not to run at memory speed i nstead of oati ng- poi nt speed. LAPACK
codes have been caref ul l y restructured to reuse as much data as pos-
si bl e i n order to reduce the cost of data movement. Further i mprove-
ments are the i ncorporati on of new and i mproved al gori thms f or the
sol uti on of ei genval ue probl ems [ 10, 19] . LAPACKi s al so desi gned to
expl oi t the paral l el processi ng capabi l i ti es of many hi gh- perf ormance
machi nes, especi al l y shared memory machi nes.
LAPACK i s desi gned to be eci ent and transportabl e across a
wi de range of computi ng envi ronments, wi th speci al emphasi s on mod-
ern hi gh- perf ormance computers. Whi l e we do not expect LAPACK
codes to be opti mal f or al l archi tectures, we anti ci pate hi gh perf or-
mance over a wi de range of machi nes. By rel yi ng on the Basi c Li near
Al gebra Subprograms (BLAS) [ 21, 15, 29] the codes can be \tuned"
to a gi ven archi tecture by eci ent|and, i n al l l i kel i hood, machi ne-
dependent|impl ementati ons of these kernel s. Machi ne- speci c opti -
mi zati ons are l i mi ted to those kernel s, and the user i nterf ace i s uni f orm
across machi nes. W e shal l al so di stri bute test and ti mi ng routi nes to
veri f y the i nstal l ati on of the LAPACKcodes on a parti cul ar archi tec-
ture and to al l owf or easy compari son wi th exi sti ng sof tware.
W e ai m f or the new l i brary to be easi l y avai l abl e. Net l i b [ 17]
has proven to be an extremel y conveni ent mechani smf or di stri buti ng
sof tware by el ectroni c mai l , and we shal l make LAPACK avai l abl e

2
through netl i b, f or users who want copi es of sel ected routi nes. W e
shal l al so make arrangements f or the compl ete package to be di s-
tri buted on magneti c tape, f or a nomi nal cost onl y. Fi nal l y we shal l
encourage vendors to i ncorporate LAPACKi n thei r own mathemati -
cal l i brari es.
The rest of thi s paper i s outl i ned as f ol l ows. Secti on 2 descri bes the
BLAS and expl ai ns why thei r use can speed up al gori thms. Secti on
3 descri bes bl ock al gori thms and shows i n some detai l howto reorga-
ni ze Gaussi an el i mi nati on and QR f actori zati on. Secti on 4 contai ns
benchmark resul ts f or l i near equati on sol vi ng and QR f actori zati on
on a vari ety of machi nes. Secti on 5 outl i nes our general approach to
achi evi ng hi gh accuracy; i n general , we have repl aced absol ute error
bounds (ei ther on the backward or f orward error) wi th rel ati ve er-
ror bounds, whi ch better respect the sparsi ty and scal i ng structure
of the ori gi nal probl ems. Secti on 6 di scusses a new bi di agonal si ngu-
l ar val ue decomposi ti on (SVD) whi ch computes si ngul ar val ues and
vectors much more accuratel y than previ ousl y thought possi bl e, and
Secti on 7 descri bes howthe Jacobi method (wi th a modi ed stoppi ng
cri teri on) i s uni f orml y more accurate than QR- based al gori thms f or
the symmetri c posi ti ve de ni te ei genval ue probl emand SVD. Secti on
8 revi ews the target machi nes f or whi ch LAPACKi s desi gned to run
most eci entl y. Secti on 9 di scusses the overal l structure of the LA-
PACKl i brary and expl ai ns the choi ce of programmi ng l anguage and
styl e. Fi nal l y, Secti on 10 outl i nes f uture pl ans to extend the l i brary,
i ncl udi ng the chal l enges f aced i n adapti ng the codes to di stri buted-
memory machi nes.

2 Basi c Li near Al gebra Subrouti nes


The ori gi nal set of Basi c Li near Al gebra Subrouti nes [ 29] , known as
the BLAS, were rst proposed i n 1973. Af ter some re nement of
the proposal , they were used i n the constructi on of LI NPACK. These
BLAS di d operati ons onl y on vectors of data, such as a dot product
or a saxpy (addi ng a scal ar mul ti pl e of one vector to another). W e
ref er to these vector- vector operati ons as Level 1 BLAS. The Level
1 BLAS permi t eci ent i mpl ementati on on scal ar machi nes, but the
granul ari ty i s too l owf or e ecti ve use on most vector or paral l el ma-
chi nes.

3
More recentl y, hi gher l evel BLAS have been speci ed that per-
f ormoperati ons of hi gher granul ari ty and so o er more opportuni ty
f or opti mi zati on on di erent archi tectures. The Level 2 BLAS [ 21]
perf orm matri x- vector operati ons such as matri x- vector mul ti pl i ca-
ti on and rank- one updates. The Level 3 BLAS [ 15] perf ormmatri x-
matri x operati ons such as matri x- matri x mul ti pl i cati on, sol vi ng tri -
angul ar systems wi th mul ti pl e ri ght hand si des, and rank- k matri x
updates.
To appreci ate why these Level 2 and Level 3 BLAS wi th l arger
granul ari ty o er better opportuni ti es f or eci ency, one must under-
stand memory hi erarchi es. Al l machi nes (not j ust supercomputers)
have a hi erarchy of memory l evel s|f or exampl e, wi th regi sters at the
top, f ol l owed by cache, mai n memory, and nal l y di sk storage at the
bottom. Toward the top of the hi erarchy, memory i s smal l er, more
expensi ve, and f aster. Si nce operati ons such as mul ti pl i cati on and
addi ti on must be done at the top l evel , data has to move up through
the vari ous l evel s to the top to be processed, and then down agai n to
be stored. The resul t i s that data at hi gher l evel s i s avai l abl e onl y
af ter some del ay and may not be avai l abl e at a rate f ast enough to
f eed the ari thmeti c uni ts. Cl earl y, an al gori thmthat mi ni mi zes the
memory trac i n the hi erarchy wi l l run f aster.
One way to measure the amount of thi s memory trac i s the rati o
of ops ( oati ng poi nt operati ons) to memory ref erences i n an al go-
ri thm. The l arger thi s rati o, the l onger a pi ece of data may be kept at
the top of the hi erarchy on average. Let us use thi s measure to com-
pare the three operati ons of saxpy (Level 1 BLAS), matri x- vector mul -
ti pl i cati on (Level 2 BLAS), and matri x- matri x mul ti pl i cati on (Level
3 BLAS), where al l vector and matri ces are of di mensi on n . Si mpl e
counti ng yi el ds the rati os 2= 3 f or saxpy, 2 f or matri x- vector mul ti pl i -
cati on, and n = 2 f or matri x- matri x mul ti pl i cati on. The l arge rati o f or
matri x- matri x mul ti pl y represents a surf ace- to- vol ume e ect, doi ng
O ( n3 ) operati ons on O (2n) data. Hence, matri x- matri x mul ti pl i ca-
ti on o ers greater opportuni ty f or expl oi ti ng the memory hi erarchy
than the l ower- l evel BLAS routi nes. Tabl e 1 i l l ustrates thi s f act wi th
some benchmark resul ts.

4
Tabl e 1: Speed of the BLAS on vari ous archi tectures
(al l val ues are i n M ops)
Al l i ant FX/8 I BM3090/VF CRAY2S
(8 processors) (1 processor) (1 processor)
Another
Peak Speed 94 108 488
Level 1 BLAS 14 26 121
Level 2 BLAS 26 60 350
Level 3 BLAS 43 80 437

advantage of Level 3 BLAS i s that they o er greater scope f or expl oi t-


i ng paral l el processors on shared memory machi nes.
Fortran i mpl ementati ons of al l the BLAS are avai l abl e; to get the
f ul l bene t, however, the BLAS shoul d be opti mi zed f or each archi -
tecture. W e encourage the computer manuf acturers to perf ormthese
opti mi zati ons; the data i n Tabl e 1 are f or such opti mi zed i mpl emen-
tati ons. W e al so expect that the LAPACK proj ect wi l l reveal the
need f or a f ewaddi ti onal basi c routi nes whose perf ormance may need
to be opti mi zed f or di erent archi tectures and may be regarded as
extensi ons to the current set of BLAS (e. g. , appl yi ng a sequence of
pl ane rotati ons to a matri x).

3 Bl ock Al gori thms


To expl oi t the Level 3 BLAS, one usual l y must express the al gori thm
i n terms of operati ons on submatri ces, or \bl ocks, " as compared to
vector- or scal ar- ori ented operati ons [ 22, 24] . W e have devel oped
such bl ock al gori thms f or LU f actori zati on, Chol esky f actori zati on,
Bunch- Kauf man f actori zati on of a symmetri c i nde ni te matri x, QR
f actori zati on (wi th and wi thout pi voti ng), the nonsymmetri c ei gen-
probl em(reducti on to Hessenberg f ormand QR i terati on), the sym-
metri c ei genprobl em(reducti on to tri di agonal f ormand reducti on of a
symmetri c- de ni te general i zed probl emto standard f orm), and SVD
(reducti on to bi di agonal f orm). See [ 18] f or detai l s of some of these
al gori thms. W ork i s conti nui ng on bl ock al gori thms f or general i zed
ei genprobl ems. I n thi s secti on we i l l ustrate two bl ocked al gori thms:
LUf actori zati on and QRf actori zati on.

5
3. 1 LU Fact or i zat i on

Fi rst we show how to organi ze the al gori thm mai nl y i n terms of


matri x- vector (Level 2) operati ons. Then we deri ve the bl ock al go-
ri thm, usi ng matri x- matri x (Level 3) operati ons.
We want to f actori ze an n by n nonsi ngul ar matri x A as the
product P L U of a permutati on matri x P (representi ng the rowi nter-
changes), a uni t l ower tri angul ar matri x L , and an upper tri angul ar
matri x U .
Assumi ng f or si mpl i ci ty that P =I , we wri te the f actori zati on i n
parti ti oned f ormas:
0 1 0 10 1
A11 A12 A13 L11 U11 U12 U13
B C B
@ A21 A22 A23 A = @ L21 L22 C
AB@ U22 U23 C
A
A31 A32 A33 L31 L32 L33 U33
Let the numbers of rows and col umns i n the bl ocks of the parti -
ti oned matri ces be j 0 1,b nand n 0 j 0 nb +1.
To deri ve an al gori thmusi ng Level 2 BLAS, we assume nb = 1.
Suppose that the rst j 0 1 col umns of L and U (that i11s ,LL21 , L31
and U 11 ) have al ready been computed. To showhowthe j - th col umn
may be computed, we equate components of thi s col umn as f ol l ows:
1. A12 =L 11 U12 . Hence the vector U12 can be obtai ned by sol vi ng
a l ower tri angul ar systemof equati ons (a Level 2 BLAS opera-
ti on).
2. A22 0 L21 U12 =L 21 U12 and
A32 0 L31U12 =L 31U12 .
Fi rst we update A22 and A 32 (treati ng thema s a si ngl e vector)
by computi ng
! ! !
A22 A22 0 L21 U12
A32 A32 L31

whi ch i s a rectangul ar matri x- vector mul ti pl i cati on (another


Level 2 BLAS routi ne). At thi s poi nt we can i ncorporate a
rowi nterchange so that the si ngl e el ement22A(cal l i t ) i s the
l argest el ement i n 2A2 and A 32 . Si nce L i s uni t l ower tri angul ar,
L22 =1 and U 22 = . Fi nal l y L32 = 01 A32 (a Level 1 BLAS
operati on).

6
Thus each col umn of L and U can be computed i n turn, by per-
f ormi ng steps 1 and 2 i n a l oop wi th j runni ng f rom1 to n . Most of
the work i n thi s l oop can be perf ormed by cal l s to two Level 2 BLAS
routi nes.
To deri ve the bl ock f ormof the al gori thm, we assume thatb n> 1.
As bef ore, suppose that 1L1 , L21 , L31 and U 11 have al ready been
computed. The same approach as bef ore yi el ds the f ol l owi ng method
f or computi ng the next bl ock of bncol umns of L and U .

1. A12 = L 11 U12 : the matri x U12 can be obtai ned by sol vi ng a


n ght hand si des (a
l ower tri angul ar systemof equati ons wi thb ri
Level 3 BLAS operati on).
2. A22 0 L21 U12 =L 22 U22 and
A32 0 L31U12 =L 32U22 .
Fi rst we update A22 and A 32 by a rectangul ar matri x- matri x
mul ti pl i cati on (another Level 3 BLAS operati on). Then we f ac-
tori ze the updated matri x as
! !
A22 =P 0
L22 U22
A32 L32

usi ng the Level 2 BLAS f ormof the al gori thmdescri bed earl i er.
The bl ocks A22 and A 32 are treated together i n order to al l ow
the necessary row i nterchanges (represented by the matri 0x. P
These rowi nterchanges must then be appl i ed to the rest of the
matri x.
Thi s i s j ust one way i n whi ch LUf actori zati on may be organi zed
i n terms of Level 2 or Level 3 BLAS operati ons. There are i n f act at
l east three ways, i nvol vi ng di erent patterns of data movement and
BLAS operati ons. Thei r rel ati ve perf ormance i s machi ne- dependent,
but our experi ence to date has shown that the Level 3 BLAS f orms
do not vary much i n perf ormance (there are greater di erences i n
perf ormance between the Level 2 BLAS f orms).
Tabl e 1a shows the speed i n mega ops on one processor of a Cray
X- MP of :
 the LI NPACKrouti ne SGEFA

7
 our LAPACKrouti ne SGETRF, cal l i ng assembl y l anguage BLAS
 a hi ghl y opti mi zed assembl y- l anguage al gori thm provi ded by
Cray
Tabl e 1a: LUf actori zati on on a Cray X- MP
n LI NPACK LAPACK Cray
25 18 18 65
100 55 105 163
500 129 203 213
Thus, f or l arge matri ces, the Level 3 BLAS code ran at over 95%
of the speed of the hi ghl y opti mi zed assembl y- l anguage code. Thi s
resul t shows that we can achi eve 95%eci ency f romthe machi ne,
usi ng portabl e Fortran code, wi th machi ne- speci c codi ng con ned to
the BLAS.

3. 2 QR Fact orizat i on

As another exampl e of a bl ock al gori thm, we consi der an al gori thm


f or computi ng the QRf actori zati on
A =QR
of a dense matri x. Here an m 2 n (w. l . o. g. m  n ) matri x A i s
decomposed i nto an orthogonal m 2 m matri x Q and an upper tri an-
gul ar m 2 n matri x R . Thi s decomposi ti on i s one of the basi c tool s
of numeri cal l i near al gebra; f or appl i cati ons, see [ 26] . The tradi ti onal
al gori thmf or computi ng the QR f actori zati on [ 26, p. 148] empl oys a
sequence of Househol der reducti ons
H =I 0 2u uT ; k u2 k=1: (1)
Appl i cati on of H to a gi ven matri x A i nvol ves a matri x- vector mul -
ti pl i cati on z TAu and a rank- one update A A 0 2u Tz : Each of
these Level 2 BLAS operati ons requi res O (2 )n oati ng- poi nt opera-
ti ons and uses O ( 2n) data.
To arri ve at a bl ock f ormul ati on of the Househol der QRal gori thm,
we must be abl e to express a seri es of Househol der reducti ons i n a con-
veni ent cl osed f orm. Bi schof and Van Loan [ 7] expressed the product
Q =H 1 1 1 1b H
8
of a seri es of m 2 m Househol der matri ces (1) i n the so- cal l ed WY
represent at i on
Q =I +W Y T (2)
where W and Y are m 2 b matri ces. Schrei ber and Van Loan [ 30]
re ned thi s representati on by expressi ng W =Y T where T i s a b 2 b
upper tri angul ar matri x. Schrei ber and Van Loan cal l ed the resul ti ng
representati on
Q =I +Y T YT (3)
the compact WY represent at i on si nce i t requi res onl y about hal f as
much storage as the ori gi nal WYrepresentati on (2) i n the typi cal case
where m  b . Compared to the tradi ti onal Househol der al gori thm,
2
the accumul ati on of T requi res O (2mb
) extra ops and b2 extra words
f or storage. Si nce typi cal l y m  b , thi s i s a l ow- order termi n the
overal l al gori thmi c compl exi ty. The advantage of the compact WY
representati on i s that the computati on of A T Q A nowi nvol ves two
matri x- matri x mul ti pl i cati ons
Z ATY T (4)
and a rank- b update
A A +Y Z T (5)
i nstead of a seri es of b matri x- vector mul ti pl i cati ons and rank- one
updates.
We can nowexpress the bl ock Househol der QRal gori thmi n terms
of the pri mi ti ves gencwy (generate compact WY f actor) and ap-
pcwy (appl y compact WYf actor):
[Y ; T ] g e n cwy ( A )
returns the compact WYf actors T and Y such that
!
A =( I +Y T Y ) T R :
0
The pri mi ti ve gencwy rst computes the QRf actori zati on of A usi ng
the tradi ti onal Househol der QR al gori thmand then accumul ates T .
Next,
A a p p c w y ( Y ; T ; A)

9
perf orms the updates (4) and (5). Fi gure 1 shows the bl ock House-
hol der al gori thmusi ng the compact WY representati on. Here A i s
parti ti oned as an M 2 N bl ock matri x, and f or si mpl i ci ty we assume
that al l bl ocks are of the same si ze b 2 b , so m =Mb and n =N b . W e
use the notati on A ( i ; j ) to ref er to bl ock entry ( i ; j ) and A ( i : j ; k : l )
to ref er to the submatri x of A consi sti ng of bl ock row entri es i to j
and bl ock col umn entri es k to l .

f or i =1 t o N do
[ Y ; T ] gencwy ( A ( i : M; i ))
A ( i : M; i : N ) appcwy ( Y ; T ; A ( i : M; i : N ))
end f or

Fi gure 1: The Bl ock Househol der QRFactori zati on Al gori thm

Thi s al gori thmi l l ustrates some i mportant f eatures of bl ock al go-


ri thm. For one, the bl ock al gori thmmay requi re more oati ng poi nt
operati ons than i ts unbl ocked counterpart. W e i nvest more work i n
accumul ati ng a bl ock transf ormati on, but thi s i s more than made up
f or by the appl i cati on of the transf ormati on, whi ch wi l l run at cl ose
to opti mumspeed si nce i t i s not sl owed down by excessi ve data move-
ment overhead. Thi s reasoni ng i s true up to the poi nt where addi ng
more col umns to a bl ock transf ormati on wi l l not resul t i n a f aster
update.
Thi s rel ates to the subtl e i ssue of parti ti oni ng a gi ven matri x i nto
bl ocks. The bl ock parti ti oni ng resul ti ng i n the f astest executi on of
the code (the \opti mal " bl ock parti ti oni ng) i s probl em- dependent (we
can use l arger bl ocks f or l arger matri ces), but i t al so depends on the
archi tecture of a gi ven machi ne. Furthermore, on mul ti processor ma-
chi nes, possi bl y con i cti ng i ssues of i ndi vi dual processor perf ormance
and overal l l oad bal anci ng must be reconci l ed. Adi scussi on of these
i ssues and a suggesti on f or a methodol ogy to overcome thi s probl em
can be f ound i n [ 6] . Determi ni ng opti mal , or near opti mal , bl ock si zes
f or di erent envi ronments i s a maj or research topi c f or the LAPACK
proj ect.
The al gori thmi n Fi gure 1 constructs a bl ock transf ormati on and
then i mmedi atel y appl i es i t to the remai ni ng sub- matri x. We shal l cal l
thi s a \ri ght - l ooki ng al gori t hm". Noti ce that at each of the N stages

10
we are updati ng a submatri x of si ze ( m 0 ( i 0 1) b +1) 2 ( n 0 ( i 0 1) b +1).
We can f urther reduce the amount of data movement by the f ol l owi ng
al gori thm:

for i =1 to N do
f or j =1 t o i 0 1 do
A ( j : M; i ) appcwyj ; ( TYj ; A ( j : M; i ))
end f or
[ Yi ; Ti] gencwy ( A ( i : M; i ))
end f or

Fi gure 2: Lef t Looki ng Bl ock Househol der QR Factori zati on Al go-


ri thm
At each stage of thi s al gori thm we are onl y modi f yi ng a m 2
b matri x. We shal l cal l thi s al gori thmthe \l ef t - l ooki ng al gori t hm".
Compared to the al gori thmi n gure 1, the pattern of wri tes i s more
l ocal i zed, and thi s may resul t i n a substanti al reducti on i n the number
of wri tes to l ower l evel s i n the memory hi erarchy. Thi s i s parti cul arl y
i mportant i n shared- memory mul ti processors where cache consi stency
i s guaranteed by the use of \wri te- through" caches [ 27, 33] . On those
archi tectures read accesses to cached data can be sati s ed i n one cycl e,
but wri te accesses are i mmedi atel y ushed to memory. As a resul t,
wri te accesses are much sl ower than read accesses.

4 Benchmarks
The rst prel i mi nary versi on of LAPACK sof tware was rel eased to
test si tes i n Apri l 1989. Thi s sof tware i ncl uded routi nes f or general ,
posi ti ve de ni te, and symmetri c i nde ni te systems and f or QR de-
composi ti on wi thout pi voti ng.
I n Tabl es 2- 4 we present resul ts f or whi ch most or al l of the BLAS
were opti mi zed f or the parti cul ar archi tecture. SGETRF i s the LA-
PACK routi ne f or tri angul ar f actori zati on of a general matri x wi th
parti al pi voti ng, SPOTRF perf orms Chol esky f actori zati on of a posi -
ti ve de ni te symmetri c matri x, and SGEQRF does QR f actori zati on
wi thout pi voti ng. Al so shown are SGEMV (matri x- vector mul ti pl y)

11
and SGEMM(matri x- matri x mul ti pl y), si nce these are the \speed
l i mi ts" f or the al gori thms wri tten i n terms of the Level 2 BLAS and
Level 3 BLAS, respecti vel y. Al l codes were run i n si ngl e preci si on (32
bi ts on the Convex and 64 bi ts on the Cray). Al l resul ts are i n M ops.

Tabl e 2: LAPACKon a Convex C210


Routi ne Matri x Di mensi on
32 64 128 256 512
SGEMV 34 43 47 47 47
SGEMM 38 44 47 47 47
SGETRF 6 12 21 30 36
SPOTRF 8 20 33 40 44
SGEQRF 12 21 27 33 38

Tabl e 2 gi ves the resul ts f or the Convex C210, wi th an al gori thm


bl ock si ze of bn=1. Si nce matri x- vector and matri x- matri x mul ti pl y
are equal l y f ast on thi s machi ne i n thei r current i mpl ementati ons,
nothi ng i s gai ned by goi ng to the Level 3 BLAS.

Tabl e 3: LAPACKon a CRAYY- MP


1 Processor
Routi ne Matri x Di mensi on
32 64 128 256 512 1024
SGETRF 40 108 195 260 290 304
SPOTRF 34 95 188 259 289 301
SGEQRF 54 139 225 275 294 301

Tabl e 4: LAPACKon a CRAYY- MP


8 Processors
Routi ne Matri x Di mensi on
32 64 128 256 512 1024
SGETRF 32 90 205 375 1039 1974
SPOTRF 29 84 273 779 1592 2115
SGEQRF 50 133 328 807 1476 1937

12
Tabl es 3 and 4 gi ve resul ts f or the CRAYY- MP f or 1 and 8 pro-
cessors, respecti vel y. Here n f or SGETRF and SPOTRF and
b =64
nb =16 f or SGEQRF. The maxi mumspeed of a si ngl e processor of a
CRAYY- MPi s 333 M ops. Thus, we see that f or l arge- enough matri x
di mensi ons, the si ngl e- processor code runs at at 90% eci ency. When
al l 8 processors are used, the code attai ns 73%to 80%eci ency.

5 Target Machi nes


The LAPACKl i brary wi l l be desi gned pri mari l y to perf ormeci entl y
on machi nes wi th a modest number of processors (say, 1- 100), each
havi ng a powerf ul vector- processi ng capabi l i ty. These machi nes i n-
cl ude al l of the most powerf ul computers currentl y avai l abl e and i n
use f or general - purpose sci enti c computi ng: CRAY- 2, CRAYX- MP,
CRAY Y- MP, Fuj i tsu VP, I BM3090/VF, NEC SX, Hi tachi S- 820,
Al l i ant FX/80, Convex C- 1, Convex C- 2, Stardent, Sequent Symme-
try, Encore Mul ti max, and BBN Butter y. On conventi onal seri al
machi nes, the perf ormance of the l i brary i s expected to be at l east
as good as that of the current LI NPACKand EI SPACKcodes. Thus
the l i brary wi l l be sui tabl e across the whol e range of machi nes f rom
personal computers to supercomputers to experi mental archi tectures.
W e do not cl ai mthat the strategy of usi ng Level 2 or Level 3
BLAS wi l l necessari l y attai n opti mal perf ormance on al l these ma-
chi nes; i ndeed, some al gori thms can be structured i n several di erent
ways, al l cal l i ng Level 3 BLAS, but wi th di erent perf ormance charac-
teri sti cs. I n such cases we shal l choose the structure that provi des the
best \average" perf ormance over the range of target machi nes. Cur-
rentl y we are l i mi ti ng machi ne- dependent opti mi zati ons to the BLAS
to retai n portabi l i ty across archi tectures. W e encourage vendors to
provi de i mpl ementati ons of the BLAS kernel s that are opti mi zed f or
thei r parti cul ar archi tectures. Whi l e users wi l l be f ree to devel op thei r
own versi ons of the LAPACKcodes, we bel i eve that the possi bl e per-
f ormance gai n wi l l be l i mi ted on the more conventi onal archi tectures.
On the more experi mental archi tectures (i n parti cul ar, di stri buted-
memory machi nes), the restri cti on of opti mi zati on to the BLAS mi ght
be too l i mi ti ng. I n parti cul ar, i t mi ght be advantageous to i ntro-
duce paral l el i smat the top- l evel of the al gori thmi nstead of i nsi de the
BLAS. To ai d users i n experi menti ng on thei r parti cul ar archi tecture,

13
the LAPACKcodes have been caref ul l y desi gned i n a modul ar f ash-
i on and wi th the obj ecti ve of mi ni mi zi ng data movement. Si nce data
movement i s the key i ssue i n di stri buted- memory as wel l as shared-
memory machi nes, the LAPACKcodes shoul d be easi l y \tunabl e" to
more experi mental archi tectures.
Several of the al gori thms we i ntend to i mpl ement [ 19] wi l l re-
qui re more than l oop- based paral l el i sm. These al gori thms wi l l rel y
upon the si mpl i ed SCHEDULEmechani sm[ 20] to i nvoke paral l el i sm.
These i deas mi ght al so be used to express top- l evel paral l el i smi n a
portabl e f ashi on. W e are al so cl osel y f ol l owi ng the acti vi ti es of the
Paral l el Computi ng Forum[ 23] whi ch has been f ormed by computer
vendors, sof tware devel opers, nati onal l aboratori es, and uni versi ti es
to exchange techni cal i nf ormati on and to document agreements on
constructs f or programmi ng paral l el appl i cati ons f or shared- memory
mul ti processors.

6 Hi gh- Accuracy Li near Al gebra Al gori thms


So f ar we have concentrated on the speed of the al gori thms i n LA-
PACK. But a secondary obj ecti ve of the LAPACKproj ect i s to de-
vel op new or i mproved al gori thms that provi de i ncreased accuracy,
speci cal l y near- opti mumrel ati ve accuracy (i n a sense to be de ned).
To present our di scussi on, we shal l need some notati on.
We l et H denote the probl emf or whi ch we seek a sol uti on; we
denote the sol uti on by f ( H ). For exampl e, f ( H ) may denote the
ei genval ues, ei genvectors, si ngul ar val ues, or si ngul ar vectors of the
matri x H . I f H denotes the pai r ( A; b), then f ( H ) may denote the
sol uti on of the l i near systemAx =b , perhaps i n a l east- squares sense
i f A i s si ngul ar or not square. I n general , f ( H ) cannot be computed
exactl y and hence i s approxi mated by an al gori thmwhose output we
denote f^( H ). W e al so l et " denote the machi ne preci si on.
Anal yzi ng the accuracy of an al gori thmf^ f or f consi sts of two
parts. Fi rst, we use pert urbat i on t heory, where we bound the di er-
ence f ( H + H ) 0 f ( H ) i n terms of  H . Thi s part depends onl y
on f and not the al gori thmthat approxi mates i t. Second, we use
error anal ysi s , whi ch attempts to show that the computed sol uti on
f^( H ) i s cl ose to f ( H + H ) f or some bounded  H . Showi ng that
f^( H ) = f ( H + H ) f or some bounded  H i s cal l ed backward error

14
anal ysi s , but i s by no means the onl y way to proceed.
There i s a great deal of choi ce i n the measures we choose to bound
errors and measure di stances. I n conventi onal error anal ysi s as i n-
troduced by Wil ki nson, we bound k f ( H +  H ) 0 f ( H ) k i n terms
of k  H k , and show f^( H ) = f ( H + H ) where k  H k  O ( " ) k H k .
Here, k 1 k denotes a norm, l i ke the one- norm or Frobeni us norm.
Typi cal l y one proves a f ormul a of the f ormk f ( H + H ) 0 f ( H ) k 
 ( f ; H ) 1 k  H k +O (2k), where
H k  ( H ) i s cal l ed the condi t i on num-
ber of H wi t h respect t o f . I n thi s f ormul ati on, i t i s easy to see that
 ( f ; H ) i s si mpl y the normof the gradi ent of f at H : k rf ( H ) k ; other
scal i ngs are possi bl e. Thus, combi ni ng the perturbati on theory and
error anal ysi s, one can wri te
k f ( H + H ) 0 f ( H ) k  O ( " )  ( f ; H ) 1 k2H) k +O ( "
The drawback of thi s approach i s that i t does not respect the
structure of the ori gi nal data. I n parti cul ar, i f the ori gi nal data i s
sparse or graded (l arge i n some entri es, smal l i n others), boundi ng
 H onl y by normcan gi ve very pessi mi sti c resul ts. Atri vi al exampl e
i s sol vi ng a di agonal system of equati ons. Each component of the
sol uti on i s computed to f ul l accuracy by a si ngl e di vi de operati on,
but the conventi onal condi ti on number i s the rati o of the l argest to
smal l est di agonal entri es and may be arbi trari l y l arge.
I nstead of boundi ng  H by i ts normk  H k , one may i nstead use the
measure r eHl (  H )  maxi j j  iHjj = ji jH
j , the l argest rel ati ve change i n
any entry (we use the notati on r He lto i ndi cate the dependence on
H ). Thi s measure respects sparsi ty, si ncei jmust H be zero i f Hi j i s
zero, and al so gradi ng, si nce every entry i s perturbed by an amount
smal l compared to i ts magni tude. For exampl e, i n the case of di agonal
l i near equati on sol vi ng, one can easi l y see that a perturbati on  H of
si ze r H e (l  H ) i n the matri x can onl y change the sol uti on rel ati vel y
by r eHl (  H ) i n each component, and that the al gori thmi s backward
stabl e wi th r He (l H )  " . Thus, the new perturbati on theory and
error anal ysi s wi th respect to Hr (e lH ) accuratel y predi ct that each
component of the sol uti on i s computed to f ul l rel ati ve accuracy.
W e have successf ul l y devel oped new perturbati on theory, al go-
ri thms, and error anal ysi s f or the measure Hr(e lH ) f or much of nu-
meri cal l i near al gebra. W e cannot al ways guarantee to sol ve probl ems
as though we had a smal l r eHl (  H ), but the al gori thms can i n al l

15
cases moni tor thei r accuracy and produce usef ul error bounds. The
al gori thms are usual l y smal l vari ati ons on conventi onal al gori thms,
perhaps wi th a sl i ghtl y di erent stoppi ng cri teri on, al though the bi di -
agonal SVDal gori thmhas a qui te new component. I n al l cases the
al gori thms run approxi matel y as f ast as thei r conventi onal counter-
parts, someti mes a l i ttl e sl ower and someti mes a l i ttl e f aster. Si nce
they are based on the conventi onal al gori thms, al l the techni ques us-
i ng the Level 3 BLAS appl y to them.
Thi s approach has been appl i ed to l i near equati on sol vi ng [ 2] , l i n-
ear l east- squares probl ems [ 3] , the bi di agonal SVD[ 11, 9] , the tri di -
agonal symmetri c ei genprobl em[ 28, 4] , the dense symmetri c posi ti ve
de ni te ei genprobl em[ 12] , and the dense de ni te general i zed ei gen-
probl em[ 4, 12] . W e have si mi l ar but sl i ghtl y weaker resul ts f or the
dense SVDand general i zed SVD[ 12] . These al gori thms ei ther wi l l be
i ncl uded di rectl y i n LAPACKor can be easi l y constructed by usi ng
LAPACKsubrouti nes as \bui l di ng bl ocks. "
I n the next two secti ons, we shal l descri be these new al gori thms
f or the bi di agonal SVDand dense symmetri c posi ti ve de ni te ei gen-
probl em, respecti vel y. The LAPACK routi ne SBDSQR i mpl ements
the newbi di agonal SVDal gori thm; the work on the symmetri c ei gen-
probl emi s a more recent devel opment, and sof tware has not yet been
prepared f or i ncl usi on i n LAPACK.

7 The Bi di agonal SVD


Abi di agonal matri x i s one that i s nonzero onl y on the mai n di agonal
and rst superdi agonal . The probl emof computi ng i ts SVD ari ses
both as the nal stage of computi ng the SVD of a general matri x
[ 14] and i n the symmetri c posi ti ve de ni te tri di agonal ei genprobl em
[ 4] . W e begi n by revi ewi ng the conventi onal perturbati on theory,
al gori thm, and error anal ysi s and then di scuss the new approach.
See [ 11, 9] f or detai l s and proof s.
The conventi onal perturbati on theory i s the same f or the SVDof
a bi di agonal matri x as f or a dense matri x. Suppose B i s the n by
n bi di agonal matri x and  B a perturbati on. Let i, 
ui and v i be
the si ngul ar val ues, uni t l ef t si ngul ar vectors, and uni t ri ght si ngul ar
vectors of B , respecti vel y, where we assum1e  1 1 1 n .  Let i0 ,
u0i and v i0 be the correspondi ng val ues f or 0 B= B + B . Then the

16
f ol l owi ng resul ts are wel l known [ 26, 8] :
j i 0 i0 j  k  B2 k
max(si n  ( ui ; u0i ) ; si n i ;( vi0v)) 
k  B2k (6)
g a ip0 k  B2k
where the absol ut e gap between si ngul ar val ues i s
g a ip mi n j j 0 i j (7)
j 6 =i

Here si n  ( x ; y ) denotes the si ne of the acute angl e between the vectors


x and y .
The conventi onal al gori thmf or the bi di agonal SVD i s QR i ter-
ati on, whi ch can be shown to be backward stabl e: i t computes the
exact SVDof B + B where k  B2k=O ( " ) k B2k. Thus, l arge si ngul ar
val ues (those near n =k B k2) are computed wi th hi gh rel ati ve accu-
racy, but smal l ones (j  n ) may be total l y i naccurate i f n"  j .
j   n,
Si mi l arl y, i f there are at l east two ti ny si ngul ar vali ; ues
then both g a ipand g a jp are smal l and the error bound on the si ngul ar
vectors i s l arge.
I n contrast, the new perturbati on theory uses the f ol l owi ng mea-
sure:
  (2n 0 1) 1 maxj l ogBi j + Bi j j
ij Bi j
whi ch i s approxi matel y (2n 0 1) rBe( l B ) when both are smal l . One
can prove [ 11, 9]
j i 0 i0 j  e 0 1
 i

 r2 e l g(1a+0p )
1=2
max(si n  ( ui ; u0i) ; si n i ;( vi0v)) (8)
i

where the rel at i ve gap between si ngul ar val ues i s

r e l g iap mi n
j j 0 ij (9)
j 6 =i j + i
One can easi l y see that bounds (8) are at l east about as strong as
thei r conventi onal counterparts (6). To see how much stronger they
may be, consi der maki ng rel ati ve perturbati ons of si ze01010i n a 3

17
by 3 bi di agonal matri x wi th si ngul ar val ues1 = 1,  2 = 2 1 10020 ,
and  3 =10 020 . Note that g a3p=g a p2 =10 020 and that r e l g3 a=p
r e l g2 a=p1= 3. Si nce the normof thi s perturbati on i s about01010,
we may appl y (6) to get the absol ute error bound 10 010  3 f or 3;
and, si nce 10  g a p3, no error bound f or the si ngul ar vectors at
010

al l . Appl yi ng (8), we get a rel ati ve error bound of about 50101 10 in


3 . Thus, we have at l east 9 deci mal di gi ts of accuracy 3i ,n whereas

(6) predi cts changes 10 20 ti mes l arger. Appl yi ng (8) agai n, we get an

error bound of about 2: 1 1 0910i n the di recti on of the si ngul ar vectors,


whereas (6) provi des no error bound at al l . The same resul ts hol d f or
2 and i ts si ngul ar vectors.
I n summary, absol ut e uncertai nti es i n the entri es of a general
matri x yi el d absol ut e error bounds on i ts si ngul ar val ues, and error
bounds dependi ng on the absol ut e gap f or i ts si ngul ar vectors. I n
contrast, rel at i ve uncertai nti es i n the entri es of a bi di agonal matri x
yi el d rel at i ve error bounds on i ts si ngul ar val ues, and error bounds
dependi ng on the rel at i ve gap f or i ts si ngul ar vectors.
W e have devel oped a new bi di agonal SVD al gori thmthat com-
putes si ngul ar val ues and si ngul ar vectors wi thi n the accuracy bounds
of thi s perturbati on theory. The al gori thms are not backward stabl e
i n the usual sense, as i ndeed they cannot be, si nce setti ng an o di -
agonal entry of B to zero (convergence) i s a l arge rel ati ve change i n
an entry of B . Nonethel ess, we can show that the combi ned e ects
of roundo and the stoppi ng cri teri on permi t al l si ngul ar val ues to be
computed wi th rel ati ve accuracy O ( " ) and si ngul ar vectors u vi
i and
to be computed wi th accuracy O ( " ) = r e lig. a p
The al gori thmi s a hybri d of the conventi onal , shi f ted QR al go-
ri thmand a new, stabl e i mpl ementati on of QRwi th a zero shi f t [ 11] .
There i s al so a newstoppi ng cri teri on (cri teri on f or setti ng o di agonal
entri es of B to zero). The zero- shi f t QR al gori thmhas the property
that i t computes each entry of the transf ormed matri x to hi gh rel a-
ti ve accuracy. I t i s used on de ated submatri ces of B whose condi -
ti on number  max = mi n i s so l arge that the round o i ntroduced by the
standard shi f ted QR, " 1max , woul d make unacceptabl y l arge changes
i n the computed mi n . Standard shi f ted QR i s used on submatri ces
where  max = mi n i s smal l .
I n numeri cal tests, thi s new al gori thmwas approxi matel y as f ast
and occasi onal l y much f aster than i ts conventi onal counterpart. Zero-
shi f t QR i s onl y l i nearl y convergent, whereas shi f t QR i s cubi cal l y

18
convergent; neverthel ess, no speed i s l ost, because i t i s appl i ed onl y
when  min =max , i ts convergence f actor, i s qui te smal l . Thus, con-
vergence i s guaranteed to be f ast, even though l i near. See [ 11] f or
detai l ed resul ts of numeri cal tests.

8 Jacobi's Method f or the Symmetri c Posi -


ti ve De ni te Ei genprobl em
We begi n by revi ewi ng the conventi onal perturbati on theory, al go-
ri thms and error anal ysi s f or the symmetri c posi ti ve de ni te ei gen-
probl em, and then di scuss the newapproach. See [ 12] f or detai l s and
proof s.
The conventi onal perturbati on theory, al gori thms and error anal -
ysi s work equal l y wel l f or al l symmetri c matri ces. Let H be an n by
n symmetri c matri x and  H a perturbati on. Leti and v i denote
the ei genval ues and uni t ei genvectors of H , respecti vel y, where we as-
sume  1  1 1 1 n .  Let 0i and v i0 be the correspondi ng val ues f or
H 0 =H + H . Then the f ol l owi ng resul ts are wel l known [ 26, 8] :
j i 0 0ij  k  H2 k
si n  (iv; i0v) 
k  H2k (10)
g a ip0 k  H2k
where the absol ut e gap between ei genval ues i s
g a ip mi n j j 0 i j (11)
j 6 =i

As i n the l ast secti on, si n  ( x; y ) denotes the si ne of the acute angl e


between the vectors x and y .
Any of the conventi onal al gori thms f or the symmetri c ei genprob-
l em, i ncl udi ng QR i terati on, Jacobi , bi secti on and i nverse i terati on,
are known to be backward stabl e: they compute the exact ei gende-
composi ti on of H + H , where k  H 2 k
= O ( ") k H k2. Thus, j ust as
wi th the conventi onal SVD al gori thm, l arge ei genval ues (those near
n =k H k2) are computed wi th hi gh rel ati ve accuracy, but smal l ones
( j  n ) may be total l y i naccurate i f n"  j . Si mi l arl y, i f there
are at l east two ti ny ei genval uesi ; j   n , then both g ai pand g a jp
are smal l and the error bound on the ei genvectors i s l arge.

19
I n contrast, the new perturbati on theory uses the measure  =
relH (  H ) i ntroduced i n secti on 6, but appl i es onl y to posi ti ve de ni te
matri ces. To present the resul ts, we rewri te H as H = DAD, where
1=2 1=2
D = di ag ( H11 ; : : : ;nn H) i s di agonal and A has uni t di agonal . I t
i s known that A ' s condi ti on number ( A )  kA2k A01 k2 sati s es
 ( A )  mi nD^  (DH
^ D ^ ), where the mi ni mumi s over al l nonsi ngul ar
di agonal matri cesD ^ [ 31] . I n other words, A i s nearl y the best di agonal
scal i ng of H . Then we may show[ 12]
ji 0 0ij  n 1  1  ( A )
i
( n 0 1)1=2 1  1  ( A ) 2
si n ( vi ; i0v)  + O(  ) (12)
r e l gap i

where the rel at i ve gap between ei genval ues i s

r e l giap mi n
j j 0 ij (13)
j 6=i ( j 1 i )1=2
There i s al so a nonasymptoti c versi on of the ei genvector bound, but
we omi t i t f or si mpl i ci ty of presentati on.
Nowwe contrast the new wi th the ol d bounds. From(10) we get
the conventi onal rel ati ve error bound
j i 0 0ij  n 1  1  ( H )
i
to contrast wi th (12). To see howmuch stronger (12) may be, consi der
the symmetri c posi ti ve de ni te matri x H =DAD where
2 40 29 19 3 2 3
10 10 10 1 : 1 : 1
H = 64 1029 1020 109 75 ; A =64 : 1 1 : 751
1019 109 1 : 1 : 1 1
and D = di ag (10 20 ; 1010; 1). Here  ( H )  4010and  ( A )  1: 33.
Thus,  rel ati ve perturbati ons i n the matri x entri es onl y cause 4
rel ati ve perturbati ons i n the ei genval ues accordi ng to the new theo-
rem, and 3 1 1040 1  rel ati ve perturbati ons accordi ng to the conven-
ti onal theorem. Al so, the absol ute gaps f or the ei genval ues of H are
g a 1p;2; 3  10020 ; 10020 ; 1, whereas the rel ati ve gaps r e1l; 2g; 3aare
p
al l approxi matel y 10 1 v
10 . Thus the newtheory predi cts errors i n and

v2 of norm2 1 10010  , whereas the ol d theory predi cts errors of 20 10


.

20
W e can showthat Jacobi ' s method (wi th a modi ed stoppi ng cri -
teri on) wi l l attai n these new error bounds, but i n general QR wi l l
not. For the exampl e i n the l ast paragraph, QRcomputes two out of
the three ei genval ues as negati ve, whereas H i s posi ti ve de ni te. I n
contrast, Jacobi computes al l the ei genval ues to nearl y f ul l machi ne
preci si on. I n f act f or thi s exampl e we can showJacobi computes al l
components of al l ei genvectors to nearl y f ul l rel ati ve accuracy, even
though they vary by 16 orders of magni tude; agai n QRwi l l not even
get the si gns of many smal l components correct.
Al l these resul ts may be extended to the dense SVD, except that
the noti on of perturbati on used i s that each col umn of H must be
smal l i n normcompared to the correspondi ng col umn of H .

9 LAPACK Structure, Language, and Styl e


9. 1 Cont ent s

I n thi s subsecti on, we outl i ne the pl anned contents of the LAPACK


l i brary; see [ 5] f or detai l s.
For sol vi ng l i near equat i ons , LAPACKwi l l perf ormtri angul ar f ac-
tori zati on, sol uti on vi a f orward and back substi tuti on, i terati ve re ne-
ment [ 10, 2] and equi l i brati on. I t wi l l handl e general matri ces (dense
onl y), posi ti ve de ni te matri ces (dense or banded), symmetri c i ndef -
i ni te matri ces (dense or banded), and tri angul ar matri ces (dense or
banded). I n addi ti on, symmetri c and tri angul ar matri ces may be
stored i n a packed f ormat. I t wi l l al so update tri angul ar f actori za-
ti ons.
For sol vi ng l east squares , LAPACKwi l l do QRdecomposi ti on wi th
and wi thout pi voti ng, as wel l as l east squares sol uti ons usi ng thi s de-
composi ti on. I t wi l l al so compute the general i zed QR decomposi ti on
of two matri ces (thi s i s needed f or the general i zed SVD bel ow) and
update the QRdecomposi ti on.
For sol vi ng ei genprobl ems , LAPACKwi l l compute ei genval ues, the
Schur f orm, and ei genvectors f or general matri ces or matri x penci l s
(i . e. , the general i zed ei gensystemof A 0 B ). For symmetri c matri ces
(dense or banded) i t wi l l compute ei genval ues and ei genvectors. I t
wi l l al so handl e the general i zed symmetri c ei genprobl emA 0  B f or
A and B symmetri c (dense or banded) and B posi ti ve de ni te. For
general matri ces (dense or banded) and tri angul ar matri ces, i t wi l l

21
compute the SVD; and f or pai rs of matri ces, i t wi l l compute the gen-
eral i zed SVD. I t wi l l al so do rank- 1 updates of the bi di agonal SVD
and tri di agonal symmetri c ei genprobl em.
LAPACKwi l l i ncl ude condi ti on esti mators f or l i near systems, ei gen-
val ues, ei genvectors, and i nvari ant subspaces. I t wi l l sol ve the Syl vester
equati on and general i zed Syl vester equati on and can reorder the ei gen-
val ues al ong the di agonal of a matri x i n Schur f orm.
The nonsymmetri c ei genrouti nes wi l l be based on the QR al go-
ri thm. I nverse i terati on wi l l be avai l abl e i f onl y a f ew ei genvectors
are desi red. For the symmetri c tri di agonal ei genprobl em(the nal
phase of the dense symmetri c ei genprobl em) al gori thms based on QR,
di vi de- and- conquer [ 13] , and bi secti on wi l l be i ncl uded; these al go-
ri thms have di erent attai nabl e accuraci es, requi re di erent amounts
of storage, and run at di erent speeds on di erent archi tectures and
dependi ng on whether some or al l ei genval ues and ei genvectors are
desi red. I n other words, no si ngl e al gori thmi s best i n al l cases [ 10] .
The same i s true f or the bi di agonal SVD, whi ch i s the nal phase
of the general SVD. W e al so pl an to i ncl ude Jacobi ' s method f or the
dense symmetri c ei genprobl emand SVDi n a l ater rel ease, f or reasons
gi ven earl i er.

9.2 Language and Style

The sof tware i s bei ng devel oped i n portabl e Fortran 77, wi th ex-
tensi ons to the standard onl y where necessary. Si ngl e- and doubl e-
preci si on versi ons wi l l be prepared; conversi on between di erent pre-
ci si ons wi l l be perf ormed automati cal l y by sof tware tool s f romTool -
pack/1.
Routi nes f or compl ex matri ces wi l l use the COMPLEX data type
(l i ke LI NPACK, but unl i ke EI SPACK); hence the avai l abi l i ty of a
doubl e- preci si on compl ex (COMPLEX*16 or DOUBLE COMPLEX)
data type wi l l be assumed as an extensi on to Fortran 77. Routi nes
f or real and compl ex matri ces wi l l be wri tten to mai ntai n a cl ose
correspondence between the two, however, i n some al gori thms (e. g. ,
unsymmetri c ei genval ue probl ems) the correspondence wi l l necessari l y
be weaker.
We real i ze that Fortran 90 i s l i kel y to have a number of f eatures
that woul d i mprove the desi gn and codi ng of the l i brary. I n parti cu-
l ar, i ts bui l t- i n array f eatures woul d repl ace some of the BLAS kernel s

22
and resul t i n cl eaner and easi er- to- read code. Another usef ul f eature
i s the dynami c al l ocati on of workspace. Al most al l bl ock routi nes
need work space, wi th the opti mal amount of storage determi ned de-
pendi ng on the probl emparameters at run- ti me. Currentl y, the user
wi l l pass a work array i n the argument l i st i n the hope that i t wi l l be
bi g enough; i f i t i s not, the bl ock si ze used may be l ess than opti mal .
I n Fortran 90, work space coul d be al l ocated dynami cal l y at run- ti me
i n a f ashi on that i s transparent to the user. Thi s woul d si gni cantl y
shorten cal l i ng sequences and avoi d some common programmi ng mi s-
takes resul ti ng f rompassi ng too l i ttl e work space.
I n a f uture proj ect we pl an to devel op a Fortran 90 versi on of
LAPACK, and al so a C versi on, usi ng tool s f or automati c transf or-
mati on as f ar as possi bl e. However, f or the current phase of i ni ti al
devel opment and testi ng, Fortran 77 was the onl y reasonabl e choi ce.

10 Future Work
Our rst prel i mi nary sof tware rel ease i n Apri l 1989 di stri buted codes
f or l i near equati on sol vi ng and QR decomposi ti on to over 65 test
si tes. Our second prel i mi nary rel ease occured i n Apri l 1990 and
i ncl udes sof tware f or i terati ve re nement, the nonsymmetri c ei gen-
probl em, symmetri c ei genprobl em, and SVD. Banded ei genprobl ems,
general i zed ei genprobl ems and the general i zed SVD, condi ti on esti -
mati on f or the ei genprobl em, and updates of vari ous decomposi ti ons
wi l l f ol l owi n 1990. W
e are worki ng toward the rst publ i c rel ease of
LAPACKat the begi nni ng of 1991.
For the l onger term, we have i denti ed a number of research
di recti ons. Fi rst, we are i nterested i n extendi ng our approach to
di stri buted- memory machi nes. These are more chal l engi ng than the
shared- memory machi nes we have been worki ng on, because of the
addi ti onal cost of communi cati on between di erent processors and
memori es. Second, we woul d l i ke to systemati cal l y devel op parame-
teri zed sof tware that i s both portabl e and eci ent. I n secti on 3 we
i denti ed the bl ock si zeb nas one such parameter. Thi rd, we wi sh
to i denti f y f eatures of computer archi tectures that ei ther hel p or hi n-
der producti on of good numeri cal sof tware. Two exampl es of hel pf ul
f eatures are the abi l i ty to access rows and col umns of matri ces wi th
si mi l ar speeds, and f ri endl y error recovery such as the over ow ag

23
i n I EEE standard oati ng poi nt ari thmeti c [ 1] . Fi nal l y, we wi sh to
provi de perf ormance eval uati on tool s f or new archi tectures.

Acknowledgments
The work i s supported by NSF grant grant ASC- 8715728 and by
DARPAgrant F49620- 87- C0065.

References
[ 1] ANSI /I EEE, New York. I EEE St andard f or Bi nary Fl oat i ng
Poi nt Ari t hmet i c , Std 754- 1985 edi ti on, 1985.
[ 2] M. Ari ol i , J. Demmel , and I . S. Du . Sol vi ng sparse l i near sys-
tems wi twh sparse backward error. SI AMJ. Mat ri x Anal . Appl . ,
10(2): 165{190, Apri l 1989.
[ 3] M. Ari ol i , I . S. Du , and P. P. M. de Ri jk. On the augmented
systemapproach to sparse l east- squares probl ems. Num. Mat h. ,
55: 667{684, 1989.
[ 4] Jesse Barl owand James Demmel . Computi ng accurate ei gensys-
tems of scal ed di agonal l y domi nant matri ces. Computer Sci ence
Dept. Techni cal Report 421, Courant I nsti tute, New York, NY,
December 1988. (to appear i n SI AMJ. Num. Anal . , al so LA-
PACKW orki ng Note #7).
[ 5] Chri s Bi schof , James Demmel , Jack Dongarra, Jeremy Du Croz,
Anne Greenbaum, Sven Hammarl i ng, and Danny Sorensen. La-
pack provi si onal contents. Mathemati cs and Computer Sci ence
Di vi si on Report ANL- 88- 38, Argonne Nati onal Laboratory, Ar-
gonne, I L, September 1988. (LAPACKW orki ng Note #5).
[ 6] Chri sti an H. Bi schof . Adapti ve bl ocki ng i n the QRf actori zati on.
The Journal of Supercomput i ng , 3(3): 193{208, 1989.
[ 7] Chri sti an H. Bi schof and Charl es F. Van Loan. The WYrepre-
sentati on f or products of Househol der matri ces. SI AMJournal
on St at and Sci Compt , 8: s2{s13, 1987.

24
[ 8] Chandl er Davi s and W. Kahan. The rotati on of ei genvectors by
a perturbati on i i i . SI AMJournal on Numeri cal Anal ysi s , 7: 248{
263, 1970.
[ 9] P. Dei f t, J. Demmel , L. - C. Li , and C. Tomei . The bi di agonal si n-
gul ar val ues decomposi ti on and Hami l toni an mechani cs. Com-
puter Sci ence Dept. Techni cal Report 458, Courant I nsti tute,
NewYork, NY, Jul y 1989. (submi tted to SI AMJ. Num. Anal . ).
[ 10] James Demmel , Jeremy Du Croz, Sven Hammarl i ng, and Danny
Sorensen. Gui des f or the desi gn of symmetri c ei genrouti nes, SVD,
i terati ve re nement and condi ti on esti mati on. Mathemati cs and
Computer Sci ence Di vi si on Report ANL/MCS- TM- 111, Argonne
Nati onal Laboratory, Argonne, I L, February 1988. (LAPACK
W orki ng Note #4).
[ 11] James Demmel and W. Kahan. Accurate si ngul ar val ues of bi di -
agonal matri ces. Computer Sci ence Dept. Techni cal Report 326,
Courant I nsti tute, New York, NY, March 1988. (to appear i n
SI AMJ. Sci . Stat. Comp. ; al so LAPACKW
orki ng Note #3).
[ 12] James Demmel and K. Vesel i c . Jacobi 's method i s most accurate.
Computer sci ence dept. techni cal report, Courant I nsti tute, New
York, NY, September 1989.
[ 13] J. Dongarra and D. Sorensen. Af ul l y paral l el al gori thmf or the
symmetri c ei genprobl em. SI AMJ. Sci . St at . Comp. , 8(2): 139{
154, March 1987.
[ 14] J. J. Dongarra, J. R. Bunch, C. B. Mol er, and G. W. Stewart.
LI NPACKUsers' Gui de . SI AMPress, 1979.
[ 15] Jack Dongarra, Jeremy Du Croz, I ai n Du , and Sven Hammar-
l i ng. Aset of l evel 3 basi c l i near al gebra subprograms. Techni -
cal Report MCS- P1- 0888, Argonne Nati onal Laboratory, August
1988, To appear ACMTOMS March 1990.
[ 16] Jack Dongarra and I ai n S. Du . Advanced computer archi tec-
tures. Techni cal Report CS- 89- 90, Uni versi ty of Tennessee, 1989.
[ 17] Jack Dongarra and Eri c Grosse. Di stri buti on of mathemati -
cal sof tware by el ectroni c mai l . Communi cat i ons of t he ACM,
30(5): 403{407, 1987.

25
[ 18] Jack Dongarra, Sven Hammarl i ng, and Danny Sorensen. Bl ock
reducti on of matri ces to condensed f orms f or ei genval ue compu-
tati ons. Journal of Comput at i onal and Appl i ed Mat hemat i cs , 27,
1989.
[ 19] Jack Dongarra and Danny Sorensen. A f ul l y paral l el al gori thm
f or the symmetri c ei genval ue probl em. SI AMJournal on St at
and Sci Compt , 8(2): s139{s154, 1987.
[ 20] Jack Dongarra and Danny Sorensen. Aportabl e envi ronment f or
devel opi ng paral l el programs. Paral l el Comput i ng , 5(1&2): 175{
186, 1987.
[ 21] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarl i ng, and
Ri chard J. Hanson. An extended set of Fortran basi c l i near al ge-
bra subprograms. ACMTransact i ons on Mat hemat i cal Sof t ware ,
14(1): 1{17, 1988.
[ 22] Jack J. Dongarra and Danny C. Sorensen. Li near al gebra on
hi gh- perf ormance computers. I n Udo Schendel , edi tor, Hi gh-
Perf ormance Comput ers 85, pages 3{32, Amsterdam, 1986.
North- Hol l and.
[ 23] The Paral l el Computi ng Forum. PCF Fort ran: Language De -
ni t i on. Kuck and Associ ates, Champai gn, I L 61820, 1988.
[ 24] K. Gal l i van, R. Pl emmons, and A. Sameh. Paral l el al gori thms
f or dense l i near al gebra computati ons. SI AMRevi ew, 32: 54{135,
1990.
[ 25] B. Garbow, J. Boyl e, J. Dongarra, and C. Mol er. Mat ri x Ei gen-
syst emRout i nes { EI SPACKGui de Ext ensi on, vol ume 51 of Lec-
t ure Not es i n Comput er Sci ence. Spri nger Verl ag, New York,
1977.
[ 26] Gene H. Gol ub and Charl es F. Van Loan. Mat ri x Comput a-
t i ons . The Johns Hopki ns Press, Bal ti more, Maryl and, 2nd edi -
ti on, 1989.
[ 27] Kai Hwang and Faye A. Bri ggs. Comput er Archi t ect ure and Par-
al l el Processi ng . McGraw- Hi l l , NewYork, 1984.

26
[28] W. Kahan. Accurate eigenvalues of a symmetri c tri di agonal ma-
tri x. Computer Sci ence Dept. Techni cal Report CS41, Stanf ord
Uni versi ty, Stanf ord, CA, Jul y 1966 (revi sed June 1968).
[ 29] C. Lawson, R. Hanson, D. Ki ncai d, and F. Krogh. Basi c l i near
al gebra subprograms f or Fortran usage. ACM Transactions on
Mat hemat i cal Soft ware , 5:308{323, 1979.
[ 30] Robert Schrei ber and Charl es Van Loan. A storage eci ent
WY representati on f or products of Househol der transf ormati ons.
SIAMJ. Sci St at Comp. , 10(1): 53{57, 1989.
[ 31] A. Van Der Sl ui s. Condi ti on numbers and equi l i brati on of ma-
tri ces. Numeri sche Mat hemat i k, 14: 14{23, 1969.
[ 32] B. Smi th, J. Boyl e, J. Dongarra, B. Garbow, Y. Ikebe, V. Kl ema,
and C. B. Mol er. Mat ri x Ei gensyst em Rout i nes { EI SPACK
Gui de . Spri nger- Verl ag, second edi ti on, 1976.
[ 33] Per Stenstrom. Reduci ng contenti on i n shared- memory mul ti -
processors. I EEE Comput er , 21(11): 26{37, 1988.
[ 34] Harol d Stone. Hi gh-Perf ormance Comput er Archi t ect ure.
Addi son- W
esl ey, Readi ng, Massachussetts, 1987.

27

View publication stats

You might also like