Professional Documents
Culture Documents
net/publication/220781801
CITATIONS READS
382 1,045
10 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Christian Bischof on 02 June 2014.
fo r Hi g h-Pe r f o r ma n ce Co mput e r s
Abstract
The goal of the LAPACK project is to desi gn and i mpl e-
ment a portabl e l i near al gebra l i brary f or eci ent use on a va-
ri ety of hi gh- perf ormance computers. The l i brary i s based on
the wi del y used LINPACKand EI SPACKpackages f or sol vi ng
l i near equati ons, ei genval ue probl ems, and l i near l east- squares
probl ems, but extends thei r f uncti onal i ty i n a number of ways.
The maj or methodol ogy f or maki ng the al gori thms run f aster
i s to restructure themto perf ormbl ock matri x operati ons (e. g. ,
matri x- matri x mul ti pl i cati on) i n thei r i nner l oops. These bl ock
operati ons may be opti mi zed to expl oi t the memory hi erarchy
of a speci c archi tecture. The LAPACKproj ect i s al so worki ng
on new di vi de- and- conquer al gori thms f or certai n ei genprob-
l ems, and new al gori thms that yi el d hi gher rel ati ve accuracy
f or a vari ety of l i near al gebra probl ems.
1 Introducti on
The Uni versi ty of Tennessee, the Courant I nsti tute f or Mathemati cal
Sci ences, the Numeri cal Al gori thms Group Ltd. , Ri ce Uni versi ty, Ar-
gonne Nati onal Laboratory, and Oak Ri dge Nati onal Laboratory are
devel opi ng a transportabl e l i near al gebra l i brary i n Fortran 77. The
l i brary i s i ntended to provi de a uni f ormset of subrouti nes to sol ve
the most common l i near al gebra probl ems and to run eci entl y on a
wi de range of hi gh- perf ormance computers.
The LAPACK l i brary (shorthand f or Li near Al gebra Package)
wi l l provi de routi nes f or sol vi ng systems of si mul taneous l i near equa-
ti ons, l east- squares sol uti ons of overdetermi ned systems of equati ons,
and ei genval ue probl ems. The associ ated matri x f actori zati ons (LU,
1
Chol esky, QR, SVD, Schur, general i zed Schur) wi l l al so be provi ded,
as wi l l rel ated computati ons such as updati ng or reorderi ng of the
f actori zati ons and condi ti on numbers (or esti mates thereof ). Dense
and banded matri ces wi l l be provi ded f or, but not general sparse ma-
tri ces. I n al l areas, si mi l ar f uncti onal i ty wi l l be provi ded f or real and
compl ex matri ces.
The newl i brary wi l l be based on the successf ul EI SPACK[ 32, 25]
and LI NPACK [ 14] l i brari es, i ntegrati ng the two sets of al gori thms
i nto a uni ed, systemati c l i brary. A great deal of eort has al so
been expended to i ncorporate desi gn methodol ogi es and al gori thms
that make the LAPACK codes more appropri ate f or today' s hi gh-
perf ormance archi tectures. The LI NPACKand EI SPACKcodes were
wri tten i n a f ashi on that, f or the most part, i gnored the cost of data
movement. Most of today' s hi gh- perf ormance machi nes, however, i n-
corporate a memory hi erarchy [ 16, 27, 34] to hel p di sgui se the di er-
ence i n speed of memory accesses and vectori zed
oati ng- poi nt oper-
ati ons. As a resul t, codes must be caref ul about reusi ng data i n order
not to run at memory speed i nstead of
oati ng- poi nt speed. LAPACK
codes have been caref ul l y restructured to reuse as much data as pos-
si bl e i n order to reduce the cost of data movement. Further i mprove-
ments are the i ncorporati on of new and i mproved al gori thms f or the
sol uti on of ei genval ue probl ems [ 10, 19] . LAPACKi s al so desi gned to
expl oi t the paral l el processi ng capabi l i ti es of many hi gh- perf ormance
machi nes, especi al l y shared memory machi nes.
LAPACK i s desi gned to be eci ent and transportabl e across a
wi de range of computi ng envi ronments, wi th speci al emphasi s on mod-
ern hi gh- perf ormance computers. Whi l e we do not expect LAPACK
codes to be opti mal f or al l archi tectures, we anti ci pate hi gh perf or-
mance over a wi de range of machi nes. By rel yi ng on the Basi c Li near
Al gebra Subprograms (BLAS) [ 21, 15, 29] the codes can be \tuned"
to a gi ven archi tecture by eci ent|and, i n al l l i kel i hood, machi ne-
dependent|impl ementati ons of these kernel s. Machi ne- speci c opti -
mi zati ons are l i mi ted to those kernel s, and the user i nterf ace i s uni f orm
across machi nes. W e shal l al so di stri bute test and ti mi ng routi nes to
veri f y the i nstal l ati on of the LAPACKcodes on a parti cul ar archi tec-
ture and to al l owf or easy compari son wi th exi sti ng sof tware.
W e ai m f or the new l i brary to be easi l y avai l abl e. Net l i b [ 17]
has proven to be an extremel y conveni ent mechani smf or di stri buti ng
sof tware by el ectroni c mai l , and we shal l make LAPACK avai l abl e
2
through netl i b, f or users who want copi es of sel ected routi nes. W e
shal l al so make arrangements f or the compl ete package to be di s-
tri buted on magneti c tape, f or a nomi nal cost onl y. Fi nal l y we shal l
encourage vendors to i ncorporate LAPACKi n thei r own mathemati -
cal l i brari es.
The rest of thi s paper i s outl i ned as f ol l ows. Secti on 2 descri bes the
BLAS and expl ai ns why thei r use can speed up al gori thms. Secti on
3 descri bes bl ock al gori thms and shows i n some detai l howto reorga-
ni ze Gaussi an el i mi nati on and QR f actori zati on. Secti on 4 contai ns
benchmark resul ts f or l i near equati on sol vi ng and QR f actori zati on
on a vari ety of machi nes. Secti on 5 outl i nes our general approach to
achi evi ng hi gh accuracy; i n general , we have repl aced absol ute error
bounds (ei ther on the backward or f orward error) wi th rel ati ve er-
ror bounds, whi ch better respect the sparsi ty and scal i ng structure
of the ori gi nal probl ems. Secti on 6 di scusses a new bi di agonal si ngu-
l ar val ue decomposi ti on (SVD) whi ch computes si ngul ar val ues and
vectors much more accuratel y than previ ousl y thought possi bl e, and
Secti on 7 descri bes howthe Jacobi method (wi th a modi ed stoppi ng
cri teri on) i s uni f orml y more accurate than QR- based al gori thms f or
the symmetri c posi ti ve deni te ei genval ue probl emand SVD. Secti on
8 revi ews the target machi nes f or whi ch LAPACKi s desi gned to run
most eci entl y. Secti on 9 di scusses the overal l structure of the LA-
PACKl i brary and expl ai ns the choi ce of programmi ng l anguage and
styl e. Fi nal l y, Secti on 10 outl i nes f uture pl ans to extend the l i brary,
i ncl udi ng the chal l enges f aced i n adapti ng the codes to di stri buted-
memory machi nes.
3
More recentl y, hi gher l evel BLAS have been speci ed that per-
f ormoperati ons of hi gher granul ari ty and so oer more opportuni ty
f or opti mi zati on on di erent archi tectures. The Level 2 BLAS [ 21]
perf orm matri x- vector operati ons such as matri x- vector mul ti pl i ca-
ti on and rank- one updates. The Level 3 BLAS [ 15] perf ormmatri x-
matri x operati ons such as matri x- matri x mul ti pl i cati on, sol vi ng tri -
angul ar systems wi th mul ti pl e ri ght hand si des, and rank- k matri x
updates.
To appreci ate why these Level 2 and Level 3 BLAS wi th l arger
granul ari ty oer better opportuni ti es f or eci ency, one must under-
stand memory hi erarchi es. Al l machi nes (not j ust supercomputers)
have a hi erarchy of memory l evel s|f or exampl e, wi th regi sters at the
top, f ol l owed by cache, mai n memory, and nal l y di sk storage at the
bottom. Toward the top of the hi erarchy, memory i s smal l er, more
expensi ve, and f aster. Si nce operati ons such as mul ti pl i cati on and
addi ti on must be done at the top l evel , data has to move up through
the vari ous l evel s to the top to be processed, and then down agai n to
be stored. The resul t i s that data at hi gher l evel s i s avai l abl e onl y
af ter some del ay and may not be avai l abl e at a rate f ast enough to
f eed the ari thmeti c uni ts. Cl earl y, an al gori thmthat mi ni mi zes the
memory trac i n the hi erarchy wi l l run f aster.
One way to measure the amount of thi s memory trac i s the rati o
of
ops (
oati ng poi nt operati ons) to memory ref erences i n an al go-
ri thm. The l arger thi s rati o, the l onger a pi ece of data may be kept at
the top of the hi erarchy on average. Let us use thi s measure to com-
pare the three operati ons of saxpy (Level 1 BLAS), matri x- vector mul -
ti pl i cati on (Level 2 BLAS), and matri x- matri x mul ti pl i cati on (Level
3 BLAS), where al l vector and matri ces are of di mensi on n . Si mpl e
counti ng yi el ds the rati os 2= 3 f or saxpy, 2 f or matri x- vector mul ti pl i -
cati on, and n = 2 f or matri x- matri x mul ti pl i cati on. The l arge rati o f or
matri x- matri x mul ti pl y represents a surf ace- to- vol ume eect, doi ng
O ( n3 ) operati ons on O (2n) data. Hence, matri x- matri x mul ti pl i ca-
ti on oers greater opportuni ty f or expl oi ti ng the memory hi erarchy
than the l ower- l evel BLAS routi nes. Tabl e 1 i l l ustrates thi s f act wi th
some benchmark resul ts.
4
Tabl e 1: Speed of the BLAS on vari ous archi tectures
(al l val ues are i n M
ops)
Al l i ant FX/8 I BM3090/VF CRAY2S
(8 processors) (1 processor) (1 processor)
Another
Peak Speed 94 108 488
Level 1 BLAS 14 26 121
Level 2 BLAS 26 60 350
Level 3 BLAS 43 80 437
5
3. 1 LU Fact or i zat i on
6
Thus each col umn of L and U can be computed i n turn, by per-
f ormi ng steps 1 and 2 i n a l oop wi th j runni ng f rom1 to n . Most of
the work i n thi s l oop can be perf ormed by cal l s to two Level 2 BLAS
routi nes.
To deri ve the bl ock f ormof the al gori thm, we assume thatb n> 1.
As bef ore, suppose that 1L1 , L21 , L31 and U 11 have al ready been
computed. The same approach as bef ore yi el ds the f ol l owi ng method
f or computi ng the next bl ock of bncol umns of L and U .
usi ng the Level 2 BLAS f ormof the al gori thmdescri bed earl i er.
The bl ocks A22 and A 32 are treated together i n order to al l ow
the necessary row i nterchanges (represented by the matri 0x. P
These rowi nterchanges must then be appl i ed to the rest of the
matri x.
Thi s i s j ust one way i n whi ch LUf actori zati on may be organi zed
i n terms of Level 2 or Level 3 BLAS operati ons. There are i n f act at
l east three ways, i nvol vi ng di erent patterns of data movement and
BLAS operati ons. Thei r rel ati ve perf ormance i s machi ne- dependent,
but our experi ence to date has shown that the Level 3 BLAS f orms
do not vary much i n perf ormance (there are greater di erences i n
perf ormance between the Level 2 BLAS f orms).
Tabl e 1a shows the speed i n mega
ops on one processor of a Cray
X- MP of :
the LI NPACKrouti ne SGEFA
7
our LAPACKrouti ne SGETRF, cal l i ng assembl y l anguage BLAS
a hi ghl y opti mi zed assembl y- l anguage al gori thm provi ded by
Cray
Tabl e 1a: LUf actori zati on on a Cray X- MP
n LI NPACK LAPACK Cray
25 18 18 65
100 55 105 163
500 129 203 213
Thus, f or l arge matri ces, the Level 3 BLAS code ran at over 95%
of the speed of the hi ghl y opti mi zed assembl y- l anguage code. Thi s
resul t shows that we can achi eve 95%eci ency f romthe machi ne,
usi ng portabl e Fortran code, wi th machi ne- speci c codi ng conned to
the BLAS.
3. 2 QR Fact orizat i on
9
perf orms the updates (4) and (5). Fi gure 1 shows the bl ock House-
hol der al gori thmusi ng the compact WY representati on. Here A i s
parti ti oned as an M 2 N bl ock matri x, and f or si mpl i ci ty we assume
that al l bl ocks are of the same si ze b 2 b , so m =Mb and n =N b . W e
use the notati on A ( i ; j ) to ref er to bl ock entry ( i ; j ) and A ( i : j ; k : l )
to ref er to the submatri x of A consi sti ng of bl ock row entri es i to j
and bl ock col umn entri es k to l .
f or i =1 t o N do
[ Y ; T ] gencwy ( A ( i : M; i ))
A ( i : M; i : N ) appcwy ( Y ; T ; A ( i : M; i : N ))
end f or
10
we are updati ng a submatri x of si ze ( m 0 ( i 0 1) b +1) 2 ( n 0 ( i 0 1) b +1).
We can f urther reduce the amount of data movement by the f ol l owi ng
al gori thm:
for i =1 to N do
f or j =1 t o i 0 1 do
A ( j : M; i ) appcwyj ; ( TYj ; A ( j : M; i ))
end f or
[ Yi ; Ti] gencwy ( A ( i : M; i ))
end f or
4 Benchmarks
The rst prel i mi nary versi on of LAPACK sof tware was rel eased to
test si tes i n Apri l 1989. Thi s sof tware i ncl uded routi nes f or general ,
posi ti ve deni te, and symmetri c i ndeni te systems and f or QR de-
composi ti on wi thout pi voti ng.
I n Tabl es 2- 4 we present resul ts f or whi ch most or al l of the BLAS
were opti mi zed f or the parti cul ar archi tecture. SGETRF i s the LA-
PACK routi ne f or tri angul ar f actori zati on of a general matri x wi th
parti al pi voti ng, SPOTRF perf orms Chol esky f actori zati on of a posi -
ti ve deni te symmetri c matri x, and SGEQRF does QR f actori zati on
wi thout pi voti ng. Al so shown are SGEMV (matri x- vector mul ti pl y)
11
and SGEMM(matri x- matri x mul ti pl y), si nce these are the \speed
l i mi ts" f or the al gori thms wri tten i n terms of the Level 2 BLAS and
Level 3 BLAS, respecti vel y. Al l codes were run i n si ngl e preci si on (32
bi ts on the Convex and 64 bi ts on the Cray). Al l resul ts are i n M
ops.
12
Tabl es 3 and 4 gi ve resul ts f or the CRAYY- MP f or 1 and 8 pro-
cessors, respecti vel y. Here n f or SGETRF and SPOTRF and
b =64
nb =16 f or SGEQRF. The maxi mumspeed of a si ngl e processor of a
CRAYY- MPi s 333 M
ops. Thus, we see that f or l arge- enough matri x
di mensi ons, the si ngl e- processor code runs at at 90% eci ency. When
al l 8 processors are used, the code attai ns 73%to 80%eci ency.
13
the LAPACKcodes have been caref ul l y desi gned i n a modul ar f ash-
i on and wi th the obj ecti ve of mi ni mi zi ng data movement. Si nce data
movement i s the key i ssue i n di stri buted- memory as wel l as shared-
memory machi nes, the LAPACKcodes shoul d be easi l y \tunabl e" to
more experi mental archi tectures.
Several of the al gori thms we i ntend to i mpl ement [ 19] wi l l re-
qui re more than l oop- based paral l el i sm. These al gori thms wi l l rel y
upon the si mpl i ed SCHEDULEmechani sm[ 20] to i nvoke paral l el i sm.
These i deas mi ght al so be used to express top- l evel paral l el i smi n a
portabl e f ashi on. W e are al so cl osel y f ol l owi ng the acti vi ti es of the
Paral l el Computi ng Forum[ 23] whi ch has been f ormed by computer
vendors, sof tware devel opers, nati onal l aboratori es, and uni versi ti es
to exchange techni cal i nf ormati on and to document agreements on
constructs f or programmi ng paral l el appl i cati ons f or shared- memory
mul ti processors.
14
anal ysi s , but i s by no means the onl y way to proceed.
There i s a great deal of choi ce i n the measures we choose to bound
errors and measure di stances. I n conventi onal error anal ysi s as i n-
troduced by Wil ki nson, we bound k f ( H + H ) 0 f ( H ) k i n terms
of k H k , and show f^( H ) = f ( H + H ) where k H k O ( " ) k H k .
Here, k 1 k denotes a norm, l i ke the one- norm or Frobeni us norm.
Typi cal l y one proves a f ormul a of the f ormk f ( H + H ) 0 f ( H ) k
( f ; H ) 1 k H k +O (2k), where
H k ( H ) i s cal l ed the condi t i on num-
ber of H wi t h respect t o f . I n thi s f ormul ati on, i t i s easy to see that
( f ; H ) i s si mpl y the normof the gradi ent of f at H : k rf ( H ) k ; other
scal i ngs are possi bl e. Thus, combi ni ng the perturbati on theory and
error anal ysi s, one can wri te
k f ( H + H ) 0 f ( H ) k O ( " ) ( f ; H ) 1 k2H) k +O ( "
The drawback of thi s approach i s that i t does not respect the
structure of the ori gi nal data. I n parti cul ar, i f the ori gi nal data i s
sparse or graded (l arge i n some entri es, smal l i n others), boundi ng
H onl y by normcan gi ve very pessi mi sti c resul ts. Atri vi al exampl e
i s sol vi ng a di agonal system of equati ons. Each component of the
sol uti on i s computed to f ul l accuracy by a si ngl e di vi de operati on,
but the conventi onal condi ti on number i s the rati o of the l argest to
smal l est di agonal entri es and may be arbi trari l y l arge.
I nstead of boundi ng H by i ts normk H k , one may i nstead use the
measure r eHl ( H ) maxi j j iHjj = ji jH
j , the l argest rel ati ve change i n
any entry (we use the notati on r He lto i ndi cate the dependence on
H ). Thi s measure respects sparsi ty, si ncei jmust H be zero i f Hi j i s
zero, and al so gradi ng, si nce every entry i s perturbed by an amount
smal l compared to i ts magni tude. For exampl e, i n the case of di agonal
l i near equati on sol vi ng, one can easi l y see that a perturbati on H of
si ze r H e (l H ) i n the matri x can onl y change the sol uti on rel ati vel y
by r eHl ( H ) i n each component, and that the al gori thmi s backward
stabl e wi th r He (l H ) " . Thus, the new perturbati on theory and
error anal ysi s wi th respect to Hr (e lH ) accuratel y predi ct that each
component of the sol uti on i s computed to f ul l rel ati ve accuracy.
W e have successf ul l y devel oped new perturbati on theory, al go-
ri thms, and error anal ysi s f or the measure Hr(e lH ) f or much of nu-
meri cal l i near al gebra. W e cannot al ways guarantee to sol ve probl ems
as though we had a smal l r eHl ( H ), but the al gori thms can i n al l
15
cases moni tor thei r accuracy and produce usef ul error bounds. The
al gori thms are usual l y smal l vari ati ons on conventi onal al gori thms,
perhaps wi th a sl i ghtl y di erent stoppi ng cri teri on, al though the bi di -
agonal SVDal gori thmhas a qui te new component. I n al l cases the
al gori thms run approxi matel y as f ast as thei r conventi onal counter-
parts, someti mes a l i ttl e sl ower and someti mes a l i ttl e f aster. Si nce
they are based on the conventi onal al gori thms, al l the techni ques us-
i ng the Level 3 BLAS appl y to them.
Thi s approach has been appl i ed to l i near equati on sol vi ng [ 2] , l i n-
ear l east- squares probl ems [ 3] , the bi di agonal SVD[ 11, 9] , the tri di -
agonal symmetri c ei genprobl em[ 28, 4] , the dense symmetri c posi ti ve
deni te ei genprobl em[ 12] , and the dense deni te general i zed ei gen-
probl em[ 4, 12] . W e have si mi l ar but sl i ghtl y weaker resul ts f or the
dense SVDand general i zed SVD[ 12] . These al gori thms ei ther wi l l be
i ncl uded di rectl y i n LAPACKor can be easi l y constructed by usi ng
LAPACKsubrouti nes as \bui l di ng bl ocks. "
I n the next two secti ons, we shal l descri be these new al gori thms
f or the bi di agonal SVDand dense symmetri c posi ti ve deni te ei gen-
probl em, respecti vel y. The LAPACK routi ne SBDSQR i mpl ements
the newbi di agonal SVDal gori thm; the work on the symmetri c ei gen-
probl emi s a more recent devel opment, and sof tware has not yet been
prepared f or i ncl usi on i n LAPACK.
16
f ol l owi ng resul ts are wel l known [ 26, 8] :
j i 0 i0 j k B2 k
max(si n ( ui ; u0i ) ; si n i ;( vi0v))
k B2k (6)
g a ip0 k B2k
where the absol ut e gap between si ngul ar val ues i s
g a ip mi n j j 0 i j (7)
j 6 =i
r2 e l g(1a+0p )
1=2
max(si n ( ui ; u0i) ; si n i ;( vi0v)) (8)
i
r e l g iap mi n
j j 0 ij (9)
j 6 =i j + i
One can easi l y see that bounds (8) are at l east about as strong as
thei r conventi onal counterparts (6). To see how much stronger they
may be, consi der maki ng rel ati ve perturbati ons of si ze01010i n a 3
17
by 3 bi di agonal matri x wi th si ngul ar val ues1 = 1, 2 = 2 1 10020 ,
and 3 =10 020 . Note that g a3p=g a p2 =10 020 and that r e l g3 a=p
r e l g2 a=p1= 3. Si nce the normof thi s perturbati on i s about01010,
we may appl y (6) to get the absol ute error bound 10 010 3 f or 3;
and, si nce 10 g a p3, no error bound f or the si ngul ar vectors at
010
18
convergent; neverthel ess, no speed i s l ost, because i t i s appl i ed onl y
when min =max , i ts convergence f actor, i s qui te smal l . Thus, con-
vergence i s guaranteed to be f ast, even though l i near. See [ 11] f or
detai l ed resul ts of numeri cal tests.
19
I n contrast, the new perturbati on theory uses the measure =
relH ( H ) i ntroduced i n secti on 6, but appl i es onl y to posi ti ve deni te
matri ces. To present the resul ts, we rewri te H as H = DAD, where
1=2 1=2
D = di ag ( H11 ; : : : ;nn H) i s di agonal and A has uni t di agonal . I t
i s known that A ' s condi ti on number ( A ) kA2k A01 k2 sati ses
( A ) mi nD^ (DH
^ D ^ ), where the mi ni mumi s over al l nonsi ngul ar
di agonal matri cesD ^ [ 31] . I n other words, A i s nearl y the best di agonal
scal i ng of H . Then we may show[ 12]
ji 0 0ij n 1 1 ( A )
i
( n 0 1)1=2 1 1 ( A ) 2
si n ( vi ; i0v) + O( ) (12)
r e l gap i
r e l giap mi n
j j 0 ij (13)
j 6=i ( j 1 i )1=2
There i s al so a nonasymptoti c versi on of the ei genvector bound, but
we omi t i t f or si mpl i ci ty of presentati on.
Nowwe contrast the new wi th the ol d bounds. From(10) we get
the conventi onal rel ati ve error bound
j i 0 0ij n 1 1 ( H )
i
to contrast wi th (12). To see howmuch stronger (12) may be, consi der
the symmetri c posi ti ve deni te matri x H =DAD where
2 40 29 19 3 2 3
10 10 10 1 : 1 : 1
H = 64 1029 1020 109 75 ; A =64 : 1 1 : 751
1019 109 1 : 1 : 1 1
and D = di ag (10 20 ; 1010; 1). Here ( H ) 4010and ( A ) 1: 33.
Thus, rel ati ve perturbati ons i n the matri x entri es onl y cause 4
rel ati ve perturbati ons i n the ei genval ues accordi ng to the new theo-
rem, and 3 1 1040 1 rel ati ve perturbati ons accordi ng to the conven-
ti onal theorem. Al so, the absol ute gaps f or the ei genval ues of H are
g a 1p;2; 3 10020 ; 10020 ; 1, whereas the rel ati ve gaps r e1l; 2g; 3aare
p
al l approxi matel y 10 1 v
10 . Thus the newtheory predi cts errors i n and
20
W e can showthat Jacobi ' s method (wi th a modi ed stoppi ng cri -
teri on) wi l l attai n these new error bounds, but i n general QR wi l l
not. For the exampl e i n the l ast paragraph, QRcomputes two out of
the three ei genval ues as negati ve, whereas H i s posi ti ve deni te. I n
contrast, Jacobi computes al l the ei genval ues to nearl y f ul l machi ne
preci si on. I n f act f or thi s exampl e we can showJacobi computes al l
components of al l ei genvectors to nearl y f ul l rel ati ve accuracy, even
though they vary by 16 orders of magni tude; agai n QRwi l l not even
get the si gns of many smal l components correct.
Al l these resul ts may be extended to the dense SVD, except that
the noti on of perturbati on used i s that each col umn of H must be
smal l i n normcompared to the correspondi ng col umn of H .
21
compute the SVD; and f or pai rs of matri ces, i t wi l l compute the gen-
eral i zed SVD. I t wi l l al so do rank- 1 updates of the bi di agonal SVD
and tri di agonal symmetri c ei genprobl em.
LAPACKwi l l i ncl ude condi ti on esti mators f or l i near systems, ei gen-
val ues, ei genvectors, and i nvari ant subspaces. I t wi l l sol ve the Syl vester
equati on and general i zed Syl vester equati on and can reorder the ei gen-
val ues al ong the di agonal of a matri x i n Schur f orm.
The nonsymmetri c ei genrouti nes wi l l be based on the QR al go-
ri thm. I nverse i terati on wi l l be avai l abl e i f onl y a f ew ei genvectors
are desi red. For the symmetri c tri di agonal ei genprobl em(the nal
phase of the dense symmetri c ei genprobl em) al gori thms based on QR,
di vi de- and- conquer [ 13] , and bi secti on wi l l be i ncl uded; these al go-
ri thms have di erent attai nabl e accuraci es, requi re di erent amounts
of storage, and run at di erent speeds on di erent archi tectures and
dependi ng on whether some or al l ei genval ues and ei genvectors are
desi red. I n other words, no si ngl e al gori thmi s best i n al l cases [ 10] .
The same i s true f or the bi di agonal SVD, whi ch i s the nal phase
of the general SVD. W e al so pl an to i ncl ude Jacobi ' s method f or the
dense symmetri c ei genprobl emand SVDi n a l ater rel ease, f or reasons
gi ven earl i er.
The sof tware i s bei ng devel oped i n portabl e Fortran 77, wi th ex-
tensi ons to the standard onl y where necessary. Si ngl e- and doubl e-
preci si on versi ons wi l l be prepared; conversi on between di erent pre-
ci si ons wi l l be perf ormed automati cal l y by sof tware tool s f romTool -
pack/1.
Routi nes f or compl ex matri ces wi l l use the COMPLEX data type
(l i ke LI NPACK, but unl i ke EI SPACK); hence the avai l abi l i ty of a
doubl e- preci si on compl ex (COMPLEX*16 or DOUBLE COMPLEX)
data type wi l l be assumed as an extensi on to Fortran 77. Routi nes
f or real and compl ex matri ces wi l l be wri tten to mai ntai n a cl ose
correspondence between the two, however, i n some al gori thms (e. g. ,
unsymmetri c ei genval ue probl ems) the correspondence wi l l necessari l y
be weaker.
We real i ze that Fortran 90 i s l i kel y to have a number of f eatures
that woul d i mprove the desi gn and codi ng of the l i brary. I n parti cu-
l ar, i ts bui l t- i n array f eatures woul d repl ace some of the BLAS kernel s
22
and resul t i n cl eaner and easi er- to- read code. Another usef ul f eature
i s the dynami c al l ocati on of workspace. Al most al l bl ock routi nes
need work space, wi th the opti mal amount of storage determi ned de-
pendi ng on the probl emparameters at run- ti me. Currentl y, the user
wi l l pass a work array i n the argument l i st i n the hope that i t wi l l be
bi g enough; i f i t i s not, the bl ock si ze used may be l ess than opti mal .
I n Fortran 90, work space coul d be al l ocated dynami cal l y at run- ti me
i n a f ashi on that i s transparent to the user. Thi s woul d si gni cantl y
shorten cal l i ng sequences and avoi d some common programmi ng mi s-
takes resul ti ng f rompassi ng too l i ttl e work space.
I n a f uture proj ect we pl an to devel op a Fortran 90 versi on of
LAPACK, and al so a C versi on, usi ng tool s f or automati c transf or-
mati on as f ar as possi bl e. However, f or the current phase of i ni ti al
devel opment and testi ng, Fortran 77 was the onl y reasonabl e choi ce.
10 Future Work
Our rst prel i mi nary sof tware rel ease i n Apri l 1989 di stri buted codes
f or l i near equati on sol vi ng and QR decomposi ti on to over 65 test
si tes. Our second prel i mi nary rel ease occured i n Apri l 1990 and
i ncl udes sof tware f or i terati ve renement, the nonsymmetri c ei gen-
probl em, symmetri c ei genprobl em, and SVD. Banded ei genprobl ems,
general i zed ei genprobl ems and the general i zed SVD, condi ti on esti -
mati on f or the ei genprobl em, and updates of vari ous decomposi ti ons
wi l l f ol l owi n 1990. W
e are worki ng toward the rst publ i c rel ease of
LAPACKat the begi nni ng of 1991.
For the l onger term, we have i denti ed a number of research
di recti ons. Fi rst, we are i nterested i n extendi ng our approach to
di stri buted- memory machi nes. These are more chal l engi ng than the
shared- memory machi nes we have been worki ng on, because of the
addi ti onal cost of communi cati on between di erent processors and
memori es. Second, we woul d l i ke to systemati cal l y devel op parame-
teri zed sof tware that i s both portabl e and eci ent. I n secti on 3 we
i denti ed the bl ock si zeb nas one such parameter. Thi rd, we wi sh
to i denti f y f eatures of computer archi tectures that ei ther hel p or hi n-
der producti on of good numeri cal sof tware. Two exampl es of hel pf ul
f eatures are the abi l i ty to access rows and col umns of matri ces wi th
si mi l ar speeds, and f ri endl y error recovery such as the over
ow
ag
23
i n I EEE standard
oati ng poi nt ari thmeti c [ 1] . Fi nal l y, we wi sh to
provi de perf ormance eval uati on tool s f or new archi tectures.
Acknowledgments
The work i s supported by NSF grant grant ASC- 8715728 and by
DARPAgrant F49620- 87- C0065.
References
[ 1] ANSI /I EEE, New York. I EEE St andard f or Bi nary Fl oat i ng
Poi nt Ari t hmet i c , Std 754- 1985 edi ti on, 1985.
[ 2] M. Ari ol i , J. Demmel , and I . S. Du. Sol vi ng sparse l i near sys-
tems wi twh sparse backward error. SI AMJ. Mat ri x Anal . Appl . ,
10(2): 165{190, Apri l 1989.
[ 3] M. Ari ol i , I . S. Du, and P. P. M. de Ri jk. On the augmented
systemapproach to sparse l east- squares probl ems. Num. Mat h. ,
55: 667{684, 1989.
[ 4] Jesse Barl owand James Demmel . Computi ng accurate ei gensys-
tems of scal ed di agonal l y domi nant matri ces. Computer Sci ence
Dept. Techni cal Report 421, Courant I nsti tute, New York, NY,
December 1988. (to appear i n SI AMJ. Num. Anal . , al so LA-
PACKW orki ng Note #7).
[ 5] Chri s Bi schof , James Demmel , Jack Dongarra, Jeremy Du Croz,
Anne Greenbaum, Sven Hammarl i ng, and Danny Sorensen. La-
pack provi si onal contents. Mathemati cs and Computer Sci ence
Di vi si on Report ANL- 88- 38, Argonne Nati onal Laboratory, Ar-
gonne, I L, September 1988. (LAPACKW orki ng Note #5).
[ 6] Chri sti an H. Bi schof . Adapti ve bl ocki ng i n the QRf actori zati on.
The Journal of Supercomput i ng , 3(3): 193{208, 1989.
[ 7] Chri sti an H. Bi schof and Charl es F. Van Loan. The WYrepre-
sentati on f or products of Househol der matri ces. SI AMJournal
on St at and Sci Compt , 8: s2{s13, 1987.
24
[ 8] Chandl er Davi s and W. Kahan. The rotati on of ei genvectors by
a perturbati on i i i . SI AMJournal on Numeri cal Anal ysi s , 7: 248{
263, 1970.
[ 9] P. Dei f t, J. Demmel , L. - C. Li , and C. Tomei . The bi di agonal si n-
gul ar val ues decomposi ti on and Hami l toni an mechani cs. Com-
puter Sci ence Dept. Techni cal Report 458, Courant I nsti tute,
NewYork, NY, Jul y 1989. (submi tted to SI AMJ. Num. Anal . ).
[ 10] James Demmel , Jeremy Du Croz, Sven Hammarl i ng, and Danny
Sorensen. Gui des f or the desi gn of symmetri c ei genrouti nes, SVD,
i terati ve renement and condi ti on esti mati on. Mathemati cs and
Computer Sci ence Di vi si on Report ANL/MCS- TM- 111, Argonne
Nati onal Laboratory, Argonne, I L, February 1988. (LAPACK
W orki ng Note #4).
[ 11] James Demmel and W. Kahan. Accurate si ngul ar val ues of bi di -
agonal matri ces. Computer Sci ence Dept. Techni cal Report 326,
Courant I nsti tute, New York, NY, March 1988. (to appear i n
SI AMJ. Sci . Stat. Comp. ; al so LAPACKW
orki ng Note #3).
[ 12] James Demmel and K. Vesel i c . Jacobi 's method i s most accurate.
Computer sci ence dept. techni cal report, Courant I nsti tute, New
York, NY, September 1989.
[ 13] J. Dongarra and D. Sorensen. Af ul l y paral l el al gori thmf or the
symmetri c ei genprobl em. SI AMJ. Sci . St at . Comp. , 8(2): 139{
154, March 1987.
[ 14] J. J. Dongarra, J. R. Bunch, C. B. Mol er, and G. W. Stewart.
LI NPACKUsers' Gui de . SI AMPress, 1979.
[ 15] Jack Dongarra, Jeremy Du Croz, I ai n Du, and Sven Hammar-
l i ng. Aset of l evel 3 basi c l i near al gebra subprograms. Techni -
cal Report MCS- P1- 0888, Argonne Nati onal Laboratory, August
1988, To appear ACMTOMS March 1990.
[ 16] Jack Dongarra and I ai n S. Du. Advanced computer archi tec-
tures. Techni cal Report CS- 89- 90, Uni versi ty of Tennessee, 1989.
[ 17] Jack Dongarra and Eri c Grosse. Di stri buti on of mathemati -
cal sof tware by el ectroni c mai l . Communi cat i ons of t he ACM,
30(5): 403{407, 1987.
25
[ 18] Jack Dongarra, Sven Hammarl i ng, and Danny Sorensen. Bl ock
reducti on of matri ces to condensed f orms f or ei genval ue compu-
tati ons. Journal of Comput at i onal and Appl i ed Mat hemat i cs , 27,
1989.
[ 19] Jack Dongarra and Danny Sorensen. A f ul l y paral l el al gori thm
f or the symmetri c ei genval ue probl em. SI AMJournal on St at
and Sci Compt , 8(2): s139{s154, 1987.
[ 20] Jack Dongarra and Danny Sorensen. Aportabl e envi ronment f or
devel opi ng paral l el programs. Paral l el Comput i ng , 5(1&2): 175{
186, 1987.
[ 21] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarl i ng, and
Ri chard J. Hanson. An extended set of Fortran basi c l i near al ge-
bra subprograms. ACMTransact i ons on Mat hemat i cal Sof t ware ,
14(1): 1{17, 1988.
[ 22] Jack J. Dongarra and Danny C. Sorensen. Li near al gebra on
hi gh- perf ormance computers. I n Udo Schendel , edi tor, Hi gh-
Perf ormance Comput ers 85, pages 3{32, Amsterdam, 1986.
North- Hol l and.
[ 23] The Paral l el Computi ng Forum. PCF Fort ran: Language De-
ni t i on. Kuck and Associ ates, Champai gn, I L 61820, 1988.
[ 24] K. Gal l i van, R. Pl emmons, and A. Sameh. Paral l el al gori thms
f or dense l i near al gebra computati ons. SI AMRevi ew, 32: 54{135,
1990.
[ 25] B. Garbow, J. Boyl e, J. Dongarra, and C. Mol er. Mat ri x Ei gen-
syst emRout i nes { EI SPACKGui de Ext ensi on, vol ume 51 of Lec-
t ure Not es i n Comput er Sci ence. Spri nger Verl ag, New York,
1977.
[ 26] Gene H. Gol ub and Charl es F. Van Loan. Mat ri x Comput a-
t i ons . The Johns Hopki ns Press, Bal ti more, Maryl and, 2nd edi -
ti on, 1989.
[ 27] Kai Hwang and Faye A. Bri ggs. Comput er Archi t ect ure and Par-
al l el Processi ng . McGraw- Hi l l , NewYork, 1984.
26
[28] W. Kahan. Accurate eigenvalues of a symmetri c tri di agonal ma-
tri x. Computer Sci ence Dept. Techni cal Report CS41, Stanf ord
Uni versi ty, Stanf ord, CA, Jul y 1966 (revi sed June 1968).
[ 29] C. Lawson, R. Hanson, D. Ki ncai d, and F. Krogh. Basi c l i near
al gebra subprograms f or Fortran usage. ACM Transactions on
Mat hemat i cal Soft ware , 5:308{323, 1979.
[ 30] Robert Schrei ber and Charl es Van Loan. A storage eci ent
WY representati on f or products of Househol der transf ormati ons.
SIAMJ. Sci St at Comp. , 10(1): 53{57, 1989.
[ 31] A. Van Der Sl ui s. Condi ti on numbers and equi l i brati on of ma-
tri ces. Numeri sche Mat hemat i k, 14: 14{23, 1969.
[ 32] B. Smi th, J. Boyl e, J. Dongarra, B. Garbow, Y. Ikebe, V. Kl ema,
and C. B. Mol er. Mat ri x Ei gensyst em Rout i nes { EI SPACK
Gui de . Spri nger- Verl ag, second edi ti on, 1976.
[ 33] Per Stenstrom. Reduci ng contenti on i n shared- memory mul ti -
processors. I EEE Comput er , 21(11): 26{37, 1988.
[ 34] Harol d Stone. Hi gh-Perf ormance Comput er Archi t ect ure.
Addi son- W
esl ey, Readi ng, Massachussetts, 1987.
27