2.2 Matrix Computations

| GENE H. GOLUB - CHARLES F. VAN LOAN MATRIX COMPUTATIONSMatrix C omputations Johns Hopkins Studies in the Mathematical Sciences in association with the Department of Mathematical Sciences THIRD EDITION The Johns Hopkins University Gene H. Golub Department of Computer Science ‘Stanford University Charles F. Van Loan Department of Computer Science Cornell University The Johns Hopkins University Press Baltimore and London(©1983, 1989, 1996 The Johns Hopkins University Press All Fights reserved. Published 1996 Printed inthe United States of America on acid-free paper 105 0403 02 01 009998 97 3432 Firs edition 1983 Second edition 1989, ‘Third Edition 1996 ‘Toe Johns Hopkins University Press 2715 North Charles Suet Baltimore, Maryland 21218-4319 ‘The Johns Hopkins Press Lu, London Libary of Congress Cataloging in-Publication Data will be found atthe end of this hook ISBN 0-8018-5414.8 (pbk) DEDICATED To ALSTON S. HOUSEHOLDER AND JAMES H. WILKINSONContents Preface to the Third Edition xi Software xiii Selected References xv 1 Matrix Multiplication Problems Ll Basic Algorithms and Notation 2 12 Exploiting Structure 16 13 Block Matrices and Algorithms 4 14 ‘Vectorization and Re-Use Issues Mu 2 Matrix Analysis 48 21 Basic Ideas from Linear Algebra 43 22 ‘Vector Norms 52 23 Matrix Norms 54 24 Finite Precision Matrix Computations 59 25 Orthogonality and the SVD 69 2.6 Projections and the CS Decomposition 75 27 The Sensitivity of Square Linear Systems 80 General Linear Systems 87 ‘Triangular Systems 88 ‘The LU Factorization 94 Roundoff Analysis of Gaussian Elimination 104 Pivoting 109 Improving and Estimating Accurscy 123, 4 41 42 43 44 45 46 ar 5 BL 52 53 54 55 56 BT 6 61 62 63 7 Ta 72 73 Ta 75 76 aT 8 81 82 Special Linear Systems ‘The LDMT and LDLT Factorizations 135 Positive Definite Systems 140 Banded Systems 152 Symmetric Indefinite Systems 161 Block Systems 174 Vandermonde Systems and the FFT 183 ‘Toeplitz and Related Systems 193 Orthogonalization and Least Squares Householder and Givens Matrices 208 ‘The QR Factorization 223 ‘The Full Rank LS Problem 236 Other Orthogonal Factorizations 248 ‘The Rank Deficient LS Problen 256 Weighting and Iterative Improvement 264 ‘Square and Underdetermined Systems 270 Parallel Matrix Computations Basic Concepts 276 Matrix Multiplication 292 Factorizations — 300 The Unsymmetric Eigenvalue Problem Properties and Decompositions 310 Perturbation Theory 320 Power Iterations 330 ‘The Hessenberg and Real Schur Forms 341 ‘The Practical QR Algorithm 352 Invariant Subspace Computations 362 ‘The QZ Method for Ax = Bx 375 The Symmetric Eigenvalue Problem Properties and Decompositions 393 Power Iterations 405 133 206 275 308 39183 The Symmetric QR Algorithm 414 84 Jacobi Methods 426. 85 ‘Tridiagonal Methods 439 86 Computing the SVD 448 BT ‘Some Generalized Eigenvalue Problems 461 9 Lanczos Methods 470 9. Derivation and Convergence Properties 471 92 Practical Lanczos Procedures 479 93 Applications to Ar = b and Least Squares 490 94 Amoldi and Unsymmetric Lanczos 499. 10 _ iterative Methods for Linear Systems 508 10.1 The Standard Iterations 509 10.2 The Conjugate Gradient Method 520 10.3 Preconditioned Conjugate Gradients 532 104 Other Krylov Subspace Methods ‘544 11. Functions of Matrices 555 11.1 Eigenvalue Methods 556 11.2 Approximation Methods 562 11.3 The Matrix Exponential 572 12 special Topics 579 12.1 Constrained Least Squares 580 12.2 Subset Selection Using the SVD 590 12.3 Total Least Squares 595 124 Computing Subspsces with the SVD 601 12.5 Updating Matrix Pactorizations 606 12.6 — Modified/Structured Eigenproblems 621 Bibliography 637 Index 687Preface to the Third Edition ‘The field of matrix computstions continues to grow and mature. In the Thind Edition we have added over 300 new references and 100 new problems, The LINPACK and EISPACK citations have been replaced with appropriate pointers to LAPACK with key codes tabulated at the Beginning of appropriate chapters. In the First Edition and Second Edition ie identified » small number ‘of global references: Wilkinson (1965), Forsythe and Moler (1967), Stewart (1973), Hanson and Lawaon (1974) and Parlett (1980). These volumes are ‘as important as ever to the research landscape, but there are some mag- nificent new textbooks and monographs on the scene. See The Literature section that follows. ‘We continue as before with the practice of giving references at the end ‘of each section and a master bibliography at the end of the book. ‘The earlier editions suffered from a large number of typographical errors ‘and we are obliged to the dozens of readers who have brought these to our attention. Many corrections and clarifications have been made. Here are some specific highlights of the new edition. Chapter 1 (Matrix Multiplication Problems) and Chapter 6 (Parallel Mstrix Computations) have been completely rewritten with less formality. We think thot this facilitates the building of intuition for high performance computing and ‘draws 8 better line between algorithm end implementation on the printed page. Tn Chapter 2 (Matrix Analysis) we expanded the treatment of CS de- ‘composition and included a proof. The overview of floating point arithmetic hhas been brought up to date. In Chapter 4 (Special Linear Systems) we embellished the Toeplitz section with connections to circulant matrices and the fast Fourier transform. A subsection on equilibrium systems has been included in our treatment of indefinite systers. ‘A more accurate rendition of the modified Gram-Schmidt process is offered in Chapter 5 (Orthogonslization and Least Squares). Chapter 8 (The Symmetric Eigenproblem) has been extensively rewritten and rear ranged 20 as to minimize its dependence upon Chapter 7 (The Unsymmet- tie Eigenproblem). Indeed, the coupling between these two chapters is now 0 minimal that it is possible to read either one first. In Chapter 9 (Lanczos Methods) we have expanded the discussion of xan PRePAGE 10 THe imi BULLS the unsymmetric Lanczos process and the Arnoldi iteration. The “unsymmetric component” of Chapter 10 (Iterative Methods for Linear Systems) thas likewise been broadened with whole new section devoted to various Krylov apace methods designed to handle the sparee unsymmetric linear system problem. In §12.5 (Updating Orthogonal Decompositions) we included new subsection on ULV updating, Toeplitz matrix eigenproblems and orthogonal matrix eigenproblems are discussed in §12.6. Both of us look forward to continuing the dialog with our readers. As ‘we said in the Preface to the Second Edition, “Tt has been a pleasure to deal with such an interested and friendly readership.” Many individuals made valuable Third Edition suggestions, but Greg Ammar, Mike Heath, Nick Trefethen, and Steve Vavasis deserve special ‘thanks. Finally, we would like to acknowledge the support of Cindy Rabinson ft Corel. A dedicated amistant makes a big diferenceSoftware LAPACK Many of the algorithms in this book are implemented in the software pack- ‘ge LAPACK: E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongerra, J. DuCros, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen (1995). LAPACK Users’ Guide, Release 2.0, 2nd ed., SIAM Publications, Philadelpbis. Pointers to some of the more important routines in this package are given at the beginning of selected chapters: Chapter 1. Level-1, Level-2, Level-3 BLAS Chapter 3. General Linear Systems Chapter 4. Positive Definite and Band Systems Chapter 5. Orthogonalization and Least Squares Problems Chapter 7. ‘The Unsymmetric Eigenvalue Problem Chapter 8. The Symmetric Eigenvalue Problem ‘Our LAPACK references are spare in detail but rich enough to “get. you started.” Thus, when we say that .TASV can be used to solve a triangular system Az = b, we leave it to you to discover through the LAPACK manual that A can be either upper or lower triangular and that the transposed system ATx = b can be handled as well. Moreover, the underscore is a Blonder whee iission is to designate type (single, double, complex, etc). LAPACK stands on the shoulders of two other packages that are mile ‘tones in the history of software development. EISPACK was developed in the early 1970s and is dedicated to solving symmetric, unsymmetric, and generalized eigenproblems: B.T, Smith, J.M. Boyle, Y. Uebe, V.C. Klema, and C.B. Moler (1970). Matriz Eigensystem Routines: EISPACK Guide, 2nd ed., Lecture Notes in Computer Science, Volume 6, Springer-Verlag, New York. xill xiv SOFTWARE BS. Garbow, J.M. Boyle, J.J. Dongarra, and C.B. Moler (1972). Matriz Bigensystem Routines: BISPACK Guide Bztension, Lecture Notes in Computer Science, Volume 51, Springer-Verlag, New York. LINPACK was developed in the late 1970s for linear equations and least squares problems: EISPACK and LINPACK have their roots in sequence of papers that feature Algol implementations of some of the key matrix factorizations. These papers are collected in LH, Wilkinson and C. Reinsch, eds. (1971). Handbook for Automatic Computation, Vol. 2, Linear Algebra, Springer-Verlag, New York. NETLIB ‘A wide range of software including LAPACK, EISPACK, and LINPACK is available electronically via Netlib: ‘World Wide Web: ntep://www.net1ib.org/index. html Anonymous ftp: ftp://ftp.netlib.org Via email, send a one-line mesage: mail. net]ibtornl gov send index to get started. MariaB® Complementing LAPACK and defining a very popular matrix computation enviroument is MATLAB: Martas User's Guide, The MathWorks Inc., Natick, Massachusetts, 'M, Marcus (1993). Mairices and MATLAB: A Tutorial, Prentice Hall, Up- per Saddle River, NJ. RL. Pratap (1995). Getting Started with Maras, Saunders College Pub- lishing, Fort: Worth, TX. Many of the problems in Matrix Computations are bert posed to students 36 MATLAB problems. We make extensive use of MATLAB notation in the resentation of algorithms.Selected References Bach section in the book concludes with an annotated list of references. ‘A master bibliography is given at the end of the text. ‘Useful books that collectively cover the field, are cited below. Chapter titles are included if appropriate but do not infer too much from the level of detail because one author's chapter may be another's subsection. The citations are classified as follows: Pre-1970 Classica. Barly volumes that set the stage. Introductory (General). Suitable for the undergraduate classroom. Advanced (General). Best for practitioners and graduate students. ‘Analytical. For the supporting mathematics, Linear Equation Problems. Az = 6. Linear Fitting Problems. Az ~ b, Eigenvalue Problems. Az = Az. High Performance. Parallel/vector issues Edited Volumes, Useful, themstic collections. Within each group the entries are specified in chronological order. Pre-1970 Classics VIN. Faddeeva (1959). Computational Methods of Linear Algebra, Dover, New York. Basic Meleral from Linear Algebra. Systeme of Linear Equations. The Proper Numbers and Proper Vectors of = Matrix . Bodewig (1959). Matriz Calenlus, North Holland, Amsterdam. ‘Matric Calculus. Direct Methods for Linear Equation, lnirect Methods for Linear Equations Inversion of Matrices. Geodetic Matrices Eigeaprobles. RS, Varga (1962). Matrix Iterative Analysis, Prentice-Hall, Englewood Clif, NJ. wi SeuecTep REFERENCES J.H. Wilkinson (1963). Rounding Errors in Algebraic Processes, Prentice- Hill, Englewood Clifs, NJ. ‘The Fundamental Arithmetic Operations Computations Involving, Polynomials. Matrix Computations A\S, Householder (1964). Theory of Matrices in Numerical Analysis, Blais- dell, New York. Reprinted in 1974 by Dover, New York. ‘Some Basic Identiti and Inequalities. Norms, Bounds, and Convergence. Localism tion Theorems and Other Inequalities. The Soliton of Linear Systemax: Methods of ‘Suocemsive Appreximation. Direct Methods of iverson. Proper Valves and Vectors: Normalization and Reduction of the Metrix. Proper Values and Vectors: Succemive ‘Approximation. L. Fax (1964). An Introduction to Numerical Linear Algebra, Oxford Uni- versity Press, Oxford, England. Introduction, Matrix Algebra. Elimination Metbods of Gauss, Jordan, and Aitken, {Gompect Elimination Methods of Doolittle, Crout, Banechcwic, and Chole. Onthoonalzation Methods. Condition, Accuracy, and Precion,” Comparson of Methods, Measure of Work. Iteraive and Gradient Methods. erative methods for latent Roots and Vectors. ‘Tasaormation Methods for Latent Roots aod Vertors. Notes on Error Anaya fr Latent Root ad Vectors HL, Wilkinsoa (1965), The Algebraic Eigenvalue Problem, Clarendon Press, Oxford, England. ‘Theoretical Background, Perturbation Theory. Error Analyt, Solution of Line ‘er Algebraic Equation Hermitian Matrioa. Reduction of « General Matrix to Condensed Form. Eigenvalues of Matrices of Condensed Forma. The LR aad QR Algorithms, Iterative Methods. G.E. Forsythe and C. Moler (1967). Computer Solution of Linear Algebraic ‘Systems, Prentice-Hall, Englewood Clifis, NJ. Reader's Background and Purpose of Book. Vector and Matrix Norms, Diagonal ‘Form of a Matrix Under Orthogonal Equivalence. Proof of Diagonal Form Theorem ‘Type of Computational Problems ia Linear Algebra. ‘Types of Matrices encou- tered ia Practical Problema. Sources of Compatetional Problems of Linear Algebre. Condition of w Lisear System. Gauman Hisnination and LU Decomposition. Now for Interchanging Rows. Scaling Equations nad Unkaowns. The Crott oad Dooli- te Variants. Iterative improvement. Computing the Deverminant. Neary Singular Matrices. Algol 60 Program. Fortran, Extended Algol, end PL/I Programm. Mee trix laversion. An Example: Hubert Matric Floating Point Round-O8 Analysis, Rounding Error in Geumian Elimination. Convergecce of Iterative Improvemect Positive Definite Matrices; Band Matrices Iterative Methods for Solving Linear Systems. Nonlinear System of Equations.REFERENCES xvii Introductory (General) A.R. Gourlay and G.A. Watson (1973). Computational Methods for Matriz Eigenproblems, John Wiley & Sons, New York. Introduction. Background Theory. Reductions and Transformations, Methods for the Dominant Eigervalue, Methods for the Subdominaat Eigenvalue. avers It ‘eration. Jacobi'a Methods, Giveas and Howsebolder’s Methods. Eigensystem of = Symmetric Tridiagonal Matzix, The LR and QR Algorithms, Extensions of Ja ‘cbi's Method. Extension of Giveas' and Houpehokler’s Methods. QR Algorithm for Heeweaberg Matric. geoeraized Bigeovalue Problems. Available Lmplementations G.W. Stewart (1973). Introduction to Matric Computations, Acodemic Press, New Yori. Proliminavies, Practicalition. The Direct Solution of Linear Systema. Norms, Lim: las, and Condition Numbers. The Lioear Lanst Square Problem. Eigenvalues and Bigeavectors. The QR Algorithm. RJ. Goult, RP. Hoskins, J.A. Milner and M.J. Pratt (1974). Compute- tional Methods in Linear Algebra, John Wiley and Sons, New York. Eigenvaiver and Pigenvectors. Error Analysis. The Soltion of Linear Equations by Eliminstion and Decomposition Methods. The Solution of Linear Systema of Eqs ‘tone by Iterative Methods. Errors inthe Solution Sets of Fquations, Computation of Eigenvalues and Eigenvectort. Errors in Eigenvalues and Eigenvectors. Appendix =A Survey of Easential Revue om Linear Algeoee ‘TAF. Coleman and C.F. Van Loan (1988). Handbook for Matriz Computa- tions, SIAM Publications, Philadelphia, PA. Fostran TT, The Basic Linear Aigsbra Subprogram, Linpack, MATLAB. W.W. Hager (1988). Applied Numerical Linear Algebra, Prentice-Hall, En- ‘lewood Clifs, NJ. Introduction. Elimination Schemes. Conditioning. Noniioonr Systems. Least Scares Eigenprobiame. fterative Methods. P.G. Ciarlet (1989). Introduction to Numerical Linear Algebra and Opti- mization, Cambridge University Press. ‘A Summary of Reaulta on Matrices. General Results in the Numerical Anaiysi of, “Matrices. Sources of Problems in the Nutperial Analyt of Matricen- Direct Meth- ‘eds for the Solution of Liar Systema Iterative Methods forthe Solation of Linear Symtoma. Methods for the Calculation of Eigenvalues and Eigeavectore. A Review of Differential Calcaloa. Some Applications. General Remus oa Optimization. Some Algorithms. Introduction to Nonlinear Programming. Linear Programming. DS. Watkins (1991). Amdamentals of Matrix Computations, John Wiley and Sons, New York. ‘Gouminn Elimination and Tia Varacte. Sensitivity of Linear Syateme; Effects of REFERENCES P. Gill, W. Murray, and M.H. Wright (1991). Numerical Linear Algebra ‘and Optimization, Vol. 1, Addizon-Wealey, Reading, MA. atroduction. Linear Algsiea Background. Computation and Condition. Linear ‘Equations, Compatible System. Lisen Leut Squares. Linesr Constraints I: Linear Progremuing. The Simplex Method. ‘A. Jennings and J.J. McKeowen (1992). Matriz Computation (2nd ed), Joba Wiley and Sons, New York. ‘Basic Algsbraic wad Numerical Concepts. Some Matrix Problems. Computer In ‘eenst Bleisaton tho Linear Epon. Spun Maite Cian, Some Matrix Eigeavalve Problema ‘Treaslormaticn Methods for Eigenvalue Prob- Jaman, Sturra Sequence Methods. Vector Iterative Methods for Partial Eigonoltion Orthogonalsetion and Re-Solution Technique for Linear Bquatins. trative Meth- coda for Linear Equations. Nos-inear Equations. Parallel and Vector Computing. BLN. Datta (1995). Numerical Linear Algebra and Applications. Brooks/Cole Publishing Company, Pacific Grove, California. Review of Required Linear Algsbra Concepta. Floating Polat Numbers and Errors ia Computations. Stability of Algorithms and Conditioaing of Problams. Numerically ‘ective Algorithms and Mathematical Software. Some Useful Transformations ix Numerical Linear Algebra and Their Applications. Numerical Matrox Elgeevaue Problems. The Geserlized Eigenvalue Problom. The Singular Value Decomposition. AVTnate of Roundoll Error Analyaie M.T. Heath (1997). Scientific Computing: An Introductory Survey, McGraw- Hill, New York. Scientific Computing, Syetems of Linear Equotions. Linear Least Squaren. Eigen- values and Singular Valuen. Nonlinear Bquations Optimisation. Interpolatioa. Nu ‘merical Integration and Diferentiation. Initial Value Problema far ODEs. Boundary ‘Value Probleme for ODEs. Partial Dieeatial Equations, Past Fourier Transform, ‘Random Numbers and Simulation. CF. Van Loan (1997). Introduction to Scientific Computing: A Matrix. Vector Approach Using Matlab, Prentice Hall, Upper Saddle River, NJ. Power Tools of the Trade. Polynomial Interpolation. Piecewise Polynomial Interpo- lation. Numerical Integration. Matrix Cocputstions. Linear Syrtems. The QR aad ‘Gdolerky Factorwations, Nonlinear Equations and Optioimtion. The Initial Value Probleen Advanced (General) NJ. Higham (1996). Accurncy and Stability of Numerical Algorithms, ‘SIAM Publications, Philadelphia, PA. ‘Principles of Finite Precision Computation. Fioating Polat Arithmetic. Basic Sommatioa, Polynomials Norms. Perturbation Theory for Linene Sywtema. TH ‘angular Syrtema. LU Factortation and Linear Equations, Choleky Factorization. Iterative Refinement. Block LU Factorization. Matrix Inversion. Condition Number Eaimation. The Sylvewtar Equation. Stationary Tarative Method. Matrix Powart QR Paccorisatin. ‘The Least Squares Problem. Underdetermined Sywtems Vas. ‘dermonde Syatera. Fast Matrix Multiplication, The Fast Fourier Transform sd ‘Applications. Automate Error Analysis. Software lanies in Floating PoittArith- toetic. A Gallery of Txt MatricesSELECTED REFERENCES wax JW, Demmel (1996). Numerical Linear Algebra, SIAM Publications, Philadel. phia, PA. Introduction. Linear Equation Solving. Linear Least Squares Problems. Noney= ‘metric Eigenvalue Probleme The Symunstrc Eigenproblom and Singular Value De- composition, Iterative Methods fr Linear Systeme and Eigenvlue Problems. Iter ive Algeithme for Bigeavalue Problems. LN. Trefethen and D. Bau IK (1997). Numerical Linear Algebra, SIAM Publications, Philadelphia, PA. Matsa Vector Mukiplication. Orthogoaal Vctors and Matricat. Norma. The Sio- {nae Decompaation, Nore ont SVD. Proton. QR Petoranton, Gras lewenborg/Tridiegonal Form. Rayleigh QR Algorithm Without Shift. QR Algorithm With Shite. Other Eigecvaiue Al {sorithme. Computing the SVD. Overview of erative Methods. ‘The Arnoldi fers tion. How Arnoldi Leceten Eigenvalues. GMRES, The Lanceoe Iteration. Orthogo- tal Polynomials and Cuum Quadrature. Conjugate Gradienta. Borthogoaalization Methods. Precooditioning. The Definition of Numerical Analysis. Analytical F.R Gantmacher (1959). The Theory of Motrices Vol 1, Chelsea, New ‘York. Matrioas and Operations on Matriow. The Algorithm of Geum and Some of its F-R. Gantmacher (1959). The Theory of Matrices Vol. 2, Chelsea, New York. Complex Symmetric, Shew-Symmetic, and Orthogonal Matrices Singular Peocila ‘of Matriows. Matrices with Nonbegative Elmeots. Application of the Theory of Ma- (ies tothe Tavestigation of Syrtane of Linear Diferestiat Equations. "The Probles ‘of Routhtlurwits abd Related Quertions, ‘A. Berman and R.J. Plemmons (1979). Nonnegative Mairices in the Math- ‘ematical Sciences, Acedemic Presa, New York. Reprinted with additions in 1994 by SIAM Publications, Philadelphia, PA, Matziom Which Leaves Cone Invariant. Noanegative Matrices. Semigroups of Now ‘agative Matrices Syrmetic Nonoegative Matriow. Generali Inver Poitviy MMatrice, Itratiw Methods for Low Systeme. Flite Markov Chaine. Input Outpot Anelyss in Beosomicn, The Linear Complementarity Protlem. x REFERENCES G.W. Stewart and J. Sun (1990). Matréz Perturbation Theory, Academic Press, San Diego. ‘Proliminarin, Norm aod Metrics. Linear Symams and Lous Squares Problems. The Perturbation of Eigenvalues. Lneriant Subspaces. Generalised Eigenvalue Probie, R. Hom and C. Johnson (1985). Matric Analysis, Cambridge University Press, New York. Review and Miscellanea. Eigeavalues, Eigunvectors, and Similarity. Unitary Equiv ‘lence and Normal Matrices. Canotical Foras. Hermitian and Symmetric Matrices, Norms for Vectors and Matrices. Location and Perturbation of Eigecvalues, Positive Definite Matrices, R. Hom and C. Johnson (1991). Topics in Matrix Analysis, Cambridge University Press, New York. ‘The Field of Valuos. Stable Matric and Inertia, Singular Valve [eequalition Ma ‘x Equations and the Krooecar Product. The Hadamard Product. Matrices and Functions. Linear Equation Problems D.M. Young (1971), Iterative Solution of Large Linear Systems, Academic Press, New York. Introduction. Matrix Preliminarion Linear Stationary Iterative Methods. Conver- ‘20ce of the Basic Iarative Methods. Figeavaluer of the SOR Method for Coo inently Ordered Matric. Determination of the Optistum Relaxation Parameter. ‘Norma of the SOR Method. The Modified SOR Method: Fixed Parameters. Nonsts- LNonary Linear Iterative Methods. Tae Modified SOR Method: Variable Parameters. ‘Sembiterative Methods. Extensions of the SOR Theory; Stnitje Matrices, Gener- ‘lized Consistently Ordered Matriom. Group Iterative Methods. Symmetric SOR Method and Related Methods. Second Degree Methods. Alternating Direction tz plicit Methods. Selection of an Iterative Method. L.A. Hageman and D.M. Young (1981). Applied Iterative Methods, Aca demic Press, New York. Background on Linear Aigetra and Related Topics, Background oa Basic Terative Methods. Polynomial Acceleration. Chebyshev Acceleration An Adaptive Chey shew Procedure Using Special Norms. Chebyahe Acovleration. Conjugete Gradient Acosleraion. Special Methods for Red/Black Partitionings. Adaptive Pro- cadures for Sucoemive Overrelaxation Method. The Use of Iterative Methods i the Solution of Partial Dierenti! Equations, Case Studion. The Nosaymmetriaable Cane,REFERENCES wd ‘A. George and J. W-H. Liu (1981). Computer Solution of Large Sparse Positive Definite Systems. Prentice-Hall Inc., Englewood Cliffs, New Jersey. Introduction. Fundameotals. Some Graph Theory Notation and Its Use in the ‘Study of Sparse Symunetric Matric BAnd and Buvelope Methods. General Spare Methods. Quotiot ‘Tree Method for Fisite Element and Piaite Diference Prob- Jes, One-Way Disecton Methods for Finite Blament Probl. Nested Dimection Methods, Numerical Experiments. S, Pissanetaky (1984). Sparse Matriz Technology, Academic Press, New York. Fundamentals. Linear Algebraic Equations. Numerical Errore in Cauaian Etimi nation. Ordering for Gane Elimination: Symmetric Matrices. Ordering for Cau Elimination: General Matrices. Sparse Eigenanalyaa Sparse Matrix Algebra. Con- nectivity and Nodal Amembly. Cenoral Parposs Algorithms IS. Duff, A.M, Erisman, and J.K, Reid (1986). Direct Methods for Sparse Matrices, Oxford University Press, New York. Introduction. Sparse Matrices Storage Scheanes and Simple Operations. Gaussian Elimination for Dense Matrices: Tho Algebraic Problem. Caumian Elimination for Dente Matrices: Numerical Considerstions. Geumsion Eliminotion for Sparse Matrices’ An Introduction. Reduction to Block Triangular Farm. Local Pivotal Strategies for Sparse Matrices. Ordering Sparse Matricer to Special Forma, Im plementing Gaurian Elimination: Analyae with Numerical Vas, Implereating Geumian Elimination with Symbolic Analyee. Partitioning, Matrix Modifcation, sad Tearing, Other Sparity-Orieted Iatven. R. Barrett, M. Berry, T-F. Chan, J, Demmel, J. Donato, J. Dongarra, V. Eijchout, R. Pozo, C. Romine, H. van der Vorst (1993). Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, STAM Publications, Philadelphia, PA. Introduction, Why Use Templates? What Methods are Covered? Itratve Methods Stationary Methods, Nonstationary Iterative Methods. Survey of Recent Krylor Methods. Jacobi, incompleta, SSOR, and Polynomial Preconditioners. Complex ‘Syrtems, Stopping Critenm. Date Structures. Perle. The Lancsoe Connection. Block Iterative Methods. Raduoed Sywtem Presonditioning. Domain Decompaxtion Methoda: Multis! Methods. Row Projection Methods W. Hackbusch (1994). Iterative Solution of Large Sparse Systems of Bawa tons, Springer-Verlag, New York. Introduction. Recapitulaticn of Linear Algebra. Iterative Methods. Methods of Incobl and Gane Seal and SOR Iteration inthe Postive Definite Case Analysis in the 2Cyelic Cuan Analysis for M-Matrices, Seu llerative Methods Transor- ‘mations, Secondary Iterations, Incomplete Triangular Decomporitions. Conjugate Gradient Methods. Mult-Grid Methods. Domain Decomposition Methods, vai REFERENCES 0. Axelwon (1994). Herative Solution Methods, Cambridge University Pres. Diuest Solution Methods. Theory of Matrix Elgmvaloes. Positive Definite Matsi- lgenvaluo Probens. Reducible and ire ‘Type Methods. Generallaed Conjugste Gradieat Methods, The Rate of Converesnce of the Conjugsta Gradient Method. ‘Y. Saad (1996). Iterative Methods for Sparse Linear Systems, PWS Pub- lishing Co., Boston. Badgzound In Linaar Algebra. Discretiation of PDEs. Sparve Matrices. Basic IMerative Methods. Project Methods. Keykw Subspace Methods ~ Part I. Krylov Subepace Methods ~ Part [1. Methods Ralated to the Normal Equations. Precon- ‘4iioped lterotions, Preconditioning Techniques. Parallel implementations, Parallel Preconditioners, Domain Decomposition Methods. Linear Fitting Problems CLL, Lawson and RJ. Hanson (1974). Solving Least Squares Problems, Prentice-Hall, Englewood Cliffs, NJ. Reprinted with a detailed “new developments” appendix in 1996 by STAM Publications, Philadelphia, PAL Incroductioa. Analysis of the Leut Squares Problem. Orthogonal Decomposition by Orthogonal the Solution for the Overdetermined or Exactly Determined Pall Rank Problesa. ‘Computation of the Covariance Mairi ofthe Solution Parametert, Computing the Solution fr the Underdstarmina Pull Rank Preblara. Computing the Solution for Problem LS with Powably Deicant Preudoruak- Analysis of Computing Ecroce for Householder Transiormations Analyse of Competing Errors for the Problem LS. Analyrin of Computing Errors for the Problam L Using MUsed Precision Arithmetic. Computation of the Singular Valve Decompestion snd the Solution of Problem US. Other Methods for Least Squares Problems. Linear Least Squares with Linear Brguality Constraiots Using a Basia ofthe Null Spon Linear Least Squares with inner Equality Constraints by Direc Elimination. Linear Lass Squares with Lis- mr Equality Constraints by Weighting, Liner last Squares with Linear InequalityREFERENCES vail RW. Farebrother (1987). Linear Least Squares Computations, Marcel Dekier, New York. Canonical Bxpremions for the Loust Squares Eetinators aod Teat Statietics, "rae ditional Expromions for tbe Least Squares Updating Formulas aod tort Statistics ‘eaet Squares Eatimation Subject to Linear Constraints, 8. Van Huffel and J. Vandewalle (1991). The Total Least Squares Problem: Computational Aspects and Analysis, SIAM Publications, Philadelphia, PA. Introduction. Basic Prineiplat ofthe Total Lanet Squares Problem. Extensions ofthe ‘Basic Total Least Squares Problem, Direct Speed improvement of the Total Least Squares Computetions. Iterative Speed Improvement for Solving Slowly Varying ‘Total Least Squares Problema. Algebraic Connections Between Total Loost Squares and Losst Squares Problems. Sensitivity Analysis of Total Laut Squares aad Lent ‘Squares Probie in the Presence of Error in All Data. Statistical Properties ofthe ‘Total Least Squares Problem, A. Bjérck (1996). Numerical Methods for Least Squares Problems, SIAM Pubiications, Philadelphia, PA. Mathematical and Statatical Propatio of Laat Squares Soletions. Basse Namarcal Methods. Modiied Least Squares Problems. Generalized Least Squares Problems. ‘Constrained Least Squares Problems. Direct Methods for Sparse Laat Squares Prob- Jens. Iterative Methods for Lanat Squares Problems. Least Squares with Special ‘aves. Nonlinear Least Squares Problems. Eigenvalue Problems BU. Pariett (1980). The Symmetric Higenvaiue Problem, Prentice-Hall, Englewood Cliffs, NJ. Basic Pacts about SethAdjoiat Matriow, Taaks, Obstacle, and Aida. Courting, ‘Subepace Heration. The General Linear Eigenvalue Problem. J. Cullum and R.A. Willoughby (1985a). Lanczos Algorithms for Large Symmetric Eigenvalue Computations, Vol. I Theory, Biricaiiser, Boston. Proliminariae Notation aod Definitions, Real Symmetric Problems. Lancsos Pro- ‘urs Rea Syma Probl, Tdiganal Matin, Lancs Procedures with [No Renrthogonaliatian for Symmetric Problema. Rael Rectangular Matrica. Hoo ‘Detective Complos Syeumetrc Marien. Block Lancaoa Procedure, Ral Symunetric Mastins viv REFERENCES 4. Cullum and R.A. Willoughby (1985b). Lanczos Algorithms jor Large Symmetric Eigenvalue Computations, Vol. If Programs, Birkhaiiser, Boston. ‘Lanczos Procedures. Real Symmetric Matrices Hermitian Matrices, Factored In- ‘yenen of Fanl Symmetric Matrices. Real Symmetric Generalied Problems. Reel Rectangular Probleme. Noodelective Complex Symmetric Matriom. Real Symmet- te Matrices, Block Lancace Code.” Fectored Inverse, Real Symmetric Matrices, Block Lanczos Code Y. Saad (1992). Numerical Methods for Large Eigenvalue Problems: Theory and Algorithms, John Wiley and Sons, New York. Background in Matrix Theory and Linear Algebra. Perturbation Theory and Ex- ror Anaiyaia. ‘The Tools of Spectral Approximation. Subspace Tteration. Krylov Subepace Methods. Acceleration Techniques and Hybrid Methods. Precondition- ing Techoiques. Noo-Standard Eigeavalue Problems. Origins of Matrix Eigenwlue Problems, F. Chatelin (1993). Bigenvalues of Matrices, John Wiley and Sons, New ‘York. Supplements from Linear Algebra, Slemenis of Spectral Theory. Why Compute Eigenvalues, Error Analyuss. Foundations of Methods for Computing Eigenvalues [Nutmerical Methods for Large Matrices. Chabysbev’s Iterative Methods. High Performance W. Schinauer (1987). Scientific Computing on Vector Computers, North Holland, Amsterdam. Introduction. ‘The First Commercially Significant Vector Computer, The Arithmetic Pecormsance ofthe First Commercially Significant Vector Computer. Hockney's n¥? ‘nd Timing Formals, Fortran and Autovectoraaton. Behavior of Programs. Some ‘Basic Algorithme, Recurrences. Matrix Opertioas. Systems of Linear Equations with Pull Matrices. Tridiagonal Linear Systema. The iterative Solution of Linear Equations. Special Applications. The Fujitas VPs and Other Japaneae Vector Com- puters. The Cray-2. The IBM VF and Other Vector Processors. The Convex Cl. RW. Hockney and C.R. Jesshope (1988). Parallel Computers % Adam Hilger, Bristol and Philadelphia. Inerodvction. Pipelined Computers. Procesor Arrays. Parallel Languages, Palla ‘Algorithms. Future Developments 4.4. Modi (1988).Parallel Algorithms and Matric Computation, Oxford Uni- versity Press, Oxford. ‘Cenecai Principles of Parallel Computing, Parallel Techniques and Algorithms, Par alld Sorting Algorithms. Solution of « System of Linear Algsbrale Equations. The Symmetric Eigenvalue Problem: Jacobi’s Method. QR Factorization. Siagular Valve Decompouition and Related ProblemsSzuectep REFERENCES 08 J. Ortega (1988). Introduction to Parallel and Vector Solution of Linear ‘Systems, Plenum Press, New York. Introduction. Direct Methods for Linear Equations. Iterative Methods for Linear equations J. Dongarrs, I. Duff, D. Sorensen, and H. van der Vorst (1990). Solving Linear Systems on Vector and Shared Memory Computers, SIAM Pub- lications, Philadelphia, PA. ‘Vector and Parallel Proceming, Overview of Current High-Performance Compa. ‘rx. Implementation Detain and Overhead. Perormancs Analyse, Modeling, and Measurement. Building Blocksio Cinoar Algebra. Direct Solution of Spare Linear Systeme, Iterative Solution of Spans Linear Syma 'Y. Robert (1990). The Impact of Vector and Parallel Architectures on the Gaussian Elimination Algorithm, Halsted Press, New York. Introduction, Vector snd Parallel Architectures. Vector Multiprocaor Computing. Hypercube Computing Systolic Computing. Task Graph Scheduling, Anelyaa of Dimcribated Algorithme. Design Methodologie. GLH. Golub and J.M. Ortega (1993). Scientific Computing: An Introduc- tion with Parallel Computing, Academic Press, Boston. ‘The World of Scientiie Computing. Linear Algebra. Parallel and Vector Computing. Polynomial Approximation. Continuous Problema Solved Discretly. Direct Solu- tion of Linear Equations. Paral Dizect Methods. Iterative Methods. Conjugete radient-Type Methods, Edited Volumes D.J. Rose and R. A. Willoughby, eds. (1972). Sparse Matrices and Their Applications, Plenum Press, New York, 1972 JR. Bunch and D.J. Rose, eds. (1976). Sparse Matriz Computations, Academic Press, New York. LS. Duff and G.W. Stewart, eds. (1979). Sparse Matrix Proceedings, 1978, ‘SIAM Publications, Philsdelphia, PA. LS. Dufl, ed. (1981). Sparse Matrices and Their Uses, Academic Press, New York. A. Bjérck, RJ. Plemmons, and H. Schneider, eds. (1981). Large-Scale ‘Matriz Problems, North-Holland, New York. G. Rodrigue, ed. (1962). Parallel Computation, Academic Press, New York. voi REreRENcEs B. Kagstrom and A. Rube, ads, (1983). Matriz Pencils, Proc. Pite Havs- bad, 1982, Lecture Notes in Mathematics 973, Springer-Verlag, New York and Berlin. J. Cullum and R.A. Willoughby, eds. (1986). Large Seale Eigenvalue Prob. Jems, North-Holland, Amsterdam. ‘A, Wouk, ed. (1986). New Computing Environments: Parallel, Vector, and ‘Systolic, SIAM Publications, Philadelphia, PA. M.T. Heath, ed. (1986). Proceedings of First SIAM Conference on Hyper- cube Multiprocessors, SIAM Publications, Philadelphia, PA. MCT. Heath, ed. (1987). Hypercube Multiprocessors, SIAM Publications, Philadelphia, PA. G. Rox, ed. (1988). The Third Conference on Hypercube Concurrent Com- puters and Applications, Vol. II - Applications, ACM Press, New York. MHL Schultz, ed. (1988), Numerical Algorithms for Modern Parallel Com- ‘puter Architectures, IMA Volumes in Mathematics and Its Applications, ‘Number 13, Springer-Verlag, Berlin. E.P. Deprettere, ed. (1988). SVD and Signal Processing. Elsevier, Ams- terdam. BUN. Datta, C.R. Johnson. M.A. Keashoek, R. Plemmons, and E.D. Son- tog, eds. (1988), Linear Algebra in Signals, Systems, and Control, SIAM Publications, Philadelphia, PA. 4. Dongsrra, I. Duff, P. Gaffney, and $. McKee, eds. (1989), Vector and Parallel Computing, Ellis Horwood, Chichester, England. ©. Axelsson, ed. (1989). “Preconditioned Conjugate Gradient Methods,” BIT 29: K. Gallivan, M. Heath, E. Ng, J. Ortega, B. Peyton, R. Plemmons, C. Romine, A. Semeh, and B. Voigt (1990), Parallel Algorithms for Matriz Computations, SIAM Publications, Philadelphia, PA. G.H. Golub and P. Van Dooren, eds. (1991). Numerical Linear Alge- bra, Digital Signal Processing, and Parallel Algorithms. Springer-Verlag, Berlin. R. Vaccaro, ed. (1991). SVD and Signal Processing HI: Algorithms, Analy- ‘sis, and Applications. Elsevier, Amsterdam.REFERENCES revi R, Beauwens and P. de Groen, eds. (1992). iterative Methods in Linear Algebra, Elsevier (North-Holland), Amsterdam. RJ, Plemmons and C.D. Meyer, eds. (1993). Linear Algebra, Markov ‘Chains, and Quewing Models, Springer-Verlag, New York. MS. Moonen, G.H. Golub, and B.L.R. de Moor, eds. (1903). Linear ‘Algebra for Large Scale and Real-Time Applications, Khuwer, Dordrecht, ‘The Netherlands. J.D. Brown, M-T. Chu, D.C. Ellison, and R.J. Plemmons, eds. (1994). Pro- ceedings of the Cornelius Lanczos International Centenary Conference, SIAM Publications, Philadelphia, PA. RV. Patel, AJ. Laub, and P.M. Vaa Dooren, eds. (1994). Numerical Linear Algebra Techniques for Systems and Control, [EE Press, Pis- cataway, New Jersey. J. Lewis, ed. (1994). Proceedings of the Fifth SIAM Conference on Applied Linear Algebra, SIAM Publications, Philadelphia, PA. A, Bojanczyk and G. Cybenko, eds. (1995). Linear Algebra for Signal ‘Processing, IMA Volumes in Mathematics and Its Applications, Springer- Verlag, New York. M. Moonen and B. De Moor, eds. (1995). SVD and Signal Processing IIT: Algorithms, Analysis, and Applications, Elsevier, Amsterdam. Matrix ComputationsChapter 1 Matrix Multiplication Problems $1.1 Basic Algorithms and Notation §1.2 Exploiting Structure §1.3 Block Matrices and Algorithms §1.4 Vectorization and Re-Use Issues ‘The proper study of matrix computations begins with the study of the tmatcixcmatrix multiplication problem. Although this problem is simple mathematically it is very rich from the computstional point of view. We begin in §1.1 by looking at the several ways that the matrix multipica- tion problem can be organized. The “language” of partitioned matrices is established and sed to characterize several near algebraic “levels” of ‘computation. If matrix bas structure, then it is usually possible to exploit it. For example, a symmetric matrix can be stored in balf the space as a general matrix. 'A matrix-vector product that involves a matrix with many 22r0 entries may require much less time to execute than a full matrix times a vector. These matters are discussed in §1.2. In §1.3 block matrix notation ia established. A block matrix in a matrix ‘ith matrix entries. This concept is very important from the standpoint of oth theory and practice. On the theoretical side, block matrix notation allows us to prove important matrix factorizations very succinctly. These factorisations are the comerstone of numerical linear algebra. From the ‘computational point of view, block algorithms are important because they 1 2 CHAPTER 1. MATRIX MULTIPLICATION PROBLEMS are rich in matrix multiplication, the operation of choice for many new high performance computer architectures. ‘These new architectures require the algorithm designer to pay 9s much attention to memory traffic as to the actual amount of arithmetic. This aspect of scientific computation is illustrated in §1.4 where the critical issues of vector pipeline computing are discussed: stride, vector length, the ‘umber of vector loads and stores, and the level of vector re-use. Before You Begin It is important to be familiar with the MATLAB language. Seo the texta by Pratap(1995) and Van Loan (1996). A richer introduction to high performance matrix computations is given in Dongarra, Du, Sorensen, and Duff (1991). ‘This chapter's LAPACK connections include LAPAGK: Some General Operations TEAL e= az fector wale Soot. | an aty Dot product carr |ysarty | Sexy Tene | y— ade + py | Metrixwecior multipliation coma | AC Atary” | Ranke) opdate repn | C= 2AB +00 | Matrix mltiplieation LAPACK: Some Symmetric Operations “Sy = ade By Maurie vectar mnliplcation Ts | y ade + By Matrix-veetor multiplication (Packed) ism | Anan? +4 Ranlet update Tema | AS aryl paya™ +A Ranke? update Tame | C— ada? +90 Rani update Tsmax| CH adaT+aBa" +0C | Rank-2k updote snat_| C= GAB +90 oF (aBA+6) | Symmetric General Product LAPACK: Some Band/Triangular Operations TGHRT | y= aA + By | Gaal Bart csanv | yeaAz+dy | Symmetric Band cmv | 2 oa ‘angular low | eae ‘Trinpular Paciad imu | 5048 (or BA) | TWianguae/General Product 1.1 Basic Algorithms and Notation Matrix computations are built upon a hierarchy of linesr algebraic opera- ‘tons. Dot products invoive the scalar operations of addition and multiplication. Matrix-vector multiplication is made up of dot products. Matrix- matrix multiplication amounts to a collection of matrix-vector products. All of these operations can be described in algorithmic form or in the language of linear algebra. Our primary objective in this section is to show1.1, Baste AuconrTHas AND NOTATION 3 bow these two styles of expression complement esch another. Along the way wwe pick up notation and acquaint the reader with the kind of thinking that underpins the matrix computation area. The discustion revolves around the matrix multiplication problem, a computation that can be organized in several ways, 1.1.1 Matrix Notation Let R denote the set of real numbers, We denote the vector space of all m-by-n real matrices by R™*": ageR AGR™™ => Aa(ay)=| ; Bh + wm If a capital letter is used to denote a matrix (eg. A, B, A), then the corresponding lower case letter with subscript 4j refers to the (i,3) entry (e6.. a4 , bys 6i)- AS appropriate, we also use the notation (Aj and (i,j) to designate the matrix elements. 1.1.2. Matrix Operations Basic matrix operations include transposition (IR™*" — R°*™), CHAT me cy say, addition (R™*" x R™" — R™), CHALB, me cy may thy, scalar-matrés multiplication, (Rx R™" — R™), CmaA = cy many, sand matriz-matréx multiplication (RI™? x RP*" —+ RE"), CHAR = ay = Doaby. mt ‘These are the building blocks of matrix computations. 4 (CHapTeR 1, MATREX MULTIPLICATION PROBLEMS 1.1.3 Vector Notation Let IR" denote the vector space of real n-veetors: eos -["] nen. We refer to 2; as the ith component of 2. Depending upon context, the alternative notations [x], and 2(4) are sometimes used. Notice that we are identifying I" with R™*! and so the members of RR" are column vectors. On the other hend, the elements of R'™" are row ‘vectors: ERP ee re (eiesan)- Uz is a column vector, then y= 27 is a row vector. 1.1.4 Vector Operations Assumea € R, z € R”, andy € RR", Basic vector operations include scalar- vector multiplication, z = yan, vector addition, psrty =F A= Btw the dot product (or inner product) cmsty ome c= Sozam and vector multiply (or the Hadamard product) zezey => 4 =o, Another very important operation which we write in “update form” is the aK veaty = wrantn Here, the symbol “=" is being used to denote assignment, not mathematical equality. ‘The vector y is being updated. The name “saxpy” is used in LAPACK, a software paciage that implements many of the algorithms in this book. One can think of “sexpy” as o mnemonic for “scalar a x plus ve1.1. BASIC ALGORITHMS AND NoTATION 5 1.1.5 The Computation of Dot Products and Saxpys We have chosen to express algorithias in a stylized version of the MATLAB language. MATLAB is a powerful interactive system that is ideal for matrix ‘computation work. We gradually introduce our stylized MATLAB notation in this chapter beginning with an algorithm for computing dot products. Algorithm 1.1.1 (Dot Product) If z,y € I”, then this algorithm computes their dot product ¢ = 27y. e=0 for i=in exe atiy(i) end “The dot produet of two n-vectors involves n multiplicstions ond n additions. It is an *O(n)” operation, meaning that the amount of work is linear in the dimension. The saxpy computation is also an O(n) operation, but it returns a vector instead of a sealar. Algorithm 1.1.2 (Saxpy) if x,y € RO and ¢ € I, theo this algorithm overwrites y with a+ y- for isin vet) = ax(4) + 96) end 1 must be stressed that the algorithms in this book are encapsulations of critical computational ideas and not “production codes.” 1.1.6 Matrix-Vector Multiplication and the Gaxpy Suppose A ¢ I™" and that we wish to compute the update ys Arty where z ¢ R" and y € R™ are given. This generalized saxpy operation is referred to as a gaspy. A standard way that this computation proceeds is to update the components one at a time: we Doge +m ‘This gives the following algorithm. Algorithm 1.1.3 (Gaxpy: Row Version) If Ae R™", 2 €R", and ¥ER™, then this algorithm overwrites y with Ax +y. 6 Chaprer 1. MATRIX MULTIPLICATION PROBLEMS Ln AG, j)2(3) + vf) ‘An alternative algorithm results if we regard Az as a linear combination of ‘A’s columns, eg., (G1 ESH]~El-ld- (8) Algorithm 1.1.4 (Gaxpy: Column Version) If A€R™", r eR", and y IR”, then this algorithm overwrites y with Az + y. for j= in for i= im Gi) = AG, J) + vl) end end Note that the inner loop in either gaxpy algorithm carries out a saxpy operation. The column version was derived by rethinking what matrix- ‘vector multiplication “means” at the vector level, but it could also have been obtained simply by interchanging the order of the loops in the row version. In matrix computations, it is important to relate loop interchanges to the underlying linear algebra. 1.1.7 Partitioning a Matrix into Rows and Columns Algorithms 1.1.3 and 1.1.4 access the data in A by row and by column respectively. To highlight these orientations more clearly we introduce the language of partitioned matrices. From the row point of view, a matrix is a stack of row vectors: T ACR ee ner’. (uy oh ‘This is called a row partition of A. Thus, if we row partition1.1, Basic Auconrrums AND Notation 7 then we are choosing to think of A as a collection of rows with rP=[1 2) FP s(3 4], ond FP =[5 8]. With the row partitioning (1.1.1) Algorithm 1.1.3 can be expressed as follows: for i= 1m warts + uli) end Alternatively, a matrix isa collection of solumn vectors: AGR™™ ep Asfencsenl, eR” (12) We refer to this as a colunn partition of A. In the S-by-2 example above, we thus would set cy and cz to be the first and second columns of A respectively: “i El With (1.1.2) we oe that Algorithm 1.1.4 is a saxpy procedure that accesses, A by columns: for j= 1m vengty end In this context appreciate y as a running vector sum that undergoes repeated saxpy updates. 1.1.8 The Colon Notation ‘A handy way to specify a column or row of a matrix is with the “colon” notation. If A € R™*", then A(k,:) designates the kth row, ie, Alk,:) = [ats .--+@4n) ‘The Ath coluzan is speciied by ou AGa=] 2 |. Ont ‘With these conventions we can rewrite Algorithms 1.1.3 and 1.14 as for i= 1m uli) = AG, Je + ui) 8 CHaprer 1. MATRIX MULTIPLICATION PROBLEMS respectively. With the colon notation we ate able to suppress iteration details. ‘This frees us to think at the vector level and focus on larger computational issues. 1.1.9 The Outer Product Update ‘As a preliminary application of the colon notation, we use it to understand the outer product update A= Atay", AeR™ eR" yeR" ‘The outer product operation zy “looks funny” but is perfectly legal, e. \ ‘os [}]s-[ a]. 3 is ‘This is because zy? is the product of two “skinny” matrices and the number of columns in the left matrix x equals the number of rows in the right matrix xy". The entries in the outer product update are prescribed by fori= im for j ag = yy + ay end ‘The mission of the J loop is to add a multiple of y to the i-th row of A, for i= im Aliy:) = AG) + 2()y? end On the other hand, if we make the i-loop the inner loop, then its task is to ‘add a multiple of z to the th column of A: for j=1n Alaa) end (3) +e ‘Note that both outer product algorithms amount to a set of saxpy updates.Ll, Basic ALGORITHMS AND Notation 9 1.1.10 Matrix-Matrix Multiplication Consider the 2-by-2 matrix-matrix multiplication AB. In the dot product, formulation esch entry is computed as a dot product: 1 2][5 6]_[t-542-7 1642-8 3 4)[7 8] “[3-54+4-7 3-6+4-8 In the saxpy version each column in the product is regarded as a linear combination of columns of A: GUE le bE del). Ble] 12 6 6] _[2 2 Gil[s -B)e oe[ie a Although equivalent mathematically, it turns out that these versions of matrix multiplication can have very different levels of performance because of their memory traffic properties, This matter is pursued in 91.4. For now, Se ee ee ee because it gives us a chance to review notation and to practice thinking at different linear algebraic levels. 1.1.11 Scalar-Level Specifications To fix the discussion we focus on the following matrix multiplication update: C=AB+C AcR™?, BeR*", CeR™". ‘The starting point isthe familiar triply-nested loop algorithm: Algorithm 1.1.5 (Matrix Multiplication: ijk Variant) If A ¢R™?, 10 CHAPTER 1. MATRIX MULTIPLICATION PROBLEMS ‘This is the “ijk variant” because we identify the rows of C (and A) with i, the columns of C (and B) with j, and the summation index with &. We consider the update C = AB +C instead of just C = AB for two reasons. We do not have to bother with C = 0 initializations and updates ‘of the form C = AB +C arise more frequently in practice. ‘The three loops in the matrix multiplication update can be arbitrarily ordered giving 3! = 6 variations. Thus, for j for k= Lp for i= 1m Cli, 3) = AG. R)B(R, 3) + CH, 9) end end is the jki variant. Bach of the six possiblities (ijk, jik, ikj, jhi, bij, 4438) features an inner loop operation (dot product or saxpy) and has its ‘own pattern of data flow. For example, in the ijk variant, the inner loop oversees a dot product that requires access to a row of A and s columnn of B. The jki variant involves saxpy that requires access to a column of C and a column of A. These attributes are summarized in Table 1.1.1 along with an interpretation of what is going on when the middle and inner loop are considered together. Each variant involves the same emount of floating Loop | Inner Middle Inner Loop Order_| Loop Loop Data Access ak ‘dot vector x matrix | A by tow, B by cohanns jk dot. ‘matrix x vector | A by row, B by column ky | saxpy TOW KaXpy By row, C by row jki | saxpy column gaxpy | A by column, C by column ij | saxpy | row outer product B by tow, C by row 3¢_|_saxpy | column outer product | A by column, C by column Be RE", and C ¢ IR™" are given, then this algorithm overwrites C with AB+C. for i= Lim for for tf =lp CUi9) = Ali ABLES) +0U63) end end ‘TABLE 1.1.1. Matriz Multiplication: Loop Orderings and Properties point arithmetic, but accesses the A, B, and C data differenty. 1.1.12 A Dot Product Formulation ‘The usual matrix multiplication procedure regards AB as an array of dot Products to be computed one at a time in left-to-right, top-to-bottom order.1.1. Basic Aucorrrams AND NOTATION u ‘This is the idea behind Algorithm 1.1.5. Using the colon notation we can highlight this dot-product formulation: Algorithm 1.1.6 (Matrix Multiplication: Dot Product Version) If Ae R™?, BeR™, and CeR™ are given, then this algorithm overwrites C with AB+C. for i for j=1n OCi3) = AU IBE3) + Ol) ond end In the language of partitioned matrices, if 7 a a, B eR” and Pryeyba] be ER then Algorithm 1.1.6 has this interpretation: for i= iim b+ Note that the ‘mission” of the j-loop is to compute the ith row of the update, To emphasize this we could write for i= Lm. Gadaed end where ¢ c=|: om Jn a row partitioning of C. To say the same thing with the colon notstion for i= 1am Olé.) = AGB +O end Either way we see that the inner two loops of the ijk variant define Tow-oriented gaxpy operation. 12 Cuaprer 1. Maraix MULTIPLICATION PROBLEMS 1.1.13 A Saxpy Formulation Suppose A and C are column-partitioned as follows Ae [anya] aR” c Leyensee] GERM. By comparing jth columns in C = AB + C we see that Soyas +e i ‘These vector sums can be put together with a sequence of saxpy updates. 6 Algorithm 1.1.7 (Matrix Multiplication: Saxpy Version) If the matrices A €R™?, Bc RP**, and C € R™*" are given, then this algorithm overwrites C with AB +C. for jin for k ct = AG, A) BCR 3) + C63) end end Note that the k-loop oversees a gaxpy operation: for j=im G5) = ABE.) +O.) 1.1.14 An Outer Product Formulation Consider the ki variant of Algorithm 1.1.5: for k=.p for j = 1:n for i= i:m Cli, 3) = AG, A)B(R, j) + C9) end end ‘The inner two loops oversee the outer product update Coasts1.1, Baste ALonrrams AND NOTATION 13 where a Ax [a1e-s09) and B= (113) we y with ay € IR” and by € R". We therefore obtain Algorithm 1.1.8 (Matrix Multiplication: Outer Product Version) If AeR™?, BeR™, and Ce R™" are given, then this algorithm overwrites O with AB +C. for k > C= AG RB) +O end ‘This implementation revolves around the fact that AB is the sum of p outer products. 1.1.15 The Notion of “Level” ‘The dot product and saxpy operations are examples of “level1” operations. Level-1 operations involve an amount of data and an amount of arithmetic that is linear in the dimension of the operation. An m-by-n outer product update or gaxpy operation involves a quadratic amount of data (O(mn)) ‘nd a quadratic amount of work (O{mn)). hey are examples of *level-2* operations. ‘The matrix update C = AB +C ia a “level” operation. Level-3 ‘operations involve a quadratic amount of data and a cubic amount of work. If A, B, and C ace n-by-n matrices, then C = AB + C involves O(n*) matrix entries and O(n) arithmetic operations. ‘The design of matrix algorithms thet are rich in high-level linear algebra operations is a recurring theme in the book. For example, a high Performance linear equation solver may require a level-3 organization of Gaussian elimination. This requires some algorithmic rethinking because that method is usually epecifed in level-1 terms, e.g. multiply row 1 by a scalar and add the result to row 2.” 1.1.16 A Note on Matrix Equations In striving to understand matrix multiplication via outer products, we es sentially tabled the metris equation . AB = Sool a where the ay and by are defined by the paritionings in (1.13). 4 (Cuaprer 1. MATRIX MULTIPLICATION PROBLEMS ‘Numerous matrix equations are developed in subsequent chapters. Some- times they are established algorithmically like the above outer product expansion and other times they are proved at the ij-component level. As ‘an example of the latter, we prove an important result that characterizes ‘ranspoces of products, ‘Theorem 1.1.1 If Ac R™? and Be RP", then (AB) = BTA. Proof. If C= (AB)", then (AB)" oy = [Ble = Yasue - On the other hand, if D = BT AT, then dy = [BTA y = SVB lA" hy = Shao. m= ro Since oy = diy for all ‘and j, it follows that C= D. 0 Scalas-level proofs such as this one are usually aot very insightful. However, they are sometimes the only way to proceed. 1.1.17 Complex Matrices From time to time computations that involve complex matrices are discussed. ‘The vector space of m-by-n complex matrices is designated by ™", The scaling, addition, and multiplication of complex matrices corresponds exactly to the real case. However, transposition becomes conjugate CHA me ay nt. ‘The vector space of complex n-vectors is designated by *. The dot product of complex n-vectors x and y is prescribed by sazty= oan Finally, if A= B+iC € ©"**, then we designate the real and imaginary parts of A by Re(A) = B and Im(A) = C respectively. Probleme Piatt Suppose A € FO" and 2 € RY are given. Give a saxpy algorithm for computing ‘he fim column of M = (A= 241)---(A—2el).1.1. BAstc ALconrras AND NotaTION 15 PL.L.2 In the conventional 2by-2 matrix multiplication C = AB, there are eight multiplications 3x61, 1,632, o24633, 093612, arabat, arab, aaa and ozabaa. Male ‘table thot indicates the order that these multiplications are porformad forthe jh, si, ‘ij, th, jo, ane jt matrix multiply algorithm. PLLA Give an algorithm for computing C= (2y7)* where 2 and y ae n-vectort PA.1.4 Specify an slgrithi for computing (XY7)* where X,Y € FO*? 1.1.5 Formulate an outer product algorithm for the update. C= ABT + where AGRO, Be RO, and CER, 1.1.8 Suppone we have real n-by-n matrices C, D, and F. Show how to compute real reby-n matrices A nod B with just three roaln-by-n mateix multiplications vo that (A418) = (C4 {DE + KF). Hint: Compa W = (C + DUE - F) Notes and References for Sec. 1.1 1k cust be stremed that the development of quality software from aay of our “sami- formal” algorithmic promentations is long and arduous task. Even the implementation of the evehL 2, and 3 BLAS require care: CL, Laweon, RJ. Hanson, , D.R. Kincaid, and P:T. Krogh (1979). “Basie Linear Algebra Subprogram for PORTRAN Usage,” ACM ‘Trans. Math. Soft 5, 308-323, CL, Lavon, RJ. Hanson, D.R. Kincadd, and FT. Krogh (1979). “Algorithm 539, ‘asic Linear Algebra Subpeograma for FORTRAN Usage,” ACM Trans: Math. Sof 5, 30395, 4.5. Dongarra, J. Du Cros, 8, Hammaring, and RJ, Hangon (1988). “An Extended Sat of Fortran Basi Linear Algebra Subprograms,” ACM Trans, Math Soft. 14, 1-17. J.J. Doone, J. Du Cros, 8. Hammarling, snd RJ. Hanson (1988). “Algorithms 656 An. Extended Set of Fortran Basic Linear Algebra Subprogram: Model Implementation and Test Programa” ACM Trans. Math Soft. 14, 18-32. 41.4. Dongarra, J. Du Gros, 15. Duff, and 5.J. Hammacling (1990). “A Set of Lavel 3 Basic Linear Algebra Subprograms” ACM Trans. Math. Soft. 16 1-17. 43.4. Dongazra, J. Du Cros, 18. Duff, and S.J. Hammarling (1990). "Algneithm 679. A ‘Set of Laval 3 Basic Linear Algeira Subprogram Model implementation aod Test Programa,” ACM Trans, Math Soft 16, 1-28 (Other BLAS references inciude B. Kigerim, P. Ling, and C. Van Loan (1991). ‘High-Performance Lewe-3 BLAS: Sample Routines for Doubla Precision Real Dats” in High Performance Computing 1, M. Duraod and F. El Dabagh (eds), North-Holland, 250-281. 2B, Kigetrém, P. Ling, and C. Van Loea (1006). “GEMOM-Dasnd Lave.3 BLAS: High- Performance Model Implementations and Performance Evaluation Beachmatl,” ia Paral Proremnmg end Apptatons,F,Pitsen nod Ls Fame (es), 150 Pr, i For an appreciation ofthe mbtietion amocintad with woftware development we recommend JLR Rice (1081). Metre Computations and Mathematicel Software, Academie Prem, New York. tnd « browse through the LAPACK manual 16 CHAPTER 1. MATRIX MULTIPLICATION PROBLEMS 1.2 Exploiting Structure ‘The efficiency of a given matrix algorithm depends on many things. Most ‘obvious and what we trest in this section is the amount of required arithmetic and storage. We continue to use matrix-vector and matrix-matrix ‘multiplication as a vehicle for introducing the key ideas. As examples of exploitable structure we have chosen the properties of bandedness and symmetry. Band metrices have many zero entries and so it is no surprise that band matrix manipulation allows for many arithmetic and storage short- cuts. Arithmetic complexity and data structures are discussed in this context. ‘Symmetric matrices provide another set of examples that can be used to illustrate structure exploitation, Symmetric inear systems and eigenvalue problems have a very prominent role to play in matrix computations and 0 it is important to be familiar with their manipulation, 1.2.1 Band Matrices and the x-0 Notation ‘We say that A ¢ IR™" has lower bandwidth p if ais = 0 whenever i > j +p. and upper bandwidth q if j > i+q implies ay =O. Here is an example of an 8-by-5 matrix that has lower bandwidth 1 and upper bandwidth 2: ‘The x's designates arbitrary nonzero entries. This notation is handy to Indicate the zero-nonzero structure of a matrix and we use it extensively, Band structures that occur frequently are tabulated in Table 1.2.1. 1.2.2 Diagonal Matrix Manipulation Matrices with upper and lower bandwidth zero are diagonal. If D ¢ ™" is diagonal, then D=ding(y, If D is diagonal and A is a matrix, then DA is a row scaling of A and AD is 0 colurnn scaling of A. d,), q=minfmn} ed dy = dy1.2. Exptormne Srrucrure Type cof Matrix Giagonal upper triangular lower triangular | m—-1 tridiagonal 1 ‘upper bidiagonal 0 lower bidiagonal 1 1 upper Hessenberg lower Hessenberg | _m. ‘TABLE 1.2.1. Band Terminology for m-by-n Matrices 1.2.8 Triangular Matrix Multiplication ‘To introduce band matrix “thinking” we look at the matrix multiplication problem C = AB when A and B are both n-by-n and upper triangular ‘The 3-by-3 case is illuminating: aub: anbia+ aiabe2 ayibys + arabay + ayabas c= 0 ambra azabaa + aasbsy 0 o assbss It suggests that the product is upper triangular and that its upper triangular entries are the result of abbreviated inner products. Indeed, since ‘Oubyy =O whenever k < ior j j, then digng = Adiag(i + nk k(k-1)/2) (20) (123) Some notation simplifies the discussion of bow to use this data structure in ‘a matrix-vector multiplication. If AE R™", then let D(A, k) € R"*" designate the kth diagonal of A 1s follows: sith, I & Using thin data structure write & rmaicix-vector multiply function that computes Raz) aod Im() from Rafe) and T(z) so that ¢ = Az PLL2.6 Suppoon X ¢ FO*P and AG FOX", with A symmetric and stored by diagonal. ‘Give an algorithm that computen ¥ = XTAX and storm the result by diagonal Use separate arrays for A and Y. PL2.7 Suppose a RY ia given and that AG RO** han the property that ayy = ‘4-s}41- Give am algorithm that overwrites y with Az-+y where 2,9 € RP are given, PL28 Suppose a¢ RP is given and that A R** has the property that oxy = 44(:¢5-1) mod np44- Gi ae algorithm that overwrite y with Ax +y where 2,y € RO we given, 1.2.9 Develop 2 compact storeby-diagonal acheme for unsymunetric band matrices and write the comerpanding guepy algorithm. 1.2.10 Suppose p and qaren-vectom and that A = (oi) le defined by 3 = a34 = Pty for L$ ¢ AarByp + Cas iN. i If we organize a matrix multiplication procedure around this summation, then we obtain 6 block analog of Algorithm 1.1.5: fora=1:N i= (a- ttt hae for f= 1:N ja G-we+upe (1.33) for y=1:N k=(y-Metiat a oh, i, k)B(E,3) + C3) end LN, B= end Note that if € = 1, then and 7 = k and we revert to Algorithm 11s. ‘To obtain a block saxpy matrix multiply, we write C= AB +C as Bu Bw [a.. Ler}at ote | bom oF Jpvnnce Bu + Bw vthere dg, Ca RO, and Bag € RO, From this we obtain 30 CHAPTER 1, Marrux MULTIPLICATION PROBLEMS: for B= 1: f= (0-104 tpt for a= 1:N (a-1)e + Liat (134) Oli 3) = AG IBA) + C9) end end ‘This is the block version of Algorithm 1.1.7. ‘A block outer product scheme results if we work with the blocking A (Ate A »-(2] BE where A,B, €R™*, From Lemma 1.3.2 we have c= Saat +e and 80 for y=: ke (y-Nerint BEE, +O (135) end This is the block version of Algorithm 1.1.8. 1.3.6 Complex Matrix Multiplication Consider the complex matrix multiplication update CL iy = (As + iAa)(Bs + iBe) + (C1 + Cr) where all the matrices are real and # imaginary parts we find CL = AB ~ ABs + Cr Cr = ABr+ AaB + C2 - Comparing the real and and this can be expressed as follows: [a]-[4 “2 ][2]+(2]1.3. BLOCK MATRICES AND ALGORITHMS a ‘This suggests how real matrix software might be applied to solve complex matrix problems. The only snag is that the explicit formation of requires the “double storage” of the matrices Ay and Ay. 1.3.7 A Divide and Conquer Matrix Multiplication ‘We conclude this section with a completely different spproach to the matrix- matrix multiplication problem. The starting point in the discussion is the ‘2-by-2 block matrix multiplication [& ¢ Ce a |= [a An | [ Bu | Cu An An || Bn Ba where each block is square. In the ordinary algorithm, Cy = AaBiy + AcgBay. There ore 8 multiplies and 4 adds. Strassen (1963) has shown how to compute C with just 7 multiplies and 18 adds: PL = (Au+An)(Bu + Bn) Ppoo= (Ant An)Bu B Au (Br2 ~ Bra) Py An(Bu — Bi) P= (Ant Au)Ba Pe = (An ~ Au)(Bu + B12) Py (Ara — An)(Bn + Bra) Cu = Ath Bath Cu = Path Cu = P+ Cn = Pi+R-Pt Pe “These equations are easily confirmed by substitution. Suppose n= 2m so that the blocks are m-by-m. Counting adds and multiplies in the computation C = AB we find that conventional matrix multiplication involves (2m)? raultiplies and (2m)5 - (2mm)? adds. In contrast, if Stressen’s al- ‘gorithm is applied with conventional multiplication at the block level, then ‘Tm? multiplies and 7m? + Lim? adds are required. If m >> 1, then the Strassen method involves about 7/8ths the arithmetic of the fully conven tional algorithm. Now recognize that we can recur on the Strassen idea. In particular, we can apply the Strassen algorithm to each of the half-sized block multiplications associated with the A. Thus, ifthe original A and B are n-by-n and 1 = 2, then we can repeatedly apply the Strassen multiplication algorithm. At the bottom “level,” the blocks are 1-by-1. Of course, there is no need to 32 CHAPTER 1. MaTRIx MULTIPLICATION PROBLEMS recur down to the n= 1 level. When the block size gets sufficiently small, ( < ryin), it may be sensible to use conventional matrix multiplication when finding the P, . Here is the overall procedure: Algorithm 1.3.1 (Strassen Multiplication) Suppose n = 2" and that ACR" and Be R™™. If tin = 2 with d 1, then it sufices to count multiplications ‘as the number of additions is roughly the same. If we just count the mul- tiplicatione, then it cuffices to examine the deepest level of the recursion ‘as that is where all the multiplications occur. In strass there are q ~ d subdivisions and thus, 7°~¢ conventional matrix-matrix multiplications to perform. These multiplications have size tmin and thus strass involves about s = (24)°7t~* multiplications compared to ¢ = (2*)°, the number of multiplications in the conventional approach. Notice that 1G @™1.3, BLOCK MATRICES AND ALGORITHMS 33 Ifd=0, ie., we recur on down to the I-by-1 level, then ) ce Ta ntl neg 3 ‘Thus, asymptotically, the number of multiplications in the Strassen procedure is O(n?40"). However, the number of additions (relative to the number cof multiplications) becomes significant 08 rimin gets small. Example 1.8.1 Ifn = 1024 and tain ~ 64, then strane involves (7/8)!9-8 m6 the asthmatic of the conventional algorithm. Probleme 1.3.1. Generalize (1.3.3) a0 that it can handle the variable block-eiza problem covered. by Theorem 1.33. : 3.9.2 Generalize (1.24) and (1.3.5) eo thet they can handle the variable Blockine 2.3.3 Adapt stramn ap that it can handle square matrix multiplication of any order. Hint Ifthe "current" A haa odd dimension, appand a zero row and column, iad Pret ba hee la blocking of the matrix A, then ae 13.5 Spon even and cin te flowing function from RO to Re 2 Ja) = ttamFatny © Senta (0) Show that i 2,9 € RP thea oa Py = Sener + vader +91) = 12) ~ fo) (8) Now consider the mty-n matrs makiptiation @ = AB. Ghre an algoithmn for By} Ae ‘and ote from Lemma 13.2 that Now analyze each Ay By with the help of Lama 13.1. Notes and References for Sec. 1.3 For quite some time fast methods for matrix multiplication have atractad « ft of attention within computer science, See 8. Winograd (1968), “A New Algorithm for Inner Produc,” BBE Trans. Comp. C-17, 6st, YY. Straseen (1969). “Gaussian Elimination is Not Optimal,” Numer. Math. 15, 354-356, YV. Pan (1984), “How Can We Speed Up Matrix Multiplicatio?,” SIAM Repiew 26, so4i6, “Maay of thase methods have dubious practical value. However, withthe publication of D. Bailey (1068). “Extra High Spend Matrix Multiplication on the Cray.2," SIAM J. Sei and Stat. Comp. 9, 603-607. {ie clar that the blankatdieminnal of those fast procedures ia unwiae, The “etability” (of the Straaen algorithm ix diced in $24.10. Seo aleo [NJ Higham (1990). “Exploiting Fast Matrix Multiplication withix the Level 3 BLAS” ACH Trona, Mosh. Soft 16, 352-968. ‘CC. Douglas, M. Heroux, G, Slshman, and RLM. Smith (1994), “CEMMW: A Portable Level 3 BLAS Winograd Variant of Struaen's Matris-Mutrie Multiply Algorithm.” I Comput, Phys. 110, 1-10. 1.4 Vectorization and Re-Use Issues ‘The matrix manipulations discussed in this book are mostly built upon dot products and saxpy operations. Vector pipeline computers are able to perform vector operations such ax these very fast because of special hardware that is able to exploit the fact that a vector operation is a very regular sequence of scalar operations. Whether or not high performance is extracted from such a computer depends upon the length of the vector operands and a number of other factors that pertain to the movement of data such as vector stride, the number of vector loads and stores, and the level of data re-use. Our goal is to build a useful awareness of these issues, We are not trying to build e comprebensive model of vector pipeline1.4, VectoRmaTion AnD Re-Use Issuzs 35 ‘computing that might be used to predict performance. We simply want to Identify the kind of thinking that goes into the design of an effective vector pipeline code. We do not mention any particular machine. The literature is filled with case studies. 1.4.1 Pipelining Arithmetic Operations ‘The primary reason why vector computers are fast has to do with pipelining. The concept of pipelining is best understood by making an analogy to scsembly line production. Suppose the sssembly of an individual automobile requires one minute at each of sixty workstations along an assembly line. the line is well staffed and able to initiste the assembly of a new car ‘every minute, then 1000 cars can be produced from seratch in about 1000 + 60 = 1060 minutes. For a work order of this size the line hos an effective “vector speed” of 1000/1060 automobiles per minute, On the other hand, if the assembly line is understalfed and 8 new assembly can be initiated just once an hour, then 1000 hours are required to produce 1000 cars. In this case the line has an effective “scalar speed” of 1/60th automobile per minute, So itis with a pipelined vector operation such as the vector add z = <-+y. The sealar operations = = xj + y are the cars. The aumber of elements is the size of the work order. If the start-to-finish tive requited for each is 7, then & pipelined, length vector add could be completed in time much less than nr. This gives vector speed. Without the pipelining, the ‘vector computation would proceed at s.scalar rate and would approximately require time nr for completion. Let us see how a sequence of fosting point operetions can be pipelined. Floating point operations usually require several cycles to complete. For example, 3 dcycle addition of two scalars x and y may proceed as in Fic.1.4.1. To visualize the operation, continue with the above metaphor :—[ Xa : Spe] tet] Horas == Fig. 14.1 A 3-Cycle Adder ‘and think of the addition unit as an assembly line with throe “work stations”. The input scslars x and y proceed along the assembly line spending ‘one cycle at each of three stations. The sum = emerges after three cycles. « cuarsen 1, Mannix Mesmucamon Pontos sa ite, hak mae a Boe: . Fig. 1.4.2 Pipelined Addition Note that when o single, “iree standing” addition is performed, only one of the three stations is active during the computation. ‘Now consider a vector addition z =+y. With pipelining, the x and y vectors are streamed through the addition unit. Once the pipeline is filed and steady state reached, a 2 is produced every cycle. In Fic.1.4.2 we depict what the pipeline might look like once this steady state is achieved. In this case, vector speed is about three times scalar speed because the time for on individual add is three cycles. 1.4.2 Vector Operations ‘A vector pipeline computer comes with a repertoire of wector instructions, such as vector add, vector multiply, vector scale, dot product, and saxpy. ‘We assume for clarity that these operations take place in vector registers. ‘Vectors travel between the registers and memory by means of vector load and vector store instructions. ‘An important attribute of a vector processor is the length of its vector registers which we designate by v,. A length-n vector operation must be broken down into subvector operations of length v, or less. Here is how such ‘2 partitioning might be managed in the case of a vector addition z = =+y whore = and y are n-vectors: first =1 while first Aaj =; eR} If {01,---.dn} is independent and 6 € span{ai,...,0a}, then b is a unique linear combination of the a;. If $1, ..+4Se are subspaces of FO, then their sum is the subspace defined by $= (ay $aa +++ +04: a, € Si, f= Lk }. Sis gald to be a direct sum if each v € S bas a unique representation sso ay with a € Si. In this case we write $ = $1 ®---@ S,. The intersection of the S; is also a subspace, 5 = 31S Se. +1044} is @ mazimal linearly independent subset of span{a linearly independent subsot of {a1,...52q}- If fojyy.+-s044} is maximal, On} = span{aiye--0i,) and {aiyoo-sdhy) 18 a basis ay} . If SC IR™ is a subspace, then it is possible to find independent basic vectors ‘ay € 5 such that S = span{ay,...,ax} . All bases for a subspace S have the same number of elements. This number is the dimension and is denoted by dim(S) 2.1.2 Range, Null Space, and Rank ‘There aze two important subspaces ossociated with an m-by-n matrix A. ‘The range of A is defined by ran(A) = {y¢R™: y = Az for some 2 € RY}, and the mull space of A is defined by null(A) = {z €R" : Az =0}. +4] is @ columa partitioning, then ran(A) = span{ar,...saq} - ‘The rank of a matrix A is defined by rank(A) = dim (ran(A)). Tt ean be shown thot rank(A) = rank(AT). We say that A ¢ Re™*" is rank deficient if rank(A) < min{m,n). If A€ R™*, then im(null()) + ronk(A) KA=[a, 50 CHAPTER 2. MATRIX ANALYSIS 2.1.3 Matrix Inverse ‘The n-by-n identity matriz Iq is defined by the column partitioning Ine len en] where e4 is the kth “canonical” vector: ee =(Qye-50 1, 0, ‘The canonical vectors arise frequently in matrix analysis and if their dimension is ever ambiguous, we use superscripts, ie., ef") € IR". If A and X are in FE" and satisfy AX = 1, then X is the inverse of A and is denoted by A™!, If A~' exists, then A is said to be nonsingular. Otherwise, we say A is singular. ‘Several matrix inverse properties have an important role to play in mae trix computations. The inverse of a product is the reverse product of the (AB) = BA", 11) ‘The transpose of the inverse is the inverse of the transpose: (AYP a (Attar (212) ‘The identity Be At BB AA (2.13) shows how the inverse changes if the matrix changes. ‘The Sherman-Morrison- Woodbury formula gives convenient expression for the inverse of (A+UV") where 4 ¢ RO*™ and U and V are n-by-k: (AF UVT) AAMT VATU) VTA! (2.1.4) ‘A rank & correction to a matrix results in a rank k correction of the inverse. Ja (2.1.4) we assume that both A and (I+ V?A~'U) are nonsingular. ‘Any of these facts can be verified by just showing that the “proposed” inverse does the job. For example, here is how to confirm (2.1.3): B(A*—~ B-4B — AYA“) = BA (B - A)A™ 2.14 The Determinant If A = (0) € R™!, then ita determinant is given by det() = a. The determinant of A ¢ R"*" is defined in terms of order n — 1 determinanta: det(A) = De 1)? *ayydet(Ay))2.1. BAsic IpeAs FROM LINEAR ALGEBRA 51 Here, A1y is an (n ~ 1)-by-(n— 1) matrix obtained by deleting the frst row ‘and jth column of A. Useful properties of the determinant include det(A)det(B) ABER" det(AB) = det(A) = det(A) Aer" det(cA) = det A) ceRAcR™ det(A) #0 4 Ais nonsingular Ae RO" 2.1.5 Differentiation Suppooe a is a scalar ood that A(a) is an m-by-n matrix with entries a(c2). If ay(a) is a differentiable finetion of a for all and j, then by A(a) we mean the matrix a Ha) = Zale) = (Lasla)) = @yla)- ‘The diferentiation of a parameterized matrix tums out to be a handy way to examine the sensitivity of various matrix problems. Probleme 2.1.1 Show that if Ae FO" bas rank p, then therw existe an X€R™? and 9 Y¥ @ RO*? wach that A = XY, where rank(X) = raak(Y) op. 2.1.2 Suppose A(a) € Fad B(a) € FE** are matrices whose entree are difer- cotiable fanctions of the scalar @. Show Emorocen = [Zac] 210+ 40 [2 aa] P2.1.3 Suppome A(a) ¢ FO%™ haa enti that ar diferntiabe functions of the salar ‘a. Amuming Aa) almye noaingula, obow $ acer] = -A0o-* [Zara] Acad. 2.14 Suppose A GFE, BG HE and that g(e) = JT A 27% Show that the aden of § i gion by VAs) = HAF + Ale 8 P2.1.5 Amume that both A and +97 are soosagulr whare A € FO** sad, € RE Show that f= saves (A 4-wo?o = then 1 alo soles perturbed ight bead ide problem ofthe form Az= b +e. Give an xpremi fr aia tem of A, 3, and Notes and Referanons for Sec. 2:1 ‘There ae many introductory near lgebra vets, Among them, the flowing are par- ‘Heulary ef PR Halmos (1858). Finite Dimensional Vector Spaces, od od, Van Nostrand Reinhold, Prinoron, 52 CHAPTER 2. MATRIX ANALYSIS 8.3, Loon (1980). Linaor Algebra ith Applications. Macmillan, New York. Strang (1989), Intreduction to Linear Alger, Waleley-Casbegs Prem, Wellley D, Lay (1994). Linear Algebra and fis Applications, Addioon- Wesley, Reading, MA. © Meyer (1907. Course Appin Lineer Agere, SIAM Pubatons,Phiadephi PAL More advanced treatments include Gantmacher (1050), Hora nad Johnson (1985, 2991), aod AS, Honma (104), The Theory of Mabe Numeral Analy, Gas (Bate ‘dell, Boston. M, Marcus and BL. Mine (1964). A Survey of Matrie Theory ond Matric Inorualiten Allyn tnd Bacon, Boston. LN. Fraaldin (1968). Matris Theory Prentice Hall, Englewood Cif, NJ. R. Baliman (1970). introduction to Matrez Analyes, Second Editon, McGraw-Hill, New York. P. Lancaster and M. Tiamenetaky (1085). ‘The Theory of Matrices, Sesond dition JMC Oncega (1087), Matrie Theory: A Second Course, Plus Pres, New York. 2.2 Vector Norms ‘Norms serve the same purpose on vector spaces that absolute value docs fon the real line: they furnish a measure of distance. More precisely, R” ‘together with a norm on R® defines a metric space. Therefore, we have the familiar notions of neighborhood, open sets, convergence, and continuity ‘when working with vectors and vector-valued functions, 2.2.1 Definitions A vector norm on IR" is a function /-R" —~ R that satisfies the following properties: f(z) 20 zeER', (f(z) =0iffz=0) f(z+y) < f(z) + fy) ye R” faz) = |alf(z) aeRzeR" We denote such a function with a double ber notation: f(2) = J. Sub- ‘scripts on the double bar are used to distinguish between various norms. ‘A useful class of vector norms are the p-norms defined by Hod, = (eu? ++ leal)# pan (223) ‘Of these the 1, 2, and oo norms are the most important: Veh, = false lea lz = (lei? +--+ Jeni)? = (Pa)! Wells = max lel2.2. Vector Nonms 53 ‘A unit vector with respect to the norm l| || is a vector x that satisfies Iz}=1. 2.2.2 Some Vector Norm Properties A classic result concerning p-norms is the Holder inequality: By || we say that tobe = 2-24] isthe absolute error in . If #0, then 2-21) ot TT prescribes the relative error in 4. Relative error in the oo-norm can be ‘translated into o statement about the number of correct significant digits in @. In particular, if Hele a 1 then the largest component of has approximately p correct significant digits Bxample 2.21 I= (170 06674)7 and w (1.235 .06128)7, ben 12 #lln/I = las 0043 = 10-*. Nota than 4y has about three signcant digits that are correct while aly coe significant digit in 23 ia comect. 54 CHAPTER 2. MATRIX ANALYSIS 2.2.4 Convergence We say that 0 sequence {2(*)} of n-vectors converges to = if fm 2 zi =0. ‘Note that becuse of (2.2.4), convergence in the a-norm implies convergence {in the A-norm and vice versa. Problems P22. Show that i € RO, then lity oii, = 2 by P22. Prove the Cauchy-Schwarts inequality (223) by comakdering the inegsality OS (ax + by)" (az + by) for suitable scalars a and 6. PB.23 Verify tat + Hy I Nyy 804 [Nag ate Yetor norms 2.24 Verify (22.5}(227). Whe is euality achieved in each em? 2.25 Show that in RY, 20) + if ad only 20 24 fr B= 2.2.6. Show that aay vector norm on R* is waiforly continuous by veritying the inequality | lel ~ vil ll Alla Bll. For the most part we work with norms that satisfy (2.3.4), ‘The p-norms have the important property that for every A € R"™" and ZER" we have | Az |, < |] All,] 2[,. More generally, for any vector norm ff «lq 08 Re" and’ - jon R™ we have | Azlly < Algal? la where A fag is « matrix norm defined by Arig Tle” (2.3.5) HAllas = Wesay that | lug is subordinate tothe vector norm fg and | «By. Since the set {z €'R": |x, = 1} is compact and | = [1 i continuous, © follows that Walaa jaa LAr ly = Lae" ly (238) for some z* ¢ IR having unit a-norm. 2.3.2 Some Matrix Norm Properties ‘The Frobenius and p-norms (especially p = 1, 2, co) satisfy certain ineqtal- ities that are frequently used in the analysis of matrix computations. For AER™® we have UA: $f Allp < Vala ean max lou] < Ala S Vein max layl (2.38) VAh = me Seu (239) Vly = ae Lieut 230) Fal Ale SUAb < VIALS sn) JRiAh sib s vila (32)2.3. Marmix Noms 37 HAER™", 1 1, a contradiction. ‘Thus, fF is nonsingular. To obtain’ an expression for its inverse consider the identity . (E*) (+P) = 1-FH, = Siace | FI, <1 i flows that im F* = 0 because 1F*Y, < WIS Thus, (vde)e-n al w It follows thot (I'-F)-* = Jim 7 F*. From this itis easy to show that is 1. IF il, ta-ry s Died = & Note that |(I-F)--Iil, < [Fllp/(- Il F'ill,) a8 @ consequence of the lemma. ‘Thus, if ¢ <1, then O(6) perturbations in I induce O(¢) perturbations in the inverse. We next extend this result to general matrices. ‘Thoorem 2.3.4 If A is nonsingular and r= AE J, <1, then A+ E is nonsingular and (A+ 2)" AI, WEL WAT —2)- Proof. Since A is nonsingular A +E = A(I ~ F) where F = —A-'E. Since | Fil, = r < Lit follows from Lemma 2.3.3 that IF is nonsingular and {(2- F)-" fl, < 1/(1-r). Now (A+ £)-! = (I= F)"1A! and so taser, s Abe2.4. Foire PRECISION MaTREx COMPUTATIONS 59 Equation (2.1.3) seys that (A+ E)~! — An! = —A~18(A + B)~! and oo by taking norms we find VAs et- A, < PAM EE NAS, 193 < MULE g Problems P2.3.1 Show | ABI, SHAI,II Bil, where 1
M or 0 < |a op bl < m respectively. ‘The handling of these and other exceptions is hardware/system dependent. 2.4.3 Cancellation Another important aspect of finite precision arithmetic is the phenomenon of catastrophic cancellation. Roughly speaking, this term refers to the ex ‘treme loss of correct significant digits when small numbers are additively computed from large numbers. A well-known example taken from Forsythe, Malcolm and Moles (1977, pp. 14-16) i the computation of e* via Tay- Jor series with a > 0. The roundoff error associated with this method is ‘There are important camps of machines whose additive floating point operations satiety 7020) = (1+ e)0-© (1+ 49)0 whee [elle] bj =lay,i=im jain BSA > by Sayi ‘With this notation we see that (2.4.6) has the form m, j= in If(A) ~ A} < ulAl. A relation such as this csa be easily tuned into a norm inequality, eg., HAA) = Al}, S all All, However, when quantifying the rounding errors in a matrix manipulation, the absolute value notation can be lot more Informative because it, provides a comment on esch (i,j) entry. 2.4.5 Roundoff in Dot Products We begin our study of finite precision matrix computations by considering ‘the rounding errors that result in the standard dot product algorithm: s=0 for k= In saotnn (24.7) end Here, x and y are n-by-1 floating point vectors. In trying to quantify the rounding errors in this algorithm, we are immediately confronted with notational problem: the distinction be- ‘tween computed abd exact quantities. When the underlying computations are clear, we shall use the /l(-) operator to signify computed quantities.2.4. Favre Precision MATRIX COMPUTATIONS 63 Thus, fi(2"y) denotes the computed output of (2.4.7). Let us bound Uste"9) = 37). It . y=Tl (Eas) , i then #1 = 21ys(1 +61) with [64] 0 and B20, then conventional matrix multiplication produces a product € that ‘has small componentwise relative error: [-C| ¢ mula} [Bl +O(u?) = aujo] +0(u") This follows from (2.4.18). Because we cannot say the same for the Strassen. approach, we conclude that Algorithm 1.3.1 is not attractive for certain nonnegative matrix multiplicstion problems if relatively accurate ij are required. Extrapolating from this discussion we reach two fairly obvious but important conclusions: ‘© Different methods for computing the same quantity can produce sub- stantially diferent results. '* Whether or not on algorithm produces satisfactory results depends upon the type of problem solved and the goals of the ser. ‘These observations are clarified in subsequent chapters and are intimately related to the concepts of algorithin stability and problem condition. Probleme Pad Show thet if (24.7) la applied with y = 2, then fi(2"2) = 272(1 +a) where al ¢ na +0(02), Pada Prov (243). PAA9 Show thet if B ¢ HE™M® with m > m, thon 151 a < VA Ef This eeu is ‘met when deriving norm bounds fom absolute value bounds. 244 Amume the existence of « aqoare rot function stiying f1(/2) = VE() +6) with [el (0? + ww). But o? =A} =|] Ar fi}, and 0 we must have w = 0. An obvious induction argument completes the proof of the theorem, © ‘The a; are the singular values of A and the vectors uw; and 1 are the ith left singular vector and the ith right singular vector respectively. It2.5. ORTHOGONALITY AND THE SVD. n is easy to verify by comparing columns in the equations AV = UE and ATU = VSP that Ay = 014 Ay = om It is convenient to have the following notation for designating singular val- } i= 1min(mn} (A) = the ith largest singular value of A, Omes(A) = the largest singular value of A, Cmin(A) = the smallest singular value of A. ‘The singular values of @ matrix A are precisely the lengths of the semi-axes of the hyperellipsoid E defined by B= { Az: [la =1}. Bxample 2.5.1 (8 Ble -(S ME ay ‘The SVD reveals s great deal about the structure of a metrix. If the SVD of A is given by Theorem 2.5.2, and we define r by OB DOe> orgy soy =O, theo rank(A) = or (2.5.3) aull(A) = spam{eegty.--+¥a} (254) rand) = epan(yeste} 5 (283) and we have the SVD expansion A= Dowsl (250) st ‘Various 2-norm and Frobenius notm properties have connections to the SVD. If Ae R™", then WAIL = oft--403 pa minimn} (25.7) Ab =m (2.5.8) in "Azle oo (m2n) (259) 2 CHAPTER 2, MATRIX ANALYSIS: 2.5.4 The Thin SVD If A= ULV? € R™ is the SVD of A and m > n, then A=UiEVT where sooty RE" and Ey = E(t, Ln) = diag(oy,...,09) € RO We refer to this much-used, trimmed down version of the SVD as the thin sup. 2.5.5 Rank Deficiency and the SVD One of the most valuable aspects of the SVD is that it enables us to deal sensibly with the concept of matrix rank. Numerous theorems in linear ‘algebra have the form “if such-and-ouch # matrix has full rank, then such- ‘and-such a property holds.” While neat and sesthetic, results of this favor do not help us address the numerical difficulties frequently encountered in situations where near rank deficiency prevails. Rounding errors and fuzzy data moke rank determination a nontrivial exercise. Indeed, for some small ‘ce may be interested in the crank of a matrix which we define by rank(A.e) = min rank) Thus, if A is obtained in a laboratory with each a; correct to within +001, then it might make sense to look at raak(A,.001). Along the same lines, if Ais an m-by-n floating point matrix then itis reasonable to regard A ss numerically rank deficient if rank(A,¢) < min{m,n} with e= ull A [2. Numerical rank deficiency and e-rank are nicely characterized in terms of the SVD because the singular values indicate how near a given matrix is to a matrix of lower rank. ‘Theorem 2.5.3 Let the SVD of A R™" be given by Theorem 2.5.2. If eer =rank(A) and £ Ay= Dovw?, (25.10) min A-Blha = A Avia = ons. (2.5.11) ait [A- Bla = WA~ dul (say2.5. ORTHOGONALITY AND THE SVD a ) it Follows that rank( Ax) 109) and so | A — Ax {la = Ost ‘Now suppose rank(B) = k for some B éR™™". It follows that we can find orthonormal vectors z1;.--,2q-n 90 oull(B) = span{2t,...,2n-1} - ‘A dimension argument shows that wwe have kat WA-BUE > M(A~ Ble =WAZHE = DoT e)? > ods completing the proof of the theorem. © ‘Theorem 2.5.3 says that the smallest singular value of A is the 2-norm distance of A to the set of all rank-deficient matrices. Tt also follows that the set of full rank matrices in IR™*" is both open and dense. Finally, ifre = rank(A,), then a1 Do Borg > CD organ Bo Boy p= minfmn). ‘We have more to say about the numerical rank issue in §5.5 and §12.2. 2.5.6 Unitary Matrices ‘Over the complex field the unitary matrices correspond to the orthogonal ‘matrices. In particular, Q € ©" is unitaryif Q7Q = QQ" = Iq. Unitary matrices preserve 2-norm. The SVD of a complex matrix involves unitary matrices. If A€ (™™, then there exist unitary matrices U € C™™™ and Vee" guch that UM AV = ding(o1,.-.,09) ER" — p= minfm,n} where oy 2 042 ...2 0,20. Problecse 2.5.1 Show that if ts ral and ST =~5, then J ~ $ a nonsingular and the mate (C= S)-*(F% 5) orthogonal. Thin in known 0a the Capley ronaform of 5. ™ CHAPTER 2. MATRIX ANALYSIS; 2.6.2 Stow that angular orthogooal matric ia dagpal. 2.63. Show that FQ = Qi +402 is unitary with Qi, € FE, then the nbn palmate z-(& & ix othogona P25. Brabah properties (25.)(259) 25.8 Prove that emee(A) = mae “ay veR™z6R™ Trl P25 Foc the 2-2 matix A = [ 9 rnin) tata unetion of, ¥, ad = P2S.7 Show that any matrix ia RO ithe limi ofa mquese of fil enk mates. (2.5.8 Show thet if A € R™** has rank n, then || A(A74)-'A? Jja = 1. P2.5.9 What isthe neat asi manixto A= [4 | inthe Robeaus nom? 5 ] deems at an 2.5.10 Show that if A € R™*® then || Ally $ /Fank(A) EA lay thereby sharpening aan. Notes and Referonces for Sec. 2.5 Forsythe and Moler (1967) offer » good account of the SVD's roe in tho analysis of the ‘Ax = problem. Their proof of the decomposition a more traditional than ours is that it maton ue ofthe eigeavale theory for eymetic matrices. Historical SVD relerencer inelade E. Boltrami (1873). ‘Sulla Punsioni Bilinear,” Gionale i Mathematiche 11, 98-106. C. Beart ane G. Young (1999). *A Principal Axia Transformation for Noo- Hermitian Matrices” Bull. Amer. Math Soe. 46, 118-21 G.W. Stewart (1989). "On the. Early History of the Singular Value Decomposition,” ‘SIAM Review $5, 351-566, (One ofthe most significant developments in sclemifc computation hasbeen the increased ‘se of the SVD in application area that roqure the inteligens handling of matrix rank. ‘The range of applications i impremive. One of the mort interesting is CB, Moler and D. Morrison (1988). “Singular Value Analysis of Cryptogramng,” Amer. ‘Math. Monthy 90, 78-27 For geoeraliantions of the SVD to infinite dimensional Uber space, see LC. Gohberg and M.G. Krein (1960). introduction to the Theory of Linear Now-Self ‘Adjoint Operators, Amer. Math. Soc, Providence, Rl . Smithiee (1070). Inter Squations, Cambridge University Press, Cambridge. Reducing the rank of o matrix as in Theorem 25.3 when the perturbing matrix i com ‘erined in dicuamed i IW, Demme (1987). “The smallest perturbation ofa submatri< which lowers the rank ‘and constrained total least squares problemas, STAM J. emer. Anal. #f, 199-206.2.6. PROJECTIONS AND THE CS DECOMPOSITION % GH, Golub, A. Hotiman, and G.W. Stewart (1088). "A Ganeralisation ofthe Eelare- ‘Young Mirsky Approsimation Theorem.” Lin. Alp. and fle Applic 68/80, 317-328, GA. Wotaon (1988). “The Smalls Perturbatioa of « Submmatrox which Lowers the Raa of the Matrix” IMA J. Numer, Anal. 8, 295-304. 2.6 Projections and the CS Decomposition If the object of a computation is to compute a matrix or a vector, thea norms are useful for assessing the sccuracy of the answer or for measuring progress during an iteration. Ifthe object of a computation is to compute a subspece, then to make similar comments we need to be able to quantify the distance between two subspaces. Orthogonal projections are critica in this regard. After the elementary concepts are established we discuss the CS decomposition. This is an SVD-like decomposition that is handy when having to compare & pair of subspaces, We begin with the notion of an ‘orthogonal projection. 2.6.1 Orthogonal Projections Let SC IR" be a subspace. P ¢ R°*" is the orthogonal projection onto Sif ran(P) = $, P! = P, and P? = P. From this definition itis ensy to show that if x €R", then Pz € S$ and (I — P)z¢ $+ If Py and Py are each orthogonal projections, then for any z € R° we have NCP ~ Paz = (Piz) ~ Pade + (Paz) "UE - Pde If ran(P,) = ran(P,) = S, then the right-hand side of this expression is zero showing that the orthogonal projection for a subspace is unique. If the columns of V = [v4,...,1% ] are an orthonormal basis for a subspace S, then it is easy to show that P = VV" i the unique orthogonal projection onto S. Note that if v €R°, then P = w7/u7v is the orthogonal projection onto $ = span{v}. 2.6.2 SVD-Related Projections ‘There are several important orthogonal projections associated with the singular value decomposition. Suppose A = UEVT e R™*" is the SVD of A and thot r = rank(A). If we have the J and V parttionings u=(Uu 6) vel% %] rom=r roner VVE projection on to null()* = ran(AT) V,VP_ = projection on to null(A) U,UT = projection on to ran(A) U-0F = projection on to ran(A)* = null A?) 76 CHAPTER 2. MATRIX ANALYSIS 2.6.3 Distance Between Subspaces ‘The one-to-one correspondence between subspaces and orthogonal projections enables us to devise a notion of distance between subspaces. Suppose ‘S, and Sz are subspaces of Rand that dim(S;) = dim(S;). We define the distance between these two spaces by dist(Si, Ss) = 1 Pi- Palla (2.6.1) where P, is the orthogonal projection onto Sj. The distance between a pair of subspaces can be characterized in terms of the blocks of a certain ‘orthogonal matrix. ‘Theorem 2.6.1 Suppose we=lm We] Z=(% hl Bonk kon-k are n-by-n orthogonal matrices. If Sy = ran(W1) and Sp = ran(Zi), the dist(Si,52) = |WYZe la = 127M ll. Proof. dist(S,52) AWAWT ~ 227 I, = WW™(WAWE - 2.28)2 fy = [fees "JL. Note that the matrices WZ and WY Z; are subinstrices of the orthogonal matrix Qu Qa] 2 | WE 2 o=[8: 2] = [WE whe |-#7 ‘Our goal is to show that || Qa fly = laeh ‘Since Q is orthogons! it follows from } = [SE] @ LehQuelh + 1Qax2i} for all unit 2-norm x € RY. Thus, that Qari = max zi min H Vnih = ymax Bnei pit, Hue = 1G min( Qs)?2.6, PROJECTIONS AND THE CS DzcoMPostrion n Anslogously, by working with Q7 (which is also orthogonal) it is possible to show that ; 1 Qf 1B = 1 - ermal). and therefore Qa = 2 ~omn(Qas)? ‘Thus, Qa lly = 1x2 tO Note that if 5, and $2 are subspaces in IR" with the same dimension, then 0 < dist(S1, 52) < 1. ‘The distance is zero if S, = Sz and one if S:\St # {0}. ‘A more refined analysia of the blocks of the Q matrix above sheds more light on the difference between a pair of subspaces, This requires a special SVD-like decomposition for orthogonal matrices. 2.6.4 The CS Decomposition ‘The blocks of an orthogonal metrix partitioned into 2-by-2 form have highly related SVDs. This is the gist of the CS decomposition. We prove a very useful special case first. ‘Theorem 2.6.2 (The CS Decomposition (Thin Version)) Consider the rs 2-[%] str 2m nia 3 i ef eer ey ee enist orthogonal matrices U, € R™*™, Uz €R™*™, and Vi € R™™ such (2 ay [&le-[8] C= dingloon(Oy), S = ding(sin(®),.. eR", Qa eR /con(n)), sin(@n)), Och che Proof. Since || Qui lla $ II @ lla = 1, the singular values of Qi, are all in he interval (0,1). Let UFQuV, = C = dingler,... 60) = [% S] mice tnt 8 CHAPTER 2. MarRix ANALYSIS be the SVD of @; where we assume ea a> Bs Dey 20. ‘To complete the proof of the theorem we must construct the orthogonal matrix Ua if am = 1% He) ton 7 hoo nu os fa (5 2 Tale-[§ 3] * WW Since the columns of this matrix have unit 2-nona, W; = 0. The columas of Wz are nonzero and mutually orthogonal because WY Wa = Inne — ETE = dlag(t — 2,1,...,1-4) JIG for k = I:n, then the columns of Z = Wa ding(h ses, +05 1/8n) are orthonormal. By Theorem 2.5.1 there exists an orthogonal matrix U, € R™*™ with Up(:,t + In) = Z. It is easy to verify that UF Qav; = dlag(s4,---44n) = 5. Since +s = 1 for k ~ lin, it follow that these quantities are the required cosines and snes, © Using the same sort of techniques it is possible to prove the following more general version of the decomposition: ‘Theorem 2.6.3 (CS Decomposition (General Version)) if otée] is a 2-by-2 (arbitrary) partitioning of an n-by-n orthogonal matriz, then there exist orthogonal o- Fete] = bite] is nonsingular. If 35 such that vTov =2.6. PROJECTIONS AND THE CS DECOMPOSITION nm where C = diag(eys..-s69) and S = diag(ar,. matrices with 0< c4 8 <1. Proof. See Peige and Saunders (1981) for details, We have suppressed the dimensions of the zero submatrices, some of which may be empty. ‘The eavential message of the decomposition is that the SVDs ofthe Qi, are highly related. Example 2.6.1 The matric 07876 o3eoT 04077 =o. 147) are square diagonal oss 0.2198 mz 0287601817 Qe =01301 00502 05805 0162 0.895 The angles amociated with the cosines and sines turn out to be very im Portant in a number of applications. See §12.4. Problems 2.6.1 Show chat i Pw an ortogonal projection, then Q-= I 2P i orchogool. 72.6.2 Whet ar te iagular value of an orthogooal projection? 2.63. Suppom Si = man{s} and 5: = span(y), where © and y are unit -n0na vector in R2. Working ony with the definition of dit), stow that dat( Si, $) = Vi (=F oF verifying that th datnce between Ss and S ual the ina of the angle erween and. Notes and References for Sec. 2.8 ‘The following papers diacunt various aspects of the CS decomposition: . Davia aad W. Kaba (1970). “The Rotation of Eigeuvectort by » Perturbation I,” SIAM J. Num. Anal 7, 1-48, G.W. Stawart (1977). "On the Perturbation of Preudo-Lrversn, Projections and Linear eaot Squares Problems.” SIAM Review 19, 634-682. (CLC. Paige and M. Saunders (1081). “Toward a Generalized Singular Value Decomspoui- on," SIAM J. Num. Anak 18, 308-405, (COC: Paige and M. Wat (1984). “History aad Gaveraity of tbe C8 Decomposition,” Lin. ‘Alp. and Its Applic. 208/200, 303-328. 80 CHAPTER 2. MATRIX ANALYSIS ‘Sea $8.7 for soxme computational details Foca deoper geometrical understanding ofthe CS decomposition and the notion of datance between subspaom, wo TTA, Aris, A. Edelman, and 8, Smith (1008). “Conjugate Gradimt and Newton's ‘Method! on the Graaarian end Stiefel Maaifokin” to appear in STAM J. Matrie Ancl. Ap 2.7 The Sensitivity of Square Systems ‘We now use some of the tools developed in previous sections to analyze the linear system problem Az = 5 where A € R'™" is nonsingular andé ¢ R". ur aim is to examine how perturbations in A and 8 affect the solution 2. A much more detailed treatment may be found in Higham (1996). 2.7.1 An SVD Analysis It A= Soa? = uEvt is the SVD of A, then ‘This expansion shows that small changes in A or b can induce relatively large changes in z if on is small. Tt sbould come as no surprise that the magnitude of om should have 1 bearing on the sensitivity of the Az = b problem when we recall from ‘Theorem 2.5.3 that op isthe distance from A to the set of singular matrices. ‘As the matrix of coefficients approsches this set, itis intuitively clear that ‘the solution z should be increasingly sensitive to perturbations. 2.7.2 Condition A precise measure of linear system sensitivity can be obtained by considering the parameterized aymam (A+ePla() =b+ef 20) =z where F € R°*" and f € R”. If A is nonsingular, then it is clear that 2(¢) is differentiable in a neighborhood of zero. Moreover, #(0) = A-*(f-Fz) and thus, the Taylor series expansion for z(c) has the form 20) = 2 + €2(0) 4012).2.7, Tue SeNsITIviTy OF SQUARE SYSTEMS aL Using ony vector norm and consistent matrix norm we obtain t-te jepatn{tanen} +o. ara et Vel For square matrices A define the condition number n(A) by (A) = AN RATE (2.7.3) with the convention that x(A) = 00 for singular A. Using the inequality OY < [All hel it follows from (2.7.2) that EG) Tap < Alea + m+ oe) (2.7.4) ster ua Wel represent the relative errors in A and 6, respectively. Thus, the relative error in z can be x(A} times the relative error in A and 6. In this sense, the ret suber Sa) ques the santa othe Aa 8 probles Note that x(-) depends on the underlying norm and subscripts are used secondly e+ we HEl ele ea el Tay aad = lel ou(A) on(A) ‘Thus, the 2-norm condition of a matrix A messures the elongation of the hhyperellipsoid {Az Iz lla = 1}. ‘We mention two other characterizations of the condition number. For pnorm condition numbers, we have (A) = 1 A lial] A“ I (2.7.5) (2.78) a (A) asa. ‘This reoult may be found in Keban (1966) and shows that (A) measures the relative p-norm distance from A to the set of singular metrices. For any norm, we also have n(A)= im up (AAAI AY (2.2.7) eo waaay ‘This imposing result merely says thatthe condition number isa normalized Frechet derivative of the map A A-'. Further details may be found in Rice (19666). Recall that we were initially le to n(A) through diferenti- sion. 82 Cuapren 2. Maran ANALYSIS ‘If x(A) is large, then A is anid to be an ill-conditioned matrix. Note that this is a norm-dependent property?, However, any two condition numbers ‘o(-) and ma(-) on IRO*” are equivalent in that constants ¢; and ¢; can be found for which cutalA) S Ka(A) S cate(A) AER. For example, oa R'™" we have dla) ia mi(A) < neq(A) 1” Leta) $ mld) S matt) ar) Seta) < wold) above. For any of the p-norms, we have (A) 2 1. Matrices with small condition numbers are said to be well-conditioned . In the 2-norm, orthogonal ‘matrices are perfectly conditioned in that n2(Q) = 1 if @ ia orthogonal. 2.7.3 Determinants and Nearness to Singularity It is natural to consider bow well determinant size measures ill-conditioning. If det(A) = 0 is equivalent to singularity, i det(A) ~ 0 equivalent to near singularity? Unfortunately, there is lttle correlation between det(A) and the condition of Ax = b. For example, the matrix 3, defined by bate at Oe al By = , | ere (279) OO nd thas determinant 1, but oo(By) = n2"1, On the other hand, a very well conditioned matrix can have a very small determinant. For example, ding(10-4,...,10-1) € Re" although det(D,) = 10-*. satisfies rg( Dn 2.7.4 A Rigorous Norm Bound Recall that the derivation of (2.7.4) was valuable because it highlighted the connection between «(A) and the rate of change of 2(«) at « = 0. However, "It also depend upon the definition of “large.” The matter in purrued in 52.52.7. THE SENSITIVITY OP SQUARE SYSTEMS 83 it is a Ltt unsatisfying becouse it is contingent on ¢ being “small enough” ‘and because it sheds no light on the size of the O(c?) term. In this and the next subsection we develop some additional Az = b perturbation theorems ‘that are completely rigorous. ‘We first cotoblish o useful lemma that indicates in terms of x(4) when we can expect a perturbed system to be nonsingular, Lemma 2.7.1 Suppose Az=b AER™", OZER” (A+ Ady = 5445 AACR, AbERT with || OAl| Sel] Al andj Ab|| Selo. en(A) =r <1, then A+ dA is nonsingular and Uivil 2 ler Tel © T= Proof. Since | A“AA|| < eA“! Al] = 7 < 1 it follows from ‘Theorem 2.3.4 that (A+ AA) is nonsingular. Using Lemma 2.3.3 ond the equality (I+ A™'AA)y = 2+ A“1Ab we find Wl < N+ 4-aay (Ded ef Am) L & > (Islet) Qeh+rizi) 2 (iz|+e] A“? ot) Since {| {| = Arf < || All] zl] it follows that 1 Wis 4 We are now set to establish a rigorous Ax = 6 perturbation bound. ‘Theorem 2.7.2 If the conditions of Lemma 2.7.1 hold, then dy-=i) me Ter $ r=7"4) (2740) Proof. Since yore = AMAb ~ AB Ay (27:1) wehave fy—zi] < ef A* Il] bl] + Amt AMM yl and so, w ex(ay ell tz (ragien tA ex(A) (+}4) = oy a w oy CHAPTER 2. MATRIX ANALYSIS Example 2.7.1 The Az =) problem (3 te ][3 +L] hha solution 2 = (1, 1)F and condition rae(A) = 108. IAb=(10-*, 0)", A= 0, Be At aay 848, cheng (he 0-8) 2)? nd te aay (20) ye Late tp w < Mable a) = 10th = 1 Tale © fol6 ‘Thus, che upper bound ia (2.7.10) can be arom oyerentimate ofthe eror induced by the persusbation. Ou the other band, if Ab= (0, 10°*)", AA = 0, and (A+A.Aly = bea, hen this inequality sys IS caxiotiot “Thus, thre ace perturbation for Which te bound in (2.710) semen axa. 2.7.5 Some Rigorous Componentwise Bounds ‘We conclude this section by showing that a more refined perturbation the- ‘ory is possible if componentwise perturbation bounds are in effect and if ‘we make use of the absolute value notation. ‘Theorem 2.7.8 Suppose A= AcR™, 04¢bER™ (AtAdly = b+Ab AAERM™™, AbERT ‘and that |AAJ < eA} and [Ab] < el. If éao(A) {is nonsingular and y= 2 oo 2 lleo Proof. Since j AA lle $ el Ao 804 |] 6 lo < el blo the conditions of Lemma 2.7.1 are satisfied in the infinity norm. This implies that A+ AA fs nonsingular and Iyllo . L+r T Welle * I=r Now using (2.7.11) we find ly—al < |ATHIAH + JA“ Al Tot S ATH Bl + Am Allyl 0 such that (A+ AA) = b+ Ab [AA n) to the Znorm condition of the matrionr oa[e x o in a] sod c= Notes and References for Sec. 2.7 ‘The condition concept ia thoroughly investigated in 4. Rice (1966). “A Theory of Condition” SIAM J. Num. Anal $, 287-310 ‘W. Kahan (1966). "Numerical Linear Algebra,” Canadian Math. Bull 9, 757-801. References for componentwise perurbation theory include 86 (CHAPTER 2, MATRIX ANALYSIS 'W. Outtll and W. Prager (1964). “Compatibility of Approximate Solutions of Linear ‘Equations with Given Esror Bounds lor Coeficiente and Right Hand Sides,” Numer. Math. 6, 405-409. LE. Cope aad BLW. Rust (1970). "Bounds on solutions of nystems with accurate dat” ‘SIAM J. Nar. Anok. 16, 950-63. RD. Shoe (1979), “Scaling for numerical stability in Gaussian Elimination” J. ACM 26, 404-828, LW, Demme (1992), "Phe Componeatwige Distance to the Nearest Singular Matrix,” ‘SIAM J. Motris Anal. Appl. 13, 10-19. DJ, Higham and 8 J. Higham (1992). “Componeatwiie Perturbation Theory for Linear ‘Syrtem with Maltiple Right-Hand Sides,” Lan Alg. and fts Applic. 174, 111-129, NJ. Higham (1994). °A Survey of Componenti Perturbation Theory in Numerical Linear Algebra,” ia Mathematics of Computation 1949-1988: A Half Century of Computational Mathematica, W. Gautsch (ed), Volume 48 of Proceedings of Sym- ona sn Applied Mathematics, American Mathematical Society, Providence, Rhode fend. ‘8. Chandrasaren and 1.C-P. Ipoea (1998). "On the Sensitivity of Solution Components in Linear Syetemw of Equations,” SIAM J. Matris Anal Appl 16, 99-112. ‘The reciprocal of the condition number measures how near 8 given Az = b problem ia to singularity. The importance of knowing bow nea given problem isto a dificult or fnsoluble problem bas come to be appreciated in many computational settings. See ‘A. Laub(1905). “Numerical Linear Algebra Aspects of Control Design Computations,” IBEE Trans, Auto. Cont. AC-S0, 9-108. 4. L, Baslow (1086). *On the Samallor: Positive Singulat Value of an Af-Matrix with Applications to Ergodic Matkov Chal,” SIAM J. Alp. and Disc. Struct. 7, 414 a EW, Demme! (1967). “On the Distance to the Nearot I-Powod Problem,” Numer: ‘Mach. 51, 281-229. LW, Demme (1988). “The Probability that a Numerical Analysis Problem is Difficult,” ‘Math. Comp. 50, 49-420. 1N4J. Higham (1989). “Matrox Nearnens Problems aad Applications,” in Appiiations of ‘Matrix Theory, M.J.C. Gover and 8. Baroets (eds), Oxford University Prem, Oxford UK 1-27.Chapter 3 General Linear Systems §3.1 Triangular Systems §3.2 The LU Factorization §3.3 Roundoff Analysis of Gaussian Elimination §3.4 Pivoting §3.5 Improving and Estimating Accuracy ‘The problem of solving a linear system Az = b is central in scientific computation. In this chapter we focus on the method of Gaussian eliminotion, the algorithm of choice when A is square, dense, and unstructured. When A does not fall into this category, then the algorithms of Chapters 4, 5, and 10 are of interest. Some parallel Az = b colvers are discussed in Chapter 6. ‘We motivate the method of Gaussian elimination in §3.1 by discussing the ease with which triangular systems can be solved. The conversion of 1 general system to triangular form via Gauss transformations is then pro- sented in §3.2 where the “language” of matrix factorizations is introduced. Unfortunately, the derived method behaves very poorly on 8 nontrivial class of problems. Our error analysis in §3.3 pinpoints the diffcalty and moti- vates $3.4, where the concept of pivoting is introduced. In the final section ‘we comment upon the important practical issues associated with scaling, iterative improvement, and condition estimation. Before You Begin Chapter 1, §§2.1-2.5, and §2.7 are assumed. Complementary references include Forsythe and Moler (1967), Stewart (1973), Hager (1988), Watkins 88 Cuarrer 3. Genenat Linzar SYSTEMS (1991), Ciarlet (1992), Datta (1995), Highamn (1996), Trefethen and Bau (1996), and Demmel (1996). Some MaTLARfunctions important to this chapter are lu, cond, rcond, and the “backslash” operator “\". LAPACK connections include solutions with error bounds ‘with condition entinate Solve AX = B, ATX = BANK = B via PA = LU a Equitbration 3.1 Triangular Systems ‘Traditional factorization methods for linear systems involve the conversion of the given square system to. triangular system that has the same solution, ‘This section is about the solution of triangular systems. Waeeaee 3.1.1 Forward Substitution Consider the following 2-by-2 lower triangular system: 4: 0 ][n] fh fa ta} | con If falas #0, then the unknowns can be determined sequentially: a= hfs t= (bo~fnt1)/tn. ‘This is the 2-by-2 version of an algorithm known as forward substitution. ‘The general procedure is obtained by solving the ith equation in Lz = forse a ( - ¥en) /3.1. ‘TRIANGULAR SySTEMS 89 If this is evaluated for i = 1:n, then 0 comploto specification of zis obtained. Note that at the ith stage the dot produet of L(j,1:i ~1) and e(1:i~ 1) i required. Since & only is involved in the formula for 24, the former may be overwritten by the latter: Algorithm 3.1.1 (Forward Substitution: Row Version) If L.¢ R™ 1s lower triangular and b € RR, then this slgorithum overwrites 5 with the solution to Lz = 6, L is assumed to be nonsingular, (1) = O(1)/E(1, 1) for i= 2m 2(4) = (6G) — LG Asi — 1)o0L4 — 1))/E64,3) end ‘This algorithm requires n? flops. Note that L is accemsed by row. ‘The computed solution 2 satisfies: (E+F)E = 6 [F| < nals) + O(W) Ga) For a proof, see Higham (1996). It says that the computed solution exactly satisfies a slightly perturbed system. Moreover, each entry in the perturbing ‘matrix F is small relative to the corresponding element of L. 3.1.2 Back Substitution ‘The analogous algorithm for upper triangular systems Uz = 6 is called back-substitution. The recipe for 2, is preseribed by = (Ee) and once again 6 can be overwritten by 2;. Algorithm 9.1.2 (Back Substitution: Row Version) If U ¢ RO" is upper triangular ond 6 € R°, then the folowing algorithm overwrites & ‘with the solution to Uz = 6. U'is assumed to be nonsingular, Hn) = H(n)/U (nn) fori=n- 1-11 H) = (OG) — UU 4 + Ln) + 1:0))/0G,4) end ‘This algorithm requires n? flops and accesses U by row. The computed solution 2 obtained by the algorithm can be shown to eatiefy (+r =o IF < nul|+0(u*), (8.1.2) % CHAPTER 3. GENERAL LINEAR SYSTEMS 3.1.3 Column Oriented Versions Column oriented versions of the above procedures can be obtained by reversing loop orders. To understand what this means from the algebraic point of view, consider forward substitution. Once x; is resolved, it can >be removed from equations 2 through n and we proceed with the reduced system L{2:n, 2:n)z(2:n) = 6(2:n)—2(1)E(2:n, 1). We thea compute x2 and remove it from equations 3 through n, etc. ‘Thus, if this approach is applied (35 2]/2]- [2 wwe find 2; = 3 and then deal with the 2-by-2 system [3 8][2]=[2]-9[2] = [8] Here is the complete procedure with overwriting. Algorithm 3.1.3 (Forward Substitution: Columa Version) IfL ¢ R'™" is lower triangular and b€ RR", then this algorithm overwrites ® with the solution to Lz =. L is assumed to be nonsingular. for j=in=1 664) = G/L.) OG + Lin) = Dj + In) ~ BY)LG + 1:0, 7) end Hn) = Bln) /L (nn) It is also possible to obtain a column-oriented saxpy procedure for back- substitution. Algorithm 3.1.4 (Back Substitution: Column Version) IfU € R™" is upper triangular and b € R", then this algorithm overwrites 6 with the solution to Uz = b. U is assumed to be nonsingular. 2-12 86) = GG.) (ls = 1) = Hy ~ 3) BG) ~ 1,4) end a1) = 6q)/00,1) [Note that the dominant operation in both Algorithms 3.1.3 and 3.1.4 is the saxpy operation. The roundoff behavior of these saxpy implementations is essentially the same as for the dot product versions. ‘The accuracy of a computed solution to a triangular system is often surprisingly good. See Higham (1996). for j3.1, TRIANGULAR SvsTEMS a 3.1.4 Multiple Right Hand Sides Consider the problem of computing a solution X ¢ R™* to LX = B where Le Re*” is lower triangular and Be RO™*. This is the multiple right hand side forward substitution problem. We show thet such » problem ccan be solved by a block algorithm that is rich in matrix multiplication assuming that q and n are large coough. This turns out to be important in subsequent sections where various block fectorization schemes are discussed. ‘We mention that slthough we are considering here just the lower triangular problem, everything we say applies to the upper triangular case as well. ‘To develop a block forward substitution algorithm we partition the equar ton LX = B os follows: Lo | ox a cr) In Ia 0 Lin Ene Assume that the diagonal Menesan Paralleling the development of Algorithm 3.1.3, we solve the system Lu1X; = By for X, and then remove /X; from block equations 2 oe N: In 0 By— In Lon yy By— Lau X La bws «> Law By-LmXy Coming i tte vay we ca he tng oe spy vd elimi sation scheme: for j=1:N (4) Notice that the i-loop oversees a single block saxpy update of the form Bi Bysr sais (Z}-TEH EE)» By By ing For this to be handled as a matrix multiplication in » given architec ture it is clear that the blocking in (3.1.3) must. give sufficiently “big” X;. Let us assume thot this is the case if each X, has at least r rows. plished if NV = ceil(n/r) and X1,...,Xw—1 €R™* and 9 Cuapren 3. General Lingar SysTEMs 3.1.5 The Level-3 Fraction It is handy to adopt a measure that quantifies the amount of matrix multiplication in a given algorithm. To this end we define the level-3 fraction of an algorithm to be the fraction of flops that occur in the context of matrix multiplication. We call such flops level-3 flops, Let us determine the level-3 fraction for (3.1.4) with the simplifying assumption that n = rN. (The seme conclusions hold with the unequal ‘blocking described above.) Because there are N applications of r-by-r forward elimination (the level-2 portion of the computation) and n? flops overall, the level-3 fraction is approximately given by Nr? 1 loa thoy ‘Thus, for large N almost all flops are level-3 flops and it makes sense to choose NV as large as possible subject to the constraint thet the underlying architecture can achieve a high level of performance when processing block saxpy's of width at least r= n/N. 3.1.6 Non-square Triangular System Solving ‘The problem of solving nonsquare, m-by-n triangular systems deserves some mention. Consider first the lower triangular case when m > 1, i Inj, - [% IneR™™ — eR" Jn }* = IneR™™ beRe Asoume thet Zi is lower triangular, and nonsingular. If we apply forward elimination to 1,2 = by then z solves the system provided Lai(Lj;'61) = ba, Otherwise, there is no solution to the overall system. In auch a case Jeost squares minimization may be appropriste. See Chapter 5 Now consider the lower triangular system Lx = 6 when the number of columns n excceds the number of rows ra. In this case apply forward substitution to the square system L(I:m, 1:ma)x(1:m, :m) = b and preseribe ‘an arbitrary value for 2(m + 1:n). See §5.7 for additional comments on systems that have more unknowns than equations. ‘The handling of nonsquare upper triangular systems is similar. Details are left to the reader. 3.1.7 Unit Triangular Systems A undt triangular matrix is a triangular matrix with ones on the diagonal. Many of the triangular matrix computations that follow have this added bit of structure. It clearly poses no difficulty in the above procedures.3.1. TRIANGULAR SYSTEMS 93 3.1.8 The Algebra of Triangular Matrices For future reference we list a few properties about products and inverses of triangular and unit trlangular matrices. 4 The inverse of an upper (lower) triangular matrix is upper (lower) ‘triangular. ‘+ The product of two upper (lower) triangular matrices is upper (lower) triangular. «The inverse of a unit upper (lower) triangular matrix is unit upper (ower) triangular, ‘+ The product of two unit upper (lower) triangular matrices is unit upper (lower) triangular. Problems PS.1.1 Give an algorithm for computing s nonzero 2 € RY much that Us = 0 where Ue FO** upper triangular with tpn = O and wo tnmicant #O- 3.1.2 Discuss how the determinant of a aquare triangular matrix could be computed ‘with iinimum risk of overton and undertow. 3.1.8 Rewrite Algorithm 3.1.4 given thet U is stored by column in a leagth n(n 1)/2 3.1.4 Welte a detailed version of (3.1.4). Do not amume that divides 1. Sen S27 € FO” we ape tpir nd ht(SF =A) = 8 no Eiaie Sela Gh opty abe soaps ae tas ce wea st SF rcne Ot ope. es ‘| é s=[$ Z]m[o a] % Sm tte oni) Te Tk = Benga eR serait ams Coo een snd we Trt i en solves (8474 — Ae = by. Obverve that 24 and wy = Tez cach require O(n) ope. PS.1-7 Seppone the matric Ry,...,.ly € FO" are all upper triangular, Give an (Olgm) algorithm for molviog the ayatem (Ry Rp~ A) = bamruming thatthe mazrix of coeficknts is nowsingulas, Hiat. Generalize the aoution tothe previous problem. Notes and Rafarancne for Sec. 3.1 ‘The accuracy of triangular syste solvers is analyzed in (NJ. Higham (1980). “The Accuracy of Solutions to Tiangular Syrems,” SIAM J. Num. ‘Anal. #6, 1252-1265, 4 (CHAPTER 3. GENERAL LINEAR SYSTEMS 3.2. The LU Factorization ‘As we have just soon, triangular systems are “easy” to solve. The idea behind Gaussian elimination is to convert a given system Az = b to an ‘equivalent triangular system. The conversion is achieved by taking appropriate linear combinations of the equations. For example, in the system Szjtin, = 9 6ytte = 4 if we multiply the first equation by 2 and subtract it from the second we obtain Setie2 = 9 ~tr = -14 ‘This ia n = 2 Gaussian elimination. Our objective in this section is to give 8 complete specification of this central procedure and to describe what it does in the language of matrix factorizations. ‘This means showing that the algorithm computes a unit lower triangular matrix L and an upper triangular matrix Uso that A = LU, e. 35 1ojfs s ez} - [2a}lo -3j- ‘The solution to the original Ax = b problem is then found by a two step triangular solve process: ty=b Urey => Ara LUr= =b ‘The LU factorization is a “high-level” algebraic description of Gaussian elimination. Expressing the outcome of a matrix algorithm in the “ln guage” of matrix factorizations is a worthwhile sctivity. It facilitates generalization and highlights connections between algorithms that may appear very different at the ecalar level. 3.2.1 Gauss Transformations ‘To obtain a factorization description of Gaussian elimination we need a matrix description of the zeroing process. At the n= 2 level if x3 # 0 and oN 1-02] More generally, suppose x € R® with zy #0. If Fe (QenOsreieit) n= Eo iektin (ecadomtirt) taken3.2. Tue LU Facrorizarion 95 and we define Mz=1—r, 24) then oyf a of] = |_| a Mur = 0 tee | | 8 oe ae ° Tn general, a matrix of the form My =I- ref € R™" is a Gauss trans- Jormation if the first k components of r € R° are zero. Such a matrix is unit lower triangular. The components of (k + I:n) are called multipliers. ‘The vector r is called the Gauss vector. 3.2.2 Applying Gauss Transformations ‘Multiplication by a Gauss transformation is particularly simple. 1fC ¢ R™* and My =1—ref is 6 Gauss transform, then ae = U=re)C = C= 16o) = c-70(8,9. is an outer product update. Since 7(1:k) = 0 only C(k + 1:n,2) is affected and the update C= M,C can be computed row-by-row as follaws: fori=ktin Cb) = C6) — CUE, ond ‘This computation requires 2(n — 1}r flops. Brxample 8.2.1 aad c=[25 8 a6 0 3.2.3 Roundoff Properties of Gauss Transforms If is the computed version of an exact Gauss vector 7, then it is easy to verify that forte lelsuir 96 CHAPTER 3. GENERAL LINEAR SYSTEMS If + is used in o Gauss transform update and fi((I ~ #¢f)C) denotes the computed result, then fU(I-#ef)C) = (-7ef)C+ 8, where JBL < 3u(ICl + IrllOCk 2) + O(0). Clearly, if 7 has large components, then the errors in the update may be Jarge in comparison to |C|. For this reason, care must be exercised when Gauss transformations are employed, a matter that is pursued in §2.4. 3.2.4 Upper Triangularizing Assume that A IR™". Gauss transformations Mi,...,Ma—1 can usually be found such that My-1-++MzMyA = U is upper triangular. To oe this we first look at the n = 3 case. Suppose ie then 14 7 MA={0 -3 -6 0-6 -1 Likewise, 1 00 1467 M=/|0 10/ + aMa(Mid)=|0 -3 -6 0-21 oo. Extrapolating from this example observe that during the kth step “© We are confronted with a matrix AM“) = Myay---MA that is upper triangular in columns 1 to k—1. ‘+ The multiplier in My are based on A—1)(k 4 1:n,A). In particular, we need aff") + 0 to proceed Noting that complete upper triangularization is achieved after n— 1 steps wwe therefore obtain3.2. THE LU FAcToRizaTion 7 kel while (A(k,&) #0) & (k 3.2.8 Solving a Linear System Once A has been factored via Algorithm 3.2.1, then Z and Vare represented jn the array A. We can then solve the system Az = b via the triangular systems Ly = b and Uz = y by using the methods of §3.1. Example 8.2.2 If Algorithm 32.1 is applied to I a TEETTE 4a] -[}44] Hd = OLD, then y = (1)-1,0)7 aohee Ly = band = = (-1/3,1/2,0)7 solves ue=y. 3.2.9 Other Versions ‘Gaussian elimination, like matrix multiplication, is a triple-loop procedure that can be arranged in several ways. Algorithm 3.2.1 corresponds to the “hij” version of Gaussian elimination if we compute the outer product: update row-by-row: for k=Ln=1 AE + I:n, k) = A(k-+ Ln, B)/A(K,&) for isk+iin for j=k+in Alia) = Ali) ~ AGRA) end end ond ‘There are five other versions: yi, thy, ijk, jik, and jh. The last of these results in an implementation that features a sequence of gaxpy’s and forward eliminations. In this formulation, the Gauss transformations are not 100 Onaprer 3. GENerat Linear SysTEMS immediately applied to A as they aro in the outer product version. Instead, their application is delayed. The original A(;,j) is untouched until step 3. ‘At that point in the algorithm A(,j) is overwritten by Mj-1--- MiAG3) The jth Gauss transformation is then computed, ‘To be precise, suppose 1 j and AG, j) is overwritten with U(i,j) fj 24. d=1 while A n case is illustrated by eile a 12 poyfi 2s 46 0-3 -6 depicts the m n case we modify Algorithm 3.2.1 as follows: for k= in rows =k + 1m A(rows, k) = A(rows,)/A(k, ) ifk j we have B,(1:j- 1,15 ~1) = 1-1. It follows that each Ma is 8 Gauss transform with Gauss vector = Eyy-+-Eysir®). 0 ‘As a consequence of the theorem, itis easy to see how to change Algorithm 3.4.1 40 that upon completion, A(i,j) houses L(i,j) for all i > 3. We merely apply each E to all the previously computed Gauss vectors. This is accomplished by changing the line “A(k,k:n) > A(u,k:n)” in Algorithm BA to “Ak, I:n) = AU Lin)” Example 8.4.2 The octorzation PA = LU ofthe matrix in Bvample 3.4.1 ia given by oo1){[s 7 wo 1 0 o}fs % -2 pool? 4 2/=;12 10fjo s w orojle w a} [is - ijlo 0 6 3.4.5 ‘The Gaxpy Version In §3.2 we developed outer product and gaxpy schemes for computing the LU factorization. Having just incorporated pivoting in the outer product version, itis natural to do the same with the gaxpy approach. Recall from (3.2.5) the general structure of the gaxpy LU process: Lal er) for j= in j=. ‘uGjin) = An.) else Solve L({Iij ~ 1 Lj ~ jz = A(Lj ~ 1,4) for and set U(13j ~ 1,3) = 2 u(jin) = Aljin,3) — LG 13 ~ 1) ifi1 UG 15 =) > L(y —1) end UG,5) =») end In this implementation, we emerge with the factorization PA = LU where P = By.y:-- Ey where Ey is obtained by interchanging rows k and p(k) of the n-by-n identity. As with Algorithm 3.4.1, this procedure requires 2n°/3 flope and O(n?) comparisons. 3.4.6 Error Analysis We now examine the stability that is obtained with partial pivoting. This ‘Fequires an accounting of the rounding errors that are sustained during elimination and during the triangular system solving, Bearing in mind ‘that there are no rounding errors associated with permutation, it is not hard to show using Theorem 3.3.2 that the computed solution ¢ satisfies (A+ B)E = b where II s mu(3]Al + SPTLICI) + fw). (943) Here we are assuming that P, £, and U are the computed analogs of P, E, and U as by the above algorithms. Pivoting implies that the elements of L are bounded by one. Thus fj £ joo } 0 othecwion thea A han an LU fectorization with [fy] $1 and tian = 29-1 3.4.7 Block Gaussian Elimination Gaussian Elimination with partial pivoting can be organized so that it is rich in level-3 operations. We detail » block outer product procedure but block gaxpy and block dot product formulations are also possible. See Dayde and Duff (1988). ‘Assume A € IR" and for clarity that n= rv. Partition A as follows: Aun An a> [a23.4. Prvorina ny ‘The first step in the block reduction is typical and proceeds as fallows: ‘* Use scalar Gaussian elimination with partial pivoting (eg. 9 rectan- {gular version of Algorithm 3.4.1) to compute permutation P; € R°**, unit lower triangular Ii, € J" and upper triangular Uy; € IR'™" 50 a(] = [2]. ‘© Apply the P, across the rest of A: (]-9(% ‘= Solve the lower triangular multiple right hand side problem Lute = Aw Perform the level-3 update A= dn-lWn. With these computations we obtain the factorization bu Ou Us PAal oe nels al 0 an ‘The process is then repeated on the first r columns of A, In general, during step k (1 < & < N ~ 1) of the block algorithm we apply scalar Gaussian elimination to a matrix of size (n ~ (k— I)r)-by-r. ‘An r-by-(n — kr) multiple right hand side system is solved and a level 3 Update of size (n — kr}-by-(n ~ kr) ia performed. The level 3 fraction for the overall proceae is approximately given by 1 ~ 3/(2N). Thus, for large N the procedure is rich in matréx multiplication. 3.4.8 Complete Pivoting ‘Another pivot strategy called complete pivoting has the property that the associated growth factor bound is considerably smaller than 2*-1. Recall that in partial pivoting, the kth pivot is determined by scanning the current subcoluma A(k:n, k). In complete pivoting, the largest entry in the current submatrix A(k:n, kn) is permuted into the (K,) position. Thus, we ‘compute the upper triangularizasion My-1Ey-1:*-MiE\AFi -~ Fant =U with the property thet in step & we are confronted with the matrix ACCU My Bee MU ELAPY Feat us CHapTen 3. GENERAL LINEAR SYSTEMS ‘and determine interchange permutations E and Fy, such thet ‘We have the analog of Theorem 3.4.1 ‘Theorem 3.4.2 If Gaussian elimination with complete pivoting is used to compute the upper triangularization My Ennis Mi EVAR + Fao = 0 (4.7) then PAQ = LU where P = Enai-- Ey, Q= Fy--- Fans and L is a unit lower triangular matriz with [,5| <1. The kth column of L below the diagonal is a permuted version of the kth Gauss vector. In particular, if My = 1-7 eT then Lk-+ Link) = g(k-+ 1) where 9 = Eni ---Enyit) Proof. The proof is similar to the proof of Theorem 3.4.1. Details are left to the reader. Here is Gaustian elimination with complete pivoting in detail: Algorithm 3.4.2 (Gaussian Elimination with Complete Pivoting) ‘This algorithm computes the complete pivoting factorization PAQ = LU where Lis unit lower triangular ond U is upper triangular, P= Ea1--- Ey and Q = Fi-++ Fy are products of interchange permutations. A(I:k,) i overwritten by U(I:k,k), = Ln. A(k + Len, k) is overwritten by L(k + Im,k),k = Ln 1. Ey interchanges rows and p(k). Fy interchanges colurnns & and 9(F). for k=1:n=1 * Determine p with k - Fp satisfy fal} < P.M (3.48) ‘The upper bound is a rather slow-growing function of k. This fact coupled with vast empirical evidence suggesting that pis always modesty sized (e.g, ‘p= 10) permit us to conclude that Goussian elimination with complete ‘pivoting is stable. The method solves & nearby linear system (A+ E)z = b ‘exactly in the sonse of (3.3.1). However, there appears to be 0 practical Justification for choosing complete pivoting over partial pivoting except in cases where rank determination is an issue, xample 3.4.8 If Gaussian elimination with complete pivoting a applied to the prob- gy 19 ][ a]. [3m im 20 | vmaposns-n tual ra(To) e-[t a). #= (38 tS]. ¢ and # = 1.00, 1.00]? Compare with Examples 3.3.1 and 3.43, 3.4.10 The Avoidance of Pivoting For certain clases of matrices it is not neosssary to pivot. It is important ‘to identify such classes because pivoting usually degrades performance. To 120 CHAPTER 3. GENERAL LINEAR SYSTEMS ilustrate the kind of analysis required to prove that pivoting can be safely avoided, we consider the case of diagonally dominant matrices. We say that Ae RO" is strictly diagonally dominant if lel > Slagle m The following theorem shows how this property can ensure a nice, no- pivoting LU factorization. Theorem 3.4.3 If A” is strictly diagonally dominant, then A has an LU factorization and \lu| <1. In other words, if Algorithm 3.4.1 is oppled, then P =I. Proof, Partition A as follows a wt a- [0%] where a is I-by-1 and note that after one step of the outer product LU process we have the factorization a ut 1 ojft 0 aut vc via }[0 C-w7/a}lo f |° ‘The theorem follows by induction on n if we can show that the transpose of B= C vw" /axis strictly diagonally dominant. This is because we may ‘then assume that 3 has an LU factorization B = LU; and that implies 4=[o a) [3%] = But the proof that 8 is strictly diagonally dominant is straight forward. From the definitions we have et et at You Teg saal < Shey + tS ot @ ms Ey a (het ~ toy + algal ~ les~ 84] = foyl.0 ”“ w34. Prvotina in 3.4.11 Some Applications We conclude with some examples that illustrate how to think in terms of matrix factorizations when confronted with various linear equation situations. ‘Suppose A is nonsingular and n-by-n and that B is n-by-p. Consider the . the multiple right hand side problem of finding X (n-by-p) 90 AX = By i problem. If X = (21,...,2p] and B = [by thea Compute PA = LU. for k= 1p Solve Ly = Phy (3.4.9) Solve Ura = y end Note that A is factored Just once, If B = Jy then we emerge with a computed A! , ‘As another example of getting the LU factorization “outside the loop,” suppose we want to solve the linear system Atz = b where Ac RO", b€ R®, and k is a postive integer. One approach is to compute C = At and then solve Cz = 8. However, the matrix multiplications can be avoided altogether: Compute PA = LU for j= Lk Overwrite b with the solution to Ly=P8. (3.4.10) Overwrite 6 with the solution to Uz = 6. end As a final example we show how to avoid the pitfall of explicit inverse computation. Suppose we are given AG R"*", de R®, and ce R" and ‘that we want to compute s = c" A~'d. One approach is to compute X = A a suggested above and then compute s =<" Xd. A more economical procedure is to compute PA = LU and then solve the triangular systems Ly = Pd and Uz = y. It follows that s = c"s. The point of this example is to stress that when a matrix inverse is encountered in a formals, we must think in terms of solving equations rather than in terms of explicit inverse formation. Probleme 9.4.1 Let A= LU be the LU factorization of m-by-n A with [tj} $1. Let oF and wT Sushi sw of aad pec. Waly te pan fad Deal 122 Chapren 3. GENERAL Linear Systems ‘and ue it to ahow that IU fae S 2°" Allan « (Hint: Take nora and use induction.) 3.42 Show that if PAQ = LU is obtained vie Gaumian elimination with complete Pivottng, then no clement af (im) ia large in abeotuee valve than S43 Soppose A.C FE** hen an LU factorization and that Land U are known, Give ‘a algorichm which can compata tha (i) entry of A“ in appresimately(0=3)?+(n—i)? ope. 3.4.4. Suppose isthe computed averse obtained via (8.4.9). Give an upper bound for [AR ~ Ip. 3.4.5. Prove Theorem 342. P3.4.6 Extend Algoritim 3.43 90 thas it can factor an arbitrary rectangular matrix, 'PS.4.7 Write detailed verion of the block eliminasion algorithm outlined in $8.47 ‘Notes and References for See. 3.4 ‘Am Algol version of Algorithm 3.4.1 a given in HJ. Bowdir, RLS. Martin, G. Petory and J.Hl Wilkiawon (1966), “Solution of Real ‘and Complex Systems of Linear Equations.” Numer. Moth 8, 217-34. See also Witkinsom and Reiaach (1971, $9-110). “The conjactare tat [| Hy oof d flee ‘The idea behind their estimator is to choose d so that the solution y is large in norm and then set = WA lleall ¥ Hoo/t d lleo- ‘The succeas of this method hinges on how close the ratio fy Hoo/II tlw it to its maximum value fA“? Io. ‘Consider the case when A = T'ia upper triangular. The relation between 4 and y is completely specified by the following column version of back substitution: p(L:n) =03.5. IMPRoviNa aN Estimaria AccURACY 129 for k= m—1:1 ‘Choose dk). y(k) = (ie) ~ w(k))/T Uk, k) (3.5.6) (ck —1) = p(k = 1) + (A)T(E 1,4) end Normally, we use this algorithm to solve a given triangular system Ty = d. Now, however, we are free to pick the right-hand side d subject to the “constraint? that y is large relative to d. ‘One way to encourage growth ia y is to choose d(k) from the set (-1,41} s0 as to maximize y(h). If p(é) 2 0, then set d(k) = -1. If p(k) <0, then set d{k) = +1. In other words, (3.5.6) is invoked with d(k) = -sign(p(k)). Since d is then’ vector of the form d(Qin) = (l,...4£1)", we obtain the estimator fs = IT lool y loo- A more reliable estimator results if d(k) < {—1,+1} is chosen so a5 to encourage growth both in y(k) and the updated running sum given by p(k ~ 1,k) + 7(L:k ~ 1,b)y(R). In particular, at step k we compute Mlky* = (1 ~ R))/T (OR) Iy()*] + I] p(Lsk — 1) + Tsk — 1,8) (A) th fk)” = (—1 ~ pk) /T IR, &) 5(K)~ = [y(R)ML + Uh p(k = 1) + TCR ~ 1,4)" Th, st and set w(a)* if a(k)* > a(k)- (kh) = { u(k)~ if a(k)* < afk) This gives Algorithm 3.5.1 (Condition Estimator) Let T ¢ Fe*" be a nonsin- ‘gular upper triangular matrix. This algorithm computes unit oo-norm y tnd a scalar 6 90 [ Ty loo = 1/h Too and 8% to(T) Hin) =0 for k= n:— 1d 1 = pik) /TUK, k) —1 ~ p(k))/T(b, ) pL 1) + P(GK ~ 1, Buk) 2k = 1, Ky) 130 CHAPTER 3. Ganenat LInzaR SYSTEMS. Af |y(R)*L + MC)? th 2 WRT +H CR)m i(k) = y(b)* vl pick — 1) = ptk)* else uk) = yf)" pick 1) = a(h) end ond m= Hy loll T los v=vilule ‘The algorithm involves several times the work of ordinary back substitution. ‘We are now in @ position to describe a procedure for estimating the condition of e square nonsingular matrix A whose PA = LU factorization wwe know: '» Apply the lower triangular version of Algorithm 3.5.1 to UT and obtain a large norm solution to UT y = d. '» Solve the triangular systems L7r = y, Lw = Pr, and Uz = w. Re =H A Noll #Hoo/ IF Hoos Note that z loc Sl) AW oll” ln. The method is based on several heuris- tics. First, if A is ll-conditioned and PA = LU, then itis usually the case ‘that U is correspondingly ill-conditioned. The lower triangle L tends to be fairly well-conditioned. Thus, it is more profitable to apply the condition estimator to U than to L. The vector r, because it solves ATPTr = d, ‘tends to be rich in the direction of the left singular vector essociated with Omin(A). Righthand sides with this property render large solutions to the problem Az =r. In practice, it is found that the condition estimation technique that we have outlined produces good order-of- magnitude estimates of the actual condition number. Probleme P3.5.1. Show by example that there may be more than ooe way to equilibrate a matric. P8.6.2 Using 9 = 10,¢ = 2 arthmetic, ive uosypayo[t (3 S)f2)-(0] ‘wing Gaatsan elimination with pasta pivoting. Do one etep of erative insprovement ting t= 4 arthowticto compute the reaidual (Do wot forget to round the computed Cendual to we dig) 3.5.8 Suppone P(A+ 5) = LO, where Piss permtation, Lis lowee triangular with (Gs 1, and O in upper trnngulac. Show that du(A) 2 [A lo/(U Ele +) where3.5. IMPROVING AND ESTIMATING ACCURACY 131 4 = min [él Conclude that if» amall pivot is encountered when Gaussian elininesion ‘wih pivoting ia applied to A, then A ia Uhconditionnd. The converse is not true. (Let 3.5.4 (Kahan 1965) The system Az = b where a ee 2(1-+ 19-20) Am | -1 107% 7 | y= | ton 1 191 19-10 0 thas solution = = (10-10 —11)F. (a) Show that if (A + B)y = b and [61 < 10°*14I, then (24 $ 10-*[z|, That ia axall relative changes in A's entrion do not induce Large changes io even though w(A) = 10'°. (b) Define D = diag(l0~*, 108,10"). Show ‘eee(DAD) £5. (c) Explain what is going on in term of Theorem 27/3 3.6.8 Consider the matrix: 10 M o1 -M oM oo 0 1 (What ontimate of roo(T7) Sa produced when (35.6) is applied with d(k) = ‘What estimate does Algorithm 3.5.1 produce? What it the 17ue Roa(T)? 3.5.6 What does Algorithm 35.1 produce when appli to the matric By given ic (rar signip(¥))? Notes and References for See. 3.5 “The following papers are concerned with the sealing of Az b problems: FL. Bauer (1965). “Optimally Sealed Matsican,” Nemer. Math. 5, 73-87 PLA. Businger (1968). “Matrices Which Can be Optimally Sealed,” Numer. Math. 18, Me-48. ‘A. var der Sluis (1969). “Condition Numbers and Baguilibration Matricen™ Numer. Math, 14, 16-23, A. van der Sluis (1970). “Condition, Equilbation, and Pivoting in Linear Algebrale Syatema” Numer. Math. 15, 1-36. ©, McCarthy and G, Strang (1873). “Optimal Conditioning of Matskes” STAM J. Nam. Anal. 10, 370-88. 'T, Feaner aod G. Loizou (1914). “Some New Bounds on the Condition Numbers of ‘Optimally Scaled Matrces,"J. ACM #1, S14-24. ‘GH. Golub and 42M. Vara (1978). “On a Characterization ofthe Bast La-Scaling of = Matrix” SIAM J. Mum. Anal 12, 472-79, R. Skeei (1979). “Scaling for Numerical Stability in Gaumian Elimination,” J. ACM. 26, 494-528, . Skeel (1981). “EMfect of Equlibration on Residual Ske for Partial Pivoting,” SIAM Nam. Anal 18, 0-85. Part of the difiulty im walling enncerne the selection of « norm ia which to maagure fort. An interesting discussion ofthis Frequently overlooked point appear in (W, Kahan (1966), “Numerical Linear Algebra,” Canadian Math. Bull. 9, TST-801. For a rgurour analysis of Iterative improvement and related matters, soe M, Jankowski sod M. Weaniakowski (1977). “Itartive Refinement Implias Numerical Stability” BIT 17, 303-311. 132 (CHAPTER 3. GENERAL LINEAR SYSTEMS CB. Moler (1967). “Iterative Refioement in Floating Polnt” J. ACH 14, 816-71. RCD, Skool (1980). "Rarative Refinement implies Numerical Stability for Gans lim. Into,” Math. Comp. 38, 817-832 G.W. Stewart (1961). “On the Implicit Deflation of Nearly Singular Systems of Linear ‘Bauationn” STAM J, Sei and Slat. Comp. 2, 196-140. ‘The condition estimstor that we described is given in AX, Cling, CB. Moler, GW. Stewart, and J.H. Wilkinson (1979). “An Batimate for the Condition Number of « Matz.” STAM J. Neon. Anal. 16, 38-75 (Other references concerned with the condition estimation problem include C.G. Broyden (1973). “Some Condition Number Bounds for the Gaussian Elimination Procem," J. Inst, Math Applic. 18, 273-86. FF. Lemeire (1973). “Bounds for Condition Numbers of Tyiangular Value of a Matrix,” Lin Alg. and Tis Applic. 11, 1-2. IS. Varga (1976). *On Diagonal Dominance Arguments for Bounding | AW! ao,” Lin. “Alp. and Its Applic. 14, 211-17. G.W. Stewart (1960). “The Eficieat Generation of Random Orthogonal Matrices with ‘a Application to Condition Eatimators” SIAM J. Num. Anak 17, 42-9. D.P. O'Leary (1980). “Eatimating Matrix Condition Numbers," SIAM J. Sek Stat Gomp. 1, 205-2. RG. Grimes and J.G. Lewis (1961). “Condition Number Estimation for Spare Matti ce" SIAM J. Sci. and Stat. Comp. 2, 364-88 ‘AK. Cline, A.R. Gons, and C. Van Loan (1982). "Generalizing the LINPACK Condition ‘stimator,” in Numerical Analysis ed, JP. Henuart, Lecture Notes in Mathematics no. 909, Springe-Verlag, New Yor. AK. Cline and FLK. Rew (1989). “A Set of Gounter example to Three Condition ‘Number Estimstoss," S1AM J. Set. and Stat. Comp. 4, 02-6. W. Hager (1984). "Condition Eatimates.” SIAM J. Sct and Stat. Cora. 5, 311-316. [NJ Higham (1987). "A Survey of Condition Number Satimation fr Triangular Mate “con.” SIAM Review 29, 575-596. [N.J. Higham (1986). “Fortran Codes fr Eatiating the One-norm of « Real or Complex ‘Matrix, with Applications to Condition Eatimation,” AGM Trans. Math. Soft. 14, 381-296. [NJ Higham (1988). “FORTRAN Codes for Estimating the One-Norm of & Real or ‘Complex Matrix with Applications to Condition Eatimation (Algorithm 674)," ACM Trans. Math. Soft. 14, 81-396. (C.H. Bischof (1900). “Incremental Condition Katimation,” SIAM J. Matris Anal. Appl 11, 644-850, (CH. Blbehof (1990). “Incremental Condition Extimation for Spars Matrices.” STAM J. Matris Anal. Appl 11, 312-322 G. Auchnuty (1091). “A Posterirt Error Betimates for Linear Equations," Numer. Math 61, 1-6 NJ. Higham (1988). Optimisation by Direct Search in Matrix Computations.” SIAM J. Metre: Anal Appl. 14, 317-333 DJ. Higham (1995). "Condition Numbers and ‘Their Condition Numbers” Lin. Alp. ‘ond Mis Applic. #14, 195-213.Chapter 4 Special Linear Systems §4.1 The LDMT and LDLT Factorizations $4.2 Positive Definite Systems §4.3 Banded Systems §4.4 Symmetric Indefinite Systems 54.5 Block Systems §4.6 Vandermonde Systems and the FFT §4.7 Toeplitz and Related Systems It is @ basic tenet of numerical analysis that structure should be ex ploited whenever solving a problem. In numerical linear algebra, this trans- lates into an expectation thst algorithms for general matrix problems can be streamlined in the presence of such properties as symmetry, definiteness, and sparsity. This is the central theme of the current chapter, where oUF principal aim is to devise special algorithms for computing special variants of the LU factorization. We begin by pointing out the connection between the triangular factors L and U when A is symmetric. This is achioved by examining the LDM® factorization in §4.1. We then tura our attention to the important case when A ip both symmetric and positive definite, deriving the stsble ‘Cholesky factorization in §4.2. Unsyzametric positive definite systems are also investigated in this section. In §4.3, banded versions of Gaussian climi- ‘ation and other factorization methods are discussed. We then examine the interesting situation when A is symmetric but indefinite. Our treatment of this problem in §44 highlights the numerical analyst's ambivalence towards pivoting. We love pivoting for the stability it induces but despise it for the 138 1M CuapTeR 4. SPECIAL LINEAR SySTEMS structure that it can destroy. Fortunately, there is a happy resolution to this conflict in the symmetric indefinite problem. ‘Any block banded matrix is also banded and so the methods of §4.3 are applicable. Yet, there are occasions when it pays not to adopt this point of ‘view. Te illustrate this we consider the important case of block tridiagonal systems in §4.5. Other block systems are discussed as well. Im the final two sections we examine some very interesting O(n?) algorithms that can be used to solve Vandermonde and Toeplitz systems. Before You Begin Chapter 1, §§2.1-2.5, and §2.7, and Chapter 3 are sssumed. Within this chapter there are the following dependencies: 545 1 M1o+ M2 — fa + Me 1 ue — 547 Complementary references include George and Liu (1981), Gill, Murray, and Wright (1991), Higham (1996), Trefethen and Bau (1996), and Demmel (1996). Some MATLAB functions important to this chapter: chol, tril, triu, vander, toeplitz, fft. LAPACK connections include TAPAGK: General Band Matrices Salve AK = (Condition axtimator Improve AX = By ATX = B, AMX = B polation with error boande Solve AX = B, ATX = B, AMX = B with condition timate PA=IU Solve AX = B, ATX = B, AMX = B via PA = LU quilbeasion ggaeage LAPACK: General Tridiagonal Matrices Salve aX = 5 Condition etimator Improve AX = By ATX = B, AM Solve AX = 8, ATX = B, Al, Pan LU Solve AX = B, ATX = B, AMX = B via PA = WV aay solution with exror bounds ‘with condition entimate Solve AX = B with condition eetimate vis A= Gor Hun4.1, TRE LDMT anp LDLT Factorizarions 135 TAPAGK: Banded Symmetric Positive Definite Pas] Save AX = Bh acon | Condition extimate via A = GGT panr3 | Improve AX = B solitons with eer bounds PBSVE | Salve AX = 5 with condition animate Pore Pers a=Gat Solve AX = B vis A= GOT LAPACK: Tridiagonal Symmetric Positive Definite Salve AX = BF Condition etimate vin A= LDLT Improve AX = B solutions with eror bounds Solve AX = B with condition timate A= LDL Solve AX = B via A= LDLT FEEFEE LAPACK; Full Symmetric Indefinite TERN] Sohe AX= 5 ‘sveaw | Condition eximate via PAP? cot “srars | Improve AX = B solutions with error bounds Tavsvi_| Soke AX = B with condition ertimate Tse | PaPT = cost Terms | Sow AX = 8 via PAPT = LDLT ism [at LAPACK: Triangular Banded Matrices TTB] Condition eatimate Trans | Improve AX = By ATX = B solutions with error bounds Troms | solve AX = 8, ATX = B 4.1 The LDM? and LDL? Factorizations We want to develop a structure-expioiting method for solving symmetric ‘Az = problems. To do this we establish a variant of the LU factorization in which A is factored into a three-matrix product LDMT where D is diagonal and L and M sre unit lower triangular. Once this factorization is obtained, the solution to Az = b may be found in O(n?) ops by solving Ly = b (forward elimination), Dz = y, and MTz = = (back substitution). ‘The reason for developing the DM" factorization Is to set the stage for the symmetric case for if A= AT thea L = M and the work associated ‘with the factorization is half of that required by Gaussian elimination. The issue of pivoting is taken up in subsequent sections. 4.1.1 The LDMT Factorization ur first result connects the LDMT factorizetion with the LU factorization. 136 CHAPTER 4. SPECIAL LINEAR SYSTEMS ‘Theorem 4.1.1 if all the leading principal submatrices of A.& RO" are nonsingular, then there exist unique unit lower triangular matrices L and M and a unique diagonal matriz D = diag(41,...,d,) such that A= LDM?. Proof. By Theorem 3.2.1 we know that A has an LU factorization A = LU. Set D = ding(d;,...,dn) with dy = ws for i = Im. Notice that D is nonsingular and that M? = D-'V is unit upper triangular. Thus, A= LU = LD(D-'U) = LDMT. Uniqueness follows from the uniqueness of the LU foctorization as described in Theorem 3.2.1. 0 ‘The proof shows that the LDM™ factorization can be found by using Gaus- sian elimination to compute A = LU and then determining D and Mf from the equation U = DM". However, an interesting alternative algorithm can be derived by computing L, D, and M directly. ‘Assume that we know the first j ~1 columns of L, diagonal entries ddy,.-.ydj-1 Of D, and the first j—1'rows of M for some j with 1<3 } , with dy if i = j, and with my ifi j and with d, if for { Compute v(1:). } fori=3j~1 vi) = AG AGA) end 0G) = AG, §) — AG.1j ~ 1) { Store d{j) and compute L(j + 1:n,3). } AG, 3) =o) AG + 10,3) = (AG +n, j) = AG +n Ly o(:j ~ 1))/oG) end ‘This algorithm requires n*/3 flops, about half the number of flope involved in Gaussian elimination. Tn the next section, we show that if A is both symmetzic and positive definite, then Algorithm 4.1.2 not only runs to completion, but is extremely stable. If A ls symmetric but not positive definite, then pivoting may be necessary and the methods of §4.4 are relevant. Example 41.2 10 2 30 1 oe o}fmoojfi 23 ms wl=/2 10 soljo.s 30 80 im san orjloor ‘ad 90 if Algorithm 4.1.2 i applied, A is oreewritten 1 Problems P4.LL1 Show that the LOMT ‘actormsion of « nonsingular A is unique iit exits. P42 Modify Algrthm 411 00 that & computes «lactation ofthe form PA = EDM, where Land M are both anit lower tnngslr, D i diagonal, ad P ie 2 Permutation that is choven [dl <1 4.1.8 Suppose the mby-m eymmetric matrte A = (ay) is stored ina vector ¢ 08 follows © (24,890: 420,-2-.dng--v qq) Rewrite Algorithm 4.1.2 with A stored inthis faion. Gat a tuck indeing outside the inner oops as pombe 140 Cnapren 4. SPECIAL Livgar SySTEMS 4.1.4 Rewrite Algorithm 4.1.2 for A sored by diagonal. See §1.28. Notes and Reference for Sec. 4.1 ‘Atgocthan 41.1 is related to the methods of Crout and Doolittle in thet outer product ‘updates are svokded. Soe Chapter 4 of Pox (1064) or Stewart (1073,131-140). An Algol procedure may be found in 113, Bomdler, RS. Mattia, G. Peters, and 141, Wilkinson (1066), “Solution of Real and ‘Complex Systems of Lineer Equations,” Numer. Math. 8, 217-294, See also GE. Formytbe (1960). *Crout with Pivoting.” Comm. ACM 3, 507-06. ‘WM. MeKeoman (1962). “Crout with Equilbraton aod Iteration,” Comm. ACM 5, ‘350-55. Justo algorthrns can bo tailored to exploit structure, oo can eror analyain and pertur- ‘bation theory: 1M. Atioli, J, Dema, sd I. Duff (1980). “Solving Sparse Linear Sysama with Sparse Backward Error,” STAM J. Matrix Anck Appl. 10, 165-100. JLR Bunch, J.W. Dempel, snd C.F. Van Loan (1099). “The Strong Stability of Algo- rithine for Solving Symmetric Linear Systems,” STAM J. Mains Anal. Appl. 10 o499. ‘A. Barrlund (1991). “Perturbation Bounds for the LDL? and Li Decompositions,” BIT 31, 358-383, DJ, Higham and N-J. Higham (1962). "Beckwacd Error and Condition of Structured Linear Syetems,” SIAM J. Matrix Anal. Appt. 13, 162-173. 4.2 Positive Definite Systems A matrix A € IR™*" is positive definite ifz7 Az > 0 for all nonzero z ¢ RR. Positive definite systems constitute one of the most important classes of special Az = 6 problems. Consider the 2-by-2 symmetric case. If anf on on on 7 is positive definite then z => alér = ay>0 z > az > 0 z = 441 + 2ai2 +02 > 0 2 = ay, ~ 2a +ax > 0. The last two equations imply |ara| < (a1: +22)/2. From these resulta we see that the largest entry in A is on the diagonal and that itis positive. This turns out to be true in general. A symmetric positive definite matrix has 3 “weighty” disgonal. The mass on the diagonal is not blatantly obvious4.2. Posrmive Dernrs Systems ur 1s in the cose of diagonal dominance but it has the same effect in that it precludes the need for pivoting. See §3.4.10. ‘We begin with a few comments about the property of positive definite- ‘ess and what it implies in the unsymmetric case with respect to pivoting. ‘We then focus on the efficent organization of the Cholesky procedure which ccan be used to anfely factor a symmetric positive definite A. Gaxpy, outer product, and block versions are developed. ‘The section concludes with a few comments about the somidefinite case. 4.2.1 Positive Definiteness Suppose A € I" is positive definite. It is obvious that a positive definite matrix is nonsingular for otherwise we could find nonzero z 30 27 Az = 0. However, much more is implied by the positivity of the quadratic form 27 Az as the following results show. ‘Theorem 4.2.1 If A RO" is positive definite ond X ¢ RO" has rank kk, then B= XTAX € RO** is also positive definite. Proof, If x ¢Ré sotisfes 0 > 27Bx = (Xz)"4(X2) then Xz = 0. But since X has full colura rank, this implies that z= 0.0 Corollary 4.2.2 If A is positive definite then all its principal submatrices are positive definite. In particular, all the diagonal entries are positive. Proof. If v € RP is an integer vector with 1 < v1 <--- 0 then the matrix am [oe Te [ove [5 cemn | [3 4] is positive definite. But if m/e > 1, then pivoting s recommended. ‘The following result suggests when to expect element growth in the LDMT factorization of a positive definite matrix. ‘Theorem 4.2.4 Let A ¢ IR™*" be positive definite and set T = (A+A™)/2 ond $= (A~A?)/2. A= LDM™, then UILIDIMT Ip < m (UT Ha + STS fa) (42.1) Proof. See Golub and Van Loan (1979). 0 ‘The theorem suggests when it is safe not to pivot. Assume that the computed factors L, D, and Af satisfy: HEIDI eS eff [EIDIMT| Hes (4.2.2) ‘where cis a constant of modest size. It follows from (4.2.1) and the analysis in §3.3 that if these factors are used to compute a solution to Az = 6, then the computed solution 2 satisfies (A + E)# = 6 with Elly < wGnf A lip + Sen? (\\T lla +1 STS fa) + O(u?). (4.2.3) It is easy to show that || T lla < || A 2, and so it follows that if 2 ee (424) is not too large then itis safe not to pivot. In other words, the norm of the skew part 5 has to be modest relative to the condition of the symmetric pert T. Sometimes it is possible to estimate 1 in an application. This is trivially the case when A is syinmnetric for then = 0. 4.2.3 Symmetric Positive Definite Systems ‘When we apply the above results to a symmetric positive definite system wwe know that the factorization A= LDL" exists and moreover is stable to compute. However, in this situation another factorization is available.4.2. Posrriva Dernnre Systems 143 ‘Theorem 4.2.5 (Cholesky Factorization ) If A € IR™*" is symmetric positive definite, then there exists a unique lower triangular G € R™" with positive diagonal entries such that A= GG". Proof. From Theorem 4.1.2, there exists s unit lower triangular L and a iagonal D = diag(d;,...,d,) such that A = LDL?. Since the ds are pos itive, the matrix G = L diag(/di,..., Wd) is real lower triangular with positive diegonal entries. It also satisies A = GG. Uniqueness follows from the uniqueness of the LDL" factorization, 0 ‘The factorization A = GGT is known as the Cholesky factorization and G is referred to as the Cholesky triangle. Note that if we compute the Cholesky factorization and solve the triangular systems Gy = 8 and Gz = y, then b= Gy =G(GTz) = (GG")z = Az. Our proof of the Cholesky factorization in Theorem 4.2.5 is constructive. However, more effective methods for computing the Cholesky triangle ean be derived by manipulating the equation A = GGT. This can be done in several ways as we show in the next few subsections. [2 2)-(4 302 10 t-[-4 af? 4] is positive definite, 4.2.4 Gaxpy Cholesky ‘We first derive an implementation of Cholesky that is rich in the gaxpy ‘operation. If we compare jth columns in the equation A = GGT then we obtain AGI = DEG HEA). m ‘This says that GGNGE, Pa = AGI) - DGG NGEA) =v. (425) Fa If we kaow the fist j ~ 1 columns of G, then v is computable. It follows by equating components in (4.2.5) that Gin, 5) = whim) Ve). ‘This is a scaled gaxpy operation and so we obtain the following gaxpy-based method for computing the Cholesky factorizstion: M4 CHAPTER 4. SPECIAL LINEAR SYSTEMS for j=in Gin) = AGin,d) for k=1:j-1 (Gin) — GU,A)Gn,#) GUen,3) = vGi:n)/ OG) end It is possible to arrange the computations so that G overwrites the lower triangle of A. Algorithm 4.2.1 (Cholesky: Gaxpy Version) Given a symmetric positive definite A € IR", the following slgorithm computes 1 lower triangular G € I*" such that A = GGT. For all ¢ > j, G{t,j) overwrites Ali). for j ppt Algin,g) = AGimd) ~ AGEny 1g — AG IG - end ad AGiin, 3) = AG=n, )/ VAG) ‘This algorithm requires n?/3 ops. 4.2.5 Outer Product Cholesky An alternative Cholesky procedure based on outer product (rank-1) updates can be derived from the partitioning -fe7]_[ 8 o]f1 0 8/8 ° VB Inn = wT a Int a(S a] = [oe ei][S omrme|[t Oe). (4.26) Here, 6 = VG and we know that a > 0 because A is positive definite. Note that B — vo™ /a is positive definite because it is a principal submatrix of XTAX where 1 i] 0 Int If we have the Cholesky factorization GGT = B~vv7 fa, then from (4.2.6) it follows that A = GGT with e-[4o a]: ‘Thus, the Chouesky factorization can be obtained through the repeated application of (4.2.6), much in the the style of hji Gaussian elimination. |4.2. Posrrve Dermrre SysteMs 145 Algorithm 4.2.2 (Cholesky: Outer product Version) Given a sym- ‘metric positive definite A ¢ R'*", the following algorithm computes a lower triangular G ¢ R™" such that A = GGT. For all i > j, G(i,) overwrites Alii). for k= in Alk,k) = VATE E) Ak + Len, k) = A(k + Lon, k)/ACk, k) for jak+in ACGen,3j) = AGin, 3) — AGin AGB) end end ‘This algorithm involves n2/3 flops. Note that the j-loop computes the lower triangular part of the outer product update Alle + Linke lin) = Ale + Lim, k + Ion) — A(k + Len h)A(k + Ion, 8)? Recalling our discussion in §1.4.8 about gaxpy versus outer product updates, itis easy to show that Algorithm 4.2.1 involves fewer vector touches than Algorithm 4.2.2 by a factor of two. 4.2.6 Block Dot Product Cholesky Suppose A € F™*" is symmetric positive definite. Regard A = (Ay) and its Cholesky factor G = (Gi;) as N-by-N block matrices with square diagonal blocks. By equating (j,3) blocks in the equation A = GGT with i 2 j it follows that i Ay = Gah. it Defining jet S = Ay-SGuGh, mi we soe that GyyG7, = $ if = j and that GyGT, = 3 if § > j. Properly sequenced, these equations can be arranged to compute all the Gis: Algorithm 4.2.3 (Cholesky: Block Dot Product Version) Given a symmetric positive definite A € IR", the following algorithm computes a lower triangular G € FO" such that A= GGT. The lower triangular part of A is overwritten by the lower triangular part of G. A is regarded as an N-by-N block matrix with square diagonal blocks. us. Caprer 4. SPECIAL LINEAR SYSTEMS for j=: for i= j:N # S = Ay - Gah Eo if Compute Cholesky factorization S$ = G,,G7,. else Solve GyGF, = S for Gy end Overwrite Ay with Gi. end end The overall process involves n°/3 flops like the other Cholesky procedures that we have developed. ‘The procedure is rich in matrix multiplication assuming a suitable blocking of the matrix A. For example, ifn = rN and each Ay, is r-by-r, then the level-3 fraction is approximately 1 — (1/'N?), ‘Algorithm 4.2.3 is incomplete inthe sense that we have not specified how the products GuxGse are formed or how the r-by-r Cholesky factorizations S = GyGF, are computed. ‘These important details would have to be worked out carefully in order to extract high performance. ‘Another block procedure can be derived from the gaxpy Cholesky algorithm, After r steps of Algorithm 4.2.1 we know the matrices Gin € R™" and Gar € RO" in Au An] [Gn O][% O][Gy 0]? An 4a] = [Ga tne JL 0 A} | Gn dnor We then perform r more steps of gaxpy Cholesky not on A but on the reduced matrix A = An—GzGf, which we explicitly form exploiting symmetry. Continuing in this way we obtain a block Cholesky algorithm whose kth step involves r gaxpy Cholesky steps on a matrix of order n ~ (k— 1)r followed a level-3 computation having order n ~ kr. The level-3 fraction is approximately equal to 1 - 3/(2V) ifn wr. 4.2.7 Stability of the Cholesky Process In exact arithmetic, we know that a symmetric positive definite matrix hhas a Cholesky foctorization. Conversely, if the Cholesky process runs to completion with strictly positive square roots, then A is positive definite. ‘Thus, to find out if.» matrix A is postive definite, we merely try to compute ite Cholesky factorization using any of the methods given above. ‘The situation in the context of roundoff error is more interesting. The ‘sumerical stability of the Cholesicy algorithm roughly follows from the in-4.2. Posrrive Derinrre SysTEMs ar equality ‘This chows thet the entries in the Cholesky triangle are nicoly bounded. ‘The same conclusion can be reached from the equation |G = A la. ‘The roundoff errors associated with the Cholesky factorization have been extensively studied in a classical paper by Wilkinson (1968). Using ‘the results in this paper, it can be shown thet if is the computed solution to Az = 6, obtained via any of our Cholesky procedures then # solves the perturbed system (A+ E)¢ = b where || Elly < cqull Alla and cq is a small constant depending upon n. Moreover, Wilkinson shows that if gquna(A) < 1 where gais another small constant, then the Cholesky process runs to completion, ie, no equare roots of negative numbers erise. Bxamplo 4.2.2 If Algorithm 4.22 ia applied tothe postive defnio matrix 100 15 ot Aq | is 23 oF 1 OL 100 and 9 = 10, ¢ = 2, rounded arithmetic used, then 31 = 10, fo = 1.5, fx = 001 and. ‘Jaa = 0.00. The algorithm thes breaks down tying to compute gua 4.2.8 The Semidefinite Case A matrix is said to be positive semidefinite if 27Ax > 0 for all vectors 2. Symmetric positive semidefinite (spe) matrices are important and we briefly discuss some Cholesky-like manipulations that can be used to solve various sps problems, Results about the disgonal entries in an spe matrix are needed first. ‘Theorem 4.2.6 If A ¢ R™" is symmetric positive semidefinite, then las] (aie +ay5)/2 (4.2.7) lagl << Yona (G3) (4.2.8) max Jay) max ai (429) m=O AG) =O, ACG) (42:10) Proof. If'z =o, +¢y then 0< 27 Az = ay + ayy + Davy while 2 =e ~ €5 implies 0 27 Az = ai +aj; — 2a4j. Inequality (4.2.7) follows from these two results. Equation (4.2.9) is am easy consequence of (4.2.7). ‘To prove (4.2.8) assume without loss of generality that i = 1 and j = 2 and consider the inequality os [F] [ee 22 ][F] = one samae tom 148 CHAPTER 4. SPECIAL LINEAR SySTEMS which holds since A(1:2, 1:2) is also semidefinite. This is a quadratic equation in » and for the inequality to hold, the discriminant 4a2, ~ daxroz2 must be negative. Implication (4.2.10) follows from (4.2.8), O ‘Consider what happens when outer product Cholesky is applied to an sps matrix. If zero A(k,k) is encountered then from (4.2.10) A(e:n,k) is zero and there is “nothing to do” and we obtain for k = iin if A(k,k) > 0 Alksk) = ACER) A(R + Ln, k) = A(k + 1m, k)/A(K) for j=k +n AGin,3) = AGin.3) ~ AG: AIAG, E) (42.11) end end ‘Thus, a simple change makes Algorithm 4.2.2 applicable to the semidefinite case. However, in practice rounding errors preclude the generation of exact zeros and it may be preferable to incorporate pivoting. 42.9 Symmetric Pivoting ‘To preserve eymmetry in a symmetric A we only consider date reorderings of the form PAPT where P is a permutation. Row permutations (A — PA) fo column permutations (A+ AP) alone destroy symmetry. An update of ‘the form Av PAPT is called a symmetric permutation of A. Note that such an operation does ‘not move off-diagonal elements to the disgonsl. The disgonal of PAPT is a reordering of the diagonal of A. ‘Suppose at the beginning of the Ath step in (4.2.11) we symmetrically permute the largest diagonal entry of A(kn, Arn) into the lead postion If that largest diagonal entry is zero then A(k:n,k:n) = 0 by virtue of (4.2.10). In this way we can compute the factorization PAPT = GGT where Ge R™(*-" ig lower triangular. Algorithm 4.2.4 Suppose A ¢ IR™*” is symmetric positive semidefinite ‘and that rank(A) =r. The following algorithm computes a permutation P, tthe index r, and on n-by-r lower triangular matrix G euch that PAPT = GGT. The lower triangular part of A(; Lr) is overwritten by the lower triangular part of G. P= P;--- Py where Fy ia the identity with rows k and piv(k) interchanged.4.2. Posrmive Dernure SysTems 149 r=0 for k=1n Find ¢ (k 0 whenever 2 #0. (a) Show thes A -8 e=[3 4] iseymupettic and positive defite,(b) Formulate an algorithm for solving (A+iB)(= +9) = (b+ ic), where 6 cx, ad y are is RO, It should involve 8n3/3 Sopa. How much storage i required 4.2.2 Suppose A 6 ** i symmetric and postive definite. Give an algorithm for computing an upper trisnguler muatrix R € FO*® such that A= RAT. P4.2.3 Let A € FO*™ be positive dfnite and vet T= (AY+AT)/2and 5 = (A AT)/2. (6) Show thet Av! fa ST! Ia and eT AMHe < a7 T~1z forall x € RE. (6) Show that A= ZDM?, then dy 2 9/17 for k= An 4.2.4 Find 82-by-2 rol matrix A withthe property that 27 Az > 0 forall real orero ‘-vectors but which ie ot postive definite when regarded ot « member of O77, P42 Suppone A € FC" has « pontive diagonal Show that i both A nad AT are strictly dlagooally dominant, then A ia postive definite, P4.2.6 Show that the fuscton f(z) = (27 Az)/2ia« vector norm on Re if and oaly if A ia positive deft 42.7 Modify Algorithm 421 a0 that if the aquare root of « negative number is encountered, then the algocthin feds w unit vactor = so 27 Ax <0 and terminates, P42.8 The numerical range W(A) of a complex matrix A is defined to be the set W(A) = (2% Az: ¥2 = 1), Show that 0 W(A), then A has an LU factorization. P4.2.9 Formulate an m 4 +9 and lower bondundth p if ayy = 0 whenever #> J +p. Substantial economies can be realized when solving banded systems because the triangular factors in LU, GGT, LDMT, etc., are also banded. Before proceeding the reader ia advised to review §1.2 where several aspects of band matrix manipulation are discussed. 4.3.1 Band LU Factorization Our first rosult shows that if A is banded and A= LU then £(U) inherits the lower (upper) bandwidth of A. ‘Theorem 4.3.1 Suppose A ¢ R""" has an LU factorization A= LU. If A has upper bandwidth ¢ and lower banduridth p, then U has upper bandwidth q and L has lower bandwidth p. Proof, The proof is by induction on n. From (3.2.6) we have the factor- tastion a-(2% 1 10 a ut v BI [ofa Ina |] 0 Bw a}]0 Jn It is clear that B— vw far has upper bandwidth ¢ and lower bandwidth p because only the first q components of w and the first p components of » are nonzero. Let LyU; be the LU factorization of this matrix. Using the induetion hypothesis and the sparsity of w and v, it follows that t= [een] mt = (5 3] have the desired bandwidth properties and satisfy A = LU.O ‘The specialisetion of Gausoian elimination to banded matrices having an LLU factorization is straightforward. Algorithm 4.3.1 (Band Gaussian Elimination: Outer Product Ver~ sion) Given A ¢ IR" with upper bandwidth ¢ and lower bandwidth p, the following algorithm computes the factorization A = LU, sanuming it exists. A(j,j) is overwritten by L(é,3) if > j and by U(é,3) otherwise.4.3, Banpe Systems 153 for k = 1: for 1 + Lamin(k + p,n) AGisk) = AG,E)/ACER) end for j = k + Limin(k +q,n) for i= k+ Lmin(k + p,n) Alii) = AG.3) - AGR)ALK,3) end ond end If n> p and n > q then this algorithm involves about 2npg flops. Band versions of Algorithm 4.1.1 (LDM®) and all the Cholesky procedures also ‘ois, but we leave their formulation to the exercises. 4.3.2 Band ‘Triangular System Solving ‘Analogous savings can also be made when solving bended triangular systems. Algorithm 4.3.2 (Band Forward Substitution: Column Version) Let Le R°*" be a unit lower triangular matrix having lower bandwidth p. Given b € IR”, the following algorithm overwrites b with the solution to Ix=b. for j= 1:n for i= j+1:min(j +p,n) (4) = 8) — DGG) end end fm > p then this algorithm requires about 2np fops. Algorithm 4.3.3 (Band Back-Substitution: Column Version) Let U eR" be a nonsingular upper triangular matrix having upper bandwidth ¢. Given & € IR", the following algorithm overwrites 6 with the solution to Uz = 6. for j=n: %@) S00. 1ax(1, 5 9): (8) = bf) - 0G, 5)6G) end end ‘f'n > then this algorithm requires about 2ng ope. 154 CHAPTER 4. SPECIAL LINEAR SYSTEMS 4.3.3 Band Gaussian Elimination with Pivoting Gausaian elimination with partial pivoting can also be specialized to exploit band structure in A. If, however, PA = LU, then the band properties of L ‘and V are not quite so simple. For example, if A is tridiagonal and the first ‘ovo rows are interchanged at the very first step of the algorithm, then ta is nonzero. Consequently, row interchanges expand bondwidth. Precisely hhow the band enlarges is the subject of the following theorem. ‘Theorem 4.3.2 Suppose A € RO" is nonsingular and has upper and lover bandwidths q and p, respectively. If Gaussian elimination with partial pivoting is used to compute Gauss transformations My =a jam and permutations P,,...,Pyat such thot Mya1Paoi+-MyPA =U is up per triangular, then Uf has upper banduridth p+q and al) = 0 whenever Sj ori>j tp. Proof. Let PA = LV be the factorization computed by Gaussian elimi ‘ation with partial pivoting and recall that P = Py» Pi. Write PT = [eagess-seng where (81, + 9n) 18 permutation of {1,2,...n}. Ifa, > i+p then it follows that the leading t-by-t principal submatrix of PA is singular, since (PA)y = a,,3 for j = 1:s, ~ p~1 and 4 -~p—1 > i. This implies that U and A are singular, a contradiction. Thus, s, 3, G(i,j) overwrites A(i, 3). for j=in for smax(L,j —p)j-1 ‘A = min(k + p.m) AGA 3) = AGA) ~ AGAAG AA) end A= min(j +p,n) eng MOND AGA DAT Ifn > p then this algorithm requires about n(p* + 3p) flops and n square roots, Of course, in a serious implementation an appropriate data structure for A should be used. For example, if we just store the nonzero lower triangular part, thea a (p + 1)-by-n array would suffice. (See §1.2.8) four band Cholesky procedure is coupled with appropriate band trian- guler solve routines then approximately np? + 7np + 2n flops and n square roots are required to solve Az = b. For small pit follows that the square roots represent a significant portion of the computation and it is preferable to use the LDL approach. Indeed, a careful Sop count of the stepe A=LDIT, Ly=b, Dz =y, and Lz = z reveals that np? +8np-+n flops ‘and no square roots are needed. 4.3.6 Tridiagonal System Solving As a sample narrow band LDL? solution procedure, we look at the case of symmetric positive definite tridiagonal systems. Setting4.3. BANDED SYSTEMS 187 ‘and D = diag(ds,...,d,) we deduce from the equation A = LDZ? that: iden tie da teh adi = de + Oe tk kt ‘Thus, the d; and ¢, can be resolved as follows: d=an for k een ake t/de is di = hk ~ C410 et end ‘To obtain the solution to Az = b we solve Ly = b, Dz = y, and Lr = ‘With overwriting we obtain Algorithm 4.3.6 (Symmetric, Tridiagonal, Positive Definite Sys- tem Solver) Given on n-by-n symmetric, tridiagonal, positive definite matrix A and 6 € IR", the following algorithm overwrites 6 with the solution to Az = 5, Tt is assumed that the disgonal of is stored in d(1:n) and the superdiagonal in e(1:n ~ 1). n ‘e(k — 1); ef — 1) = t/d(k — 1); dk) = d(k) ~ te(k ~ 1) (ki) = ek = 16 = 1) Hn) = O(n) /aln) for k=n~ aa 5) = BOR /aCR) ~ 01 +1) end ‘This algorithm requires 8n flops. 4.3.7 Vectorization Issues ‘The tridiagonal example brings up a sore point: narrow band problems and vector/pipeline architectures do not mix well. ‘The narrow band implies short vectors. However, it is sometimes the cave that large, independent sets of such problems must be solved at the same time. Let us look at how such 8 computation should be arranged in light of the issues raised in §1 For simplicity, assume that we must solve te n-by-n unit lower bidisg~ onal systems AM) 8) kei 158 CHAPTER 4. SPECIAL LINEAR SYSTEMS and that m > n. Suppooe we have arrays E(I:n ~1,1:m) and B(Lin, Lim) ‘with the property that E(I:n 1,k) houses the subdiagonal of A) and ‘B(i:n,k) houses the Ath right hand side B®. We can overwrite BO) with, the solution 20) as follows: for k= 1m for i=2n BG) = BG,k) - BG 1, BG 1,8) end ‘The problem with this algorithm, which sequentially solves each bidiagonsl system in tur, is that the inner loop does not vectorize. This is because of the dependence of B(i,) on B(i~1,k). If we interchange the & and i loops we get for i= 2m for k= Lm B(i,k) = B(i,k) ~ BG —1,4)BG 1,8) (43) end end Now the inner loop vectorizen well aa it involves « vector multiply and a vector add. Unfortunately, (4.3.1) is not a unit stride procedure. However, this problem is easily rectified if we store the subdiagonais and right-hand- sides by row. That is, we use the arrays E(1:m, I:n—1) and B(1:m, 1:n—~1) and store the subdiagonal of A® in E(k, 1:n ~ 1) and 09)” in B(k, Lin). ‘The computation (4.3.1) then transforms to fori=2n for k= iim Blk, i) = Bek) ~ B(ki— B(k,i—1) end illustrating once again the effect of data structure on performance. 4.3.8 Band Matrix Data Structures ‘The above algorithims are written as if the matrix A is conventionally stored in an n-by-n array. In practice, a band linear equation wolver would be or- gonized nround a dato structure thet takes advantage of the many zeroes in A. Recall from §1.2.6 that if A hes lower bandwidth p and upper band- ‘width ¢ it can be represented in a (p +¢+ I)-byen array A.band where band entry aj is stored in A.bond(i— j++, ). {n this arrangement, the nonzero portion of 4’s jth column is housed in the jth column of A.band. Another possible band matrix data structure that we discussed in §1.2.84.3. BANDED SYSTEMS 189 involves storing A by diagonal in a 1-dimensional array A.diag. Regardless ‘of the data structure adopted, the design of « matrix computation with a ‘band storage arrangement requires care in order to minimize subscripting overheads. Problems 4.8.1. Derive banded LDM? procedure similar to Algorithm 43.1. 43.2. Show how the output of Algorithm 4.3.4 can bo used to solve the upper Hae veers system Hz = 6. 43.3. Give an algorithm for solving an vneymsetri tidiagonal artem Az = that ‘em Gauaian elinination with partial pivoting. It ahould require only four n-vectors of Aoating point storage forthe factorization, P44 Por Ce RO define the profile indices m(C.i) = min(jiay # 0}, where {= Irn. Show that if A= GG" ls the Cholesky factorization of A, than m(A,i) = 1m(G,) for S= tin. (Wo may that G has the sume profile as A.) 4.3.8 Suppose A ¢RO*® is eymmoetric positive definite with profile indices my = ‘m(A,2) where ¢= Ion. Amurme that A ia gored ins one-dimensional array v an follows: 2 (etsy onymaee 1022, 63,mgy 2-189 mqs oun). Write an algoritb that overwrites v with the corresponding entries of the Cholesky factor G and then uses this factorization to solve Az =b. How aay Bope are required? 43.6 For C € ROX" deine (C4) = max{jicy # 0}. Suppone that A.€ RO" baa an LU factorization A= LU and that: mt) < m(A2) << man) WA € AR Eo S wAn) ‘Show that m(dyi) = m(yé) and piA,i) = p{U,) for im Lim, Recall the dafition of ‘m(Ai) fom PAB 4.3.7 Develop « guapy version of Algorithea 43.1 4.3.8 Develop» uit etre, yotartsbie algorithm fo solving the symmetric positive definite tridiagonal ayetems AU*z() = 60, Aamame thatthe dagonala, superdiagoaole, ght and lene sored by row la rare D, 5, end Band that 8) a erwin wah 2 4.3.9 Develop a version of Algcithsn 4.1 ia which A i tored by diagonal. 4.3.10 Give an example ofa $-ty-3 symmetric poalive definite matrix whose tridiag- ‘onal part ls not positive define. 43.11 Consider the Az = b problem where a oa -1 0 ot 2 "This Lind of matrix arises in boundary value problema with periodic boundary conditions. (4) Show A insingalar. (6) Give conditions that b rout satiny for thar 1 ext a solution {ind specify ua algorithm for solving i€. (c). Asmume that n ia even und coceider the ™ i 1 Paleeameie 160 (Ouaprer 4. SpeciaL Linsan SysTems where ent the kth column of fy, Describe the tranaformad system PTAP(PT =} = PTS tnd show bow to solve it. Asrume that thers lea olution and ignore pivoting Notes and References for Sec. 4.3 “The literature concerned with banded syrtens is immense, Some representative papers inelude RS. Martin and J.H. Wilkinson (1968). “Symmetric Decomposition of Pouitive Definite ‘Band Matrices,” Numer. Math. 7, 88-61. FS. Martin and J.H. Wilkinson (1967). “Solution of Symmetric and Unrymmetric Band Bquations and the Calculation of Eigenvaives of Band Matrices,” Numer. Math. 9, 279-30. ELL, Allgower (1973). “Bact Invern of Cartain Band Matrices,” Numer. Math, 21, ‘784 Z. Bobte (1975). “Bounds for Rounding Errors in the Gaumtan Elimination for Band ‘Syreema” J. fast. Math. Appic. 16, 135-42. 1S. Duff (1977). "A Survey of Sparse Matrix Research,” Proc. IBEB 65, 500-538. NJ. Higham (1900). “Bounding the Ervor in Geumian Elimination for Thidiagonal ‘Syutemm” SIAM J. Matrix Anal. Appl. 11, 521-590. A topic of considerable interet in the area of banded matrices deals with methods for reducing the width of the band. Soe B, Cuthil (1972), ‘Several Stratogion for Reducing the Bandwidth of Matrices,” in ‘Sporse Matrices and Their Applications of, DJ Nowe and RLA. Willoughby, Pleaum rum, New York. NE, Gibbs, W.G. Poole, Jr, and P.K, Stockmeyer (1978). “An Algorithm for Reducing the Bandwidth and Profle of « Sparse Matrix” SIAM J. Num. Anal. 19, 238-60. N.B, Gibbe, W.C. Poole, de, asd PIC. Stockmeyer (1076). “A Comperiaoa of Several Bandwidth and Profile Reduction Algorithms,” ACM Trans. Math. Soft 2 322-20. ‘As we mentioned, tridiagonal systems arise with particular frequency. Thus, it a not surprising that « great deal of attention has been focuimd on special methods for this ‘lam of banded problems, (C. Facher aad R.A, Usman (1960). “Propartin of Some Trdiagooal Matciom and Their ‘Application to Boundary Value Probleas,” SIAM J. Num. Anal 6, 17-42 DJ, Rowe (1965). “An Algorithm fo Solving 8 Special Clas of Tidlagoaal Systema of Lieer Equations” Comm. ACM 18, 24-96, HLS. Stone (1973). “An Biicent Purlle Algorithm for the Solution of « Tidiagonal “Linear System of Equations," J. ACM 20, 27-33. MA. Malcolm and J. Palmar (1974), A Past Method for Solving « Chum of Tidlngonal ‘Sywtems of Lineat Equations,” Comm. ACM 17, 14-17. J. Lamblotte and R.G. Voigt (1973). “The Solution of Trdiagonal Linear Systane of ‘the CDCSTAR 100 Computer,” ACM Trans. Math Soft 1, 308-2 HS. Stone (1975). “Parallel Trsdingonal Equation Solver,” ACM Trans. Math. Soft, “389-307, 1. Kerabaw(1962), “Solution of Single Tidlagonul Linoar Systems and Vectorsation of ‘the ICCG Algorithm on the Cray-1,” In G. Rederigue (ed), Parallel Computstion, Academic Pres, NY, 1982 NJ. Higham (1086). “Eifisient Algeithae for computing the condition number of & ‘widingoeal mate,” SIAM J. Sct and Stat. Comp. 7, 180-168. (Chapter 4 of George and Lie (1981) contains a nice survey of baad methods for positive defiaite stems.44. SyMmernic InpernaTe SYSTEMS 11 4.4 Symmetric Indefinite Systems A symmetric matrix whose quadretic form 27 Az takes on both positive and negative values is called indefinite. Although an indefinite A may have an LDLT factorization, the entries in the factors can have arbitrary magnitude: Fara rad | caer | Of course, any of the pivot strategies in §3.4 could be invoked. However, they destroy symmetry and with it, the chance for a “Cholesky speed” Indefinite system solver. Symmetric pivoting, Le., data reshuffings of the form A— PAPT, must be used as we discutsed in §4.2.9. Unfortunately, symmetric pivoting does not always stabilize the LDL™ computation. If ¢: and ¢z are small then regardless of P, the matrix anol t]e has small diagonal entries and large numbers surface in the factorization. With symmetric pivoting, the pivots are always selected from the disgonal ‘and trouble results if these numbers are small relative to what must be zeroed off the diagonal. Thus, LDL? with symmetric pivoting cannot be recommended as a reliable approsch to symmetric indefinite system solving. Tt seems that the challeage is to involve the off-diagonal sete io the pivoting process while at the same time maintaining symmets In ts section we dicuss two maye to do this, The fist method is due to Aasen(1971) and it computes the factorization rit (44a) where L = (C4) is unit lower triangular and T is tridiagonal. P is a permutation chosen such that tjj| <1. In contrast, the diagonal pivoting method due to Bunch and Parlett (1971) computes a permutation P such that PAPT = LDLT (4.42) PApT where D is a direct sum of 1-by-1 and 2-by-2 pivot blocks. Again, P is chosen so that the entries in the unit lower triangular L satisfy |e] <1. Both factorizations involve n3/3 flops and once computed, can be used to solve Az = b with O(n?) work: PAP? = ITI? Lz=PhTw= Ly =wz=Py + Arab PAPT = LDLT,Li= Pb,Dw=2,L7y=wz=Py » Az=b ‘The only thing “new” to discuss in these solution procedures are the Tw = = and Dw =z systems, 162 CHapTeR 4. SPECIAL LINEAR SysTEMS Tn Ansen's method, the symmetric indefinite tridiagonal system Tw = = i solved in O(n) time using band Gaussian elimination with pivoting, Note ‘that there ia no serious price to pay for the disregard of symmetry at this level since the overall process is O(n®). In the diagonal pivoting approsch, the Dw = z system amounta to a set of I-by-1 and 2-by-2 symmetric indefinite systems. ‘The 2-by-2 problems can be handled via Gaussian elimination with pivoting. Again, there is n0 ‘harm in disregarding symmetry during this O(n) phase of the calculation. ‘Thus, the central issue in this section is the efficent computation of the factorizations (4.4.1) and (4.4.2). 4.4.1 The Parlett-Reid Algorithm Parlett and Reid (1970) show how to compute (4.4.1) using Gauss transforms. Their algorithm is sufficiently illustrated by displaying the k = 2 step for the case n = 5, At the beginning of this step the matrix A has ‘been transformed to am A tro AQ = MPAPTMT = | 0 vs OM Ow where P; is e permutation chosen so that the entries in the Gauss transformation M, are bounded by unity in modulus. Scanning the vector (vs v4 vp)7 for its largest entry, we now determine a 3-by-3 permutation FP such that % % Pl w| = | % => [asl = maax{|dsh [Bal 51}. % ts If this maximal element is zero, we set M, = P; = I and proceed to the next step. Otherwise, we set P; = diag(Ia, Fy) and Mz =I ~ ale with a) = (0 0 0 Safty t/t)” 0 a x xxx o xxx Eo ‘and observe that a a A® = MaP,AC)PEMT = | 0 0 ° In general, the process continues for n 2 steps leaving us with a tridiagonal matrix T= AC = (MyaPaas cosgE xxx gO xx xoo os MyPi)A(Mya2Praa + MiP.4.4. Symmetric INDEFINITE SYSTEMS 163 Tt can be shown that (4.4.1) holds with P= Py_a-+ Py and = (MyoaPaca- MPAPPY?. Analysis of Z reveals that its first column is ¢, and thet its subdiagonal entries in column k with k > 1 are “made up” of the multipliers in My. ‘The efficient implementation of the Parlett-Reid method requires care when computing the update AO = My PAOD PE) ME. (443) ‘To see what is involved with 6 minimum of notation, suppose B= B™ has order n~ k and that we wish to form: By = (I-wef)B(I— wef) where we RY and c is the first column of Jy. Such « calculstion is at the heart of (4.4.3). If we sot bu tw, us Bey then the lower half of the symmetric matrix By = B- wu? ~ ww™ can be formed in 2(n — &)? flops. Summing this quantity as k ranges from 1 to n—2 indicates that the Parlett-Reid procedure requires 2n’ /3 flops— twice what we would like eal 4 fhe Pai tn ped 3 Leveeese) Fa~(0, 0, 2/8, 1/3, eT A= (nauel Me eg aE ook PAPE TH where P= {e1, €3, ¢4, e2], o 8 o 3 0 6 304 3/3 0 a(S bi) me [ER Bi? 0 2/3 1/2 1 oo 0 42 A a 4.4.2 The Method of Aasen An n3/3 approsch to computing (4.4.1) due to Aasen (1971) can be derived by reconsidering some of the computations in the Parlett-Reid spproach. 164 ‘CuarTer 4. SPECIAL LINEAR SYSTEMS ‘We need a notation for the tridiagonal T: on A oe 0 Boog” BO Bye 0 | But oe For clarity, me tesaporarily ignore pivoting and assume that the factorization A= LTLT exista where £ is unit lower triangular with £(:,1) = er, ‘Aasen’s method is organized as follows: for j=in Compute A(I:j) where h = Te, Compute a{3). ifj ano a=1 ‘Choose Pi so Jera| = a1. ‘Choose P, 20 ler ond = Ho. It is easy to verity from (4.4.10) that if s = 1 then fal < Q+a7)o (44.1) while s = 2 implies ful (44.12)4.4, Sysueraic INDEFnOTE SysTEMS 169 By equating (1 +a71)?, the growth factor associated with two s = 1 steps, and (3~a)/(1—a), the corresponding s = 2 factor, Bunch and Parlett con- clade that a= (1+ VI7)/8 is optimum from the standpoint of minimizing the bound on clement growth, ‘The reductions outlined above are then repeated on the n ~ s order symmetric matrix 4. A simple induction argument establishes that the factorization (4.4.2) exists and that n°/3 flops are required if the work associated with pivot determination is ignored. 44.5 Stability and Efficiency Diagonal pivoting with the above strategy is shown by Bunch (1971) to be ‘as stable as Gaussian elimination with complete pivoting. Unfortunately, the overall process requires between n°/12 and n?/6 comparisons, since 1p iavolves a two-dimensional search at ench stage ofthe reduction. The actual number of comparisons depends on the total aumber of 2-by-2 pivots but in general the Bunch-Parlett method for computing (4.4.2) is considerably slower than the technique of Aasen. See Barwell and George(1976). ‘This ia not the case with the diagonal pivoting method of Bunch and Kaufman (1977). In their scheme, itis only necessary to scan two columns at ench stage of the reduction. The strategy is fully ilustrated by consid. fering the very first step in the reduction: = (1+ ¥T7)/8; A= lat] = max{lari|,---s lanl} A>0 if Jans|> 0d sek Pal alse = lage if olay] > a? =1A st olseif jae] > a0 1 and choove Pt 90 (PF APi)11 = Gre. eke = and choose Py 50 (PPAF) = arp. ond end end Overall, the Bunch-Kaufman algorithm requires n2/S flops, O(n?) compas isons, and, like all the methods of this section, n?/2 storage. 170 Capra 4. SPECIAL LINEAR SYSTEMS Example 4.4.2 If tbe Bunch-Kaufman algorithm is applied to 1 2 A=] ao m1 ‘hen in the fee stap 2 = 20, = 3, ¢ = 30, and p= 2. The permutation P = [3 ex €1] is applied giving 1 90 20 pap =| 30 1 10 2 10 t A.2by-2 pivot is them wed to produce the reduction 1 0 o]fs @ o 1 0 0 pat=] 0 1 off 1 0 o 1 0 ans oes 1} [0 9 1170 | | sus 50s 1 4.4.6 A Note on Equilibrium Systems A very important class of symmetric indefinite matrices have the form ¢ B)a 4 [& 0 ] ’ (4.4.13) np where C is symmetric positive definite and B has fall column rank. These conditions ensure that A is nonsingular. Of course, the methods of this section apply to A. However, they do not exploit its structure because the pivot strategies “wipe out” the zero (2.2) block. On the other hand, here is a tempting approach that does exploit A's block structure: (2) Compute the Cholesky factorization of C, C= GGT. (b) Solve GK = B for K ¢ RO*?, (©) Compute the Cholesky fatorsation of KTK = BTC"1B, HHT = K. an[l@&ol[e *& AT Hilo -at]> In principle, this triangular factorization can be used to solve the equilib- nen owen fe elpelips Le s]li]-U}- (4414)4.4. Symmetric InperiviTe SysTEMs im However, itis clear by onsidering steps (b) and (c) above that the accuracy of the computed solution depends upon {C) and this quantity may be much greater than x(A). The situation has been carefully analyzed and various structure-exploiting algorithms have been proposed. A brief review of the literature is given at the end of the section. But before wo close itis interesting to consider a special case of (4.4.14) that clarifies what it means for an algorithm to be stable and illustrates how perturbation analysis can structure the search for better metbods, In several important spplications, 9 = 0, C is diagonal, and the solution subvector y is of primary importance. A manipulation of (4.4.14) shows that this vector is specified by y= (B7C~'B)'BTC-'f. (4.4.18) Looking at this we are again led to believe that x(C) should have a bearing, ‘on the accuracy of the computed y. However, it can be shown that W(BT7C“'By1BTC* | < vp (4.416) where the upper bound vn is independent of C, a result that (correctly) suggests that y is not sensitive to perturbations in C. A stable method for computing this vector should respect this, meaning that the accuracy of the computed y should be independent of C. Vavasis (1994) has developed ‘a method with this property. It involves the careful assembly of a matrix Ve R™ =P) whose columns are a basis for the nullspace of B7C-1. The reby-n linear system teniz]er is then solved implying f = By +Vq. Thus, B7C~1/ = BTC-'By and (44.15) holds. Probleme 4.4.1 Show that if all the L-by-1 and 27-2 principal submatrices of an n-by-n sgrmunetric matrix A are singular, thea A i 20. 4.4.2 Show that no 2-by-2 pivte can aria ia the Bunch-Kautman algorithm if A i positive definite. 4.4.3 Arrange Algorithm 4.41 90 that only the lower trangular portion of A ie ‘eforenced and 20 that a(j) overwrites AQ, 4) for j= L:n, 2) overwrites AG + 1,3) for $= n~ 1, and L(,§) overwrites A(,j =) for j= 2m Vand f= 5+ in. Pé.ted Suppose A € RO" is tonsingular, rymmetrc, and etrictly diagonally dominant. Give an algorithm that computes the factortation mar = [5 a) a where RG REE and M€ ROMA X(— 8 are lower triangular and nonsingular and Ti permutation im (CHAPTER 4. SPECIAL LINEAR SYSTEMS PEAS Show that if An Ag ]s A= lan -d4n] > 7 Pp termes Ay snd Am pnt da an LDU haere wn ‘Spey oe 2 by ic Dy FO" an Dy € 0°? bp dnl Pacts Pros (U0) ad (04 PO Shortt “(TC-"0)- ode) St A ae Ai gy (43) be P4.48 The point of thin problem io consider 8 mec cas of (4.4.15). Deine the M(a) =(87C~'B)""aTC“? whee Cm(mtanel) o> -1 and ¢y = In(;,k). (Note that C is just the identity with o added to tho (ke, t) entry.) ‘mut that Be FO"? bas rk pad show fant Mla) = (87) 7 cr (are) where w = (iy ~ BIBT )'BTey. Show thet if [wy = or wl, = 1, then TAM(@) fy = Y/emin(). Show that 10 < [ley < 1, ther aa} / omit) ‘Thus, | M(a) ll han an cvindependant upper bound, mot < ame ‘Notes and References for Sec. 4.4 "The basic reference for computing (44.1) xe 3.0. Augen (1971). "On the Reduction of « Symmetric Matrix to Tridiagonal Form.” BIT 11, 233-42. BLN, Parlet and JK. Raid (1970). *On the Solution of a Symtem of Linear Equations ‘Whose Matrix is Symmetric but aot DefntaB/T 10, 38-87. ‘The diagonal pivoting literature incladee JUR Bunch and BLN, Parlett (1971). “Direct Methods for Solving Symmetric Indefinite ‘Symtema of Linear Equations.” SIAM J. Num. Anal, 8, 39-55. ‘ER Bunch (1971). “Analysis ofthe Diagonal Pivcting Method,” SIAM J. Num. Anal. 1, 636-880. 4R. Bunch (1974), “Partial Pivoting Stratagin fr Symmetric Matrices” STAM J. Nua. ‘Anal. 15, 521-528. LR. Bunch, L. Kaufoan, and BLN. Parlat (1076). “Decomposition of « Symmetric Masri,” Numer. Math $7, 05-100. J.R. Bunch und L, Kaufman (1977). “Some Stobla Mothoda fr Calculating Inertia and Solving Symmetric Linear Syetamna,” Math, Comp. 31, 162-78 1.8. Daf, NUM. Gould, JK. Rei J-A. Soott, and IC Taroer (1901). “The Factorization ‘of Sparse Indefinite Matrions” IMA J. Mom. Anal. 11, 181-204,4.4, Syvmernic InermaTe SysTEMS 173 ‘M:T. Jonenand M.L, Patrick (1963). “Bunch Keulman Pactoriantion for Real Symmetric Tadefinite Banded Matrices,” STAM J. Matrix Anal. Appl 14, 853-859. Because “future” columns most be scanned in the pivoting procam, its awkward (but pombe) to obtain a gaxpy-nich diagonal proting algorithm. On the other hand, Anoan's ‘aethed ie naturally ich in guxpy's Block versio ofboth procedares are posible. LA- ‘PACK ume the dingooal proting method. Varcus performance ames are diacumed in ‘V. Barwall and J.A. George (1078). “A Comparison of Algorithme for Solving Symmetric Indabinite Systema of Linear Bquationy.” ACM Trans. Math. Soft. £, 242-51. [MIT Jones and ML. Patrick (1994). "Factoring Symmetric IndeGnite Matrices on High- Performance Architectures,” SIAM J. Mats Anal. Appl 15, 273-288. Another idan for a cheap pivoting stracegy utilizes eor bounds based on more Uberal interchange criteria, an iden borrowed from some work dope inthe area of sparse elimi ration rechoda. See R. Plecher (1976), “Factorsing Symmetric Indefinite Matrioa,” Lin. Alp. ond ite Appl. 14, 251-72. Before ung any symmetric As = 6 solver, it may be advisable to equilibrate A. An (Ofna) aigortia fr sccomplibing this tak i given in JER. Bunch (1971), “Bquiliteation of Symmetric Matricn is the Max-Norm,"J. ACM 18, 966-72. ‘Analogues of the symmetie indefinite solver that we have presented exist for skew. rymmetric rystema Sew LR. Busch (1982). “A Note on the Stable Decompeition of Skew Symmetric Matrices,” ‘Math. Comp. 188, 475-480. ‘The equilibrium yster Iterature it acattered among tbe several application areas where 18 ban an important role to play. Nice overviews with pointers to ths literature iclude . Strang (1988). “A Framework for Equilibrium Bquatioas,” STAM Review 90, 289-291, ‘S.A. Vavesis (1094) “Stable Numerical Algorithms for Equilbeium Syatema,” STAM J. ‘Matriz Anal. Appl. 15, 1108-1131 ‘Other pepers include C.C. Paige (1979). “Past Numerically Stabe ‘Squares Problema” STAM J. Num. Anal. 16, 165-11 A. Blea aad 18, Du (000, “A Diet Matod forthe Soation of Spam Lice Lact Squares Probleme,” Lin. Alg. end Is Apphe. 94, 43-87. A. Bibeck (1992), “Pivoting and Stabity inthe Augmented System Method,” Procead- singe ofthe 14th Dunder Conference, D.F. Crifitha and G.A. Wataon (od), Longnan Scientific ane Technical, Bex, UK. PD, Hough and 9.A, Vavada (1996). “Complete Ortbogoasi Decomposition for Weighted Laast Squares” SIAM J. Mairtz Anak Appl, to spear. Some of thasa papers make use of the QR factortation and other last squares ideas ‘that are discuned in the part chapter and 121 Probleme with structore abound in matrix computations and perturbation theory thas a kay role 10 play in the march for etable ficient algorithms. For equilbeiam sy tena hee ae several els ke (415) tat end the motets for Generalized Linear Least 1% CHAPTER 4. SPECIAL LINEAR SYSTEMS ‘A. Poragren (1095). ‘On Linear Least Squares Problews with Diagoaally Dominant ‘Weight Matrices,” Technical Repoct TRITA-MAT-1005-052, Department of Matbe- matics, Royal Insitute of Technology, 5-100 44, Stockholm, Sweden. ‘aed the included references. A diseasion of (44.15) may be found in G.W. Stewart (1989). "On Scaled Projections and Paoudoioversen” Lin. Alg. ond Its ‘Applic. 118, 189-188, DP. O'Leary (1960), “On Bounds for Sealed Projections and Peetdoinverses” Lin. Alp ond I Apphe. 138, 115-117. MJ. Todd (1990). “A Dantsig-Wolle-lke Vorint of Karmarter’s Inerio-Point Linear Programming Algorthis." Operations Research $8, 1006-1038. 4.5 Block Systems In many application aress the matrices that arise have exploitable block structure. AS a case study we have chosen to discuss block tridiagonal systems of the form DA 0 |x 4 Ey Dr * i n be Bo ple]: (45) i we Fe |] E : Oo Ee Da | Lae bn Here we assume that all blocks are q-by-q and that the 2; and by are in IRS. In this section we discuss both a block LU approach to this problem as ‘well as a divide and conquer scheme known as cyclic reduction. Kronecker product systems are briefly mentioned. 4.5.1 Block Tridiagonal LU Factorization ‘We begin by considering a block LU factorization for the matrix in (4.5.1). Define the block tridiagonal matrices At by Dm 0 Ey Da A= not, ken. (452)4.5. BLock SysTEMs 175, ‘Comparing blocks in r 0 anh 0 At i 0 UW * i . (453) ° Inr FL 0 oo ‘we formally obtain the following algorithm for the L; and U;: = Ei for Lin (45.4) fi = De Le Finn ond ‘The procedure is defined so long as the U; are nonsingular. This is assured, for example, if the matrices Ay,..., 4, are nonsingular. Having computed the factorization (4.5.3), the vector in (4.5.1) can bbe obtained via block forward and back substitution: mab fori=2n = 5 — Demin end (45.5) Solve Unzn = Un for Zn. for f2n—1:—1:1 Solve Vis = yi ~ Fitigs for 2: ond ‘To carry out both (4.5.4) aod (4.5.5), each U, must be factored since linear systems involving these’ submatrices are solved. This could be done using Gaussian elimination with pivoting. However, this does not guarantee the stability of the overall process. ‘To see this just consider the case when the block size q is unity. 4.5.2 Block Diagonal Dominance In order to obtain satisfactory bounds on the Ly ond U; it is necessary to make additional assumptions about the underlying block matrix. For ‘example, if for i = I:n we have the block diagonal dominance relations HOF WR +h Bh) <1 By =F sO (4.5.6) 176 CHAPTER 4. SPECIAL LINEAR SYSTEMS then the factorization (4.5.3) exists and it is pomible to chow that the Ly ‘and U; satisfy the inequalities Wath <1 (452) With < PAnls (sa) 4.5.3 Block Versus Band Solving ‘At this point it is reasonable to ask why we do not simply regard the matrix ‘A in (4.5.1) 08 0 gn-by-qn matrix having scalar entries and bandwidth 2g -1. Band Gaussian elimination as described in §4.3 could be applied. ‘The effectiveness of this course of action depends on such things as the ‘dimensions of the blocks and the sparsity patterns within each block, en Sera DA)[a]_ fe Baltal-[g] wm where D, and D, are diagonal and F, and £) are tridiagonal. Assume that each of these blocks is n-by-n ond that it is “safe” to solve (4.5.9) via (4.5.3) and (4.5.5). Note that Uw Dy {(dingonal) t= But (tridiagonal) Uz. = Di-InFi __(pentadingonal) nah v= b~Ex(Dz'n1) Um = Din = ww Fire ‘Consequently, some very simple n-by-n calculations with the original banded blocks renders the solution. On the other haod, the naive application of band Gaussian elimination to the system (4.5.9) would entail a great deal of unnecessary work and storage as the system has bandwidth n + 1. However, we mention that by permuting the rows and columns of the system via the permutation P= [etsenpty€ar-+-€ms lan) (4.5.10)4.5, BuooK Systems 17 wwe find (in the n= 5 case) that PAPT = eocooeKxoxx eocoocexxx eoccooxxxex eoxoxxxcee xexxxoocce exxxexcocoe xxx ecoooce xx oxcoocee ‘This matrix has upper and lower bandwidth equal to three and go a very ressonable solution procedure results by applying band Gaussian elimine- tion to this permuted version of A. ‘The subject of bandwidth-reducing permutetions is important. See George and Liu (1981, Chapter 4). We also refer to the reader to Varah (1972) and George (1974) for further details concerning the solution of block (tidiagonal systems. 4.5.4 Block Cyclic Reduction We next describe the method of block cyclic reduction that can be used to solve some important special instances of the block tridiagonal system (45.1). For simplicity, we assume that A hes the form DF 0 < Rem (45.1) F FD where F and D are ¢-by-q matrices that satisfy DF = FD. We also assume that n = 2 — 1. These conditions hold in certain important applications ‘such a5 the discretization of Poisson's equation on a rectangle, In that situation, 4a 0 (45.2) 178 CHAPTER 4. SPECIAL LINEAR SYSTEMS and F = ~Iy. The integer n is determined by tho size of the mesh and can often be chosen to be of the form n = 24 ~ 1. (Sweet (1977) shows how to proceed when the dimension is not of this form.) ‘The basic iden behind cyclic reduction is to halve the dimension of the problem on hand repeatedly until we are left with a single q-by- system for the unknown subvector 2.-1. This system ia then solved by standard means. The previously eliminated 2; are found by a back-substitution process. ‘The general procedure is adequately motivated by considering the case nat: Day + Pay Po + Daz + For Pr + Dry + Fry Pry + Diu + Fos Fay + Dry + Fre Frs + Dag + Far Fre + Der (45.13) For i = 2, 4, and 6 we multiply equations i—1, i, and i+1 by F, —D, and F, respectively, and add the resulting equations to obtain (QF? Dijzy + Fay = F(t +03) ~ Fiz, + (2F?- Dixy + Frag = Flbs +55) - Fizg + (QF? D3)zq = Flbg + br) — ‘Thus, with this tactic we have removed the odd-indexed 2 and are left with reduced block tridiagonal system of the form Dry + Fay = PU + Dry + FY zg = FOxn + DMzy nue vrrsres onan ao ‘where DW = 2F? - D? and FO) = F? commute. Applying the same elimination strategy as above, we multiply these three equations respectively by FU), D0), and FO. When these transformed equations are added together, wo obtain the single equation (267 - DO) ag = FO (10 +409) - DO? which we write as DA xq = 6), ‘This completes the cyclic reduction. We now solve this (small) ¢-by-9 sys tem for x. The vectors 22 and z¢ are then found by solving the systems DMs, = —Pe, Dx = PME4.5, Block Systems 179 Finally, we use the first, third, fifth, and seventh equations in (4.5.13) to compute 21, 23, 7, and =r, respectively. For general n of the form n = 2*—1 we set D =D, PO = F, HO = and compute: for p= Iik—-1 FO) = [FOP De) = 2F) — [D-H =? for j= 12-71 (45.14) 92 = FO (a + Brera) — DEO end end ‘The 2; are then computed as follows: Solve DA-Nzq.-1 = for zp for p=k-2:~10 (45.15) = 8 sy, = FO (2aje + 214y-2y) end Solve Dz asap = ¢ for Zaj-14r end ead ‘The amount of work required to perform these recursions depends grestly upon the sparsity of the D®) and F©), In the worse case when theso matrices are fall, the overall Sop count has order log(n)q?. Care must be ‘exercised in order to ensure stability during the reduction. For further details, see Buneman (1969), Bxample 4.5.1 Suppowe q = 1, D = (4), and F = (1) in (4.14) and that we wiah to sive in “1 1 4-1 0 0 0 14-1 0 Oo O-1 4-1 0 o 0-1 4-1 ° ° BSSeoee Oo 1 4 = 0 0 0-1 180 CHAPTER 4. SPECIAL LINEAR SYSTEMS ‘By exocuting (4.5.18) we obtain the relucnd syetee: [4 2Ng)-(2] = {oe} = [2 ]f-me] pat 00 sn dint (8 ane SIE ST ane STE BID BIS aes er Cyclic reduction is an example ofa divide and conquer algorithm. Other divide and conquer procedures are discussed in §1.3.8 and §8.6. 4.5.5 Kronecker Product Systems If Be R™" and Ce RP, then their Kronecker product is given by bul be bin bu ba A= 8eC= bei ba = an Thus, A is an m-by-n block matrix whose (i,j) block is biyC. Kronecker ‘products arise in conjunction with various mesh discretizations and through- ‘out signal processing. Some of the more important properties that the Kronecker product satisfies include (Aeay(ced) = =Ac@BD (45.18) (A@By = AT @aT (45.17) (esy = AteBt (45.18) ‘where it is assumed that all the factor operations are defined. Related to the Kronecker product is the “vec operation: XG XeR™™ 6 wlX)=| 2 [eR™ Xn) ‘Thus, the vec of « matrix amounts to 0 “stacking” of its columns. It can be shown that Y=CXBT @ veel) =(B@C)vee(X). (45.19)4.5. Buock Svsrems 1 It follows that solving a Kronecker product system, (Becz=4 is equivalent to solving the matrix equstion CXB™ = D for X where = vee(X) and d= vee(D). This has efficiency ramifications, To illustrate, suppose B,C € R%" are syinmetric positive definite. If A= BOC is treated a8 general matrix and factored in order to solve for z, thea O(n) flops are required since B @C € Fe'*"", On the other hand, the solution approach |. Compute the Cholesky factorizations B= GGT and C = HHT. 1 2. Solve BZ = D? for 2 using G. 3. Solve OX = 2" for X using H. 4, 2 = vee(X). involves O(n*) flops. Note that BeC =GG" @HHT =(GeH\GeHt is the Cholesky factorization of B@ C because the Kronecker product of a pair of lower triangular metrices is lower triangular. Thus, the above four- step solution approach is a structureexploiting, Cholesky method applied wo B@C. ‘We mention that if B is sparse, then B@C has the same sparsity at the block level. For example if H ix tridiagonal, then @C is block tridiagonal. Probleme PA.5:1 Show that o block diagonally dominant matrix is nonsingula. LSA Verify that (45.6) knplin (45.7) and (453). 4.5.3 Suppom block eye redvetion la applied with D given by (4.5.12) and F =~. ‘What ean you sxy about the band structure of the matrices FOV and DU? that arin? 4.54 Suppom A€R** ia nonsingular and that we have solutions to the lise ‘ontems Arab and Ay = 9 where ,g € R° are gives. Show how to solve the eyetem (# 2]0]-(31 00) fa whew 09 6 Bde dh ms of ie A Sinner Tr try ogg fe bch eon nope! ee CA ipcas pe tb cnn sn tf a tr ers P45.5 Verily (4.5.16)-(45.19). 48.1 Sh tw acon SVD of BOC to th SVDuot Bad. Pas 14,0. sndc we tion hens nena hat 4@8)06 = A8(88C) 182 CHAPTER 4. SPECIAL LINEAR SYSTEMS tnd wo wo Just write A@ B @ C for this matrix. Sbow how to vole the liner systese (A@B@C)x = smuming that A, B, and C are nytametie positive definite, Notes and References for See. 4.5 ‘The following papers provide insight into the various nuances of block matrix computae one 3M, Varah (1972). “On tha Solution of Block-Tiingooal Systema Ariging fom Cartan Finite Difference Equations.” Math. Comp. 26, 80-68. 1A, George (1574). *On Block Elimination for Sparan Linear Systema,” SEAM J. Num. “Anal 11, 585-603. 1 Pourer (1084), “Stnircann Matrices and Synterns,” SIAM Review 26, 1-71. (M.L. Merrasn (1985). "On the Factoriation of Blode Tridiagonala With Storage Con- ‘rainta.” SIAM J. Sci and Stat Comp. 6, 182-192. ‘he propery of ou digosl domnacs end its vviou imlations i the central D.G. Feingold and RS. Varga (1962). “Block Diagoaully Dominant Matrices and Gen- ‘zallsations of the Gershoria Circle Theorem,"Pecific J Math. 12, 141-50. ‘Early mathods that involve the idea of eyelic reduction are dencibed ia RW. Hockney (1965). “A Past Direct Solution of Poimos's Equation Using Fourier ‘Analysis, "J. ACM 18, 98-113, BLL. Busbes, G.H. Golub, and C.W. Nilson (1070). “On Direct Methods for Solving Poimon's Equations” SIAM J. Num. Anal. 7, 62796. ‘The accumulation of the right-hand side must be done with great car, for otherwise there would be a significant lom of accuracy. A stable way of doing this is described in ©. Bunemaa (1960). “A Compact Non-tterative Polaon Salve,” Report 294, Stanford ‘University Tnaitute for Plasma Research, Stanford, Calforaia. Other tnerature concerned with cyelic reduction includes B.W. Dorr (1970). “The Direct Solution of the Discrete Polson Equation on « Rectan- te” STAM Review 18, 248-63. BLL. Busbeo, F.W. Dorr, J.A. Georg, and G.H. Golub (1971). “The Dicect Solution of the Discrete Potton Equation on Irregular Regions," SIAM J. Numa. Anal 8, 722-38. FLW. Dorr (1973). “The Direct Solution of the Discrete Pola Equation in O(n?) Operations," SIAM Review 15, 412-416. P. Conca aod GH. Golub (1973). “Uno of Fast Direct Methods fo the Eiiieat Nu- eral Solution of Noneparble litt: Buatiban” STAM J. Nem Anal. 10, 1103-20. BLL. Busbes and F.W. Docr (1974). “The Direct Solution ofthe Biharmonic Equation ‘on Rectangelar Regions and the Poison Equation on Uregular Ragina."STAM J. Man Anal, 11, 733-63. ‘D. Heller (1070). “Some Aspecta ofthe Cyelic Reiuction Algorithm fr Block Tidingonal Linear System" S7AM J. Num. Anal 13, 424-06. ‘Various generalizations and extensions to cyclic reduction have boa proposed: PIN, Swarstrauber and FLA. Sweat (1973). “The Direct Solution ofthe Discrete Poisson Equation on a Disk” SIAM J. Num. Anal. 10, 900-907.4.6. VANDERMONDE SYSTEMS AND THE FFT 183 R.A. Swost (1074), “A Generalised Cyclic Reduction Algorithm” STAM J. tem. Anal 11, 306-20. MA. Diamond and D.LLV. Ferreira (1978). "On a Cyclic Reduction Method for the ‘Solution of Poimoa's Equation,” STAM 4. Hum. Anal. 13, 4-70. HLA. Sweet (1977). “A Cyellc Reduction Algorithm for Solving Block Tiidiagonal Sys temas of Arbitrary Dimension.” SIAM J. Num. Anal. 1f, TO6-20. PN, Swarrrauber and FL Sweet (1989). “Vector and Parallel Method forthe Direct Solution of Poimon'a Equation,” J. Comp. Appl. Math. £7, 241-265. ‘S. Bopdall and W, Gander (1004). “Cycle Radoction for Special Tridiagonal Systems.” STAM J. Matriz Anal. Appl. 15, 321-330. Fox certain matrices thet aise in conjunction wth elliptic partial difeeatial equations, lock elimination ‘to rather nataral operations oa the underlying mesh. A damical example of this the method of need disection deecribed in |. George (1973). "Nested Disection of « Regular Finite Blemext Mesh,” STAM J. Mune Anal. 10, 45-63, ‘Wo also mention the folowing general survey: LR, Busch (1976). “Block Methods for Solving Sparoe Linear Systoms,” in Sparse “Metra Computations 1.R. Bunch axd DJ. Rose (eds), Acedemic Pros, New York. Bordered linear systema as presented in Pa.S.4 ae dicued in W. Coverta and J.D. Pryee (1990), “Block Blimisation with One Iterative Refinement Solves Border Linear Systems Accurately.” BIT $0, 450-507. W. Govaerts (1991). “Stable Solvers and Block Elimination Tor Bordered Systems," SIAM J. Matris Anal Appl 18, 460-453. \W. Gowserta and J-D. Pryce (1993). "Mixed Block Blimination for Linene Systems with Wider Border," 1MA'J. Num, Anal. £3, 161-180. Kroveckar product references include H.C. Andrews and J. Kane (1970). “Kronecker Matrices, Computer Implementation, ‘and Generallaed Spectra” J. Assoc. Comput. Mach. 17, 260-268. (C.de Boor (1979). “Eficient Computer Manipulation of Tenaor Products," ACM Trans. Math. Soft. 5, 173-182. ‘A. Graham (1981). Kronecker Products and Matriz Calculus with Applications, Ellis Horwood Lid., Chichester, England, HV, Henderson and $.R. Seare (1981), “The Voe-Permitation Matrix, The Vee Opers- tr, and Kroner Products: A Raview,” Linear and Mulssncar Algebra 8, 271-288. PA, Rapala and 8, Mitra (189). “Kronecker Products, Unitary Matrices, and Signal Processing Applications,” SIAM Review 31, 586-613. 4.6 Vandermonde Systems and the FFT Supposez(Oin) € RP*. A matrix V € RO*(°+) of the form 184 (Chapter 4. SPECIAL LingaR SYSTEMS Is said to be a Vandermonde matriz. In this section, we show how the systems V7a = f = f(0in) and Vz = b = Orn) can be solved in O(n?) flops, ‘The discrete Fourier trensform is brieSy introduced. This special and ‘extremely important Vandermonde eystem has aa recursive block structure and can be solved in O(nlogn) flops. In this section, vectors and matrices ‘are subscripted from 0. 4.6.1 Polynomial Interpolation: VT Vandermonde systems arise in many appraximation and interpolation prob- Jems. Indeed, the key to obtaining a fast Vandermonde solver isto recognize that solving V7a = f is equivalent to polynomial interpolation. ‘This follows because if V7a = f and Piz) = ajo (4.6.1) j= then pled) = fi for i= On. Recall that if the 2, are distinct then there is » unique polynomial of degree n that interpolates (0, fo),-.-. (tm, fa)- Consequently, V is noa- singular as long os the 2 are distinct. We ssoume this throughout the section. ‘The frst stop in computing the a, of (4.61) isto calculate the Newzon representation of the interpolating polynomial p w= Sa (fle-») . (4.6.2) ‘The constants cy are divided differences and may be determined as follows: ef0:n) = f(0:n) for k=O —1 for i= (463) = (4 — ena) (ai ~ Zine-a) end ond ‘See Conte and de Boor (1980, chapter 2). ‘The next task is to generate a(0:n) from (0:n). Define the polynomials Po(z) by the iteration Pe(2) = cx + (2 — 2e)Pa4i (2)4.8. VANDERMONDE SySTEMS AND THE FET 185, and observe that po(z) = p(z). Writing mlz) = af" + oft), 2+--- altar and equsting Uke powers of = in the equation Py = ch +(2—4)pu11 gives the following recursion for the coefficients al": al”) = em ~ mali? = o can be calculated as follows: (484) =a, 40441 end Combining this iteration with (4.6.3) renders the following algorithm: Algorithm 4.6.1 Given z(0:n) ¢ R"** with distinct entries and f = ‘f(0:n) € K++, the following algorithm overwrites f with the solution a = ‘a(0:n) to the Vandermonde system V(z0)....20)"a = f. for k n= 1 for i=ni—L:k+1 ag 1097 FO = 16= 1) ~266~ =D) end for k= n= 1-10 for i= kn =1 16) = (8) 14 + Yate) end end ‘This algorithm requires 5n?/2 flops. Example 4.8.1 Suppose Algorithm 4.6.1 ia used to solve [2212-13 186 (CuapTeR 4. SPECIAL LINEAR SYSTEMS ‘The first f-loop computes the Newton representation of p=): P(e) = 10-4 16(2 ~ 1) +82 = 19 ~ 2) + (= = We - 2}le— 9). ‘Teo second nloop computes a = [4323]F from {10 16 81/7. 46.2 The System Vz = b Now consider the system Vz = b. To derive an efficient algorithim for this problem, we describe what Algorithm 4.6.1 does in matrix-vector language. Define the lower bidiagonal matrix L4(a) € RO"*)*+0 by Ee(a) = and the diagonal matrix Dy by Dy = ding( 1-1 2e¢1 — 20)--1F 0 — wa With these definitions it is easy to verify from (4.6.3) that if f = f(0m) and c = ¢{(0:n) is the vector of divided differences thea c= UT f where U is the upper triangular matrix defined by UF a Dehli (l) +++ Dol). Similarly, from (4.6.4) we have antl, ‘where L is the unit lower triangular matrix defined by: Fm Lg(t0)7 +++ Ln—1(n—1)™. Thus, a = L7UTf where V-T = LTUT. In other words, Algorithm 4.6.1 solves V7a = f by tacitly computing the “UL” factorization of V-. ‘Consequently, the solution to the system Vz = b is given by z= V"'b = UL) = (Lol)? DG" + Lqna(1)”Dg2s) (Em a(Fa-1) ++ Lolz0)®)4.6, VANDERMONDE SySTEMS AND THE FFT 187 ‘This observation gives rise to the following algorithm: Algorithm 4.6.2 Given z(0:n) € R'*" with distinct entries and 6 ‘H{0:n) € e+, the following algorithm overwrites b with the solution x = 2(0:n) to the Vandermonde system V(zp,...,24)z = b. 268) = (8) — 20046 — 1) end end for k=n—3:-10 for i= b+ lin Hi) = WA) /(o(0) ~ =k) end fori=kn-1 i) = Mi) - i +1) end end ‘This algorithm requires 5n?/2 flops. Example 6.6.2 Suppose Aigorithm 46.2 is used to solve [ i (2) [3] aaa alfa] [a 14g ella 3 reaalls s a ° santana | “| = | ~% = . "The sero boop then calculation ° 3 Lot)? Dg'Li (1)? Dy Lal)® Dy" 4 = [3] % 1 4.6.3 Stability Algorithms 4.6.1 and 4.6.2 are discussed and analyzed in Bjérck and Pereyra (1970). Their experience is that these algorithms frequently produce wur- Drisingly accurate solutions, even when V is ill-conditioned. They alto show how to update the solution when a new coordinate pair (tn+i, fu+t) 188, Cuaprer 4. SPECIAL LINEAR SYSTEMS is added to the set of points to be interpolated, and how to solve confluent Vandermonde systems, ie. systems involving matrices like 1 1 4 4 ARR. AALS A V = V(se.s1.21,03) = 3 4.6.4 The Fast Fourier Transform ‘The discrete Fourier transform (DFT) matrix of order n is defined by Fastin) fe=ott Wn = exp(—2ni/n) = cos(2n/n) — i-sin(2x/n), ‘The parameter win is an nth root of unity because uf} = 1. Ia the n = 4 case, wy = ~i and ,oaid i-l i lola dost nt 4]. wt A= he. SEL If 2€C%, then its DFT is the vector Fyz. The DFT bas an extremely important role to play throughout applied mathematics and engineering, If n is highly composite, then it is possible to carry out the DFT in many fewer than the O(n?) flops required by conventional matrix-vector multiplication. To illustrate this we set n = 2 and proceed to develop ‘the radis-2 fast Fourier transform (FFT). The starting point is to look at on even-order DFT matrix when we permute its columns so that the ‘evpn-indexed columns come fret. Consider the case n = 8 Noting that OP = uti 088 we have Da ria lw wut wh lw wt ot wt anlt a wn 1 of w Ph oF 1 wut wt 1 ot Put wo4.6. VANDERMONDE SYSTEMS AND THE FFT 189 I we define the index vector ¢ = [02 46135 7, then ‘The lines through the matrix are there to belp us think of Fa(:e) a8 a 2-by-2 matrix with 4-by-4 blocks, Noting thet u? = uf = wg we see that aooer= [Ra where = coke econ ef,co Secce It follows that if 2 in an 8-vector, then Fur = Fleja() = & uf} [0 | =MPCY [RC % ~ Grete] Feeten] “Thus, by simple scalings we can obtain the &-point DFT y = Faz from the Apoint DPTs yr = Fez(0:2:7) and yp = Fax(1:27): WO8) = yr+deye WAT) = yr —deyp. and “.*" indicates vector multiplication. In general, if n = 2m, then y = Fux is givea by ylOm—2) = ytdeye umn =1) = ye-deye 190 Cuapren 4. SPECIAL LinzAR SYSTEMS where a= wr = ve = For n = 2! we can recur op this process until n = 1 for which Fyz = x: function y = FFT(z,n) ifn=l yer else m= n/% w= erm vr = FFT (z(0:2:n),m); yp = FFT (2(1:2:n),m) d= (wu JP ote [222] end ‘This is a member of the fast Fourier transform family of algorithms. It has @ nonrecursive implementation that is best presented in terms of a factorization of Fy. Indeed, it can be shown that Fy = A:---APy where Ap= hOB, L=2,r=n/L with [42 Mp = ding(wy,...,u!™ Bem [Te Bt) and Daya atte ‘The matrix Py is called the bit reversal permutation, the description of which we omit, (Recall the definition of the Kronecker product “@” from ‘Note that with this factorization, y = Fy can be computed as te n/L (465) I @ Buz ‘The matrices Ay = (Z-@B1) have 2 nonzeros per row and it is this sparsity thet makes it possible to implement the DFT in O(nlogn) fope. In fact, ‘careful implementation involves Sn logg n flops. ‘The DFT matrix has the property that -1F,. (4.6.6)4.6. VANDERMONDE SYSTEMS AND THE FFT 191 ‘That is, the inverse of Fy is obtained by conjugating its entries and scaling by n. A fast inverse DFT can be obtained from a (forward) FFT merely by replacing all root-of-unity references with their complex conjugate and scaling by n at the end. ‘The value of the DFT is that many “hard problems” sre made simple by transforming into Fourier epace (via F,). The sought-after solution is then obtained by transforming the Fourier space solution into original coordinates (via Fz"). Problems 461. Show that iV = V(zo,-..,q)s thea dev) = [] s-sp. neo iee (Gautachi 197Sa) Verity the following inequality fo the m = 1 case above: [quality cenit f the 2; ar all on te ana ray i the comple plane 748. Suppone w= [tne wfynnns ul!" | where = 2. Using colon notation, cxprest bored. ot*] 0 a mabvctor of wy where r= 2, 9 = Lt PRGA Prove (456). 48.5 Expand the operation s = (F@ Bz)s in (46.5) into a double loop and court the numberof ope reid by your implementation. (Ignore the details of 2 = Paz. 740.8 Suppesen = 3m and examine @ = [FG,08n—1) FA(LEN~ 1) FyC.23M—1)] ss by-3 block matic, looking for sealed copie of Fn, Bad oo what 0% find, develop recurve radi} FFT analogous othe radix? mplerentation i tbe vet. Notes and References for See. 4. (Our discussion of Vandermonde linear systems is drawn from the papers [A Bjeck and V. Pereyra (1970) “Solution of Vandermonde Syatems of Equations.” Math Comp. 24, 803-08. ‘A. Bjorek and. Elving (1973). “Algorithm for Confuent Vandermonde Systems.” Numer. Math. 21, 130-37. ‘The divided diffrence computations we ciacumed are detailed in chapter of SD. Conte and C. de Boor (1990). Blementary Numerical Analysis: An Algorithmic “Approach, rd wd., McGraw-Hill, New York. 192 CHAPTER 4, SPECIAL LINEAR SYSTEMS ‘The latter rference includes an Algol procedure, Error analyees of Vandermonde rater solvers include [NA Highasn (1987). *Error Analysis of the Biérck-Pereyra. Algorithme for Solving ‘Vandermonde Sywtams,” Numer. Math. 50, 81-632. NJ, Higham (19884). "Past Solution of Vandermonde Systems Invelviag Orthogonsl Polyaomial” IMA J. Nam. Anak 4, 473-486. NJ. Higham (1900), “Stability Analysis of Algorithras for Solving Confluent Vandermonde likn Systema,” STAM J. Matis Anal. Appl 1, 23-4) ‘S.C, Bartele and DJ. Higham (1002). “The Structured Sensitivity of Vandermonde Like ‘Systane,” Numer. Math. 68, 17-34 JM, Varah (1983). “Errors and Pervurbation i Vandermonde Systems,” [MA J. Num. ‘Anal. 13, 1-12. Intererting theoretical remita concerning the conditio of Vandermonde aystoms may be Sound ia \W. Gautochi (1975a). “Norm Batimates for Inversn of Vandermonde Maticen,” Numer. Mack 23, 337-47. ‘W. Gautoehi (19750). “Optimally Conditioned Vandermonde Matrices,” Numer. Math. 4, 1-1. “The basic algorithms presented can be extended to cover confluent Vandermonde sys tema, block Vandermonde systems, and Vandermonde systema that are based on other polynomial bases G. Galimberti and V. Pereyra (1970). "Numerical Diferetiation aad the Solution of Multidimensional Vandermonde Syatems." Math Comp. £4, 87-4, G. Galimberti and V. Pereyra (1971). “Solving Confuent Vandermonde Systana of Hermitian Type.” Nemer. Math 18, 4-60. H. Van de Vel (2977). "Numerical Trstment of « Generalized Vandermonde systems of Byuatione,” Lin. Alp. and lls Applic. 17, 149-74. G.AL Golub and W-P Tang (1961). "The Block Decomposition of « Vandermonde Matrix ‘and Its Application, BIT 21, 505-17. 1D. Calvetti and L, Reichel (1902). “A Chsbychev- Vandermonde Solve,” Lin. Als. ond ‘ha Applic. 178, 219-229. 1D. Calvetti aad L. Reichel (1903). "Fast Inversion of Vandermonde-Like Matrices Ire volving Orthogonal Polyzomials” BIT 33, 473-484, 4H. Lu (1904). “Past Solution of Confluent Vandermonde Linear Systema” SIAM J. Matrés Anal Appl 16, 1277-1289, H. Lu (1996). “Solution of Vandarroonde- ike Syateme and Confivent Vandermonde like Systema" STAM J. Matrix Anal. Appl 17, 127-138. ‘The FFT Uerature i very extensive and wattered. For an overview of the ares couched la Kronecker product notation, se CF, Van Loan (1992). Computational Prumevorks for the Past Fourier Transform, SIAM Publications, Philadelphia, PA. ‘The point of view n this vet is that diferent FFTs correspond 10 diferent factoriaations of the DFT matrix. Thane are spare fectocimtions in that the factors Rave very few ‘onseroe per row.4.7, Togpuira AND RELATED SYSTEMS 193 4.7 Toeplitz and Related Systems Matrices whose entries aze constant along each diagonal arise in many applications and are called Toeplitz matrices. Formally, T € F*" is Toeplitz UF there exist scalars rmgty.s+sPoy2--yFamt such that ogy = 75-5 for all) tnd j. Thus, ron om oma 3176 pelt m nml_l[a siz Ta te fo om 0431 res tea te 70 9043 is Toeplitz, ‘Toeplits matrices belong to the larger class of persymmetric matrices. We say that B ¢ Re" is persymmetric iit symmetric about its northeast- southwest diagonal, 1c. bij = be-j4iniet for all j and j. This is equivalent to requiring B = EBTE where E = [en,...,¢1] = Fal: 1:1) is the eby-n exchange matréz, ic, coon _|oo10 F-lordo 1000 It is easy to verify that (a) Toeplitz matrices are persymmetric and (b) the inverse of a nonsingular Toeplitz matrix is peroymmetric. In this section we show how the careful exploitation of (b) enables us to solve Toeplitz systems with O(n) flope. The discussion focuses on the important case when Tis also symmetric and positive definite. Unsymmetric Toeplitz systems and connections with circulant matrices and the discrete Fourier transform are briefly discussed. 4.7.1 Three Problems Asoume that we have scalars r),...,7_ such that for k= 1m the matrices Loot os ren teed nok Tht Thea met haa tare positive definite. (There i no loss of generality in normalizing the diagonal.) Thre algorithms are described ia this wtlon 194 CHAPTER 4. SPECIAL LINEAR SvSTEMS fris-esetal? «© Levinson’s algorithm for the general righthand side problem Ty = b. ‘© ‘Trench’s algorithm for computing B = Tz. In deriving these methods, we denote the k-by-k exchange matrix by Ex, ive, Be = Ja: ~ 1:1). ‘© Durbin’s algorithm for the Yule- Walker problem Tr 4.7.2 Solving the Yule-Walker Equations We begin by presenting Durbin’s algorithm for the Yule-Walker equations ‘which arise in conjunction with certain linear prediction problems. Suppose for some k that satisfies 1 Sat, 7 7 bac + 5 (ay = Uanjtas) ‘This indicates that although B is not persymmetric, we can readily compute ‘an element b,, from its reflection across the northenst-southwest axis. Cou- pling this with the fact thet A“? is persymmetric enables us to determine B from ita “edges” to its “interior.” Because the order of operations is rather cumbersome to describe, we preview the formal specifcstion of the algorithm pictorially. To this end, ‘soume that we know the last column and row of T!: wud T= weeee weeee weeee weeeee weeese Here u and k denote the unknown and the known entries respectively, and n= 6. Alternately exploiting the persymmetry of T;;! and the recursion (4.7.5), we can compute B, the leading (n ~ 1)-by-(n— 1) block of Ty}, as follows: arp 1 d es meerem miadem nieeem weeeen penne meeeem wreeem wrtaem aeaeee arene4.7. Toertirz AND RELATED SYSTEMS 199 RERRRE bee RRE kRERERKER RARER wom |e uuk klarp|E ku k wR Tw lk kuuwk kl} [kek RRR kRERERE kkRRRR ReRREE kkeRRE keEERR keERKE pene fk RRR KR we [RR RRR E kkeERRE kKEERER Of course, when computing a matrix that is both symmetric and persymmetric, such as T', it is only necessary to compute the “upper wedge” of the matrix—eg, With this last observation, we are ready to present the overall algoritiim, Algorithm 4.7.3 (Trench) Given real numbers 1 = ry,71,---,7n Such that T = (ry-y)) € RO*" is positive deinit, the following algorithm computes B = T;!. Only those 6, for which i n and b€ R™. The most relisble solution procedures for this problem involve the reduction of A to various canonical forms vie orthogonal transformations, Householder reflections and Givens rotations are central to this process and we begin the chapter with a discussion of these important transformations. In §6.2 we discuss the computation of the factorization A = QR where Q is orthogonal and 2 is upper triangular. This amounts to finding an orthonormal basis for ran(A). ‘The QR. factorization can be used to solve the full rank leest squares problem os we show in §5.3. The technique is compared with the method of normal equations after a perturbation theory is developed. In §5.4 and §5.5 we consider ‘methods for handling the difficult situation when A is rank deficient (or nearly 90). QR with columa pivoting and the SVD are featured. In §5.6 we discuss several steps that can be taken to improve the quality of a computed4.7. ‘ToEpurrz aNp RELATED SysTEMs 205 M, Stewart and P. Vas Doocen (1906). “Stability arson inthe Factorination of Strvc- tured Matrices," STAM J. Matriz Anal. Appl 18, to appeat. ‘The original referenos forthe thre algorithms dencribed in thin section are: 4, Durbia (1960). “The Fitting of Time Serie Model” Rew. Inst int Stat. £8 233-43. 1. Levingoa (1947). “Tae Weiner RMS Error Criterion in Fiter Design and Prediction,” J Math. Phys. 26, 261-78. WE. Trench (1964). “An Algorithm for the lnvenion of Finite Tooplits Matrices,” J SIAM 12, 15-22 [A more detalled description of the noneyzometric Trach algoithan i given in 5. Zohar (1960). “Toeplits Mate Inversion: The Algorithm of W.P. Treach,” J. ACM 16, 692-401. Fast Tooplts system solving hoe attracted en enormous smoust of attention and a samn- pling of iterating elgorithesic kde may be found in G. Ammar and WB. Gragg (1988). “Superfast Solution of Real Poultive Definite ‘Toeplts Syatemmn” SIAM J. Matris Anak Appl 8, 61-70. ‘TF. Chan snd P. Henman (1992). “A Look- Ahead Levinson Algorithm for Indefinite ‘Tooplits Syatema.” STAM J. Matréz Anal. Appl 13 490-806. D.R. Sweet (1998). “The Use of Piveting to Imprive the Numerical Performance of “Algorithme for Toeplits Matticen,” STAM J. Mair Anal Appl. 14, 468-493 'T-Kallath and J. Chun (1994). “Genaralizad Structure or Block-Toeplits, Teel Block, and Toop Drved Mato” SIAM J Mater Ana Aral 15, ‘T Kallath and A.H. Sayed (1998), “Dieplacemeat Structure: ‘Theory and Applications.” SIAM Review 37, 257-388, Important Toeplits matrix applications are dlacumed is 4. Makhoul (1978). "Linear Prediction: A Tutocial Review,” Proc. IEEE 63(4), S61-80. 4. Masta un A: Gry (08) Linear Pratcton of Spr Springs Vero, Bein eed AL, Oppenheim (1978). Applications of Digital Signal Processing , Prestice Hall, Ea- ‘Bewood Clif Hankel matric are constact soos their antidiagoonls and arise in erveral important G. Heng aod P, Jankowski (1990). Parallel and Superfast Algorithms for Hankal Systems of Equations” Numer. Math. 58, 100-171. RW, Preund and B. Zha (1998). “A Look-Abaad Algorithm for the Soltion of General Hankel Syren” Murer. Math, 64, 285-522 ‘The DFT/Toplits/cireulant connection is dierumed in CF. Van Loan (1962). Computational Pramesorks for the Fast Fourter ‘Transform, SIAM Publeations, Philadephia, PA. Chapter 5 Orthogonalization and Least Squares §§.1 Householder and Givens Matrices 45.2 The QR Factorization §5.3 The Full Rank LS Problem §5.4 Other Orthogonal Factorizations §5.5 The Rank Deficient LS Problem §5.6 Weighting and Iterative Improvement §5.7 Square and Underdetermined Systems ‘This chapter is primarily concerned with the least squares solution of overdetermined systems of equations, Le. the minimization of | Az — bl, where A € R™" with m > n and b€ R™. The most relisble solution procedures for this problem involve the reduction of A to various canonical forms vie orthogonal transformations, Householder reflections and Givens rotations are central to this process and we begin the chapter with a discussion of these important transformations. In §6.2 we discuss the computation of the factorization A = QR where Q is orthogonal and 2 is upper triangular. This amounts to finding an orthonormal basis for ran(A). ‘The QR. factorization can be used to solve the full rank leest squares problem os we show in §5.3. The technique is compared with the method of normal equations after a perturbation theory is developed. In §5.4 and §5.5 we consider ‘methods for handling the difficult situation when A is rank deficient (or nearly 90). QR with columa pivoting and the SVD are featured. In §5.6 we discuss several steps that can be taken to improve the quality of a computed207 CHAPTER 5. ORTHOGONALIZATION AND LEAST SQUARES least squares solution. Some remarks about square and underdetermined systems are offered in $5.7 Before You Begin ‘Chapters 1, 2, and 3 and §§4.1-43 are assumed. Within this chapter there are the following dependencies: $5.6 t $l — $2 3 + HA = HS 1 $57 ‘Complementary references include Lawson and Hanson (1974), Farebrother (1987), and Bjérck (1996). See also Stewart (1973), , Hager (1988), Stewart ‘and Sun (1990), Watkins (1991), Gill, Murray, and Weight (1991), Highaza (1996), Trefethen and Bau (1996), and Demme! (1996), Some MaTLAs Functions important to this chapter are qr, svd, pinv, orth, rank, and the “backslash” operator “\.” LAPACK connections include LAPACK: Householder/Givens Tools erates » Houseboler mairix Housshalder times matrix ‘Small n Hounsbolder time matrix Block Householder tenes matrix Computes 1 ~ VV! block rfector repemeneation ‘Generates » plane rotation Gonerater« vector of plane rotations Applic a vector of plane rotations to a vector pair ‘Applies roetion vequence to matrix Real rotation times coropex veetor Pair ‘Complex ration (¢ real) times complex veetor pair Complex rotation (x real) ten complet vector Bait GARbEEEGEEEE ‘LAPACK: Orthogonal Factorizations Ta=Qk— ‘4n=QR @ (fecored Goren) tte matrix (eal case} Q (factored form) tres matrix (complex case) Generates Q (cea came) Genera Q ae) = RQ = (upper eiaorls) ‘A= Qk = (orthogonal) lower triangular) ‘4A= 12 = (ower trangulas)(orbogonal) ‘A= RQ where A is upper trapeidal A= ULVT anand i ii i i in | 208 CHAPTER 5. ORTROGONALIZATION AND LEAST SQUARES ‘ceuss | SVD solution to min | AX = B lp .cE18% | Complete orthogonal decomposition solution 10 min | AX ~ Bly. snatrie to reduce coadiion 5.1 Householder and Givens Matrices Recall that Q € IR" is orthogonal if Q™Q = QQT = In. Orthogonal ‘matrices have an important role to play in least squares and eigenvalue computations, In this section we introduce the key players in this game: Householder reflections and Givens rotations. 5.1.1 A 2-by-2 Preview It is instructive to examine the geometry associated with rotations and reflections at the n = 2 level. A 2-by-2 orthogonal matrix Q is a rotation if it has the form g=[_ 2 ao —sin(@) con(@) Ify =QPz, then y is obtained by rotating x counterclockwise through an angle 0. ‘A 2-by-2 orthogonal matrix Q is a reflection if it has the form cos(8) —_sin(4) ‘sin(@) —cos(@) J Ify = QT 2 = Qr, then y is obtained by reflecting the vector 2 across the line defined by - -cos(9/2) sme {073} }. Reflections and rotations are computationally attractive because they are easily constructed end because they can be used to introduce zeros in a vector by properly choosing the rotation angle or the reflection plane. ‘Example 5.1.1 Suppose z= (1, V3/7. If we set aa [ rm cen) [op a2 wane) coor) | = | vie “aye 2, 0], Thus, 8 rotation of -60° zeros the second component of x. If o= (25) 2] - [42 23] Qe n(30P)—cou(30") ya ~via then QT = (2, 07. Thun by reflecting 2 scroa the 30° Une we can zero its meond ‘component.5.1, HoUusEHOLDER AND Grvens Marnicss 209 5.1.2 Householder Reflections Let v € IR" be nonzero. An n-by-n matrix P of the form Pel- Font 61.1) is called a Householder reflection. (Synonyms: Householder matrix, House- holder transformation.) The vector v is called a Householder vector. If a vector 2 is multiplied by P, then itis reflected in the hyperplane span(v}. It is easy to verify that Householder matrices are syminetric and orthogonal. Householder reflections are similar in two ways to Gauss transforma tions, which we introduced in §3.2.1. They ore rank-1 modifications of the identity and they can be used to zero selected components of « vector. In particular, suppose we are given 0 = € IR” and want Pz to be a multiple of et = In and Pz € span(e:} imply v € span{z,¢)}. Setting v = = + acy gives, wz=aTztan and vl y = 272+ lar +a, and therefore 7, 2. ata oa Pra (\-2ppaearat)? er In onder for the cosficent of tobe zero, we set a = +f 2a for then wh vazelzlhe > Pr= (1-2) 2=F12he. 612) It is this simple determination of v that makes the Householder relection ‘80 useful. Example 5.1.2 Ifs=(3, 1,5, 1]? and v=(9, 1, 5, 1]7, then m0 8 9 3 A$ % -5 w “1 8 has the property that Pz =[—6, 0, 0, 0, 7. 210 CHAPTER 5. ORTHOGONALIZATION AND LEAST SQUARES 5.1.3 Computing the Householder Vector ‘There are a number of important practical details emociated with the determination of a Householder matrix, L., the determination of a Householder vector. One concerns the choice of sign in the definition of v in (5.1.2). Setting nea-ieih ‘haa the nice property that Pz is # positive multiple of ¢,. But this recipe is dangerous if zis close to a positive multiple of e, because severe cancelation would occur. However, the formula Lely = BozH Lett te) 2° Flaite suggested by Parlett (1971) does not suffer from this defect in the 2 > 0 case. In practice, it is handy to normalize the Householder vector 20 that ‘v(1) = 1. This permits the storage of u(2:n) where the zeros have been introduced in z, i.e., 2(2:n). We refer to v(2:n) ag the essential part of the Householder vector. Recalling that = 2/v"v and letting length() specify vector dimension, we obtain the following encapsulation: a= Algorithm 5.1.1 (Householder Vector) Given z € R®, this function computes v ¢ R® with v(1) = 1 and 8 €R such that P = Iq ~ Bev" is orthogonal and Pz = 2 [l,¢1 function: [v,A] = house(z) n= longth(z) on er v=[.d [ scm ife=0 pao p= Valero if (2) <=0 wl) = 20) ~ else 4 aes) +8) on B= 2v(t)?/(o + ¥(1)4) v= jet) end else ‘This algorithm involves about 3n flops and renders s computed Householder ‘matrix that is orthogonal to machine precision, a concept discussed below.5.1, HOUSEHOLDER AND GIveNs MaTRICEs au A production version of Algorithm §.1.1 may involve a preliminary scaling of the # vector (2 + 2/2) to avoid overflow. 5.1.4 Applying Householder Matrices Its critical to exploit structure when applying a Householder reflection to matrix. If A € IR" and P= I~ Bu ¢ R™", then PA= (I - fw?) A=A-weT where w= AAT». Likewise, if P=1— fou? eR", then AP = A(I~ Bw?) = Awe where w= Av. Thus, an m-by-n Householder update involves » matrix vector multiplication and an outer product update. It requires 4mmn flops. Failure to recognize this and to treat P as a general matrix increases work by an order of magnitude. Householder updates never entail the explicit formation of the Householder matriz. Both of the above Householder updates can be implemented in a way that exploits the fact that v(1) = 1. This feature can be important in the computation of PA when m is small and in the computation of AP when nis small. ‘As an example of a Householder matrix update, suppose we want to overwrite Ae RE" (m > n) with B = Q7A where @ is an orthogonal matrix chosen so that B(j + I:m, j) = 0 for ome j that satisfies 1 fal r= alt = 1/VTFAE cm or else r= bo; c= 1/VTFTe ead end ‘This algorithin requires 5 flops and a single square root. Note that it does sot compute 8 and so it does aot involve inverse trigonometric functions. Example 5.1.4 Ife = [1 2, 3, 47, cou) = 1/V5, and sin(@) = -2/V5, then 602, 4,0)2 = {1, VB, 3, OF. 5.1.9 Applying Givens Rotations It is critical that the simple structure of a Givens rotation matrix be ex: ploited when it is involved in a matrix multiplication. Suppose A¢ R™", ¢ = cos(6), and s = sin(@). If Gli,k,6) ¢R™™, then the update A — Gli, ky 8)" A effects just two rows of A, es)" agian [_$ £] atiaba and requires just 6n flops: for 71 = Ali) 2 = Aly) A(L,j) = en ~ 572 A(2,j) san tery end Likewise, if G(i,£,8) € IR™™, then the update A + AG{i,k,6) effect just ‘two columns of 4, acctex = acstead [ $2] and requires just: 6m flope:5.1, HOUSEHOLDER AND GIVENS MATRICES aur for j = im n= AGA) m= AGA) AG,4) = en - 9m AG.E) = sn +0 end 5.1.10 Roundoff Properties ‘The numerical properties of Givens rotations are as favorable as those for Householder reflections. In particular, it can be shown that the computed é and 5 in givens satisfy = l+6) € = Ofu) § = s(l+e) & = Olu) If @ and 4 are subsequently used in » Givens update, then the computed update is the exact update of a nearby matrix: ANGGEOTA) = GG AOA+E) [Els eu Alla JMAG(, k, 8) (A+ E)GG, kB) Ella = ull Alta. A detailed error analysis of Givens rotations may be found in Wilkinson (1985, pp. 131-39). 5.1.11 Representing Products of Givens Rotations Suppove Q = Gy---Gy inn product of Given rotations. As we have seen in connection with Householder reflections, it is more economical to keep the orthogonal matrix Q in factored form than to compute explicitly the prod- ‘uct of the rotations. Using a technique demonstrated by Stewart (1976), it is possible to do this in a very compact way. The idea is to associate a single floating point number p with each rotation. Specifically, if z-[5 J etter ae then we define the scalar p by ife=0 p=1 elseif |s| < |e] p= sign(e)s/2 19) else p= 2sign(s)/e end 218 CHAPTER 5. ORTHOGONALIZATION AND LEAST SQUARES Essentially, this amounts to storing 3/2 if the sine is smaller and 2/c if the conine is smaller. With this encoding, it is possible to reconstruct +7 as follows: ifp=1 c=O sat elseif [p| <1 s= 2 c= Vin (8.1.10) else. c= 2% s=vIne ‘That —Z may be generated is usually of no consequence for if Z zeros a articular matrix entry, so does -Z. The reason for essentially storing the smaller of ¢ and s is that the formula vT— 2? renders poor results if = is reer unity. More details may be found in Stewart (1976). Of course, to “reconstruct” G(i, k,6) we need i and & in addition to the associated p. ‘This usually poses no difficulty as we discuss in §5.2.3. 5.1.12 Error Propagation We offer some remarks about the propagetion of roundoff error in algorithms that involve sequences of Houscholder/Givens updates. To be precise, suppose A = do ¢ RM" is given and that matrices Ay,...,4p = are generated via the formula Aes MQrAriZ) k= 14 ‘Assume that the above Householder and Givens algorithms are used for both the generation and application of the Q, and Zi . Let Q and 2, be the orthogonal matrices that would be produced ia the abyence of roundod, Tt can be shown that B= (Qp-QNAF ENE Z), en) where || E'||2 < culj A fjz and cis constant that depends mildly on n,m, and p. In plain English, B is an exact orthogonal update of a matrix near to A. 5.1.13 Fast Givens Transformations ‘The ability to introduce zeros in a selective fashion makes Givens rotations an important zeroing tool in certain structured problems. This has led to the development of “fast Givens” procedures. The fast Givens idea amounts to a clever representation of Q when @ is the product of Givens rotations.5.1, HOUSEHOLDER AND Gvens MaTRices ag In particular, Q is represented by a matrix pair (M, D) where M™M = D = ding(di) and each d; is positive. The matrices Q, M, and D are connected through the formula Q = MD? = Méing(1//4). Note that (MD-!/2)7(MD-¥/2) = D-¥2DD-¥2 = I and so the ma trix MD-"? ig orthogonal. Moreover, if F is an n-by-n matrix with FPDF = Daew diagonal, then MyMnew = Drew Where Mnew = MF. ‘Thus, it is possible to update the fast Givens representation (MD) to obtain (Mnews Dare). For this ides to be of practical interest, we must show how to give F zeroing capabilities subject to the constraint that it “keeps” D diogonsl. ‘The details are best explained at the 2-by-2 level. Let 2 = [xy x2] and D = ding(di,da) be given and assume that dy and dy are positive. Define = [% a] (6.1.12) so cat te = [ES] = ape, =| atthe dab + dao MrDM, = [ ae, ee | 21/22, and fy = —arde/dy, then Mis = [7r” | Ia #0, a d(1+n) arom = [4° gang | where 91 = apy = (@a/4s)(21/29)?- iy, if we asrume x) 70 and define Mz by My = [ av ] (5.1.3) where ap = —a/21 and fy = ~(d/da}on, then age = [207] 4+) 0 MIDMs = { 0” ann) | = Po 220 CHAPTER 5. ORTHOGONALIZATION AND LeasT SQUARES where 92 = ~aabr = (4i/da)(¢2/21)*. It is easy to show that for either i is orthogonsl and that it is designed 20 that the second component of TD“) a ero, (J tay actually be a tection and chs tis hall correct to use the popular term “fast Givens.”) Notice that the 7 satisfy 2172 = 1. Thus, we con always select M; in the above so that the “growth factor” (I + 7s) is bounded by 2. Matrices of the form At La mn [hb an [3 2] that satisfy —1 < a,B; <0 are 2-by-2 fast Givens transformations, Notice ‘that premultiplication by a fast Givens transformation involves half the number of multiplies as premultiplication by an “ordinary” Givens trens- formation. Also, the zeroing is carried out without an explicit square root. In the n-by-n case, everything “scales up” as with ordinary Givens rotations. The “type 1” transformations have the form Diaypi 0F 2, the matrix. Pliska,8) = | pod (6.14) re en) Oe lee i FkaB) = ft (6.1.15) pete ole ” Od k Encapaulating all this we obtain Algorithm 5.1.4 Given 2 € R? and positive d € R?, the following algorithm computes a 2-by-2 fast Givens transformation M such that the5.1, HOUSEHOLDER AND GIVENS MATRICES 22 second component of MTz is zero and MTDM = Dy is diagonal where D = ding(di,dz). If type = 1 then M has the form (5.1.12) while if type ~ 2 then M hss the form (5.1.19). The disgonsl elements of D; overwrite d. function: [a, 6, type] = fast.givens(z,d) if 2(2) #0 a = ~2(1)/z(2); 6 = ~ad(2)/d(1); 7 = -ab ifys1 type =1 r= {1}; d(1) = (1 + -7)d(2); d(2) = (1+)r type a= 1/a; B= 1/6; y= 1/y (1) = (1 + -y)d(1); d(2) = (1 + y)d(2) ond else type = 2 oa ‘The application of fast Givens transformations is analogous to that for ordinary Givens transformations. Even with the appropriate type of trans formation used, the growth factor 1 +-y may still be as large as two, Thus, 2 growth can occur in the entries of D and M after # updates. This means ‘that the diagonal D must be monitored during a fast Givens procedure to avoid overflow. See Anda and Park (1994) for how to do this efficiently. Nevertheless, element growth in M and D is controlled because at all, times we have M D-¥/? orthogonal. The roundoff properties ofa fast givens procedure are what we would expect of a Givens matrix technique. For example, if we computed Q = fl(MD-"/?) where M and D are the computed ‘M and D, then Q is orthogonal to working precision: | Q7Q — I I =u. Probleme PS.1e1 Execute house with #=(1, 7, 2,3, -1/7. PS.12 Let = and y be nonaero vectors in RO. Cle algorithm for determining « Householder matrix P uch that Pris a muiple of PS.1.3 Suppose 2 € O° and that 21 = [ze!” with OER Amume s #0 and Seine u = r+ el zfaer Show that P= I — Dav /ullu ia unitary and that Pom ez hae 5.1.4 Use Houmbokder matrices to show that det (I +2y?) = 1+ 27y where = and y re rea nevectors. 5.15 Sopp = ¢ Cie an lg or detaining wry sof te [4 2] cen esurer Q ‘such that the second component of QM ie sero. 220 OnAPTER 5. ORTHOGONALIZATION AND LEAST SQUARES P5.1.8 Suppose 2 aod y are unit vectors lo RY. Give a. algorithm using Givens ‘raaaiormations which computes an orthogonal Q wuch that Q7 = y. PBA Determine c= cox) sad # = sa(®) such that [ST Ta}-[2] 5.1.8 Suppose that Q = 1+ YTYT ia orthogonal where ¥ € FO and 7'¢ WS ia ‘upper triangular. Show that if Qy = QP where P ~ 1-2" /sTy is a Householder tain, then Qy cam be expremed ia the frm Qy = [+¥:Ts¥Z where Vy € ROOD and Ty © RYH*U+) iy upper triangular 5.1.9 Give » deta implemestation of Algocithm 5.12 with the asrumption that sWiG+ Ln), the emential par ofthe he 3th Flousobokler vector, instored in AC} 3) Since ¥ ia efetively represent in A, your procedure oed only vt up the W matrix P8110 Show that if 5 is sarw-aymmeric (S7 = —5), then Q = (14+ S\F—5)-Y ia erthogoasl (Q ia called the Copley ianaform of 5.) Coowtroct a rank-2 S vo thas if = ia vector then Qs i sro except in the Ae component. 5.1.11 Suppose P € RO" antaien | PP — fy y= «<1 Show thet al the singular value of P ae in th interval (Le 1-4¢ and that POV? fy ¢ where P= ULV? {the SVD of P. 5.1.23 Suppow A € RP%?, Under what conditions isthe closet rotation to A closer than the closest eelection to A? Notes and References for Sec. 5.1 Hoowsebolder matrices are named after A.S. Householder, who popularised their use in numerical aalyaia. However the properties of thee matrices have been known for quite some time. See LW, Tummbull and A.C. Aitken (1961). An iniroduction to the Theory of Canonical Matrices, Dover Publications, New York, pp. 102-5. (Other refreacas concerned with Houssbolder transformations include AR. Gouriny (1970), “Generalization of Elementary Hermitian Mettice,.” Comp. J 13, AUL-12. BN, Parl (107), “Anas of Aleta or Reectionn a Bictor,” STAM Reviews 'N.XK, Tao (1975). “A Note on Implementing the Householder Transformations” STAM 4. Nom. Anal. 12, 53-58, B. Danlay (1978). “On the Choice of igus for Househokler Matricen” J. Comp. Appl. Moth. 8, 87-89. JIM. Cuppen (1984), On Updating Triangular Products of Houmbolder Matrices,” ‘Numer. Math. 45, 408-410. 1. Keufinan (1987). “The Generalized Houmbolder Transformation and Sparse Matri- om Lin, Alp, and les Apple, 90, 221-234. A detailed error analysis of Houneboldertranaformations in given in Lawson ead Hasson (ats, 83-40). ‘The basic references for block Householder representations and the nasocinted com tations include CAH, Bischof and C. Van Loan (1887). “The WY Representation for Products of Hous holder Matrices” STAM J. Sei. and Stat. Comp. 8, l-a13‘Tae QR FactoRizaTion 223 [R. Schieiber and BLN, Parltt (1987). “Block Rafectors: Theory and Computation” STAM J. Numer. Anal. 25, 180-206. 'BN, Parlet and RL Schrerber (1988), “Block Reflectors: Theory and Computation” SIAM J. Nom. Anal. 25, 189-208. RS, Schraber and C. Van Loan (1980). *A Storage ficient WY Reprematetion for Products of Housebolder Transformations,” STAM J. Sci ond Stat. Comp. 10, sos, C. Puglia (1902), “Modication of the Houseboiter Method Besed oa the Compact WY Reprowmaiation,” STAM J. Sci ond Stat. Comp. 13, 723-726. X. Sun and GH. Bischof (1965). “A Basie-Kernel Representation of Orthogonal Matsi- ‘con STAM J, Matrix Anal Appl. 16, 1184-1196. Givens rotations, named after W. Given, ae also refered to a8 Jacobi rotations. Jacobi devised a rymmetrc eigenvalue sgorithmn based on thee trunsiormations in 1848. Soe B84 The Given rotation storage scheme dlocumed in the text i detailed ln G.W. Stewart (1978). “The Economical Storage of Plane Rotation,” Numer, Math, 25, 137-38, Fst Givens tromaformations are alao referred to as “oquare-rot-frd” Givens tranafor- ‘mations. (Recal that a square root must ordiaarily be computed during the formation of Glvwna transformation) There are soveral ways fast Givens calculations can be arranged. See M. Gentleman (1973). "Least Squares Computations by Givens Transformations without Square Roots," J. frat, Moth. Appl 12, 320-36. CF, Van Loan (1973). "Ganeraizad Singular Valuat With Algorithm and Applice- tions,” PD. thesis, Uaiversty of Michigan, Apa Arbor. S. Hammucing (1974). *A Note oa Modifications to the Givens Plane Rotation,” J Inat. Math. Appl 19, 215-18. 11, Wainwon (1017) “Some Recent Advances in Numerical Linear Algebra,” in The ‘Slate of the Art in Numerical Anciyns, od. D.AH, Jecobn, Academic Press, New ‘York, pp. 1-63. AA. Ande and H. Park (1904). “Fost Plane Rotations with Dynamic Sealing,” SIAM J. Moteiz Anak Appl 15, 162-374, 5.2. The QR Factorization ‘We now show bow Householder and Givens transformations can be used to compute various fectorizationa, beginning with the QR factorization. The QR factorization of an m-by-n matrix A is given by A=QR where @ € IR" is orthogonal and R € R™™" is upper triangular. In this section we assume m >n. We will ses that if A hos full column rank, then the first n columns of @ form an orthonormal basis for ran(A). ‘Thus, calculation of the QR fectorization is one way to compute an orthonormal bess fora set of vectors. This computation can be arranged in several ways. We give methods based on Householder, block Householder, Givens, and fast Givens transformations. The Gram-Schmidt orthogonalizetion process and s numerically more stable variant called modified Gram-Schmidt are also discussed. 224 CHAPTER 5. ORTHOGONALIZATION AND LEAST SQUARES 5.2.1 Householder QR ‘Wo begin with a QR factorization method that utilizes Householder transformations. The essence of the algorithm can be conveyed by a small ex- ‘ample. Suppose m = 6, n= 5, and assume that Householder matrices Hf and 2 have been computed so that Ilha = je ccecox eccoxx g Beeexx Concentrating on the highlighted entries, trix Hy € RO such that determine a Householder ma As - e ocex It Hs = ding(, ), then Haga = ecoxxx XK KX K x After n such stepe we obtain an upper triangular Hy Hf,.1--- HyA = Rond so by setting Q-= Hy---H, we obtain A =QR. Algorithm 5.2.1 (Householder QR) Given A ¢ R™** with m > 1 the following algorithm finds Housebolder matrices Hy,..../q such that if Q = Hy-+- Hq, then Q7A = Ris upper triangular. ‘The upper trisngular art of A is overwritten by the upper triangular part of R and components J+ Lam of the jth Householder vector are stored in A(j + 1:m,j),j n, the fol lowing algorithm overwrites A with Q7A = R, where A is upper trisngular and Q is orthogonal. ~1jt1 ivens(A(i — 1,4), A(t, 3) AG 1:i,pn) = [ a Ja = i, jin) end end ‘This algorithm requires 3n3(m —n/3) Aops. Note that we could use (5.1.9) to encode (c,) in a single number p which could then be stored in the zeroed entry A(i,3). An operation such as z+ QT z could then be implemented by using (6.1.10), taking care to reconstruct the rotations in the proper order. (Other sequences of rotations can be used to upper triangularize A. For example, if we replace the for statements in Algorithm 5.2.2 with for i= for j 2 -min(i ~ 1,n} then the enon in are introduced row-by-20W. “Another peaineter in a Givens QR procedure covers the planes of rotation that are involved in the zeroing of each a,,, For example, instead of rotating rows i~1 and i to zero ay as in Algorithm 5.2.2, we could use rows j and i: for j=1:n fort=me=1541 [ey] = givens(AG, j), AC, 5)) J assum 5.2.4 Hessenberg QR via Givens As an example of how Givens rotations can be wsed in structured problems, ‘we show how they can be employed to compute the QR factorization of an ‘upper Hessenberg matrix. A small example lustrates the general ides. 228 CHAPTER 5. ORTHOGONALIZATION AND LEAST SQUARES Suppose n = 6 and thot after two steps we have computed xx Kx x x ox xxx x (2,3,0)"G0,2.0)7A =| 3 9 x x x * 00 0x x x oo00xx ‘We then compute G(3, 4,85) to zero the current (4,3) entry thereby obtaining xxx x Ox x x x x GG, 4,65)7G(2,3,04)"O(1,2,0)7A = | 9 9 3 x x ™ 0 00x x x oo000xx Overall we have Algorithm 5.2.3 (Hessenberg QR) If A © R™" is upper Hessenberg, then the following algorithm overwrites A with Q7A = R where Q is orthogonal and R is upper triangular. Q ++ Gna i8 a product of Givens rotations where G; has the form G, = G(j,j + 1,63). for j=Ln-1 [es] =givens(AG,3), AG +1,3)) $f] 409+ 19m agseram=[ nd ‘This algorithm requires about 3n? flops. 5.2.5 Fast Givens QR ‘We can use the fast Givens transformations described in {5.1.18 to compute an (M, D) representation of Q. In particular, if Mf is nonsingular and D is diagonal such thot MA = T is upper triangular and =Dis diagonal, then Q = MD-1/? is orthogonal and Q7A = DAT = Ris upper triangular. Analogous to the Givens QR procedure we have: Algorithm 5.2.4 (Fast Givens QR) Given Ac R"*” with m 2m, the following algorithm computes nonsingular M ¢ IR"*" and positive d(1:m) ‘such that M7 A = T is upper triangular, and M7 M = ding(ds,...,4m)- A is overwritten by T. Note: A = (MD-¥/2)(D¥/27) is « QR factorization of A.5.2, Tw QR FACTORIZATION 209 for i= im ai) 1 for j= Lin for t= m= 1+ [a, 8, type] = fast.givens(A(i —1:i, j),d(i — 1:i)) Hape=t r Ali = Li,jin) = [2 i ] AGi=15,32) else agarisim=[5 P| again end end This algorithm requires 2n?(m —n/3) flops. As we mentioned in the previous section, itis necessary to guard against overfiow in fast Givens algorithms such a8 the above. This means that M, D, and A must be periodi- cally scaled if their entries become large. If the QR factorization of a narrow band matrix is required, then the fast Givens approach is attractive because it involves no square roots. (We found LDL? preferable to Cholesky in the narrow band case for the same reason; see §4.3.6.) In particular, if A €R™*" has upper bandwidth g and lower bondwidth p, then QT = R has upper bandwidth p +q. In this case Givens QR requires about O(np(p+q)) flops and O(np) square roots. ‘Thus, the square roots are a significant portion of the overall computation ingen. 5.2.6 Properties of the QR Factorization ‘The above algorithms “prove” that the QR factorization exists. Now we relate the columns of Q to ren(A) and ran(4)* ond examine the uniqueness ‘question. ‘Theorem 5.2.1 If A = QR is a QR factorization of « full column rank AER™™ and A= [a1,.+-50n) and Q = [@is-+-sm] are column partic ‘tionings, then span{ar,....a4} = span{a,....ge} & Jn particalar, if Qy = Q(lem, tn) and Qy = Q(L:m,n+ 1am) then ran(A) = ran(Q.) ran(A)* = ran(Q2) and A= Qi Rj with Ry = R(t, 1:n). 230 CHAPTER 5. ORTHOGONALIZATION AND LEAST SQUARES Proof. Comparing kth columns in A= QR we conclude that * ay = Doren € span(gns.--sax) - (22) ‘Thus, span{ar,....de} & span{dr,....dk}- However, since rank(A) = 1 it follows that span(o,,...,04} has dimension k and so must equal span(qi.-..9e} The rest of the theorem follows trivially. ‘The matrices Q1 = Q(1:m,1:n) and Qa = Q(1am,n + 1m) can be easily computed from a factored form representation of Q. If A = QRis a QR factorization of A € R™*" and m 2 n, then we refer to A = Q(,1:n)R(1in, lin) as the thin QR factorization. The next result addresses the uniqueness issue for the thin QR factorization ‘Theorem 5.2.2 Suppose A€ R™*" has full column rank The thin QR factorization A=QR, is unique where Q1 ER" has orthonormal colwnns and Ry i8 upper t- angular with positive diagonal entries. Moreover, Ry = G? where G is the lower triangular Cholesky factor of ATA. Proof, Since ATA = (94R)"(QiR) = RR, wo aoe that G = RF i he ‘Cholesky factor of ATA. This factor is unique by Theorem 4.2.5. Since Q1 = AR;! it follows that Q, is also unique. O How are Qz and Ry affected by perturbations in A? To answer this question we need to extend the notion of condition to rectangular matrices. ‘Recall from §2.7.3 that the 2-norm condition of a square nonsingular matrix is the ratio of the largest and smallest singular values. For rectangular ‘matrices with full column rank we continue with this definition: Smas(A) rnin)” If the columns of A are nearly dependent, then 2(A) is large. Stewart (1993) has shown that O(¢) relative error in A induces O(ers(A)) relative error in R and Qi. AGR™ rank(A) =n <> ma(A) = 5.2.7 Classical Gram-Schmidt We now discuss two alternative methods that can bo used to compute the thin QR factorization A = QiR directly. If rank(4) =n, then equation (6.22) can be solved for gx: n= («=e fn5.2. Tue QR FACTORIZATION 231 ‘Thus, we can think of qx as a unit 2-norm vector in the direction of ’ n= ay Sorat where to ensure #4 € span{di,..-,9k-1}+ we choose Te = Go, i= Lk- 1. ‘This leads to the classical Gram-Schmidt (CGS) algorithm for computing A=QR R(L1) =| AG) ty 6-3) = AG 1)/RG1) for k= 2:n Rk - 1k) = Q(Ls = TAQ, k) 2s A(Lim,k) ~ Q(.im, Ick — 1)R(Lsk ~ 1,4) (523) GRR) = zh Q(tzm,k) = 2/R(k,k) In the kth step of CGS, the kth columns of both Q and R are generated. 5.2.8 Modified Gram-Schmidt Unfortunately, the CGS method has very poor numerical properties in that, there is typically a severe loss of orthogonality among the computed 4. Interestingly, a rearrangement of the calculation, known as modified Gram- Schmidt (MGS), yields a much sounder computational procedure. In the kth step of MGS, the kth column of @ (denoted by qu) and the kth row of R (denoted by r) are determined. ‘To derive the MGS method, define the matrix A @ ROE by st . A-Yoarl = oar? = (0.4). (624) aS Te follows that it Mate B| Int theo rae = fellas Qe = 2/Feg ond (racet---Thn) = gf B. We then compute the outer product AMM = B — a (resist '*-tin) and proceed to the next step. This completely describes the kth step of MGS. Algorithm 5.2.8 (Modified Gram-Schmidt) Given Ae R™" with rank(A) = n, the following algorithm computes the fectorizatioa A = Qi where Q; € 1" has orthonormal columns and Ry € RO" is upper tiangular. 232 CHAPTER 5. ORTHOGONALIZATION AND LEAST SQUARES for k= 13 Rik, k) = || AQ, &) lla Q(1zm, k) = A(L:m, k)/R(k,K) for jaktin R(k.3) = Q(1:m,k)P Altem, 3) (iim, j) = A(.im,j) — QCM, KY Ck 3) ond end ‘This algorithm requires 2mn? flops. It ia not possible to overwrite A with ‘both Q; and Ri. Typically, the MGS computation is arranged so that A is ‘overwritten by Qj and the matrix 2 is stored in a separate array. 5.2.9 Work and Accuracy If one is interested in computing an orthonormal basis for ran(A), then the Householder approsch requires 2mn? — 2n?/3 flops to get @ in fac- ‘tored form and another 2mn? — 2n?/3 flops to get the first. columns of Q. (This requires “paying attention” to just the first n columns of Q in (6.15).) Therefore, for the problem of finding an orthonormal basis for ran(A), MGS is about twice as efficient as Householder orthogonalization. ‘However, Bjérck (1967) has shown that MGS produces a computed Qr = aiy.--+dn} that satisfies Or = 1 + Eucs | Baas lla = wma(A) whereas the corresponding result for the Householder approach is of the form OO =1+ Ey WEvhau. ‘Thus, if orthonormality is critical, then MGS should be used to compute orthonormal bases only when the vectors to be orthogonalized are fairly independent. . ‘We also mention that the computed triangular factor ® produced by MGS catisfies || A — QF} ~ ulj A || and that there exists a Q with perfectly ‘orthonormal columns such that |A— Qf || = ull Al). See Higham (1996, p.379). ‘Example 6.2.1. If modiied Gram-Schmidt ia applied to loa A=| wo 06 ofA) 141? 0 10 with 6igt deconal arithmetic, than 1.00000 0 (ael = | 001 —70707 0 7071005.2, Tee QR FacToRWaTion 233 5.2.10 A Note on Complex QR Most of the algorithms that we present in this book have complex versions that are fairly straight forward to derive from their real counterparts. (This is not to say that everything is easy and obvious at the implementa- ton level.) As an illustration we outline what a complex Householder QR {factorization algorithm looks like. Starting at the level of on individual Householder transformation, suppose Ox ze O* and that = = re where r,8¢R Mv =z tell z tae, and P = J, — Gov", 9 = 2/oMv, then Px = Fez les. (See PS.1.3.) ‘The sign can be determined to maximize || [lz for the sake of stability. ‘Tho upper triangularization of A ¢ R™*, m > n, proceeds as in Algo- rithm 5.2.1. In step j we zero the subdiagonal portion of A(j:m, ) for j= 1m = AGimj) v=zse4ljz lize: where 21 = re. B= 2/v!/v AGM, jin) = (Im—j4r — Boo Amn, j2m) end ‘The reduction involves 8n?(m — n/3) real flops, four times the number required to execute Algorithm 5.2.1. If Q = P, ++ Py isthe product of the ‘Housebolder transformations, then Q is unitary and Q7A = Re R™* is complex and upper triangular. Problema 5.2.1. Adapt the Hoursbolder QR algoithm a0 that it can efficiently handle the cam ‘when A © FE*" hn lower bandwidth p and upper bandwidth g 5.2.2 Adapt the Housbolder QR algorithm ao that it computes the fectoriztion ‘Aw QL where L's lower triangular and @ i orthogonal. Assume that A i aquare. This involve rewritiog the Houbolder vector function ¢ = house() 0 that (I~2ov" /o v)= In aero everywhere but ita bottom component. 5.2.3. Adapt the Givens QR factorization algorithm o tht the woos are introduced by lagonal. That i, the entries are zero in the order (rm, 1), (m=1y 1), (m2) (m= 2,1), (m= 1,2), (m3) ee, the ease when A ia n-by-n and tidingonal superdiagoaal of A are sored in (1:n-~ 1), a(n), (in = 1) respectively. Design your Algorithm so that these vector are overwritten by the noasero portioa of 7. 5.2.5 Suppose L¢ FE" with m > m le lower triangular. Show bow Houbolder ‘matrices AH can be tsed to determine & lower triangular La € FE™" 9 that mote = [4] 234 CHAPTER 5. ORTHOGONALIZATION AND LeasT SQUARES ‘lint: The cond stop in the 6-by-3 can involves finding Ha m that mle Rw a-(o kare ‘and A haw full coluran rank, then min Az -b 13 = a2 - (07a/l vita)”. 3.2.7 Suppowe A € FO** and D = ding(dh,....dq) € RU. Show how to construct tn orthogonal Q such that Q7-4 ~ DQ? = fis upper triangular. Do aot worry aboot ‘ificiency thi is just an exercise ia QR manipulation. 5.2.8 Sbow how to compute the QR factorization of the product A= Ay. ArAy without explicitly ‘Ay together. Hist: Te the p = miso [G] mie 5.2.9 Suppage Ae FO*™ and let B be the permutation obtained by reversing the ‘order of the rows i Jn. (This is just tho exchange matrix of $47) (a) Show that if Re RO™" is upper tangular, then L = ERE is lower triangular (0) Show how to compare an orthogonal Q€ FO" sud a lower taagular Le FO" so that A= QL ‘amuining the availablity of procedure for computing ehe QR facterzaton. 5.2.10 MGS applied to A € RO" a pumercally equivalent tothe fr step in House- Iolder QR applied to On a where On is the nbycn aero matrix. Verily that this etatement ia true after the Sst ‘ap ofeach method ix completed. PO.2.11 Reverse the loop orders in Algorithm 8.28 (MGS QR) so that 2 ls computed ‘olume-by-columa, 5.2.12 Develop « complax versian of the Givens QH factorization. Reler to PSL. ‘where complex Givens rotations are the theme. Ist posible to orgaize the calelations vo that the diagonal slementa of Fare sonaegati? Notes and Raferences for Sec. 5-2 "The iden of using Householder transformations to solve the LS problem was proposed i AS. Houssholer (1958). "Unitary THangularnation of a Nomymmetric Matrx,” J. “ACM. 5, 3042, ‘The practical detala were worked out P. Businger aod G.H. Golub (1965). “Linear Lenst Square by Houmbolder Treamatioen” Namer. Math 7, 0-7 Sw uke Wikinen sod Fach asTiaAt-18) G.H, Golub (1065). “Numerical Methods for Soiving Linear Lougt Squares Problems,” ‘Mamer. Math. 7, 208-18.5.2, Tae QR FACTORIZATION 235 ‘The basic references oo QR via Givens rotations inctude W. Givens (1958). “Computation of Plane Unitary Rotations Transforming & Gaoeral Matitc to Triangular Form,” STAM J. App. Math. 6, 8-0. M, Gentleman (1979). “Error Analysis of QR Decompositions by Givens Tranaforma- Alon” Lin. Alp. and Is Appl 10, 189-87. For a disrumion of how the QR factorlestion can be ured to solve aumerous probes in ‘tatintcal computation, se GH, Golub (1960), “Matrix Decompositions and Statistical Computation,” in Statistical Computation , of. RC, Milton and 1A. Nelder, Acedemic Proas, New York, pp. 368-97. ‘The behavior ofthe @ and R factors when A ie perturbed is discussed in G.W. Stewart (1977). *Pervubation Bounds for the QR Factortation of « Matrix,” SIAM T. Nam. Anak 14, 500-18. 1H. Zhe (1969). “A Componeatwin Perturbation Analysis of the QR Decomposition,” SIAM J. Metres Anak. Appl 4, 124-1931 G.W. Stewart (1993). “On the Perturbation of LU Chola, and QR Factorzatioas,” SIAM J. Metrix Anak Appl 14, 1141-1348. ‘A. Barluad (1994). “Perturbation Bounds for the Generali QR Factodaation” Lin. ‘lg. and lis Applic. 207, 251-271. 4-6. Sun (1988). "Oa Perturbation Bounds for the QR Factoczation." Lin. Alg, ond 2a Applic. £15, 8-112. ‘The main raul ia that the change in Q and Fare bounded by the condition ofA times the relative change ia A. Organising the computation go that the extria in Q dapend continuously on the entries in A io scum is. ‘TE. Coleman and D.C, Soreaes (1984). “A Note on the Computation of an Orthonar- ‘el Basis for tho Nall Space of a Matix,” Mathematiea! Programming £9, 234-242. References forthe GramSchmidt procem include include ‘LR. Rice (1966), “Experiments on Grasm-Schmidt Orthogoualization,” Math Comp. 120, 325-28. [A Blérek (1967). “Solving Linear Least Squares Problems ty Gram Schmidt Orthogo- pallation,” BIT 7, I= NN. Abdelmalek (1971), "Roundoff Error Anaiysa for Gram-Schinidt Method and Solution of Linear Lent Squares Problecn,” BIT 11, 345-88. 4. Danie, W.B. Gragg, L-Kattnan, and GW. Stewart (1978). “Reoethogoaalization nd Stable Algorithme for Updating the Gram-Schmict QR Peeworiation,” Math. Comp. 30, 772-798. ‘A. Rube (1983). "Numerical Aspects of Gram-Schmidt Orthogpaalization of Vectors,” Lim Alp ond Its Apple. $2/58, 591-601. W. Jalby and B. Philippe (1901). “Stablity Analysis und Improvement of the Block ‘GramSchmidt Algorithm,” SIAM J. Sct. Stot. Comp. 12, 1058-1073. A. Blecck and C.C. Paige (1002). “Lom and Recapture of Orthognnality inthe Modi ‘Gram-Schmidt Algorithm,” SIAM J. Meiria Anal. Appl. 13, 176-190. [A Bjbeck (1094), “Nameris of Cram-Sehimidt Orthogonaliaation” Lin Alp. and Its Apple. 197/198, 207-316. "The QR factorization of w structured matrix is ually structured itl. Soe AW, Bojaneryk, RLP. Brent, and FR de Hoog (1968). “QR Factorization of Toeplta Matticen” Nemer. Math: 49, 81-04. 236 CHAPTER 5. ORTHOGONALIZATION AND LEAST SQUARES 8. Qiao(1986). 53, 351-360 C.J. Demeure (1960). “Fast QR Factorisstion of Vandermonde Matrican” Lin, Als. ‘and Ite Applic. 188/129/124, 105-194. 1. Reichel (1991). “Fast QR Decomposition of Vandermonde Like Matrices and Polyno- ‘ial Least Squares Apprasdinatioa," STAM J. Mairex Anak Appl. 12, $52-564. D.R. Sweet (1991). "Fare Block Tooplits Orthogonalisation,” Numer. Math. 58, 61> om. “HybeidAlgocithm for Fast Tooplits Orthogoasliaation,” Numer. Math. Vasious high-performance lau pertaining to the QR factorsation are disused in B. Mattingly, C. Meyer, and J. Ortaga (1989). “Orthogonal Reduction on Vector Com- pitern STAM J. Sc. and Stat. Comp. 10, 372-38. P.A. Knight (1995). "Fart Rectangular Matrix Multiplication and the QR Decompou!- hoo," Lin. Aig. and Its Applic. 221, 62-8. 5.3 The Full Rank LS Problem Consider the problem of finding a vector 2 € RR" such thet Az = b where the data matrix A € IR"*" and the observation vector b € IR™ are given and m>n, When there are more equations than unknowns, we say that the system Az = is overdetermined. Ususlly an overdetermined system has ro exact solution since must be an element of ran(A), e proper subspace of R™. This suggests that we strive to minimize || Az — 5), for some suitable choice of p. Different norms render different optimum solutions. For example, iA = (1, 1, 1]? and b= (by, ba, bs]? with by 2 by & by 2 O, then it can be verified thot ps1 + tm = b P 2 > roe (br + bz + b3)/3 P= > am = (hth) ‘Minimization in the 1-norm and oo -norm is complicated by the fact that the function f(z) = || Ar—6|), is not differentiable for these values of p. However, much progress hai been made in this area, and there are several good techniques available for 1-norm and oo-norm minimization. Soe Coleman and Li (1992), Li (1993), and Zhang (1993). ‘In contrast to general p-norm minimization, the least squares (LS) prob- Jem in | Arby (631) is more tractable for two reasons: * d(x) =} Ar —b/lf is a differentiable function of z and so the min- immizers af $ satisfy the gradient equation Vg(z)= 0. This turns out to be an easily constructed symmetric linear system which is positive definite if A fas full column rank.5.3. Tae FULL Rank LS Propiem 237 + The 2-norm is preserved under orthogonal transformation. This means that we can sock an Q such that the equivalent problem of minimizing j (Q™A)z ~ (Q7b) I is “easy” to solve. In this section we pursue these two solution approsches for the case when ‘A has fall column rank. Methods based on normal equations and the QR fectorization are detailed and compared. 5.3.1 Implications of Full Rank Suppose x €R", 2 €R", and a € R and consider the equality Ale +ez)—b]f = | Av— off +2027AT(Ax — b) +a] Az where A€ IR" and 6€ IR". If z solves the LS problem (5.3.1) then we must have A7(Ax - 6) = 0. Otherwise, if 2 = ~AT(Az —b) and we make a small enough, then we obtain the contradictory inequality WAtz+az)~b]l, <[|Az—6]),. We may alo conclude that if = and +az are LS minimizers, then 2 € aull(A). ‘Thus, if A has full column rank, then there is a unique LS solution z;.5 and it solves the symmetric positive definite linear system AT Azys = ATH ‘These are called the normal equations. Since Vg(z) = AT(Az — 6) where O(z) = fi] Ar —4()} , we see that solving the normal equations is tants- mount to solving the gradient equation Vg = 0, We cal rus = b~ Azzs the minimum residial and we use the notation pus = Aris ~bll, to denote its size, Note that if prs is small, then we can “predict” b with the columns of A. So far we have been ascuming that A ¢IR"™" has full column rank. ‘This assumption is dropped in §5.5. However, even if renk(A) = n, then ‘we can expect trouble in the above procedures if A is neatly rank deficient. When assessing the quality of a computed LS solution 2, there are ‘two important issues to bear in mind: ¢ How close is £1 to 219? ¢ How small is Fs =b— Afzs compared to rig = b~ Az; 238 CHAPTER 5. ORTHOGONALIZATION AND LEAST SQUARES ‘The relative importance of these two criteria varies from application to application. In any case it is important to understand how zs and rs ‘ave affected by perturbations in A and 6. Our intuition tells us that if ‘the columns of A are nearly dependent, then these quantities may be quite Example 5.3.2. Suppose wie Pelt fl-Eb=E): sc hat sg and £1g minimise | Az ~0 hy and (A + 6A)2 ~ (0+ 69) | respectively Um rag and fz be the coreponding mionm reeials- Then ovo es [3] ese [a | suse [E]saus Since na(A)= 108 we have duessh x niet = may ttle = 10! 10-* Meus taste romans? < (Area = at -10-* Th ‘The example suggests that the sensitivity of x1.s depends upon m2(A)?. At the end of this section we develop a perturbation theory for the LS problem and the w2(A)? factor will return. 5.3.2. The Method of Normal Equations ‘The most widely used method for solving the full rank LS problem is the method of normal equations. Algorithm 5.3.1 (Normal Equations) Given A ¢ R™*" with the property that rank(A) = n and 5 €R", this algorithm computes the solution 15 to the LS problem min | Az ~ bi}, where be R™. ‘Compute the lower triangular portion of C= A™ A. = ATs ‘Compute the Cholesky factorization C = GGT. Soive Gy = d and GTzys = y. This algorithm requires (m+-n/3)n? fops. The normal equation approach is convenient because it relies on standard algorithms: Cholesky factorizetion, metrix-matrix multiplicstion, and matrix-vector multiplication. ‘The compression of the m-by-n data matrix A into the (typically) much smaller n-by-n cross-product matrix C is attractive.5.3, Tue Fou, Rank LS Prosuem 239 Let us consider the accuracy of the computed normal equations solution zs. For clarity, assume that no roundoff errors occur during the formation of C = ATA and d= A". (On many computers inner products are accu- ‘aulsted in double precision and oo this is not a terribly unfair assumption.) I follows from what we know about the roundoff properties of the Cholesky factorization (cf. §4.2.7) that (ATA+ Ble = 7, where ff El ul AF fall Ally © ull A7A | and thus we can expect Hassle ms wmx(ATA) = wn(A)? (532) In other words, the accuracy of the computed normal equations solution depends on the square of the condition. This seems to be consistent with Example 5.3.1 but more refined comments follow in §5.3.9. Example 5.3.2 Rt should be noted that the formation of AT A can renlt in a severe own of information. then (A) = 14-10%, 27g = [1 1]7, and pus = 0. If the normal equations method is eee een et nts <= matay=[1 1] bch dnp Oa he oer nd gnats mk te tg = [2.000001 , 0/7 and fl #25 ~ 2s fa/lzzs by = wxa(4)?. 5.3.3 LS Solution Via QR Factorization Let Ac IR™™ with m > n_ and 6 € IR” be given and suppose that an ‘orthogonal matrix Q € Fe" has been computed such that, QAaR= [4] min (6.2.2) is upper triangular. If a= [i] ate Az O02 = (9? Az —Q7612 = Riz—cllk + HAIR 240, CHAPTER 5. ORTHOGONALIZATION AND LeAsT SQUARES for any z ¢ IR". Clearly, if rank(A) = rank(R;) = n, then zzs is defined by the upper triangular system Ryzzs = ¢. Note that pus =hd ly. We conclude that the full rank LS problem can be readily solved once we have computed the QR factorization of A. Details depend on the exact QR procedure. If Householder matrices are used and Q” is applied in factored form to b, then we obtain Algorithm 5.3.2 (Householder LS Solution) If A€R™" has full column rank and 6 € RP, then the following algorithm computes a vector us €R" uch thet | Azz ~ 6 ly is minitoum, Use Algorithm 5.2.1 to overwrite A with its QR factorization, ‘This method for solving the full rani LS problem requires 2n?(m — n/3) flops. The O(mn) flops associated with the updating of 8 and the O(n#) flops associated with the back substitution are not significant compared to the work required to factor A. It can be shown that the computed Zz solves in| (A + 5A)z ~ (b+ 65) lly (5.3.4) where WéAllp < (6m — In + A1)null A fp + O(u?) (6.35) and 116i], < (6m —3n-+ 40}nul] bf), + O(w). (5.36) ‘These inequalities are established in Laweon and Hanson (1974, p.908) and show that #19 satisfies a “nearby” LS problem. (We cannot address the relative error in Zs without an LS perturbation theory, to be discussed shortly.) We mention that similar results hold if Givens QR is used. 5.3.4 Breakdown in Near-Rank Deficient Case Like the method of normal equations, the Householder method for solving ‘the LS problem breaks down in the back substitution phase if rank(A) wee [Jaks then W4e-68 = 1D-uTias 9 = fo ([ $]2-[]) ff for any + € R", Cleatiy, zz is obtained by solving the nonsingular upper ‘triangular eystem Sy ‘The computed solution #zs obtained in this fashion can be sbown to solve a nearby LS problem in the sense of (5.3.4)-(5.3.6). This may soem. 242 CHAPTER 5. ORTHOGONALIZATION AND LEAST SQUARES surprising since large numbers can arise during the calculation. An entry in the scaling matrix D con double in magnitude after a single fast Givens update. However, langeness in D must be exactly compensated for by large- ness in M, since D-/1M is orthogonal at all stages of the computation, It is this phenomenon that enables oae to push through a favorable error analysis, 5.3.7 The Sensitivity of the LS Problem We now develop a perturbation theory that astists in the comparison of the normal equations and QR approaches to the LS problem. The theorem below examines how the LS solution and its residual are affected by changes in A and 6, In so doing, the condition of the LS problem is ideatiGed. ‘Two easily established facts are required in the analysis: Alig (ATA) lly = aA) (5.3.7) HAIR WATAY IL, = 626A)? These equations can be verified using the SVD. ‘Theorem 5.3.1 Suppose z, r, 3, ond? satisfy YAz—b], = min r=b-Az I{A + 6A) ~ (+88) |. = min F = (b+ 6b) ~(A + 5A}e here A and 5A are in R™” with m > n and 0 #6 and 6 ore in R™. If - WéAl, 1642). onlA) ‘ wae ea I < axa) and PLS sn) = Ge A where ps = [| Azzs Bila, then ok Se (a + cancepeatay*} +0) (63.8) 00(8) < €(1+262(4)) min(l»m—n) + Of). (5.3.9)5.3. ‘Tae Fuu, Rank LS Paopuem 43, Proof, Let E and f be defined by B = 5A/¢ and f = b/c, By hypothesis 184], < oq(A) and so by Theorem 2.5.2 we have rank(A +tE) = n for allt € [0,¢). It follows that the solution z(t) to (A+tET(A+tE)2(t) = (A+tE)P(+ tf) (6.3.10) is continuously differentiable for all t € (0,¢). Since z = 2(0) and # = x(e), we have Baz + c€(0) + O(2). ‘The assumptions # 0 and sin(@) # 1 ensure that z is nonzero and so We-th | 14M , gga Ti ish +O): (6.3.11) In order to bound £(0) fj, we differentiate (5.3.10) and set ¢ = 0 in the result. This given ET Ax + ATEx+ AT Ad(0) = ATS +57 O50) = (ATAY“LAT(y = Be) + (ATAYET. (63.12) By subetituting this reult into (53.11), taking norms, and using the easily verified inequalities | f fy < 1b, and | Bl, n and does not require as much storage. ‘© QR approaches are applicable to a wider class of matrices because ‘the Cholesky process applied to ATA breaks down “before” the back substitution process on Q7A = R. ‘At the very minimum, this discussion should convince you how dificult it ‘can be to choose the “right” algorithm! Problems PS.3.1 Amume ATA = AT, (ATA4 F)S = AT, and 2] Flla < 09(4)?. Show that pm bo Az and # ob Ae, then Pr AUATA$ FP and bork saaca ttle, PE32 Assume that ATAz = ATO and that ATAL = ATO+ J where BStly S cul APB Ty and A be al oka rank. Show hat Lethe aya hv Tai Se Tato P38 Let Ac RE" with m > mand y € RM and define A = [A yf € FOXIMY), Show that 21(4) > 0x(A) sod eny1(A) & on(A). Thus the condition grow if «column (a edded to « matrix PSB Lat AG FTA (m Dn}, we RO, and deine [a] ‘Show that on(B) 2 on(A) sod ox(B) < VTATE +HreHd Thus, the contin of 8 ‘matric say aca or decrease if w row id odd, 5.3.6 (Cline 1972) Suppom that A ¢ F™™® has rank wand that Gouin aliniaation ‘ith partial pivoting i dv compote the factotation PA = Li, whare L € FOO i ‘ait lower triangular, U € Fo"" ie uppar triangular, and P € RO" is permaation plata how the desompontin a P&S cau be we to Bd a actor 5 © Re such thet || Le— Poly is minima: Show that f Us = ton | As by a tinimaim. Show ‘ha tia metbod of anving the £8 probe io more fila than Qi bom ‘he flop poiat of view whenever m © Sn/3 5.3.8 The matrix C = (ATA)! where rak(A) =, aien fa many eatatica spl cations and le known asthe worance-coversance rats Amun tat the factorization 246 CHAPTER 5. ORTHOGONALIZATION AND LEAST SQUARES A= QRia wmailabla. (a) Sbow C= (ATR). (b) Give an algorithm for computing the ingoaal Of thas requires 72/3 flope, (c) Show that at py = [ A+ Cmat Foo ee[S SE] oe ccm n [ONC EM whace C: = (S75)-. (4) Using (c), give an algorithm thst overwrites the upper tri- Angular portion of R with the upper trangular portion of C- Your algorithm should raquire 20/3 tops. 5.8.7 Suppose A CFE" is symmutric and that y = §— Ar where r,b,z € RO and 2 is ponsero.. Show bow to compute « symmetric E € RO*™ with minizal Frobeaius form 90 that (A+ E)z = b Hint. Use the QR factorsstion of [2, r] and note that Exar > (QTEQ)|G?2)=Q"r. 5.3.8 Show bow to compute the nearest circulant matic to a given Toeplts matrix. Meagure distance with the Probeniua norm. Notes and References for Sec. 5.3 (Our restriction to lat quaren approximation is nota vote aginst minimization in other sorms, There are occasions when #t is advisable to minimize | Az bl for p= 1 and ‘e. Some algorithins for doing this are decribed in ALK, Cline (19764). *A Descent Method for the Uniform Solution to Overdetermined ‘Systeme of Equations.” SYAM J. Num. Anal. 19, 208-200 RH Bartels, AR. Cons, and C. Charalambous (1078), “On Cline’s Direct Method for Solving Overdetermined Linear Syatamn ia the Log Sense,” SIAM J. Narn. Anal 15, 255-70, ‘TP. Coleman and Y. Li (1992). *A Globally sod Quadraticaly Convergent Affine Scaling Method for Linear Ly Problems.” Mathematical Programming, 56, Series A, 199-202 ¥. Li (1993), 3, 609-629, YY. Zhang (1968). “A PrimalDual Interior Point Approach for Computing the L1 and Lea Solutions of Overdetermined Linear Systema” J. Optimisation Theory and Ap- plieations 77, 325-34. ‘The use of Geum transformations to solve the LS problem has attractod soma attention becouse they are chesper to use than Housebelder or Givens mairioas See G. Peters and J. Wilkinaoa (1970). “The Least Squares Problem and Parudo-Iaveray,” Comp. J. 13, 20-16. A.K. Cline (1973), “An Elimination Method for the Selution of Linar Last Squares Problems,” SIAM J. Num. Anal. 10, 289-60. RA, Plemmons (1974). “Linear Least Squares by Elimination and MGS" J. Assoc. Gomp. Mach. 81, 581-85 [mportant analyse of the LS problem and vatious solution approaches include (GH. Golub and LH. Wilkinaon (1966). "Note on the Iterative Refinement of Lesst Squares Solution” Numer. Math. 9, 130-48 ‘A. van der Stas (1OTS). "Stabilty ofthe Solution of Linear Lanst Squares Problem,” Mumer. Math. 23, 21-54. YY. Saad (1986). "On the Condition Number of Some Gram Matrices Ariing from Least ‘Square Approximation in the Complex Plane,” Numer. Math. 48, 397-348. A. Bjécck (1987). “Stability Analyain of the Method of Seminormal Equations” Lin. Alp. and its Applic. $6/80, 31-185.3. THE FULL RANK LS PROBLEM 27 41. Gluchowala and A. Smoktunowics (1990) “Solving the Linear Least Squares Problem ‘with Very High Relative Accurecy.” Computing 46, 45-354, A. Birk (1991), “Component-wise Perturbation Azalysn and Bvor Bounds for Linear Loan Square Solutions” BIT 31, 238-244. A. Bick and C.C. Paige (1982). “Loa and Recapture of Orthogonality in the Modified GGramSchuni Algorthin,” SIAM J. Matrix Anal Appl 19, 176-190. B, Waldéa, R. Karle, J, Sun (1998). *Optimal Beckward Perturbation Bounds for the Linear Least Squares Problem,” Numerical Lin. lg, widh Applic. % 271-286, "The “seminormal” equations are given by AF Rz = AT where A = QR. ta the abowe paper it is shown that by oiving the eminormal equations an acceptable LS solution i» ‘obtained if one step of xd precision iterative improvement i performed, ‘An Algol implementation ofthe MGS method fr solving the LS problem appears in FLL, Baver (1965), “Elimination with Weighted Row Cormbination for Solving Lim- ‘car Equations and Least Squares Problems,” Memer. Math. 7, 398-52. See also ‘Witkingon and Reissch (1071, 119-33) ‘Least squares problems often have special structure which, of course, shouldbe exploited. M.G. Cox (1981). “The Least Squares Solution of Overdeterniaed Linear Bqustions having Baad or Augmented Band Structure,” [MA J. Mum. Anak 1, -2. G. Cybeako (1984). “The Numerical Stability of the Lattice Algorithn for Lest Squares Linear Prediction Problems” BIT 24, 441-485. P.G. Hansen and H. Genmar (1993). “Past Decomposition of Raak- Deficient ‘Toeplite Matsien” Momerical Algorithms 4, 151-166. ‘The use of Householder matrices to olve sparse LS problema requires careful attention te avo exceeive Blin. ‘EK. Reld (1967). “A Note on the Least Squares Solution of « Band Syatern of Linear Baquations by Housdzolder Reductions,” Comp. J. 1 188-89. LS. Dalf and 3-K, Reid (1976). "A Comparian of Some Methods fr the Solution of ‘Sparse Over Determined Systems of Linear Equation,” J. nut Math Applic. 17, 267-80 PLE. Gill and W. Murray (1076). “The Orthogonal Factorization of « Large Sparse “Matrix” in Sparse Matrix Computations od. 3.8. Bunch and D.J. Rove, Academie -Preas New York, pp. 177-200. 1 Kautaua (1979). “Application of Dense Householder Trunsormatioas to « Sparse Matrin” ACM Tron Math. Soft. 5, 442-81 Although the computation of the QR factorization is more ecient with Houssholdor ‘efloctions, there ae sore settings where the Givens approech is ndvantaguous. Por ‘ample, if Ais eparm, then the careful application of Gives rotations can minimise fit-in. 1.8, Duff (1974), “Pivot Selection and Row Ordering in Givens Reduction on Sparse Maiziow” Computing 13, 235-18. LA. George and MT. Heath (1980). “Solution of Sparse Linear Lost Squares Problems Using Givens Rotations.” Lin. Alp. and its Apphc. 24, 69-83. 248 CHAPTER 5. ORTHOGONALIZATION AND LEAST SQUARES 5.4 Other Orthogonal Factorizations If A is rank deficient, then the QR factorization need not give a basis for ran(A). This problem can be corrected by computing the QR factorization of 8 column-permuted version of A, ie., All = QR whore II is « permutation. ‘The “data” in A can be compressed further if we permit right multiplication by a general orthogonal matrix Z: QrAZ = ‘There are interesting choices for Q and Z and these, together with the column pivoted QR factorization, are discussed in this section. 5.4.1 Rank Deficiency: QR with Column Pivoting If Ac R™" and rank(A) k be the smallest index such that 1 1 APP ty = ae (1a? I 1a} (5.43) Note that fk—1 = rank(A), then this maximum is zero and we are finished. ‘Otherwise, let TI, be the n-by-n identity with columns p and k interchanged and determine a Householder matrix H, such that if R®) = HyR*-"T,, then R (i+ 1:m, k) = 0. In other words, IT, moves the largest coluran in ‘REY to the lead positon and Hy zeroes all ofits subdiagonal components. ‘The column norms do not have to be recomputed at each stage if we exploit the property om [2], ‘which holds for any orthogonal matrix Q € IR", This reduces the overhead, associated with column pivoting from O(mn?) flops to O(mn) flops because ‘we can get the new columa norms by updating the old column norms, ¢.., FOU = p29 HB. Combining all of the abave we obtain the following algorithm established by Businger and Golub (1965): = dwiR = bel - Algorithm 5.4.1 (Householder QR. With Column Pivoting) Given AeR™" with m > n, the following algorithm computes r = rani(A) and the factorization (5.4.1) with Q = Hy---H, and I =Th---Ml,, The upper triangular part of A is overwritten by the upper triangular part of R and components j + I:m of the jth Householder vector are stored in AG + lim, 3). The permutation II is encoded in an integer vector piv. In particular, TI; is the identity with rows j and piv(j) interchanged. 250 CHAPTER 5. ORTHOGONALIZATION AND LEAST SQUARES for j= 1m (9) = Aim, 3) ACL, 1) end 7 = 0; 7 = max{e{1),...y6{n)} Find smaller k with 1< k' 0 orth piv(r) = k; A(Lam,r) ++ A(Lim, k); ofr) + ofk) {v4 ] = house(A(r:m,7)) A(rsn, rin) (In = Bov™ )A(rim, rn) Ar +1:m,r) = om —r +1) fori=r+in oft) = eft) ~ Alri? end itr n. We next show how to compute orthogonal Up (m-by-mn) and Vi (n-by-n) such that Ca rr) Ob f 0 URAVa = | 0 6.4) ° Up = Us-+-Up and Vp = Vi--+Vqu2 can each be determined as a product of Householder matrices: xx x x x x Kx xX x x ox x x xx x x/B%lo x x x/% xx x x ox x x x xX x x Ox x x xx 00 x x 00 ox x x Ox x x ox xx|{4%looxx|% ax x x 00x x ax x x 00x x x x00 xx 00 xx 00 ox x 0 ox xo ox x 0 ooxx|/&looxx|/4loox x oo x x oo00 x 000 x 00x x oo00x ooo00 252 CHAPTER 5. ORTHOGONALIZATION AND LEAST SQUARES In general, Uy introduces zeros into the kth column, while Vj, zeros the appropriate entries in row k. Overall we have: Algorithm 5.4.2 (Householder Bidiagonalization) Given A ¢ R™* with m 2 n, the following algorithm overwrites A with UAVs = B where Bis upper bidiagonal and Ug = U---Up and Vp = Vi--Vaoa. The cesoential part of U;'s Householder vector is stored in A(j + 1:m, 3) and the essential part of V;'s Householder vector is stored in A(,j + 2:n). for j=lin {v6} = house(A(:m.3)) AG, j = Bev? AG, jin) = house(A(j,j + 1:n)7) AG, j + Ln) = AG:m, j + 1:n)(Iny — Bev) AG, § + 2in) = v(2:n = §)? end end ‘This algorithm requires 4mn? — 4n?/3 flops. Such a technique is used in Golub and Kahan (1965), where bidiagonalization is frst described. If the matrices Up and Vg are explicitly desired, then they can be accumulated in 4mtn - 4n?/3 and 4n3/3 flops, reepectively. The bidiagonalization of A is related to the tridiagonalization of A? A. See §8.2.1. Example 6.4.2 If Algorithm 5.42 is applied to 1203 45 6 A=] 7 3 9] wou 2 then Yo three sigiflcant digits wo obtain me me 0 100 099 a0 B=] 8 4-8 | ve =| 000 ~ cor 105 gg 8 00 Lies ser -o76 =m a2 = Sao 451 — 238 “aor = 540 Los 70 = 487 Smo “312 (set 07 Oo 5.4.4 R-Bidiagonalization AA faster method of bidingonalising when m °>-n reaulte if we upper triangularize A first before applying Algorithm 5.4.2. In particular, suppose we5.4, OTRER ORTHOGONAL FACTORIZATIONS 253 compute an orthogonal Q ¢ R™" such that 40 [ ® ? a=[ ms ] is upper triangular. We then bidiagonalize the square matrix Ri, URRVa = Bs. Here Up and Vp are n-by-n orthogonal and By is n-by-n upper bidiagonal. 16 Up = Qing (Va Jman) then ray ef] y av =[ Bs J. is a bidiagonalization of A. ‘The idea of computing the bidisgonalization in this manner is mentioned in Lawson and Hanson (1974, p.139) and more fully analyzed in Chan (19828), We refer to this method as R-bidiagonalization. By comparing its flop count (2mn?-+2nt) with that for Algorithm 5.4.2 (rmn?—4n3/3) we see that it involves fewer computations (approximately) whenever m > Sn/3. 5.4.5 The SVD and its Computation Once the bidiagonalization of A has been achieved, the next step in the Golub-Reinsch SVD algorithm is to zero the superdiagonal elements in B. ‘This is an iterative process and is accomplished by an algorithm due to Golub and Kahan (1965). Unfortunately, we must defer our discussion of this iteration until §8.6 as it requires an understanding of the symmetric eigenvalue problem. Sulfice it to say here that it computes orthogonal matrices Up and Ve auch that UEBVe = E = disg(or,...,on) € R™. By defining U = UpUr and V = VaVe we see that UTAV = E is the SVD of A. The flop counts augociated with this portion of the algorithm depend upon “bow much” of the SVD is required. For example, when solving the LS problem, U7 need never be explicitly formed but uerely applied to b 9 it is developed. In other applications, only the matrix U; = U(:,1in) is required. Altogether there are six possibilities and the total amount of ‘work required by the SVD algorithm in each case is summarized in the table below. Because of the two possible bidiagonalization schemes, there are two columns of flop counta. If the bidiagonalization is achieved via Algorithm 5.4.2, the Golub-Reinsch (1970) SVD algorithm results, while if ‘Rbidiagonalization is invoked we obtain the R-SVD algorithm detailed in ‘Chan (19828). By comparing the entries in this table (which are meant only aa approximate estimates of work), we conclude that the R-SVD approach is more efficent unless m ®n, 254 CHAPTER 5. ORTHOGONALIZATION AND LEAST SQUARES. Required | GolubReinsch SVD | SVD B | 4mn? — 4nt/3 ‘mn + On? BV 4mm? +8n? | 2mn? + 11n? BU admin —8mn? | Amn + 130? EU; Mn? — an? | 6ma? + 1ant EUV | dm?n+8mn?+9n? | dm?n + 2203 EUV | ldmn?+8n? | 6mn? + 20n? Problems PB.4.1 Suppoee A FE" with m (ofa? i rt where a = Vz. Clearly, ifx solves the LS problem, then a = (u?6/a,) for i= Lr. If we set a(r + 1:n) = 0, then the resulting x clearly has minimal 2norm, 5.5.4 The Pseudo-Inverse Note that if we define the matrix A* €R™" by At =VETUT where Bag (2 a then sus = Atb and prs = ||(T- AA*)b lla. AX is referred to as the peeudo-inverse of A. It is the unique minimal Frobenius norm solution to the problem =.0, ) reat min |AX—Imlle- (5.5.3) xian | Ir ) If ramk(A) = n, then A+ = (ATA)-1A7, while if m =n = rank(A), then At = A7!, Typically, AY is defined to be the unique matrix X € RO" ‘that satisfies the four Moore-Penrose conditions: @ AXA = A (i) (ANT = ax (i) XAX = X (iv) (XA)P = XA 258 CHAPTER . ORTHOGONALIZATION AND LEAST SQUARES ‘These conditions amount to the requirement that AA* and A* A be orthog- ‘onal projections onto ran(A) and ran(47), respectively. Indeed, AA* = UsUT whese U, = U(i:m,) and AtA = VV where Vi = Vn, Is). 5.5.5 Some Sensitivity Issues In §5.3 we examined the sensitivity of the full rank LS problem. ‘The behavior of zz in this situation is summarized in Theorem 5.3.1. If we drop the full rank assumptions then zs is not even « continuous function of the data and small changes in A and 6 can induce arbitrarily large changes in 213 = A*b. The easiest way to see this is to consider the behavior of the pooude inverse. If A and 6A are in R™", then Wedin (1973) and Stewart (1975) show that [(A+6A)* ~ AF lp S264 f peax {lf At IE , (A+ 6A)* EY. ‘This inequality is a generalization of Theorem 2.3.4 in which perturbations in the matrix inverse are bounded. However, unlike the square nonsingular ‘ase, the upper bound does not necessarily tend to zero as 5A tends to zero. 1 10 00 A=|00 ond 6A=/0 € 00 oo arn [3 3 8] at areore[t 2 8] then and [At -(A+6A)* lla = 1/e. The numerical determination of an LS ‘minimizer in the presence of such discontinuities is 8 major challenge. 5.5.6 QR with Column Pivoting and Basic Solutions Suppose A ¢ RO" has rank r. QR with column pivoting (Algorithm 5.4.1) producee the factorization AI = QR where = [fa Ra] 6 a [ES] ale, rae Given this reduction, the LS problem can be readily solved. Indeed, for any 2 € R° we have WAz~12 = 1(QTAN(Ts) - (97) 13 = [Ruy- (Ras) +1el8,5.5. Tue RANK Derictenr LS Propuem 259 [2] at. mt or [4] ate ‘Thus, if z is an LS minimizer, then we must have san[ FH Rad J where If z is set to zero in thie expression, then we obtain the basic solution 1 so=o[ Me]. Notice that zp has at most + nonzero components and so Azp involves a subset of A’s columns. ‘The basic solution is not the minimal 2-norm solution ualese the submatrix Ry is zero since Ry Ria Iasi = Jeon a4 zeR | 2 654) Indeed, this characterization of || zz lz can be used to show ilzels “Ral < Ji 0 RG Ra - 555 < Tee S Viti RG Rall 653) See Golub and Pereyra (1976) for details. 5.5.7 Numerical Rank Determination with AIl = QR. If Algorithm 5.4.1 is used to compute zp, then care must be exercised in the determination of rank(A). In order to appreciate the difficulty of this, suppose [ i) ae ] k Ag RAT --- Th) = RO) = , o ap] mk 2 is the matrix computed after k steps of the algorithm have been executed im floating point. Suppose rank(A) Because of roundoff error, Hi) will not be exactly zero. However, if RG) is suitably small in norm then it ‘is reasonable to terminate the reduction and declare A to have rank k. A typical terminetion criteria might be VAD Isat Al (656) 260 CHAPTER 5. ORTHOGONALIZATION AND LEAST SQUARES for some small machine-dependent parameter 1. In view of the roundoff properties associated with Householder matrix computation (cf. §5.1.12), wwe know that AM) is the exact R fector of o matrix A+ Ey, where W Esl sell Alls On) Using Theorem 2.5.2 we have oea(A + Be) = onx(R) < 1D Ma Since ox+1(A) < ou4i(A + Es) +] Ee lait follows that aasa(A) < (6 +e Ally In other words, a relative perturbation of O(c; +¢2) in A can yield a rank-k matrix. With this termination criterion, we conclude that QR with column pivoting “discovers” raak degeneracy ifn the course of the reduction AM?) is small for some k 0 (See Lamon aad Hanson (1074, p21)) These matrices sre unaltered by Algorithm 84.1 and thus | 2 fa > 3”~! fork = I:n~1. This inequality ‘implies (for example) that the matrix Tico(.2} bas no particulary mall trailing prieipal submatrix since «1.13. However, can be shown that em = O(107*) 5.5.8 Numerical Rank and the SVD ‘We now focus our attention on the ability of the SVD to handle rank- deficiency in the presence of roundoff. Recall that if A = UEVT is the SVD of A, then ts= 5s ah, (85.7) where r = rank(A), Denote the computed versions of U, V, and B= diag(o:) by U,V, and 5 = ding(4,). Assume that both sequences of singular5.5. THE Rank Dericient LS Prosuem 261 values range from largest to smallest. For a reasonably implemented SVD algorithm it can be shown that =W+aU WW=in AUS (658) VaZ+dv 27Z=1, JAVib 0 and & convention that A has “numerical rank” ? if the d, estisty az 28, > 62 G41 Do Don The tolerance 6 should be consistent with the machine precision, e.g. 6 = Lu A lle: However, if the general level of relative error in the dato is larger than u, then 5 should be correspondingly bigger, e., 6 = 10°] A loo if the entries in A are correct to two digits. IF is eccepted as the numerical rank then we can regard 262 CHAPTER 5. ORTHOGONALIZATION AND LEAST SQUARES ‘as an approximation to zs. Sinoe || z+ [2 © 1/or < 1/6 then § may also be chosen with the intention of producing an appraximate LS solution with suitably small norm. In §12.1, we discuss more sophisticated methods for doing this, If 6» > 6, then we have reason to be comfortable with z+ because A can then be unambiguously regarded as a rank(As) matrix (modulo 6). (On the other hand, {@1,...,6n} might not clearly split into subsets ‘of small and large singular values, making the determination of # by this means somewhat arbitrary. This leads to more complicated methods for ‘estimating rank which we now discuss in the context of the LS problem. For example, suppose r =n, and assume for the moment tbat AA = 0 in (55.10). Thus 0; = & fori = I:n, Denote the ith columns of the matrices U, W, V, and Z by wy 11, % and 2, respectively. Subtracting ts from ys and taking norms we obtain + W2y - 22s lla < % a Poe (fiule | ye (ey. a From (5.5.8) and (5.5.9) it is easy to verify that {1 (oF )= ~ (uf Ovi Iz S 21 + eel ble (55.11) and therefore é Bf wrb\? te-ashsZau+odon + | o (BY). ‘he paraeter # can be determined an that ineger which minimizes the upper bound. Notice that the first term in the bound increases with 7, while the second decreases. On occasions when minimizing the residual is more important than accuracy inthe soliton, ww can determine # on the boa of how cle We surmise || b — Azy ||, is to the true minimum. Paralleling the above analy- ‘sig, it can be shown that 1 b~ Aze lla —fb— Aris lla < (n ‘Vb ila + ell bla a +far0). ‘Again # could be chosen to minimize the upper bound. See Varah (1973) for practical details and also the LAPACK manual. 5.5.9 Some Comparisons ‘Aa we mentioned, when solving the LS problem via the SVD, only £ and V have to be computed. The following table compares the efficiency of this approach with the other algorithms that we have presented,5.5. ‘Tae RANK Dericrenr LS PRosLEM 263, Flop Count mnt + 03/3 2mn? = 2n9/3 mn? Smnt — n? Amant — 4? /2 mn? + 209 mn? + 8n9 2mn? + tin? Problems 5.5.1 Show that if ae (Fo) ale where r= rank(A) and Tie nonsingular, then Too} + x= (7 oe sates AXA = A and (AX)? = (AX). In thio com, wo ay that X laa (1.3) preudo- foerseof A. Show that foe general A, 2p = Xb where X jaw (1,3) panudonvere of A. 5.5.2 Define B(A) ¢ ROM by B(A) = (ATA ANAHAF, where A > 0. Show He = NBO)~ AY ta = Saye Fa ronx(A) tno therefor that B(A) —> A* as A= 0. PB.5.3 Consider the rank deficient LS problem min WER S])fy]_fe alls C)- Ch a whee RE FEM", SE ROAM", y ER, aod x CREF Anne hat pp nae Jnr and soatingsle. Show how to ota the minimus sora sltion to this prob by computing an appropriate QR fictaraain without pvoing ad hea wlving forthe sorcopcnte y and 3. 6.64 Show tabi Ay» A and Af — A*, then thre ean integer ach tha rank(A,) ia constant for all k > ko. (5.5.5 Show that if A ¢ F™*" has rank n, then s0 dom A+ E if we have the inequality TE IRI At, <1. Notes and References for Sec. 5.5 ‘The pov inverse literature ia vast, ax evidenced by the 1,775 references in 264 CHAPTER 5, ORTHOGONALIZATION AND LEAST SQUARES M.2, Neahed (1976). Generaied inverses ond Applications, Academic Press, New York. ‘The differentiation ofthe pevuo-laverm is farther discumed in (G.I, Lawpon and BJ. Hanaon (1960). “Extensions and Applications of the Householder “Algorithm for Solving Linear Least Squares Problems,” Math Comp. 23, 7e7-812. G.H. Golub and V. Pereyre (1973). “The Oiferentiation of Paado-Inverwt and Nonlin ‘cur Lenet Squares Problems Whome Variables Separate,” SIAM J. Num. Anal. 10, 43-32, Sarvey trestineats of LS perturbation theory may be found in Lawwoa and Hanson (1974), Stewart and Sun (1001), Bjécek (1906), and PA. Wed (1073). *Pertusbation Theory for Proudo-laverses” SIT" 13, 217-82. GW. Stewart (1077), “On the Perturbation of Pooudo Inverse, Projections, and Linear Least Squares,” SIAM Reciew 19, 634-62. ‘Even for full rank problems, column pivoting seers to produce more accurate vlutions. ‘The error analysis in the following paper attempt to explain why. LS, Jennings and M.R. Osborne (1074), *A Direct Error Analysi for Least Squares,” Muoner. Math. 28, 922-92. ‘Various other aspects rank deficiency ar discumed in 1M, Vacah (1973). “On the Numerical Solution of D-Conditionad Linear Syetema with ‘Applications to Il-Posed Problem,” SIAM J. Nuon. Ansl. 10, 257-67. ‘G.W! Stewart (1964). "Rank Degeneracy,” SIAM J. Seu and Stat Comp. 5, 408-418. PC. Hansen (1987). "The Truncated SVD aa » Method for Regularisation,” BIT 27, 234-353. GW. Stewart (1987). “Collearty and Least Squaren Ragrension," Statistical Science +, 68-100. We ve more toy on te ube i $121 ad $122 5.6 Weighting and Iterative Improvement ‘The concepts of scaling and iterative improvement were introduced in the Chapter 3 context of square linear systems. Generalizations of these idess ‘that are applicable to the least squares problem are now offered. 5.6.1 Column Weighting ‘Suppose G ¢ RR™" is nonsingular. A solution to the LS problem min [Az—bl, AER™", bER™ (66.1) ‘can be obtained by finding the minimum 2-norm solution yzs to sin || (AG)y ~ ly (5.62) ‘and then setting #q = Gyzs. If rank(A) = n, then 2¢ = rs. Otherwise, za is the minimum G-norm solution to (5.6.1), where the G-norm is defined by Hla =I Ih5.6. WEIGHTING AND ITERATIVE IMPROVEMENT 265, ‘The choice of G is important. Sometimes its selection can be based on A priori knowledge of the uncertainties in A. On other occasions, it may be desirable to normalize the columns of A by setting @ = Go = disg(/| AG,1) llas---,1/l) AC.) lla) - ‘Van der Sluis (1969) has shown that with this choice, x2(4G) is eppraxi- ‘mately minimized. Since the computed accuracy of yz depends on x2(AG), a case can be made for setting G = Go. ‘We remark that column weighting affects singular values. Consequently, ‘scheme for determining numerical rank may not return the same estimates when applied to A and AG. See Stewart (1984b), 5.6.2 Row Weighting Let D = ding(di,...,dy) be nonsingular and consider the weighted least squares problem minimize || D(Az-6)lz AE R™", be R™. (5.6.3) Assume rank(A) = n and that zp solves (5.6.3). It follows that the solution ‘zys to (5.6.1) satisfies tp ~ tus = (ATD*A)“1AT(D — N(b— Aas). (6.6.4) ‘This shows that row weighting in the LS problem afects the solution. (An important exception occurs when b € ran(A) for then xp = z1s.) ‘One way of determining D is to let dy be some measure of the un- certainty in 64, ¢.g., the reciprocal of the standard deviation in b.. The tendency is for rx = e[(b— Azp) to be small whenever dy ia large. The precise effect of dy on rg can be clarified as follows. Define D8) = ding(d,.--sdhaty de VIFB ditty. --sdm) where § > -1. If 2(6) ininimizes | D(6)(Az — 5) lla and r(6) is the k-th ‘component of b— Az(), then it can be shown that = rh (= Toaapepatat DEA TAT,” (05) ‘This explicit expression shows that r,(6) is a monotone decressing function ‘of 5. Of course, bow ri changes when all the weights are varied is much ‘more complicated. Example 5.6.1 Suppose “(HE 266 OnapTeR 5. ORTHOGONALIZATION AND LEAST SQUARES UD = fa then ap = [-1, 8517 aod = b~ Asp = [4 4-1, 217, On the othr hand, if D = dlag( 1000, 1, 1, 1) then we have sp © [—L43, 1.217 and Fe b~ Ay = [ 000428 ~ 571428 — 14853 259714)7, 5.6.3 Generalized Least Squares In many estimation problems, the vector of observations b is related to x through the equation b= Artw (6.66) where the noise vector w has zero mean and a symmetric positive deft nite variance-covariance matrix o7W. Assume that W ia known and that W = BBT for some B < R™™. ‘The matrix B might be given or it might be W's Cholesky triangle. In order that all the equations in (5.6.6) con- tribute equally to the detorminstion of z, statisticians frequently solve the LS problem min{| B-"(Az - 6) Ia 6.6.7) ‘An obvious computational approach to this problem is to form A= B-'A and 5 = 8-1 and then apply any of our previous techniques to minimize Az ~Bi2. Unfortunately, z will be poorly determined by such a procedure if B is il-conditioned. ‘A much more stable way of solving (6.6.7) using orthogonal transforma- tons has been suggested by Paige (1979, 19798). It is besed on the idea ‘that (5.6.7) is equivalent to the generafized lenst squares problem, min oy. 6.88) or ‘Notice that this problem is defined even if A and B are rank deficient. Although Paige’s technique can be applied when this in the case, we shall describe it under the assumption that both these matrices have full rank. ‘The first step is to compute the QR factorization of A: ra = {® = wee[S] entg An orthogonal matrix Z ¢ R™*"" is then determined so that Qtez=(0 $s | Z=(% Bh} ‘where Sis upper triangular. With the use of these orthogonal matrices the constraint in (5.6.8) transforms to [S)-[t)- [es 12)5.6. WEIGHTING AND ITERATIVE IMPROVEMENT 287 Notice that the “bottom half" of this equation determines v, Su=Qtb Za, (6.6.9) ‘hile the “top half” prescribes 2: Ruz = QTb— (QIBLZT + QT BLA v= QTb-QTBZau. (6.6.10) ‘The attractiveness of this method is that all potential ill-conditioning is concentrated in triangular systems (5.6.9) and (5.5.10). Moreover, Paige (19790) has shown that the above procedure is numerically stable, eome- thing that is not true of any method that explicitly forms B~'A, 5.6.4 Iterative Improvement ‘A technique for refining an approximate LS solution has been analyzed by ‘BjGrek (1967, 1968). It is based on the idea that if dle] - then b— Az tp = min. This follows because r+ Ax = band A7r = Oimply 7b. ‘The above augmented system is nonsingular if rank(A) = By casting the LS problem in the form of a square linear systetn, the iterative improvement schame (3.5.5) can be applied: [ : ] AcR™™", DER" (5.6.11) 7 =; 2 20 (2 ]- [a] -Lé af] [fyi )- [2] [tem] = [8] + [25] end ‘The residuals / and g'*) must be computed in higher precision and an orginal copy of A must be around for this purpose. 268 CHAPTER 5. ORTROGONALIZATION AND LEAST SQUARES If the QR factorization of A is available, then the solution of the aug- ‘mented system is resdily obtained. In particular, if A= QR and Ri = (Ln, 1:n), thea a system of the form Lolz] -[7] (441-8 a hj on 7, A) on or= [fl mtn = [A] mtn ‘Thus, p and z can be determined by solving the triangular systems RTA = 9 tod Ris = fi —h and seting p= 9 |; Asuning nt ie tore in factored form, each iteration requires Bran — 2n? flops. ‘The key to the iteration’s success is that both the LS residual and solution are updated—not just the solution. Bjéeck (1968) shows that if ‘ma(A) ~ 6 and digit, @-boce arithmetic is used, then 2") has apprexi- mately k(¢ ~ q) correct base 2 digits, provided the residuals are computed in double precision. Notice that it is xo(A), not 2A)’, that appears in this heuristic, Probleme ‘transforms to PSA Verity (564). 5.8.2 Let A.€ RE" have fll rank ad delice the dlagonal mattx Am ding Lecod 04 Bde oe Ae oo) for 6 > ~1. Denote tho LS elution to min (Ax ~ 8) [aby =(6) nd tarsal by 116) b= Az(0). (0) Show MATA) uk HO = ( - Ae) (©) Latingr4(8) and forthe hth compooent of (6), show ra(0) “0 = Trapaatanare (6) Uae () 19 verity (8.65), 5.0.3 Show how the SVD can be wed to sive the generalized LS problem when the matrices A and B in (5.6.8) are rank deficient.5.6. WEIGHTING AND ITERATIVE IMPROVEMENT 269 PB.OA Lat Ae RO have rank n aod for a > 0 define M(o) A a oO]: cnn = mela. fey} od determine the value of cr tha mininizen 13(0(a)). PS.0.5 Another iterative improvereat method for LS problema is the following: Show that Az) (double precision) 14s) — 700g, = min 24 22 40) end (4) Anmuming that the QR factorization of ia avaiable, how maay, Sopa per iteration te cenuiced? (b) Show thar the above iteration revue by ering gl) = 0 in the ter tive improvement achere gives in $5.64. Notes and References for See. 5.6 Row and columa waighting in the LS problem is diacumed in Laweon and Hanson (SLS, Pp 180-88), The various eects of scaling are dlaamed in ‘A. van der Sluia (3960). “Condition Numbers and Equilbration of Matrica,” Numer. Math 14, 16-23. G.W. Stewart (1964). "Op the Asymptotic Behavior of Scaled Singular Value and QR Decompasitions,” Math Comp. 43, 483-490 ‘The theoretical and computational aspects ofthe generalized Jagt squares problem appear in 5. Kouroubtis and C.C. Paige (1961). “A Constrained Least Squoree Approach to the (General Gausw-Martow Lina Modal,” J. Amer. Stat Assoc. 75, 620-25. C.C. Paige 1070s). “Compute Solution and Perturbation Analysis of Generailaad Lanst ‘Squares Probleme.” Math. Comp. 39, 171-84. C.c. Paige (10700). "Fast Numerically Stable Computations for Generalised Linear Lost Squares Problens,” SIAM J. Nem. Anal 6, 16-71. C.C. Paige (1085). ~The General Limit Modal andthe Generallaad Singular Valve Decomponition, Lin. Alp and Its Apple. 70, 80-294. ‘Rerative improvemont inthe lst equares contert i dacumed in (GLH. Golub and JL. Wihionn (1966). “Note on tertive Refinement of Lent Sqres “Solutions.” Numer. Math. 9, 130-48. A. Bjirck and G.HE Golub (1967), “Tteretive Reoement of Linear Least Squares Sol. tous by Householder Trassiormation,” BIT 7, 32-37. A aes coon “Teeative Reinament of Linear Least Squares Soletions L* BIT 7, A. Biro (968), “Tertve Raiment of Liser Lest Squares Soltions 1,7 B0T 8 +20. A. Bjbrck (1987), “Stability Analysis of the Method of Seminormal Equations for Linear ‘Leant Squares Problema,” Linser Al. and Its Applic. 88/80, 31-4. 270 CHAPTER 5. ORTHOGONALIZATION AND LEAST SQUARES 5.7 Square and Underdetermined Systems ‘The orthogonalization methods developed in this chapter can be applied to square systems and also to systems in which there are fewer equations than ‘unknowns. In this brief section we discuss some of the various possibilities. 5.7.1 Using QR and SVD to Solve Square Systems ‘The least squares solvers based on the QR factorization ond the SVD can be used to solve square linear systems: just set_m =n. However, from the flop point of view, Gaussian elimination is the cheapest way to solve ‘square linear system as shown in the following table which assumes that the right hand side is available at the time of factorization: Method Flops Goussian Elimination 203 /3 Housebolder Orthogonalizstion | 4n¥/3 Modified Gram-Schmidt ant Bidiagonalization 8n9/3 Singular Value Decomposition | 12n? ‘Nevertheless, there are three reasons why orthogonalization methods might be considered: ‘© The flop counts tend to exaggerate the Gaussian elimination advan ‘tage. When memory traffic and vectorization overheads are considered, the QR approach is comparable in efficiency. ‘© The orthogonalization methods have guaranteed stability; there is no “growth factor” to worry about as in Gaussian elimination, ‘+ In cases of ill-conditioning, the orthogonal methods give an added ‘measure of reliability. QR with condition estimation ia very dependable and, of course, SVD is unsurpassed when it comes to producing a meaningful solution to a nearly singular system. Wo are not expressing a strong preference for orthogonalization methods but merely suggesting viable alternatives to Gauesian elimination, ‘We also mention that the SVD entry in Table 5.7.1 assumes the avail- ability of bat the time of decomposition. Otherwise, 20n® flops are required because it then becomes necessary to accumulate the U matrix. If the QR factorization is used to solve Az = 6, then we ordinarily have to carry out a back substitution: Rz = Q7b, However, this can be avoided by “preprocessing” b. Suppose H is « Householder matrix such5.7. SQUARE AND UNDERDETERMINED SYSTEMS am. that Hb = eq where en is the lest column of Jy. If wo compute the QR factorization of (HTA)", then A= H7R™QT and the system transforms to Bly = fey where y = Q?z. Since RT is lower triangular, y = (8/ran)en and 50 = Zain. 5.7.2 Underdetermined Systems We say that a linear system Az AcR™", beR” (ray is underdetermined whenever m 5.7.3. Show how triangular system solving can be avoided when using the QR factor lantioa to sive an underdetarmined system. PS.T.4 Suppose &.2 € IP are given. Consider the following problems: ma CHAPTER 5. ORTHOGONALIZATION AND LEAST SQUARES (0) Find an unsymmetric Tooplits matits T #0 Ts = b. (b) Pind « symmetric Toopits matrix T so Ts = 5. (©) Flod circulant matsie C a0 Cs =. Powe each problem in the form Ap =b where A ia & mutrix made up of entries from = sod ple the vector of sought-after parameters. Notes and Raferencas for Sec. 8.7 Interesting aspects concerning singular stam are discumed in ‘TP. Chan (1984). “Deisted Decomposition Solutions of Neatly Singular Systems,” SIAM J. Num. Anak 21, T38-T34. G.H. Golub and C.D. Meyer (1986). “Using the QR Factoraation and Group Inversion ‘to Compute, Diferentiain, and aatimata the Sensitivity of Stationary Probabilition for Markov Chaing” SIAM J. Alp. and Dis. Methods, 7, 73-281 Papers on undardetermined systems include RLE. Clie and RJ. Plemmoas (1076). *LzSolutions to Underdetarmined Linear Sye- ‘erat,” SIAM Heview 18, 92-106. 1M. Afioli and A, Larate (1685). "Error Analysia ofan Algorithm for Solving an Under- determined System,” Numer. Math. 46, 255-263, LW, Demme and NJ. Higham (1998), “Improved Error Bounds fr Underdeterained ‘Syrtom Solvers” SHAM J. Matrez Anak Appl 14, 1-16 ‘The QR factorization ean of courme be wed to solve linear systems. See 1N.J, Higham (1991), “erative Refoerent Eahances the Stabiity of QR Pactorization ‘Methods tor Solving Linear Equationn” BIT 31, 447-368.Chapter 6 Parallel Matrix Computations §6.1 Basic Concepts §6.2 Matrix Multiplication §6.3. Factorizations ‘The parallel matrix computation area has been the focus of intense research. Although much of the work is machine/system dependent, a ‘number of basic strategies have emerged. Our aim is to present these along. ‘with a picture of what it is like to “think parallel” during the design of a matrix computation, ‘The distributed and shared memory paradigms are considered. We use ‘matrix-vector multiplication to introduce the aotion of a node program in §61. Load balancing, speed-up, and synchronization are also discuased. In §6.2 matrix-matrix multiplication is used to show the effect of blocking. on granularity and to convey the spirit of two-dimensional data flow. Two parallel implementations of the Cholesky factorization are given in $6.3. Before You Begin Chapter 1, §4.1, and §4.2 are assumed. Within this chapter there are tthe following dependencies: 61 + $62 + g63 Complementary references include the books by Schénauer (1987), Hock- ney and Jesshope (1988), Modi (1988), Ortega (1988), Dongarra, Duff, 76 CHAPTER 6. PARALLEL MATRIX COMPUTATIONS Sorensen, and van der Vorst (1991), and Golub and Ortega (1993) and the excellent review papers by Heller (1978), Ortega and Voight (1985), Galli- ven, Plemmons, and Sameh (1990), and Demmel, Hesth, and van der Vorst (1983). 6.1 Basic Concepts In this section we introduce the distributed and shared memory paradigms using the gaxpy operation sutdz, AERO 2,426 R" (6.2.1) as on example. In proctice, there is a fuzzy line between these two styles of parallel computing and typically a blend of our comments apply to any particular machine. 6.1.1 Distributed Memory Systems Ta a distributed memory multiprocensor each processor has a local mem ory and executes its own node program. The program can alter values in the executing processor's local memory and can send data in the form of messages to the other processors in the network. ‘The interconnection of the processors defines the network topology and one simple example that is good enough for our introduction is the ring. See FIGURE 6.1.1. Other Proc(t) |_| Proe(2) LH Proc(3) |_| Proe(4) Ficurg 6.1.1 A Four-Processor Ring important interconnection schemes include the mesh and torus (for their clove correspondence with two-dimensional arrays), the hypercube (for its generality and optimality), and the tree (for its handling of divide and conquer procedures). See Ortegs and Voigt (1985) for a discussion of the possibilities. Our immediate goa! isto develop a ring algorithm for (6.1.1). ‘Matrix multiplication on 3 torus is discussed in §6.2. Each procemor has an identification number. The ith processor is deo- fgnated by Proc(), We say that Proc() is a neighbor of Proc(y) if there is a direct physical connection betwoea them. Thus, in a p-processor ring, Proc(p— 1) and Proc(1) are neighbors of Proc(p).6.1. Basic Concerrs aw Important factors in the design of an effective distributed memory algorithm inclide (8) the number of processors and the capacity of the local memories, (b) how the processors are interconnected, (c) the speed of computation relative to the speed of interprocemor communication, and (4) whether or not 8 node is able to compute and communicste at the same time. 6.1.2 Communication ‘To describe the sending and receiving of messages we adopt a simple notation: sena( (matria} , {id of the receiving processor} ) recv( {matriz} , {id of the sending processor} ) Scalars and vectors are matrices and therefore mensages. In our model, if Proc(u) executes the instruction semd(Vise,), then copy of the local matric Viec is sent to Proc() and the execution of Proe(1)’s node program resumes immediately. It is legal for a processor to send a message to itself. ‘To emphasiza that a matrix is stored in a local memory we use the subscript ea? If Proc(y) executes the instruction recv(Viee,2), then the execution of its node program is suspended until a message is received from Proc(A). ‘Once received, the message is placed in a local matrix Urse and Proc(1s) resumes execution of its node program. ‘Although the syntax and semantics of our send/receive notation is od- ‘equate for our purposes, it does suppress a number of important details: ‘+ Message assembly overhead. In practice, there may be » penslty ‘amociated with the transmission of a matrix whose entries are not contiguous in the sender’s local memory. We ignore this detail (© Message tagging. Messages need not arrive in the order they are sent, and a system of memage tegging is neceasary vo that the receiver is ‘ot “confused.” We ignore this detail by assuming that messages do arrive in the order that they are sent. '» Momage interpretation overhead. In practice a message is a bit string, ‘and a beeder must be provided that indicates to the receiver the dimensioos of the matrix and the format of the floating point words that are used to represent its entries. Going from mensage to stored matrix takes time, but it isan overhead that we donot try to quantify. ‘These simplifications enable us to focus on high-level algorithmic ideas. But ft should be remembered that the success of » particular implementation may hinge upon the control of these hidden overheads. 278 CHAPTER 6. PARALLEL MATRIX COMPUTATIONS 6.1.3 Some Distributed Data Structures Before we can specify our fit distributed memory algorithm, we must consider the matter of data layout. How ate the participating matrices and vectors distributed around the network? Suppone 2 € IR” is to be distributed among the local memories of « p- processor network. Asume for the moment that n= rp. Two “canonical” ‘approaches to this problem are store-by-row and store-by-column. In store-by-column we regard the vector z as an r-by-p matrix, + 21+ @-Drn) ], Boxe = [2(14) 2(¢+1:2r) and store each column in a processor, Le, 2(1 + (~I)riur) € Proc(y). (Ia this context “€” means “is stored in.”)' Note thst each processor houses «contiguous portion of =. Tn the store-by-row scheme we regard x as a pby-r matrix Zyxr = [ (hsp) e(p+12p) -- 2((r—1)p+hn)], and store each row in » processor, ie., 2(p:p:n) € Proc). Store-by-row is sometimes referred to as the wrap method of distributing a vector because the components of = can be thought of as cards in a deck that are “dealt” to the procansors in wrap-around fashion. En is not an exact multiple of p, then theas ideas go through with minor modification. Consider store-by-columa with n = 14 and p = at 7 i, (er 222924 | 25262729 | Ze FoF | Zi2 73714) Pret) Pro) Prec) Pratt) In general, if n = pr + with 0 1. This assumes no idle waiting. If k= 1, then no communication is required and T(1) = 2n2/R. It follows that the efficiency “1+ Toa) Improves with Increasing n and degradates with increasing p or R. In Practice, benchmarking is the only dependable way to axsens eficiency. A concept related to efficiency is speed-up. We say that parallel algo- tithm for a particular problem achieves speed-up $ if $= Taal Trr “where Tyar is the time required for execution of the parallel program and Tug is the time required by one procemor when the best uniprocessor pro- ‘cedure is used. For some problems, the fastest sequential algorithm doos ‘not parallelize and to two distinct algorithms are involved in the speed-up snscammeat. We awation thet Wess simple meerures are not particularly iuminating i eyotems ‘where the nodes are able to overlap computation and communication. gE 282 CHAPTER 6. PARALLEL MaTRix COMPUTATIONS 6.1.7 The Challenge of Load Balancing If we apply Algorithm 6.1.1 to a motrix A € IR" that is lower triangular, then approximately half of the Bops associated with the Yioc updates are ‘unnecessary because half of the Ajj in (6.1.2) are zero. In particular, in the uth processor, Atse(31+ (7 —I)r:rr) is zero if r > pw. Thus, if we guard the thee update as follows, ifrsa Vee = Wee + Aes} + (4 = 1)P=7F) Soe end then the overal] number af flops is halved. This solves the superfluous flops problem but it creates a load imbelance problem. Proc(j:) oversees about yr?/2 flops, an increasing function of the procesor id ys. Consider the following r = p = 3 example: a a@ 0 of0 0 of0 0 o]fa u a aa 0/00 0j000]| x w 4 a@aalo 0 aloo of] x 3 ™ BB B[A 0 ajo 0 0| |x ue zw |=[8 8 6/8 8 O}o 0 O/ | ay] + | ws % BB 8\8 B B\O 0 0)) a Ye a 17 17 7 117 0 0| [ar ve a aralr ratty Of] we = aararralr yr rdla we Here, Proc(1) handles the a part, Proc(2) handles the 8 part, and Proc(3) bbandles the + part. However, if processors 1, 2, and 3 compute (21, 24,21), (za) 25,28), and (2s, 26,74), respectively, then approximate load balancing results: | ‘The amount of arithmetic still increases with 1, but the effect is not no- ticeable if n> p. ‘The development of the general algorithm requires some index manip- tlation. Assume that Proc(ys) is initialized with Aige = A(uspen,;) and leo lees E eek oo 2eoleca seneasire 1 Qeepuepeue weeloekoe aeelmoine Qeokeokwe aackaokoco arwolockoo Yeaeos vaealeenp se veules6.1. Basic Concerts 283, Vioe = y(upen), and assume that the contiguous z-subvectors circulate as before. If at some stage Zio. contains z(1 + (r ~I)r:rr), then the update Yee = He + Aine] + (7 = Diett ioe implements w(vepin) = y(uepen) + Aluipin, + (7 — Ure )a(2 + (7 = Ir). ‘To exploit the triangular structure of A in the Yee computation, we express the gaxpy as a double loop: Moe(@) = Moelor) + Atel + (7 = 1)r)t0e(B) end end ‘The Atge reference refers to A(u-+(a—1)p,8+(r—1)r) which is zero unless the column index is less than or equal to the row index. Abbreviating the inner loop range with this in mind we obtain Algorithm 6.1.2 Suppose A ¢ R™", x € R" and y ¢ R” are given and that z= y+ Az. Assume that n = rp and thet A is lower triangular. If ‘each processor in a p-processor ring executes the following node program, then upon completion Proc() houses 2(uipin) in oe. Assume the following local memory initialization: p, 4 (the node id), left and right (the neighbor fe's), my Arse = AUP), Ye = YUE), and Ziee = 2+ (a ~ rar) r=n/p for t= 1p send(zice,right) recv(2izc, left) unt itr> p and the communication parameters ag and Jz factors are small enough. Another possible mitigating factor is that Algo- rithm 6.1.3 manipulstes length n vectors whereas Algorithm 6.1.1 works6.1, Basic Concerts 2s vith length n/p vectors. If the nodes are capable of vector arithmetic, then the longer vectors may raise the level of performance. ‘This brief comparison of Algorithms 6.1.1 and 6.1.3 reminds us once ‘gain that different implementations of the same computation can have very diferent performance characteristics. 6.1.9 Shared Memory Systems We now discuss the gaxpy problem for a shared memory multiprocessor. In this environment each processor has access to # common, global memory as depicted in Figure 6.1.2. Communication between processors is achieved Proc(1) |] Proc(2)} | Proe(3)} | Proo(4) Global Memory Figure 6.1.2 A Four-Processor Shared Memory System by reading and writing to global variables that reaide in the global memory. Bach processor executes its own local program and has its own local memory. Data flows to and from the global memory during execution. All the concerns thet attend distributed memory computation are with us in modified form. The overell procedure should be load balanced and the computations should be arranged so that the individual processors have to wait as little as pomsible for something useful to compute. The traffic between the global and local memories must be managed carefully, because the extent of such data transfers is typically o significant overbead. (It corresponds to interprocessor communication in the distributed memory ‘setting and to data motion up and down @ memory hierarchy 8 discussed in $1.45.) ‘The asture of the physical connection between the procemors ‘and the shared memory is very important and can effect algorithinic devel- ‘opment. However, for simplicity we regard this aspect of the system as 8 black box as shown in Figure 6.1.2. 286 CHAPTER 6. PARALLEL MATROX COMPUTATIONS 6.1.10 A Shared Memory Gaxpy Consider the following partitioning of the n-by-n gaxpy problem z = y+Az: [Z)- Le Here we assume that n= rp and that A, €R™", y, ER’, and z, € R’. ‘We use the following algorithm to introduce the basic ideas and notations. (6.14) Algorithm 6.1.4 Suppose A¢ R™", rR", and y € R” reside in a global memory accessible to p processors. If n= rp and each processor executes the following algorithm, then upon completion, y is overwritten by z=y+ Az. Assume the following initializations in each local memory: , u (the node id), and n. r=n/p row = 1+ (u~ ter ioe = A(row, j) ‘Wise = Vise + BiaeHtoe(I) ylrow) = vtec ‘We assume that copy of this program resides in each processor. Float- ing point variables that are local to an individual processor have 8 “loc” subscript. Data is transferred to and from the global memory during the execution of Algorithm 6.1.4. ‘There are two global memory reads before the loop (Ztae = = and yize = y(row)), one resd each time through the loop (Giec = (row, j)), and one write after the loop (y(row) = Ye) Only one processor writes to a given global memory location in y, and so there is mo need to synchronize the participating processors. Each has ‘a completely independent part of the overall gaxpy operation and does not hhave to monitor the progress of the other processors. The computation is statically scheduled because the partitioning of work is determined before execution. If A is lower triangular, then steps have to be taken to preserve the load balancing in Algorithm 6.1.4. As we discovered in §6.1.7, the wrap mapping i a vehicle for doing this. Assigning Proc(j1) the computation of spn) = vse) + Alain.) etively partitions the n! Dope among the p processors.6.1, Basic Concerrs 287 6.1.11 Memory Traffic Overhead Tt is important to recognize that overall performance depends strongly on the overheads associated with the reads and writes to the global memory. If such a data transfer involves m flsting point numbers, then we model the transfer time by r(m) = a, + Bim. (6.1.5) ‘The parameter ay represents a start-up overhead and fis the reciprocal transfer rate. We modelled interprocessor communication inthe distributed environment exactly the same way. (See (6.1.3).) ‘Accounting for all the shared memory reada and writes in Algorithm 6.1.4 we see that each procesor spends time wt Te (ntda+ Tb communicating with global memory. ‘We organized the computation so that: one column of A(row.:) is read at a time from shared memory. /f the local memory is large enough, then the loop in Algorithm 6.1.4 can be replaced with Atse = Alrow,:) Woe = Woe + Atectioe ‘This changes the communication overhead to - oo Tr 3a, + By ? ‘8 significant improvement if the start-up parameter a, is large. 6.1.12 Barrier Synchronization ‘Let us consider the shared memory version of Algorithm 6.1.4 in which ‘the gaxpy is column oriented. Assume n = rp and col = 1+ (— I)ryr. A reasonable ides is to use a global array W(1:n,1:p) to house the products AG, col)z(eol) produced by each proceseor, and then have some chosen processor (say Proc(1)) add its columns: Aine = Alc); Hee = 2(C) wae = Aiatoei WC) = tae ifpai Ye = 9 for j= 1p ties = W(3) Viwe = Moc + Woe end Y= ee end 288 CHAPTER 6, PARALLEL Matatx ComPuTaTIONS However, this strategy is seriously fawed because there is no guarantee that W (Ln, 1:p) is fully initialized when Proc(1) begins the summstion process. ‘What we need is a synchronization construct thst can delay the Proc(1) summation until all the processors have computed and stored their contri- butions in the W array. For this purpose many shared memory systems support some version of the barrier construct which wo introduce in the following algorithm: Algorithm 6.1.5 Suppose Ac RO™", 2 ¢R", and y € R" reside in a global memory accemible to p processors. If n = rp ond each processor executes the following algorithm, then upon completion y is overwritten by y+ Az. Assume the following initializations in each local memory: p, 4 (the node id), and n. r= n/p, col = 1+ (u— Irie; Alec = Al:,00!); ziee = 2(cal) Wee = Atacee WO, i) = ioe, barrier ifu=1 Yoo = ¥ for j Woe = W(3) Vise = Ye + Woe Y= Moc “To understand the barrier, it i convenient to regard a proceasr as either blocked or free. A processor is blocked and suspends execution when it ‘executes the barrier. After the pth processor is blocked, all che processors return to the “frvo state” aod resume execution. ‘Think of the barrier as treacherous stream to be traversed by all p proceuors. For safety, they all congregate on the beak before attempting to cross. When the last member of the party arrives, they ford the stream in unison and resume their individual treks. In Algorithm 6.1.5, the processors are blocked after computing their portion of the matrix-vector product. We cannot predict the order in whlch ‘these blockings occur, but once the ast processor reaches the barrier, they are all lensed and Proc(1) ean carry out the vector summation. 6.1.13 Dynamic Scheduling Instead of having one processor in charge of the vector summation, it is ‘tempting to have each procemor add ite contribution directly to the global variable y. For Proc(s), this means executing the following:Bastc Concerrs 289 r= nfp; col = 1+ (a — Trae; Atos = Al: col); Ziae = 2(col) teat tee Bet tas Y= te However, a problem concerns the read-update-write triplet Moe = Vi Hoc = Hise + Whoci Y= Hoe Indeed, if more than one processor is executing this code fragment at the same time, then there may be a loss of information. Consider the following sequence: Proe(1) reads y Proc(2) reads y Proc(1) writes y Proc(2) writes y ‘The contribution of Proe(1) is lost because Proc(1) and Proc(2) obtain the same version of y. As a result, the effect of the Proc(1) write is eresed by the Proe(2) write ‘To prevent this kind of thing from hsppening most shared memory systems support the idea of a critical section. These are special, isolated portions of a node program that require a “key” to enter. Throughout the system, there is only one key and so the net effect is that only ove processor can be executing in a critical section at any given time. Algorithm 6.1.6 Suppose Ac RO", 2 ¢ RY, and y ER" reside in a global memory accemible to p processors. If n = pr and each processor executes the following algorithm, then upon completion, y is overwritten by y+ Az. Ansume the following initializations in esch local memory: p, 1 (the node id), and n, Fm n/p, col = 14+ (= Irn; Ate = AC 00!) ae = 2(co) Wee = Aisne begin critical section We = 9 ae = oe Woe Y= tee end critical section ‘This use of the critical section concept controls the update af y in a way ‘that ensures correctness. The algorithm is dynamically scheduled because the order in which the summations occur is determined as the computation ‘unfolds. Dynamic scheduling is very important in problems with irregular structure. 290 (CHAPTER 6. PARALLEL MATRIX COMPUTATIONS Problems PO.L.A Modify Algorithm 6.1.1 90 that it can handle arbicrary n. PO.1.2 Modify Algorithm 6.1.2 0 that i effchently handles the upper triangular case, PO.L.S (a) Modify Algrithos 6.13 and 6.1.4 00 that they overwrite y with s = y+. A™= {or a piven positive intogar m that ie avilable to each processor. (b) Modify Algorithms 6.1.3 and © 14 oo that y is overwriti by x= y + AT As. P.1.4 Modify Algorithm 6.130 that vpon completion, the leal array Ain in Proc) ‘houses the pth block column of A+ zy? 8.1.8 Modify Algorithm 6.1.4 0 that (a) A la overwritten by the ovtar product update ‘Asay? (0) 2 in overwritten with A2z,(c) yi overwritten by a unit 2norm vector in the direction of y+ Az, and (4) it ficiently handles the case when A is lower triangular. Notes and Raferences for Sec. 6.2 General references on parallel computations that include several chapters on matrix ‘computations inclade GLC. Por, M. A. Johasoa, G. A. Lyzanga, S. W. Otto, J. K Salmon and D. W. Wallar (1988)" Soining Problems on Concurrent Processors, Volume 1. Prentice Hall, Eo- ‘Blewood Clif, NJ. 1D, P. Bertsekan aod J. N. Teil (1089). Parallel end Distributed Computation: Mamerieal Methods Prentice Hall, Baghewood Clif, N3. ‘8, Lakahmivarahea aod $. K, Dball (1980). Analy and Derign of Parallel Algorithms: “Arithmetic end Matria Problems, McGraw Hill, New York. ‘TL. Freeman end C, Philips (1002). Parallel Numerical Algoritimas Preatice Hall, New York oT, Leighton (1902). Introduction to Parallel Algorithms and Architectures, Morgan. ‘Kaufmann, San Mateo, CA. G. G. Fox, RD. Wiliams, and P. C. Messina (1994). Paral Gomputing Works, ‘Morgan Keufroann, San Francico. V. Kumar, A. Grama, A. Gupta aad G. Karypis (1994). Intreduction to Perailel Com- using: Design ond Analyels of Algoritima, Beajemin/Cummings, Reading, MA. EP. Van de Vela (1954). Concurrent Scientific Computing Speingr-Varlag, New York. M. Comard and D. Trystram (1995). Parallel Algorithm and Archileceret, Interna ‘onal Thomaon Computer Prem, New York. Here ace soma general references that are mors specific to parallel matrix computations , Fadenva and D. Fedeny (1977). “Parallel Computations in Linear Algebra.” Kier. neticn 6, 2-40. . Healer (1978). *A Survey of Paralieh Algorithme in Numerical Liner Algnbre,” SIAM Review 20, 140-777. JIM, Orsegs and RG. Voigs (1985) “Solution of Partial Differential Equations on Vector ‘aod Parallel Computers,” SIAM Review #7, 149-240. 1.4. Dongarra and D.C, Sorensen (1986). “Ligoar Algebra on High Performance Coor ‘putern” Appl. Math. and Comp. 20, ST-85. KA. Gallivan, RJ. Plenoons, and A.H, Sazoeh (1990). “Parallel Algorithins for Dense ‘Linear Algebra Computations,” SIAM Review $2, 5-135. J.W. Demme, M.T. Heath, and HLA. van der Vorss (1903) “Parallel Numerical Linear ‘Aagebra,” in Acta Numerin 1999, Cambridge University Pree. See also6.1. Basic Concepts 291 BLN, Datta (1069). “Parallel ad Large-Scale Matrix Computations in Cootol: Some Tens.” Lin. Aig, ond Its Applic. 121, 248-204 ‘A. Edelman (1993). “Large Dense Numerical Linear Algebra in 1995: The Parallel Computing Iaflvence,” Int J. Supercompuser Appl. 7, 113-128. Manegiog and modeling communication in « distributed memory exvirooment ia an im: portant, dificult problem. See L. Adamn and T. Crockett (1984). “Modeling Algorithm Execution Time on Processor Arrayy,” Compuler 17, 38-43 1. Gannon and. Van Rosendale (1984). “Oa the Impact of Communication Complesity on tbe Design of Parallel Nurorical Algorithms,” [BEE Trans. Comp. C29, 1180" “Hypercube Multiprocessors” J. Parallel and Distributed Computeng, No. 4, 138-172 YY. Saad and M. Sebults (1980). “Data Communication in Hypercubes,” J. Disk Parallel Comp. 6, 135-138. Y. Sead'and MoH. Schults (1989). “Data Commurication in Parallel Architectures,” J Dist. Parallel Comp, 11, 131-150, For soapubots of basic linear algebra computation on « distributed memory system, soe ©. McBryan and EP. van de Velde (1987). “Hypercube Algorithms acd Implemenca- ons.” SIAM J. Set and Stat. Comp. 8, 221-087, 8.1, Johamoa and C.-T: Ho (1988). "Matrix Transportion on Boolean w-cube Configured ‘Ensemble Architectures,” SIAM J. Mamra Anal Appl. 9, 415-494, ‘T, Dehn, M. Biermann, K. Giebermann, and V. Speriing (1998). “Structured Sparse Matrix Vecwor Multiplication on Massively Pazalel SIMD Archieectures,” Paral Computing 21, 1867-1804 4J- Choi J.J. Dongarra, and D.W. Walker (1985), “Parallel Matrix Transpase Algorithms ‘on Distributed Memory Concurrent Computer,” Parallel Computzn 21, 1387-1408. 1, Colombet, Ph. Michalon, and D. Trystram (1996). “Parlial Matrx-Vector Prodoct ‘on Rings with « Minimum of Communication,” Porllel Compuisny 22, 289-310, ‘The implementation of « parallel algorthes is uually very challenging. Iti important to bave compilers and related tools that are able to bandle tbe devas. See DP. O'Leary and G.W. Stewart (1986). “Aasgament and Schedaling in Parallel Mastix Factorization,” Lin Alp ond Ila Applic. 77, 275-800. 4. Dongarra aad D.C. Sorenseo (1987). "A Portable Baviroament for Developiog Parallel rograsng,” Peraiel Computing 5, 75-138. K, Coupolly, J.J. Dongarra, D. Sorensen, acd J. Pasterroa (1968). “Programming, Methodology and Performasce lanves fot Advanced Computer Archivecturen” Par allel Composing &, 41-58. FP. Jecobwoa, B. Kagstrom, and M. Rannar (1992). “Algorithm Development for Dis- tribuved Memory Multicompurers Using Conlab,” Scientific Progommeny, 1.185 208. ©. Ancomr, F. Coeibo, F. Iigoin, aod R. Kero (1988). “A Linear Algebra Framework for Static HPF Coda Distribution," Procndings of the 40s Workshop on Comers for Parallel Computers, Delf, The Netietianda. Bou, 1. Kodukula, V. Kotlyar, K. Plagali aod P. Stodghill (1999). “Solving Aigement sig Blementary Linear Algebra,” In Proceedings of the 7th International Workshop com Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science 892, Springer-Verlag, New York, 48-60. 3M Well (180). sgh Performance Compt for Poa Computer, Ao Wey, ion MA. 292 CHAPTER 6. PARALLEL MaTnix COMPUTATIONS 6.2 Matrix Multiplication In this section we develop two parallel algorithms for matrix-matrix multiplication. A shared memory implementation is used to illustrate the effect of blocking on granularity and load balancing. A torus implementation is designed to convey the spirit of two-dimensional data flow. 6.2.1 A Block Gaxpy Procedure Suppose A, B,C € R™*" with B upper triangular and consider the compu tation of the matrix multiply update D=C+AB (6.21) on a shared memory computer with p processors. Assume that n = rkp ‘and partition the update [Dayeees Dip] = [CinevesChe] + [AtsesosAep ][Bts--- Bey] (6.22) where each block column has width + = n/(kp). If By B= | FF], Byer, ° thea : Dy = 0, 4AB, = 0, + DArBry. (623) ‘The number of flops required to compute Dia given by ant Be)? ‘This is on increasing function of j because B is upper triangular. As we discovered in the previous section, the wrap mapping is the way to solve load imbalance problems that result from ‘matrix structure. This suggests that we assign Proc(s) the task of computing Dj for j = u:pckp. 2ns?j & Algorithm 6.2.1 Suppose A, B, and C are n-by-n matrices that reside in a global memory accessible to p processors. Assume that B is upper triangular and 1 = rkp. If each processor executes the following algorittua,6.2, Marnix MULTIPLICATION 293 then upon completion C is overwritten by D = C + AB. Assume the following initializations in each local memory: n, r,k, p and (the node id). for j= wpkp {Compute D;.} Bioe = B(1:f7,1 + (9 — rsdn) Gee = O(,1+ G = Yyrsr) forr=13 col = 1+4(r—1)rsrr Aloe = Al!) Cina = Chae + AeeBie( oa) end O(,1+ (GF - Irir) = Clee end {et us examine the degree of load balancing asa function ofthe parameter for rea, the ner of oe eed ny ant Fu) ae sw = (XP) Be. ‘The quotient F(p)/F(1) is a measure of load balancing from the lop point of view. Since Fo) _ +R 9/2 _, , 22-1) FU) ~ E+ Rp? 2+ Rp we soe that arithmetic balance improves with increasing k. A similar analysis shows that the communication overheads are well balanced as h in- ‘creases. (On the other hand, the total number of global memory reads and writes associated with Algorithm 6.2.1 increases with the square of k. Ifthe start- up parameter a, in (6.1.5) is large, then performance can dograde with Increased k. ‘The optimum choice for k given these two opposing forces is system dependent. IF communication is fast, then smaller tasks can be supported without penalty and this makes it easier to achieve load balancing. A mul tiprocessor with this attribute eupports fine-grained parallelism. However, if granularity is too fine in a system with high-performance nodes, then it may be impomible for the node programs to perform st level2 or leval-3 speeds simply because there just is not enough local Linear algebra. Again, benchmarking is the only way to clarify these issues. 6.2.2 Torus ‘A torus is a two-dimensional processor array in which each row and col- ‘umn is 6 ring. Soe FIGURE 6.2.1. A Procemor id in this context is an 294 (Cmaprer 6, PARALLEL MATRIX COMPUTATIONS ‘ordered pair and each processor haa four neighbors. In the displayed exam- 3.2) (4.} I Pree(4.3) Figure 6.2.1 A Fourby-Four Torus ple, Proc(1,3) has west neighbor Proc(1,2), east neighbor Proc(1,4), south neighbor Proc(2,3), and north neighbor Proc(4,3). ‘To show what itis like to organize a toroidal matrix computation, we develop an algorithm for the matrix multipliestion D = C + AB where A,B,C RO". Assume that the torus is p.-by-p and that n = rp. Regard A = (Ay), B = (By), and C = (Gy) 8 pi-by-m block matrices swith r-by-r blocks. Assume that Proc({, ) contains Ayj, By, and Ciy and ‘that its mission isto overwrite Cy with Dig = Cry + D> Au Bry. i We develop the general algorithm from the py = 3 case, displaying the torus Im cellular form as follows:6.2. Maraix MULTIPLICATION 295, Proc(1,1) | Proc(t,2) | Proc(1,3) Proc(2,1) | Proe(2,2) | Proc(2,3) Proc(3,t) | Proe(3,2) | Proc(3.3) ‘Let us focus attention on Proc(1,1) and the calculation of Du =Cut AuBu + AuBa + AnBa. Suppose the six inputs that define this block dot product are positioned ‘within the torus as follows: Au Bulan + | Aw Bn Bu (Pay no attention to the “dots.” They are later replaced by various Ay and By). ‘Our plan is to “ratchet® the first block row of A and the fir. block column of B through Proe(1,1) in coordinated fashion. ‘The paire Ay: and By,, Aya and Bn, and Ays and Bs, moet, are multiplied, and added {nto @ running sum array Cie Az Bui Ax > [An - Bul so - | Choe = Ciae + AraBai Bu Ais Bu | Au Ay ml. |: Pal + Choe = Ciee + ArsBai Au Bu} dn - | As + Ba} es Cree = Ciee + An Bri Bul. = 296 CHAPTER 6. PARALLEL MATRIX COMPUTATIONS ‘Thus, after three steps, the local array Ciec in Proe(1,1) houses Di. We have organized the flow of data so that the Ay; migrate westwards and the Bis migrate northwards through the torus. It is thus appareat that Proc(1,1) must execute a node program of the form: Fecv(Bize, South) Clee = Cree + Atoe Bioe end ‘The send-recv-send-recy sequence for send (Aiges vest) Feev( Ac, 20st) send (Bier, north) recv(Bioe, south) Cice = Ctoe + Aloe Btoe end tlso works. However, this induces unnecessary delays into the process because the H submatrix is not sent until the new A submatrix arrives. We next consider the activity in Proc(1,2), Proc(1,3), Proc(2,1), and Proc(3,1). At this point in the development, these processors merely help circulate blocks 13, Aia, and Ais and By1, Bay, and Bas, respectively. If Bez, Buz, and Brg flowed through Proc(1,2) during these stepe, then Diy = Cia + AisBnt AnBi + Aub could be formed. Likewise, Proc(1,3) could compute Dis = Cis + AuBis + AB + AisBas if Bys, Bas, and Bas ore ovailable during ¢ = the torus as follows . ‘To this ond we initialize Au Bul dn Ba| An Bas Bul + Bal + Ba Bu| > Bul: Ba With northward flow of the By we get6.2. MATRx MULTIPLICATION 207 Ay Bn | A An Bis 2 1s Ba | Au Bu| > Bal: Bo Bulj > Bul - Bs Ay Ba | Au Bala Ba Ba Ba 5x + Bal > Bal» Bs Au Bu | Az Ba} ar Bas Bu| > Bul» Bas t=3 Bay Bul Bs ‘Thus, if B is mapped onto the torus in a “staggered start” fashion, we can arrange for the first row of processors to compute the first row of C. If we stagger the second and third rows of A in e similar fashion, then we can arrange for all nine processors to perform a multiply-add at exch step. In particular, if we set 4u Bul 4a Bal As Ba An Bu| As Bal An Bis 43 Bu | An Bu| Aa Bo then with westward flow of the Ai; snd northward flow of the Ba, we obtain 298 CHAPTER 6. PARALLEL MaTux COMPUTATIONS: Ais Ba | Au Ba | An Ba An Bul An Bn | An Bs t=2 Asa Bu | Ass Ba} An Bis Aun Bu | Av Aa Bs Bn An Bu | An Bs Ax Ba} Ax Bu| és Ba An Bus From this example we are ready to specify the general algorithm. We assume that at the start, Proc(i, 3) houses Ay, Bj, and C,j. To obtain the necessary staggering of the A data, we note thet in processor row i the Ay should be circulated wertward 4— 1 positions. Likewise, in the jth column of processors, the Bi; should be circulated northward j- 1 positions. This sives the following algorithm: Algorithm 6.2.2 Suppose A €R™", Be R™", and C € RO" are given and that D = C+ AB. If each procassor in 8 pi-by-py torus executes the following algorithm and n = pir, then upon completion Proc(ss,) houses Dya in local variable Ciec- Assume the following local memory initializations: pi, (1, ) (the node id), north, east, south, and west, (the four neighbor id), row = 1+ (— Irsur, cot = 1+ (A—I)rsAry Aloe = A(row, col), Biee = B(row,col), and Ciee = C(row, col). {Stagger the Ay, and Bu. } for k= 1y-1 send(Aice, west); FOCV(Aise, ast) end for k=1:A-1 ‘send(Bizc, north); recv(Bize, south) ond for k= 1:71 Choc = Cree + AtceBioe send(Aige, west) send (Bice, north) recv(Atec, east) Fecv(Bise, south)6.2, MaTRIx MULTIPLICATION 299 {Unstagger the Ayy and Bu.} for k= Ip—1 send ( Ajec, cast); reev( Ajo, west) end for k=1:A-1 send(Bine s0uth); reCv( Bic, north) end It ig not hard to show that the computstion-to-communication ratio for this algorithm goes to zero as n/p, increases. Problems 6.2.1 Develop a ring implementation for Algorithm 62.1. 0.2.2. An upper teiangular matrix can be overwritten with Ste aquare without any Additional vorkapace. Write « dynamically scheduled, shared-memory procedure for doing thie Notes and References for Sec. 6.2 Matrix computations on 2-dimensional arrays are discussed in HLT, Kung (1962). “Why Symolic Architecture,” Computer 15, 37-46 DP, O'Leary and G.W. Stewart (1985). “Date Flow Algorithm for Parallel Matrix “Computations,” Comm. ACH 2, 84-858. B. Hendricon and D. Womble (1994). “The Torus Wrap Mapping for Dense Matrix (Calculations on Massively Parallel Gomputers.” SIAM J. Se Comput. 15, 1201- 1s. ‘Various ampocta of parallel matrix mahiplication are discussed in LE, Cannon (1969). A Cellular Computer to Implement the Ketman Filter Algorthry, PhD. Thesis, Montana Stace Univerity K.H, Chang and S. Sabai (1087). “VLSI Systems for Band Matric Muiplication,” Peralel Computing 4, 230-258. G. Fox, S.W. Orto, and A.J, Hey (1987), “Matcix Algorthina on a Hypercube : Matrix Maitipication," Paraliel Computing 4, 17-31. 4. Bermzen (1080). “Communication Eficient Matrix Multiplication on Hyparcuben,” Parallel Computing 18, $35-342. HL. Jngadiah and Kalla (1080). *A Pamily of New Biicent Arrays for Matrix Multipieation.” [ABE Trans. Comput. 38, 149-188. P. Biorrtad, F. Manne, TSarevit, and M. Vajteric (2992). “Emicant Matrix Multiple ‘ation on SIMD Computers” SIAM J. Matriz Anal. Appl. 13, 386-40 K. Mathur aod 5.1, Johnman (1994). “Multiplication of Matrioes of Arbitrary Shape on ‘» Data Parallel Computer” Parailel Computing 20, 919-662. R, Mathias (1906). “Tho Instability of Parallel Prefix Matrix Multiplication,” STAM J. Sek. Comp. 16, 956-973. 300 ‘CHAPTER 6, PARALLEL MATROX COMPUTATIONS 6.3 Factorizations In this section we present 8 pair of parallel Cholesky factorizations. To illustrate what a distributed memory factorization looks lle, we implement the gexpy Cholesky algorithm on 8 ring. A shared memory implementation of outer product Cholesky is also detailed. 6.3.1 A Ring Cholesky Let us see how the Cholesky factorization procedure can be distributed on ‘a ting of p processors. The starting point ia the equation et Gu, WG (um, ) = Alun, ) — SGU, A)Gw:n, §) = v(uen) ‘This equation is obtained by equating the jth column in the n-by-n equation A= GG. Once the vector v(y:n) is found then G(y:n, u) is a simple sealing: Glusn, pw) = v(wen)/Vo{u). For clarity, we first assume that n = p and that Proc(y) initially houses A(im, 2). ‘Upon completion, each processor overwrites its A-column with the corresponding G-column. For Proc() this process involves j.~1 saxpy updates of the form Alun, w) — Alen, w) — Gu, Gun, 3) followed by a square root and a scaling, The general structure of Proc()’s node program is therefore as follows: for j= 1-1 Receive # G-column from the let neighbor. If necessary, send a copy of the received G-columa to the right neighbor. Update A(uin,u) « end Generate Glyn, ) and, if necessary, send it to the Tight neighbor. ‘Thus Proc(1) immediately computes G(L:n, 1) = A(1:n,1)//A(G 1) and sends it to Proc(2). As soon as Proc(2) receives this column it can generate G(2:n, 2) and pass it to Proc(3) ete... With this pipelining arrangement we can amert that once @ processor computes its G-column, it can quit. It ‘luo follows that cach processor receives G-columnna in ascending order, .., Glin, 1), Gln, 2), ete. Based on these observations we have6.9. FACTORIZATIONS 301 jai while
1, then all of the processors are busy most of the time. Let us examine the details of a wrap-distributed Cholesky procedure. Each procemor maintains « pair of counters. The counter j is the in dex of the next G-column to be received by Proc(j). A processor also needs to know the index of the next G-column that it is to produce. Note ‘that if col = yspin, then Proc(js) is responsible for G(:,col) and that L = length(col) is the number of the G-columns that it must compute. We use q to indicate the status of G-columa production. At any instant, 302 CHAPTER 6. PARALLEL MaTaix COMPUTATIONS, coll) ia the index of the next G-column to be produced. Algorithm 6.3.1 Suppose A R™" is symmetric and positive definite and that A = GG" is its Cholesky factorization. If each node in a p-processor ring executes the following program, then upon completion Proc() houses G(kin, k) for k = yipin in a local array Atze(Iin, L) where ZL = length(ool) and col = yspin. In particular, G(eoi(q)-n, col(g)) is boused im Atse(ool(q):n, q) for q = 1:L. Assume the following local memory imitislizations: p, y (the node id), left and right (the neighbor id’s), n, and Aiae = ACHR) f=; q=1; ool = p:pn; L = length(eo!) while gL if j = eol(a) { Form G(i:n, 7) } Alae(5:s9) = Atoe (4%, 9)/-V Ricca Wien 4g md Aisa) right) f=j+l {Update local columns. } for k=q+h:L r= col(k) Aige(t4R) = Ato) ~ Atel )Atae(F2 9) a=qtl Tecv(diec(9:n) left) Compute a, the id of the processor that generated the received G-column. Compute §, the index of Proc(right)’s final column. ifright fan j p. It is not hard to show that Proc(u) performs t FU) = Ae e+ (Epis e09) ~ E & ops. Each processor receives and sends just about every G-column. Us- ing our communication overhead model (6.1.3), we see that the time each processor spends communicating is given by my = ag + Ba(n - j)) % Doan + fan? . 5 If we aasume thet computation proceeds at R flops per second, then the computation/communication ratio for Algorithm 63.1 is approximately siven by (n/p)(1/3RA,). Thus, communication overieads diminish in im- ortance as n/p grows. 6.3.2 A Shared Memory Cholesky Next we consider a shared memory Implementation of the outer product ‘Cholesky algorithin: for k= in Akin, k) = ACen, k)/ AUER) for j=k+ in AG=n, 3) = AG=n, 3) — AGin, BAG, 4) end end 304 CHAPTER 6. PARALLEL MATRIX COMPUTATIONS ‘The j-loop oversees an outer product update. The n ~ k saxpy operations that male up its body are independent and easly parallelized. The scaling ‘A(kcn, k) can be eactied out by a single processor with no threst to loed balancing. Algorithm 6.3.2 Suppose A ¢ IR™" is o symmetric positive definite matrix stored in s shared memory accessible to p procemors. If each processor executes the following algorithm, then upon completion the lower triangular part of A is overwritten with its Cholesky factor. Assume the following initializationa in each local memory: n, p and ys (the node id). for in ify=1 agin) = (ken) tal) = tren) / Veal) Alf) = taken) ond barrier ‘Uee(k + Len) = A(k + 1:n, b) for j= (k+u):pn RecK5in) = AGm3) Woe(3:m) = Wieel§) ~ Yoel j)ee(3:n) AG:2,3) = ween) end barrier end ‘The scaling before the j-loop represents very little work compared to the outer product update and so it is reasonable to assign that portion of the computation to a single processor. Notice that two barrier statements are required. TThe first ensures that a processor does not begin working on the ‘kth outer product update until the éth column of G is made available by Proc(1). The second barrier prevents the processing of the k+Ist step to begin until the kth step is completely finished, Probleme PO.S1 It ie pombe to formulate « block version of Algorithm 6.1. Suppawe n= ri. For k= 1:N we (a) have Proc(t) generate G(,1-+(k=1)rskr) and (b) have all procemors participate in the rankr update ofthe trang submatrix A(kr+Lin,Ar4+-lin), See $4.26. ‘Tha coarser granularity may improve performance if tbe individual procesors ka level-S operations. PO.3.2 Develop « sbared memory QR factoriaation patterned after Algorithm 63.2. Proc(1) should gunerate the Householder vectors and all proceso should abare inthe easuing Householder update,6.3. FactonizaTions 305 Notes and References for See. 6.3 GGeoeral commenta on distributed memory factorization procedures may be found in GA, Geint and M-T. oath (1966). “Matrbx Factorttion on « Hypercube,” i MT Hat (a) 1080), Precip of Fst SIAM Gonferena on Hp Malpe: ‘qasors, SIAM Publicatioos, Philadelphia Pa. LOE. Ipwea, Y. Sand, and M. Sehults (1966). “Denso Linear Systeme on « Ring of Procemon” Lin. Alp. and Ile Applic. 77, 208-230. DP, O'Leary and G.W. Stewart (1066). “Assignment and Scheduling in Parallel Matrix Peetoreation,” Lin. Aig, and Its Applic. 7, 275-300. RS. Schreiber (1988),” “Block Algorithas for Parallel Machines.” in Mamerical Algo. ‘hm for Madern Parallel Compater Arcatectures, MH. Sct (ed), IMA Values {p Mathematica and fta Applications, Number 15, Springer-Verlag, Bera, 197-207 S.L-Jobamon end W. Lichammein (1993). “Block Cyclic Denwe List Algebra,” SIAM “1. Sek.Comp, 14, 1287-1286. pers specifically coaceroed with LU, Cholesky and QR include LN. Kapur and J.C, Browne (1984). “Techniques for Solving Black Triagonal Systeme ‘on Recoafigursble Array Computers.” STAM J. Sei. and Stat. Corap. 5, 70-719, G.J, Devie (1986). “Column LU Pivoting oo « Hypercube Multiprocemor,” SUAM J. ‘lg. and Disc. Methods 7, 538-850. JM, Delosms and C.F. Ipsea (1986), “Parallel Solution of Symmetric Positive Definite ‘Syatame with Hyperbolic Rotations,” Lin. Alp. ond Its Applic. 77, 5-112 ‘A. Pothen, S. Jha, and U, Vemapulsti (1967). “Orcbogoaal Fecorizstion o a Dis nbuted Memory Muliprocemer,” in Ffyperoube Mulliprocesors, ed M-T. Heath, SIAM Prem, 1087. Ci. Biachof (1988). "QR Factorization Algorthums for Coarse Grain Distributed Sys- texas," PLD Tenia Dept. of Computer Science, Cornell Usivarsity, Ithaca, NY. G.A. Geist and CH. Romine (1088). “LU Factorization Algorithme oo Distributed Memory Multiprocemoe Architectures,” STAM J. Set and Stat. Comp. 9, 639-649. JM, Ortega and CHL Romine (1968). “The 43k Forma of Factorization Methods Ti: ‘Parallel Systems,” Parallel Computing 7, 149-162. M, Marathi ond Y. Robert (1960). “Optimal Algorithms for Gauesian Elimination on ‘ac MIMD Computer.” Parallel Computing 18, 183-104 Parallel triangular system solving ia diacumed in R. Montoye and D. Laure (1982). “A Practical Algorithm forthe Solution of Triangular Syrtense on & Parallel Proceming Syrtem,” IGBE Trons. Comp. C-S1, 1076-1082, DJ. Bvans end R. Dunbar (1983). “The Parallel Solution of Triangular Syateme of ‘Bquations” IREE Trans. Comp. C-98, 201-204. (CHL Romine and IM. Ortaga (1988). “Parallel Solution of Triangular Syetera of Equa ‘iong” Parallel Competing 6, 109-114. MT. Heath and CHE Romine (1988). “Paral Solution of Tiaagular System ox Dia- ‘tributed Memory Multiprocessor” SIAM J. Sei. and Stat. Comp. 9, 858-588. . Liand'T. Coleman (1968). “A Parallel Triangular Solver for « Disribeted- Memory Multiprocemar." STAM J. Sci. and Stat, Comp. 9, 485-502. SC. Eimenwtat, 04:7. Heath, C3. Henlal, and CH. Romine (1988). “Modified Cycic ‘Algorithms for Solving Tiangular Systems on Distributed Mecory Multiprocesore,” SIAM J. Set. and Stat. Comp. 9, 569-800. NJ. Higham (1908), “Stabiity of Parallel THangular Syvtem Solves,” SIAM J. Sct Comp. 16, 400-413. Pape on the parallel computation of the LU and Cholesky factorantion include 306 CHAPTER 6, PARALLEL MATRIX COMPUTATIONS LP. Brent and PT. Luk (1982) “Computing the Cholesky Pactorisetion Using & Systotc “Architecture” Proc. 6th Austraian Computer Science Conf. 295-202. DEP. O'Leary and GW. Stewart (1968). “Dota Flow Algorithms for Parall! Matrix ‘Compatations” Comm. of the ACM 28, 841-883. LM, Delomne and LCF. Ipen (1986). "Parallel Sofution of Symmetric Positive Definite ‘Systeme with Hyperbolic Rotatios,” Lin. Aig. and fir Aypic. 7%, T-12. RE, Punderic and A. Geist (1966). “Torus Data Flow for Parallel! Computation of ‘Miaiged Matrix Problema,” Lin. Al and Its Apphc. 77, 169-164. M, Camtaard, M. Marrakchi, and Y. Robert (1088). “Parallel Gaussian Elimination on ‘an MIMD Computer,” Peralal Computing 5, 275-296, Parallel methods for banded end eparve eyxtems include S.L, Johnaon (1985). “Solving Narrow Banded Systems on Ensemble Architecturo,” ACM Trans. Math. Soft. 11, 271-288 S.L_ Jobnamon (1984). “Band Matrix System Solvers on Ensemble Architectures.” in “Supersomputers: Algorithms, Architectures, and Scientific Computation, ois. F.A. Matson and T. Tajita, University of Texas Pree, Austin TX, 196-216. SLL, Johmason (1067). “Solving Tridiagoaal Syrians oa Ensemble Architecture,” STAM “T Sek. and Stat. Comp. 8, 354-002. '. Meter (1085). “A Parallel Parition Method for Solving Banded Systeme of Livear Equations” Parallel Computers & 25-43 HL van der Vor (1987). "Large Teidingonal and Block Teidiagoaal Linear Systems oo Vector and Parallel Computers,” Porallel Comput 5, 45-84. R. Bevilacqua, B. Codenotth, and F. Romani (1068). “Parallel Solution of Block Tid ‘agonal Linear Sywtams;" Lin.Alg, and Ite Applic. 10f, 39-87. E Gallopoulos and Y. Saad (1089). “A Parallel Block Cyclic Reduction Algorithm for the Past Solution of Eliptic Bquatios,” Parallel Computing 10, 143-160. 1M. Conroy (1989). “A Note oa the Parallel Choleaky Factorization of Wide Banded Matsices” Parallel Computing 10, 299-246. M, Hegland (1991). “On the Parallel Solution of Tridiagonal Syatema by Wrap-Around Partitioning aad Incomplete LU Factorization,” Numer. Math $9, 483-472. MCT. Hoath, E. Ng, and B.W. Peyton (1901). “Parallel Algorithme (or Sparse Linear System” SIAM Review 3, 470-460. \V. Mebrmann (1981). “Divide and Conquer Methods for Block Tridiagonal Systems,” Porale! Computing 19, 257-200. P, Raghavan (1988). “Distributed Sparse Gaumian Elimination and Orthogonal Factor- (aation,” SIAM J. Set Comp. 16, 152-1477. Parallel QR factoristion procedures are of interest in reabtine signal procoing. De- ‘ails may be found ia WM, Gentleman sad HT. Kung (1981). “Matrix Trangularisation by Systolic Arrays” ‘SPIE Proceedings, Val. 298, 9-26. DE, Holler and C.F. [pee (1983). “Syatolic Networka for Orthogonal Decompositions," ‘SIAM J. Sct. ond Stat. Comp. 4, 261-269. M. Costar, JM. Muller, and Y. Robert (1986). “Parallel QR Decomposition of « Rectangular Mairi,” Numer. Math. 48, 230-250. Edin and RL Schreiber (1986). “An Application of Systolic Arrays to Linear Discrete ‘UbPosed Problema” STAM J. Sci and Stat. Comp. 7, 892-903. FT. Luk (1986). “A Rotation Method far Computing the QR. Factarsation,” SIAM J. ‘Sct and Stat Comp. 7, 482-450 41.4. Modl and M.RLB, Clark (1988). “An Alternative Givens Ordering.” Numer, Math. 13, 33-00. P, Amodio nd L. Brognano (1095). “The Parallel QR Factoriation Algorithms for ‘Triingonal Linear Systems,” Peralet Computing 21, LOW7-11206.3. PACTORIZATIONS. 307 Parallel factoriation methods for shared mersory machines are diecumed in 8. Chea, D. Kuck, and A. Samah (1978). “Practical Parallel Band Triangular Systems Solvers” AGM Trans. Math. Soft 4, 270-277. [A Samah aad D. Kuck (1978). "On Stable Poraiel Linear System Solver,” J. Astoc. Comp. Mach. 25, 81-91 . Swarrtraaber (1979). "A Parallel Algorithm for Solving General Tridiagonal Eque- oon" Math Comp. 33, 185-199. 8. Chen, J. Dongarra, and ©. Houing (1984). “Multiprocmsing Liowar Algebra Algo- rithine oa the Gray X-MP-2: Experiences with Scnll Granaecity”" J. Porlle! and Distributed Computing 1, 22-31. 444. Doupara and AH, Satoh (1984). “On Some Parl Banded Stam Solver” rata Computing 4.4! Dongate sad RE. Hiromi (1984). “A Colecon of Para ise Eauation ‘Routiaes for the Denelor HEP,’ Parallel Computing 1, 13-142 4. Dongarra and T. Hewist (1966). “Implementing Dense Linear Algobre Algorithms ‘Uaing Multitasking on the Cray XMP-4 (ar Approaching the GigaBop),” STAM J. Sc. and Stat. Comp. 7, 347-300. 4.3. Dongarra, A. Samah, and D, Sorensen (1086). “Implementation of Some Concurrent ‘Algorithms for Mairtx Factorization,” Parallel Computing 3, 25-04. A. George, MCT. Heath, and J. Llu (1986). "Parallel Cholenky Fhctoriation on a Shared Memory Multiprocenos.” Lin Alg. and its Applic. 77, 165-187 41.3 Dongarra and D.C. Sorengen (1987). “Linear Algebra on High Performance Com- ‘putern” Appl. Math. and Comp. 20, ST-88 K Dackland, E. imroth, and B. Kagrtrom (1992). “Parallel Block Puctriztions on the ‘Shared Memory Multiprocessor [BM 3090 VF /00}," international J. Supercomputer ‘Applications, 6, 60-97. Chapter 7 The Unsymmetric Eigenvalue Problem §7.1 Properties and Decompositions §7.2 Perturbation Theory §7.3 Power Iterations §7.-4 The Hessenberg and Real Schur Forms §7.5 The Practical QR Algorithm §7.6 Invariant Subspace Computations §7.7 The QZ Method for Ax = ABx Having discussed linear equations and least squares, we now direct our attention to the third major problem ares in matrix computations, the algebraic eigenvalue problem. The unsymmetric problem is considered in ‘this chepter and the more agreeable symmetric case in the next. ‘Our first task is to present the decomporitions of Schur and Jordan ‘along with the basic properties of eigenvalues and invariant subspaces. The controsting behavior of these two decompositions seta the stage for §7.2 In which we investigate how the eigenvalues and invariant subspaces of ‘8 matrix are affected by perturbetion. Condition numbers are developed that permit estimation of the errors that can be expected to arise because of roundofl. ‘The key algorithm of the chapter is the justly famous QR algorithm. This procedure is the most complex algorithm presented in this book and its development is spread over three sections. We derive the basic QR iteration in §73 as natural generalization of the simple power method. The next

2.2 Matrix Computations

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2.2 Matrix Computations

Uploaded by

Copyright:

Available Formats

You might also like