You are on page 1of 136
Nov. 2019 Motivation Almost everything in Scientific Computing relies on Numerical Linear Algebra! We often reformulate problems as Ax = b, e.g: > Interpolation (Vandermonde matrix) and lineer least-squares (normal equations) are naturally expressed as linear systems. > Gauss-Newton/Levenberg-Marquardt involve approximating nonlinear problem by a sequence of linear systems Similar themes will arise in remaining Units (Numerical Calculus, Optimization, Eigenvalue problems) Motivation The goal of this Unit is to cover: > key linear agebra concepts that underpin Scientific Computing > algorithms for solving Ax = b in a stable and efficient manner > algorithms for camputing factorizations of A that are useful in many practical contexts (LU, QR) First, we discuss some practical cases where Ax = b arises directly in mathematical modeling of physical systems Example: Electric Circuits Ohm's Law: Voltage drop due to a current j through a resistor R is V—iR Kirchoff's Law: The net vollage drop in @ clused loup is zero Wee RL Ro »Qmn os wee Ov Ra RS WW, WW Loop 3 Re WW Example: Electric Circuits Let i; denote the current in “loop j” Then, we obtain the linear system: (Ri + Rs+ Ri) Rs Ri i A Ry (Ro-+ Ra + Rs) Rs a] _|v Re —Re (Re + Re-+ Re) Circuit simulators solve large linear systems of this type Example: Structural Analysis Common in structural analysis to use a linear relationship between force and displacement, Hooke's Law Simplest case is the Hookean spring law F= kx, > k: spring constant (stiffness) > F: applied load > x: spring extension Example: Structural Analysis This relationship can be generalized to structural systems in 2D and 3D, which yields a linear system of the form Kx =F >» KER™”: “stiffness matrix” > F eR": “load vector” » x CR": “displacement vector" Example: Structural Analysis Solving the linear system yields the displacement (x), hence we can simulate structural deflection under applied loads (F) Kx=F Unloaded structure Loaded structure Example: Structural Analysis It is common engineering practice to use Hooke’s Law to simulate complex structures, which leads to large linear systems (From SAP2000, structural analysis software) Example: Econamics Leontief awarded Nobel Prize in Economics in 1973 for developing linear input/output model for production /consumption of goods Consider an economy in which n goods are produced and consumed >» AeR™”: aj represents amount of good j required to produce 1 unit of good 7 > x CIR": x; is number of units of good 7 produced >» de€R": d; is consumer demand for good ¢ In general a7 = 9, and A may or may not be sparse Example: Econamics The total amount of x; produced is given by the sum of consumer demand (d;) and the amount of x; required to produce each x Xj = airxt + 22x +++ + ainXa +0) ee production of other goods Hence x — Ax +d or, (I-A)x=d Solve for x to determine the required amount of production of cach good If we consider many goods (e.g. an entire economy), then we get a large linear system Summary Matrix computa-ions arise all over the place! Numerical Linear Algebra algorithms provide us with a toolbox for performing these computations in an efficient and stable manner In most cases, can use these tools as black boxes, but it’s important to understand what the linear algebra black boxes do: > Pick the right algorithm for a given situation (e.g. exploit structure ina problem: symmetry, bandedness, ede.) » Understand how and when the black box can fail Preliminaries In this chapter we will focus on linear systems Ax = b for Ae R™? and b,x € R” Recall that it is often helpful to think of matrix multiplication as a linear combination of the columns of A, where x; are the weights That is, we have b = Ax = S77 xjay.j) where a,j) € R” is the 3" column of A and xj are scalars Preliminaries This can be displayed schematically as 6 = ay | a2 sam) a An) Preliminaries We therefore interpret Ax = b as: “x is the vector of coefficients of the linear expansion of b in the basis of columns of A” Often this is a more helpful point of view than corventional interpretation of “dot-product of matrix row with vector” eg. from “linear combination of columns" view we immediately see that Ax = b has a solution if b € span{ay.1), a(,2)5°°* + an) } (this holds even if A isn't square) Let us write image(A) = span{a. 1). a¢.2).°-> + -,»)} Preliminaries Existence and Uniqueness: Soluitinn ¥ € IR? exists if h < image(A) If solution x exists and the set {4(.,1).2[,2).°*+ + €(,n)} is linearly independent, then x is unique If solution x exists and Sz / 0 such that Az — 0, then also A(x +2) = b for any 7 € R, hence infinitely mary solutions If b ¢ image(A) then Ax = b has no solution "Linear independence of columns of A is equivalent to Az =0 => 2=0 Preliminaries The inverse map A~!: IR” + R” is well-defined if and only if Ax = b has unique solution for all b € R” Unique matrix A“! € R”® such that AA“! = A~=A = / exists if any of the following equivalent conditions are satisf > det(A) 40 > rank(A) =n > For any z 40, Az £0 (null space of A is {0}) Ais non-singular if A~? exists, and then x = A~'b€ R? Ais singular if A~ does not exist Norms A norm || - || : V+ Ris a function on a vector space V that satisfies > |x|] > 0 and |x| =0 = x=0 > lrxll = lalbll, for ye R » IIx+yll S Ixll + Ilyll Norms Also. the triangle inequality implies another helpful inequality: the “reverse triangle inequality”, |l|x\| — ly|ll < |x — 3 Proof: Let a= y, b= x—y, then xl] = lla+bll < llall-+llbll = llyli+llx—vll => IIxll—ilyll < llx—yll Repeat with a= x, b= y — x to show |||x\| — ||y|ll < |x — yl Vector Norms Let's now introduce some common norms on R” Most common norm is the Euclidean norm (or 2-norm): special case of the p-norm for any p > 1 1. 3 Ixlle = (& by ’) ial Also, limiting case as p — 00 is the co-norm: Ixlloo = max [>i Vector Norms \[xloc = 2.35, we see that p-norm approaches oc-norm: picks out the largest entry in x Tspst00. Vector Norms We generally use whichever norm is most convenient/appropriate for a given problem, e.g. 2-norm for least-squares analysis Different norms give different (but related) measures of size In particular, an important mathematical fact is: All norms on a finite dimensional space (such as R”) are equivalent Vector Norms That is, let || [|p and || - || be two norms on a finite dimensional space V, then Sc1,¢2 C xo such that for any x c V cillxlla < ||x\I0 < cal\xlla (Also, from above we have £'|x\|» < ||x\la < |x|») Hence if we can derive an inequality in an arbitrary norm on Y, it applies (after appropriate scaling) in any other norm too Vector Norms In some cases we can explicitly calculate values for cx, co: eg. |Ix\l2 Sxl, < Vallxlla, since jal = 2 . " IxI3 = | So )s Sb] = bell? |xlle < llxlh leg. consider |a? + |b]? < lal? + |6[? + 2Jallb| = (lal + [b|)?] 1/2 1/2 Ix = S01 x [x] < | 7 a a ja = vallxlle [We used Cauchy-Schwarz Inequality in IR”: Dyan phy S (Efan a7) (L pa FY”? | Vector Norms Different norms give different measurements of size The “unit circle” in three different norms: {x € R2: |Jx||p = 1} for p=1,2,00 ~ I, p> a, Matrix Norms There are many ways to define norms on matrices For example, the Frobenius norm is defined as: y2 non Alle = | Oo ley? im ja (If we think of A as a vector in IR, then Frobenius is equivalent to the vector 2-norm of A) Matrix Norms Usually the matrix norms induced by vector norms are most useful, eg: \\Ax max Axle _ max, |Axllo 20 [lp tll |Alle This definition implies the useful property ||Ax||p < ||Al|||x||p. since [Axle |Av| \Ixllp S (mgs ele \xlle = Allell*ll Dp ~ veo" Tp , om Axle = Matrix Norms The 1-norm and oo-norm can be calculated straightforwardly: Alla = max lacyylly—_ (max column sum) Wl = max llaglls (max row sum) We will see how to compute the matrix 2-norn Condition Number the condition number of A ¢ R"*" is defined as (A) = [AIA The value of «(A) depends on which norm we use Both Python and Matlab can calculate the condition number for different norms If A is square then by convention A singular —> «(A) = 00 The Residual Recall that the residual r(x) = b — Ax was crucial in least-squares problems It is also crucial in assessing the accuracy of a proposed solution (&) to a square linear system Ax = b Key point: The residual r( general the error Ax = x — is always computable, whereas in isn't The Residual We have that ||Ax|| = ||x — || = 0 if and only if r(2)|| =0 However, small residual doesn’t necessarily imply smalll || Ax|] Observe that Axl] = [lx ~ 81] = A706 - AR) = ADIL < AMI Hence [ech AUC) _ WAAC) _ gy ADIL, i (x4) The reason (x) and (#») are related is that the residual measures the “input pertubation” in Ax = b To see this, let's consider Ax = b to be a map be R” + x € R” The Residual Then we can consider % to be the exact solution for some perturbed input b= b+ Ab, ie, AX= 5 The residual associated with & is r(%) = b— AX = b—~ b= —Ab, ie. |[r(%)|| = bl] In general, a numerically stable algorithm gives us the exact solution to a slightly perturbed problem, i.e. a small residual? This is a reasonable expectation for a stable algorithm: rounding error doesn’t accumulate, so effective input perturbation is small ?More precisely, this is called @ "backward stable algorithm" The Residual Consider a 2 x 2 example to clearly demonstrate the difference between residual and error Ax = 0.913 0.659] fm] _ [0.254] _ 0.457 0.330 | | x2 | | 0.127 The exact solution is given by x = [1, —1]7 Suppose we compute two different approximate solutions (c.g. using two different algorithms) wo [SP] a= 2%] Aa) = 05 AG) ~1.001 The Residual Then, el&y)lla = 21 x 10-8, r(R iy) Nu = 2.4 x 10-7 but eX y]la = 2-58, [lx — Raylla = 0.002 In this case, %j)) iS better solution, but has larger residual! This is possible here because «(A) = 1.25 x 10* is quite large (ie. rel. error $1.25 x 10* x rel. residual) LU Factorization Solving Ax = b Familiar idea tor solving Ax = b is to use Gaussian elimination to transform Ax = b to a triangular system What is a triangular system? > Upper triangular matrix U ¢ R"*": if i> j then uy > Lower triangular matrix L € R"*": if i ow) /ujj Solving Ax = b Similarly, we car use forward substitution for a lower triangular system Lx = b ma = bi/fr x2 = (be — faixr)/l2a it x = (5 - a) ty ia Solving Ax = b Back and forward substitution can be implemented with doubly nested for-loops The computational work is dominated by evaluating the sum wi Dh Get FHL We have j — 1 additions and multiplications in this loop for each j=1,...,1, ie. 2(j—1) operations for each j Hence the total number of floating point operations in back or forward substituzion is asymptotic to: 2n(n + 1)/2~ nr? Solving Ax = b Here "~" refers to asymptotic behavior, e.g. F(n) f(n) ~~? = lim = We often also use “big-O" notation, c.g. for remainder terms in Taylor expansior F(x) = O(g(x)) if there exists M © Pyo,xo © R such that |F(x)| < Mlg(x)| for all x > x0 In the present context we prefer “~" since it indicates the correct scaling of the leading-order term eg. let F(n) = 2/4 +n, then f(n) = O(n2), whereas f(n) ~ n?/4 Solving Ax = b So transforming Ax = b to a triangular system is a sensible goal, but how do we achieve it? Observation: If we premultiply Ax = b by a nonsingular matrix M then the new system MAx = Mb has the same solution Hence, want to devise a sequence of matrices My, Mp, --- ,My—1 such Lhal MA = My_1+++MyA = U is upper triangular This process is Gaussian Elimination, and gives the transformed system Ux = Mo LU Factorization We will show shortly that it turns out that if MA — U, then we have that L = M~ is lower triangular Therefore we obtain A — LU: product of lower and upper triangular matrices This is the LU factorization of A LU Factorization LU factorization is the most common way of solving linear systems! Ax = b= LU = Let y = Ux, then Ly = b: solve for y via forward substitution? Then solve for Ux = y via back substitution 3y — 1" is the transformed right-hand side vector (i.e, Mb from earlier) that we are familiar with from Gaussian elimination LU Factorization Next question: How should we determine My,M,-++ ,Mj—1? We need to be able to annihilate selected entries cf A, below the diagonal in order to obtain an upper-triangular matrix To do this, we use “elementary elimination matrices” Let L; denote j* elimination matrix (we use “L;" rather than “M,” from now on as elimination matrices are lower triangular) LU Factorization Let X(= Lj-1Lj;-2-++ L1A) denote matrix at the start of step j, and let x. € R” denote column j of X Then we define Lj such that 1: 0 o- 0 ny xy Lenya}? 1 oo “j x FD = | xrj/xp Lov OF} | xry | | 0 Oo axy/xy Ove 1 Kaj 0 LU Factorization To simplify notation, we let ¢;; = 22 in order to obtain - oO 0 0 _|o 1 0 o Le) 9 tay 1 0 oO 0 1 LU Factorization Using elementary elimination matrices we can reduce A to upper triangular form, one column at a time Schematically, for a 4 x 4 matrix, we have xx x x x x x x x x x x x Kx x x ty, ox x x o x x x x xX x x ox x x 0 0 x x x x x x ox x x 00x x A LA bA Key point: Ly does not affect columns 1,2,.... k-lof Lyatkyea LA LU Factorization After n — 1 steps, we obtain the upper triangular matrix Us Lat LeiA coox coxx ex xx xx xx LU Factorization Finally, we wish to form the factorization A = LU, hence we need L = (Lp-a-++dala)t = tg This turns out to be surprisingly simple due to two strokes of luck! First stroke of luck: L72 is obtained simply by negating the subdiagonal entries ii ly 1 o 0 0 1 o 0 0 1 0 0 0 1 0 0 Sag 1 Oo}: 0 Gay 1 0 0 1 0 fy 0 oo. LU Factorization Explanation: Let & . [0,---,0,842y,---5€yjJ7 so that lj =1- Ge Now consider L(I + éje/ L(+ Ge) = (I- Ge} (1+ Ge) ) = 1- Ge Ge? =1- Ge Ge Also, (e/7&)) =U (why?) so that Li(I+ ge") =I By the same argument (I+ fej! )Lj = I, and hence a r uy) = (1+ Gel) LU Factorization Next we want to form the matrix L = Ly'L31-++L,1 m1 Note that we have Gb = (+ bef M+ Gases) I+ Gey + Garefs + (ef fss)ef = 14 Gef + Ging Interestingly, this convenient result doesn’t hold for L77,L;*, why? LU Factorization Similarly, bate = + oe + Gina ent + 42649) = T+ hel +b sreia three That is, to compute the product L;1L;1-+-L;7, we simply collect the subdiagonals for j — 1,2,...,—1 LU Factorization Hence, second s:roke of luck: 1 fy 1 vist =| G1 @ 1 lm bra e+ Enna 1 LU Factorization Therefore, basic LU factorization algorithm is a U=AL 2: for j=1:n—l1do 3 fori/=j+1:ndo a ty = uy / ug 5: for k 6: 7 8 9: ik end for end for end for Note that the entries of U are updated each iteration so at the start of step j, U= Lj ily 2---LiA Here line 4 comes straight from the definition ¢ = “ 3 LU Factorization Line 6 accounts for the effect of Lj on columns k = j,j+1, of U For k =: n we have 1: 0 0 ++ 0 uy ue tu =|? 1 0 0 oe | Ox POMS TQ eee Ea 1 0 Ua k Upsik — Gare 0 fp) 0 1 Une Link — Eni te The vector on the right is the updated k* column of U, which is computed in line 6 LU Factorization LU Factorization involves a triply-nested for-loop, hence O(n3) calculations Careful operation counting shows LU factorization requires ~ 4n? additions and ~ $n3 multiplications, ~ 2n3 operations in total Solving a linear system using LU Hence to solve Ax = b, we perform the following three steps: Step 1: Factorize A into L and U: ~ 3n3 Step 2: Solve Ly = b by forward substitution: - n? Step 3: Solve Ux = y by back substitution: ~ n? Total work is dominated by Step 1, ~ 3’ Solving a linear system using LU An alternative aoproach would be to compute A~! explicitly and evaluate x = A~1b, but this is a bad idea! Question: Haw would we compute A-12 Solving a linear system using LU Answer: Let aj’™, denote the kth column of A~}, then aji%,, must satisfy Aa.) = en Therefore to compute A-, we first LU factorize A, then back/forward substitute for rhs vector ex, k = 1,2, wa The n back/forward substitutions alone require ~ 2n? operations, inefficient! A rule of thumb in Numerical Linear Algebra: It is almost always a bad idea to compute A~ explicitly Solving a linear system using LU Another case where LU factorization is very helpful is if we want to solve Ax = b; for several different right-hand sides b;, i= 1,...,k We incur the ~ 2n3 cast only once, and then each subequent forward/back subsitution costs only ~ 2n? Makes a huge difference if n is large! Stability of Gaussian Elimination There is a problem with the LU algorithm presented above [] Ais nonsingular, well-conditioned (i(A) ~ 2.62) but LU factorization fails at first step (division by zero) Consider the matrix Stability of Gaussian Elimination LU factorization doesn't fail for 10-70 1 az| , | but we get fim 1 u=| 0 10 | Stability of Gaussian Elimination Let's suppose that —10° € FF (a floating point number) and that round(1 — 107°) = —10?0 Then in finite precision arithmetic we get 7_[ 1.0 510% 1 t-| age | a-| 0 “oe | Stability of Gaussian Elimination Hence due to rounding error we obtain 7y_ [107 1 a [ 1 0 which is not close to 10-1 ens Then, for example, let b = [3,3] > Using LU, we get x = [3,3]7 > True answe- is x = [0,3]7 Hence large relative error (rel. err. = 1) even though the problem is well-conditioned Stability of Gaussian Elimination In this example, standard Gaussian elimination yields a large residual Or equivalently, it yields the exact solution to a problem corresponding te a large input perturbation: Ab = [0,3]7 Hence unstable algorithm! In this case the cause cf the large error in x is numerica instability, not ill-conditioning To stabilize Gaussian elimination, we need to permute rows, i.e. perform pivoting Pivoting Recall the Gaussian climination process K x x x x mg oR OX > x x x x xX x But we could just as easily do K x x x x x x x = My xX x x x x ook x xXx xx xx Xxx aera Partial Pivoting The entry xj is called the pivot, and flexi pivot is essential otherwise we can't deal with =[1 i] From a numerical stability point of view, it is crucial to choose the pivul lo be the largest entry in Colum j. “partial pivoting" * in choosing the This ensures that each ¢j; entry — which acts as a multiplier in the LU factorization process — satisfies |¢;)| <1 “Full pivoting refers to searching through columns j : 1 for the largest entry; this is more expens ve and only marginal benefit to stability in practice Partial Pivoting To maintain the triangular LU structure, we permute rows by premultiplying by permutation matrices x x Kx x x x x x xX x x x x x] ow wy x x] ok My x x a, x x x x x x 0x x xj xX x mR ox x Pivot volection Row intorchange In this case 1000 ooo1 v=loo010 0100 and each P; is obtained by swapping two rows of I Partial Pivoting Therefore, with partial pivoting we obtain Ly—1Ph—1-++LoP2liPiA =U It can be shown PA=LU where? P= Py1-+- PoP Theorem: Gaussian elimination with partial pivotirg produces nonsingular factors L and U if and only if A is nonsingular. The L matrix here is lower triangular, but not the same as L in the non-pivoting case: we have to account for the row swaps Partial Pivoting Pseudocode for LU factorization with partial pivoting (blue text is new): 3 Select ea that maximizes |uj) 4: Interchange rows of U: Upj jen) U(ijzn) 5: Interchange rows of L: l¢j1j-1) + €(.ay-1) 6: Interchange rows of P: py.) + Pri.) 7: fori +1:ndo 8 f 9 ndo 10: Uik = Uik — Lip Ujk 1. end for 12: end for 13:_end tor Again this requires ~ 21 floating point operations Partial Pivoting: Solve Ax = b To solve Ax — & using the factorization PA — LU: > Multiply through by P to obtain PAx = LUx = Pb > Solve Ly = Pb using forward substitution > Then solve Ux = y using back substitution Partial Pivoting in Python Python's scipy.linalg.1u function can do LU factorization with pivoling. Python 2.7.10 (Getaute, Fab 22 2019, 21:56:15) (Gu 6.211 Canpetipie Apple Liv 2010.2. (e1ang-i001.0.8/.10)) on darwin ‘Type "help", “eopyrignt', “crecice” of "license" tor gore intornation >>> import humps 35 5p >> aenp.ranien-randon'4,4)) ssrray(CT 0.91190125, °. T o.4e0seis4, 0 24213620, 0: [ o.te4as853, 0 aeoaiis, 0. fF ovesnarrea, 0 anannans 0: >o> (palsw)scipy-dinale.du(a) © soxseis4, 0.083562961 0949457351 065252366) ‘swov2ana, 0 49s¢08aali) >> 3 array(CE 1, 10, Co. Co. seray(Cl so 10. ro. h T one0sses | 1 +o. as if p.szeaeezs, -0 sxeaaiT, 1. 8 1 [ oirieses7 -0 seeacaea, -0.2so1sr65, 1 » ssrray(C{ 0.91196125, 0 95226104, 0.Sea46227, 0.003302061, 3 ) Srercooas, olaticooes, Sloveaoest) | to. 0 } O.SBBS37EA, 1.19861973), Co. oe ie. { 060g6s161]3) Stability of Gaussian Elimination Numerical stability of Gaussian Elimination has been an important research topic sirce the 1940s Major figure in this field: James H. Wilkinson (English numerical analyst, 1919-1986) Showed that for Ax = b with A € IR™": > Gaussian elimination without partial pivoting is numerically unstable (as we've already seen) » Gaussian elimination with partial pivoting satisfies <2" WPemach Stability of Gaussian Elimination That is, pathological cases exist where the relative residual, IIr\|/\|All|lx|], grows exponentially with n due to rounding error Worst case behavior of Gaussian Elimination with partial pivoting is explosive instability but such pathological cases are extremely rare! In over 50 years of Scientific Computation, instability has only been encountered due to deliberate construction of pathological cases In practice, Gaussian elimination is stable in the sense that it produces a smal relative residual Stability of Gaussian Elimination In practice, we typically obtain Ul TAM SMe mectis i.e. grows only linearly with n, and is scaled by €mach Combining this result with our inequality: Ax!) \Ixll implics that in practice Gaussian climination gives small crror for well-conditioned problems! < (A) IA\I«i] Cholesky Factorization Cholesky factorization Suppose that Ac R"*” is an “SPD” matrix, i.e.: > Symmetric: AT =A > Positive Definite: for any v0, vTAv > 0 Then the LU factorization of A can be arranged so that U = LT, ie, A= LL™ (but in this case L may not have Is on the diagonal) Consider the 2 x 2 case: ai aa] _|[& 0 fu fr am a2 fn bop 0 be Fquating entries gives Ga = Van, f= an /f1, ba = Van - Gy Cholesky factorization This approach of equating entries can be used to derive the Cholesky factorization for the general n x n case rL=A 2: tor y=1:ndo 3 G= VG 4 fori=j+1:ndo Sty = Oy 6: end for 7 fork=j+1:ndo 8: for i ndo 9°: Cine = Cin — Lily 10: end for 11: end for 12: end for Cholesky factorization Notes on Cholesky factorization: > For an SPD matrix A, Cholesky factorization is numerically slable and does nol require any pivoling > Operation count: ~ operations in total, i.e. about half as many as Gaussian elimination > Only need to store L, hence uses less memory than LU Sparse Matrices Sparse Matrices In applications, we often encounter sparse matrices A prime example is in discretization of partial differential equations (covered in the next section) “Sparse matrix’ is not precisely defined, roughly speaking it is a matrix that is “mostly zeros” From a computational point of view it is advantageous to store ‘only the non-zero entries The sel of non-zero entries of a sparse malrix is referred Lo as ils sparsity pattern Sparse Matrices oooon Asparse ayy (1,2) 0.9) (1,3) (3,3) aa) a) (1,5) (5.5) BH RENE RAND ooron oroon Sparse Matrices From a mathematical point of view, sparse matrices are no different from dense matrices But from Sci. Comp. perspective, sparse matrices ‘equire different data structures and algorithms for computational efficiency e.g., can apply LU or Cholesky to sparse A, but “new” non-zeros (i.e. oulside spa’sily pattern of A) are introduced ni Lhe faclors These new non-zero entries are called “fill-in” — many methods exist for reducing fill-in by permuting rows and columns of A QR Factorization QR Factorization A square matrix Q € R"*" is called orthogonal if its columns and rows are orthonormal vectors Equivalently, Q7Q = QQT =I Orthogonal matrices preserve the Euclidean norm of a vector, i.e. v3 = v7 Q7Qv =v" v= Ivo Hence, geometrically, we picture orthogonal matrices as reflection or rotation operators Orthogonal matrices are very important in scientific computing, norm-preservaticn implies no amplification of numerical error! QR Factorization A matrix A € R™*", m > n, can be factorized into A= OR where >» Qe R™*" is orthogonal =| FB] cgmn »aa[ fer > Re" is upper-triangular QR is very good for solving overdetermined linear east-squares problems, Ax ~ b + TQR can also be used to solve a square system Ax — b, but requires ~: 2 as many operations as Gaussian elimination hence not the standard choice QR Factorization To see why, consider the 2-norm of the least squares residual eB = 6 Anig = toa § ] 8 - (off) a R er [ ‘ ]x8 (We used the fact that || Q7z\|2 — ||zl|2 in the second line) QR Factorization Then, let QT7b = [c1, c2]” where ¢ € R", cp € R™, so that Ure) = Hea — Bxel3 + Uleall3 Question: Based on this expression, how do we minimize [](x)||2? QR Factorization Answer: We can't influence the second term, ||c2||3, since it docsn't contain an x Hence we minimize ||r(x)|[3 by making the first term zero That is, we solve the n x n triangular system Rx = cy — this what Python does in its Istsy function for solving least squares Also, this tells us that min \ir(x)|l2 = llezl2 QR Factorization Recall that solving linear least-squares via the normal equations requires solving a system with the matrix ATA But using the normal equations directly is problematic since cond(AT A) — cond(A)? (this is a consequence of the SVD, which we'll cover soon} The QR approach avoids this condition-number-squaring effect and is much more numerically stable! QR Factorization How do we comoute the QR Factorization? There are three main methods > Gram-Schmidt Orthogonalization > Householder Triangularization © Givens Rotations: We will cover Gram-Schmidt and Givens rotations in class Gram-Schmidt Orthogonalization Suppose AE R™”, m>n Onc way to picture the QR factorization is to construct a sequence of orthonormal vectors qi, q2,... such that span{ qi, q2,--. 4} = span{ar.1),a.2).+- yy} We seek coefficients rj such that Aa) = magn, 2) = 2g + Moga, An) = Mad + P2ng2 +--+ on Gn This can be done via the Gram-Schmidt process, as we'll discuss shortly Gram-Schmidt Orthogonalization In matrix form we have: mon + he rm Pn cay | ea om | = | a | m This gives A= OR for QE R™", RE RM” This is called the reduced QR factorization of A, which is slightly different from the definition we gave earlier Note that for m>n, Q7Q =I, but QQ7 41 (the latter is why the full QR is sometimes nice) Full vs Reduced QR Factorization The full QR factorization (defined earlier) A=QR is obtained by appending m — n arbitrary orthonormal columns to Q to make it an m x m orthogonal matrix We also need to append rows of zeros to fe to “silence” the last R 0 m—n_columns of Q, to obtain R = Full vs Reduced QR Factorization Full QR A } R Reduced QR Full vs Reduced QR Factorization Exercise: Show that the linear least-squares solution is given by Rx = QT by plugging A= QB into the Normal Equations This is equivalert to the least-squares result we showed earlier using the full QR factorization, since ¢ = Q7b Full versus Reduced QR Factorization In Python, numpy.1inalg.qr gives the reduced QR factorization by default Python 2.7.10 (default, Fob 7 2017, 00:08:15) [occ 4.2.1 Compatinie Apple LVM 8.0.0 (crang~sv0.0.34)] on aarwin Type "help", “copyright, “credits” or "license" for more information. >>> import mumpy 2s np >>> asnp.randon.randon((5,3)) >>> (q.r)=np.Linalg.ar(a) >> 4 array([[-0.58479903, 0.18604305, -0.21857883], [-0.59514318, -0.34033705, -0.23588693],, [-0.26381403, 0.14702842, -0.47775682] , [-0.39324594, -0.43993772, 070189805) , [-0.282089¢8, 0.79977432, 0.41912791)]) [-1.s022a926, -1.44299112, -1.0019288 1, Co. + 0.49087707, 0.4207912 1, Co 220s + 0.65436304))) Full versus Reduced QR Factorization In Python, supplying the mode=’ complete? option gives the complete QR factorization Python 2.7.10 (default, Feb 7 2017, 00:08:15) [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darvin Type "help", "copyright", "credits" or "license" for more information. >>> import numpy as np >>> aenp random random((5,3)) >>> (q,r)=np.Linalg.qr (a,node=’ complete”) sg array ({[-0.58479903, 0.18604305, (-0.59514318, -0.34033765, (-0.26381403, 0.14702842, [-0.39324596, -0.43393772, [-0.9AmqAKA, 0. 79977499, door array([[-1.50223926, -1.44239112, Lo. + v.auogrvor, to. +0. . ro #10 z Lo. 2-0. : ~0.21857883, -0.18819304, -0.73498623) , ~0.23588693, -0.40072023, 0.56013886] , -0.47775682, 0.80433265, 0.18325448), 0.70189805, 0.39138159, -0.10590212],, 0.41919791, -0.0R998RAa, 0 AtATASRRII) -1.0813288 1, o.azu7912 1, 0.654363041 , 1 -0. m) Gram-Schmidt Orthogonalization Returning to the Gram-Schmidt process, how do we compute the gi i=1,...,02 In the jth step, find a unit vector q; € span{ar.1),a(..2),---+ ay) } that is orthogonal to span{q1,qn,---. qj—a1} We set = aj) — (a aya — 0 = (41 aG4)) 4a» and then qj = v,/||vj||2 satisfies our requirements We can now determine the required values of rj Gram-Schmidt Orthogonalization We then write our set of equations for the qj as (1) n a (;,2) ~ 2M An) — De an ne Then from the definition of qj, we see that y= Gass ‘Aj j-1 lil = lacy - Do riaille ra The sign of rj is not determined uniquely, e.g. we could choose ij > 0 for each j Classical Gram-Schmidt Process The Gram-Schmidt algorithm we have described is provided in the pseudocode below 4: for j=1:ndo : aj) 3 forj=1:j—1do * = 4% Ay) BGS HG 6 7: 8: 9: Yj end for y= llyjlle G = Yilty end for This is referred to the classical Gram-Schmidt (CGS) method Gram-Schmidt Orthogonalization The only way the Gram-Schmidt process can fail is if Ing = \Ivjlle = 0 for some j This can only heppen if ay. j) = SIT} ryqi for some j, ie. if 4: j) © SPAMLM Ans - ++ Y-1} = span{ay.a), a:2)>++ ++ A-j-1)} This means that columns of A are linearly dependent Therefore, Gram-Schmidt fails => cols. of A linzarly dependent Gram-Schmidt Orthogonalization Equivalently, by contrapositive: cols. of A linearly independent ==> Gram-Schmidt succeeds Theorem: Every A € R'"*"(m > n) of full rank has a unique reduced QR facturization A= QR with 1 > 0 The only non-uniqueness in the Gram-Schmidt process was in the sign of ri, hence QF is unique if rj; > 0 Gram-Schmidt Orthogonalization Theorem: Every A € R"*"(m > n) has a full QR factorization. Case 1: A has full rank >» We compute the reduced QR factorization from above > To make Q square we pad Q with m— n arbitrary orthonormal columns > We also pad R with m—n rows of zeros to get R Case 2: A doesn’t have full rank > At some point in computing the reduced QR factorization, we encounter ||vj||2 = 0 > At this poirt we pick an arbitrary g; orthogonal to span{qu,qz,-.-,4)-1} and then proceed as in Case 1 Modified Gram-Schmidt Process The classical Gram-Schmidt process is numerically unstable! (sensitive to rounding error, orthogonality of the q; degrades) The algorithm can be reformulated to give the modified GramSchinidl process, which is numerically more robust Key idea: when each new gq; is computed, orthogonalize each remaining column of A against it Modified Gram-Schmidt Process Modified Gram-Schmidt (MGS): 1: fori=1:ndo 2 w= ae 3: end for 4: for i=1:ndo 5: tH = ||ville 6 g=vi/ni 7: forj=iti:ndo 8 =q yj 9: Vi Gi 10: end for 1i:_end for Modified Gram-Schmidt Process Key difference between MGS and CGS: > In CGS we compute orthogonalization coefficients rj wrt the “raw” vector a, ;) > In MGS we remove components of a¢.j) in span{q1, 42,---, 4:1} before computing rij This makes no difference mathematically: In exact arithmetic components in span{qi, q2,..-,9;-1} are annihilated by q7- But in practice it reduces degradation of orthogonality of the qj => superior numerical stability of MGS over CGS Operation Count Work in MGS is dominated by lines 8 and 9, the innermost loop: ny yo= First line requires m multiplications, m— 1 additions; second line requires m multiplications, m subtractions Hence ~ 4m operations per single inner iteration Hence total number of operations is asymptotic to y y fen wm Soo? isl jail Alternative QR computation methods The QR factorization can also be computed using Householder triangularization and Givens rotations. Both methods take the approach of applying a sequence of urthogonal matrices @1, Qa, @3,-.. lo the matrix that successively remove terms below the diagonal (similar to the method employed by the LU factorization) We will discuss Givens rotations. A Givens rotation For i < j and an angle @, the elements of the m x m Givens rotation matrix G(i,j,0) are Bi = C, Si = C: 8 = S: gi = —S, Bik =1 fork #i,j, ki =0 otherwise, (QQ) where c = cos@ and s = sind. A Givens rotation Hence the matrix has the form 1 0 0 - 0 Oe ue 0 0 ar 0 0 Od It applies a rotation within the space spanned by the ith and jth coordinates Effect of a Givens rotation Consider a m x n rectangular matrix A where m > n Suppose that a1 and ag are in the /th and jth positions in a particular column of A. Assume at least one a; is non-zero. Restricting to just ith and jth dimensions, a Givens rotation G(i,j.0) for a particular angle @ can be applied so that (5 )(2)-G): where a is non-zero, and the jth component is eliminated Stable computation a is given by y/aj + a3. We could compute ca a/\V ae + 35, S = an/y/ aj + a5 but this is susceptible to underflow/overflow if a is very small. A better procedure is as follows: > if |ar| > lag|, set t = tan = ag/ay, and hence pS ge c= yap sa ct. > if |ao| > |ax|, set 7 = cot@ = ay/a, and hence s= oibayc = st. Givens rotation algorithm To perform the Givens procedure on a dense m x i rectangular matrix A where m > n, the following algorithm can be used: 1 R=A,Q=! 2: for k :ndo 3: forj=m:k+1do 4: Construct G= G(j, k, 8) to eliminate a 5 R=GR 6: 2= Qc" 7: 8: end for end for Givens rotation advantages In general, for dense matrices, Givens rotations are not as efficient as the other two approaches (Gram-Schmidt and Householder) However, they are advantageous for sparse matrices, since non-zero entries can he eliminated one-hy-one They are also amenable to parallelization. Consider the 6 x 6 matrix: BR weux wRanx x Wanx x x Nox x x x Kx x x x xx KX XX The numbers represent the steps at which a particular matrix entry can be eliminated. e.g. on step 3, elements (4,1) and (6,2) can be eliminated concurrently using G(3,4,3) and G(5,6, 45), respectively, since these two matrices operate on different rows. Singular Value Decomposition The Singular Value Decomposition (SVD) is a very useful matrix factorization Motivation for SVD: image of the unit sphere, S, from any mx n matrix is a hyperellipse A hyperellipse is obtained by stretching the unit sphere in IR” by factors 01,...,0m in orthogonal directions u1,...,Um Singular Value Decomposition For A ¢ R2*?, we have Singular Value Decomposition Based on this picture, we make some definitions: > Singular values: o1,02,....0 > 0 (we typically assume m1 >02>...) > Left singular vectors: {u1, Up,..., Un}, unit vectors in directions of principal semiaxes of AS > Right singular vectors: {v1, v2,..., Va}. preimages of the u; so that Av; =oju;, i=1,...,0 (The names “left” and “right” come from the formula for the SVD below) Singular Value Decomposition The key equation above is that Av; = ojui, Writing this out in matrix form we get on 2 A vw five] os | ve ty | un] | un Or more compactly: Av = 0% Singular Value Decomposition Here > £ © R™® js diagonal with non-negative, real entries > GER" with orthonormal columns >» V €R”™” with orthonormal columns Therefore V is an orthogonal matrix (V7V = VT =I), so that we have the reduced SVD for Ae R™™*”: Singular Value Decomposition Just as with QR, we can pad the columns of U wich m—n arbitrary orthogonal vectors to obtain U€ R™” We then need to “silence” these arbitrary columns by adding rows of zeros to & to obtain © This gives the full SVD for Ac R™*": A=UEVT Full vs Reduced SVD LA A U Full SVD A 6 z vo Reduced SVD. Singular Value Decomposition Theorem: Every matrix A € R"*" has a full singuar value decomposition. =urthermore: > The oj are uniquely determined > If Ais square and the oj are distinct, the {u;} and {vj} are uniquely determined up to sign Singular Value Decomposition This theorem justifies the statement that the image of the unit sphere under any m x n matrix is a hyperellipse Consider A = UEV7 (full SVD) applied to the unit sphere, S, in R”: 1. The orthogonal map V7 preserves S 2. E stretches S into a hyperellipsc aligned with the canonical axes ¢; 3. U rotates or reflects the hyperellipse without changing its shape SVD in Python Python's numpy.1inalg.svd function computes the full SVD of a matrix Python 2.7.0 default, Jul 19 2014, 47.11.92) [GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-803.0.40)] on darvin ‘Type "help", "copyright", "credits" or "license" for mere information. >>> import mumpy eo np >>> aemp. random.randon((4,2)) >>> (u,s,v)emp. 1inalg.svd(a) >>> a errey(Il 090027060, 0.967265 , 0.444707, 0.70817509], [-0.4748846 , -0.845504 , -0.23412286, -0.06813139], (-0.47611682, 0.05263149, 0.84419597, -0.24264299], (-0.63208972, 0.35328288, -0.18704595, 0.663828 1) >>> s array([ 1,56149162, 0.24419604]) >>> v array([[-0.677668¢9, -0.73536754] , (-0.73836754, 0.67766849]]) SVD in Python The full_matrices=0 option computes the reduced SVD Python 2.7.8 (default, Jul 13 2014, 17:11:32) [GCC 4.2.1 Compatzoie Apple LLVX 5.1 (clang-503.0.40)) on carvan Type "help", "copyright", "credits" or "license" for more information. >>> import numpy 2s mp >>> a=mp.randon random ((4,2)) >>> (us. linalg.svd(a.full matrices=0) array({[-0.38627868, 0.3967265 1, (-o.arasees , -0.0as59¢ J, (-0.47511682, 0.05263149] , [-0.68208972, 0.35328288)]) >>> 8 array([ 156149162, 0.244196041) array(([-0.67766849, -0.735367541 , (-0.73536754, 0.67766849)]) Matrix Properties via the SVD © The rank of A is r, the number of nonzero singular values! Proof: In the full SVD A= UEVT, U and V7 have full rank, hence it follows from linear algebra Liat rank(A) = rank(Z) @ image(A) — span{u;,...,u,} and null(A) — span{v,41,---5 Vp} Proof. This follows fru A= UEVT and image(Z) = spanfer,...,¢,} €R” null(Z) = span{e,+1,..-.én} €R® “This also gives us a good way to define rank in finite precision: the number of singular values lerger than some (small) tolerance Matrix Properties via the SVD @ |lAll2 =o Proof: Recall that ||Al|2 = maxi, ,-1 ||Avll2. Geometrically, we see that ||Av||2 1s maximized it v = vi and Av = o1th. ¢ The singular values of A are the square roots of the eigenvalues of ATA or AAT Proof: (Analogous for AAT) ATA = (UZV")™(UZV") = VEUTUEVT = V(ZTEX)VT, hence (ATA)V = V(ETE), of (ATA) W.3) = 07 Yu) Matrix Properties via the SVD The pseudoinverse, A*, can be defined more generally in terms of the SVD Define pseudoinverse of a scalar o to be 1/a if o # 0 and zero otherwise Define pseudoinverse of a (possibly rectangular) diagonal matrix as transpose of the matrix and taking pseudoinverse of each entry Pseudoinverse of A € R”*” is defined as At = vrtuT A* exists for any matrix A, and it captures our definitions of pseudoinverse trom previously Matrix Properties via the SVD We generalize the condition number to rectangular matrices via the definition «=(A) = ||Alll|/A* | We can use the SVD to compute the 2-norm condition number: > [Alle = omax > Largest singular value of A* is 1/amin $0 that A¥ 2 = Lonmin Hence (A) = omax/omin Matrix Properties via the SVD These results indicate the importance of the SVD, both theoretically and as a computational toal Algorithms for calculating the SVD are an important topic in Numerical Linear Algebra, but outside scope of this course Requires ~. Amn? — $n? operations Low-Rank Approximation via the SVD One of the most useful properties of the SVD is that it allows us to obtain an optimal low-rank approximation to A We can recast SVD as A-Sousl x Follows from writing © as sum of r matrices, ©), where diag(0,...,0,0),0,...,0) Each ujyj isa rank one matrix: each column is a scaled version of 4 Low-Rank Approximation via the SVD Theorem: For anyO The error in Ay is given by the first omitted singular value Low-Rank Approximation via the SVD A similar result holds in the Frobenius norm: A- Alle =, ink A-Ble= aye +e? WA Alle = suai yepyenlA~ Blie= Voba Fo Low-Rank Approximation via the SVD These thcorems indicate that the SVD is an cffect ve way to compress data excapsulated by a matrix! If singular values of A decay rapidly, can approximate A with few rank one matrices (only need to store oj, uj, vj for j = 1,....v) Example: Image compression via the SVD

You might also like