You are on page 1of 97
Newton's Method Constructing fixed-point iterations can require some ingenuity Need to rewrite f(x) = 0 in a form x = g(x), with appropriate properties on To obtain a more generally applicable iterative method, let us consider the following fixed-point iteration Xk = Xk — ACXK)E(%), Kk = 0,4,2,... corresponding to g( (x) F(x), for some function A fixed point « of g yields a solution to f(a) = 0 (except possibly when (a) = 0). which is what we're trying to actieve! Newton's Method Recall that the asymptotic convergence rate is dictated by |g’(a)|, so we'd like to have |g'(a)| = 0 to get superlinear convergence Suppose (as stated above) that f(a) = 0, then N(a)F(a) — Ma)E(a) g(a)= —A(a)f"(a) Hence to satisfy g’(a”) = 0 we choose (x) = 1/f'(x) to get Newton's method: F(x) Fx)! k=0,1,2,... Met = Xe — Newton's Method Bascd on fixed-point iteration theory, Newton's method is convergent since |g’(a)| =0< 1 However, we need a different argument to understand the superlinear convergence rate properly To do this, we use a Taylor expansion for f(a) about F(x.) 0= Fla) = Flo) + (a — sa) F Oe) + OS 91 0,) for some 0, € (w, xx) Newton's Method Dividing through by f’(xx) gives F( "8, (#03) -0= ena. #(6x) Met 8 = Fac Fs)! Xk — a or Hence, roughly speaking, the error at iteration k +1 is the square of the error at each iteration k This is referred to as quadratic convergence, which is very rapid! Key point: Once again we need to be sufficiently close to @ to get quadratic convergence (result relied on Taylor expansion near a) Secant Method An alternative to Newton's method is to approximate f’(x,) using the finite difference _ F(x) = F%-1) Xk — Xk-1 F(a) Substituting this into the iteration leads to the secant method Xk = Xk=1 Xe = Xk — (xx) ( =——— ],_ k= 1,2,3,... ran Mea) (EHS) ‘The: main advantages of ‘secant’are: > does not require us to determine F"(x) analytically > requires only one extra function evaluation, f(x), per iteration (Newton's method also requires ”(x,)) Secant Method As one may expect, secant converges faster than a fixed-point iteration, but slewer than Newton's method In fact, it can be shown that for the secant method, we have von [eee =a lim ————— Kee [xe = a? where j1 is a positive constant and q ~ 1.6 Multivariate Case Systems of Nonlinear Equations We now consider fixed-point iterations and Newton's method for systems of nonlinear equations We suppose that F : I" — It", n> 1, and we seek a root a € Ik” such that F(a) = 0 In component form, this is equivalent to Fi(a) = 0 F(a) = 0 F(a) = 0 Fixed-Point Iteration For a fixed-point iteration, we again seek to rewrite F(x) = 0 as X = G(x) to obtain: Xe+1 = G(X) The convergence proof is the same as in the scalar case, if we replace | «| with || - | ive. if ||@(x) — G(y)I| < Ll\x — yl], then |[xx — al] < Lx — al Hence, as before, if G is a contraction it will converge to a fixed point « Fixed-Point Iteration Recall that we define the Jacobian matrix, Jg € R"*", to be IF | Jc(a)\loc <1, then there is some neighborhood of « for which the fixed-point i:eration converges to a The proof of this is a natural extension of the corresponding scalar result Fixed-Point Iteration Once again, we can employ a fixed point iteration to solve F(x) =0 e.g. consider x~txyy-l = 0 5x? + 2Ld -9 1 ° This can be rearranged to x1 = V = 8, m= vig — 5x?)/21 Fixed-Point Iteration Hence, we define Gi(x1, 2) fl —x3, Go(x1, x2) = y/ (9 - 5x?)/21 Ihis yields a convergent iterative method Newton's Method As in the one-dimensional case, Newton's method is generally more useful than a standard fixed-point iteration The natural generalization of Newton's method is xiep = Xe — Ie(xk) °F (xe), K = 0,12,... Note that to put Newton's method in the standard form for a linear system, we write Ielxp) Oxy = —F (xk), k= 01,2 where Axy = xket — Xk Newton's Method Once again, if xo is sufficiently close to a, then Newton's method converges quadratically — we sketch the proof below This result again relies on Taylor's Theorem Hence we first consider how to generalize the familiar one-dimensional laylor’s | heorem to Ik" First, we consider the case for F: R?’ > R Multivariate Taylor Theorem Let (s) = F(x + 58), then one-dimensional Taylor Theorem yields (90) cae ota) = 0) + 3 EO 5 6M), n€ (0,2), aa Also, we have g(0.) = F(x) 41) = F(x+6) (a) — PELESDD,, , OFA), DEL Ds, On OX ve Peo +. POH ae FU tb) | PEG 5H 2 Onde, feet og Multivariate Taylor Theorem Hence, we have l a) F(x +6) = F(x) + +E, where e wey= (are Bo) #] 00, e242 and = Upabe+n6), 1 € (0.1) Multivariate Taylor Theorem Let A be an upper hound on the abs. values of all derivatives of order k +1, then Kel < pay (AT Melk. ake _ Tee HAMIE tot = oa where the last line follows from the fact that there are n'+1 terms in the inner product (i.e. there are n+ derivatives of order k + 1) Multivariate Taylor Theorem We shall only need an expansion up to first order terms for analysis of Newton's method From our expression above, we can write first order Taylor expansion succinctly as: F(x +6) = F(x) + VF(x)"5 + A Multivariate Taylor Theorem For F : R® + R®, Taylor expansion follows by developing a Taylor expansion for each Fi, hence Fi(x + 6) = Fix) + VFi(x) 6+ Ea so that for F : R” + R” we have F(x +6) = F(x) + Je(x)0 + Er FF, Byon where ||Ep||0 < max |Eji| < 3 ( max isisn 1sijesn ) wate. Newton's Method We now return to Newton's method We have 0 = F(a) = F(xq) + Je (Xt) [ — x4] + EF so that Xm — 0 = Ue (xq) TF (x4) + We (xa) EF Newton's Method Also, the Newton iteration itself can be rewritten as Je (xe) [xiera — 0] = Je (Xe) [xe — | — F(X) Hence, we obtain. Xk — = (Sex) EFS 2 so that |jxij1 —Aljac < const. xx — al], convergence! i.e. quadratic Newton's Method Example: Newton's method for the two-point Gauss quadrature rule Recall the system of equations Filxisx2, Wi, W2) = wi + we —-2=0 Falx1.x2, m1, We) = wixr + wox2 = 0 Fa(x1.%2,Wi,W2) = wax; + woxz — 2/3 =0 wix? + woxd = 0 Fa(xi.x2, 1, W2) Newton's Method We can solve ths in Python using our own implementation of Newton's method To do this, we require Lhe Jacobian of this systei: 0 0 161 eosveimiw) =| 9i@ 9 8 Bix? Bwoxd xp xp Newton's Method Alternatively, we can use Python's built-in fsolve function Note that #s01ve computes a finite difference approximation to the Jacobian by default (Or we can pass in an analytical Jacobian if we went) Matlab has an equivalent fsolve function. Newton's Method Python example With either approach and with starting guess x9 = [-1,1,1, 1], we get xk = -0.577350269189626 0.577350269189626 1.000000000000000 1.000000000000000 Conditions for Optimality Existence of Global Minimum In order to guarantee existence and uniqueness of a global min. we need to make assumptions about the objective function e.g. if f is continuous on a closed! and bounded set SC R” then it has global minimum in $ In one dimension, this says f achieves a minimum on the interval [2,4] CR In general f does not achieve a minimum on (a, b), e.g. consider f(x) — x (Though | inf, f(x). the largest lower bound of F on (a,b). is xe(a, well-defined) 1A set is closed if it contains its own boundary Existence of Global Minimum Another helpful concept for existence of global min. is coercivity A continuous function f on an unbounded set S C R” is coercive if lim F(x) = +00 [xl] 400 That is, f(x) must be large whenever ||x|| is large Existence of Global Minimum If f is coercive cn a closed, unbounded? set S, then f has a global minimum in S$ Proof: From the definition of coercivity, for any Mc IR, Ir > 0 such that F(x) > M for all x € S where |x|) > r Suppose that 0 S, and set M — F(0) Let ¥ = {x ES. |x|] > r}, v0 ual F(x) > £(0) for all x = And we already F(x) =x? is not cwercive on R (f > —vv for x > —0v) + y* is coercive on R? (global min. at (0, 0)) > f(x) = e* is not coercive on R (f + 0 for x + —oo) Convexity An important cencept for uniqueness is convexity A set S CR” is convex if it contains the line segment between any two of its points That is, S is convex if for any x,y € S, we have {ax+(1-Ay:0€[0,1}} cS Convexity Similarly, we define convexity of a function f: $c R’>R Ff is convex if its graph along any line segment in $ is on or below the chord connecting the function values ie F is convex if for any x,y € S and any # € (0,1), we have F(@x + (1 —4)y) < OF (x) + (1 — 4) f(y) Also, if F(éx + (1—@)y) < OF(x) + (1 —8)F(y) then f is strictly convex Convexity 5 @ Strictly convex Convexity Non-convex, Convexity Convex (not strictly convex) Convexity If F is a convex “unction on a convex set S, then any local minimum of f must be a global minimum? Proof: Suppose x is a local minimum, i.e. f(x) < f(y) for y € B(x,e) (where B(x, e) = {y € S. |ly—x|| <4) Suppose that x is not a global minimum, i.e. that there exists w CS such that F(w) < F(x) (Then we will stow that this gives a contradiction) 3A global minirum is defined as a point z such that f(z) < f(x) for all x €S. Note that a global minimum may not be unique, e.g. if F(x) — —cosx then 0 and 2m are both global minima. Convexity Proof (continued . .. ): For 6 € (0, 1] we have f(Aw + (1—8)x) < OF(w) + (1 — AF (x) Let o € (0, 1] be sufficiently small so that z=aw+(l—o)x€ B(x,e) Then F(2) < af(w) + (1a) F(x) < aX) + (1) FX) = Fl) ie. F(z) < F(x) which contradicts that F(x) is a local minimum! Hence we cannot have w € 5 such that f(w) < fix) O Convexity Note that convexity does not guarantee uniqueness of global minimum €.g. a convex function can clearly have a “horizontal” section (see earlier plot) If f is a strictly convex function on a convex set S, then a local minimum of f is the unique global minimum Optimization of convex functions over convex sets is called convex optimization, which is an important subfield of optimization Optimality Conditions We have discussed existence and uniqueness of minima, but haven't considered how to find a minimum The familiar optimization idea fram calculus in one dimension is: set derivative to zero, check the sign of the seconc derivative This can be generalized to R” Optimality Conditions If f : R° > R is differentiable. then the gradient vector VF: R° > Ris afx) VE The importance of the gradient is that VF points “uphill,” i.e towards points with larger values than f(x) And similarly —Vf points “downhill” Optimality Conditions This follows from Taylor's theorem for f : R” + R Recall that F(x +6) = f(x) + VF(x)76 +H.0.T. Let 6 = —cVF(x) for « > 0 and suppose that VF(x) #0, then: F(x — eVF(x)) & F(x) — eVF(x) VF (x) < F(x) Also, we see from Cauchy-Schwarz that —V f(x) is the steepest descent direction Optimality Conditions Similarly, we sec that a necessary condition for a local minimum at x" € Sis that VF(x*) =0 In this case there is no “downhill direction” at x* The condition Vf(x*) — 0 is called a condition for optimality, since it only involves first derivatives st-order necessary Optimality Conditions x* € S that satisfies the first-order optimality condition is called a critical point of f But of course a critical point can be a local min., ocal max., or saddle point (Recall that a saddle point is where some directiors are “downhill” and others are “uphill”, e.g. (x,y) — (0,0) for f(x,y) — x2 — y2) Optimality Conditions As in the one-dimensional case, we can look to second derivatives to classify critical points If f : R" > R is twice differentiable, then the Hessian is the matrix valued function Hy : IR" > RI" Pf) Pf)... PAX ag Oa Enx BA) PF) PFS Hex) = es mp ‘oom Pte) OFX). @F(X) Dam Oe “ON The Hessian is the Jacobian matrix of the gradient Vf : R° + R” If the second partial derivatives of f are continuous, then 0°F /Oxixj — OF /Oxq0x;, and Hy is symmetric Optimality Conditions Suppose we have found a critical point x*, so that Vf(x") = 0 From Taylor's Theorem, for J € R", we have I(x" +6) Ax) ETEGE)T E+ 50TH +1)6 F(x") + SeT M(x +n6)5 for some 1 € (0,1) Optimality Conditions Recall positive definiteness: A is positive definite if x7 Ax > 0 Suppose //;(x*) is positive definite Then (by continuity) Hy(x* +774) is also positive definite for |||) sufficiently small, so that: 57 Hr(x* +) > 0 Hence, we have [(x* + 6) > f(x") for |]6] sufficienuly small, ie. #(x*) is a local minimum Hence, in general, positive definiteness of Hy at a critical point x* is a second-order sufficient condition for a local minimum Optimality Conditions A matrix can also be negative definite: x Ax <0 for all x £0 Or indefinite: There exists x, y such that x7 Ax <0 < y’Ay Then we can classify critical points as follows: > H,(x*) positive definite =» x* is a local minimum > H,(x") negative definite => x" is a local maximum > H;(x") indefinite => x’ Isa saddle point Optimality Conditions Also, positive definiteness of the Hessian is closely related to convexity of F If Hy(x) is posit ve definite, then f is convex on scme convex neighborhood of x If H;(x) is posit ve definite for all x < S, where $ is 2 convex set, then f is convex on S Question: How do we test for positive definiteness? Optimality Conditions Answer: A is positive (resp. negative) definite if and only if all eigenvalues of A are positive (resp. negative)* Also, a matrix with positive and negative eigenvalues is indefinite Hence we can compute all the eigenvalues of A and check their signs ‘This is related to the Rayleigh quotient, see Unit V Example Consider F(x) = 2x} + 3x7 + Ldxime + 3x9 — 6m + 6 Then Ox? + 6x1 + 12x VEO) = | io + 69 —6 We set VF(x) = 0 to find critical points® [1,1] and [2, —3]7 5In general solving Vf(x) = 0 requires an iterative method Example The Hessian is He(x) 12x,+6 12 12 6 and hence H/(1,-1) = [ D t |; which has eigenvalues 25.4, —1.4 H¢(2, —3) 30 12 [ D6 |: which has eigenvalues 35.0, 1.0 Hence [2, —3]” is a local min. whereas [1,—1]7 is a saddle point Optimality Conditions: Equality Constrained Case So far we have ignored constraints Let us now consider equality constrained optimization min f(x) subject to g(x) = 0, eR" where f:R” > Rand g:R’—+R™, withm 1 equality constraints Then g > R® > R® and we now have a set of constraint gradient vectors, Vgj,/=1,...,m Then we have S = {x € IR": g(x) =0,/ Any “tangent direction” at x € S must be orthogonal to all gradient vectors {Vg;(x),=1,...,m} to remain in S Optimality Conditions: Equality Constrained Case Let T(x) = {v € B®: Vgi(x)Tv = orthogonal complement of {Vgj(x), i -.,;m} denote the mM Then, for 6 € T(x) and € € Ryo, ¢6 is a step in a “tangent direction” of S at x Since we have F(x" +66) = f(x") + eVF(x*) "6 +H.0.T. it follows that fer a stationary point we need Vf(»*)'d = 0 for all 5€T(x*) Optimality Conditions: Equality Constrained Case Hence, we require that at a stationary point x* ¢ 5 we have VF(x*) € span{Vgi(x"), i =1,...,m} This can be written succinctly as a linear system VE(x*) = el)" for some \* € R™, where (J,(x*))7 € R™™ This follows because the columns of (Jg(x*))™ are the vectors {Vgi(x*), i =1,...,m} Optimality Conditions: Equality Constrained Case We can write equality constrained optimization problems more succinctly by introducing the Lagrangian function, £: R"*™ + R, Llx,) = F(x) +AT g(x) F(x) + Angi(x) +++ + AmBm(x) Then we have, 20M 5), i i” Optimality Conditions: Equality Constrained Case Hence VaLlx,A V(x) + Je) veoxay=| gree?) | = (44) |: 50 that the first order necessary condition for optimality for the constrained proklem can be written as a nonlinear system:” VE(x,A) = [me es | <6 (x) "n+ m variables, n+ m equations Optimality Conditions: Equality Constrained Case As another example of equality constrained optimization, recall our underdeterminec linear least squares problem min f(b) subject to g(b) = 0, where f(b) = b* 6, g(b) = Ab—y and AE R™" with m eventual quadratic convergence Quasi-Newton Methods Newton's method is effective for optimization, but it can be unreliable, expensive, and complicated > Unreliable: Only converges when sufficiently close to a imum » Expensive: The Hessian Hy is dense in general, hence very expensive if n is large » Complicated: Can be impractical or laborious to derive the Hessian Hence there has been much interest in so-called quasi-Newton methods, which do not require the Hessian Quasi-Newton Methods General form of quasi-Newton methods: Xk. = Xe — 4B, VE (XK) where ax is a line search parameter and B, is some approximation to the Hessian Quasi-Newton methods generally lose quadratic convergence of Newton's method, but often superlinear convergence is achieved We now consider some specific quasi-Newton methods BFGS The Broyden-Fletcher-Goldfarb-Shanno (BFGS) method is one of the most popular quasi-Newton methods: 1: choose initial guess x0 2: choose Bp, initial Hessian guess, e.g. Bo = 1 3: for k=0,1,2,... do 4: solve Busy = —VF(xx) 5: °: mn 8: Xkt1 = Xk + Sk Va = VEX) — VEX) Bust = By + ABy ; end for where + r Vn — Bu sesy Bx AB, = Yes — Sf BuSk BFGS Actual implementation of BFGS: store and update inverse Hessian to avoid solving linear system: 1: choose initial guess xa 2: choose Hp, initial inverse Hessian guess, e.g. Hy = 1 3: for k—0,1,2,...do 4 calculate s = Mk VF (x) 5: Xk $1 = Xk + Sk 6 = V E(x 1) — VEC) ¥ 8: Ayia = AH, end for where BH = (1 akpaye Hel ayeoe) | pKoK5 Pk Ye Sk BFGS BFGS is implemented as the fmin_bfgs function in ccipy. optimize Also, BFGS (1 trust region) is implemented in Metlab's fminunc function, eg. x0 = [555]; options = optimset(?Gradibj’,’on’); (x,fval,exitflag,output] = .. fminunc (@hinmelblau_function,x0,options) ; Conjugate Gradient Method The conjugate gradient (CG) method is another alternative to Newton's method that does not require the Hessian: 1: choose initial guess xo 2 go = VF(%) 3: x0 = —80 4: for k=0,1,2,... do 5: choose ny to minimize F (xq + 15k) 6: Xke1 = Xk + MKSk 7 Beer = VE(%41) 8 Bsa = (Bi-r8k+1)/ (Bi Bx) 9% Sepa — — Bata + Ba+isk 10: end for Constrained Optimization Equality Constrained Optimization We now consider equality constrained minimization: min f(x) subject to g(x) =0, where F:IR° > R and g:R" > R™ With the Lagrangian £(x,A) = F(x) + AT g(x), we recall from that necessary condition for optimality is _ | VF) +4Z (x) ] _ VL(x, A) pi 0 Once again, this is a nonlinear system of equations that can be solved via Newton's method Sequential Quadratic Programming To derive the Jacobian of this system, we write VF(x) + Dy Ven (x) er? a(x) Then we need to differentiate wrt to x € R” and 1€ R™ VL(x,\) = For i =1,...,7, we have (VL(x, A) = 24 +e dente ae) Differentiating wrt xj, for i,j =1,...,n, gives PF(x) Ay Peelx) OO; Dae Oxide; a IgE i= Sequential Quadratic Programming Hence the top-left n x n block of the Jacobian of VL(x, A) is m B(x, A) = Ar (x) 1 SO Ang, (x) ¢ RO k=l Differentiating (VL(x,A)); wrt Aj, for i gives 2 ye x,ry), — 9609) Byte — EO Hence the top-right n x m block of the Jacobian of VL(x, \) is g(x)? @R™™ Sequential Quadratic Programming For i= +1,...,9+m, we have (VL(x,A))i = g(x) Differentiating (VL(x,A)); wrt xj, fori =n+1,...,9+m, j=l...sm gives a — 28i(x) By Vs i= Oey Hence the bottom-left m x n block of the Jacobian of VL(x, A) is Jg(x) € R™" and the final m x m bottom right block is just zero (differentiation of g;(x) wrt. Aj) Sequential Quadratic Programming Hence, we have derived the following Jacobian matrix for VL(x, d): BOA) EOD] plmtayxtminy I(x) 0 Note the 2 x 2 block structure of this matrix (matrices with this structure are often called KKT matrices?) SKarush, Kuhn, Tucker: did seminal work on nonlinear optimization Sequential Quadratic Programming Therefore. Newton's method for VL(x, 4) = 0 is: BlxK Ax) Je (x4) [ Sk | I ea! [ VE(xx) — Je (x )AK dele) 0 5k B(x) for k= 0,1,2,... Here (sxx) € R"+” is the k'* Newton step Sequential Quadratic Programming Now, consider the constrained minimization problem, where (x, Ax) is our Newton iterate at step k: min {35780 Da)s-+ 7 (V(X) + swan} subject to Jz(x,)s + (xk) = 0 The objective function is quadratic in s (here xz, Ax are constants) This minimization problem has Lagrangian Lals.6) = 557 B(x, Ae)s +57 (WF) + JE O%)M) — 67 Ug(xe)s + g(x«)) Sequential Quadratic Programming Then solving V£;,(s,5) = 0 (ie. first-order necessary conditions) gives a linear system, which is the same as the ktt Newton step Hence at cach step of Newton's method, we exactly solve a minimization problem (quadratic objective fn., linear constraints) An optimization problem of this type is called a quadratic program This motivates the name for applying Newton's method to L(x,d) =0: Sequential Quadratic Programming (SQP) Sequential Quadratic Programming SQP is an important method, and there are many issues to be considered to obtain an efficient and reliable implementation: > > Efficient solution of the linear systems at each Newton iteration — matrix block structure can be exploited Quasi-Newton approximations to the Hessian (as in the unconstrained case) Trust region, line search etc to improve robustness Treatment of constraints (equality and inequality) during the iterative process Selection of good starting guess for \ Penalty Methods Another computational strategy for constrained optimization is to employ penalty methods This converts a constrained problem into an unconstrained problem Key idea. Introduce @ new objective function which is a weighted sum of objective function and constraint Penalty Methods Given the minimization problem min f(x) subject to g( we can consider the related unconstrained problem . 1 min p(x) ~ #0) + seetx)Ta(x) — (e) Let x" and x; denote the solution of (*) and (+x), respectively Under appropriate conditions, it can be shown tha: lim xp poo Penalty Methods In practice, we can solve the unconstrained problem for a large value of p to get a good approximation of x* Another strategy is to solve for a sequence of penalty parameters, Pk, Where x7, serves as a starting guess for x... Note that the major drawback of penalty methods is that a large factor p will increase the condition number of the Hessian Hy, On the other hand, penalty methods can be convenient, primarily due to their simpl Linear Programming Linear Programming As we mentioned earlier, the optimization problem min f(x) subject to g(x) =0 and A(x) < 0, (*) with f,g,/ affine, is called a linear program The feasible region is a convex polyhedron? Since the abject ve function maps aut a hyperplane, its global minimum must occur at a vertex of the feasible region "Polyhedron: a solid with flat sides, straight edges Linear Programming This can be seen most easily with a picture (in R2) x at) Perea Linear Programming The standard approach for solving linear programs is conceptually simple: evamine a sequence of the vertices ta find the minimum This is called the simplex method Despite its conceptual simplicity, it is non-trivial to develop an efficient implementation of this algorithm We will not disciss the implementation details of the simplex method... Linear Programming In the worst case, the computational work required for the simplex method grows exponentially with the size of the problem But this worst-case behavior is extremely rare; in practice simplex is very efficient (computational work typically grows linearly) Newer methods, called interior point methods, have been developed that

You might also like