Ma 128 B Lecture Week 10

Numerical Solutions of Nonlinear Systems of Equations (I)
I Optimization is the foundation of machine learning.

I Setup: global search for optimal strategy:
minx∈Rn f (x) .
I Approach: Optimal strategy satisfies non-linear equation
∇x f (x) = 0 ∈ Rn .
I Focus of Ch. 10, practical problems far harder

Numerical Solutions of Nonlinear Systems of Equations (II)
I Ex: Support Vector Machine (SVM) for classification:
m
! !
1 X
2
minw∈Rn , b∈R max 0, 1 − yi wT xi − b + λ kwk2 ,
m
i=1
I Labels: yi = 1 for blue, yi = −1 for red.

I Solution requires more than nonlinear equations.
§10.1 Fixed Points for Functions of Several Variables
system of nonlinear equations has the form
f1 (x1 , x2 , · · · , xn ) = 0,
f2 (x1 , x2 , · · · , xn ) = 0,
.. ..
. .
fn (x1 , x2 , · · · , xn ) = 0.
In vector form
F (x) = 0, where x = (x1 , x2 , · · · , xn )T .
Ex: Nonlinear equations
1
3x1 − cos (x2 x3 ) = ,
2
x12 − 81 (x2 + 0.1)2 + sinx3 = −1.06,
10π − 3
e −x1 x2 + 20x3 = − .
3
In vector form, with x = (x1 , x2 , x3 )T ,
3x1 − cos (x2 x3 ) − 21

 
def
F (x) =  x12 − 81 (x2 + 0.1)2 + sinx3 + 1.06  = 0.
e −x1 x2 + 20x3 + 10π−33
Solution Methods
I Fixed Point Method

I Newton’s Method
I Quasi-Newton Methods
I Steepest Descent Methods
Definitions
Limit: Let f be defined on a set D ⊂ Rn and
mapping into R.
limx→x0 f (x) = L if, given any number > 0,
a number δ > 0 exists with |f (x) − L| <

whenever x ∈ D and 0 < kx − x0 | < δ.
Definitions
mapping into R.

Continuous: Function f is continuous at x0 ∈ D if
limx→x0 f (x) exists and equals f (x0 ) .

Definitions
mapping into R.

Continuous: Function f is continuous on D, or

f ∈ C (D), if f is continuous at every
point of D.
Definitions
mapping into R.

Continuous: Function f is continuous on D, or

f ∈ C (D), if f is continuous at every
point of D.
def
Continuous: F (x) = (f1 (x) , · · · , fn (x))T ∈ Rn is
continuous on D, or F ∈ C (D), if
fj (x) ∈ C (D) for j = 1, · · · , n.
Fixed Point in Rn
Def: Function G (x) : D ⊂ Rn −→ Rn has a

fixed point at p ∈ D if G (p) = p.
Fixed Point in Rn
Def: Function G (x) : D ⊂ Rn −→ Rn has a

fixed point at p ∈ D if G (p) = p.
n D=
Thm: Let o
(x1 , · · · , xn )T | αj ≤ xj ≤ βj , j = 1, · · · , n. .
Suppose G (x) ∈ C (D) with the property
that G (x) ∈ D whenever x ∈ D. Then G
has a fixed point in D.
I Ex: nonlinear equations
3x1 − cos (x2 x3 ) − 12

 
def
F (x) =  x12 − 81 (x2 + 0.1)2 + sinx3 + 1.06  = 0.
e −x1 x2 + 20x3 + 10π−3
3
I FPI 1: x = G1 (x):
1 1
x1 = cos (x2 x3 ) +
3q 6
1
x2 = x12 + sinx3 + 1.06 − 0.1
9
1 −x1 x2 10π − 3
x3 = − e +
20 3
x(0) = (0, 0, 0)T and FPT: x(k+1) = G1 x(k) , k = 0, 1, · · ·

I
I Ex: nonlinear equations
3x1 − cos (x2 x3 ) − 12

 
def
F (x) =  x12 − 81 (x2 + 0.1)2 + sinx3 + 1.06  = 0.
e −x1 x2 + 20x3 + 10π−3
3
I FPI 2: x = G2 (x):
1 1
x1 = cos (x2 x3 ) +
3 q 6
1
x2 = − x 2 + sinx3 + 1.06 − 0.1
9 1
1 −x1 x2 10π − 3
x3 = − e +
20 3
x(0) = (0, 0, 0)T and FPT: x(k+1) = G2 x(k) , k = 0, 1, · · ·

I
n o
Let D = (x1 , · · · , xn )T | αj ≤ xj ≤ βj , j = 1, · · · , n. . Suppose
G (x) ∈ C (D) with the property that G (x) ∈ D whenever x ∈ D.
n o

def
Thm: Let J (x) = ∂g∂xi (x) j
. Assume that there exists a constant
κ < 1 so that kJ (x)k∞ ≤ κ for all x ∈ D. Then FPI

x(k+1) = G x(k) , k = 0, 1, · · ·
with x(0) ∈ D converges to the unique FP p ∈ D and

(k)
κk
(0)

x − p ≤ x − p .

∞ 1−κ ∞
n o

def
Thm: Let J (x) = ∂g∂xi (x) j
. Assume that there exists a constant
κ < 1 so that kJ (x)k∞ ≤ κ for all x ∈ D. Then FPI

x(k+1) = G x(k) , k = 0, 1, · · ·
with x(0) ∈ D converges to the unique FP p ∈ D and

(k)
κk
(0)

x − p ≤ x − p .

∞ 1−κ ∞
2
Thm: If J (p) = 0 and ∂∂xgj ∂x
i (x)
t
≤ M for all 1 ≤ i, j, t ≤ n. Then for
sufficiently large k,

(k)
n2 M
(k−1)
2
x − p ≤ x − p .

∞ 2 ∞
§10.2 Newton’s Method: one-dimensional case review
To solve f (x) = 0, consider fixed point function for some function
φ(x):
g (x) = x − φ(x) f (x).
I Let p be a root for f (x): f (p) = 0.

I Then p is fixed point for g (x): p = g (p).
I At fixed point p: g 0 (p) = 1 − φ(p) f 0 (p).
I For quadratic convergence, need to choose φ(x) so g 0 (p) = 0,
1
or φ(p) = f 0 (p) .
I Newton’s method

x (k+1) = g x (k) , k = 0, 1, · · ·
f (x)
with g (x) = x − f 0 (x) .
§10.2 Newton’s Method: n-dimensional case
To solve F (x) = 0 ∈ Rn , consider fixed point function for some
matrix function A (x) ∈ Rn×n :
G (x) = x − A (x)−1 F (x) .
I Let p be a root for F (x): F (p) = 0.

I Then p is fixed point for G (x): p = G (p).
I At fixed point p: Jx (G (p)) = I − A (p)−1 Jx (F (p)).
I For quadratic convergence, need to choose A (x) so
Jx (G (p)) = 0, or A (p) = Jx (F (p)) .
I Newton’s method

x (k+1) = G x(k) , k = 0, 1, · · ·
with G (x) = x − Jx−1 (F (x)) F (x) .

Ex: Nonlinear equations with x = (x1 , x2 , x3 )T ,
3x1 − cos (x2 x3 ) − 12

 
def
F (x) =  x12 − 81 (x2 + 0.1)2 + sinx3 + 1.06  = 0.
e −x1 x2 + 20x3 + 10π−3
3
Jacobian matrix has analytic form:

 
3 x3 sin (x2 x3 ) x2 sin (x2 x3 )
Jx (F (x)) =  2 x1 −162 (x2 + 0.1) cosx3 .
−x2 e −x1 x2 −x1 e −x1 x2 20
Newton’s method with x(0) = (0.1, 0.1, −0.1)T

§10.2 Quasi-Newton Method: Broyden Method
Motivation: Newton’s method costs too much for large n.

Quasi-Newton Method: poor man’s alternative.
I Newton’s method:
−1
∂fi (x)
x(k+1) = x(k) −A x(k) F x(k) , where A (x) =
∂xj
I Problem I with Newton: Requires n2 partial derivatives.

Problem II with Newton: Needs to factorize A x(k)

I
for each k.
Something cheaper, for slower convergence but less overall computation

Broyden Method: Motivation
I Broyden Method:

x(k+1) = x(k) − A−1
k F x (k)
, where Ak ∈ Rn×n .
Desired Property I: ”Mimics” A x(k) in some sense.

I
I Desired Property II: ”Easy” to compute A−1 −1

k+1 from Ak .
Broyden Method: Derivation
I Assume for some step k ≥ 0, we have available x(k) ∈ Rn and
Ak ∈ Rn×n . By Broyden Method:

x(k+1) = x(k) − A−1
k F x(k)
.
I Secant equation

F x(k+1) − F x(k) ≈ A x(k) x(k+1) − x(k) .
I Broyden ideas:
I Approximate secant equation: Ak+1 sk+1 = yk+1 , where

def def
yk+1 = F x(k+1) − F x(k) , sk+1 = x(k+1) − x(k) .
I Choose a special Ak+1 that does not differ much from Ak

yk+1 − Ak sk+1
Ak+1 = Ak + 2 sT
k+1
ksk+1 k2
Broyden Method
I Initialization: Given x(0) ∈ Rn .
I Choose A0 ∈ Rn×n .
I For k = 0, 1, · · · ,

x(k+1) = x(k) − A−1
k F x (k)
,
yk+1 − Ak sk+1
Ak+1 = Ak + sT
k+1 , with
ksk+1 k22

yk+1 = F x(k+1) − F x(k) , sk+1 = x(k+1) − x(k) .
Broyden Method
I Initialization: Given x(0) ∈ Rn .
I Choose A0 ∈ Rn×n .
I For k = 0, 1, · · · ,

x(k+1) = x(k) − A−1
k F x (k)
,
yk+1 − Ak sk+1
Ak+1 = Ak + sT
k+1 , with
ksk+1 k22

yk+1 = F x(k+1) − F x(k) , sk+1 = x(k+1) − x(k) .
Practical details:
I A−1 may not exist. (Broyden method fails)
k
I Does an LU factorization of Ak help when we compute LU
factorization of Ak+1 ? (Yes, but another story)
I If A−1 does exist and is available,
k
sk+1 − A−1
k yk+1
A−1 −1
k+1 = Ak + −1
sT −1
k+1 Ak .
sT
k+1 Ak yk+1
3x1 − cos (x2 x3 ) − 12
 
def
F (x) =  x12 − 81 (x2 + 0.1)2 + sinx3 + 1.06  = 0.
e −x1 x2 + 20x3 + 10π−3
3
 
3 x3 sin (x2 x3 ) x2 sin (x2 x3 )
Jx (F (x)) =  2 x1 −162 (x2 + 0.1) cosx3 .
−x
−x2 e 1 2 x −x
−x1 e 1 2 x 20
Quasi-Newton method with x(0) = (0.1, 0.1, −0.1)T ,

(0)

A0 = Jx F x
§10.4 Steepest Descent Techniques (I)
System of nonlinear equations has the form
f1 (x1 , x2 , · · · , xn ) = 0,
f2 (x1 , x2 , · · · , xn ) = 0,
.. ..
. .
fn (x1 , x2 , · · · , xn ) = 0.
In vector form, with x = (x1 , x2 , · · · , xn )T :

 
f1 (x1 , x2 , · · · , xn )
def  f2 (x1 , x2 , · · · , xn ) 
 
F (x) =  ..  = 0, .
 . 
fn (x1 , x2 , · · · , xn )
def
Def: g (x) = F (x)T F (x) =
Pn 2 (x , x , · · ·
j=1 fj 1 2 , xn ). Then
F (x) = 0 ⇐⇒ minx∈Rn g (x) = 0.

Steepest Descent Techniques (II)
def
g (x) = F (x)T F (x) =
Pn 2 (x , x , · · ·
j=1 fj 1 2 , xn ). Then
F (x) = 0 ⇐⇒ minx∈Rn g (x) = 0.
Algorithm 1 Generic Descent Algorithm

Evaluate g (x) at an initial approximation vector x(0) ∈ Rn .
Set k = 0.
while not yet converged do
Determine a direction d(k) from x(k) that results in a decrease
in the value of g (x) (Descent direction.)
Move an appropriate amount α (step-size) in this direction:
x(k+1) = x(k) + α d(k) .

k =k +1
end while
Gradient and descent directions
 
∂g (x)
∂x1

Gradient: ∇x g (x) =  .. 
.
 . 
∂g (x)
∂xn
Directional deriv:
def g (x + h v) − g (x)
Dv g (x) = limh−→0 = vT ∇x g (x) .
h
Steepest descent: For any kvk2 = 1 and tiny α > 0:
g (x − α v) = g (x) − α vT ∇x g (x) + O α2

≥ g (x) − α k∇x g (x)k2 + O α2 .

∇x g (x)
g (x − α v) decreases asymptotically the most with v = k∇x g (x)k2
Steepest descent
g (x − α ∇x g (x)) ≈ g (x) − α k∇x g (x)k22 .

Steepest Descent Algorithm for solving F (x) = 0 (I)
Algorithm 2 Steepest Descent Algorithm
def
Evaluate g (x) = kF (x)k22 at initial vector x(0) ∈ Rn . Set k = 0.
Set d(k) = −∇x g (x) = −2 (J (F (x)))T F (x).
Move an appropriate amount α (step-size) in this direction:
x(k+1) = x(k) + α d(k) .

k =k +1
end while
Algorithm 2 allows many different choices for αk
I α should be ”cheap” to compute.

I α should ensure ”sufficient” reduction in g (x).
Step-size Selection
I Ensure reduction in g (x): Algorithm 3 will halt with an
(k) (k) < g x(k) .

α3 > 0 so that g x + α3 d
Algorithm 3 g (x) Reduction

Set α3 = 1.
while g x(k) + α3 d(k) ≥ g x(k) do

Set α3 = α23 .
end while
Step-size Selection
I Ensure reduction in g (x): Algorithm 3 will halt with an
(k) (k) < g x(k) .

α3 > 0 so that g x + α3 d
Algorithm 3 g (x) Reduction

Set α3 = 1.
while g x(k) + α3 d(k) ≥ g x(k) do

Set α3 = α23 .
end while
I Ensure sufficient reduction in g (x) with α(k) :
Algorithm 4 g (x) Sufficient Reduction

Set α2 = α3 /2. Compute coefficients h0 , h1 , h2
so P (α) = h0 + h1 α + h2 α (α − α2 ) interpolates
g x(k) + α d(k) at α = 0, α2 , α3 . Set α0 = 21 (α2 − h1 /h2 )

α(k) = argminα∈[α0 ,α2 ,α3 ] g x(k) + α d(k) .
Steepest Descent Algorithm for solving F (x) = 0 (II)
Algorithm 5 Steepest Descent Algorithm

def
Evaluate g (x) = kF (x)k22 at initial vector x(0) ∈ Rn . Set k = 0.
Set d(k) = −∇ x g (x)
= −2 (J (F (x)))T F (x),
d(k) = d(k) / d(k) 2 , (bad move)
Compute α(k) with Algorithms 3 and 4.
x(k+1) = x(k) + α d(k) .

k =k +1
end while
3x1 − cos (x2 x3 ) − 12

 
def
F (x) =  x12 − 81 (x2 + 0.1)2 + sinx3 + 1.06  = 0.
e −x1 x2 + 20x3 + 10π−3
3

 
3 x3 sin (x2 x3 ) x2 sin (x2 x3 )
Jx (F (x)) =  2 x1 −162 (x2 + 0.1) cosx3 .
−x2 e −x1 x2 −x1 e −x1 x2 20
I Newton’s method with x(0) = (0, 0, 0)T
Newton Iteration Errors, Solution = [[ 5.00000000e-01]
[-7.34546142e-18]
[-5.23598776e-01]]
10 1
10 3
10 5
10 7
10 9
10 11
10 13
10 15
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5
I Newton’s method with x(0) = − (2, 2, 2)T

Newton Iteration Errors, Solution = [[ 0.49814468]
[-0.1996059 ]
[-0.52882598]]
100
10 3
10 6
10 9
10 12
10 15
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5

I Steepest Descent with x(0) = (0, 0, 0)T
Steepest Descent Iteration Errors
100
10 1
10 2
10 3
10 4
10 5
10 6
10 7
10 8
0 100 200 300 400 500
I Step-sizes
Steepest Descent Step-sizes
100
10 1
10 2
10 3
10 4
10 5
10 6
10 7
10 8
0 100 200 300 400 500

I Steepest Descent with x(0) = − (10, 10, 10)T
(Newton’s method diverges.)
Steepest Descent Iteration Errors
101
10 1
10 3
10 5
10 7
0 200 400 600 800 1000 1200
I Step-sizes Steepest Descent Step-sizes

100
10 1
10 2
10 3
10 4
10 5
10 6
10 7
10 8
0 200 400 600 800 1000 1200

Ma 128 B Lecture Week 10

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ma 128 B Lecture Week 10

Uploaded by

Copyright:

Available Formats

Numerical Solutions of Nonlinear Systems of Equations (I)

I Optimization is the foundation of machine learning.

I Approach: Optimal strategy satisfies non-linear equation

I Focus of Ch. 10, practical problems far harder

I Labels: yi = 1 for blue, yi = −1 for red.

In vector form, with x = (x1 , x2 , x3 )T ,

3x1 − cos (x2 x3 ) − 21

I Fixed Point Method

limx→x0 f (x) = L if, given any number  > 0,

a number δ > 0 exists with |f (x) − L| < 

limx→x0 f (x) = L if, given any number  > 0,

a number δ > 0 exists with |f (x) − L| < 

limx→x0 f (x) exists and equals f (x0 ) .

limx→x0 f (x) = L if, given any number  > 0,

a number δ > 0 exists with |f (x) − L| < 

limx→x0 f (x) exists and equals f (x0 ) .

Continuous: Function f is continuous on D, or

limx→x0 f (x) = L if, given any number  > 0,

a number δ > 0 exists with |f (x) − L| < 

limx→x0 f (x) exists and equals f (x0 ) .

Continuous: Function f is continuous on D, or

Def: Function G (x) : D ⊂ Rn −→ Rn has a

Def: Function G (x) : D ⊂ Rn −→ Rn has a

3x1 − cos (x2 x3 ) − 12

x(0) = (0, 0, 0)T and FPT: x(k+1) = G1 x(k) , k = 0, 1, · · ·

3x1 − cos (x2 x3 ) − 12

x(0) = (0, 0, 0)T and FPT: x(k+1) = G2 x(k) , k = 0, 1, · · ·

with x(0) ∈ D converges to the unique FP p ∈ D and

with x(0) ∈ D converges to the unique FP p ∈ D and

I Let p be a root for f (x): f (p) = 0.

G (x) = x − A (x)−1 F (x) .

I Let p be a root for F (x): F (p) = 0.

with G (x) = x − Jx−1 (F (x)) F (x) .

3x1 − cos (x2 x3 ) − 12

Jacobian matrix has analytic form:

Newton’s method with x(0) = (0.1, 0.1, −0.1)T

Motivation: Newton’s method costs too much for large n.

I Problem I with Newton: Requires n2 partial derivatives.

Something cheaper, for slower convergence but less overall computation

Desired Property I: ”Mimics” A x(k) in some sense.

I Desired Property II: ”Easy” to compute A−1 −1

I Choose a special Ak+1 that does not differ much from Ak

Quasi-Newton method with x(0) = (0.1, 0.1, −0.1)T ,

In vector form, with x = (x1 , x2 , · · · , xn )T :

F (x) = 0 ⇐⇒ minx∈Rn g (x) = 0.

F (x) = 0 ⇐⇒ minx∈Rn g (x) = 0.

Algorithm 1 Generic Descent Algorithm

x(k+1) = x(k) + α d(k) .

≥ g (x) − α k∇x g (x)k2 + O α2 .

g (x − α ∇x g (x)) ≈ g (x) − α k∇x g (x)k22 .

x(k+1) = x(k) + α d(k) .

Algorithm 2 allows many different choices for αk

I α should be ”cheap” to compute.

Algorithm 3 g (x) Reduction

Algorithm 3 g (x) Reduction

I Ensure sufficient reduction in g (x) with α(k) :

Algorithm 4 g (x) Sufficient Reduction

Algorithm 5 Steepest Descent Algorithm

x(k+1) = x(k) + α d(k) .

3x1 − cos (x2 x3 ) − 12

Jacobian matrix has analytic form:

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5

limx→x0 f (x) = L if, given any number > 0,

a number δ > 0 exists with |f (x) − L| <

limx→x0 f (x) = L if, given any number > 0,

a number δ > 0 exists with |f (x) − L| <

limx→x0 f (x) = L if, given any number > 0,

a number δ > 0 exists with |f (x) − L| <

limx→x0 f (x) = L if, given any number > 0,

a number δ > 0 exists with |f (x) − L| <