Introduction To Non-Linear Optimization: Ross A. Lippert

Introduction to non-linear optimization
Ross A. Lippert
D. E. Shaw Research
February 25, 2008
R. A. Lippert
Non-linear optimization
Optimization problems
problem: Let f : Rn (, ],
find
find
min {f (x)}
xRn
x s.t. f (x ) = minn {f (x)}

xR
Quite general, but some cases, like f convex, are fairly solvable.
Todays problem: How about f : Rn R, smooth?
find
x s.t. f (x ) = 0
We have a reasonable shot at this if f is twice differentiable.
R. A. Lippert
Two pillars of smooth multivariate optimization
nD optimization
linear solve/quadratic opt.
R. A. Lippert
1D optimization
The simplest example we can get
Quadratic optimization: f (x) = c x t b + 21 x t Ax.

very common (actually universal, more later)
Finding f (x) = 0
f (x) = b Ax
= 0
x = A1 b
A has to be invertible (really, b in range of A).

Is this all we need?
R. A. Lippert
Max, min, saddle, or what?

Require A be positive definite, why?
3
2.5
0.5
1.5
1.5
0.5
2.5
0
1
3
1
0.5
0.5
0.5
1
0.5
0.5
0.5
0.5
0.5
1
0.5
0.8
0.6
0.5
0.4
1
0.2
1.5
0
1
2
1
1
0.5
0
0.5
0.5
1
0.5
0.5
0.5
0.5
0.5
1
R. A. Lippert
Universality of linear algebra in optimization
1
f (x) = c x t b + x t Ax
2
Linear solve: x = A1 b.
Even for non-linear problems: if optimal x near our x
1
(x x)t f (x) (x x) +
2
= x x (f (x))1 f (x)
f (x ) f (x) + (x x)t f (x) +

x
Optimization Linear solve
R. A. Lippert
Linear solve
x = A1 b
But really we just want to solve
Ax = b
Dont form A1 if you can avoid it.
(Dont form A if you can avoid that!)
For a general A, there are three important special cases,
a1 0 0
diagonal: A = 0 a2 0 thus xi = a1i bi
0 0 a3
orthogonal At A = I, thus A1 = At and x = At b
a11 0
0

P
triangular: A = a21 a22 0 , xi = a1ii bi j<i aij xj
a31 a32 a33
R. A. Lippert
Direct methods
A is symmetric positive definite.
Cholesky factorization:
A = LLt ,

where L lower triangular. So x = Lt L1 b by
X
1
Lz = b, zi =
bi
Lij zj
Lii
j<i
X
1
zi
Lt x = z, xi =
Lij xj
Lii
j>i
R. A. Lippert
Direct methods
A is symmetric positive definite.

Eigenvalue factorization:
A = QDQ t ,
where Q is orthogonal and D is diagonal. Then

x = Q D 1 Q t b .
More expensive than Choesky

Direct methods are usually quite expensive (O(n 3 ) work).
R. A. Lippert
Iterative method basics

Whats an iterative method?
Definition (Informal definition)
An iterative method is an algorithm A which takes what you
have, xi , and gives you a new xi+1 which is less bad such that
x1 , x2 , x3 , . . . converges to some x with badness= 0.
A notion of badness could come from
1
distance from xi to our problem solution
value of some objective function above its minimum
size of the gradient at xi
e.g. If x is supposed to satisfy Ax = b, we could take ||b Ax||

to be the measure of badness.
R. A. Lippert
Iterative method considerations

How expensive is one xi xi+1 step?
How quickly does the badness decrease per step?
A thousand and one years of experience yields two cases
Bi i for some (0, 1) (linear)
Bi ( ) for (0, 1), > 1 (superlinear)
2
1
1.2
0.9
0.8
0.6
magnitude
magnitude
0.7
0.5
0.4
0.3
0.2
0.6
0.4
0.2
0.1
0
0.8
iteration
10
Can you tell the difference?
R. A. Lippert
iteration
10
Convergence
100000
1
1e05
magnitude
magnitude
0.1
0.01
1e10
1e15
1e20
0.001
1e25
1e04
iteration
1e30
10
Now can you tell the difference?
iteration
10
100000
1
magnitude
1e05
1e10
1e15
1e20
1e25
1e30
iteration
10
When evaluating an iterative method against manufacturers

claims, be sure to do semilog plots.
R. A. Lippert
Iterative methods
Motivation: directly optimize f (x) = c x t b + 12 x t Ax.
gradient descent:
1
2
3
Search direction: ri = f = b Axi

Search step: xi+1 = xi + i ri
Pick alpha: i =
rit ri
rit Ari
minimizes f (x + ri )
1
1
f (xi + ri ) = c xit b + xit Axi + rit (Axi b) + 2 rit Ari
2
2
1
= f (xi ) rit ri + 2 rit Ari
2
(Cost of a step = 1 A-multiply.)
R. A. Lippert
Iterative methods
Optimize f (x) = c x t b + 12 x t Ax.
conjugate gradient descent:
1
2
Search direction: di = ri + i di1 , with ri = b Axi .

dt
Ari
Pick i = d t i1Ad
i1
i1
t Ad = 0.
, ensures di1
i
Search step: xi+1 = xi + i di
Pick i =
dit ri
:
dit Adi
minimizes f (xi + di )
f (xi + di ) = c xit b +
1 t
1
xi Axi dit ri + 2 dit Adi
2
2
t d = 0)
(also means that ri+1
i
Avoid extra A-multiply: using Adi1 ri1 ri

(r ri )t ri
(r r )t r
(r r )t ri
i = (r i1r )it d i = ri1
= ri t i1
t d
r
i1
i1
R. A. Lippert
i1 i1
i1 i1
A cute result
1
2
3
ri = b Axi
Search direction: di = ri + i di1 ( s.t. di Adi1 = 0)

Search step: xi+1 = xi + i di ( minimizes).
Cute result (not that useful in practice)

Theorem (sub-optimality of CG)
(Assuming x0 = 0) at the end of step k, the solution x k is the

optimal linear combination of b, Ab, A 2 b, . . . Ak b for minimizing
c bt x +
1 t
x Ax.
2
(computer arithmetic errors make this less than perfect)

Very little extra effort. Much better convergence.
R. A. Lippert
Slow convergence: Conditioning

The eccentricity of the quadratic is a big factor in convergence
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
1
0.8
0.6
0.4
0.2
R. A. Lippert
0.2
0.4
0.6
0.8
Convergence and eccentricity
max eig(A)
min eig(A)
For gradient descent,
For CG,

1 i

||ri ||
+ 1

1 i

||ri ||
+ 1
useless CG fact: in exact arithmetic r i = 0 when i > n (A is

n n).
R. A. Lippert
The truth about descent methods

Very slow unless can be controlled.
How do we control ?
Ax = b (PAP t )y = Pb,
x = Pt y
where P is a pre-conditioner you pick.

How to make (PAP t ) small?
perfect answer, P = L1 where Lt L = A (Cholesky

factorization).
imperfect answer, P L1
Variations on the theme of incomplete factorization:

1
P 1 = D 2 where D = diag (a11 , . . . , ann )
more generally, incomplete Cholesky decomposition

some easy nearby solution or simple approximate A
(requiring domain knowledge)
R. A. Lippert
Class project?
One idea for a preconditioner is by a block diagonal matrix
L11 0
0
P 1 = 0 L22 0
0
0 L33
where Ltii Lii = Aii a diagonal block of A.
In what sense does good clustering give good preconditioners?
End of solvers: there are a few other iterative solvers out there
I havent discussed.
R. A. Lippert
Second pillar: 1D optimization

1D optimization gives important insights into non-linearity.
min f (s),
sR
f continuous.
A derivative-free option:
A bracket is (a, b, c) s.t. a < b < c and f (a) > f (b) < f (c) then
f (x) has a local min for a < x < b
Golden search based on picking a < b 0 < b < c and either

(a < b 0 < b) or (b 0 < b < c) is a new bracket. . . continue
Linearly convergent, ei Gi , golden ratio G.
R. A. Lippert
1D optimization
Fundamentally limited accuracy of derivative-free argmin:
Derivative-based methods, f 0 (s) = 0, for accurate argmin

bracketed: (a, b) s.t. f 0 (a), f 0 (b) opposite sign
1
2
bisection (linearly convergent)

modified regula falsi & Brents method (superlinear)
unbracketed:
1
2
secant method (superlinear)

Newtons method (superlinear; requires another derivative)
R. A. Lippert
From quadratic to non-linear optimizations
What can happen when far from the optimum?

f (x) always points in a direction of decrease
f (x) may not be positive definite
For convex problems f is always positive semi-definite and

for strictly convex it is positive definite.
What do we want?
find a convex neighborhood of x (be robust against
mistakes)
apply a quadratic approximation (do linear solve)
Fact: non-linear optimization algorithms, f which fools it.
R. A. Lippert
Nave Newtons method

Newtons method finding x s.t. f (x) = 0
xi
= (f (xi ))1 f (xi )
xi+1 = xi + xi
Asymptotic convergence, ei = xi x
f (xi ) = f (x )ei + O(||ei ||2 )
f (xi ) = f (x ) + O(||ei ||)
ei+1 = ei (fi )1 fi = O(||ei ||2 )
squares the error at every step (exactly eliminates the linear

error).
R. A. Lippert
Nave Newtons method
Sources of trouble
1
if f (xi ) not posdef, xi = xi+1 xi might be in an

increasing direction.
if f (xi ) posdef, (f (xi ))t xi < 0 so xi is a direction of

decrease (could overshoot)
even if f is convex, f (xi+1 ) f (xi ) not assured.
(f (x) = 1 + e x + log(1 + e x ) starting from x = 2).
if all goes well, superlinear convergence!
R. A. Lippert
1D example of Newton trouble

1D example of trouble: f (x) = x 4 2x 2 + 12x
20
15
10
5
0
5
10
15
20
2
1.5
0.5
0.5
Has one local minimum

Is not convex (note the concavity near x=0)
R. A. Lippert
1.5
1D example of Newton trouble

derivative of trouble: f 0 (x) = 4x 3 4x + 12
20
15
10
5
0
5
10
15
2
1.5
0.5
0.5
the negative f 00 region around x = 0 repels the iterates:
1.5
0 3 1.96154 1.14718 0.00658 3.00039 1.96182

1.14743 0.00726 3.00047 1.96188 1.14749
R. A. Lippert
Non-linear Newton
Try to enforce f (xi+1 ) f (xi )
xi
= (I + f (xi ))1 f (xi )
xi+1 = xi + i xi
Set > 0 to keep xi in a direction of decrease (many

heuristics).
Pick i > 0 such that f (xi + i xi ) f (xi ). If xi is a direction
of decrease, some i exists.
1D-minimization do 1D optimization problem,
min f (xi + i xi )
i (0,]
Armijo-search use this rule: i = n some n

f (xi + sxi ) f (xi ) s (xi )t f (xi )
with , , fixed (e.g. = 2, = = 21 ).

R. A. Lippert
Line searching
1D-minimization looks like less of a hack than Armijo. For

Newton, asymptotic convergence is not strongly affected, and
function evaluations can be expensive.
far from x their only value is ensuring decrease
near x the methods will return i 1.
If you have a Newton step, accurate line-searching adds little
value.
R. A. Lippert
Practicality
Direct (non-iterative, non-structured) solves are expensive!

f information is often expensive!
R. A. Lippert
Iterative methods
gradient descent:
1
2
3
Search direction: ri = f (xi )
Search step: xi+1 = xi + i ri

Pick alpha: (depends on whats cheap)
1
2
3
linearized i =
rit (f )ri
rit ri
minimization f (xi + ri ) (danger: low quality)

zero-finding rit f (xi + ri ) = 0
R. A. Lippert
Iterative methods

1
2
Search direction: di = ri + i di1 , with ri = f (xi ).

Pick i without f
(ri ri1 )t ri1
(ri ri1 )t ri
(Polak-Ribiere)
i =
can also use i =
rit ri
t
ri1
ri1
(Fletcher-Reeves)

1
2
3
linearized i =
dit (f )di
rit di
1D minimization f (xi + di ) (danger: low quality)

zero-finding dit f (xi + di ) = 0
R. A. Lippert
Dont forget the truth about iterative methods

To get good convergence you must precondition!
B (f (x ))1
Without pre-conditioner
1
2
3
4
Search direction: di = ri + i di1 , with ri = P t f (xi ).

Pick i =
(ri ri1 )t ri1

(ri ri1 )t ri
(Polak-Ribiere)
zero-finding dit f (xi + di ) = 0
with B = PP t change of metric

1
2
3
4
Search direction: di = ri + i di1 , with ri = f (xi ).

Pick i =
(ri ri1 )t Bri1

(ri ri1 )t ri
zero-finding dit Bf (xi + di ) = 0

R. A. Lippert
What else?
Remember this cute property?
Theorem (sub-optimality of CG)
(Assuming x0 = 0) at the end of step k, the solution x k is the
optimal linear combination of b, Ab, A 2 b, . . . Ak b for minimizing
c bt x +
1 t
x Ax.
2
In a sense, CG learns about A from the history of b Ax i .

Noting,
1
computer arithmetic errors ruin this nice property quickly
non-linearity ruins this property quickly
R. A. Lippert
Quasi-Newton
Quasi-Newton has much popularity/hype. What if we
approximate (f (x ))1 from the data we have
(f (x ))(xi xk 1 ) f (xi ) f (xk 1 )
xi xk 1 (f (x ))1 (f (xi ) f (xk 1 ))
over some fixed-finite history.

Data: yi = f (xi ) f (xk 1 ), si = xi xk 1 with 1 i k
Problem: Find symmetric positive def H k s.t.
Hk yi = s i
Multiple solutions, but BFGS works best in most situations.
R. A. Lippert
BFGS update

sk ykt
sk s t
yk skt
Hk = I t
Hk 1 I t
+ t k
yk sk
yk sk
yk sk
Lemma
2
The BFGS update minimizes minH ||H 1 Hk1
1 ||F such that
Hyk = sk .
Forming Hk not necessary, e.g. Hk v can be recursively

computed.
R. A. Lippert
Quasi-Newton
Typically keep about 5 data points in the history.

initialize Set H0 = I, r0 = f (x0 ), d0 = r0 goto 3
1
2
3
Compute rk = f (xk ), yk = rk 1 rk
Compute dk = Hk rk
Search step: xk +1 = xk + k dk (line-search)
Asymptotically identical to CG (with i =
dit (f )di
)
rit di
Armijo line searching has good theoretical properties. Typically

used.
Quasi-Newton ideas generalize beyond optimization (e.g.
fixed-point iterations)
R. A. Lippert
Summary
All multi-variate optimizations relate to posdef linear solves
Simple iterative methods require pre-conditioning to be
effective in high dimensions.
Line searching strategies are highly variable
Timing and storage of f , f , f are all critical in selecting
your method.
f
f
concerns method
fast
fast
2,5
quasi-N (zero-search)
fast
fast
5
CG (zero-search)
fast
slow
1,2,3
derivative-free methods
fast
slow
2,5
quasi-N (min-search)
fast
slow
3,5
CG (min-search)
fast/slow slow
2,4,5
quasi-N with Armijo
fast/slow slow
4,5
CG (linearized )
1=time
2=space
3=accuracy
4=robust vs. nonlinearity 5=precondition
Dont take this table too seriously. . .
R. A. Lippert

Introduction To Non-Linear Optimization: Ross A. Lippert

Uploaded by

Copyright:

Available Formats

You might also like

Introduction To Non-Linear Optimization: Ross A. Lippert

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Non-Linear Optimization: Ross A. Lippert

Uploaded by

Copyright:

Available Formats

Introduction to non-linear optimization

February 25, 2008

x s.t. f (x ) = minn {f (x)}

We have a reasonable shot at this if f is twice differentiable.

Two pillars of smooth multivariate optimization

linear solve/quadratic opt.

The simplest example we can get

Quadratic optimization: f (x) = c x t b + 21 x t Ax.

A has to be invertible (really, b in range of A).

Max, min, saddle, or what?

Universality of linear algebra in optimization

f (x ) f (x) + (x x)t f (x) +

Optimization Linear solve

orthogonal At A = I, thus A1 = At and x = At b

A is symmetric positive definite.

More expensive than Choesky

Iterative method basics

distance from xi to our problem solution

value of some objective function above its minimum

size of the gradient at xi

e.g. If x is supposed to satisfy Ax = b, we could take ||b Ax||

Iterative method considerations

Bi ( ) for (0, 1), > 1 (superlinear)

Can you tell the difference?

Now can you tell the difference?

When evaluating an iterative method against manufacturers

Search direction: ri = f = b Axi

Search direction: di = ri + i di1 , with ri = b Axi .

Search step: xi+1 = xi + i di

Avoid extra A-multiply: using Adi1 ri1 ri

Search direction: di = ri + i di1 ( s.t. di Adi1 = 0)

Cute result (not that useful in practice)

(Assuming x0 = 0) at the end of step k, the solution x k is the

(computer arithmetic errors make this less than perfect)

Slow convergence: Conditioning

Convergence and eccentricity

For gradient descent,

useless CG fact: in exact arithmetic r i = 0 when i > n (A is

The truth about descent methods

where P is a pre-conditioner you pick.

perfect answer, P = L1 where Lt L = A (Cholesky

Variations on the theme of incomplete factorization:

P 1 = D 2 where D = diag (a11 , . . . , ann )

more generally, incomplete Cholesky decomposition

Second pillar: 1D optimization

Golden search based on picking a < b 0 < b < c and either

Derivative-based methods, f 0 (s) = 0, for accurate argmin

bisection (linearly convergent)

secant method (superlinear)

From quadratic to non-linear optimizations

What can happen when far from the optimum?

For convex problems f is always positive semi-definite and

Nave Newtons method

= (f (xi ))1 f (xi )

f (xi ) = f (x )ei + O(||ei ||2 )

f (xi ) = f (x ) + O(||ei ||)

ei+1 = ei (fi )1 fi = O(||ei ||2 )

squares the error at every step (exactly eliminates the linear