Introduction To Non-Linear Optimization: Ross A. Lippert

You might also like

You are on page 1of 37

Introduction to non-linear optimization

Ross A. Lippert
D. E. Shaw Research

February 25, 2008

R. A. Lippert

Non-linear optimization

Optimization problems

problem: Let f : Rn (, ],
find
find

min {f (x)}

xRn

x s.t. f (x ) = minn {f (x)}


xR

Quite general, but some cases, like f convex, are fairly solvable.
Todays problem: How about f : Rn R, smooth?
find

x s.t. f (x ) = 0

We have a reasonable shot at this if f is twice differentiable.

R. A. Lippert

Non-linear optimization

Two pillars of smooth multivariate optimization

nD optimization

linear solve/quadratic opt.

R. A. Lippert

1D optimization

Non-linear optimization

The simplest example we can get

Quadratic optimization: f (x) = c x t b + 21 x t Ax.


very common (actually universal, more later)

Finding f (x) = 0
f (x) = b Ax

= 0

x = A1 b

A has to be invertible (really, b in range of A).


Is this all we need?

R. A. Lippert

Non-linear optimization

Max, min, saddle, or what?


Require A be positive definite, why?
3

2.5

0.5

1.5

1.5

0.5

2.5

0
1

3
1

0.5

0.5

0.5
1

0.5

0.5

0.5

0.5

0.5
1

0.5

0.8

0.6

0.5

0.4
1

0.2

1.5

0
1

2
1
1

0.5
0

0.5

0.5
1

0.5

0.5

0.5

0.5

0.5
1

R. A. Lippert

Non-linear optimization

Universality of linear algebra in optimization

1
f (x) = c x t b + x t Ax
2
Linear solve: x = A1 b.
Even for non-linear problems: if optimal x near our x
1
(x x)t f (x) (x x) +
2
= x x (f (x))1 f (x)

f (x ) f (x) + (x x)t f (x) +


x

Optimization Linear solve

R. A. Lippert

Non-linear optimization

Linear solve
x = A1 b
But really we just want to solve
Ax = b
Dont form A1 if you can avoid it.
(Dont form A if you can avoid that!)
For a general A, there are three important special cases,

a1 0 0
diagonal: A = 0 a2 0 thus xi = a1i bi
0 0 a3

orthogonal At A = I, thus A1 = At and x = At b

a11 0
0


P
triangular: A = a21 a22 0 , xi = a1ii bi j<i aij xj
a31 a32 a33
R. A. Lippert

Non-linear optimization

Direct methods
A is symmetric positive definite.
Cholesky factorization:
A = LLt ,

where L lower triangular. So x = Lt L1 b by

X
1
Lz = b, zi =
bi
Lij zj
Lii
j<i

X
1
zi
Lt x = z, xi =
Lij xj
Lii
j>i

R. A. Lippert

Non-linear optimization

Direct methods

A is symmetric positive definite.


Eigenvalue factorization:
A = QDQ t ,
where Q is orthogonal and D is diagonal. Then


x = Q D 1 Q t b .

More expensive than Choesky


Direct methods are usually quite expensive (O(n 3 ) work).

R. A. Lippert

Non-linear optimization

Iterative method basics


Whats an iterative method?
Definition (Informal definition)
An iterative method is an algorithm A which takes what you
have, xi , and gives you a new xi+1 which is less bad such that
x1 , x2 , x3 , . . . converges to some x with badness= 0.
A notion of badness could come from
1

distance from xi to our problem solution

value of some objective function above its minimum

size of the gradient at xi

e.g. If x is supposed to satisfy Ax = b, we could take ||b Ax||


to be the measure of badness.

R. A. Lippert

Non-linear optimization

Iterative method considerations


How expensive is one xi xi+1 step?
How quickly does the badness decrease per step?
A thousand and one years of experience yields two cases
Bi i for some (0, 1) (linear)

Bi ( ) for (0, 1), > 1 (superlinear)

2
1

1.2

0.9

0.8
0.6

magnitude

magnitude

0.7
0.5
0.4
0.3
0.2

0.6
0.4
0.2

0.1
0

0.8

iteration

10

Can you tell the difference?

R. A. Lippert

iteration

Non-linear optimization

10

Convergence
100000

1
1e05
magnitude

magnitude

0.1
0.01

1e10
1e15
1e20

0.001

1e25
1e04

iteration

1e30

10

Now can you tell the difference?

iteration

10

100000
1

magnitude

1e05
1e10
1e15
1e20
1e25
1e30

iteration

10

When evaluating an iterative method against manufacturers


claims, be sure to do semilog plots.

R. A. Lippert

Non-linear optimization

Iterative methods
Motivation: directly optimize f (x) = c x t b + 12 x t Ax.
gradient descent:
1
2
3

Search direction: ri = f = b Axi


Search step: xi+1 = xi + i ri
Pick alpha: i =

rit ri
rit Ari

minimizes f (x + ri )

1
1
f (xi + ri ) = c xit b + xit Axi + rit (Axi b) + 2 rit Ari
2
2
1
= f (xi ) rit ri + 2 rit Ari
2
(Cost of a step = 1 A-multiply.)

R. A. Lippert

Non-linear optimization

Iterative methods
Optimize f (x) = c x t b + 12 x t Ax.
conjugate gradient descent:
1
2

Search direction: di = ri + i di1 , with ri = b Axi .


dt

Ari

Pick i = d t i1Ad
i1

i1

t Ad = 0.
, ensures di1
i

Search step: xi+1 = xi + i di

Pick i =

dit ri
:
dit Adi

minimizes f (xi + di )

f (xi + di ) = c xit b +

1 t
1
xi Axi dit ri + 2 dit Adi
2
2

t d = 0)
(also means that ri+1
i

Avoid extra A-multiply: using Adi1 ri1 ri


(r ri )t ri
(r r )t r
(r r )t ri
i = (r i1r )it d i = ri1
= ri t i1
t d
r
i1

i1

R. A. Lippert

i1 i1

Non-linear optimization

i1 i1

A cute result
conjugate gradient descent:
1
2
3

ri = b Axi

Search direction: di = ri + i di1 ( s.t. di Adi1 = 0)


Search step: xi+1 = xi + i di ( minimizes).

Cute result (not that useful in practice)


Theorem (sub-optimality of CG)

(Assuming x0 = 0) at the end of step k, the solution x k is the


optimal linear combination of b, Ab, A 2 b, . . . Ak b for minimizing
c bt x +

1 t
x Ax.
2

(computer arithmetic errors make this less than perfect)


Very little extra effort. Much better convergence.
R. A. Lippert

Non-linear optimization

Slow convergence: Conditioning


The eccentricity of the quadratic is a big factor in convergence
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
1

0.8

0.6

0.4

0.2

R. A. Lippert

0.2

0.4

0.6

Non-linear optimization

0.8

Convergence and eccentricity

max eig(A)
min eig(A)

For gradient descent,

For CG,



1 i


||ri ||
+ 1


1 i


||ri ||
+ 1

useless CG fact: in exact arithmetic r i = 0 when i > n (A is


n n).
R. A. Lippert

Non-linear optimization

The truth about descent methods


Very slow unless can be controlled.
How do we control ?
Ax = b (PAP t )y = Pb,

x = Pt y

where P is a pre-conditioner you pick.


How to make (PAP t ) small?

perfect answer, P = L1 where Lt L = A (Cholesky


factorization).
imperfect answer, P L1

Variations on the theme of incomplete factorization:


1

P 1 = D 2 where D = diag (a11 , . . . , ann )

more generally, incomplete Cholesky decomposition


some easy nearby solution or simple approximate A
(requiring domain knowledge)
R. A. Lippert

Non-linear optimization

Class project?
One idea for a preconditioner is by a block diagonal matrix

L11 0
0
P 1 = 0 L22 0
0
0 L33
where Ltii Lii = Aii a diagonal block of A.
In what sense does good clustering give good preconditioners?

End of solvers: there are a few other iterative solvers out there
I havent discussed.

R. A. Lippert

Non-linear optimization

Second pillar: 1D optimization


1D optimization gives important insights into non-linearity.
min f (s),
sR

f continuous.

A derivative-free option:
A bracket is (a, b, c) s.t. a < b < c and f (a) > f (b) < f (c) then
f (x) has a local min for a < x < b

Golden search based on picking a < b 0 < b < c and either


(a < b 0 < b) or (b 0 < b < c) is a new bracket. . . continue
Linearly convergent, ei Gi , golden ratio G.
R. A. Lippert

Non-linear optimization

1D optimization
Fundamentally limited accuracy of derivative-free argmin:

Derivative-based methods, f 0 (s) = 0, for accurate argmin


bracketed: (a, b) s.t. f 0 (a), f 0 (b) opposite sign
1
2

bisection (linearly convergent)


modified regula falsi & Brents method (superlinear)

unbracketed:
1
2

secant method (superlinear)


Newtons method (superlinear; requires another derivative)

R. A. Lippert

Non-linear optimization

From quadratic to non-linear optimizations

What can happen when far from the optimum?


f (x) always points in a direction of decrease
f (x) may not be positive definite

For convex problems f is always positive semi-definite and


for strictly convex it is positive definite.
What do we want?
find a convex neighborhood of x (be robust against
mistakes)
apply a quadratic approximation (do linear solve)
Fact: non-linear optimization algorithms, f which fools it.

R. A. Lippert

Non-linear optimization

Nave Newtons method


Newtons method finding x s.t. f (x) = 0
xi

= (f (xi ))1 f (xi )

xi+1 = xi + xi

Asymptotic convergence, ei = xi x

f (xi ) = f (x )ei + O(||ei ||2 )

f (xi ) = f (x ) + O(||ei ||)

ei+1 = ei (fi )1 fi = O(||ei ||2 )

squares the error at every step (exactly eliminates the linear


error).

R. A. Lippert

Non-linear optimization

Nave Newtons method

Sources of trouble
1

if f (xi ) not posdef, xi = xi+1 xi might be in an


increasing direction.

if f (xi ) posdef, (f (xi ))t xi < 0 so xi is a direction of


decrease (could overshoot)
even if f is convex, f (xi+1 ) f (xi ) not assured.
(f (x) = 1 + e x + log(1 + e x ) starting from x = 2).
if all goes well, superlinear convergence!

R. A. Lippert

Non-linear optimization

1D example of Newton trouble


1D example of trouble: f (x) = x 4 2x 2 + 12x
20
15
10
5
0
5
10
15
20
2

1.5

0.5

0.5

Has one local minimum


Is not convex (note the concavity near x=0)
R. A. Lippert

Non-linear optimization

1.5

1D example of Newton trouble


derivative of trouble: f 0 (x) = 4x 3 4x + 12
20
15
10
5
0
5
10
15
2

1.5

0.5

0.5

the negative f 00 region around x = 0 repels the iterates:

1.5

0 3 1.96154 1.14718 0.00658 3.00039 1.96182


1.14743 0.00726 3.00047 1.96188 1.14749
R. A. Lippert

Non-linear optimization

Non-linear Newton
Try to enforce f (xi+1 ) f (xi )
xi

= (I + f (xi ))1 f (xi )

xi+1 = xi + i xi

Set > 0 to keep xi in a direction of decrease (many


heuristics).
Pick i > 0 such that f (xi + i xi ) f (xi ). If xi is a direction
of decrease, some i exists.
1D-minimization do 1D optimization problem,
min f (xi + i xi )

i (0,]

Armijo-search use this rule: i = n some n


f (xi + sxi ) f (xi ) s (xi )t f (xi )

with , , fixed (e.g. = 2, = = 21 ).


R. A. Lippert

Non-linear optimization

Line searching

1D-minimization looks like less of a hack than Armijo. For


Newton, asymptotic convergence is not strongly affected, and
function evaluations can be expensive.
far from x their only value is ensuring decrease
near x the methods will return i 1.
If you have a Newton step, accurate line-searching adds little
value.

R. A. Lippert

Non-linear optimization

Practicality

Direct (non-iterative, non-structured) solves are expensive!


f information is often expensive!

R. A. Lippert

Non-linear optimization

Iterative methods

gradient descent:
1
2
3

Search direction: ri = f (xi )

Search step: xi+1 = xi + i ri


Pick alpha: (depends on whats cheap)
1
2
3

linearized i =

rit (f )ri
rit ri

minimization f (xi + ri ) (danger: low quality)


zero-finding rit f (xi + ri ) = 0

R. A. Lippert

Non-linear optimization

Iterative methods

conjugate gradient descent:


1
2

Search direction: di = ri + i di1 , with ri = f (xi ).


Pick i without f
(ri ri1 )t ri1
(ri ri1 )t ri

(Polak-Ribiere)

i =

can also use i =

rit ri
t
ri1
ri1

(Fletcher-Reeves)

Search step: xi+1 = xi + i di


1
2
3

linearized i =

dit (f )di
rit di

1D minimization f (xi + di ) (danger: low quality)


zero-finding dit f (xi + di ) = 0

R. A. Lippert

Non-linear optimization

Dont forget the truth about iterative methods


To get good convergence you must precondition!
B (f (x ))1
Without pre-conditioner
1
2
3
4

Search direction: di = ri + i di1 , with ri = P t f (xi ).


Pick i =

(ri ri1 )t ri1


(ri ri1 )t ri

(Polak-Ribiere)

Search step: xi+1 = xi + i di

zero-finding dit f (xi + di ) = 0

with B = PP t change of metric


1
2
3
4

Search direction: di = ri + i di1 , with ri = f (xi ).


Pick i =

(ri ri1 )t Bri1


(ri ri1 )t ri

Search step: xi+1 = xi + i di

zero-finding dit Bf (xi + di ) = 0


R. A. Lippert

Non-linear optimization

What else?
Remember this cute property?
Theorem (sub-optimality of CG)
(Assuming x0 = 0) at the end of step k, the solution x k is the
optimal linear combination of b, Ab, A 2 b, . . . Ak b for minimizing
c bt x +

1 t
x Ax.
2

In a sense, CG learns about A from the history of b Ax i .


Noting,
1

computer arithmetic errors ruin this nice property quickly

non-linearity ruins this property quickly

R. A. Lippert

Non-linear optimization

Quasi-Newton
Quasi-Newton has much popularity/hype. What if we
approximate (f (x ))1 from the data we have
(f (x ))(xi xk 1 ) f (xi ) f (xk 1 )

xi xk 1 (f (x ))1 (f (xi ) f (xk 1 ))

over some fixed-finite history.


Data: yi = f (xi ) f (xk 1 ), si = xi xk 1 with 1 i k
Problem: Find symmetric positive def H k s.t.
Hk yi = s i
Multiple solutions, but BFGS works best in most situations.

R. A. Lippert

Non-linear optimization

BFGS update





sk ykt
sk s t
yk skt
Hk = I t
Hk 1 I t
+ t k
yk sk
yk sk
yk sk
Lemma
2
The BFGS update minimizes minH ||H 1 Hk1
1 ||F such that
Hyk = sk .

Forming Hk not necessary, e.g. Hk v can be recursively


computed.

R. A. Lippert

Non-linear optimization

Quasi-Newton

Typically keep about 5 data points in the history.


initialize Set H0 = I, r0 = f (x0 ), d0 = r0 goto 3
1
2
3

Compute rk = f (xk ), yk = rk 1 rk
Compute dk = Hk rk

Search step: xk +1 = xk + k dk (line-search)

Asymptotically identical to CG (with i =

dit (f )di
)
rit di

Armijo line searching has good theoretical properties. Typically


used.
Quasi-Newton ideas generalize beyond optimization (e.g.
fixed-point iterations)

R. A. Lippert

Non-linear optimization

Summary
All multi-variate optimizations relate to posdef linear solves
Simple iterative methods require pre-conditioning to be
effective in high dimensions.
Line searching strategies are highly variable
Timing and storage of f , f , f are all critical in selecting
your method.
f
f
concerns method
fast
fast
2,5
quasi-N (zero-search)
fast
fast
5
CG (zero-search)
fast
slow
1,2,3
derivative-free methods
fast
slow
2,5
quasi-N (min-search)
fast
slow
3,5
CG (min-search)
fast/slow slow
2,4,5
quasi-N with Armijo
fast/slow slow
4,5
CG (linearized )
1=time
2=space
3=accuracy
4=robust vs. nonlinearity 5=precondition
Dont take this table too seriously. . .
R. A. Lippert

Non-linear optimization

You might also like