You are on page 1of 37

Lecture 2

3B1B Optimization

Michaelmas 2014

Newtons method
Line search

Quasi-Newton methods
Least-Squares and Gauss-Newton methods
Downhill simplex (amoeba) algorithm

A. Zisserman

Optimization for General Functions

f (x, y) = exp(x)(4x2 + 2y 2 + 4xy + 2x + 1)
5

-1
-4

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0.5

Apply methods developed using quadratic Taylor series expansion

Rosenbrocks function
2
2
2
f (x, y) = 100(y x ) + (1 x)
Rosenbrock function
3
2.5
2
1.5
1
0.5
0
-0.5
-1
-2

-1

Minimum is at [1, 1]

Steepest descent
The 1D line minimization must be performed using one of the earlier
methods (usually cubic polynomial interpolation)
Steepest Descent

Steepest Descent

3
2.5
0.85

2
0.8

1.5
1

0.75

0.5
0

0.7

-0.5
0.65

-1
-2

-1

-0.95

-0.9

-0.85

-0.8

The zig-zag behaviour is clear in the zoomed view (100 iterations)

The algorithm crawls down the valley

-0.75

Performance issues for optimization algorithms

1. Number of iterations required

2. Cost per iteration

3. Memory footprint

4. Region of convergence

Recall from lecture 1: Newtons method in 1D

Fit a quadratic approximation to f (x) using both first and second derivatives at x.

Expand f (x) locally using a Taylor series.

x2 00
f (x) + h.o.t
f (x + x) = f (x) + xf (x) +
2
0

f 0(x)
x = 00
f (x)

Update x

f 0(xn)
xn+1 = xn 00
f (xn)

Recall from lecture 1: Taylor expansion in 2D

A function may be approximated locally by its Taylor series expansion
f (x0 + x) f (x0) +

f f
,
x y

x
y

2f
1
2

+ (x, y) x
2f
2
xy

The expansion to second order is a quadratic function

1
f (x0 + x) = a + g>x + x>H x
2

!
2f
x
xy

2f
y
y 2

+ h.o.t

Newtons method in ND
Expand f (x) by its Taylor series about the point xn
1 >
f (xn + x) f (xn) + g>
n x + x H n x
2
where the gradient is the vector
"

f
f
gn = f (xn) =
,...,
x1
xN

#>

H n = H (xn) =

2f
x2
1

..

...
...

2f
x1xN

2f
x1xN

f
x2
N

For a minimum we require that f (x) = 0, and so

f (x) = gn + H nx = 0

with solution x = H1
n gn. This gives the iterative update

xn+1 = xn H1
n gn

Assume that H is positive definite (all eigenvalues greater than zero)

xn+1 = xn + x
= xn H1
n gn
If f (x) is quadratic, then the solution is found in one step.
The method has quadratic convergence (as in the 1D case).
The solution x = H1
n gn is guaranteed to be a downhill direction
(provided that H is positive definite)
For numerical reasons the inverse is not actually computed, instead
x is computed as the solution of H x = gn.
Rather than jump straight to xn H1
n gn , it is better to perform a
line search which ensures global convergence

xn+1 = xn nH1
n gn
If H = I then this reduces to steepest descent.

Newtons method - example

Newton method with line search

Newton method with line search

2.5

2.5

1.5

1.5

0.5

0.5

-0.5

-0.5

-1
-2

-1

0
gradient < 1e-3 after 15 iterations

-1
-2

-1.5

-1

-0.5

0.5

1.5

The algorithm converges in only 15 iterations compared to hundreds

for steepest descent.
However, the method requires computing the Hessian matrix at each
iteration this is not always feasible

Quasi-Newton methods
If the problem size is large and the Hessian matrix is dense then it
may be infeasible/inconvenient to compute it directly.

Quasi-Newton methods avoid this problem by keeping a rolling

estimate of H (x), updated at each iteration using new gradient
information.

Common schemes are due to Broyden, Fletcher, Goldfarb and Shanno

(BFGS), and also Davidson, Fletcher and Powell (DFP).

e.g. in 1D

First derivatives
f f0
f f1
h
h
and f 0 (x0 ) = 0
f 0(x0 + ) = 1
2
h
2
h
second derivative

f(x)

f0f1
f1 f0

f 2f0 + f1
h
= 1
f 00(x0) = h
2

h
x-1

h
x0

x+1

For H n+1 build an approximation from H n, gn, gn+1, xn, xn+1

Quasi-Newton: BFGS
Set H 0 = I.
Update according to

qnq>
(H nsn) (H nsn)>
n
H n+1 = H n + >
qn sn
s>
n H n sn
where

sn = xn+1 xn
qn = gn+1 gn
The matrix itself is not stored, but rather represented compactly by
a few stored vectors.
The estimate H n+1 is used to form a local quadratic approximation
as before.

Example
Rosenbrock function
3
2.5
2
1.5
1
0.5
0
-0.5
-1
-2

-1

The method converges in 25 iterations, compared to 15 for the full-Newton

method
In Matlab the optimization function fminunc uses a BFGS quasi-Newton
method for medium scale optimization problems

Matlab fminunc
>> f='100*(x(2)-x(1)^2)^2+(1-x(1))^2';
Choose options for BFGS quasi-Newton
>> OPTIONS=optimset('LargeScale','off', 'HessUpdate','bfgs' );
Start point
>> x = [-1.9; 2];
This produces
x = 0.9998, 0.9996

fval = 3.4306e-008

Non-linear least squares

It is very common in applications for a cost function f (x) to be the
sum of a large number of squared residuals

f (x) =

M
X

ri2

i=1

If each residual depends non-linearly on the parameters x then the

minimization of f (x) is a non-linear least squares problem.

Linear least squares reminder

The goal is to fit a smooth curve to measured
data points {si, ti} by minimizing the cost
f (x) =

i=1

ri2 =

i=1

(y(si, x) ti)2

target value
s

For example, the regression function y(si, x)

might be polynomial
y(s, x) = x0 + x1s + x2s2 + . . .
In this case the function is linear in the parameter x and there is a closed form solution. In
general there will not be a closed form solution
for non-linear functions y(s, x).

Non-linear least squares example: aligning a 3D model to an image

Cost function

Transformation parameters:
3D rotation matrix R

translation 3-vector T = (Tx, Ty , Tz )>

Image generation:
rotate and translate 3D model by R and T
project to generate image IR,T(x, y)

f (x ) =

M
X

i=1

ri2 = krk2

r1
x1
J(x) =
..
r
M
x1

...
...

r1
xN

rM
xN

assume
M>N

Consider
X
X 2
r
ri =
2ri i
xk i
xk
i

Hence
f (x) = 2J>r

J>

X ri
2 X 2
ri = 2
ri
xl xk i
xl i
xk

2ri
= 2
+2
ri
xk xl
i xk xl
i
X ri ri

Hence
2 J> J

H ( x) =

M
X

riR i

i=1

J>

Ri

Note that the second-order term in the Hessian H (x) is multiplied by

the residuals ri.
In most problems, the residuals will typically be small.
Also, at the minimum, the residuals will typically be distributed with
mean = 0.

For these reasons, the second-order term is often ignored, giving the
Gauss-Newton approximation to the Hessian :

H (x) = 2J>J
Hence, explicit computation of the full Hessian can again be avoided.

Example Gauss-Newton
The minimization of the Rosenbrock function
f (x, y) = 100(y x2)2 + (1 x)2
can be written as a least-squares problem with residual vector

r=

r1
x
J(x) = r
2
x

"

10(y x2)
(1 x)

!
r1
20x 10
y =
r2
1
0
y

H(x) =

2f
x2
2f
xy

2f

xy
=
2f
y 2

"

400x

200

2J>J = 2

"

20x 1
10
0

#"

20x 10
1
0

"

800x2 + 2 400x
400x

200

Summary: Gauss-Newton optimization

For a cost function f (x) that is a sum of squared residuals
f (x) =

ri2

i=1

The Hessian can be approximated as

H (x) = 2J>J
and the gradient is given by
f (x) = 2J>r
So, the Newton update step

xn+1 = xn + x
= xn H1
n gn
computed as H x = gn, becomes

J>J x = J>r
These are called the normal equations.

>J
xn+1 = xn nH1
g
with
H
(
x
)
=
2
J
n
n n
n n
Gauss-Newton method with line search

Gauss-Newton method with line search

2.5

2.5

1.5

1.5

0.5

0.5

-0.5

-0.5

-1
-2

-1

0
gradient < 1e-3 after 14 iterations

-1
-2

-1.5

-1

-0.5

0.5

minimization with the Gauss-Newton approximation with line search

takes only 14 iterations

1.5

Comparison
Gauss-Newton

Newton
Newton method with line search

2.5

2.5

1.5

1.5

0.5

0.5

-0.5

-0.5

-1
-2

-1

-1
-2

-1

requires computing Hessian

(i.e. n^2 second derivatives)

approximates Hessian by
Jacobian product

algorithm

The downhill simplex (amoeba) algorithm

Due to Nelder and Mead (1965)
A direct method: only uses function evaluations (no derivatives)
a simplex is the polytope in N dimensions with N+1 vertices, e.g.
2D: triangle
3D: tetrahedron
basic idea: move by reflections, expansions or contractions

start

reflect

expand

contract

One iteration of the simplex algorithm

Reorder the points so that f (xn+1 ) > f (x2) > f (x1) (i.e. xn+1 is the worst point).
Generate a trial point xr by reflection

xr =
x + (
x xn+1)

P
where
x=
i xi /(N + 1) is the centroid and > 0. Compute f (xr ), and there
are then 3 possibilities:
1. f (x1) < f (xr ) < f (xn ) (i.e. xr is neither the new best or worst point), replace
xn+1 by xr .

2. f (xr ) < f (x1 ) (i.e. xr is the new best point), then assume direction of reflection
is good and generate a new point by expansion

xe = xr + (xr
x)

where > 0. If f (xe) < f (xr ) then replace xn+1 by xe, otherwise, the expansion
has failed, replace xn+1 by xr .
3. f (xr ) > f (xn) then assume the polytope is too large and generate a new point
by contraction

xc =
x + (xn+1
x)
where (0 < < 1) is the contraction coecient. If f (xc) < f (xn+1 ) then the
contraction has succeeded and replace xn+1 by xc, otherwise contract again.
Standard values are = 1, = 1, = 0.5.

Example

Path of best vertex

Downhill Simplex

3
2.5
2
1.5
1
0.5
0
-0.5
-1
-2

-1

2.7
2.6
2.5
2.4
2.3
2.2
2.1
2
1.9
1.8
-2

-1.8

-1.6

-1.4

detail

-1.2

-1

Matlab fminsearch
with 200 iterations

0
-1

-1

contractions

0
-1

0
-1
-1

XX

YY

3
2

YY

3
2
1

YY

0
-1
0

1
XX

3
1
-1
-1

XX

XX

YY

-1

0
-1
-1

XX

reflections

YY

3
2

-1

YY

3
2
-1

YY

3
2
1
-1

YY

Example 2: contraction about a minimum

-1

XX

1
XX

-1

XX

Summary
no derivatives required
deals well with noise in the cost function
is able to crawl out of some local minima (though, of course, can still get stuck)

Matlab fminsearch