You are on page 1of 37

# Lecture 2

3B1B Optimization

Michaelmas 2014

Newtons method
Line search

Quasi-Newton methods
Least-Squares and Gauss-Newton methods
Downhill simplex (amoeba) algorithm

A. Zisserman

## Optimization for General Functions

f (x, y) = exp(x)(4x2 + 2y 2 + 4xy + 2x + 1)
5

-1
-4

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0.5

## Apply methods developed using quadratic Taylor series expansion

Rosenbrocks function
2
2
2
f (x, y) = 100(y x ) + (1 x)
Rosenbrock function
3
2.5
2
1.5
1
0.5
0
-0.5
-1
-2

-1

Minimum is at [1, 1]

Steepest descent
The 1D line minimization must be performed using one of the earlier
methods (usually cubic polynomial interpolation)
Steepest Descent

Steepest Descent

3
2.5
0.85

2
0.8

1.5
1

0.75

0.5
0

0.7

-0.5
0.65

-1
-2

-1

-0.95

-0.9

-0.85

-0.8

## The zig-zag behaviour is clear in the zoomed view (100 iterations)

The algorithm crawls down the valley

-0.75

## Performance issues for optimization algorithms

1. Number of iterations required

## 2. Cost per iteration

3. Memory footprint

4. Region of convergence

## Recall from lecture 1: Newtons method in 1D

Fit a quadratic approximation to f (x) using both first and second derivatives at x.

## Expand f (x) locally using a Taylor series.

x2 00
f (x) + h.o.t
f (x + x) = f (x) + xf (x) +
2
0

f 0(x)
x = 00
f (x)

Update x

f 0(xn)
xn+1 = xn 00
f (xn)

## Recall from lecture 1: Taylor expansion in 2D

A function may be approximated locally by its Taylor series expansion
f (x0 + x) f (x0) +

f f
,
x y

x
y

2f
1
2

+ (x, y) x
2f
2
xy

## The expansion to second order is a quadratic function

1
f (x0 + x) = a + g>x + x>H x
2

!
2f
x
xy

2f
y
y 2

+ h.o.t

Newtons method in ND
Expand f (x) by its Taylor series about the point xn
1 >
f (xn + x) f (xn) + g>
n x + x H n x
2
where the gradient is the vector
"

f
f
gn = f (xn) =
,...,
x1
xN

#>

H n = H (xn) =

2f
x2
1

..

...
...

2f
x1xN

2f
x1xN

f
x2
N

## For a minimum we require that f (x) = 0, and so

f (x) = gn + H nx = 0

with solution x = H1
n gn. This gives the iterative update

xn+1 = xn H1
n gn

## Assume that H is positive definite (all eigenvalues greater than zero)

xn+1 = xn + x
= xn H1
n gn
If f (x) is quadratic, then the solution is found in one step.
The method has quadratic convergence (as in the 1D case).
The solution x = H1
n gn is guaranteed to be a downhill direction
(provided that H is positive definite)
For numerical reasons the inverse is not actually computed, instead
x is computed as the solution of H x = gn.
Rather than jump straight to xn H1
n gn , it is better to perform a
line search which ensures global convergence

xn+1 = xn nH1
n gn
If H = I then this reduces to steepest descent.

## Newtons method - example

Newton method with line search

## Newton method with line search

2.5

2.5

1.5

1.5

0.5

0.5

-0.5

-0.5

-1
-2

-1

0
gradient < 1e-3 after 15 iterations

-1
-2

-1.5

-1

-0.5

0.5

1.5

## The algorithm converges in only 15 iterations compared to hundreds

for steepest descent.
However, the method requires computing the Hessian matrix at each
iteration this is not always feasible

Quasi-Newton methods
If the problem size is large and the Hessian matrix is dense then it
may be infeasible/inconvenient to compute it directly.

## Quasi-Newton methods avoid this problem by keeping a rolling

estimate of H (x), updated at each iteration using new gradient
information.

## Common schemes are due to Broyden, Fletcher, Goldfarb and Shanno

(BFGS), and also Davidson, Fletcher and Powell (DFP).

e.g. in 1D

First derivatives
f f0
f f1
h
h
and f 0 (x0 ) = 0
f 0(x0 + ) = 1
2
h
2
h
second derivative

f(x)

f0f1
f1 f0

f 2f0 + f1
h
= 1
f 00(x0) = h
2

h
x-1

h
x0

x+1

## For H n+1 build an approximation from H n, gn, gn+1, xn, xn+1

Quasi-Newton: BFGS
Set H 0 = I.
Update according to

qnq>
(H nsn) (H nsn)>
n
H n+1 = H n + >
qn sn
s>
n H n sn
where

sn = xn+1 xn
qn = gn+1 gn
The matrix itself is not stored, but rather represented compactly by
a few stored vectors.
The estimate H n+1 is used to form a local quadratic approximation
as before.

Example
Rosenbrock function
3
2.5
2
1.5
1
0.5
0
-0.5
-1
-2

-1

## The method converges in 25 iterations, compared to 15 for the full-Newton

method
In Matlab the optimization function fminunc uses a BFGS quasi-Newton
method for medium scale optimization problems

Matlab fminunc
>> f='100*(x(2)-x(1)^2)^2+(1-x(1))^2';
Choose options for BFGS quasi-Newton
>> OPTIONS=optimset('LargeScale','off', 'HessUpdate','bfgs' );
Start point
>> x = [-1.9; 2];
This produces
x = 0.9998, 0.9996

fval = 3.4306e-008

## Non-linear least squares

It is very common in applications for a cost function f (x) to be the
sum of a large number of squared residuals

f (x) =

M
X

ri2

i=1

## If each residual depends non-linearly on the parameters x then the

minimization of f (x) is a non-linear least squares problem.

## Linear least squares reminder

The goal is to fit a smooth curve to measured
data points {si, ti} by minimizing the cost
f (x) =

i=1

ri2 =

i=1

(y(si, x) ti)2

target value
s

## For example, the regression function y(si, x)

might be polynomial
y(s, x) = x0 + x1s + x2s2 + . . .
In this case the function is linear in the parameter x and there is a closed form solution. In
general there will not be a closed form solution
for non-linear functions y(s, x).

## Non-linear least squares example: aligning a 3D model to an image

Cost function

Transformation parameters:
3D rotation matrix R

## translation 3-vector T = (Tx, Ty , Tz )>

Image generation:
rotate and translate 3D model by R and T
project to generate image IR,T(x, y)

f (x ) =

M
X

i=1

ri2 = krk2

r1
x1
J(x) =
..
r
M
x1

...
...

r1
xN

rM
xN

assume
M>N

Consider
X
X 2
r
ri =
2ri i
xk i
xk
i

Hence
f (x) = 2J>r

J>

X ri
2 X 2
ri = 2
ri
xl xk i
xl i
xk

2ri
= 2
+2
ri
xk xl
i xk xl
i
X ri ri

Hence
2 J> J

H ( x) =

M
X

riR i

i=1

J>

Ri

## Note that the second-order term in the Hessian H (x) is multiplied by

the residuals ri.
In most problems, the residuals will typically be small.
Also, at the minimum, the residuals will typically be distributed with
mean = 0.

For these reasons, the second-order term is often ignored, giving the
Gauss-Newton approximation to the Hessian :

H (x) = 2J>J
Hence, explicit computation of the full Hessian can again be avoided.

Example Gauss-Newton
The minimization of the Rosenbrock function
f (x, y) = 100(y x2)2 + (1 x)2
can be written as a least-squares problem with residual vector

r=

r1
x
J(x) = r
2
x

"

10(y x2)
(1 x)

!
r1
20x 10
y =
r2
1
0
y

H(x) =

2f
x2
2f
xy

2f

xy
=
2f
y 2

"

400x

200

2J>J = 2

"

20x 1
10
0

#"

20x 10
1
0

"

800x2 + 2 400x
400x

200

## Summary: Gauss-Newton optimization

For a cost function f (x) that is a sum of squared residuals
f (x) =

ri2

i=1

## The Hessian can be approximated as

H (x) = 2J>J
and the gradient is given by
f (x) = 2J>r
So, the Newton update step

xn+1 = xn + x
= xn H1
n gn
computed as H x = gn, becomes

J>J x = J>r
These are called the normal equations.

>J
xn+1 = xn nH1
g
with
H
(
x
)
=
2
J
n
n n
n n
Gauss-Newton method with line search

## Gauss-Newton method with line search

2.5

2.5

1.5

1.5

0.5

0.5

-0.5

-0.5

-1
-2

-1

0
gradient < 1e-3 after 14 iterations

-1
-2

-1.5

-1

-0.5

0.5

## minimization with the Gauss-Newton approximation with line search

takes only 14 iterations

1.5

Comparison
Gauss-Newton

Newton
Newton method with line search

2.5

2.5

1.5

1.5

0.5

0.5

-0.5

-0.5

-1
-2

-1

-1
-2

-1

## requires computing Hessian

(i.e. n^2 second derivatives)

approximates Hessian by
Jacobian product

algorithm

## The downhill simplex (amoeba) algorithm

Due to Nelder and Mead (1965)
A direct method: only uses function evaluations (no derivatives)
a simplex is the polytope in N dimensions with N+1 vertices, e.g.
2D: triangle
3D: tetrahedron
basic idea: move by reflections, expansions or contractions

start

reflect

expand

contract

## One iteration of the simplex algorithm

Reorder the points so that f (xn+1 ) > f (x2) > f (x1) (i.e. xn+1 is the worst point).
Generate a trial point xr by reflection

xr =
x + (
x xn+1)

P
where
x=
i xi /(N + 1) is the centroid and > 0. Compute f (xr ), and there
are then 3 possibilities:
1. f (x1) < f (xr ) < f (xn ) (i.e. xr is neither the new best or worst point), replace
xn+1 by xr .

2. f (xr ) < f (x1 ) (i.e. xr is the new best point), then assume direction of reflection
is good and generate a new point by expansion

xe = xr + (xr
x)

where > 0. If f (xe) < f (xr ) then replace xn+1 by xe, otherwise, the expansion
has failed, replace xn+1 by xr .
3. f (xr ) > f (xn) then assume the polytope is too large and generate a new point
by contraction

xc =
x + (xn+1
x)
where (0 < < 1) is the contraction coecient. If f (xc) < f (xn+1 ) then the
contraction has succeeded and replace xn+1 by xc, otherwise contract again.
Standard values are = 1, = 1, = 0.5.

Example

## Path of best vertex

Downhill Simplex

3
2.5
2
1.5
1
0.5
0
-0.5
-1
-2

-1

2.7
2.6
2.5
2.4
2.3
2.2
2.1
2
1.9
1.8
-2

-1.8

-1.6

-1.4

detail

-1.2

-1

Matlab fminsearch
with 200 iterations

0
-1

-1

contractions

0
-1

0
-1
-1

XX

YY

3
2

YY

3
2
1

YY

0
-1
0

1
XX

3
1
-1
-1

XX

XX

YY

-1

0
-1
-1

XX

reflections

YY

3
2

-1

YY

3
2
-1

YY

3
2
1
-1

YY

## Example 2: contraction about a minimum

-1

XX

1
XX

-1

XX

Summary
no derivatives required
deals well with noise in the cost function
is able to crawl out of some local minima (though, of course, can still get stuck)

Matlab fminsearch