You are on page 1of 38

Nonlinear Optimization Benny Yakir

These notes are based on ???. No originality is claimed.

1
Contents
1 The General Optimization Problem 4
2 Basic MATLAB 4
2.1 Starting and quitting MATLAB . . . . . . . . . . . . . . . . . 5
2.2 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Scripts and functions . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 More about functions . . . . . . . . . . . . . . . . . . . . . . 9
2.7 Homework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Basic properties of solutions and algorithms 9
3.1 Necessary conditions for a local optimum . . . . . . . . . . . 9
3.2 Convex (and concave) functions . . . . . . . . . . . . . . . . 11
3.3 Global convergence of decent algorithms . . . . . . . . . . . 11
3.4 Homework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Basic descent methods 13
4.1 Fibonacci and Golden Section Search . . . . . . . . . . . . . 13
4.2 Newtons method . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Applying line-search methods . . . . . . . . . . . . . . . . . . 14
4.4 Homework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.5 Quadratic interpolation . . . . . . . . . . . . . . . . . . . . . 15
4.6 Cubic t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.7 Homework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5 The method of steepest decent 17
5.1 The quadratic case . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2 Applying the method in Matlab . . . . . . . . . . . . . . . . . 18
5.3 Homework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6 Newton and quasi-Newton methods 20
6.1 Newtons method . . . . . . . . . . . . . . . . . . . . . . . . 20
6.2 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.3 The Davidon-Fletcher-Powell (DFP) method . . . . . . . . . 21
6.4 The Broyden-Flecher-Goldfarb-Shanno (BFGS) method . . . 22
6.5 Homework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2
7 Constrained Minimization Conditions 22
7.1 Necessary conditions (equality constraints) . . . . . . . . . . 23
7.2 Necessary conditions (inequality constraints) . . . . . . . . . 24
7.3 Sucient conditions . . . . . . . . . . . . . . . . . . . . . . . 25
7.4 Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.5 Homework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8 Lagrange methods 27
8.1 Quadratic programming . . . . . . . . . . . . . . . . . . . . . 27
8.1.1 Equality constraints . . . . . . . . . . . . . . . . . . . 28
8.1.2 Inequality constraints . . . . . . . . . . . . . . . . . . 28
8.2 Homework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
9 Sequential Quadratic Programming 29
9.1 Newtons Method . . . . . . . . . . . . . . . . . . . . . . . . 29
9.2 Structured Methods . . . . . . . . . . . . . . . . . . . . . . . 29
9.3 Merit function . . . . . . . . . . . . . . . . . . . . . . . . . . 30
9.4 Enlargement of the feasible region . . . . . . . . . . . . . . . 30
9.5 The HanPowell method . . . . . . . . . . . . . . . . . . . . 31
9.6 Constrained minimization in Matlab . . . . . . . . . . . . . . 31
9.7 Homework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
10 Penalty and Barrier Methods 33
10.1 Penalty method . . . . . . . . . . . . . . . . . . . . . . . . . . 33
10.2 Barrier method . . . . . . . . . . . . . . . . . . . . . . . . . . 34
10.3 A nal project . . . . . . . . . . . . . . . . . . . . . . . . . . 35
10.4 An exercise in Matlab . . . . . . . . . . . . . . . . . . . . . . 35
11 Stochastic Approximation 37
3
Nonlinear Optimization Benny Yakir
1 The General Optimization Problem
The general optimization problem has the form:
min
xR
n
f(x)
subject to:
g
i
(x) = 0 i = 1, . . . , m
e
g
i
(x) 0 i = m
e
+ 1, . . . , m
x
l
x x
u
In particular, if m = 0, the problem is called an unconstrained optimization
problem. In this course we intend to introduce and investigate algorithms for
solving this problem. We will concentrate, in general, in algorithms which
are used by the Optimization toolbox of MATLAB.
We intend to cover the following chapters:
1. Basic MATLAB.
2. Basic properties of solutions and algorithms.
3. Basic descent methods.
4. Quasi-Newton methods.
5. Least squares optimization.
6. Sequential Quadratic Programming.
7. The REDUCE algorithm.
8. Stochastic approximation.
2 Basic MATLAB
The name MATLAB stands for matrix laboratory. It is an interactive system
for technical computing whose basic data element is an array that does not
require dimensioning.
4
2.1 Starting and quitting MATLAB
Starting: double-click on the MATLAB icon.
Quitting: File/Exit MATLAB or write quit in the command line.
Help: Help/Help Window, click on the ? icon or type help.
2.2 Matrices
MATLAB is case sensitive. Memory is allocated automatically.
>> A = [16 3 2 13; 5 10 11 8; 9 6 7 12; 4 15 14 1]
A =
16 3 2 13
5 10 11 8
9 6 7 12
4 15 14 1
>> sum(A)
ns =
34 34 34 34
>> sum(A)
ns =
34 34 34 34
>> sum(diag(A))
ns =
34
>> A(1,4) + A(2,3) + A(3,2) + A(4,1);
>> sum(A(1:4,4));
>> sum(A(:,end));
>> A(~isprime(A)) = 0
A =
0 3 2 13
5 0 11 0
0 0 7 0
0 0 0 0
>> sum(1:16)/4;
>> pi:-pi/4:0
ns =
3.1416 2.3562 1.5708 0.7854 0
>> B = [fix(10*rand(1,5));randn(1,5)]
5
B =
4.0000 9.0000 9.0000 4.0000 8.0000
0.1139 1.0668 0.0593 -0.0956 -0.8323
>> B(2:2:10)=[]
B =
0 3 8 0 1
>> s = 1 -1/2 + 1/3 - 1/4 + 1/5 - 1/6 + 1/7 ...
-1/8 + 1/9 -1/10
s =
0.6456
>> A*A
ns =
378 212 206 360
212 370 368 206
206 368 370 212
360 206 212 378
>> det(A)
ns =
0
>> eig(A)
ns =
34.0000
8.0000
0.0000
-8.0000
>> (A/34)^5
ns =
0.2507 0.2495 0.2494 0.2504
0.2497 0.2501 0.2502 0.2500
0.2500 0.2498 0.2499 0.2503
0.2496 0.2506 0.2505 0.2493
>> A.*A
ns =
256 15 18 52
15 100 66 120
18 66 49 168
52 120 168 1
>> n= (0:3);
>> pows = [n n.^2 2.^n]
pows =
6
0 0 1
1 1 2
2 4 4
3 9 8
2.3 Graphics
>> t = 0:pi/100:2*pi;
>> y = sin(t);
>> plot(t,y)
>> y2 = sin(t-0.25);
>> y3 = sin(t-0.5);
>> plot(t,y,t,y2,t,y3)
>> [x,y,z]=peaks;
>> contour(x,y,z,20,k)
>> hold on
>> pcolor(x,y,z)
>> hold off
>> [x,y] = meshgrid(-8:.5:8);
>> R = sqrt(x.^2 + y.^2) + eps;
>> Z = sin(R)./R;
>> mesh(x,y,Z)
2.4 Scripts and functions
M-les are text les containing MATLAB code. A text editor can be ac-
cessed via File/New/M File. M-les end with .m prex. Functions are
M-les that can accept input argument and return output arguments. Vari-
ables, in general, are local. MATLAB provides many functions. You can
nd a function name with the function lookfor.
>> lookfor inverse
INVHILB Inverse Hilbert matrix.
...
INV Matrix inverse.
...
INVHESS Inverse of an upper Hessenberg matrix.
7
Or you can write a function in an M-le:
function h = falling(t)
global GRAVITY
h = 1/2*GRAVITY*t.^2;
save it and run it from MATLAB:
>> global GRAVITY
>> GRAVITY = 32;
>> y = falling((0:.1:5));
>> falling(0:5)
ans =
0 16 64 144 256 400
2.5 Files
The MATLAB environment includes a set of variables built up during the
session the Workplace and disk les containing programs and data
that persist between sessions. Variables can be saved in MAT-les.
>> save B A
>> A = 0
A =
0
>> A
A =
16 3 2 13
5 10 11 8
9 6 7 12
4 15 14 1
>> save mydata A -ascii
>> dir
. libut.dll msvcirt.dll
b.mat lmgr325a.dll msvcrt.dll
bccengmatopts.bat matfopts.bat mwoles05.dll
bccopts.bat matlab.exe mydata
cmex.bat medit.exe patchmat.exe
....
8
2.6 More about functions
To obtain the most speed its important to vectorize the computations.
Write the M-le logtab1.m:
function t = logtab1(n) x=0.01; for k=1:n
y(k) = log10(x);
x = x+0.01;
end t=y;
and the M-le logtab2.m:
function t = logtab2(n) x = 0.01:0.01:(n*0.01); t = log10(x);
Then run in Matlab:
>> logtab1(1000);
>> logtab2(1000);
2.7 Homework
1. Let f(x) = ax
2
2bx + c. Under which conditions does f has a
minimum? What is the minimizing x?
2. Let f(x) = x

Ax2b

x+c, with A an nn matrix, b and c n-vectors.

Under which conditions does f has a minimum? a unique minimum?
What is the minimizing x?
3. Write a MATLAB function that nds the location and value of the
minimum of a quadratic function.
4. Plot, using MATLAB, a contour plot of the function f with A =
[1 3; 1 2], b = [5 2]

and c = [1 3]

. Mark, on the plot, the location

of the minimum.
3 Basic properties of solutions and algorithms
3.1 Necessary conditions for a local optimum
Assume that the function f is dened over R.
9
Denition: A point x

is said to be a relative minimum point or a local

minimum point of f if there is an > 0 such that f(x

x such that xx

then
x

is said to be a strict relative minimum point.

Denition: A point x

f(x

then x

is said to be a strict global minimum point.

In practice, the algorithms we will consider in the better part of this
course converge to a local minimum. We may indicate in the future how the
global minimum can be attained.
Denition: Given x , a vector d is a feasible direction at x if there
exists an > 0 such that x + d for all 0 .
Theorem 1 (First-order necessary conditions.) Let f C
1
. If x

is a
relative minimum, then for any d which is feasible at x

, we have

f(x

d
0.
Corollary 1 If x

is a relative minimum and if x

0
then

f(x

) = 0.
Theorem 2 (Second-order necessary conditions.) Let f C
2
. Let x

, if

f(x

d = 0
then d

f(x

)d 0.
Corollary 2 If x

is a relative minimum and if x

0
then

f(x

d = 0
and d

f(x

)d 0 for all d.
Theorem 3 (Second-order sucient conditions.) Let f C
2
. As-
sume x

0
. If

f(x

) = 0 and

f(x

) is positive denite then x

is a
strict relative minimum.
10
3.2 Convex (and concave) functions
Denition: A function f on a convex set is said to be convex if
f(x
1
+ (1 )x
2
) f(x
1
) + (1 )f(x
2
),
for all x
1
, x
2
, 0 1.
Theorem 4 If f
1
and f
2
are convex and a > 0 then f
1
+ f
2
and af
1
are
also convex.
Theorem 5 Let f be convex. Then
c
= {x : f(x) c} is a convex
set.
Corollary 3 If f
1
, . . . , f
m
are convex then the set of points that simultane-
ously satisfy
f
1
(x) c
1
, . . . , f
n
(x) c
n
is convex.
3.3 Global convergence of decent algorithms
The algorithms we consider are iterative descent algorithms.
Denition: An algorithm A is a mapping that assigns, to each point, a
subset of the space.
Iterative algorithm: The specic sequence is constructed by choosing a
point in the subset and iterating the process. The algorithm generates
a series of points. Each point being calculated on the basis of the points
preceding it.
Descent algorithm: As each new point is generated, the corresponding
value of some function decreases in value. Hence, there is a continuous
function Z such that if A is the algorithm and is the solution set then
1. If x and y A(x), then Z(y) < Z(x).
2. If x and y A(x), then Z(y) Z(x).
11
Denition: An algorithm is said to be globally convergent if, for any start-
ing point, it generates a sequence that converges to a solution.
Denition: A point-to-set map A is said to be closed at x if
1. x
k
x and
2. y
k
y, y
k
A(x
k
), imply
3. y A(x).
The map A is closed if it is closed at each point of the space.
Theorem 6 If A is a decent iterative algorithm which is closed outside of
the solution set and if the sequence of points is contained in a compact set
then any converging subsequence converges to a solution.
3.4 Homework
1. To approximate the function g over the interval [0, 1] by a polynomial p
of degree n (or less), we use the criterion
f(a) =
_
1
0
[g(x) p(x)]
2
dx,
where a R
n+1
are the coecients of p. Find the equations satised by
the optimal solution.
2. (a) Using rst-order necessary conditions, nd the minimum of the function
f(x, y, z) = 2x
2
+ xy + y
2
+ yz + z
2
6x 7y 8z + 9.
(b) Varify the point is a relative minimum by checking the second-order
conditions.
3. Let {f
i
: i I} be a collection of convex functions denned over a convex
set . Show that the function g = sup
iI
f is convex on the region where
it is nite.
4. Dene the point-to-set mapping on R
n
by
A(x) = {y : y

x b},
where b is a xed constant. Is A closed?
12
4 Basic descent methods
We consider now algorithms for locating a local minimum in the optimiza-
tion problem with no constraints. All methods have in common the basic
structure: in each iteration a direction d
n
is chosen from the current location
x
n
. The next location, x
n+1
, is the minimum of the function along the line
that passes through x
n
in the direction d
n
. Before discussing the dierent
approaches for choosing directions, we will deal with the problem of nding
the minimum of a function of one variable line search.
4.1 Fibonacci and Golden Section Search
These approaches assume only that the function is unimodal. Hence, if
the interval is divided by the points x
0
< x
1
< < x
N
< x
N+1
and we
nd that, among these points, x
k
minimizes the function then the over-all
minimum is in the interval (x
k1
, x
k+1
).
The Fibonacci sequence (F
n
= F
n1
+ F
n2
, F
0
= F
1
= 1) is the basis
for choosing altogether N points sequentially such that the x
k+1
x
k1
is
minimized. The length of the nal interval is (x
N+1
x
0
)/F
N
.
The solution of the Fibonacci equation is F
N
= A
N
1
+ B
N
2
, where

1
=
1 +

5
2
= 1/0.618,
2
=
1

5
2
.
It follows that F
N1
/F
N
0.618 and the rate of convergence of this line
search approach is linear.
4.2 Newtons method
The best known method of line search is Newtons method. Assume not
only that the function is continuous but also that it is smooth. Given the
rst and second derivatives of the function at x
n
, one can write the Taylor
expansion:
f(x) q(x) = f(x
n
) + f

(x
n
)(x x
n
) + f

(x
n
)(x x
n
)
2
/2.
The minimum of q(x) is attained at
x
n+1
= x
n

(x
n
)
f

(x
n
)
.
13
(Note that this approach can be generalized to the problem of nding the
zeros of the function g(x) = q

(x).)
Theorem 7 Let the function g have a continuous second derivative and
let x

) = 0 and g

(x

) = 0. Then the Newton method

converges with an order of convergence of at least two, provided that x
0
is
suciently close to x

.
4.3 Applying line-search methods
In order for Matlab to be able to read/write les in disk D you should use
the command
>> cd d:
Now you can write the function:
function y = humps(x)
y = 1./((x-0.3).^2 + 0.01)+ 1./((x - 0.9).^2 + 0.04) -6;
in the M-le humps.m in directory D.
>> fplot(humps, [-5 5])
>> grid on
>> fplot(humps, [-5 5 -10 25])
>> grid on
>> fplot([2*sin(x+3), humps(x)], [-5 5])
>> fmin(humps,0.3,1)
ans =
0.6370
>> fmin(humps,0.3,1,1)
Func evals x f(x) Procedure
1 0.567376 12.9098 initial
2 0.732624 13.7746 golden
3 0.465248 25.1714 golden
4 0.644416 11.2693 parabolic
5 0.6413 11.2583 parabolic
6 0.637618 11.2529 parabolic
7 0.636985 11.2528 parabolic
8 0.637019 11.2528 parabolic
9 0.637052 11.2528 parabolic
ans =
0.6370
14
4.4 Homework
1. Consider the iterative process
x
n+1
=
1
2
_
x
n
+
a
x
n
_
,
where a > 0. Assuming the process converges, to what does it converge?
What is the order of convergence.
2. Find the minimum of the function -humps. Use dierent ranges.
3. (a) Given f(x
n
), f

(x
n
) and f

(x
n1
), show that
q(x) = f(x) + f

(x
n
)(x x
n
) +
f

(x
n1
) f

(x
n
)
x
n1
x
n

x x
n
)
2
2
,
has the same derivatives as f at x
n
and x
n1
and is equal to f at x
n
.
(b) Construct a line search algorithm based on this quadratic t.
Assume we are given x
1
< x
2
< x
3
and the values of f(x
i
), i = 1, 2, 3, which
satisfy
f(x
2
) < f(x
1
) and f(x
2
) < f(x
3
).
The quadratic passing through these points is given by
q(x) =
3

i=1
f(x
i
)

j=i
(x x
j
)

j=i
(x
i
x
j
)
.
The minimum of this function is attained at the point
x
4
=
1
2

23
f(x
1
) +
31
f(x
2
) +
12
f(x
3
)

23
f(x
1
) +
31
f(x
2
) +
12
f(x
3
)
,
with
ij
= x
2
i
x
2
j
and
ij
= x
i
x
j
. An algorithm A : R
3
R
3
can
be dened by such a pattern. If we start from an initial 3-points pattern
x = (x
1
, x
2
, x
3
) the algorithm A can be constructed in such a way that
A(x) has the same pattern. The algorithm is continuous, hence closed. It
is descents with respect to the function Z(x) = f(x
1
) + f(x
2
) + f(x
3
). If
follows that the algorithm converges to the solution set = {x

: f

(x

i
) =
0, i = 1, 2, 3.}. It can be shown that the order of convergence to the solution
is (approximately) 1.3.
15
4.6 Cubic t
Given x
1
and x
2
, together with f

(x
1
), f

(x
1
), f

(x
2
) and f

(x
2
), one can
consider a cubic polynom of the form
q(x) = a
0
+ a
1
x + a
2
x
2
+ a
3
x
3
.
The local minimum is determined by the solution of the equation
q

(x) = a
1
+ 2a
2
x + 3a
3
x
2
= 0,
which satises
q

(x) = 2a
2
x + 6a
3
x > 0.
It follows that the appropriate interpolation is given by
x
3
= x
2
(x
2
x
1
)
f

(x
2
) +
2

1
f

(x
2
) f

(x
1
) + 2
2
,
where

1
= f

(x
1
) + f

(x
2
) 3
f(x
1
) f(x
2
)
x
1
x
2

2
= (
2
1
f

(x
1
)f

(x
2
))
1/2
.
The order of convergence of this algorithm is 2.
4.7 Homework
1. What conditions on the values and derivatives at two points guarantee
that a cubic t will have a minimum between the two points? Use the
answer to develop a search scheme that is globally convergent for unimodal
functions.
2. Suppose the continuous real-valued function f satises
min
0x
f(x) < f(0).
Starting at any x > 0 show that, through a series of halving and doubling
of x and evaluation of the corresponding f(x)s, a three-point pattern can
be determined.
16
3. Consider the function
f(x, y) = e
x
(4x
2
+ 2y
2
+ 4xy + 2y + 1).
Use the function fmin to plot the function
g(y) = min
x
f(x, y).
5 The method of steepest decent
The method of steepest decent is a method of searching for the minimum of
a function of many variables f. In in each iteration of this algorithm a line
search is performed in the direction of the steepest decent of the function at
the current location. In other words,
x
n+1
= x
n

n

f(x
n
),
where
n
is the nonnegative scalar that minimizes f(x
n

f(x
n
)). It can
be shown that relative to the solution set {x

:

f(x

) = 0}, the algorithm is

decending and closed, thus converging.
5.1 The quadratic case
Assume
f(x) =
1
2
x

Qx bx

b =
1
2
(x x

Q(x x

)
1
2
x

Qx

,
were Q a positive denite and symmetric matrix and x

= Q
1
b is the
minimizer of f. Note that in this case

f(x) = Qx b. and
f(x
n

f(x
n
)) =
1
2
(x
n

f(x
n
))

Q(x
n

f(x
n
)) (x
n

f(x
n
))

b,
which is minimized at

n
=

f(x
n
)

f(x
n
)

f(x
n
)

Q

f(x
n
)
.
It follows that
1
2
(x
n+1
x

Q(x
n+1
x

) =
_
1
(

f(x
n
)

f(x
n
))
2

f(x
n
)

Q

f(x
n
)

f(x
n
)

Q
1
f(x
n
)
_

1
2
(x
n
x

Q(x
n
x

).
17
Theorem 8 (Kantorovich inequality) Let Q be a positive denite and
symmetric matrix and let 0 < a =
1

2

n
= A be the eigenval-
ues. Then
(x

x)
2
(x

Qx)(x

Q
1
x)

4aA
(a + A)
2
.
Theorem 9 For the quadratic case
1
2
(x
n+1
x

Q(x
n+1
x

)
_
Aa
A + a
_
2
1
2
(x
n
x

Q(x
n
x

).
5.2 Applying the method in Matlab
Write the M-le fun.m:
function f=fun(x)
f = exp(x(1))*(4*x(1)^2+2*x(2)^2+4*x(1)*x(2)+2*x(2)+1);
In the Matlab session:
>> x=[-1,1];
>> x=fminu(fun,x)
x =
0.5000 -1.0000
>> fun(x)
ans =
1.3029e-010
>> x=[-1,1];
>> options(6)=2;
>> x = fminu(fun,x,options)
x =
0.5000 -1.0000
Write the M-le fun1.m:
function f = fun1(x)
f = 100*(x(2)-x(1)^2)^2 + (1 - x(1))^2;
and in the Matlab session:
18
>> x=[-1,1];
>> options(1)=1;
>> options(6)=0;
>> x = fminu(fun1,x,options)
f-COUNT FUNCTION STEP-SIZE GRAD/SD
4 4 0.500001 -16
9 3.56611e-009 0.500001 0.0208
14 7.36496e-013 0.000915682 -3.1e-006
21 1.93583e-013 9.12584e-005 -1.13e-006
24 1.55454e-013 4.56292e-005 -7.16e-007
Optimization Terminated Successfully
Search direction less than 2*options(2)
Gradient in the search direction less than 2*options(3)
NUMBER OF FUNCTION EVALUATIONS=24
x =
1.0000 1.0000
>> x=[-1,1];
>> options(6)=2;
>> x = fminu(fun1,x,options)
f-COUNT FUNCTION STEP-SIZE GRAD/SD
4 4 0.500001 -16
9 3.56611e-009 0.500001 0.0208
15 1.11008e-012 0.000519178 -4.82e-006
Warning: Matrix is close to singular or badly scaled.
Results may be inaccurate. RCOND = 1.931503e-017.
> In c:\matlab\toolbox\optim\cubici2.m at line 10
In c:\matlab\toolbox\optim\searchq.m at line 54
In c:\matlab\toolbox\optim\fminu.m at line 257
....
192 4.56701e-013 -4.52912e-006 -1.02e-006
195 4.5539e-013 2.26456e-006 -4.03e-007
198 4.55537e-013 -1.13228e-006 -1.02e-006
201 4.55336e-013 5.66141e-007 -4.03e-007
Maximum number of function evaluations exceeded;
increase options(14).
x =
1.0000 1.0000
19
5.3 Homework
1. Investigate the function
f(x, y) = 100(y x
2
)
2
+ (1 x)
2
.
Why doesnt the steepest decent algorithm converge?
6 Newton and quasi-Newton methods
6.1 Newtons method
Based on the Taylor expansion
f(x
n
) f(x
n
) +

f(x
n
)

(x x
n
) +
1
2
(x x
n
)

f(x
n
)(x x
n
)
one can derive, just as for the line-search problem, Newtons method:
x
n+1
= x
n
(

f(x
n
))
1

f(x
n
).
Under the regularity conditions it can be shown that local rate of conver-
gence of this method is 2. A modication of this approach is to set
x
n+1
= x
n

n
(

f(x
n
))
1

f(x
n
),
where
n
minimizes the function f(x
n
(

f(x
n
))
1

f(x
n
)).
6.2 Extensions
Consider the approach of choosing
x
n+1
= x
n

n
S
n

f(x
n
),
where S
n
is some symmetric and positive-denite matrix and
n
is the non-
negative scalar that minimizes f(x
n
S
n

f(x
n
)). It can be shown that
since S
n
is positive-denite the algorithm is descending.
Assume
f(x) =
1
2
x

Qx bx

b =
1
2
(x x

Q(x x

)
1
2
x

Qx

,
20
were Q a positive denite and symmetric matrix and x

= Q
1
b is the
minimizer of f. Note that in this case

f(x) = Qx b. and
f(x
n
S
n

f(x
n
)) =
1
2
(x
n
S
n

f(x
n
))

Q(x
n
S
n

f(x
n
))(x
n
S
n

f(x
n
))

b,
which is minimized at

n
=

f(x
n
)

S
n

f(x
n
)

f(x
n
)

S
n
QS
n

f(x
n
)
.
It follows that
1
2
(x
n+1
x

Q(x
n+1
x

) =
_
1
(

f(x
n
)

S
n

f(x
n
))
2

f(x
n
)

S
n
QS
n

f(x
n
)

f(x
n
)

Q
1
f(x
n
)
_

1
2
(x
n
x

Q(x
n
x

).
Theorem 10 For the quadratic case
1
2
(x
n+1
x

Q(x
n+1
x

)
_
B
n
b
n
B
n
+ b
n
_
2
1
2
(x
n
x

Q(x
n
x

),
where B
n
and b
n
are the largest and smallest eigenvalues of SQ.
6.3 The Davidon-Fletcher-Powell (DFP) method
This is a rank-two correction procedure. The algorithm starts with some
positive-denite algorithm S
0
, initial point x
0
:
1. Minimizes f(x
n
)S
n

f(x
n
) to obtain x
n+1
,
k
x = x
n+1
x
n
=
n
S
n

f(x
n
),

f(x
n+1
) and
n

f =

f(x
n+1
)

f(x
n
).
2. Set
S
n+1
= S
n
+

k
x

k
x

k
x

k

f

S
n

k

f
k

f

S
n

k

f

S
n

k

f
.
3. Go to 1.
It follows, since
k
x

f(x
n+1
) = 0, that if S
n
is positive denite then so
is S
n+1
.
21
6.4 The Broyden-Flecher-Goldfarb-Shanno (BFGS) method
In this method the Hessian is approximated. This is a rank-two correction
procedure as well. The algorithm starts with some positive-denite algo-
rithm H
0
, initial point x
0
:
1. Minimizes f(x
n
) H
1
n

f(x
n
) to obtain x
n+1
,
k
x = x
n+1
x
n
=

n
H
1
n

f(x
n
),

f(x
n+1
) and
n

f =

f(x
n+1
)

f(x
n
).
2. Set
H
n+1
= H
n
+

k

f

k

f

k

f

k
x

H
n

k
x
k
x

H
n

k
x

H
n

k
x
.
3. Go to 1.
6.5 Homework
1. Use the function fminu with options(6)=0 (BFGS), options(6)=1 (DFP)
and options(6)=2 (steepest descent) to compare the performance of the
algorithms. Apply the function to minimize f(x) = x

Qx, here Q is a
diagonal matrix. Use dierent ratios between the smallest and the largest
eigenvalues dierent dimensions.
2. Investigate the rate of convergence of the algorithm
x
n+1
= x
n
[I + (

f(x
n
))
1
]

f(x
n
).
What is the rate if is larger than the smallest eigenvalue of (

f(x

))
1
?
3. Use the formula
[A +ba

]
1
= A
1

A
1
ab

A
1
1 +b

A
1
a
,
in order to get a direct updating formula for the inverse of H
n
in the
BFGS method.
7 Constrained Minimization Conditions
22
The general (constrained) optimization problem has the form:
min
xR
d
f(x)
subject to:
g
i
(x) = 0 i = 1, . . . , m
e
g
i
(x) 0 i = m
e
+ 1, . . . , m
x
l
x x
u
The rst m
e
constraints are called equality constraints and the last mm
e
constraints are the inequality constraints.
7.1 Necessary conditions (equality constraints)
We assume rst that m
e
= m all constraints are equality constraints.
Let x

be a solution of the optimization problem. Let g = (g

1
, . . . , g
m
).
Note that g is a (non-linear) transformation from R
d
into R
m
. The set
{x R
n
: g(x) = 0} is a surface in R
n
. This surface is approximated near
x

by x

+ M, where
M = {y : g(x

y = 0}.
In order for this approximation to hold, x

should be a regular point of the

constraint, i.e. ( g
1
(x

), . . . , g
m
(x

)) should be linearly independent.

Theorem 11 (Lagrange multipliers) Let x

be a local extremum point

of f subject to the constraint g = 0. Assume that x

is a regular point of
these constraints. Then there is a R
m
such that

f(x

) + g(x

) =

f(x

) +
m

j=1

j
g
j
(x

) = 0.
Given , one can consider the Lagrangian:
l(x, ) = f(x) +g(x).
The necessary conditions can be formulated as

l = 0. The matrix of partial
second derivatives of l (with respect to x) at x

is

l
x
(x

) =

f(x

) + g(x

) =

f(x

) +
m

j=1
g
j
(x

)
j
23
We say that this matrix is positive semidenite over M if x

l
x
(x

)x 0, for
all x M.
Theorem 12 (Second-order condition) Let x

be a local extremum point

of f subject to the constraint g = 0. Assume that x

is a regular point of
these constraints, and let R
m
be such that

f(x

) + g(x

) =

f(x

) +
m

j=1

j
g
j
(x

) = 0.
Then the matrix

l
x
(x

) is positive semidenite over M.

We now consider the case where m
e
< m. Let x

be a solution of
the constrained optimization problem. A constraint g
j
is active at x

if
g
j
(x

) = 0 and it is inactive if g
j
(x

) < 0. Note that all equality constraints

are active. Denote by J the set of all active constraints.
7.2 Necessary conditions (inequality constraints)
For the consideration of necessary conditions when inequality constraints
are present the denition of a regular point should be extended. We say
now that x

is regular if { g
j
(x

) : j J} are linearly independent.

Theorem 13 (Kuhn-Tucker Conditions) Let x

be a local extremum
point of f subject to the constraint g
j
(x) = 0, 1 j m
e
and g
j
(x) 0,
m
e
+ 1 j m. Assume that x

is a regular point of these constraints.

Then there is a R
m
such that
j
0, for all j > m
e
, and

f(x

) +
m

j=1

j
g
j
(x

) = 0
m

j=me+1

j
g
j
(x

) = 0.
Let

l
x
(x

) =

f(x

) +
m

j=1

j
g
j
(x

).
24
Theorem 14 (Second-order condition) Let x

be a local extremum point

of f subject to the constraint g
j
(x) = 0, 1 j m
e
and g
j
(x) 0,
m
e
+1 j m. Assume that x

let R
m
be such that
j
0, for all j > m
e
, and

f(x

) +
m

j=1

j
g
j
(x

) = 0.
Then the matrix

l
x
(x

) is positive semidenite on the tangent subspace of

the active constraints.
7.3 Sucient conditions
Sucient conditions are based on second-order conditions:
Theorem 15 (Equality constraints) Suppose there is a point x

satisfy-
ing g(x

) = 0, and a R
m
such that

f(x

) +
m

j=1

j
g
j
(x

) = 0.
Suppose also that the matrix

l
x
(x

) is positive denite on M. Then x

is a
strict local minimum for the constrained optimization problem.
Theorem 16 (Inequality constraints) Suppose there is a point x

that
satises the constraints. A sucient condition for x

to be a strict local
minimum for the constrained optimization problem is the existence of a
R
m
such that
j
0, for m
e
< j m, and

f(x

) +
m

j=1

j
g
j
(x

) = 0 (1)
m

j=me+1

j
g
j
(x

) = 0, (2)
and the Hessian matrix

l
x
(x

) is positive on the subspace

M

= {y : g
j
(x

y = 0, j J}
where J = {j : g
j
(x

) = 0,
j
> 0}
25
7.4 Sensitivity
The Lagrange multipliers can be interpreted as the price of incremental
change in the constraints. Consider the class of problems:
minimize f(x)
subject to g(x) = c.
For each c, assume the existence of a solution point x

(c). Under appropriate

regularity conditions the function x

(c) is well behaved with x

(0) = x

.
Theorem 17 (Sensitivity Theorem) Let f, g C
2
and consider the fam-
ily of problem dened above. Suppose that for c = 0 there is a local solution
x

that is a regular point and that, together with its associated Lagrange mul-
tiplier vector , satises the second-order sucient conditions for a strict
local minimum. Then for every c in a neighborhood of 0 there is x

(c),
continuous in c, such that x

(0) = x

, x

(c) is a local minimum of the

constrained problem indexed by c, and

f(x

(c)) |
c=0
= .
7.5 Homework
1. Read the help le on the function fminu. Investigate the eect of sup-
plying the gradients with the parameter grad on the performance of the
procedure. Compare, in particular the functions bilinear and fun1.
2. Consider the constraints x
1
0, x
2
0 and x
2
x
1
1)
2
0. Show that
(1, 0) is feasible but not regular.
3. Find the rectangle of given perimeter that has greatest area by solving
the rst-order necessary conditions. Verify that the second-order sucient
conditions are satised.
4. Three types of items are to be stored. Item A costs one dollar, item B costs
two dollars and item C costs 4 dollars. The demand for the three items
are independent and uniformly distributed in the range [0, 3000]. How
many of each type should be stored if the total budget is 4,000 dollars?
26
5. Let A be an n m matrix of rank m and let L be an n n matrix that
is symmetric and positive-denite on the subspace M = {y : Ay = 0}.
Show that the (n + m) (n + m) matrix
_
L A

A 0
_
is non-singular.
6. Consider the quadratic program
minimize x

Qx 2b

x
subject to Ax = c.
Prove that x

is a local minimum point if and only if it is a global minimum

point.
7. Maximize 14x x
2
+ 6y y
2
+ 7 subject to x + y 2, x + 2y 3.
8 Lagrange methods
The Lagrange methods for dealing with constrained optimization problem
are based on solving the Lagrange rst-order necessary conditions. In par-
ticular, for solving the problem with equality constraints only:
minimize f(x)
subject to g(x) = 0,
the algorithms look for solutions of the problem:

f(x) +
m

j=1

j
g
j
(x) = 0
g(x) = 0,
An important special case is when the target function f is quadratic and
the constraints are linear:
minimize (1/2)x

Qx +x

c
subject to a

i
x = b
i
, 1 i m
e
a

i
x b
i
, m
e
+ 1 i m
27
with Q a symmetric matrix.
8.1.1 Equality constraints
In the particular case where m
e
= m the above becomes
minimize (1/2)x

Qx +x

c
subject to Ax = b.
and the Lagrange necessary conditions become
Qx + A

+c = 0
Ax b = 0.
This system is nonsingular if Q is positive denite on the subspace M =
{x : Ax = 0}. and the solution becomes:
x = Q
1
A

(AQ
1
A

)
1
[AQ
1
c +b] Q
1
c
= (AQ
1
A

)
1
[AQ
1
c +b].
8.1.2 Inequality constraints
In the general quadratic programming problem the method of active set is
used. A working set of constraints W
n
is updated in each iteration. The set
W
n
contains all constraints that are suspected to satisfy an equality relation
at the solution point. In particular, it contains the equality constraints. An
algorithm for solving the general quadratic problem is:
1. Start with a feasible point x
0
and a working set W
0
. Set n = 0
2. Solve the quadratic problem
minimize (1/2)d

Qd + (c + Qx
n
)

d
subject to a

i
d = 0, i W
n
.
If d

n
= 0 go to 4.
3. Set x
n+1
=
n
d

n
, where

n
= min
a

i
d

n
>0
_
1,
b
i
a

i
x
n
a

i
d

n
_
.
If
n
< 1, adjoin the minimizing index above to W
n
to form W
n+1
. Set
n = n + 1 and return to step 2.
28
4. Compute the Lagrange multiplier in step 3 and let
n
= min{
i
: i
W
n
, i > m
e
}. If
n
0, stop; x
n
is a solution. Otherwise, drop
n
from
W
n
to form W
n+1
8.2 Homework
2. Investigate the properties of the function for quadratic programming.
9 Sequential Quadratic Programming
Let us go back to the general problem
minimize f(x)
subject to g
i
(x) = 0, i = 1, . . . , m
e
g
i
(x) 0, i = m
e
+ 1, . . . , m.
The SQP method solves this problem by solving a sequence of QP problems
where the Lagrangian function l is approximated by a quadratic function
and the constraints are approximated by a linear hyper-space.
9.1 Newtons Method
Consider the case of equality constraints only. At each iteration the problem
minimize (1/2)d

l
x
(x
n
,
n
)d +

l(x
n
,
n
)

d
subject to g
i
(x
n
)

d + g
i
(x
n
) = 0, i = 1, . . . , m
is solved. It can be shown that the rate of convergence of this algorithm
is 2 (at least) if the starting point (x
0
,
0
) is close enough to the solution
(x

). A disadvantage of this approach is the need to compute Hessian.

9.2 Structured Methods
29
These methods are modications of the basic Newton method, with approx-
imations replacing Hessian. One can rewrite the solution to the Newton step
in the form
_
x
n+1

n+1
_
=
_
x
n

n
_

_

l
n
g

n
g
n
0
_
1
_

l
n
g
n
_
.
Instead, one can use the formula
_
x
n+1

n+1
_
=
_
x
n

n
_

n
_
H
n
g

n
g
n
0
_
1
_

l
n
g
n
_
,
with
n
and H
n
properly chosen.
9.3 Merit function
In order to choose the
n
and to assure that the algorithm will converge
a merit function is associated with the problem such that a solution of
the constrained problem is a (local) minimum of the merit function. The
algorithm should be descending with respect to the merit function.
Consider, for example, the problem with inequality constraints only:
minimize f(x)
subject to g
i
(x) 0, i = 1, . . . , m.
The absolute-value merit function is given by
Z(x) = f(x) + c
m

i=1
g
i
(x)
+
.
The parameter is chosen by minimizing the merit function in the direction
chosen by the algorithm.
Theorem 18 If H is positive-denite and if c > max
1im

i
then the
algorithm is descending with respect to the absolute-value merit function.
9.4 Enlargement of the feasible region
Consider, again, the problem with inequality constraints only:
minimize f(x)
subject to g
i
(x) 0, i = 1, . . . , m.
30
and its solution with a structural SQP algorithm.
Assume that at the current iteration x
n
= x and H
n
= H. Then one
wants to consider the QP problem:
minimize (1/2)d

Hd +

f(x)
subject to g
i
(x)

d + g(x) 0, i = 1, . . . , m.
However, it is possible that this problem is infeasible at the point x. Hence,
the the original method breaks down. However, one can consider instead
the problem
minimize (1/2)d

Hd +

f(x) + c
m

i=1

i
subject to g
i
(x)

d + g(x)
i
, i = 1, . . . , m.

i
0, i = 1, . . . , m,
which is always feasible.
Theorem 19 If H is positive-denite and if c > max
1im

i
then the
algorithm is descending with respect to the absolute-value merit function.
9.5 The HanPowell method
The HanPowell method is what is used by Matlab for SQP. It is a Quasi-
Newton method, where H
n
is updated using the BFGS approach:
H
n+1
= H
n
+
(l)(l)

(x)

(l)

H
n
(x)(x)

H
n
(x)

H
n
(x)
,
where
x = x
n+1
x
n
, l = l(x
n+1
,
n+1
) l(x
n
,
n
).
It can be shown that H
n+1
is positive-denite if H
n
is and if (x)

(l) > 0.
9.6 Constrained minimization in Matlab
Write the M-le fun2.m:
function [f,g]=fun2(x)
f = exp(x(1))*(4*x(1)^2+2*x(2)^2+4*x(1)*x(2)+2*x(2)+1);
g(1,1) = 1.5 + x(1)*x(2) - x(1) - x(2);
g(2,1) = -x(1)*x(2) - 10;
31
and run the Matlab session:
>> x0 = [-1,1];
>> x = constr(fun2, x0)
x =
0.1956 1.0910
>> x = constr(fun2, x0)
x =
-9.5474 1.0474
>> [f,g] = fun2(x)
f =
0.0236
g =
1.0e-015 *
-0.8882
0
>> options = [];
>> vlb = [0,0];
>> vlu = [];
>> x = constr(fun2, x0, options, vlb, vlu)
x =
0 1.5000
>> [f,g] = fun2(x)
f =
8.5000
g =
0
-10
Write the M-le grodf2.m:
function [df,dg]=grudf2(x)
f = exp(x(1))*(4*x(1)^2+2*x(2)^2+4*x(1)*x(2)+2*x(2)+1);
df = [f + exp(x(1))*(8*x(1) + 4*x(2)),
exp(x(1))*(4*x(1) + 4*x(2) + 2)];
dg = [x(2) - 1, -x(2);
x(1) - 1, -x(1)];
and run the session:
>> vlb = [];
>> x = constr(fun2, x0, options, vlb, vlu, grudf2)
32
x =
-9.5474 1.0474
9.7 Homework
1. Let H be a positive-denite matrix and assume that throughout some
compact set the quadratic programing has a unique solution, such that
the Lagrange multipliers are not larger than c. Let {x
n
: n 0} is a
sequence generated by the recursion bx
n+1
= x
n
+
n
d
n
, where d is the
direction found by solving the QP centered at x
n
and with H xed and
n
is determined by minimizatin of the function Z. Show that any limit point
of {bx
n
} satises the rst order necessary conditions for the constrained
minimization problem.
2. Extend the result in 1 for the case where H = H
n
changes but yet x
2

H
n
x cx
2
for some 0 < < c < and for all x and n.
10 Penalty and Barrier Methods
The basic approach in these methods is to solve a sequence of unconstrained
problems. The solutions of these problems converge to the solution of the
original problem.
10.1 Penalty method
Consider the problem
minimize f(x)
subject to g
i
(x) 0, i = 1, . . . , m.
Choose a continuous penalty function which is zero inside the feasible set
and positive outside of it. For example,
P(x) = (1/2)
m

i=1
max{0, g
i
(x)}
2
.
Minimize, for each c, the problem
q(c, x) = f(x) + cP(x).
33
When c is increased it expected that the solution x

(c) converges to x

.
Lemma 1 Let c
n+1
> c
n
, then
q(c
n
, x

n
) q(c
n+1
, x

n+1
) (3)
P(x

n
) P(x

n+1
) (4)
f(x

n
) f(x

n+1
). (5)
Lemma 2 Let x

be the solution of the original problem. Then for each n

f(x

) q(c
n
, x

n
) f(x

n
).
Theorem 20 Let {x
n
} be a sequence generated by the penalty method. Then,
any limit point of the sequence is a solution to the original problem.
10.2 Barrier method
Consider again the problem
minimize f(x)
subject to g
i
(x) 0, i = 1, . . . , m,
and assume that the feasible set is the closure of its interior. A barrier
function is a continuous and positive function over the feasible set which
goes to as x approaches the boundary. For example,
B(x) =
m

i=1
1
g
i
(x)
.
Minimize, for each c, the problem
minimize r(c, x) = f(x) +
1
c
B(x)
subject to g
i
(x) 0, i = 1, . . . , m.
Note that a method of unconstrained minimization can be used because
the solution is in the interior of the feasible set. When c is increased it is
expected that the solution x

(c) converges to x

.
Theorem 21 Let {x
n
} be a sequence generated by the barrier method. Then,
any limit point of the sequence is a solution to the original problem.
34
10.3 A nal project
Choose an algorithm for minimization of a non-linear programming problem.
Program the steps of the algorithms using Matlab. Apply your program to a
specic example and compare the results to those of the appropriate built-in
function. Submit the program + plots and outputs that demonstrate the
steps of the algorithm.
10.4 An exercise in Matlab
Use the M-les fun1.m, const1.m and Z.m:
function f = fun1(x)
f = 100*(x(2)-x(1)^2)^2 + (1 - x(1))^2;
function g = cons1(x)
g = x(1)^2 + x(2)^2 - 1.5;
function a = Z(t,x,d)
y = x + t*d;
if cons1(y) >= 0
a = fun1(y) + 2*cons1(y);
else
a = fun1(y);
end
and write the M-le myconstr.m:
function x = myconstr(x0,H,n)
x = x0;
for i = 1:n
c = grudfun1(x);
A=grudcons1(x);
b=-cons1(x);
d = qp(H,c,A,b);
a = fmin(Z,0,2,[],x,d);
x = x + a*d;
end
function f = fun1(x)
35
f = 100*(x(2)-x(1)^2)^2 + (1 - x(1))^2;
function g = cons1(x)
g = x(1)^2 + x(2)^2 - 1.5;
function df = grudfun1(x)
df1 = -400*(x(2) - x(1)^2)*x(1) + 2*x(1) - 2;
df2 = 200*(x(2) - x(1)^2);
df = [df1;df2];
function dg=grudcons1(x);
dg = [2*x(1), 2*x(2)];
Run the Matlab session:
>> x0 = [0;0];
>> constr(f = fun1(x); g = cons1(x);,x0)
ans =
0.9072
0.8227
>> H=[1,0;0,1];
>> myconstr(x0,H,1000)
ans =
0.9131
0.8329
>> x0 = [0.9;0.8];
>> myconstr(x0,H,10)
ans =
0.9072
0.8228
>> x =constr(f = fun1(x); g = cons1(x);,x0);
>> (-400*(x(2) - x(1)^2)*x(1) + 2*x(1) - 2)/(2*x(1))
ans =
-0.0387
>> 200*(x(2) - x(1)^2)/(2*x(2))
ans =
-0.0386
>> 100*(grudcons1(x+[0.01;0])-grudcons1(x))
ans =
2.0000 0
>> 100*(grudcons1(x+[0;0.01])-grudcons1(x))
36
ans =
0 2.0000
ans =
>>
671.5124 -364.8935
ans =
-362.8935 200.0000
>> H = [672,-364;-364,200] + 0.386*[2,0;0,2];
>> x0 = [0.7;0.5];
>> xx = x0;
>> for n=1:12
x0 = myconstr(x0,H,1);
xx=[xx,x0];
end
>>[x1,x2]=meshgrid(-1.5:0.1:1.5);
>> t = 0:pi/20:2*pi;
>> contour(x1,x2,fun12(x1,x2),30)
>> hold on
>> plot(sqrt(1.5)*exp(i*t))
>> plot(xx(1,:),xx(2,:),*-)
11 Stochastic Approximation
Let h be a monotone function from the interval [a, b] to R. Let be the
(unique) solution of the equation h(x) = 0. Assume that h

() > 0. Un-
fortunately, for a given point x, one cannot evaluate h(x) but we observe
Y = h(x) +, where is a random variable with E = 0 and E
2
=
2
< ,
independent of how x is selected. The Robbins-Monroe scheme involves an
innite sequence of positive constants {c
n
: n = 0, 1, 2, . . .} which satisfy the
conditions

n=0
c
n
= ,

n=0
c
2
n
< . (6)
The recursion is given by
x
n
= x
n1
c
n1
Y
n1
, a < x
0
< b. (7)
37
Theorem 22 For the recursion dened in (7), with constants c
n
satisfy-
ing (6), it can be shown that E(x
n
)
2

n
0.
Theorem 23 For the recursion dened in (7), with constants c
n
satisfy-
ing (6), it can be shown that n
1/2
(x
n
) converges to a zero-mean normal
random variable.
38