You are on page 1of 35

IMM

C ONTENTS

OPTIMIZATION WITH 1. I NTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


CONSTRAINTS 1.1. Smoothness and Descent Directions . . . . . . . . . . . . . . . . . . . . . . . 4
1.2. Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2nd Edition, March 2004 2. L OCAL , C ONSTRAINED M INIMIZERS . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1. The Lagrangian Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
K. Madsen, H.B. Nielsen, O. Tingleff 2.2. First Order Condition, Necessary Condition . . . . . . . . . . . . . . . . 15
2.3. Second Order Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3. Q UADRATIC O PTIMIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1. Basic Quadratic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2. General Quadratic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3. Implementation Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4. Sequential Quadratic Optimization . . . . . . . . . . . . . . . . . . . . . . . . 32
4. P ENALTY AND SQO M ETHODS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1. The Augmented Lagrangian Method . . . . . . . . . . . . . . . . . . . . . . . 41
4.2. The Lagrange-Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
A PPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
R EFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
I NDEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Informatics and Mathematical Modelling


Technical University of Denmark
1. I NTRODUCTION 2

Example 1.1. In IR1 consider the objective function f (x) = (x − 1)2 . The
unconstrained minimizer is x+ +
u = 1 with f (xu ) = 0. We shall look at the effect
of some simple constraints.
¡ ¢
1◦ . With the constraint x ≥ 0, r = 0, m = 1, c1 (x) = x we also find the con-
+
strained minimizer x = 1.
2◦ With the constraint x−2 ≥ 0 the feasible region is the interval P = [2, ∞[,
and x+ = 2 with f (x+ ) = 1.
1. I NTRODUCTION 3◦ The inequality constraints given by c1 (x) = x−2 and c2 (x) = 3−x lead to
P = [2, 3] and x+ = 2.
In this booklet we shall discuss numerical methods for constrained opti- 4◦ If we have the equality constraint 3−x = 0, then the feasible region consists
mization problems. The description supplements the description of uncon- of one point only, P = {3}, and this point will be the minimizer.
strained optimization in Frandsen et al (1998). We consider a real function
5◦ Finally, 3−x ≥ 0, x−4 ≥ 0 illustrates that P may be empty, in which case
of a vector variable which is constrained to satisfy certain conditions, specif-
the constrained optimization problem has no solution.
ically a set of equality constraints and a set of inequality constraints.
In many constrained problems the solution is at the border of the feasible
Definition 1.1. Feasible region. A point x ∈ IRn is feasible if it satis- region (as in cases 2◦ – 4◦ in Example 1.1). Thus a very important special
fies the equality constraints case is the set of points in P which satisfy some of the inequality constraints
ci (x) = 0, i = 1, . . . , r, r≥0 to the limit, ie with equality. At such a point z ∈ P the corresponding con-
straints are said to be active. For practical reasons a constraint which is not
and the inequality constraints
satisfied at z is also called active at z.
ci (x) ≥ 0, i = r+1, . . . , m, m≥r,
where the ci : IR 7→ IR are given.
n Definition 1.3. Active constraints. A constraint ck (x) ≥ 0 is said to
be
The set of feasible points is denoted by P and called the feasible region. active at z ∈ IRn if ck (z) ≤ 0 ,
inactive at z ∈ IRn if ck (z) > 0 .
Notice that if r = 0, then we have no equality constraints, and if r = m we
have no inequality constraints. The active set at z, A(z), is the set of indices of equality constraints
and active inequality constraints:
A constrained minimizer gives a minimal value of the function while satis-
fying all constraints, ie e
A(z) = {1, . . . , r} ∪ A(z) ,
e
where A(z) = {j ∈ {r+1, . . . , m} | cj (z) ≤ 0} .
Definition 1.2. Global constrained minimizer. Find
x+ = argminx∈P f (x) , Thus, an inequality constraint which is inactive at z has no influence on the
where f : IRn 7→ IR and P is given in Definition 1.1. optimization problem in a neighbourhood of z.

Here, f is the so-called objective function or cost function.


3 1. I NTRODUCTION 1.1. Smoothness and Descent Directions 4

Example 1.2. In case 3◦ of Example 1.1 the constraint c1 is active and c2 is inactive 1.1. Smoothness and Descent Directions
at the solution x+ . Here the active set is
In this booklet we assume that the cost function satisfies the following Tay-
e + ) = {1} .
A(x+ ) = A(x lor smoothness condition,

f (x+h) = f (x) + h> g + 12 h> H h + O(khk )


3
As in unconstrained optimization a global, constrained minimizer (Defini- (1.6a)
tion 1.2) can only be computed under special circumstances, like for in-
stance convexity of some of the functions. In some cases (including some where g is the gradient,
non-convex problems) methods of interval analysis can be applied to find a  
∂f
global, constrained minimizer (see for instance Caprani et al (2002)).  (x) 
 ∂x1 
In this booklet, however, we shall only discuss methods that determine a  .. 
g ≡ f 0 (x) ≡  .  , (1.6b)
local constrained minimizer. Such a method provides a function value which  
 ∂f 
is minimal inside a feasible neighbourhood, determined by ε (ε > 0) : (x)
∂xn
Definition 1.4. Local constrained minimizer. Find x∗ ∈ P so that and H is the Hessian matrix,
f (x∗ ) ≤ f (x) for all x ∈ P with kx − x∗ k < ε . · ¸
∂2f
H ≡ f 00 (x) ≡ { (x)} . (1.6c)
Since the feasible region P is a closed set we have the following result ∂xi ∂xj
concerning constrained optimization.
Furthermore we assume that the feasible region P has a piecewise smooth
Theorem 1.5. Assume that boundary. Specifically we request the constraint functions to satisfy the
following Taylor smoothness condition,
0) The feasible region P is not empty
ci (x+h) = ci (x) + h> ai + 12 h> Ai h + O(khk )
3
1) ci (i = 1, . . . , m) are continuous for all x ∈ P (1.7a)

2) f is continuous for all x ∈ P for i = 1, . . . , m. Here ai and Ai represent the gradient and the Hessian
matrix respectively,
3) P is bounded (∃ C ∈ IR : kxk ≤ C for all x ∈ P)
Then there exists (at least) one global, constrained minimizer. ai = ci0 (x) , Ai = ci00 (x) . (1.7b)

If both the cost function f and all constraint functions ci are linear in x, Notice that even when (1.7) is true, the boundary of P may contain points,
then we have a so-called a linear optimization problem. The solution of curves, surfaces (and other subspaces), where the boundary is not smooth,
such problems is treated in Nielsen (1999). In another important special eg points where more than one inequality constraint is active.
case all constraints are linear, and f is a quadratic polynomial; this is called
a quadratic optimization problem, see Chapter 3. Example 1.3. We consider a two-dimensional problem with two inequality con-
straints, c1 (x) ≥ 0 and c2 (x) ≥ 0.
We conclude this introduction with two sections on important properties of
the functions involved in our problems.
5 1. I NTRODUCTION 1.1. Smoothness and Descent Directions 6

c2 The methods we present in this booklet are in essence descent methods, ie it-
erative methods where we move from the present position x in a direction
Figure 1.1: Two inequality h that provides a smaller value of the cost function. We must satisfy the
constraints in IR2 . descent condition
The infeasible side is hatched.
In this and the following figures x2 c1
f (x+h) < f (x) . (1.8)
c1 means the set {x | c1 (x) = 0},
etc. In Frandsen et al (1999) we have shown that the direction giving the fastest
x1
local descent rate is the Steepest Descent direction
In Figure 1.1 you see two curves with the points where c1 and c2 , respectively, hsd = −f 0 (x) . (1.9)
is active, see Definition 1.3. The infeasible side of each curve is indicated by
hatching. The resulting boundary of P (shown with thick line) is not smooth at In the same reference we also showed that the hyperplane
the point where both constraints are active. You can also see that at this point the
tangents of the two “active curves” form an angle which is less than (or equal to) H(x) = {x+u | u> f 0 (x) = 0} (1.10)
π, when measured inside the feasible region. This is a general property. n
divides the space IR into a “descent” (or “downhill”) half space and an
Next, we consider a three-dimensional problem with two inequality constraints. “uphill” half space.
Below, you see the active surfaces c1 (x) = 0 and c2 (x) = 0. As in the 2-
dimensional case we have marked the actual boundary of the feasible region A descent direction h is characterized by having a positive projection onto
by thick line and indicated the infeasible side of each constraint by hatching. It the steepest descent direction,
is seen that the intersection curve is a kink line in the boundary surface. It is also
h> hsd > 0 ⇐⇒ h> f 0 (x) < 0 . (1.11)
seen that the angle between the intersecting constraint surfaces is less than (or
equal to) π, when measured inside P.
Now consider the constraint functions. The equality constraints ci (x) =
c2 0 (i = 1, . . . , r) and the boundary curves corresponding to the active in-
equality constraints, ci (x) ≥ 0 satisfied with “=”, can be considered as level
curves or contours (n = 2), respectively level surfaces or contour surfaces
c1 (n > 2), for these functions. We truncate the Taylor series (1.7) to
ci (x+h) = ci (x) + h> ai + O(khk ) (i = 1, . . . , m) .
2
x3
Figure 1.2: Two inequality
From this it can be seen that the direction ai (= ci0 (x)) is orthogonal to any
constraints in IR3 . x2 tangent to the contour at position x, ie ai is a normal to the constraint curve
x1 (surface) at the position.

Example 1.4. We continue the 2- and 3-dimensional considerations from Example


1.3, see Figure 1.3. At the position x c1 is an active and c2 is an inactive
constraint, ie c1 (x) = 0 and c2 (x) > 0 .
7 1. I NTRODUCTION 1.2. Convexity 8

The opposite is true at the position y, c2 (y) = 0 and c1 (y) > 0. Definition 1.13. Convexity of a function. Assume that D ⊆ IRn is
At the two positions we have indicated the gradients of the active constraints. convex. The function f is convex on D if the following holds for arbi-
They point into the interior of P, the feasible region. trary x, y ∈ D, θ ∈ [0, 1] and xθ ≡ θx + (1−θ)y :
c2 f (xθ ) ≤ θf (x) + (1−θ)f (y) .
c2
a2
f is strictly convex on D if
a2 f (xθ ) < θf (x) + (1−θ)f (y) .
a1
y c1
y a1

x Definition 1.14. Concavity of a function. Assume that D ⊆ IRn is


x c1 convex. The function f is concave/strictly concave on D if −f is con-
vex/strictly convex on D.
Figure 1.3: The gradients of the constraint functions in IR2 and IR3
point into the feasible region. f(xθ)
In the figure we show a
At this early stage we want to emphasize that in a neighbourhood of a point at
the boundary of P, properties of inactive constraints have no influence.
strictly convex function between
two points x, y ∈ D. The defini-
tion says that f (xθ ), with xθ ≡
1.2. Convexity θx + (1−θ)y, is below the secant
¡ ¢
The last phenomenon to be described in this introduction is convexity. It is between the points 0, f (x) and
essential for a theorem on uniqueness of a constrained global minimizer and ¡ ¢ 0 1 θ
1, f (y) , and this holds for all
also for some special methods. Figure 1.4: A strictly
choices of x and y in D.
A set D (for instance the feasible region P) is convex if the line segment convex function.
between two arbitrary points in the set is contained in the set:
Definition 1.15. Convexity at a point. The function f is convex at
Definition 1.12. Convexity of a set. The set D ⊆ IRn is convex x ∈ D if there exists ² > 0 such that for arbitrary y ∈ D with kx−yk < ²,
if the following holds for arbitrary x, y ∈ D , θ ∈ [0, 1] and xθ ≡ θ ∈ [0, 1] and xθ ≡ θx + (1−θ)y :
θx + (1−θ)y : f (xθ ) ≤ θf (x) + (1−θ)f (y) .
xθ ∈ D . f is strictly convex at x ∈ D if
We also use the term convexity about functions: f (xθ ) < θf (x) + (1−θ)f (y) .

It is easy to prove the following results:


9 1. I NTRODUCTION 1.2. Convexity 10

Theorem 1.16. If D ⊆ IRn is convex and f is twice differentiable on 2) Let ci be an inequality constraint. Assume that ci is concave on IRn . If
D, then ci (x) ≥ 0, ci (y) ≥ 0 and θ ∈ [0, 1], then Definition 1.14 implies that
1◦ f is convex on D
ci (θx + (1−θ)y) ≥ θ ci (x) + (1−θ) ci (y)) ≥ 0 .
⇐⇒ f 00 (x) is positive semidefinite for all x ∈ D
2◦ f is strictly convex on D if Thus, the set of points where ci is satisfied is a convex set. This means that
f 00 (x) is positive definite for all x ∈ D the feasible domain P is convex if all equality constraints are linear and all
inequality constraints are concave.

Theorem 1.17. First sufficient condition. If P is bounded and convex


and if f is convex on P, then
f has a unique global minimizer in P .

Theorem 1.18. If f is twice differentiable at x ∈ D, then


1◦ f is convex at x ∈ D
⇐⇒ f 00 (x) is positive semidefinite
2◦ f is strictly convex at x ∈ D if
f 00 (x) is positive definite

We finish this section with two interesting observations about the feasible
domain P.
1) Let ci be an equality constraint. Take two arbitrary feasible points x
and y: ci (x) = ci (y) = 0. All points xθ on the line between x and y must
also be feasible (cf Definition 1.12),
ci (θx + (1−θ)y) = 0 for all θ ∈ [0, 1] .
Thus ci must be linear, and we obtain the surprising result: If P is convex,
then all equality constraints are linear. On the other hand the set of points
satisfying a linear equality constraint must be convex. Therefore the feasible
domain of an equality constrained problem (ie r = m) is convex if and only
if all constraints are linear.
2. L OCAL , C ONSTRAINED M INIMIZERS 12

1) One equality constraint (no inequality constraints) in IR2


In this case the feasible region, P = {x | c1 (x) = 0}, is the curve shown
below. At a position x we show hsd , the halfspace of descent directions, and
the constraint gradient, a1 = c10 (x), see (1.7b). It is obvious that if x “slides
to the left” along the constraint curve, “pulled by the force hsd ”, we shall
get lower cost values and still remain feasible. Thus x cannot be a local,
constrained minimizer.
2. L OCAL , C ONSTRAINED M INIMIZERS
a(s)
1
We shall progress gradually when we introduce the different aspects and xs
complications in the conditions for local, constrained minimizers (see Defi-
nition 1.4). We make the assumption that the feasible region P is not empty h(s) a1
sd
and that the cost function f and the constraint functions ci (i = 1, . . . , m)
are smooth enough for the Taylor expansions (1.6) and (1.7) to hold. x c
1
First, we consider some special cases.
hsd
0) No equality constraints, no active inequality constraints
In Figure 2.1 we indicate the current position x and the “downhill” halfspace Figure 2.2: One equality constraint in IR2 , c1 (x) = 0.
(cf (1.9) and (1.10)) in the two-dimensional case. The “uphill” side of the At x we have a feasible descent direction; at xs there is none.
dividing hyperplane H is hatched. At the position xs the vectors h(s) 0 (s) 0
sd = −f (xs ) and a1 = c1 (xs ) are propor-
tional, ie they satisfy a relation of the form

f 0 (xs ) = λ c10 (xs ) ,


(s) (s)
hsd = −λ a1 ⇐⇒ (2.1)
x
where λ is a scalar. This point may be a local, constrained minimizer. We
say that xs is a constrained stationary point.
In conclusion: Any local, constrained minimizer must satisfy the equation
hsd in (2.1) with λ ∈ IR.

Figure 2.1: Steepest descent direction and “downhill” halfspace. 2) Two active inequality constraints (no equality constraints) in IR2
In order to get a lower cost value we should move in a direction h in the Figure 2.5 illustrates this case. At position x both constraints are active.
unhatched halfspace of descent directions. If the step is not too long then The pulling force hsd shown indicates that the entire feasible region is on
the constraints of the problem are of no consequence. the ascent side of the dividing plane H (defined in (1.10) and indicated in
Figure 2.5 by a dashed line). In this case, x is a local, constrained minimizer.
13 2. L OCAL , C ONSTRAINED M INIMIZERS 2.1. Lagrangian Function 14

c2
x*
a1
a2
xu xu xu
Figure 2.5: Two inequality constraints
in IR2 , c1 (x) ≥ 0 and c2 (x) ≥ 0. x c1 c1
At the intersection point, hsd points c1 c1
out of the feasible region.
hsd c1 is strongly active
c1 is inactive c1 is weakly active
Figure 2.6: Contours of a quadratic function in IR2 ;
Imagine that you turn hsd around the point x (ie, you change the cost func- one constraint, c1 (x) ≥ 0.
tion f ). As soon as the dividing plane intersects with the active part of one In other words: if a constraint is weakly active, we can discard it without
of the borders, a feasible descent direction appears. The limiting cases are, changing the optimizer. This remark is valid both for inequality and equality
when hsd is opposite to either a1 or a2 . The position xs is said to be a con- constraints.
strained stationary point if h(s)
sd is inside the angle formed by −a1 and −a2 ,
or 2.1. The Lagrangian Function
h(s) (s)
sd = −λ1 a1 − λ2 a2
(s)
with λ1 , λ2 ≥ 0 .
The introductory section of this chapter indicated that there is an im-
This is equivalent to
portant relationship between g∗ , the gradient of the cost function, and
f 0 (xs ) = λ1 c10 (xs ) + λ2 c20 (xs ) with λ1 , λ2 ≥ 0 . (2.2) a∗i (i = 1, . . . , m) the gradients of the constraint functions, all evaluated at
a local minimizer. This has lead to the introduction of Lagrange’s Function:

Definition 2.3. Lagrange’s Function. Given the objective function f


and the constraints ci , i = 1, . . . , m. Lagrange’s function is defined by
3) Strongly and weakly active constraints (in IR2 ) Xm
L(x, λ) = f (x) − λi ci (x) .
In Figure 2.6 we show one inequality constraint and the contours of a i=1
quadratic function together with xu , its unconstrained minimizer. In the first
The scalars {λi } are the Lagrangian multipliers
case, xu is also a constrained minimizer, because the constraint c1 is inac-
tive. In the last case, xu is not feasible and we have a constrained minimizer
The gradient of L with respect to x is denoted Lx0 , and we see that
x∗ , with f 0 (x∗ ) 6= 0. We say that the constraint is strongly active.
In the middle case, c1 is active at xu . Here we say that the constraint
X
n
is weakly active, corresponding to λ1 = 0 in (2.1) and (2.2) (because Lx0 (x, λ) = f 0 (x) − λi ci0 (x) . (2.4)
f 0 (xu ) = 0). i=1
15 2. L OCAL , C ONSTRAINED M INIMIZERS 2.2. First Order Condition 16

By comparison with the formulae in the introduction to this chapter we see Theorem 2.5. First order necessary conditions. (KKT conditions)
that in all cases the necessary condition for a local, constrained minimizer
Assume that
could be expressed in the form Lx0 (xs , λ) = 0 .
a) x∗ is a local constrained minimizer of f (see definition 1.4).
For an unconstrained optimization problem you may recall that the neces-
sary conditions and the sufficient condition for a minimizer involve the gra- b) either b1) all active constraints ci are linear,
dient f 0 (x∗ ) and the Hessian matrix f 00 (x∗ ) of the cost function, see Theo- or b2) the gradients a∗i = ci0 (x∗ ) for all active
rems 1.1, 1.2 and 1.5 in Frandsen et al (1999). In the next sections you will constraints are linearly independent.
see that the corresponding results for constrained optimization will involve
Then there exist Lagrangian multipliers {λ∗i }i=1 (see definition 2.3)
m
the gradient and the Hessian matrix (with respect to x) of the Lagrangian
such that
function.
1◦ Lx0 (x∗ , λ∗ ) = 0 ,
2.2. First Order Condition, Necessary Condition 2◦ λ∗i ≥ 0, i = r+1, . . . , m ,
First order conditions on local minimizers only consider first order partial 3◦ λ∗i ci (x∗ ) = 0, i = 1, . . . , m .
derivatives of the cost function and the constraint functions. With this re-
striction we can only formulate the necessary conditions; the sufficient con-
ditions also include second derivatives. The formulation is very compact, and we therefore give some clarifying
remarks:
Our presentation follows Fletcher (1993), and we refer to this book for the
formal proofs, which are not always straight forward. The strategy is as 1◦ This was exemplified in connection with (2.4).
follows, 2◦ λ∗i ≥ 0 for all inequality constraints was exemplified in (2.2), and in
(1) Choose an arbitrary, feasible point. Appendix A we give a formal proof.
(2) Determine a step which leads from this point to a neighbouring point,
3◦ For an equality constraint ci (x∗ ) = 0, and λ∗i can have any sign.
which is feasible and has a lower cost value.
For an active inequality constraint ci (x∗ ) = 0, and λ∗i ≥ 0.
(3) Detect circumstances which make the above impossible.
(4) Prove that only the above circumstances can lead to failure in step (2). For an inactive inequality constraint ci (x∗ ) > 0, so we must have
λ∗i = 0, confirming the observation in Example 1.4, that these con-
First, we formulate, the so-called first order Karush–Kuhn–Tucker condi- straints have no influence on the constrained minimizer.
tions (KKT conditions for short):
In analogy with unconstrained optimization we can introduce the following
17 2. L OCAL , C ONSTRAINED M INIMIZERS 2.3. Second Order Conditions 18

Corollary 2.6. Constrained stationary point In all the cases, xs = 0 is a constrained stationary point, see definition (2.6):
¡ ¢ ¡ ¢
(x, λ) = 12 (x1 − 1)2 + x22 − λ −x1 + βx22 ,
xs is feasible and (xs , λs ) satisfy 1◦ –3◦ in Theorem 2.5 · ¸ · ¸ · ¸
x1 − 1 −1 −1 + λ
m Lx0 (x, λ) = − λ ; Lx0 (0, λ) = .
x2 2βx2 0
xs is a constrained stationary point Thus, (xs , λs ) = (0, 1) satisfy 1◦ in Theorem 2.5, and 2◦ –3◦ are automatically
satisfied when the problem has equality constraints only.
Notice, that f is strictly convex in IR2 .
2.3. Second Order Conditions For β = 0 the feasible region is the x2 -axis. This together with the contours of
The following example demonstrates that not only the curvature of the cost f (x) near origo tells us that we have a local, constrained minimizer, x∗ = 0.
function but also the curvatures of the constraint functions are involved in With β = 14 the stationary point xs = 0 is also a local, constrained minimizer,
the conditions for constrained minimizers. x∗ = 0. This can be seen by correlating the feasible parabola with the contours
of f around 0.
Example 2.1. This example in IR2 with one equality constraint (r = m = 1) is due Finally, for β = 1 we get the rather surprizing result that xs = 0 is a local, con-
to Fiacco and McCormick (1968). The cost function and constraint are strained maximizer. Inspecting the feasible parabola and the contours carefully,
¡ ¢
f (x) = 12 (x1 − 1)2 + x22 , c1 (x) = −x1 + βx22 . you will discover that two local constrained minimizers have appeared around
x = [0.5, ±0.7]> .
We consider this problem for three different values of the parameter β, see Fig-
ure 2.7.
In Frandsen et al (2004) we derived the second order conditions for uncon-
x2 strained minimizers. The derivation was based on the Taylor series (1.6) for
f (x∗ +h), and lead to conditions on the definiteness of the Hessian matrix
1
Hu = f 00 (xu ), where xu is the unconstrained minimizer.
The above example indicates that we have to take into account also the cur-
vature of the active constraints, A∗i = ci00 (x∗ ) for i ∈ A(x∗ ).
The second order condition takes care of the situation where we move along
the edge of P from a stationary point x. Such a direction is called a feasible
1 x active direction:
1
Figure 2.7: Contours of f
and the constraint Definition 2.7. Feasible active direction. Let x ∈ P. The nonzero
vector h is a feasible active direction if
−x1 + βx22 = 0
for three values of β.
−1 h> ci0 (x) = 0
β=0 β = 1/4 β=1
for all active constraints.
19 2. L OCAL , C ONSTRAINED M INIMIZERS 2.3. Second Order Conditions 20

Now we use the Taylor series to study the variation of the Lagrangian func- Theorem 2.11. Second order necessary condition.
tion. Suppose we are at a constrained stationary point xs , and in the variation Assume that
we keep λ = λs from Definition 2.6. From xs we move in a feasible active a) x∗ is a local constrained minimizer for f .
direction h, b) As b) in Theorem 2.5.
c) All the active constraints are strongly active.
L(xs +h, λs ) = L(xs , λs ) + h> Lx0 (xs , λs )
Then there exists Lagrangian multipliers {λ∗i }i=1 (see Definition 2.3)
m
1 > 00 3
+ h2
Lxx (xs , λs )h + O(khk ) such that
= L(xs , λs ) + 12 h> Lxx
00
(xs , λs )h + O(khk3 ) , (2.8) 1◦ Lx0 (x∗ , λ∗ ) = 0 ,
2◦ λ∗i ≥ 0, i = r+1, . . . , m ,
since (xs , λs ) satisfies 1◦ in Theorem 2.5. The fact that xs is a constrained
stationary point implies that also 3◦ of Theorem 2.5 is satisfied, so that 3◦ λ∗i > 0 if ci is active, i = r+1, . . . , m ,
L(xs , λs ) = f (xs ). Since h is a feasible active direction, we obtain (again
4◦ λ∗i ci (x∗ ) = 0, i = 1, . . . , m ,
using 3◦ of Theorem 2.5),
5◦ h> W∗ h ≥ 0 for any feasible active direction h.
Pm
L(xs +h, λs ) = f (xs +h) − i=1 λ(s)
i ci (xs +h)
Here, W∗ = Lxx
00
(x∗ , λ∗ ).
Pm
' f (xs +h) − i=1 λ(s) > 0
i (ci (xs )+h ci (xs ))
= f (xs +h) , (2.9) Theorem 2.12. Second order sufficient condition.
Assume that
and inserting this in (2.8) we get (for small values of khk) a) xs is a local constrained stationary point (see Definition 2.6).
b) As b) in Theorem 2.5.
f (xs +h) ' f (xs ) + 12 h> Ws h , (2.10a) c) As c) in Theorem 2.11.
d) h> W∗ h > 0 for any feasible active direction h,
where the matrix Ws is given by where W∗ = Lxx 00
(x∗ , λ∗ ).
Then
X
m
00 00 00 xs is a local constrained minimizer.
Ws = Lxx (xs , λs ) = f (xs ) − λ(s)
i ci (xs ) . (2.10b)
i=1
For the proofs we refer to Fletcher (1993). There, you may also find a
This leads to the sufficient condition that the stationary point xs is a local, treatment of the cases, where the gradients of the active constraints are not
constrained minimizer if h> Ws h > 0 for any feasible active direction h. linearly independent, and where some constraints are weakly active.
Since this condition is also necessary we can formulate the following two
second order conditions:
21 2. L OCAL , C ONSTRAINED M INIMIZERS

Example 2.2. Continuing from Example 2.1 we find


· ¸ · ¸
00 1 0 0 0
Lxx (x, λ) = − λ .
0 1 0 2β
At the stationary point xs = 0 we found λs = 1. Further, from Figure 2.7 and
Definition 2.7 we see that h = [0 h2 ]> is the only feasible active direction, and
we get
h> Lxx
00
(hs , λs )h = (1 − 2β)h22 . 3. Q UADRATIC O PTIMIZATION
This is positive if β < 12 , and Theorem 2.12 shows that in this case xs = 0 is a
local, constrained minimizer. We now start to introduce solution methods for different classes of opti-
If β > 1
,then h> 00
Lxx (hs , λs )h ◦
< 0, contradicting 5 in Theorem 2.11; there- mization problems with constraints. The fundamental class has linear cost
2
fore xs = 0 cannot be a local, constrained minimizer when β > 12 . functions and also linear constraints. This class is called linear optimization
problems and is covered in Nielsen (1999).
The limiting case β = 12 is not covered by the theorems. In order to investigate
it, higher order terms are needed in the Taylor expansion for L(xs +h, λs ). The next class has a quadratic cost function and all the constraints are linear.
We call it
Finally, we give the following theorem, whose proof can be found p 10 in
Madsen (1995): Definition 3.1. The quadratic optimization problem (QO).
Find
Theorem 2.13. Third sufficient condition. Assume that x∗ = argminx∈P {q(x)} ,
a) xs is a local constrained stationary point (see Definition 2.6), where
b) all active constraints are linear, q(x) = 12 x> H x + g> x ,
c) h> W∗ h > 0 for any feasible active direction h 6= 0.
P = {x ∈ IRn | a>i x = bi , i = 1, . . . , r
Then
xs is a local constrained minimizer. a>i x ≥ bi , i = r+1, . . . , m } .

The matrix H ∈ IRn×n and the vectors g, a1 , . . . , am ∈ IRn are given. The
associated Lagrange function is
X
m
L(x, λ) = 12 x> H x + g> x − λi (a>i x − bi ) , (3.2a)
i=1
with the first and second order derivatives
Xm
Lx0 (x, λ) = Hx + g − λ i ai , 00
Lxx (x, λ) = H . (3.2b)
i=1

Throughout this chapter we have the assumptions


23 3. Q UADRATIC O PTIMIZATION 3.1. Basic Quadratic Optimization 24

Assumption 3.3. H is symmetric and positive definite. which is the following linear system of equations,

(See Fletcher (1993) for methods for the cases where these simplifying as- H xu = −g . (3.5)
sumptions are not satisfied). Under Assumption 3.3 the problem is strictly
The solution is unique according to our assumptions.
convex (Theorem 1.16). This ensures that q(x)→+∞ when kxk→∞, irre-
spective of the direction. Thus we need not require the feasible region P to
be bounded. All the constraint functions are linear and this makes P convex. 3.1. Basic Quadratic Optimization
Thus, in this case Theorem 1.17 leads to The basic quadratic optimization problem is the special case of Problem QO
(Definition 3.1) with only equality constraints, ie m = r. We state it in the
Corollary 3.4. Under Assumption 3.3 the problem QO of Defini-
form1)
tion 3.1 has a unique solution.
Definition 3.6. Basic quadratic optimization problem (BQO)
As in Chapter 2 we shall progress gradually with the different complications
Find
of the methods, ending the chapter with a method for non-linear optimiza-
x∗ = argminx∈P {q(x)} ,
tion using iterations where each step solves a quadratic optimization prob-
where
lem, gradually approaching the properties of the non-linear cost function
q(x) = 12 x> H x + g> x , P = {x ∈ IRn | A> x = b} .
and constraints.
The matrix A ∈ IRn×m has the columns A:,j = aj and bj is the jth
Example 3.1. In Figure 3.1 you see the contours of a positive definite quadratic element in b ∈ IRm .
in IR2 . If there are no constraints on the minimizer, we get the unconstrained
minimizer, indicated by xu in the figure. The solution can be found directly, namely by solving the linear system of
equations which express the necessary condition that the Lagrange function
x
2 L is stationary at the solution with respect to both of its vector variables x
and λ:
xu
Lx0 (x, λ) = 0 : Hx + g − Aλ = 0 ,
(3.7)
Lλ0 (x, λ) =0: A> x − b = 0 .
Figure 3.1: Contours
of a quadratic in IR2 The first equation is the KKT condition, and the second expresses that the
and its unconstrained
constraints are satisfied at the solution. This linear system of equations has
minimizer xu
x1 the dimension (n+r)×(n+r), with r = m. Thus the solution requires
O((n+m)3 ) operations. We return to the solution of (3.7) in Section 3.3.
The solution of the unconstrained quadratic optimization problem corre-
sponding to Definition 3.1 is found from the necessary condition q 0 (xu ) = 0 1) In other presentations you may find the constraint equation formulated as Ãx = b with
à = A> . Hopefully this will not lead to confusion.
25 3. Q UADRATIC O PTIMIZATION 3.2. General Quadratic Optimization 26

3.2. General Quadratic Optimization If, on the other hand, xeq is feasible, then we are finished (ie x∗ = xeq )
provided that all Lagrange multipliers corresponding to Ae are non-negative
In the general case we have both equality constraints and inequality con-
(Theorem 2.13). If there is one or more λj < 0 for j ∈A e (ie for active
straints in Problem 3.1, and we must use an iterative method to solve the
problem. If we knew which constraints are active at the solution x∗ we inequality constraints), then one of the corresponding indices is dropped
could set up a linear system like (3.7) and find the solution directly. Thus from A before the next iteration.
the problem can be formulated as that of finding the active set A(x∗ ). Before formally defining the strategy in Algorithm 3.10 we illustrate it
We present a so-called active set method. Each iterate x is found via an through a simple example.
active set A (corresponding to the constraints that should be satisfied with
“=”, cf Definition 1.3). Ignoring the inactive constraints we consider the ba- Example 3.2. We take a geometric view of a problem in IR2 with 3 inequality
sic quadratic optimization problem with the equality constraints given by A: constraints. In Figure 3.2 we give the contours of the cost function and the
border lines for the inequalities. The infeasible side is hatched.
Definition 3.8. Current BQO problem (CBQO(A))
x2
Find
xeq = argminx∈P {q(x)} ,
where
q(x) = 12 x> H x + g> x , P = {x ∈ IRn | A> x = b} . xu
The matrix A ∈ IRn×p has the columns A:,j = aj , j ∈ A and b ∈ IRp hsd
x2
has the corresponding values of bj . p is the number of elements in A. x3
3
x1
We shall refer to CBQO(A) as a function (subprogram) that returns
hsd
(xeq , λeq ), the minimizer and the corresponding set of Lagrange multipli-
ers corresponding to the active set A. Similar to the BQO they are found as
the solution to the following linear system of dimension (n+p)×(n+p): x0

Hx + g − Aλ = 0 ,
(3.9)
A> x − b = 0 .
2 1
x1

Figure 3.2: Contours of a quadratic optimization problem in IR2


In the iteration for solving Problem 3.1 all iterates are feasible. This means with 3 inequality constraints, a>
i x ≥ bi , i = 1, 2, 3
that we have a feasible x and an active set A at the beginning of each it- The starting point x = x0 is feasible, and we define A ≡ A(x0 ) = {1, 2},
eration. Now the CBQO (Definition 3.8) is solved. If xeq violates some while the third constraint is inactive. The “pulling force” hsd (=−q 0 (x0 )) shows
constraint (ie some of the ignored inequality constraints), then the next it- that we should leave inequality no. 1. This corresponds to the fact that λ1 < 0.
erate is that feasible point on the line from x to xeq which is closest to xeq , Thus the next active set is A = {2}. The solution to the the corresponding system
and the new inequality constraint(s) becoming active is added to A. (2.13) is x = x1 . This is feasible, but the “pulling force” tells us that we should
27 3. Q UADRATIC O PTIMIZATION 3.2. General Quadratic Optimization 28

loosen the only remaining constraint (corresponding to λ2 < 0). Thus, the next 3◦ The vector xeq satisfies the current active constraints, but some of the
CBQO step will lead to xu , the unconstrained minimizer which is infeasible: It e may be
inequality constraints that were ignored: j ∈ {r+1, . . . , m} \A
satisfies constraints 1 and 2, but not 3. The next iterate, x = x2 is found as the
intersection between the line from x1 to xu and the bordering line for a> violated at xeq . Let V denote the set of indices of constraints violated
3 x ≥ b3 .

Finally, a CBQO step from x2 with A = {3} gives x = x3 . This is feasible and
at xeq , V = {j ∈ {r+1, . . . , m} | a>j xeq < bj } . We shall choose the
by checking the contours of the cost function we see that we have come to the best feasible point on the line between x and xeq ,
solution, x∗ = x3 . Algebraically we see this from the fact that λ3 > 0.
e = x + t(xeq − x),
x 0<t<1 . (3.11a)
The strategy from this example is generalized in Algorithm 3.10.
The value of t which makes constraint no. j active is given by
Algorithm 3.10. General quadratic optimization aj> x
e = bj , which is equivalent to

begin tj = (bj − aj> x)/aj> (xeq −x) . (3.11b)


x := x0 {1◦ }
A := A(x) {2◦ } Since x is feasible and xeq is optimal in CBQO, and since the objective
stop := false function is convex, the best feasible point on the line is the feasible
repeat point closest to xeq . This corresponds to
(xeq , λeq ) := CBQO(A) cf Definition 3.8 and (3.9)
if xeq is infeasible k = argminj∈V tj , (3.11c)

x := best feasible point on the line from x to xeq {3 } The new x is found by (3.11a) with t = tk , and the index k is added to
Update A {3◦ }
else A. If the minimum in (3.11c) is taken by several values of k then all of
x := xeq these are added to A.
L := {j ∈A e | λj < 0}
4◦ Since xeq is feasible and Lagrange multipliers corresponding to in-
if L is empty
equality constraints are nonnegative, xeq = x∗ solves the problem ac-
stop := true {4◦ }
else cording to Theorem 2.13.
Remove an element of L from A {5◦ } 5◦ If an active inequality constraint at the CBQO solution has a negative
until stop Lagrange multiplier, then we can reduce the cost function by loosening
end
this constraint.
We have the following remarks:
1◦ The initial point x0 must be feasible. How to find such a point is dis- Finite termination. For each choice A of currently active constraints
CBQO has a unique minimizer xeq . Each time an element of L is removed
cussed in Section 3.3.
from A (see remark 5◦ ) we have x = xeq and there is a strict decrease in
2◦ A holds the indices of current active constraints, cf Definition 1.3. the objective function: q(xnew ) < q(x). Since each new iterate satisfies
29 3. Q UADRATIC O PTIMIZATION 3.3. Implementation Aspects 30

q(xnew ) ≤ q(x) the strict decrease can only take place a finite number of columns. First, we note that the matrix H is used in all iteration steps. It
times because of the finite number of possible active sets A. should be factorized once and for all, eg by Cholesky’s method, cf Appendix
Therefore we only drop a constraint a finite number of times, and thus cy- A in Frandsen et al (2004),
cling cannot take place: The algorithm must stop after a finite number of H = CC> ,
iterations. where C is lower triangular. This requires O(n3 ) operations, and after this,
each “H−1 w” will then require O(n2 ) operations.
3.3. Implementation Aspects The first equation in (3.9) can be reformulated to
To start Algorithm 3.10 we need a feasible starting point x0 . This is sim-
x = H−1 (Aλ − g) , (3.13a)
ple if m ≤ n (the number of constraints is at most equal to the number of
unknown): We just solve and when we insert this in the second equation in (3.9), we get
A> x = b , (3.12a) (A> H−1 A)λ = b + A> H−1 g . (3.13b)
with A ∈ IRn×m having the columns ai , i = 1, . . . , m. If m < n, then this Next, we can reformulate (3.13b) to
system is underdetermined, and the solution has (at least) n−m free param-
Gλ = b + A> d
eters. For any choice of these the vector x is feasible; all the constraints are
active. with G = (C−1 A)> (C−1 A) , d = H−1 g .
If m > n, we cannot expect to find an x with all inequality constraints active. This system is solved via the Cholesky factorization of the p×p matrix G
Instead, we can use the formulation (p being the current number of active constraints). When A changes by
adding or deleting a column, it is possible to update this factorization in
A> x − s = b with s ≥ 0 , (3.12b) O(n·p) operations, and the cost of each iteration step reduces to O(n2 ) op-
and si = 0 for the equality constraints. The problem of finding an x that sat- erations. For more details see pp 18–19 in Madsen (1995).
isfies (3.12b) is similar to getting a feasible starting point for the S IMPLEX There are alternative methods for solving the system (3.9). Gill and Mur-
method in Linear Optimization, see Section 4.4 in Nielsen (1999). ray (1974) suggest to use the QR factorization2) of the active constraint
The most expensive part of the process is solution of the CBQO at each matrix,
· ¸ · ¸
iteration. The simplest approach would be to start from scratch for each R £ ¤ R
new A. Then the accumulated cost of the computations involved in the A = Q = QR QN = QR R , (3.14)
0 0
solutions of (3.9) would be O((n+m)3 ) floating point operations per call of
where Q is orthogonal and R is upper triangular. As indicated, we can split
CBQO. If constraint gradients are linearly independent then the number of
Q into QR ∈ IRn×p and QN ∈ IRn×(n−p) . The orthogonality of Q implies
equality and active constraints cannot exceed n, and thus the work load is
that
O(n3 ) floating point operations per call of CBQO.
Considerable savings are possible when we note that each new A is obtained 2) See eg Chapter 2 in Madsen and Nielsen (2002) or Section 5.2 in Golub and Van
from the previous either by deleting a column or by adding one or more new Loan (1996).
31 3. Q UADRATIC O PTIMIZATION 3.4. Sequential Quadratic Optimization 32

Q>
R QR = I(n−p)×(n−p) , Q>
R QN = 0(n−p)×p , (3.15) Because of Assumption 3.3 the matrix is symmetric. It is not positive def-
inite, however3) , but there are efficient methods for solving such systems,
where the indices on I and 0 are the dimensions of the matrix. The columns where the sparsity is preserved better, without spoiling numerical stabil-
of Q form an orthonormal basis of IRn , and we can express x in the form ity. It is also possible to handle the updating aspects efficiently; see eg
x = QR u + QN v, u ∈ IRp , v ∈ IRn−p . (3.16) Duff (1993).

Inserting (3.14) – (3.16) in the second equation of (3.9) we get


3.4. Sequential Quadratic Optimization
R> Q> >
R (QR u + QN v) = R u = b .
A number of efficient methods for non-linear optimization originate from
This lower triangular system is solved by forward substitution. To find v in sequential quadratic optimization. These methods are iterative methods
(3.16) we multiply the first equation of (3.9) by Q>
N and get where each iteration step includes the solution of a general quadratic op-
Q> > >
N H(QR u + QN v) + QN g − QN QR Rλ = 0 ,
timization problem.
and by use of the second identity in (3.15) this leads to First, we consider problems with equality constraints, only:
x∗ = argminx∈P f (x) ,
Q> >
N HQN v = −QN (HQR u + g) . (3.17) (3.20)
P = {x ∈ IRn | c(x) = 0} .
The (n−p)×(n−p) matrix M = Q>
N HQN is symmetric and positive defi-
nite, and (3.17) can be solved via Cholesky factorization of M. Finally, λ Here, c is the vector function c : IRn 7→ IRr , whose ith component is the ith
can be computed from the first equation of (3.9): constraint function ci .
The corresponding Lagrange’s function (Definition 2.3) is
Q> >
R Aλ = Rλ = QR (Hx + g) . (3.18)
L(x, λ) = f (x) − λ> c(x) , (3.21a)
We used (3.14) and (3.15) in the first reformulation, and x is given by (3.16).
The system (3.18) is solved by back substitution. with the gradient
· ¸ · ¸
There are efficient methods for updating the QR factorization of A, when 0 Lx0 (x, λ) f 0 (x) − Jc> λ
L (x, λ) = = , (3.21b)
this matrix is changed because an index is added to or removed from the Lλ0 (x, λ) −c(x)
active set, see eg Section 12.5 in Golub and van Loan (1996). This method
1 where Jc is the Jacobian matrix of c,
for solving the system (3.9) is advantageous if p is large, p >
∼ 2 n. ∂ci
If the problem is large and sparse, ie most of the elements in H and A are (Jc )ij = (x) ⇐⇒ Jc = [c10 (x) · · · cr0 (x)]> . (3.21c)
∂xj
zero, then both the above approaches tend to give matrices that are consid-
erably less sparse. In such cases it is recommended to solve (3.9) via the At a stationary point xs with corresponding λs we have L 0 (xs , λs ) = 0,
so-called augmented system, which includes that c(x) = 0 (the constraints are satisfied) and
· ¸· ¸ · ¸
H −A x g
= − . (3.19) 3) A necessary condition for a symmetric matrix to be positive definite is that all the diag-
−A> 0 λ b onal elements are strictly positive.
33 3. Q UADRATIC O PTIMIZATION 3.4. Sequential Quadratic Optimization 34

f 0 (x) − Jc> λ = 0 , where


>
which we recognize as a part of the KKT conditions (Theorem 2.5). q(h) = 12 h> W h + f 0 (x) h
(3.23b)
Thus we can reformulate problem (3.20) to a non-linear system of equa- Plin = {h ∈ IRn | Jc h + c(x) = 0}
tions: Find (x∗ , λ∗ ) such that
Adding a constant to q makes no difference to the solution vector. If we
L 0 (x, λ) = 0 . furthermore insert the value of W then (3.23b) becomes
We can use Newton-Raphson’s method to solve this problem. In each itera- >
tion step with current iterate (x, λ), we find the next iterate as (x+h, λ+η), q(h) = 12 h> Lxx
00
(x, λ) h + f 0 (x) h + f (x)
(3.23c)
with the step determined by Plin = {h ∈ IRn | Jc h + c(x) = 0} .
· ¸
00 h
L (x, λ) = −L 0 (x, λ) , By comparison with the Taylor expansions (1.6) and (1.7) we see that
η if λ = 0 then q(h) is a second order approximation to f (x+h), and
where L00 is the total Hessian, Jc h + c(x) is a first order approximation to c(x+h). In other words, (3.23)
· 00 ¸ · ¸ represents a local QO (ie Quadratic Optimization) approximation to (3.20),
00 Lxx Lxλ 00
W −Jc>
L = 00 00 = except for the fact that f 00 (x) is replaced by Lxx
00
(x, λ). However, using
Lλx Lλλ −Jc 0 00
Lxx (x, λ) provides faster convergence than using a quadratic approxima-
with Pr
00 tion to f , which follows from this argument: It is shown above that solving
W = Lxx (x, λ) = f 00 (x) − i=1 λi ci00 (x) .
(3.23) and subsequently letting
One Newton-Raphson step is x := x + h
· ¸· ¸ · 0 ¸
W −Jc> h f (x) − Jc> λ in the final stages of an iterative method for solving (3.20) corresponds to
= − , applying the Newton-Raphson method to find a stationary point of the La-
−Jc 0 η −c(x)
grange function L. Under the usual regularity assumptions this provides
x := x + h; λ := λ + η ,
quadratic final convergence to the solution of (3.20). Using f 00 (x) instead
an by elimination of η we obtain 00
of Lxx (x, λ) in the QO approximation would perturb the Newton-Raphson
· ¸· ¸ · 0 ¸ 00
(x, λ) = f 00 (x)). Thus the quadratic
W −Jc> h f (x) matrix (except for λ = 0, where Lxx
=− , convergence would be prevented.
−Jc 0 λ −c(x) (3.22)
x := x + h . If the non-linear problem has both equality and inequality constraints,

What has this got to do with Quadratic Optimization? Quite a lot! Com- x∗ = argminx∈P f (x) ,
pare (3.22) with (3.19). Since (3.19) gives the solution (x, λ) to the CBQO, P = {x ∈ IRn | cj (x) = 0 , j = 1, . . . , r (3.24)
(Definition 3.8), it follows that (3.22) gives the solution h and the corre- cj (x) ≥ 0 , j = r+1, . . . , m} ,
sponding Lagrange multiplier vector λ to the following problem,
then we can still use (3.23) except that the feasible region has to be changed
Find h = argminh∈Plin {q(h)} (3.23a) accordingly. Thus the QO problem becomes
35 3. Q UADRATIC O PTIMIZATION 3.4. Sequential Quadratic Optimization 36

q(δ) = 2 + 2δ1 + 2δ2 + 12 δ12 + 12 δ12


Find h = argminh∈Plin {q(h)} = 1
2
(δ1 + 2)2 + 1
2
(δ2 + 2)2 ,
>
q(h) = 12 h> Lxx
00
(x, λ) h + f 0 (x) h + f (x) , l(δ) = −1 + 2δ1 − δ2 .
(3.25)
Plin = {h ∈ IRn | cj (x) + cj0 (x)> h = 0 , j = 1, . . . , r The level curves of q are concentric circles centered at δ = [−2, −2]> , and the
solution δ = h is the point, where one of these circles touches the line l(δ) = 0,
cj (x) + cj0 (x)> h ≥ 0 , j = r+1, . . . , m} see Figure 3.3a. The solution is h = [ −0.8, −2.6 ]> .
Using the line search to be described in Section 4.2 the next approximation is
00
In applications the demand for second derivatives (in Lxx (x, λ)) can be an x := x + αh = [0.620, −0.236]> . This leads to the next quadratic approxima-
obstacle, and we may have to use approximations to these. Another problem tion (3.27) with
with the method is that the quadratic model is good only for small values of · ¸ · ¸
1.239 0.943 −0.044
khk. Therefore, when the current x is far away from the solution, it may be f 0 (x) = , W= ,
−0.472 −0.044 2.014
a good idea to retain the direction of h but reduce its length. In Section 4.2 l(δ) = −0.380 + 1.238δ1 − δ2
00
we present a method where Lxx (x, λ) is approximated by BFGS updating, 00
where W is an updated approximation to Lxx (x, λ) (see Section 4.2). The con-
and where a line search is incorporated in order to make the convergence
tours of q are concentric ellipses centered at −W−1 f 0 (x) = [−1.305, 0.206]>
robust also far from the solution.
(the unconstrained minimizer of q, cf (3.5)).

Example 3.3. Consider the problem Figure 3.3 shows the contours of q (full line) and f (dashed line) through the
points given by δ = 0 and δ = h. In the second case we see that in the region
f (x) = x21 + x22 , P = {x ∈ IR2 | x21 − x2 − 1 = 0} (3.26) of interest q is a much better approximation to f than in the first case. Notice
the difference in scaling and that each plot has the origin at the current x. At the
The cost function is a quadratic in x, but the constraint c1 (x) = x21 −x2 −1 is not point x := x + αh ' [ 0.702, −0.508 ]> we get c1 (x) ' 1.3·10−4 . This value
linear, so this is not a quadratic optimization problem. is too small to be seen in Figure 3.3b.
In Example 4.5 we solve this problem via a series of approximations of the form
>
f (x+δ) ' q(δ) ≡ 12 δ> W δ + f 0 (x) δ + f (x),
>
c(x+δ) ' l(δ) ≡ c01 (x) δ + c1 (x) ,
00
where q is the function of (3.23) with Lxx (x, λ) replaced by an approximation
W. This leads to the following subproblem,

Find h = argminδ ∈Plin {q(δ)}


(3.27)
Plin = {x ∈ IR2 | l(δ) = 0} .

Let the first approximation for solving (3.26) correspond to x = [1, 1]> and
W = I. Then
37 3. Q UADRATIC O PTIMIZATION

δ2
δ2
2 c1 = 0
0.5 c1 = 0
d =0
1
d =0
1

2 δ 0.5 δ
1 1
4. P ENALTY AND SQO M ETHODS
h

There are several strategies on which to base methods for general con-
h strained optimization. The first is called sequential linear optimization: in
a b each iteration step we solve a linear optimization problem where both cost
function and constraint functions are approximated linearly. This strategy
may be useful e.g. in large scale problems.
Figure 3.3: Approximating quadratics and the constraint The next strategy is sequential quadratic optimization (SQO). We intro-
Dotted line: f and c. Full line: q and d duced this in Section 3.4, and in section 4.2 we shall complete the descrip-
tion, including features that make it practical and robust.
The third strategy could be called sequential unconstrained optimization
(SUO). In each iteration step we solve an unconstrained optimization prob-
lem, with the cost function modified to induce or force the next iterate to be
feasible. The modification consists in adding a penalty term to the cost func-
tion. The penalty term is zero, if we are in the feasible region, and positive
if we are outside it. The following examples are due to Fletcher (1993).

Example 4.1. Find


argminx∈P f (x) , P = {x ∈ IR2 | c1 (x) = 0} ,
where
f (x) = −x1 − x2 , c1 (x) = 1 − x21 − x22 .
It is easy to see, that the solution is x∗ = 1

2
[1, 1]> .
We penalize infeasible vectors by using the following function
ϕ(x, σ) = f (x) + 12 σ· (c1 (x))2 ,
where σ is a positive parameter. The penalty is zero if x is in P, and positive
39 4. P ENALTY AND SQO M ETHODS 4. P ENALTY AND SQO M ETHODS 40

otherwise. In the case σ = 0 we have an unconstrained and unbounded problem: X


2
X
2
X
2

When the components of x tend to infinity, f (x) tends to −∞. In Figure 4.1
we see the contours of ϕ(x, 1), ϕ(x, 10) and ϕ(x, 100). For σ > 0 we have a 1 1 1

minimizer xσ and the figures indicate the desired convergence: xσ → x∗ for


σ → ∞.
X2 X2 X2

1 X 1 X 1 X
1 1 1 1 1 1

σ=1 σ = 10 σ = 100
Figure 4.2: Contours and minimizer of ϕ(x, σ)
x∗ and xσ is marked by * and o, respectively

1 X 1 X 1 X
1 1 1
equality constraints there is an extra difficulty with this penalty function: Inside
σ=1 σ = 10 σ = 100 the feasible region the functions f and ϕ have the same values and derivatives,
Figure 4.1: Contours and minimizer of ϕ(x, σ) while this is not the case in the infeasible region. On the border of P (where the
x∗ and xσ is marked by * and o, respectively solution is situated) there is a discontinuity in the second derivative of ϕ(x, σ),
The figure indicates a very serious problem connected with SUO. As σ → ∞, and this disturbs line searches and descent directions which are based on inter-
the valley around xσ becomes longer and narrower making trouble for the polation, thus adding to the problems caused by the narrow valley.
method used to find this unconstrained minimizer. Another way of expressing
this, is that the unconstrained problems become increasingly ill-conditioned. It is characteristic for penalty methods, as indicated in the examples, that
(normally) all the iterates are infeasible with respect to (some of) the in-
equality constraints. Therefore they are also called exterior point methods.
Example 4.2. Consider the same problem as before, except that now c1 is an
inequality constraint: Find In some cases the objective function is undefined in (part of) the infeasible
region. Then the use of exterior point methods becomes impossible. This
argminx∈P f (x) , P = {x ∈ IR2 | c1 (x) ≥ 0} ,
has lead to the class of barrier methods that force all the iterates to be feasi-
where f and c1 are given in Example 4.1. The feasible region is the interior of ble. To contrast them with penalty function methods they are called interior
the unit circle, and again the solution is x∗ = √12 [1, 1]> .
point methods (IPM).
The penalty term should reflect that all x for which c1 (x) ≥ 0 are permissible, The most widely used IPMs are based on the logarithmic barrier function.
and we can use
We can illustrate it with a problem with one inequality constraint only,
ϕ(x, σ) = f (x) + 12 σ (min{c1 (x), 0})2 , σ≥0.
x+ = argminx∈P f (x) , P = {x ∈ IRn | c1 (x) ≥ 0} .
In Figure 4.2 we see the contours of ϕ(x, σ) and their minimizers xσ for the
same σ-values as in Example 4.1. The corresponding barrier function is1)
All the xσ are infeasible and seem to converge to the solution. We still have ϕ(x, µ) = f (x) − µ log c1 (x) ,
the long narrow valleys and ill conditioned problems, when σ is large. With in-
1) “log” is the natural (or Naperian) logarithm.
41 4. P ENALTY AND SQO M ETHODS 4.1. Augmented Lagrangian Method 42

with the barrier parameter µ > 0. The logarithm is defined only for x c being the vector function c : IRn 7→ IRr , whose ith component is the ith
strictly inside P (we confine ourselves to working with real numbers), and constraint function ci . At the end of this section we generalize the formula-
since log c1 (x) → −∞ for c1 (x)→0, we see that ϕ(x, µ) → +∞ for x tion to include inequality constraints as well.
approaching the border of P. However, when µ→0, the minimizer xµ of We have the following Lagrangian function (Definition 2.3),
ϕ(x, µ) can approach a point at the border.
L(x, λ) = f (x) − λ> c(x) ,
Methods based on barrier functions share some of the disadvantages of the
penalty function methods: As we approach the solution the intermediate and introduce a penalty term as indicated at the beginning of this chapter.
results xµ are minimizers situated at the bottom of valleys that are narrow, Thus consider the following augmented Lagrangian function2)
ie xµ is the solution of an ill-conditioned (unconstrained) problem. >
ϕ(x, λ, σ) = f (x) − λ> c(x) + 12 σ c(x) c(x) . (4.1)
As indicated barrier methods are useful in problems where infeasible x vec-
tors must not occur, but apart from this they may also be efficient in large Notice that the discrepancy mentioned above has been relaxed: If λ = λ∗ ,
scale problems. In linear optimization a number of very efficient versions then the first order conditions in Corollary 2.6 and the fact that c(x∗ ) = 0
have been developed during the 1990s, see eg Chapter 3 in Nielsen (1999). implies that x∗ is a stationary point of ϕ:

We end this introduction by returning to the penalty functions used in Ex- ϕ0x (x∗ , λ∗ , σ) = 0 .
amples 4.1 and 44.2 and taking a look at the curvatures of the penalty func- Furthermore, Fletcher has shown the existence of a finite number σ b with
tion near the solution x∗ and xσ , the unconstrained minimizer of ϕ(x, σ). b, then x∗ is an unconstrained local minimizer of
the property that if σ > σ
Consider one inequality constraint as in Example 4.2, and assume that the ϕ(x, λ∗ , σ), ie if
constraint is strongly active at the solution: f 0 (x∗ ) 6= 0. This shows that
xλ,σ = argminx∈IRn ϕ(x, λ, σ) , (4.2)
ϕ0x (x∗ , σ) 6= 0 ,
then3)
independent of σ, while the unconstrained minimizer xσ satisfies
ϕ0x (xσ , σ) = 0 . xλ∗ ,σ = x∗ b.
for all σ > σ (4.3)
When σ→∞, xσ →x∗ , but the difference in the gradients of ϕ (at x∗ and This means that the penalty parameter σ does not have to go to infinity. If σ
xσ ) remains constant, and thus the curvature of ϕ goes to infinity. This dis- is sufficiently large and if we insert λ∗ (the vector of Lagrangian multipli-
crepancy is eliminated in the following method which was first introduced ers at the solution x∗ ), then the unconstrained minimizer of the augmented
by Powell (1969). Lagrangian function solves the constrained problem. Thus the problem of
finding x∗ has been reduced – or rather changed – to that of finding λ∗ .
4.1. The Augmented Lagrangian Method We shall describe a method that uses the augmented Lagrangian function
to find the solution. The idea is to use the penalty term to get close to the
At first we consider the special case where only equality constraints are
present: 2) Remember that λ> c(x) = P m λ c (x) and c(x)> c(x) = P m (c (x))2 .
i=1 i i i=1 i
P = {x ∈ IR | c(x) = 0} ,
n
3) In case of several local minimizers “argmin
x∈IRn ” is interpreted as the local uncon-
strained minimizer in the valley around x∗ .
43 4. P ENALTY AND SQO M ETHODS 4.1. Augmented Lagrangian Method 44

solution x∗ , and then let the Lagrangian term provide the final convergence From the current λ we seek a step η such that λ+η ' λ∗ . In order to get a
by letting λ approach λ∗ . A rough sketch of the algorithm is guideline on how to choose η we look at the Taylor expansion for ψ,
Choose initial values for λ, σ ψ(λ+η) = ψ(λ) + η> ψ 0 (λ) + 12 η> ψ 00 (λ) η + O(kηk3 )
repeat −1 >
Compute xλ,σ (4.4) = ψ(λ) − η> c − 12 η> Jc (ϕxx
00
) Jc η + O(kηk3 ) , (4.7)
Update λ and σ
where c = c(xλ ), Jc = Jc (xλ ) is the Jacobian matrix defined in (3.21c), and
until stopping criteria satisfied 00 00
ϕxx = ϕxx (xλ , λ, σfix ). A proof of these expressions for the first and sec-
The computation of xλ,σ (for fixed λ and σ) is an unconstrained optimiza- ond derivatives of ψ can be found in Fletcher (1993). This expansion shows
tion problem, which we deal with later. First, we concentrate on ideas for that
updating (λ, σ) in such a way that σ stays limited and λ→λ∗ . η = −α c(xλ ) , α>0
In the first iteration steps we keep λ constant (eg λ = 0) and let σ increase. is a step in the steepest ascent direction. Another way to get this, and at
This should lead us close to x∗ as described for penalty methods at the start the same time providing a value for α, goes as follows: The vector xλ is a
of this chapter. minimizer for ϕ. Therefore ϕ0x (xλ , λ, σfix ) = 0, implying that
Next, we would like to keep σ fixed, σ = σfix , and vary λ. Then f 0 (xλ ) − Jc (xλ )[λ − σfix c(xλ )] = 0 .
xλ = argminx∈IRn ϕ(x, λ, σfix ) Combining this with the KKT condition (Theorem 2.5),
and f 0 (x∗ ) − Jc (x∗ )λ∗ = 0 ,
ψ(λ) = ϕ(xλ , λ, σfix ) = minx∈IRn ϕ(x, λ, σfix ) and the assumption that xλ ' x∗ , we find
b. Since
are functions of λ alone. Assume σfix > σ
λ∗ ' λ − σfix c(xλ ) . (4.8)

1 ψ(λ) is the minimal value of ϕ,
The right-hand side can be used for updating λ. Fletcher (1993) shows that
2◦ the definition (4.1) combined with c(x∗ ) = 0 shows that under certain regularity assumptions (4.8) provides linear convergence4) .
ϕ(x∗ , λ, σ) = f (x∗ ) for any (λ, σ), Faster convergence is obtained by applying Newton’s method to the nonlin-
ear problem ψ 0 (λ) = 0,
3◦ (4.3) implies xλ∗ = x∗ , λ∗ ' λ + η , where ψ 00 (λ)η = −ψ 0 (λ) .
it follows that for any λ Notice, that this is equivalent to finding η as a stationary point for the
quadratic model obtained by dropping the error term O(kηk3 ) in (4.7). A
ψ(λ) ≤ ϕ(x∗ , λ, σfix ) = ϕ(x∗ , λ∗ , σfix ) = ψ(λ∗ ) . (4.5) formula for ψ 00 (λ) is also given in (4.7). Inserting this we obtain
Thus the Lagrangian multipliers at the solution is a local maximizer for ψ,
4) This means that in the limit we have kλ ∗ ∗
new − λ k ≤ κkλ − λ k, where
λ∗ = argmaxλ ψ(λ) . (4.6) λnew = λ − σfix c(xλ ) and 0 < κ < 1.
45 4. P ENALTY AND SQO M ETHODS 4.1. Augmented Lagrangian Method 46

2◦ K is meant to measure, how well the constraints are satisfied, and is


λ∗ ' λ − [ψ 00 (λ)]−1 ψ 0 (λ) used in the stopping criterion. A better measure (which can not be used
00 −1 > −1
= λ − [Jc (ϕxx ) Jc ] c(xλ ) . (4.9) as long as λ = 0) is to take K = maxi |λi ci (x)|.
If the last expression of (4.9) is used for updating λ then quadratic conver- x is the minimizer of an unconstrained optimization problem, to be
gence is obtained under certain regularity conditions, see Fletcher (1993).
solved eg by one of the iterative methods given in Frandsen et al (2004).
Notice that if a Quasi-Newton method is used in the unconstrained opti-
mization for finding xλ then an estimate of the inverse Hessian (ϕxx 00 −1
) is We assume that it can exploit “warm starts” (since after the first few
available. iteration steps the new x = xλ,σ will be close to the previous one).
Now we can present a specific example of an implementation of the algo- 3◦ If K was reduced by 75% , then λ is updated by means of (4.8) or (4.9).
rithm outlined in (4.4). The details of course could be chosen in many other
Otherwise . . .
ways.
4◦ . . . we assume that x is too far from x∗ and increase the penalty factor
Algorithm 4.10. Augmented Lagrangian method σ.
(Equality constraints only).

begin Example 4.3. We illustrate Algorithm 4.10 with the following simple problem,
k := 0; x := x0 ; λ := λ0 ; σ := σ0 {1◦ } with n = 2 and r = m = 1:
Kprev := kc(x)k∞ {2◦ } minimize f (x) = x21 + x22
repeat with the constraint c1 (x) = 0 , c1 (x) = x21 − x2 − 1 .
k := k+1 For hand calculation the following expressions are useful:
x := argminx ϕ(x, λ, σ); K := kc(x)k∞ {2◦ } £ ¤
f 0 (x) = [ 2x1 , 2x2 ]> , Jc (x) = 2x1 −1 ,
if (K ≤ 14 Kprev ) ¡ ¢2
λ := Update(x, λ, σ) {3◦ } ϕ(x, λ, σ) = (x21 + x22 ) − λ · (x21 − x2 − 1) + σ · x21 − x2 − 1 ,
Kprev := K · ¸
2x1 (1 − λ + σ(x21 − x2 − 1))
else ϕx0 (x, λ, σ) = ,
2x2 + λ − σ(x1 − x2 − 1)
2

σ := 10 ∗ σ {4◦ } · ¸
00 2x1 (1−λ − σ(x2 +1−3x21 )) −2σx1
until K < ε or k > kmax ϕxx (x, λ, σ) =
−2σx1 2+σ
.
end
We shall follow the iterations from the starting point x0 = [ 1, 1 ]> , λ0 = 0,
We have the following remarks: σ0 = 2. We find Kprev = |c1 (x0 )| = 1.
1◦ As mentioned earlier it is natural to start with the pure penalty method, First step: The augmented Lagrangian function is
¡ ¢2
ie we let λ0 = 0. σ0 must be a positive number, one might eg start with ϕ(x, λ, σ) = (x21 + x22 ) − 0 · (x21 − x2 − 1) + 1 · x21 − x2 − 1 ,
σ0 = 1. x0 is an initial estimate of the solution provided by the user.
47 4. P ENALTY AND SQO M ETHODS 4.1. Augmented Lagrangian Method 48

whose contours are shown below together with the minimizer, x


2
x
2
1 x1 1 x1
x = [ 0, −0.5 ]> .
−1 x
2 1 x
1

−1 −1

ϕ(x, 0, 20) ϕ(x, 1, 20)


−1 Figure 4.4: Contours and minimizers of ϕ(x, 0, 20) and
Figure 4.3: Contours and minimizer of ϕ(x, 0, 2). ϕ(x, 1, 20), respectively.
The constraint c1 (x) = 0 is dashed The constraint c1 (x) = 0 is dashed.
We get K = |02 + 0.5 − 1| = 0.5 . Thus, K was reduced by less than 75 %
and therefore we enter the else branch of Algorithm 4.10. The values for the We now turn to the general case, where we have equality as well as inequal-
next iteration step are λ = 0, σ = 20. ity constraints,
Second step: The augmented Lagrangian function is P = {x ∈ IRn | ci (x) = 0, i = 1, . . . , r
¡ ¢2
ϕ(x, 0, 20) = (x21 + x22 ) − 0 · (x21 − x2 − 1) + 10 · x21 − x2 − 1 . ci (x) ≥ 0, i = r+1, . . . , m} .
There are two minimizers,
√ and we assume that we find the minimizer with pos-
itive x1 : x = [ 0.45, −0.50 ]> , c1 (x) = −0.05, thus K = 0.05. This makes
us enter the if branch: we will update the Lagrange factor. The steepest ascent 4.1.1. An easy solution. A straight forward way to solve this problem
method gives would be to use the method just described: Let a penalty method bring
λ := 0 − 20 · (−0.05) = 1 , us to the neighbourhood of a solution, and then simply consider the active
or near active constraints as equality constraints. Discard the rest of the
and this is also the result from the Newton method. The details are left as an
constraints (still keeping an eye on them, though, to observe whether they
exercise.
remain inactive), and use one of the two methods for updating the vector of
Third step: Lagrange multipliers λ.
¡ ¢2
ϕ(x, 1, 20) = (x21 + x22 ) − 1 · (x21 − x2 − 1) + 10 · x21 − x2 − 1 , The augmented Lagrangian function could be the following:

The minimizer is x = [ 0.5, −0.50 ]> ' [ 0.70711, −0.50 ]> with K = 0, so e λ, σ) = f (x) − λ> d(x) + 12 σd(x)> d(x) ,
ϕ(x,
the algorithm stops. It has found the exact solution of the problem, x∗ = x, and
the corresponding Lagrangian multiplier is λ∗ = 1, ie it is equal to the current
where d(x) is defined as follows
(
λ-value. This exemplifies the comments on (4.3). ci (x) if i ∈ Aδ (x) ,
di (x) =
Below we give the contours of the augmented Lagrangian functions for steps two 0 otherwise.
and three.
49 4. P ENALTY AND SQO M ETHODS 4.1. Augmented Lagrangian Method 50

½
Here, we have defined the approximate active set Aδ (x) by cr+i (x) − zi = 0
cr+i (x) ≥ 0 ⇔ , i = 1, . . . , m−r . (4.13)
zi ≥ 0
Aδ (x) = {1, . . . , r} ∪ { i | i > r and ci (x) ≤ δ } , (4.11)
Notice, that we have extended the number of variables, and still have in-
equality constraints. These are simple, however, and – as we shall see – the
where δ is a small positive number. Initially we could keep λ = 0 and
slack variables can be eliminated.
increase σ until the approximate active set seems to have stabilized (eg by
being constant for two consecutive iterations). As long as A(x) remains Consider the augmented Lagrangian function corresponding to the m equal-
constant we update λ using (4.8) or (4.9) (discarding inactive constraints ity constraints,
and assuming that the active inequality constraints are numbered first). Oth- Pr Pr
ϕ(x, z, λ, σ) = f (x) − i=1 λi ci (x) + 12 σ i=1 ci (x)2
erwise λ is set to 0 and σ is increased. Pm
− i=r+1 λi (ci (x) − zi−r )
The algorithm might be outlined as follows: Pm
+ 12 σ i=r+1 (ci (x) − zi−r )2 . (4.14)
Algorithm 4.12. Augmented Lagrangian method For fixed λ and σ we wish to find xλ,σ and zλ,σ that minimize ϕ under the
(General problem, easy solution). constraint zλ,σ ≥ 0. xλ,σ minimizes the original problem provided that σ is
begin sufficiently large and λ is the vector of Lagrange multipliers at the solution.
λ := 0; σ := σ0 At the minimizer (xλ,σ ,zλ,σ ) either zi = 0 (the constraint zi ≥ 0 is active) or
repeat ∂ϕ
= 0. Now, from (4.14) we see that
e λ, σ)
x := argminx ϕ(x, ∂zi
if (stable active set Aδ (x)) ∂ϕ
λ := Update(x, λ, σ) = λi − σ(ci (x) − zi−r ) ,
∂zi−r
else
1
λ := 0; σ := 10 ∗ σ and equating this with zero we get zi−r = ci (x) − λi . Thus, the relevant
until STOP σ
values for the slack variables are
end 1
zi−r = max{0, ci (x) − λi }, i = r+1, . . . , m .
Many alternatives for defining the active set could be considered. It might, σ
eg, depend on the values of |ci (x)|, i = 1, , ..., m. One disadvantage about Inserting this in (4.14) will make z disappear, and we obtain
this type of definition is that a threshold value, like δ, must be provided ϕ(x, λ, σ) = f (x) − λ> d(x) + 12 σd(x)> d(x) , (4.15a)
by the user. This might be avoided by a technique like the one in Algo-
rithm 4.10 (and the following Algorithm 4.20). where d(x) hold the modified equality constraint functions given by
(
ci (x) if i ≤ r or ci (x) ≤ σ1 λi
4.1.2. A better solution. We change the inequality constraints (i = di (x) = 1 . (4.15b)
r+1, . . . , m) into equality constraints by introducing so-called slack vari- σ λi otherwise
ables zi : Thus, the augmented Lagrangian function for the generally constrained
problem is very similar to (4.1).
51 4. P ENALTY AND SQO M ETHODS 4.1. Augmented Lagrangian Method 52

Letting
Algorithm 4.20. Augmented Lagrangian method.
ψ(λ) = minx∈IRn ϕ(x, λ, σfix ) = ϕ(xλ , λ, σfix ) , (4.16) (General case).

the inequality ψ(λ) ≤ ψ(λ∗ ) corresponding to (4.5), can easily be shown begin
valid. Thus λ∗ maximizes ψ so ψ 0 (λ∗ ) = 0. k := 0; x := x0 ; λ := λ0 ; σ := σ0 {1◦ }
Kprev := kd(x)k∞ {2◦ }
The steepest ascent iteration, corresponding to (4.8), is repeat
λsa = λ − σfix d(xλ ) . (4.17) k := k+1
x := argminx ϕ(x, λ, σ); K := kd(x)k∞ {2◦ }
0 ∗
The Newton iteration for solving ψ (λ ) = 0, corresponding to (4.9), is if (K ≤ 14 Kprev )
λ := Update(x, λ, σ) {3◦ }
λnew = λ + η , where ψ 00 (λ)η = −ψ 0 (λ) . (4.18)
Kprev := max(K, Kprev )
Here the first and second order derivatives of ψ are (see Fletcher (1993)) else
σ := 10 ∗ σ
ψ 0 (λ) = −d(xλ ) ,
until K < ε or k > kmax
· ¸ (4.19) end
00 G 0 ec (ϕ00 )−1 J
e> .
ψ (λ) = − with G = J
0 σ1 I xx c

In Je we only consider the active constraints (first line in (4.15b)) and we 3◦ The updating of λ can be made by the steepest ascent formula
assume that these are numbered first. Thus G is an s by s matrix (where s (4.17), which is efficient initially, or by Newton’s method (4.18),
is the number of active constraints), and I is the unit matrix in IRm−s . which provides quadratic final convergence (under the usual regular-
Notice that if constraint number i is inactive at (x,λ) (last line in (4.15b)) ity conditions).
then the value of ηi in (4.18) is −λi . Thus the i’th component of λnew will
be 0 which is consistent with remark 3◦ on Theorem 2.5. If a Quasi-Newton method is used to find x at 3◦ , then an approximate
The algorithm is given below. Essentially, it is identical with 4.10. We have ϕ00 (or (ϕ00 )−1 ) is available and can be used in (4.19b). In this case we
the following remarks: do not obtain quadratic but superlinear convergence, which is almost as
1◦ As remark 1◦ to Algorithm 4.10. good.

2◦ As remark 2◦ to Algorithm 4.10, except for K: For active constraints Algorithm 4.20 has proved to be robust and quite efficient in practice. Typ-
|di (x)| is the deviation from ci (x) to zero. For an inactive constraint, ically the solution is found after 3 – 10 runs through the repeat loop. In
Example 4.6 we report results of some test runs with the algorithm.
|di (x)| = |λi /σ| which becomes 0 when λ is updated. If this con-
straint is also inactive at the solution, then λ∗i = 0, see remark 3◦ on
Theorem 2.5; thus, also in this case the value |di | is relevant for the
stopping criterion.
53 4. P ENALTY AND SQO M ETHODS 4.2. Lagrange-Newton Method 54

4.2. The Lagrange-Newton Method e is the feasible region


and P
In Section 3.4 we formulated the problem of finding a local constrained e = {δ ∈ IRn | di (δ) = 0, i = 1, . . . , r
P (4.22b)
minimizer of f as a problem of finding a stationary point of the associated
Lagrangian function. Applying Newton’s method to this, we saw that each di (δ) ≥ 0, i = r+1, . . . , m} ,
step was equivalent to a quadratic optimization problem (QO). The next
corresponding to a linear model of the constraints
Newton step gives rise to a new QO, and therefore the names “Lagrange-
Newton method” and “sequential quadratic optimization” are more or less c(x+δ) ' d(δ) = c(x) + Jc (x)δ , (4.22c)
synonymous. The shorter name “SQP” is also used because a quadratic
optimization problem is called a “quadratic program” in older literature. ∂ci
where (Jc )ij = , cf (3.21d).
∂xj
The ideas date back to the 1960s, but the first efficient implementations were
developed by Han (1976) and Powell (1977). Currently it is considered We shall discuss the choice of the step parameter α in (4.21) and matrix
as the most efficient method (except for problems with extremely simple W(x) in (4.22a). First, however, let us consider the consequences of the
function evaluations (as all the problems in the examples of this booklet)). linearization (4.22c) of the constraint functions.
We now complete the description of the method from Section 3.4 and in-
Example 4.4. We consider a problem in IR2 with one inequality constraint
troduce features that improve the global performance of the method. This only, c1 (x) ≥ 0. Figure 4.5 shows the border curve for the feasible region.
includes soft line search with a special type of penalty function. We con-
clude the description with an update method for the Hessian matrix. This x
2
actually makes the method a Quasi-Newton method see Chapter 5 in Frand-
sen et al (2004)) with good final (superlinear) convergence without having to c1 = 0
implement second derivatives, which would be needed with a true Newton’s
method (giving quadratic final convergence). Figure 4.5: Border curve of
Summarizing (and slightly modifying) the description from Section 3.4, we feasible region, c1 (x) = 0.
The infeasible side is hatched x1
can state the SQP method in algorithmic form
Choose x0 ; x := x0 We want to study the variation of the function c1 (x) around this border curve.
repeat Figure 4.6a shows the surface y = c1 (x), and in Figure 4.6b we have added the
h := argminδ∈Pe q(δ) tangent plane to this constraint surface at a point (x, c1 (x)), where c1 (x) > 0.
(4.21)
Find step parameter α The tangent plane corresponds to the linear approximation
x := x + αh
c1 (x+h) ' κ(x+h) = c1 (x) + h> c10 (x) .
until stopping criteria satisfied
We assume that c1 is strictly concave (so that the feasible region is convex, cf
Here q is a quadratic model of the cost function in the neighbourhood of x, Theorem 1.18). Then the tangent plane at any point is above the constraint
surface (except at the point of osculation), and a consequence is that the line
f (x+δ) ' q(δ) = f (x) + δ> f 0 (x) + 12 δ> W(x)δ, (4.22a) ψ(x+h) = 0 is completely outside the feasible region. This is illustrated in
55 4. P ENALTY AND SQO M ETHODS 4.2. Lagrange-Newton Method 56

some conservatism is recommended in later steps,


x2 x2
µi := max{|λi |, 1
2
(µi + |λi |)} . (4.24)
y y
This is specially important for constraints that are active in one iteration step
and inactive in the next.
c1 c
1 κ=0
Powell has shown that the function
π(α) = π(x+αh, µ) for α ≥ 0, h and µ fixed ,
a x1 b x1 has π 0 (0) < 0, so that a line search can lead to a point x+αh, which is “bet-
ter” in terms of this measure. Also, even for moderate penalty, the minimizer
Figure 4.6: Variation of the constraint function c1 (x) near the border curve
is exactly feasible.
c1 (x) = 0 and the tangent plane at the point (x, c1 (x)), marked by a circle.
The disadvantage of an exact penalty function is that it is not differentiable;
Figure 4.6b for x ∈ P, and it is easily seen that also if x is infeasible (c1 (x) < 0), π(α) has kinks at the points where a constraint ci (x+αh) passes the value
then the line ψ(x+h) = 0 is completely outside P. of zero; see Figure 4.7 below.
The properties described above are valid in general, except for cases, where the We also need a piecewise linear approximation to π(α),
concave function has a local maximizer between x and the border curve. π(α) ' ψ(α) = f (x) + αh> f 0 (x)
Pr
4.2.1. Choice of step length α. The solution h to the quadratic optimization + i=1 µi |ci (x) + αh> ci0 (x)|
Pm
problem in (4.21) satisfies all the linearized constraints, and, as shown in + i=r+1 µi | min{0, ci (x) + αh> ci0 (x)}|
the previous example, this may cause x+h to be infeasible with respect
to the true constraints. Also, if h comes out too large, then the quadratic An example of π(α) and ψ(α) is shown in Figure 4.7. Notice that ψ(α) is
model may be a poor approximation to the true variation of the cost function. convex, and it also has kinks, situated differently from the kinks of π.
Therefore we make a line search similar to the soft line search described in y y = π(0) + 0.1∆α
Section 2.5 of Frandsen et al (2004). The function considered in the line
search is a so-called “exact penalty function”5)
y = π(α)
X
r X
m
π(y, µ) = f (y) + µi |ci (x)| + µi | min{0, ci (y)}|
(4.23) y = ψ(α)
i=1 i=r+1
with µi ≥ |λi | .
1 α
The penalty factors are chosen as µ = |λ| in the first iteration step, while Figure 4.7: Line search function π(α)
and its linear approximation ψ(α)
5) This penalty function is exact in the sense that the solution x∗ of our problem minimizes
π(y, µ) for any µ with µ ≥ 0. Similar to a “normal” soft line search, we accept a value of α such that
57 4. P ENALTY AND SQO M ETHODS 4.2. Lagrange-Newton Method 58

the point (α, π(α)) is below the dashed line indicated in Figure 4.7. The 4.2.2. Choice of W in (4.22). By comparison with the Taylor expansion
slope of this line is 10% of the slope of the chord between (0, ψ(0)) and (1.6) an obvious choice is W(x) = f 00 (x). However, the goal is to find a
(1, ψ(1)), ie minimizer for the Lagrangian function L(x, λ), and the description in Sec-
tion 3.4 shows that a more appropriate choice is
∆ = ψ(1) − ψ(0) X
00
X
r X
m W(x) = Lxx (x, λ) = f 00 (x) − λi ci00 (x) .
= h> f 0 (x) − µi |ci (x)| − µi | min{0, ci (x)}| . (4.25)
i=1 i=r+1
We know from Theorem 2.11 that at the solution (x∗ , λ∗ ) the Hessian ma-
trix satisfies h> Lxx
00
(x∗ , λ∗ )h ≥ 0 for all feasible directions. This does not
In this expression we have used the fact that ψ(1) = h> f 0 (x) since the other imply that Lxx 00
(x∗ , λ∗ ) is positive definite, but contributes to the theoretical
e Note that h is downhill for f , and therefore ∆ is
terms are zero for h ∈ P. motivation for the following strategy that has proven successful: Start with
guaranteed to be negative. a positive definite W(x0 ), eg W(x0 ) = I. In each iteration step update W
In each step of the line search algorithm we use a second order polynomial so that it is positive definite, thus giving a well-defined descent direction.
P (t) to approximate π(t) on the interval [0, α]. The coefficients are deter- The use of an updating strategy has the further benefit that we do not have
mined so that P (0) = π(0), P 0 (0) = ∆, P (α) = π(α), to supply second derivatives of the cost function f and the constraint func-
t2 tions {ci }.
P (t) = π(0) + ∆t + (π(α) − π(0) − ∆α) 2 .
α A good updating strategy is the BFGS method discussed in Section 5.10 of
If the coefficient to t2 is positive, then this polynomial has a minimizer β, Frandsen et al (2004). Given the current W = W(x) and the next iterate
determined by P 0 (β) = 0, or xnew = x+αh. The change in the gradient of Lagrange’s function (with
−∆α2 respect to x) is
β= . (4.26)
2(π(α) − π(0) − ∆α)
y = Lx0 (xnew , λ) − Lx0 (x, λ)
Now we can formulate the line search algorithm: >
= f 0 (xnew ) − f 0 (x) − (Jc (xnew ) − Jc (x)) λ . (4.28a)

Algorithm 4.27. Penalty Line Search. We check the so-called curvature condition,
begin
y> (xnew − x) > 0 . (4.28b)
α := 1; Compute ∆ by (4.25)
while π(α) ≥ π(0) + 0.1∆α If this is satisfied, then W is “positive definite with respect to the step di-
Compute β by (4.26) rection h”, and so is Wnew found by the BFGS formula,
α := min{0.9α, max{β, 0.1α} }
end 1 1
Wnew = W + >
yy> − > uu> ,
αh y h u (4.28c)
The expression for the new α ensures that the algorithm does not get stuck where u = Wh .
at the current value and, on the other hand, does not go to zero too fast. The
algorithm has been validated by experience. If the curvature condition is not satisfied, then we let Wnew = W.
59 4. P ENALTY AND SQO M ETHODS 4.2. Lagrange-Newton Method 60

4.2.3. Stopping criterion. We use the following measure for the goodness We shall need the following expressions
· ¸
of the approximate solution obtained as x = xprev + αh, 2x1 £ ¤
f 0 (x) = , Jc (x) = 2x1 −1 ,
2x2
η(x, λ) = |q(αh) − f (x)| ¡ ¢
q(δ) = f (x) + 2(x1 δ1 + x2 δ2 ) + 12 w11 δ12 + 2w12 δ1 δ2 + w22 δ22
X X
+ λi |ci (x)| + | min{0, ci (x)}| . (4.29) d1 (δ) = c1 (x) + 2x1 δ1 − δ2 .
i∈B i∈J
The first model problem is
As in Chapter 3, B is the set of equality and active inequality constraints, minimize q(δ) = 2 + 2δ1 + 2δ2 + 0.5δ12 + 0.5δ12
and J is the set of inactive inequality constraints. The first term measures
subject to d1 (δ) = −1 + 2δ1 − δ2 = 0 .
the quality of the approximating quadratic (4.22a) and the other terms mea-
sure how well the constraints are satisfied. This was discussed in Example 3.3, where we found the minimizer h =
[ −0.8, −2.6 ]> . The corresponding Lagrange multiplier is λ = 0.6, and this
is also used as the first value for the penalty parameter µ. Figure 4.8 shows
4.2.4. Summary. The algorithm can be summarized as follows. The pa-
π(α) = (1−.8α)2 + (1−2.6α)2 + .6|(1−.8α)2 − (1−2.6α) − 1| .
rameters ε and kmax must be set by the user.
y

Algorithm 4.30. Lagrange – Newton Method.


begin y = π(α)
x := x0 ; W := W0 ; µ := 0; k := 0
repeat
k := k+1
y = ψ(α)
Find (h, λ) by Algorithm 3.10
Update µ by (4.24) Figure 4.8: Penalty function π(α)
Find α by Algorithm 4.27 and linear approximation ψ(α).
1 α
xprev := x; x := x + αh
Update W by (4.28) The linear approximation is
· ¸ · ¸
until η(x, λ) < ε or k > kmax £ ¤ 2 £ ¤ 2
ψ(α) = 2 + α −.8 −2.6 + .6| − 1 + α −.8 −2.6 |
end 2 −1
= 2 − 6.8α + .6| − 1 + α| = 2.6 − 7.4α for 0 ≤ α ≤ 1 .

Example 4.5. We shall use the algorithm on the same problem as in Example 4.3, We see that ∆ = −7.4 and

minimize f (x) = x21 + x22 π(1) = 2.984 > 2.6 + .1∆ · 1 = 1.86 .

with the constraint c1 (x) = 0 , c1 (x) = x21 − x2 − 1 ,


and with the same starting point, x0 = [ 1, 1 ]> . Further, we choose W0 = I.
61 4. P ENALTY AND SQO M ETHODS 4.2. Lagrange-Newton Method 62

We need to reduce α, and use (4.26) to find6) k x>


k λk η(xk , λk ) kxk − x∗ k∞
β=
7.4
= 0.475334 = α , 3 [0.701733, −0.507702] 1.011291 7.0·10−4 7.7·10−4
2(2.984 − 2.6 + 7.4) 4 [0.707111, −0.500023] 1.000498 5.9·10−5 2.3·10−5
π(α) = 0.667740 < 2.6 + .1∆α ' 2.248 . 5 [0.707107, −0.500000] 0.999990 8.2·10−10 4.6·10−8
Thus, the line search is finished, and the next iterate is If we use ε = 10−8 in Algorithm 4.30, we are finished. The errors
xnew = x + αh = [ 0.619733, −0.235868 ] . > {kxk − x∗ k∞ } exhibit superlinear convergence.
To update W we use (4.28) and find
· ¸ · ¸
−0.304214 0.943 −0.044 Example 4.6. In the table below we give some test results from Powell (1977). The
y= , ẏh ' 6.67 > 0 , Wnew ' .
−2.471737 −0.044 2.014 size of each problem is given in the first two columns. The next column gives the
number of elements in B(x∗ ) and the number in parenthesis tells, how many of
The error estimate computed by (4.29) is η(x, λ) = 0.23, and the true error7) is these that are linear. The last two columns give the number of iterations and the
kxnew − x∗ k∞ = 0.26. number of function calls needed to solve the problem to a desired accuracy of
The second model problem is 10−5 . For comparison we also give (in parenthesis) the corresponding numbers
minimize q(δ) = 0.440 + 1.239δ1 − 0.472δ2 for the augmented Lagrangian algorithm 4.20.
+0.471δ12 − 0.044δ1 δ2 + 1.007δ12 n m #B(x∗ ) Iterations Fct. calls
subject to d1 (δ) = −0.380 + 1.239δ1 − δ2 = 0 . 3 7 1 (1) 5 (4) 7 (30)
According to Example 3.3 the solution is 5 3 3 (0) 6 (5) 7 (37)
³ ´ 5 15 4 (4) 4 (4) 6 (39)
(h, λ) = [ 0.070550, −0.292618 ]> , 1.064025 , 5 16 5 (3) 2 (5) 3 (64)
and the penalty function with µ = λ shows that α = 1. We get 15 20 11 (3) 16 (3) 17 (149)

xnew = [ 0.690283, −0.528487 ]> Each function call involves one evaluation of f (x) and f 0 (x).
η(xnew , λ) ' 0.094 , kxnew − x∗ k∞ ' 0.028 , For these examples the Lagrange–Newton method is clearly superior when the
· ¸ number of function evaluations is used as a measure. However, the work load per
0.908 0.250
Wnew ' . function evaluation may be much higher for the Lagrange–Newton method since
0.250 2.060
it involves many QP problems. This is especially important when the number of
variables and/or constraints is high.
The results from the next iteration steps are
In conclusion, we recommend the Lagrange-Newton method (SQP) when func-
tion evaluations are expensive, and the Augmented Lagrangian method when
function evaluations are cheap and we have many variables and constraints.

6) The computation in this example was performed with machine accuracy ε = 5·10−14 ,
M
but results are shown with at most 6 digits.
7) According to Example 4.3, x∗ = [

0.5, −0.5 ]> .
63 A PPENDIX A PPENDIX 64

This shows, that for α > 0 and sufficiently small, x+αh is feasible. Further,
from the Taylor series (1.6) for the cost function and (A.1) we get
f (x + αh) ' f (x) + αh> f 0 (x)
A PPENDIX Pp
= f (x) + αh> ( i=1 λi ai )
= f (x) + αλp h> h ,
showing that f (x+αh) < f (x) for α > 0, since λp < 0.
A. Karush–Kuhn–Tucker Theorem Thus, we have shown that at a local, constrained minimizer all the Lagrange
◦ multipliers for inequality constraints are nonnegative.
We shall prove property 2 in Theorem 2.5. Without loss of generality we
assume that the active inequality constraints are numbered first:
Equality constraints : ci (x) = 0, i = 1, . . . , r
Active inequality constraints : ci (x) = 0, i = r+1, . . . , p
Inactive constraints : ci (x) > 0, i = p+1, . . . , m .
The comments on 3◦ in the theorem and the definition of the Lagrange func-
tion imply that 1◦ has the form

X
p
f 0 (x) = λ i ai with ai = ci0 (x) . (A.1)
i=1

p
We shall prove that if one (or more) of the {λi }i=r+1 is negative, then x is
not a local, constrained minimizer:
p
For the sake of simplicity, assume that the gradients {ai }i=1 are linearly
independent and that λp < 0. Then we can decompose ap into v, its or-
thogonal projection on the subspace spanned by {ai }p−1
i=1 and h, which is
orthogonal to this subspace,
ap = v + h with ai> h = 0 for i = 1, . . . , p−1 .
For small values of kαhk we use the Taylor series (1.7) for the constraint
functions to see that
(
>
0 for i = 1, . . . , p−1
ci (x+αh) ' ci (x)+αh ai = .
αh> h for i = p
65 R EFERENCES R EFERENCES 66

10. G.H. Kuhn and A.W. Tucker (1951): Nonlinear Programming in


J. Neyman (ed) “Procedings of the 2nd Berkeley Symposion on Mathe-
matical Statistics and Probability”, pp 481–493, University of Califor-
R EFERENCES
nia Press.
11. K. Madsen (1995): Optimering under bibetingelser (in Danish),
IMM, DTU.
1. C.G. Broyden (1967): Quasi–Newton methods and their Application to 12. K. Madsen and H.B. Nielsen (2002): Supplementary Notes for
Function Minimization, Math. Comps., 21, 368–381. 02611 Optimization and Data Fitting, IMM, DTU. Available as
2. O. Caprani, K. Madsen and H.B. Nielsen (2002): Interval Analysis, http://www.imm.dtu.dk/courses/02611/SN.pdf
IMM, DTU. Available as http://www.imm.dtu.dk/courses/02611/IA.pdf 13. H.B. Nielsen (1999): Algorithms for Linear Optimization, IMM, DTU.
3. I.S. Duff (1993): The Solution of Augmented Systems, Available as http://www.imm.dtu.dk/courses/02611/ALO.pdf
Report RAL-93-084, Rutherford Appleton Laboratory. 14. M.J.D. Powell (1969): A Method for Nonlinear Constraints in Minimi-
4. A.V. Fiarco and G.P. McCormick (1968): Nonlinear Programming, zation Problems, in R. Fletcher (ed): ”Optimization”, Academic Press.
Wiley. 15. M.J.D. Powell (1977): A Fast Algorithm for Nonlinearly Constrained
5. R. Fletcher (1993): Practical Methods of Optimization, Wiley. Optimization Calculations, in “Numerical Analysis, Dundee 1977”,
6. P.E. Frandsen, K. Jonasson, H.B. Nielsen and O. Tingleff (2004): Lecture Notes in Mathematics 630, Springer.
Unconstrained Optimization, IMM, DTU. Available as
http://www.imm.dtu.dk/courses/02611/uncon.pdf
7. P.E. Gill and W. Murray (1974): Newton type methods for Linearly
Constrained Optimization in P.E. Gill and W. Murray (eds): “Numeri-
cal Methods for Constrained Optimization”, Academic Press.
8. G. Golub and C.F. Van Loan (1996): Matrix Computations, 3rd edition.
John Hopkins Press.
9. S.P. Han (1976): Superlinearly Convergent Variable Metric Algo-
rithms for General Nonlinear Programming Problems, Math. Progr. 11,
263–282.
67 I NDEX I NDEX 68

Lagrange’s function, 14, 22, – program, 53


32, 42, 46, 53, 58 Quasi-Newton method, 45, 52f
– multiplier, 14, 16, 20, 27, 60, 64
I NDEX Lagrange-Newton method, 53, 62 Raphson, 33
level curves, 23, 26, 36f, 39, 47 robust convergence, 35
line search, 55, 57, 61
active constraint, 2, 6 factorization, 30 linear constraint, 9 sequential linear optimization, 38
– set, 2, 18, 25, 31, 59 feasible active direction, 18 – convergence, 44 – quadratic optimization, 38, 53
augmented Lagrange function, 42, 46 – descent direction, 13 – model, 54 S IMPLEX, 29
– – method, 45, 52, 62 – region, 1, 5, 10, 26, 34, 54 – optimization, 3, 29 slack variables, 49
– system, 31 – starting point, 29 logarithmic barrier function, 40 soft line search, 56
Fiacco, 17 sparse matrix, 31
back substitution, 31 Fletcher, 15, 20, 23, 38, 44, 51 Madsen, 21, 30 SQO, 38
basic QO, 24 forward substitution, 31 McCormick, 17 SQP, 53, 59, 62
BFGS updating, 35, 58 Frandsen et al, 6, 15, 18, 30, model, 53f stationary point, 32
46, 53, 55, 58 – problem, 60 steepest ascent, 44, 51
Caprani et al, 3 Murray, 30 – descent, 6, 11
CBQO, 25, 29 general quadratic optimization, 27 strongly active, 13, 20, 41
Cholesky factorization, 30 Gill, 30 necessary condition, 16, 20 sufficient condition, 9, 20f
computational cost, 29 Golub, 30f Newton, 33, 44, 51ff, 59 SUO, 38
concave function, 8, 55 gradient, 4, 14, 32, 58 Nielsen, 3, 29f, 41 superlinear convergence, 52f, 62
constrained minimizer, 3
– stationary point, 13, 17 Han, 53 objective function, 1, 15 Taylor expansion, 4, 6, 11,
constraint surface, 54 Hessian matrix, 4, 15, 33, 58 orthogonal matrix, 30 34, 44, 58, 63
contours, 23, 26, 36f, 39, 47 orthonormal basis of IRn , 31 total Hessian, 33
convex function, 8, 28 ill-conditioned, 39 triangular matrix, 30
– set, 7, 10 inactive constraint, 2, 59 penalty function, 55, 60 Tucker, 15, 63
cost function, 1, 15, 35, 38, 53 inequality constraint, 1, 26, 39 – line search, 57
current BQO, 25 interior point methods, 40 – method, 43 unconstrained minimizer, 2, 13,
curvature condition, 58 interval analysis, 3 positive definite, 9, 23, 32, 58 23f, 42
Powell, 41, 53, 56, 62 underdetermined system, 29
descent condition, 6 Jacobian matrix, 32, 44, 46 projection, 63 updating, 30ff, 35, 58, 61
– direction, 6, 58
Duff, 32 Karush, 15, 63 QO, 22, 34 Van Loan, 30f
kink, 56 QR factorization, 30 violated constraint, 28
equality constraint, 1, 9, 12, 45 KKT conditions, 15, 24, 33, 44 quadratic convergence, 34, 52, 53
exact penalty function, 55 Kuhn, 15, 63 – model, 53 warm start, 46
exterior point method, 40 – optimization, 3, 22 weakly active constraint, 13, 20

You might also like