You are on page 1of 82

MATH412

Optimization
Giorgio Consigli
giorgio.consigli@ku.ac.ae
Fall Term a.y. 2022-2023

Khalifa University
Overview
The course is structured in four main parts:
• Part I Mathematical Review: matrix theory, convexity, multivariable
calculus.
• Part II Unconstrained optimization: Gradient descent, Newton methods,
Quasi-Newton methods.
• Part III Linear programming: Simplex method, Interior point method.
• Part IV Nonlinear constrained optimization: Lagrange method,
Karush-Kuhn-Tucker conditions, second order conditions,convex
optimization, algorithms, multi-objective optimization.
Material
This document will only trace the main topics, while many exercises and
applications will be handed separately and posted on our BB course page,
where you will be able to collect:
• These notes that will evolve over the course, always including previous
sections.
• Exercise questions and separately solutions.
• Simple excel or matlab applications to explain some methods and results.
The course main reference textbook is the 4th edition of An Introduction to
Optimization by Edwin K.P.Chong and Stanislaw H. Zak, Wiley 2013.
Part I.a: Linear algebra
Through several exercises we wish to summarize the main relevant
concepts from linear algebra needed in our course, specifically:
• linear independence
• vector spaces and subspaces
• transformations, linear systems and matrix definiteness.
We denote with `x ∈ Rn an n-dimensional vectors with elements
{xi }, i = 1, 2, ., n. We use capital letters for matrices: A ∈ Rm,n is a
linear mapping from Rn to Rm , whose rank is denoted by rk(m) while
a subspace, say W ⊆ V ∈ Rmin(m,n) is often defined and should satisfy
the conditions for a vector space. It will also include the null vector 0
of all 0’s.

Exercises 1
Inner product operator and inequalities
Let v and w be twoP vectors in Rn . Their inner product satisfies:
< v, w >= vT w = i vi wi = cos(θ) × ||v|| · ||w|| where θ ∈ [0, π]
and ||.|| is the Euclidean norm. Thus
cos(θ) = <v,w>
||v||·||w|| and θ = arccos(.) .
For two vectors to be orthogonal you need the inner product to be 0.
→Schwarz inequality, Triangle inequality

Exercises 2
Linear systems
We define the set of linear equations Ax = b with A ∈ Rm,n a vector of
unknowns x ∈ Rn and a vector of coefficients b ∈ Rm , where for
b = 0 we have an homogeneous system.
→Rouche-Capelli, Gaussian elimination, Cramer
→Transformations
Consider the system Ax = λx. We look for the set of λ’s that allow
such equality to hold:

(A − λI) = (λI − A) = 0

. We call λ the set of eigenvalues and x the associated eigenvectors.

Exercises 3
Diagonalization
Key results in Rn :
• det(A) = Πni=1 λi and tr (A) := i aii = i λi
P P

• A N.S.C. for A ∈ Rn,n to have n l.i. eigenvectors v1 , v2 , ..., vn


associated with the eigenvalues λ1 , ..., λn is that
V −1 · A · V = Λ =: diagλi .
• Spectral theorem: Let A ∈ Rn,n be symmetric and λ1 , ..., λn be its
eigenvalues, then there exists an orthogonal matrix U such that
U −1 · A · U = Λ =: diagλi . We also have as implications that:
• A matrix A ∈ Rn,n admits a diagonal representation if there exists a
matrix V s.t. A = V · D · V −1 where D is diagonal.
• All eigenvalues of a symmetric R-valued matrix are real
• Any symmetric R-valued matrix has associated n mutually
orthogonal eigenvectors.

Exercises 4
Orthogonal decomposition
Let V be a vector space and U, W two subspaces. The two spaces are
orthogonal if their elements (vectors) are orthogonal. We define the
orthogonal complement U ⊥ := {x|xT u = 0, ∀u ∈ U }.
The orthogonal complements span a given space, due to the following
relationship. Take V to be a subspace of Rn , then we would have:
Rn = V + V ⊥ =⇒ dim(V) + dim(V ⊥ ) = n. Same relationship would
hold for the subspaces of V.
Every vector x ∈ Rn admits a unique orthogonal decomposition
x = x1 + x2 , with x1 ∈ V, x2 ∈ V ⊥ .
Projections
Consider a matrix A ∈ R(m,n) .
Let R(A) := {Ax|x ∈ Rn } define the range or image of matrix A and
N (A) := {x ∈ Rn |Ax = 0} the null space of A.
We have the following results:
• Let A be a given matrix, then R(A)T = N (AT ) and
N (A)T = R(AT ).
• A matrix P is an orthogonal projector onto the subspace
V = R(P) iff P 2 = P = P T . For any x ∈ V, Px = x.
The above results, the first establishes a relationship between two
relevant subspaces and the second defines a Necessary and Sufficient
Condition (N.S.C) for orthogonal projection.
Orthogonal projections and Gram-Schmidt method
Let V be a vector space and U a subspace endowed with orthonormal
basis u(1) , ..., u(n) , then for any v ∈ V its orthogonal projection v⊥
onto U is given by
v⊥ =< v, u(1) > u(1) + < v, u(2) > u(2) + ...+ < v, u(n) > u(n) .
From this representation we can derive the popular Gram-Schmidt
method: this allows, from any given basis in a vector space, to derive
an associated orthonormal basis. In particular let x(1) , x(2) , ..., x(n) be a
basis in V ⊆ Rn . We want to construct a new basis of orthonormal
vectors v(1) , ..., v(n) , then: v(1) = ||xx(1) || , v(2) = ||xx(2) −<x ,
(1) (2) (2) ,v(1) >v(1)

−<x(2) ,v(1) >v(1) ||


and in general for k = 1, 2, ..., n:
x(k) −<x(k) ,v(1) >v(1) −<x(k) ,v(2) >v(2) −...−<x(k) ,v(k−1) >v(k−1)
v(k) = ||x(k) −<x(k) ,v(1) >v(1) −<x(k) ,v(2) >v(2) −...−<x(k) ,v(k−1) >v(k−1) ||

Exercises 5
Defineteness
We say that f : Rn → R is a quadratic form if f (x) = xT · Q · x with
Q an (n, n) real matrix, assumed to be symmetric. Then, for any x 6= 0
the quadratic form is said to be:
• positive definite (p.d.) if xT · Q · x > 0 for all x 6= 0
• positive semi definite (p.s.d.) if xT · Q · x ≥ 0 for any x
• negative definite (n.d.) if xT · Q · x < 0 for all x 6= 0
• negative semi definite (n.s.d.) if xT · Q · x ≤ 0 for any x
• indefinite if neither (p.s.d.) nor (n.s.d.). We see later on that the
sign of a matrix is of primary relevance when studying the
behaviour of a function in Rn and to determine its convexity.
Defineteness: few results
We give two relevant results which help to determine the sign of a
matrix directly.
Result 1: Q is p.d. iff all its eigenvalues are positive. n.d. iff all
negative. p.s.d. iff all nonnegative and n.s.d. iff all nonpositive. The
matrix is undefined if at least one eigenvalue is of opposite sign.
Result 2 (see Sylvester criterion): Let A be symmetric of order n. Then
it is p.d. iff all its north-west minors are Det(a11 ) > 0, Det(A22 ) > 0,
...,Det(Akk ) > 0,...,Det(A) > 0. It is n.d. iff Det(a11 ) < 0,
Det(A22 ) > 0, ...,(−1)k Det(Akk ) > 0,...,(−1)n Det(A) > 0. To assess
p.s.d. and n.s.d. it is necessary to introduce principal minors, whose
sign will determine the condition in case one of the sub-matrices is null.
Exercises 6
Part I.b Geometry
Let x, y be vectors in Rn and z be a vector lying in the segment joining
x and y: z = αx + (1 − α)y for α ∈ [0, 1].
Let u1 , u2 , ..., un , elements of u, and v be in R. The set
H := {x ∈ Rn |uT · x = v} defines a hyperplane of dimension n − 1.
For n = 2 : u1 x1 + u2 x2 = v is a straight line; for
n = 3 : u1 x1 + u2 x2 + u3 x3 = v defines a plane (a subspace only for
v = 0). We define positive and negative half-spaces in Rn as:
H+ := {x|uT x ≥ v} H− := {x|uT x ≤ v}
The vector u ∈ Rn is said to be normal to H under the following
conditions. Let a be in H, then uT · a = v and
(uT · x − v) − (uT · a − v) = uT · (x − a) = 0 for u ⊥ (x − a). Then:

H+ := {x|uT · (x − a) ≥ 0} H− := {x|uT · (x − a) ≤ 0}
Convex sets
Let A ∈ Rm,n , b ∈ Rm , a linear variety in Rn is a set
{x ∈ Rn |A · x = b}.
• The empty set, a single point, a line, a subspace, a hyperplane, a
linear variety, a half space, Rn are convex sets in their k − dim
spaces for k = 0, 1, 2, 3, .....
• Assume a convex subset Θ ⊆ Rn , then:
• βΘ := {x|x = βv, v ∈ } is convex
• If Θ1 , Θ2 are convex, then
Θ1 + Θ2 := {x|x = v1 + v2 , vi ∈ i , i = 1, 2} is convex
• V = Xi=1,2,..,n Θi the intersection of convex sets is convex.
• The set B (x) := {y ∈ Rn | ||y − x|| < } defines a
neighbourhood of x, it is convex. A set S is open if it contains a
neighbourhood of every point in the set. It is closed if contains its
boundary. If  is finite the set is bounded, if closed and bounded it
is said to be compact.
Functions
A set that can be expressed as the intersection of a finite number of
half-spaces is a polyhedron, when bounded and non-empty this is a
polytope.
Consider two disjoint convex sets S and T : then there exists a
u 6= 0|uT x ≤ uT y for every x ∈ S and y ∈ T. Such vector will define a
separating hyperplane for the elements in either sets.
Let S ⊆ Rn be convex, f : Rn → R is:
• convex: f (αx + (1 − α)x0 ) ≤ αf(x) + (1 − α)f(x0 )
• concave if for any (x, x0 ) ∈ S, x 6= x0 and α ∈ (0, 1):
f (αx + (1 − α)x0 ) ≥ αf(x) + (1 − α)f(x0 ) (thus −f is convex)
• strictly convex: if x 6= x0 and LHS < RHS.
• strictly concave: if x 6= x0 and LHS > RHS (again −f is strictly
convex)

Exercises 7
Part I.c Calculus
Few useful results on limits of vector sequences in Rn can be
summarised:
• The sequence {x(k) }∞ k=1 → x if there exists a k s.t.

[||x(k) − x∗ || < ]. If convergent x∗ is unique and {x(k) } is


bounded. Any bounded sequence contains a convergent
subsequence.
• Let f : Rn → Rm , x0 ∈ Rn . Suppose there exists an
f ∗ |limk→∞ f (x(k) ) = f ∗ for a convergent {x(k) } then f is
continuous at x0 iff limx→x0 f (x) = f (x0 ).
• The above criterion applies to matrices. More importantly: let
A ∈ Rn,n . Then limk→∞ Ak = 0 iff the A matrix eigenvalues
satisfy |λi (A)| < 1, i = 1, 2, ..., n.
More on norms
Several norm functions can be summarized for vectors and matrices.
They play an important role in multivariable calculus and convergence.
Let’ summarize the most important ones (in finite dimensional spaces):
• 1-norm: ||A||1 = maxj i=1,..,n |aij | which for j = 1 defines the
P
||.||1 vector norm
• Frobenius norm:
i1/2 
= tr (A · AT ) which
hP 1/2
2
P
||A||2 := i=1,..,m |a
j=1,...,n ij |
for j = 1 is the Euclidean norm
• Infinite norm: maxi=1,..,n j |aij |, for j = 1 the infinite vector norm
P
i1/p
• p-norm: ||A||p := p for p ≥ 1, again
hP P
i=1,..,m |a
j=1,...,n ij |
this norm translates easily in the generic p-norm for j = 1.
All the norms must comply with the properties of non-negativity,
homogeneity, triangular inequality. When no index is shown we refer to
Euclidean norm.
Differentiability
• A function A : Rn → Rm is affine if there exists a linear fn
L : Rn → Rm and y ∈ Rm s.t. A(x) = L(x) + y for every
x ∈ Rn .
• Let f : Rn → Rm and x0 ∈ Rn . We define
A(x) = L(x − x0 ) + f(x0 ).
• We require for x ∈ Ω ⊂ Rn that limx→x0 ||f (x)−A(x)||
||x−x0 || = 0. Then
the function f is differentiable at x0 .
• In R2 : A(x) = ax + b = a(x − x0 ) + f (x0 ) and
f 0 (x0 ) = a = limx→x0 f (x)−f
x−x0
(x0 )
→ A(x) = f (x0 ) + f 0 (x0 )(x − x0 ).
• For xj = x0 + tej we define the partial derivatives
= limt→0 j t 0 = Lej , where ej , j = 1, 2, .., n is the
∂f f(x )−f(x )
∂xj (x0 )
natural basis in Rn and L is a derivative matrix.
Differentiability
• For f : Rn → Rm we have the Jacobian matrix Df (x) with (i, j)th
element ∂x
∂fi
j
(x0 ).
• For f : Rn → R we define the gradient (column) vector Df (x)T
elements with (i)th row-element ∂x ∂f
i
(x0 ).
• If the gradient is differentiable we define the Hessian matrix H(x)
2f
with generic element ∂x∂i ∂x j
(x).
• A function f : Ω → Rm , with Ω ⊂ Rn is continuously differentiable
on Ω if differentiable and Df : Ω → Rm,n is continuous: f ∈ C 1 .
If partial derivatives of order p are cts, f ∈ C p .
• The matrix H(x) of f : Rn → R is symmetric if f is twice
differentiable at x. If second order partial derivatives are not
continuous there is no guarantee.

Exercises 8
Level sets
We define the level set of f : Rn → R as the set S := {x|f (x) = c} for
any c ∈ R in the function domain.
For varying coefficient c = {c1 , c2 , ...} we identify different level curves
that can be studied in the (n − 1) dimension.
Consider an arbitrary c and a mapping g : R → Rn , with g(t0 ) = x0
and Dg(t0 ) = v 6= 0 so that v is tangent to the level set at x0 .
Consider the composite function h(t) = f (g(t)) = c with f : Rn → R,
we apply the chain rule:
Df (g(t0 )) · Dg(t0 ) = Df (x)0 · v. But since h(t) = f (g(t)) = c,
h0 (t) = 0, thus Df (x0 ) · v = 0
The tangent vector at x0 and the gradient are orthogonal and this is
true for any point on the level set.
As we define different level curves we can study the directions on the
surface. The gradients provide the direction of maximal rate of increase
over the surface in Rn .
Exercises 9
Taylor expansion in Rn
We now have everything we need to extend Taylor expansion to Rn .
Let f : Rn → R be in C 2 . Then
1
f (x) = f (x0 )+Df (x0 )·(x−x0 )+ (x−x0 )T D 2 f (x0 )(x−x0 )+o(||x−x0 ||2 )
2!

where o(||x − x0 ||2 ) = limx→x0 ||f ||x−x


(x)−f (x0 )||
2 .
0 ||
We indicate with Df (x0 ) the transpose of the gradient evaluated in x0
and with D 2 f (x0 ) the Hessian: you see immediately the implications of
having an n-dim hyperplane as second term of the expansion and a
quadratic form as third.
Exercises 10
Part II.a: Unconstrained and set-constrained
optimization
We deal with f : Rn → R to be minimized subject to
f (x), x ∈ Ω ⊆ Rn . We have
• x∗ is a local minimizer if there exists an  s.t. f (x) ≥ f (x∗ ) for
any x 6= x∗ ∈ Ω with ||x − x∗ || < .
• x∗ is a global minimizer if f (x) ≥ f (x∗ ) for any x 6= x∗ ∈ Ω.
We would define a strict local or global minimizer in case of strict
inequality. Under the feasibility condition x ∈ Ω ⊂ Rn we refer to a
set-constrained, otherwise to an unconstrained problem. In the first
case a feasible direction d at x can be defined if given an α0 > 0, for
any α < α0 we have x + αd ∈ Ω.
First-order necessary conditions: FONC
Theorem: Let Ω ⊂ Rn , f ∈ C 1 , f : Ω → R. If x∗ is a local minimizer
of f over Ω, then for any feasible direction d we have

dT ∇f (x∗ ) ≥ 0

. We can also write ∂d∂f


≥ 0. Proof by Taylor explansion in Rn .
The above is equivalent to
∂f
• ∂d (x∗ ) ≥ 0 for all feasible directions d: whatever direction we
take we go up the surface.
• Interior of Ω: If x∗ lies in the interior of Ω the FONC for a local
minimizer is just ∇f (x∗ ) = 0
Second-order necessary conditions: SONC
Theorem: Let Ω ⊂ Rn , f ∈ C 2 , f : Ω → R, x∗ a local minimizer of f
over Ω, and d a feasible direction at x∗ . Then if dT ∇f (x∗ ) = 0 we
have
dT · F (x∗ ) · d ≥ 0
where F is the Hessian of f .

Interior of Ω: If x∗ lies in the interior of Ω and ∇f (x∗ ) = 0 (FONC),


and F (x∗ ) is p.s.d., i.e. for all d ∈ Rn : dT F (x∗ )d ≥ 0.

Proof by contradiction, assume a feasible d and dT ∇f (x∗ ) = 0 but


dT F (x∗ )d < 0. Set π(α) = f (x(α)) = f (x∗ + αd) and prove with
Taylor expansion to the second order. We need the quadratic form to
be p.s.d.
Second-order sufficient conditions: SOSC
Theorem, interior case: Let Ω ⊂ Rn , f ∈ C 2 , f : Ω → R, x∗ in the
interior of Ω. Suppose:
1. ∇f (x∗ ) = 0
2. F (x∗ ) > 0
Then x∗ is a strict local minimizer of f at x∗ .

Remark: notice that only positive defineteness is sufficient for an


optimum and not just p.s.d.

Exercises 11
1-d search methods
Assume now f : X ⊂ R → R and a unimodal function. We consider
three iterative approaches to identify the minimum of a continuous
function: find
x ∗ = argminx∈R f (x)
• Fibonacci search method: it is based on the celebrated Fibonacci
numbers and relies only on function evaluation steps.
• First-order methods: in this generic approach (with several specific
versions) the problem is solved relying on first derivatives.
• Newton’s method implements instead an optimum search based
on first and second derivatives. We’ll see that this has then been
extended to Rn in more recent times.
Fibonacci method
This is an algorithm inspired by the ancient greek Golden section method. It is
globally convergent and employs a range reduction for function evaluation
based on ρk = 1 − FFN−k+1
N−k+2
, for k = 1, .., N, where Fk are Fibonacci numbers
satisfying Fk+1 = Fk − Fk−1 with F−1 = 0 and F0 = F1 = 1.
1. Initial conditions: the initial range [a0 , b0 ], the number of evaluations
finalrange
N| 1+2
FN+1 ≤ initialrange = c leading to FN+1 ≥ c .
1+2

2. iteration 1: say N = 5, 1 − ρ1 = 13 8
, a1 = a0 + ρ1 (b0 − a0 ),
b1 = a0 + (1 − ρ1 )(b0 − a0 ), f (a1 ), f (b1 ) evaluations, assume
f (a1 ) < f (b1 ), then range reduced to [a0 , b1 ].
3. ... iteration k
4. Last (fifth) iteration: 1 − ρ5 = 1/2, a5 = ak + (ρ5 − )(b4 − ak ), say
b5 = ak+1 , then f (a5 ), f (b5 ) and take the minimum, STOP.
Develop example.
First order algorithms
Assume now f : R → R, f ∈ C 1 . Adopt the following algorithm (based
on FONC):
df
1. Find all stationary points of f (x) by solving dx =0
2. Evaluate f at each such point
3. Evaluate f (∞) and f (−∞) as x → +/ − ∞
4. Select the least of the values of f in steps 2 and 3. This is
minx f (x).
In this simple algorithm, we are just introducing a function evaluation
step over though a selected finite number of stationary points, making
sure that the function is bounded (step 3).
The algorithm is surely convergent to a local minimum in the interior of
the function domain if the function is bounded. An issue may arise in
presence of saddle points.
1-d Newton method
f : R → R, f ∈ C 2 and we consider now an iterative procedure. Given a value
x (k) we assume that f (x (k) ), f 0 (x (k) ), f 00 (x (k) ) are defined: we then introduce
a quadratic approximation
1
q(x) = f (x (k) ) + f 0 (x (k) )(x − x (k) ) + f 00 (x (k) )(x − x (k) )2
2
Instead of minimizing f we minimize q at each iteration. Choose x (0) , then
• 0 = q 0 (x) = f 0 (x (k) ) + f 00 (x (k) )(x − x (k) ) is now the FONC
f 0 (x (k) )
• Let x = x (k+1) we have x (k+1) = x (k) − . This is Newton step.
f 00 (x (k) )

• Termination when |x (k+1) − x (k) | <  and/or |f (x (k+1) ) − f (x (k) )| < δ,


with say {, δ} = {10−6 , 10−9 , 10−12 }.
• It can be seen also as an iterative method to solve g(x) = 0: let
g(x) = f 0 (x) then g(x (k) ) = x (k+1) − x (k) g 0 (x k ).
 

The method is accurate and fast until the second derivative is positive,
otherwise may not converge.

Exercises 12
Part II.b: Gradient method
Consider f : Rn → R, f ∈ C 1 : we saw that any x0 |f (x0 ) = c on the level set
of c is an element of a tangent hyperplane to c for which tg(x0 ) ⊥ ∇f (x0 ).
Then we define a direction of increase along the surface through
< ∇f (x), d > with d = ||∇f (x)|| , then < ∇f (x), ||∇f (x)|| >= ||∇f (x)|| is the
∇f (x) ∇f (x)

direction of max rate of increase in Rn at x: it’s negative the max rate of


decrease.
From a very general perspective, gradient methods rely on the following steps,
where we assume that the step length is fixed at α:
1. Set x(0) , f (x(0) ), ∇f (x(0) )
2. x(1) = x(0) − α∇f (x(0) ), f (x(1) ), for ∇f (x(0) ) 6= 0
3. ... iteration k: x(k) = x(k−1) − α∇f (x(k−1) ), f (x(k) ) for
∇f (x(k−1) ) 6= 0
4. If ||x (k) − x (k−1) || < 10−6 , or ||∇f (x (k) ) − ∇f (x (k−1) )|| < 10−5 Stop,
o.w. go to 3.
Steepest descent
Consider now steepest descent, probably the most popular among gradient
method. Now at every iteration the step size αk is updated. We analyse this
method for generic nonlinear problems and then specifically for quadratic
programs.
The algorithm is conceived to determine the maximum rate of decrease of
f (xk ) at every iteration k: we want
αk = argminα≥0 φk (α) := f (x(k) − α∇f (x(k) )).
Let’s focus on the step size αk and a possible derivation based on gradient
evaluation. Then:
dφk (α)
= 0 = ∇f (x (k) −α∇f (x (k) ))T ·(−∇f (x (k) )) = −∇f (x (k+1) )T ∇f (x (k) )

So in general we wish to select a step size for which the gradient of the
function at the next iteration is orthogonal to the gradient evaluated at the
current iteration.
Properties
Two relevant results:
1. If {x(k) }∞ n
k=0 is a steepest descent sequence for f : R → R then
for each k: (x k+1 k
− x ) ⊥ (x k+2 k+1
−x ) .
2. If {x(k) }∞ n
k=0 is the steepest descent sequence for f : R → R and
if ∇f (x(k) ) 6= 0 then f (xk+1 ) < f (xk ) .
Termination is based on tolerances, according to
• absolute: |f (x(k+1) ) − f (x(k) )| < 10−6 and, or
||∇f (x(k) )|| < 10−6 or
• relative criteria: |f (x(k+1) )−f (x(k) )|
<  and ||x(k+1) −x(k) ||
< .
{max(1,f (x(k) ))}| {max(1,||xk ||)}
Steepest descent for quadratic programs
Consider f (x) = 12 xT · Q · x − bT x with Q p.d. and symmetric. We apply
steepest descent to this quadratic program (QP). Let g(x) = ∇f (x). We have
g(x) = Qx − b and the Hessian F (x) = Q. In this case the algorithm
generates a stepsize αk that can be characterized mathematically. We need
φ0k (α) = 0
φ0k (α) = (x(k) − αg(x(k) ))T · Q(−g(x(k) )) − bT (−g(x(k) )) = 0
T T
For, it is sufficient to have αg (k) · Q · g (k) = (x(k) − bT )g (k) and we derive
T
g (k) g (k)
αk = . In the QP case the steepest descent step becomes:
g (k)T Qg (k)

T
g (k) g (k) (k)
x(k+1) = x(k) − g
g (k)T Qg (k)

with g (k) = ∇f (x(k) ) = Qx(k) − b

Exercises 13
Convergence analysis
Convergence analysis focuses on:
• it’s characterization as global or local convergence.
• results characterizing the convergence to an optimum x(k) → x∗ so that
f (x∗ ) is the associated optimal value of f ,
• the speed of convergence to that value.
As for the first point, an algorithm is said to be globally convergent when from
any x(0) it will surely reach a point x∗ where FONC are satisfied. When such
condition requires the starting point to be close to x∗ , in a neighbourhood of,
then we speak of local convergence. We focus for simplicity on the QP case
with f (x) = 12 xT · Q · x with Q p.d. and symmetric.
Convergence analysis
In the case of steepest descent two results help characterizing its convergence.

1. (Kantorovich inequality) Let Q be a symmetric p.d. matrix, then for any


(xT ·x)
x ∈ Rn we have (xT ·Q·x)·(x T ·Q −1 ·x) ≥ (λ +λ )2 .
4λM λm
M m

2. This inequality leads to the relevant convergence result: the steepest


descent algorithm generates in a finite number of steps a sequence
x(0) , x(1) , ... converging to x∗ = 0 and
||x(k+1) − x∗ || ≤ ( λλMm )1/2 (λM −λm )
(λM +λm ) ||x
(k)
− x∗ || in which λM , λm are
respectively the maximum and minimum eigenvalues of Q.
Notice that Q in the quadratic form is actually the Hessian matrix, the result
establishes that the speed of convergence of gradient methods in general does
depend on the ratio r = λλMm . The higher r is the slower the convergence,
while for r = 1 we have convergence in 1 step!
Q factors
More generally consider a sequence x(k) , k = 1, 2, ... such that
x(k) → x∗ ∈ Rn , let ek = ||x(k) − x∗ || (then ek → 0, under any norm) and for
every p ∈ [1, ∞) define the Q-factors of x(k) under || · || ∈ Rn as follows:
1. Qp = 0 if x(k) = x∗
2. Qp = limsupk→∞ ek+1
ep
if x(k) 6= x∗
k

3. Qp = ∞ otherwise.
The following result holds: let Qp be Q factors of x(k) , then one of the
following is true: (a) Qp = 0 for any p ∈ [1, ∞); (b) Qp = ∞ for any
p ∈ [1, ∞); (c) There is a p0 ∈ [1, ∞) such that Qp = 0 for p ∈ [1, p0 ) and
Qp = ∞ for p ∈ (p0 , ∞). We complete this section with few additional
definitions.
Convergence analysis, p = 1, 2
Consider the case p = 1. We have:
• If Q1 = 0 thus limk→∞ ek+1
ek = 0, we say that the speed of convergence
is Q-superlinear,
• If 0 < Q1 < 1 under any norm we say that in the given norm the speed
of convergence is Q-linear,
• For Q1 ≥ 1 in the given norm, we can say that x(k) has speed of
convergence Q-sublinear.
The distinction generalizes to p = 2: we can distinguish accordingly between
Q-superquadratic, Q-quadratic and Q-subquadratic speed of convergence.
The steepest descent algorithm is at least Q-linearly convergent under the
norm ||x||Q = (xT · Q · x)1/2 since
 
∗ λM − λm
||x(k+1)
− x ||Q ≤ · ||x(k) − x∗ ||Q
λM + λm
Part II.c: Newton and Quasi-Newton methods
Consider now Newton method in Rn . The extension is quite straightforward,
f : Rn → R, with f ∈ C 2 :
1. Let again g (k) = ∇f (x(k) ) and assume the approximation of f with a
quadratic function
q(x) = f (x) + (x − x(k) )T · g (k) + 12 (x − x(k) )T F (x(k) )(x − x(k) )
2. From FONC: 0 = ∇q(x) = g (k) + F (x(k) ) · (x − x(k) ) thus

x(k+1) = x(k) − F (x(k) )−1 g (k)

is Newton step in Rn .
The Newton step is actually and conveniently decomposed in two steps: (i)
F (x(k) ) · d(k) = −g (k) and (ii) x(k+1) = x(k) + d(k) . Positive definiteness of
the Hessian at every iteration is key to the convergence (in a minimization
problem).
Newton in Rn
We can characterize the convergence properties of this algorithms. In the
(QP) case we have f (x) = 12 xT · Q · x − xT b, thus g(x) = ∇f (x) = Qx − b
and F (x) = Q which is symmetric and assumed to be invertible. Then in 1
single step:
x(1) = x(0) − F (x(0) )−1 g (0) = x(0) − Q −1 Qx(0) − b = Q −1 b = x∗
h i

In the general case, there are two relevant results we can quote:
1. Let f ∈ C 2 , x∗ ∈ Rn such that ∇f (x∗ ) = 0 and F (x∗ ) invertible. Then
for any x(0) sufficiently close to x∗ the Newton step is well defined for all
k and converges to x∗ with an order of convergence at least 2 (quadratic
convergence).
2. Let x(k) be the sequence generated by Newton method to minimize f (x).
If F (x(k) ) > 0 and g (k) = ∇f (x(k) ) 6= 0, then
d(k) = −F (x(k) )−1 g (k) = x(k+1) − x(k) is a descent direction for f and
there exists an α0 such that for any α ∈ (0, α0 )
f (x(k) + αd(k) ) < f (x(k) .
Exercises 14
Quasi-Newton methods
Quasi-Newton methods (QNM) are motivated by the numerical problems
associated with the direction d(k) = F (x(k) )−1 · g (k) and the associated
Hessian inversion. We look for an approximation Hk of the Hessian inverse
and a step α leading to the iterations: x(k+1) = x(k) − αk Hk g (k) for
k = 1, 2, ... until termination.
The definition of Hk is simple in the case of (QP) problems, in which the
Hessian is independent of the iterations, F (x) = Q for any x(k) . Then
g (k+1) − g (k) = Q(x(k+1) − x(k) ), or ∆g (k) = Q∆x(k) and

Q −1 ∆g (k) = ∆x(i) 0≤i ≤k

.
We may then require in the general case that Hk+1 ∆g (k) = ∆x(i) for all
i ≤ k.
Rank 1 algorithm

1. k = 0: select x(0) and a real symmetric p.d. matrix H0 .


2. if g (k) = 0 stop, else d(k) = −Hk · g (k) .
3. compute αk = argminα≥0 f (x(k) + αd(k) ), x(k+1) = x(k) + αk d(k)
4. Derive ∆x(k) = αk d(k) , ∆g (k) = g (k+1) − g (k) ,
(k) (k)
)·(∆x(k) −Hk ∆g (k) )T
Hk+1 = Hk + (∆x −H(k) k ∆g
T (k) (k)
∆g (∆x −Hk ∆g )

5. k = k + 1 go to 2.
The algorithm is based on satisfying at every iteration the system
Hk+1 ∆g (k) = ∆x(k) . It turns out that this is sufficient to have
Hk+1 ∆g (i) = ∆x(i) for i = 0, 1, ..., k.
DFP Hessian update
The DFP algorithm is the first one extending the rank 1 methods in the early
60’s.
1. k = 0: select x(0) and a real symmetric p.d. matrix H0 .
2. if g (k) = 0 stop, else d(k) = −Hk · g (k) .
3. compute αk = argminα≥0 f (x(k) + αd(k) ), x(k+1) = x(k) + αk d(k)
4. Derive ∆x(k) = αk d(k) , ∆g (k) = g (k+1) − g (k) ,
h ih i T
T Hk ∆g (k) Hk ∆g (k)
∆x(k) ·∆x(k)
Hk+1 = Hk + −
∆x(k)T ∆g (k) ∆g (k)T ·Hk ·∆g (k)

5. k = k + 1 go to 2.
The DFP algorithm is indeed a QNM: when applied to a QP problem we have
Hk+1 ∆g (i) = ∆x(i) for all i = 0, 1, .., k
BFGS update
iT
T
h ih
Hk ∆g (k) Hk ∆g (k)
From the DFP update we have Hk+1 = Hk + ∆x(k) ·∆x(k)
T − T .
∆x(k) ∆g (k) ∆g (k) ·Hk ∆g (k) )
The BFGS update relies on the direct approximation of the Hessian through matrix
Bk at the k-th iteration with:
iT
(k)T
h
(k)T
(k)
(k)
∆g · ∆g Bk ∆x ∆x Bk
Bk+1 = Bk + −
∆g (k)T ∆x(k) ∆x(k)T · Bk ∆x(k) )

In the algorithm the descent direction is then d(k) = B−1k+1 g


(k)
and the Hessian
BFGS
inverse update with Hk = Bk becomes:
−1

T T
" #
BFGS ∆g (k) · Hk · ∆g (k) ∆x(k) ∆x(k)
Hk+1 = Hk + 1 + T
· T
+
∆g (k) ∆x(k) ∆x(k) ∆g (k)
T T
Hk · ∆g (k) ∆x(k) + (Hk · ∆g (k) ∆x(k) )T

∆g (k)T ∆x(k)

Exercises 15
Least square approximation
Let A ∈ Rm,n , b ∈ Rm , m ≥ n, rk(A) = n, b ∈ / R(A), thus Ax = b is
inconsistent. We address the problem: which xo s.t.
{||Ax − b||2 } = min? so that for any other x ∈ Rn the squared
difference in norm would be greater. Then xo is referred to as least
square solution (l.s.s.) to Ax = b
Theorem: The unique vector xo that minimizes ||Ax − b||2 is the
solution to the system AT · A · x = AT · b, thus xo = (AT · A)−1 · AT · b
proof: assume xo = (AT · A)−1 · AT · b. Then
||Ax − b||2 = ||A(x − xo ) + (Axo − b)||2 =
[A(x − xo ) + (Axo − b)]T · [(A(x − xo ) + (Axo − b)]=
||A(x − xo )||2 + ||Axo − b||2 + 2(A(x − xo ))T · (Axo − b)
By construction the last product is 0 so we have
||Ax − b||2 = ||A(x − xo )||2 + ||Axo − b||2 with the first term on the
RHS surely positive for x 6= xo with rk(A) = n, which completes the
proof.
Least square approximation
Alternatively, we can derive the minimum squared error by explicitly
applying the optimality conditions from the gradient. We have
f (x) = ||Ax − b||2 = (Ax − b)T · (Ax − b), from which we derive:
∇f (x) = 2AT Ax − 2AT b = 0 with solution xo = (AT A)−1 AT b.
Rationale: consider A ∈ Rm,n , m ≥ n. The column vectors of A span
R(A), n-dim subspace of Rm . For Ax = b to be solved we need
b ∈ R(A), which is equivalent to have Ax = b consistent (it admits
solution). Which vector h ∈ R(A) is the closest to b? It’s orthogonal
projection onto the subspace spanned by h. Then we can write:
h ∈ R(A)|(h − b) ⊥ R(A), h = Axo = A(AT · A)−1 AT b

Exercises 16
Part III: Linear programming
Consider the problem of minimizing a linear cost function subject to linear
constraints with c ∈ Rn , A ∈ Rm,n , b ∈ Rm and decision vector x ≥ 0 ∈ Rn+ :

minx≥0 c T x s.t. A · x = b

. The above is a linear program (LP) due to the linearity of both the objective
function and the constraints. Any x satisfying the constraints is said to be
feasible: among them we seek the one that leads to a minimal cost: the
optimal solution of the problem.
The following are stylized school applications of LP:
• manufacturing problem: maximize overall production time under technical and
operational constraints, with x to represent the time allocated to each
production line.
• transportation problem: minimize transportation costs to dispatch goods and
materials across several origin-destination pairs.
• newsvendor problem: decide how many journals to buy daily to maximize
profit under demand uncertainty.
Standard form LP
In presence of inequality constraints, an LP is taken back to standard form
(constraints’ equality and min) by introducing surplus or slack variables. No
changes in the objective, we have:
• surplus vector: Ax − Im y = b, y ≥ 0
• slack vector: Ax + Im y = b, y ≥ 0
Basic solution: let in the general case A ∈ R(m,n) with m < n and rk(A) = m.
We wish to define the sub-matrix B ∈ R(m,m) ofA whose  columns are l.i., so
T T
that A = [B|D]. We have xB = B −1 b and x = xT B , 0 is the basic
solution to the problem. B is referred to as the basis of the LP.
If some of the elements of the basic solution are 0, we talk of a degenerate
basic solution.
Fundamental theorem of LP
Consider a linear program in standard form. The fundamental theorem of LP
establishes that:
• If there exists a feasible solution, then there exists a basic feasible
solution.
• If there exists an optimal feasible solution, then there exists a basic
optimal feasible solution.
The following theorem relates basic solutions to extreme points of the
feasibility set:
Th: Let Ω be the convex set consisting of all feasible solutions, that is all x
s.t. Ax = b, x ≥ 0 with A ∈ R(m,n) , m < n. Then x is an extreme point of Ω
iff x is a basic feasible solution to Ax = b, x ≥ 0.
The search of an optimal solution can then be limited to extreme points.
Part III.a: Simplex
The introduction of slack variables in an LP with Ax ≤ b leads to a simple
updating rule based on elementary matrices E1 , E2 , ..., Et and the definition of
the augmented coefficient matrix A = [B, D] where basic and nonbasic
variables are considered. The following result sets the grounds for the simplex
method.
Th.: Let A ∈ Rn . Then A is non-singular iff there exists elementary matrices
E1 , E2 , .., Et such that Et · ... · E2 · E1 · A = I and there is a t for which
Et · ... · E2 · E1 = A−1 .
Several implications:
1. From x∗ = A−1 b = (Et · ... · E2 · E1 )b = E b = b̃
2. [A, b] → [I, D, E b] where I = E · A ∈ Rm,m , D ∈ Rm,n−m , E b = b̃
Simplex
Let x solve Ax = b, then x = [xB , xD ], and xB + DxD = b̃ → xB = b̃ − DxD
iT
then x∗ = b̃T , 0T is a solution of the system. We consider a system in
h

canonical form based on the augmented imatrix.


Consider the following notation: I, D, b̃ → [Im , Y , y0 ] where Y ∈ Rm.n−m
h

and y0 ∈ Rm . A basic feasible solution (b.f.s) will be associated with the


identity matrix.
The key idea of the simplex method is to recursively update the basis columns
in the coefficient matrix so to yield the minimum of the objective fn.
The above updating requires the so-called pivot selection: one coefficient in
the given augmented matrix that will allow the move from a basis to a
different one.
Basis updating in simplex method
Consider the augmented system [I, Y ] x = y0 associated with a problem in
standard form, where we wish to minimize z = cT x:

x1 +y1,m+1 xm+1 + ... +y1,n xn = y10


x2 +y2,m+1 xm+1 + ... + y2,n xn =y20
(1)
... ... =...
xm +ym,m+1 xm+1 + ... +ym,n xn = ym0

This system has basic solution: x = (y10 , ..., ym0 , 0, ..., 0)T = (xT T T
B 0 ) .
Then b = y10 a1 + y20 a2 + ... + ym0 am in the current basis and the cost
function z = cT · xB = c1 y10 + ... + cm ym0 . We also have that any nonbasic
vector aq = y1q a1 + y2q a2 + ... + ymq am .
To evaluate the impact on the objective of a different basis, assume the qth
vector to enter then zq = c1 y1q + ... + cm ymq . For any i we denote with
ri = ci − zi the reduced cost coefficients, and through ri we can evaluate the
induced change of the cost function.
The algorithm
Theorem: A b.f.s is optimal iff the corresponding reduced cost coefficients are all
nonnegative.
The simplex algorithm:
1. Form a canonical augmented matrix corresponding to the initial b.f.s.,
2. compute rj associated to nonbasic var’s
3. If rj ≥ 0 for all j: STOP, the b.f.s. is optimal. Else
4. select q with least rq
yi0
5. If no yiq > 0 STOP: the problem is infeasible. Else p = argmini
h i
|y
yi q iq
>0
6. Update the augmented matrix with pivot the (p, q) element. Go to 2.

Exercises 17
Simplex in matrix form – the Tableau
The basis updating and final linear cost function evaluation can be effectively
interpreted as follows. Consider the block matrices:

A b = B D b
(2)
cT 0 cBT cDT 0
.
Through elementary row operations we derive the Tableau in final form (RHS
of (4)):
[I]

B −1 0 B D b = Im B −1 D B −1 b
(3)
0T 1 cBT cDT 0 cBT cDT 0

[II]

Im 0 Im B −1 D B −1 b = Im B −1 D B −1 b
(4)
−cBT 1 cBT cDT 0 0T cDT − cBT B −1 D −cBT B −1 b
Revised simplex
The last set of matrices in (4) shows the optimal solution of the problem
xB = B −1 b and the reduced cost vector rDT = cDT − cBT B −1 D. Let
λT = cBT B −1 , then we can write rDT = cDT − λT D: indeed the vector λ is the
dual vector of the DLP, we will consider shortly.
The revised simplex develops from observing that all we need to perform
simplex iterations are the current basis with associated coefficient matrix and
the reduced coefficient vector. The updating of the Tableau and the simplex
iterations are then based on [B − 1 y0 yq ]. We have:
• If all reduced costs are non-negative, the optimal solution has been
reached.
• If a negative reduced cost is found so that no yiq > 0 then the problem
is unbounded.
• O.w. iterate until all reduced costs are non negative.
Part III.b: Duality
We introduce duality for LPs and then this will represent a key topic also for
nonlinear programs. Consider the linear program
minx≥0 cT x s.t. A · x ≥ b
. Where c ∈ Rn , A ∈ Rm,n , b ∈ Rm and x ∈ Rn . We refer to this formulation
as LP in primal form. The corresponding dual problem is defined by
maxλ≥0 λT b s.t. λT · A ≤ cT
in which λ ∈ Rm is referred to as the dual vector. The two problems define
the so-called symmetric form of duality.
We define the asymmetric form of duality by deriving the dual problem
associated with a primal in standard form:

Primal Dual
minimize cT x maximize λT b
(5)
subject to Ax = b subject to λT A ≤ cT
x≥0 λ unrestricted
Duality
The possibility of an unrestricted (free) λ arises from the representation of the
feasibility region of the primal in standard form as Ax ≥ b and −Ax ≥ −b:
then we introduce non negative dual vectors u and v for b and −b and define
λ = u − v.
The dual as an LP problem in the symmetric form reads:

maxλ λ1 b1 + λ2 b2 + ... + λm bm
s.t. λ1 a11 + λ2 a21 + ... + λm am1 ≤ c1
λ1 a12 + λ2 a22 + ... + λm am2 ≤ c2
...
λ1 a1n + λ2 a2n + ... + λm amn ≤ cn

with λ ≥ 0. When in asymmetric form (the primal is in standard form) then λ


is free. The single elements λi are free.
Primal-dual rationale
Example 1: Diet problem.
Primal: n types of food: find the most economical diet and at the same time meet
or exceed nutritional requirements, with bi amount of nutrient i, aij amount of i-th
nutrient per unit of j-th food and cj unit cost of food j = 1, .., n. xj to be
determined (units of food j in the diet).
Dual: λi is now the i-th nutrient (pills) unit price and the objective λT b represents
now the total revenue of a health food store. Then λ1 a1i + λ2 a2i + ... + λm ami
represents the cost of buying pills to create the equivalent of food i, which must not
exceed the unit price of such food, leading to the weak inequality.
Example 2: Energy generation.
Primal: m production lines employing n factors at unit cost ci , i = 1, 2, .., n under
the constraint of consuming no more than bj of energy type j, with aij representing
technical energy-intensity coefficients associated with the i-th factor and energy type
j.
Dual: λj is now the j-th energy unit price and the objective λT b represents now the
total revenue of the energy producer. Then λ1 a1i + λ2 a2i + ... + λm ami ≤ ci will be
the energy producer budget constraint not to affect unit costs ci in that industry
sector.
P-D results
We have several relevant results:
1 Weak duality: Let x and λ be feasible solutions of the primal and dual
LP problems, respectively. Then cT x ≥ λT b.
2 P-D optimality: Let x0 and λ0 be feasible solutions of the primal and
dual LP problems, respectively. If cT x0 = λT
0 b then x0 and λ0 are
optimal for their respective problems.
3 Duality theorem: If the primal has an optimal solution so does the dual
and the optimal values of their obj functions coincide.
4 Complementary slackness: The feasible solutions x and λ to PLP and
DLP are optimal if and only if: (1) (cT − λT A)x = 0 and (2)
λT (Ax − b) = 0.

Exercises 18
Part III.c: Interior point methods
The rationale of Interior Point Methods (IPM) and key difference with simplex
method is that in the former the search for optimality starts in the strict
interior of the feasible region and iteration by iteration seeks a direction within
the interior to the optimum. Consider the following LP in so-called Karmarkar
(1984) canonical form (KcnclLP):

minx∈Rn {cT x| x ∈ Ω ∩ ∆}
Where Ω = N (A) = {x ∈ Rn |Ax = 0}, ∆ = {x ∈ Rn |eT x = 1, x ≥ 0}. ∆ is
T
a simplex whose center is a0 = ne = [1/n, ..., 1/n] . The feasible region can
T T
be specified as Ω ∩ ∆ = {x ∈ Rn | A eT x = [0 1] , x ≥ 0}
 

Result: Any LP problem can be expressed in KcnclLP.


K restricted problem
A problem in canonical form is called K restricted problem when the following
assumptions are satisfied:
1. The center a0 of the simplex ∆ is a feasible point.
2. The minimum of cT x over Ω ∩ ∆ is 0.
T
3. The matrix A eT of dim (m + 1, n) has rank m + 1.


4. given a termination parameter q > 0 the problem is solved (termination


T
criterion) when a feasible point x satisfies ccT ax0 ≤ 2−q .
K algorithm solves canonical problems in restricted form, formulated as
follows:
T
min
n+1
{c0 y |A0 y = 0, eT y = 1, y ≥ 0}
y∈R

Where: T xi
A0 = [A − b] , c0 = cT 0 , yi=1,..,n = x1 +...+xn +1 , yn+1 x1 +...+xn +1 .
1

=
K artificial problem: a P/D interpretation of KcnclLP
From LP duality we know that the (PLP) problem minx≥0 {cT x|Ax ≥ b} has
the same solution as the (DLP) maxλ≥0 {λT b|λT A ≤ cT }. We combine
them to define the set of conditions for optimality:
cT x − bT λ = 0
Ax ≥ b
AT λ ≤ c
x≥0 λ≥0
from which, by including slack and surplus vectors v, u and a set of vector
values (x0 , λ0 , u0 , v0 ) ≥ 0 we define the Karmarkar artificial problem
(KartLP)
minimize z
subject to cT x − bT λ + (−cT x0 + bT λ0 ) · z = 0
Ax − v + (b − Ax0 + v0 ) · z = b (6)
AT λ + u + (c − AT λ0 ) · z = c
x, λ, u, v, z ≥ 0
The algorithm
KartLP leads to the set of linear equations on x, λ, u, v for optimality:
cT x − bT λ = 0, Ax − v = b, AT λ + u = c, (x, λ, u, v) ≥ 0.
This problem and set of conditions are fully equivalent to the original KcnclLP
in restricted form and the conditions are used to follow a central path to
optimality.
Steps:
1. Initialize: k = 0, x0 = a0 = 1n e, set q
2. set x(k+1) = Ψ(x(k) )

3. if c·x(k)
cT ·x(0)
≤ 2−q STOP, else
4. k = k + 1 and return to 2.
We focus on the Ψ update, which employs the descent direction within the
feasible region towards the optimum at 0 through a set of orthogonal
projections.
The update Ψ
Consider k = 1:
• x(1) = x(0) + αd(0) , α ∈ (0, 1).
• Let −c be the max rate of decrease of the obj: select d(0) = −r ĉ(0) as
the direction of the orthogonal projection of −c onto N (B0 ),
T
B0 = A eT : ĉ(0) = ||PP00 cc|| , r = p(n·(n−1))
1
.


• The projection P0 at the first iteration is P0 = In − B0T (B0 · BoT )−1 B0 .

As k =
 2, .. the projection is updated relying on Dk = diag{x i }, i = 1, .., n,
(k)

Bk = ADk eT , Pk = In − BkT (Bk BkT )−1 Bk , c(k) = ||PPkk D kc


,
Dk c|| d
(k)
= −r c(k) .
The resulting iterate x(k+1) = x (k) − αr ĉ (k) is derived and the termination
criterion verified resulting into termination or a new iteration.

Exercises 19
Part IV: Constrained nonlinear programming
We consider minimization (maximization) problems specified as follows:

min f (x) s.t. h(x) = 0 g(x) ≤ 0


x

where: x ∈ Rn , f : Rn → R, h : Rn → Rm , g : Rn → Rp .
This will be further qualified but in absence of additional details, f (x) is
continuous and differentiable once f ∈ C 1 (.) or at least twice, then f ∈ C 2 (.).
The constraints are also assumed continuous and differentiable and
h = (h1 (x) h2 (x) ...hm (x))T defines m equality constraints, each one
continuous and differentiable. While g = (g1 (x) g2 (x) ...gp (x))T defines p
inequality constraints, each one continuous and differentiable.
The Jacobian matrix associated with h is defined as an (m, n) matrix with
rows the gradients ∇hi (x) transpose, for i = 1, 2, .., m.
We focus initially on nonlinear problems with equality constraints.
Part IV.a: NLN optimization with equality constraints
Prior to the explanation of the optimization methods to be adopted in this
context, we need several definitions to frame properly the topic. We define:
1. A point x∗ for which hi (x∗ ) = 0 for all i = 1, .., m is said to be a regular
point of the constraints if the gradient vectors
∇h1 (x∗ ), ∇h2 (x∗ ), ..., ∇hn (x∗ ) are l.i.
2. A surface in Rn as the set S = {x ∈ Rn |h1 (x) = 0, ..., hm (x) = 0}.
3. A curve C on S as the set of points
{x(t) ∈ S, t ∈ (a, b), x : (a, b) → S, x(t) a continuous function. This
curve C is differentiable if for all t there exists
x0 (t) = dx(t) 0 0 0 T
dt = (x1 (t) x2 (t) ...xn (t)) . It is twice differentiable if there
d 2 x(t)
exists x00 (t) = dt = (x100 (t) x200 (t) ...xn00 (t))T .
Tangent and Normal spaces
We also need the following definitions:
1. The tangent space at x∗ on S = {x ∈ Rn |h(x) = 0} is the set
T (x∗ ) = {y|Dh(x∗ ) · y = 0}, thus this is the Null space of the
differential N (Dh(x∗ )).
2. We distinguish T (x∗ ) from the tangent plane at x∗ , namely
TP(x∗ ) = T (x∗ ) + x∗ .
3. The normal space at x∗ on the surface S is the set
N(x∗ ) = {x ∈ Rn |x = Dh(x∗ ) · z, z ∈ Rm }, or range R(Dh(x∗ )T ). It is
the subspace of Rn spanned by the gradients
x = j zj ∇hj (x), zj ∈ R, j = 1, .., m.
P

4. As before we can define the normal plane


NP(x∗ ) = N(x∗ ) + x∗ = {x + x∗ , x ∈ N(x∗ )}
Theorem: Let x∗ ∈ S be a regular point and T (x∗ ) the tangent space at x∗ .
Then y ∈ T (x∗ ) iff there exists a differentiable function in S passing through
x∗ with derivative y at x∗ .
Lagrange method
We first analyse this method in R3 with f : R2 → R and h : R2 → R is the
constraint function.
Lagrange theorem (n = 2, m = 1): Let x∗ be a minimizer of f (x1 , x2 ) subject
to h(x1 , x2 ) = 0. Then ∇f (x∗ ) and ∇h(x∗ ) are parallel: if ∇h(x∗ ) 6= 0 then
there exists a λ∗ ∈ R s.t. ∇f (x∗ ) + λ∗ ∇h(x∗ ) = 0.
• The theorem establishes a necessary but not sufficient condition for a
constrained optimum.
• The Lagrangean L(x1 , x2 , λ) transforms the constrained problem into an
unconstrained problem and the search of the optimum x∗ relies on the
solution of the system ∂x
∂L
1
∂L
= 0, ∂x 2
= 0, ∂L
∂λ = 0.
• The so-called Lagrange conditions include ∇f (x∗ ) + λ∗ ∇h(x∗ ) = 0 and
h(x∗ ) = 0.
• The key intuition is that at the optimum x∗ the gradients of f and h
must be linearly dependent (they lie along the same direction).
Lagrange method in Rn
We extend Lagrange theorem to Rn :
Lagrange theorem: Let x∗ be a minimum (or a maximum) of f : Rn → R
subject to h(x) = 0, h : Rn → Rm , with x∗ a regular point. Then there exists
T
a λ∗ ∈ Rm for which Df (x∗ ) + λ∗ Dh(x∗ ) = 0T .
• We have from the theorem that a necessary condition for an extremum is
∇f (x∗ ) = −Dh(x∗ )T · λ∗ . So ∇f (x∗ ) ∈ N(x∗ ) = T (x∗ )⊥
• The condition for a local minimizer (max) is DL(x∗ , λ∗ ) = 0T where
L(x, λ) = f (x) + λT h(x) is the Lagrangean, a mapping from
Rn × Rm → R.
• Notice that Df (x∗ ) = ∇f (x∗ )T and Dh(x∗ ) is an (m, n) matrix (the
Jacobian).
• Denote with Dx L(x, λ) and Dλ L(x, λ) the derivatives of the Lagrangean
with respect to x and λ respectively. Then Lagrange theorem implies
solving the system of n + m equations Dx L(x, λ) = 0T and
Dλ L(x, λ) = 0T .
Second order conditions
Consider again the Lagrangean in case of m constraint functions:
L(x, λ) = f (x) + λ1 h1 (x) + λ2 h2 (x) + ... + λm hm (x). We denote with
L(x, λ) the associated Hessian matrix.
We define the Lagrangean Hessian matrix as:

L(x, λ) = F(x) + λ1 H1 (x) + λ2 H2 (x) + ... + λm Hm (x)

where F, Hk are the Hessian matrices evaluated at x of the objective function


and each constraint
Pfunctions. We use the compact notation
L(x, λ) = F(x) + k λk Hk (x) = F(x) + [λH(x)]. We have:
Theorem (SONC): Let x∗ be a local minimizer of f : Rn → R with h(x) = 0,
h : Rn → Rm , m ≤ n, f , h ∈ C 2 and x∗ regular. Then there exists a λ∗ ∈ Rm
such that:
T
1. Df (x∗ ) + λ∗ Dh(x∗ ) = 0T and
2. For all y ∈ T (x∗ ), yT L(x∗ , λ∗ )y ≥ 0.
Second order sufficient conditions
Theorem (SOSC): Let f , h ∈ C 2 and there exists a x∗ ∈ Rn and a λ∗ ∈ Rm
such that:
T
1. Df (x∗ ) + λ∗ Dh(x∗ ) = 0T and
2. For all y ∈ T (x∗ ), yT L(x∗ , λ∗ )y > 0
Then x∗ is a strict local minimizer of f subject to h(x) = 0.
Remarks:
• Df (x∗ ) is (1,n) namely the gradient of f transpose. Dh(x∗ ) is the
Jacobian matrix which is (m, n) by definition and the Lagrangean vector
λ is (m, 1).
• The tangency space T (x∗ ) at x∗ includes all vectors y such that
Dh(x∗ ) · y = 0.
• It is interesting to compare the SOSC for the constrained case with those
of the unconstrained case. In this latter case we have: (1) ∇f (x∗ ) = 0
and (2) the Hessian F (x∗ ) > 0, strictly positive definite.
SOSC with gradient evaluation
We present without proof a set of sufficient conditions for optimality, which
are often more practical to verify. Dh(x) is the Jacobian matrix.
Theorem (SOSC): Let L(x, λ) = f (x) + λT h(x), with x regular and
f : Rn → R, h : Rn → Rm , m < n. Assume first order conditions satisfied and
L(x, λ) ∈ C (2)(x) with Hessian matrix  Lx of order n. Then if the augmented
0m,m Dh(x ∗ )
matrix L̃L = has the last n − m N-W principal
Dh(x ∗ )T Lx (x ∗ , λ∗ )
minors of order i = 2m + 1, ..., n + m with sign (−1)m , then x∗ is a local
minimum. If instead have alternating sign, starting with (−1)m+1 , then x∗ is
a local maximum.
Remarks:
• It is not sufficient to establish the p.d. or p.s.d. for a minimum,
respectively n.d. or n.s.d. for a maximum by looking at the Hessian
Lx (x ∗ , λ∗ ) only.
• This result requires the evaluation of the determinants of the N-W
minors of the augmented Hessian and the assessment of their sign. We
evaluate the same minors starting with i = 2m + 1.
SOSC with gradient evaluation
Remarks ctd
• Let m = 1 and n = 2: thus we have an augmented matrix (3,3) and we
start evaluating the minor of order 2 · 1 + 1 = 3, the matrix itself. Then
if the sign of the determinant is (−1)1 (m = 1, negative) we have a
local minimum, if (−1)2 (positive) we have a local max.
• Let m = 1 and n = 3: now again just one constraint but 3 unknowns
and the augmented Hessian will be (4, 4): we need to evaluate the N-W
minors of order 3 (2m + 1) and 4 (n + m): if both negative (sign (−1)1 )
then we have a local min, if instead they alternate with first sign positive
((−1)2 ) and second negative, we have a local max. In the opposite case
we can’t say and need another criterion.
• Let m = 2 and n = 3: augmented matrix of order 5 and first N-W minor
to evaluate of order i = 2m + 1 = 5 the entire matrix. In this case if
sign positive ((−1)m=2 ) we have a local min and if negative a local max.
If 0, other criterion.
Linearly constrained quadratic program
Consider the problem: minx 12 xT Qx s.t. Ax − b = 0 with
x ∈ Rn , Q > 0, A ∈ Rm,n , b ∈ Rm , m < n, rk(A) = m.
We apply the Lagrangean method to solve this problem’ Define the
Lagrangean by L(x, λ) = 12 xT Qx + λT (b − Ax) with optimality condition
Dx L(x∗ , λ∗ ) = 0T :
T T
x∗ Q − λ∗ A = 0T
which is solved by x∗ = Q −1 AT λ∗ .
For Dλ L(x∗ , λ∗ ) = 0T , we have

Ax∗ = AQ −1 AT λ∗ = b

then λ∗ = (AQ −1 AT )−1 b, from which by substituting in the equation for x∗


we find the constrained problem solution: x∗ = Q −1 AT · (AQ −1 AT )−1 b.

Exercises 20
Part IV.b: NLN optimization with inequality
constraints
Let’s go back to the nonlinearly constrained problem introduced at the
beginning of this section with both equality and inequality constraints:

min f (x) s.t. h(x) = 0 g(x) ≤ 0


x

where: x ∈ Rn , f : Rn → R, h : Rn → Rm , g : Rn → Rp .
We generalize the concept of a regular point in this new setting, by
distinguishing active versus inactive constraints in case of inequality.
Let x∗ satisfy h(x∗ ) = 0 and g(x∗ ) ≤ 0: we denote with
J(x∗ ) := {j|gj (x∗ ) = 0}.
Then x∗ is a regular point if ∇hi (x∗ ) and ∇gj (x∗ ) are l.i. for all i and
j ∈ J(x∗ ).
We specify the Lagrange function in this case as:
L(x, λ, µ) = f (x) + λT h(x) + µT g(x), expecting λ to be free and µ non
negative.
Karush-Kuhn-Tucker (KKT) conditions
The KKT theorem provides FONC for a local optimum.
Karush-Kuhn-Tucker theorem: Let f , g, h ∈ C 1 and x∗ be a regular point
which minimizes f under h(x∗ ) = 0 and g(x∗ ) ≤ 0. Then there exist a
λ∗ ∈ Rm and a µ∗ ∈ Rp such that:
1. µ∗ ≥ 0
T T
2. Df (x∗ ) + λ∗ Dh(x∗ ) + µ∗ Dg(x∗ ) = 0T
T
3. µ∗ g(x∗ ) = 0.
λ∗ is referred to as the vector of Lagrange multipliers, µ∗ as the vector of
KKT multipliers.
Remarks:
• The KKT multipliers are 0 for non binding constraints and non-negative
for active constraints at the optimum.
• The optimality condition 2 then implies, given x∗ regular, that the
gradient of f is a l.c. of the gradients of the m equality and p inequality
constraints, with weights given by the multipliers.
KKT conditions
We refer to 1., 2., 3., and h(x∗ ) = 0, g(x∗ ) ≤ 0 all together as KKT
necessary conditions for a local optimum under equality and inequality
constraints.
• If we are maximizing, rather than minimizing the KKT conditions do not
change but we are now considering a Lagrange function specified as
L(x, λ, µ) = f (x) − λT h(x) − µT g(x)
• If KKT multipliers are associated with the constraints g(x∗ ) ≥ 0 instead
of ≤ then the last condition needs to change: g(x∗ ) ≥ 0 and the KKT
multipliers are non-positive µ∗ ≤ 0.
Second order conditions
Let’s now consider second order necessary and sufficient conditions for an
optimum in Rn under equality and inequality constraints.
Theorem (SONC): Let x∗ be a local minimizer of f : Rn → R subject to
h(x) = 0, g(x) ≤ 0, h : Rm → Rn , m ≤ n, g : Rp → Rn , f , h, g ∈ C 2 and x∗
regular. Then there exists a λ∗ ∈ Rm and a µ∗ ∈ Rp such that:
T T T
1. µ∗ ≥ 0, Df (x∗ ) + λ∗ Dh(x∗ ) + µ∗ Dg(x∗ ) = 0T , µ∗ g(x∗ ) = 0 and
2. For all y ∈ T (x∗ ), yT L(x∗ , λ∗ , µ∗ )y ≥ 0.
We have the first-order stationary conditions and the second order p.s.d.
condition for a local minimum, now restricted to the tangent space
T (x ∗ ) ⊂ Rn . Namely:
• T (x∗ ) := {y ∈ Rn |Dh(x∗ )y = 0, Dgj (x∗ )y = 0, j ∈ J(x∗ )} and
• L(x∗ , λ∗ , µ∗ ) = F(x∗ ) + k λ∗k Hk (x∗ ) + j∈J(x∗ ) µ∗j Gj (x∗ ) where
P P

F, Hk , Gj are the Hessian associated with each equality and inequality


constraint functions.
Second order sufficient conditions
Theorem (SOSC): Let f , h, g ∈ C 2 and there exists a x∗ ∈ Rn , λ∗ ∈ Rm and
µ∗ ∈ Rp such that:
T T T
1. µ∗ ≥ 0, Df (x∗ ) + λ∗ Dh(x∗ ) + µ∗ Dg(x∗ ) = 0T and µ∗ g(x∗ ) = 0
2. For all y ∈ T̃ (x∗ , µ∗ ), y 6= 0 we have yT L(x∗ , λ∗ , µ∗ )y > 0.
Then x∗ is a strict local minimizer of f subject to h(x∗ ) = 0 and g(x∗ ) ≤ 0.
We need to clarify T̃ (x∗ , µ∗ )
• T̃ (x∗ , µ∗ ) := {y ∈ Rn |Dh(x∗ )y = 0, Dgj (x∗ )y = 0, j ∈ J̃(x∗ , µ∗ )} and
J̃(x∗ , µ∗ ) := {j|gj (x∗ ) = 0, µ∗j > 0}.
We need not only y ∈ N (Dh(x∗ )) but also y ∈ N (Dgj (x∗ )), thus
y ∈ N (Dh(x∗ )) ∩ N (Dgj (x∗ )), a subspace of T (x∗ ).

Exercises 21
Part IV.c: Convex optimization
Convex optimization problems are of particular relevance due to their unique
properties and relevant application domains. We introduce few definitions.
• The graph of f : Ω → R, Ω ⊂ Rn is the set of points in Ω × R ⊂ Rn+1
T
given by {[x f (x)] , x ∈ Ω}.
• The epigraph is denoted by epi(f ) = {[x β]T x ∈ Ω, β ∈ R, β ≥ f (x)}
• a function f : Ω → R is convex on Ω if its epigraph is convex. If f is
convex on Ω then Ω is convex.
• Jensen inequality in Rn N.S.C. for convexity and strict convexity.
• f , f1 , f2 convex imply for α ∈ R αf convex and f1 + f2 convex.
• f : Ω → R, f ∈ C 1 defined on a convex set Ω ⊂ Rn open, then f is
convex iff for all x, y ∈ Ω: f (y) ≥ f (x) + ∇f (x)(y − x).
Convexity: FOC are sufficient
Here come the key results:
• f : Ω → R, f ∈ C 2 defined on an open convex set Ω ⊂ Rn . Then f is
convex iff for each x ∈ Ω the Hessian matrix of f at x is positive
semi-definite. This can be easily extended to closed convex sets.
• Let f : Ω → R be a convex function f ∈ C 1 defined on a convex set.
Suppose x∗ ∈ Ω is such that for all x ∈ Ω, x 6= x∗ : ∇f (x∗ )(x − x∗ ) ≥ 0,
then x∗ is a global minimizer of f over Ω. Under the same assumptions,
same holds if for any feasible direction d at x∗ , dT ∇f (x∗ ) ≥ 0. FONC
are sufficient for global optimality if the problem is convex.
• Similarly under f ∈ C 1 convex over a convex set
Ω = {x ∈ Rn |h(x) = 0} with h : Rn → Rm , h ∈ C 1 , then x∗ is a global
T
minimizer if Df (x∗ ) + λ∗ Dh(x∗ ) = 0T . (Lagrange theorem)
• Global optimality extends naturally to KKT conditions.
Examples
We have gone through several examples and problems during the course to be
classified as convex programs. Notice that concavity of f leads to convexity of
−f and all the above results apply.
• spaces Rn for n = 1, ..., are all convex domains, Half spaces are also
convex. Intersections and union of convex sets generate convex sets.
• Quadratic programs xT Qx are convex on Ω ⊂ Rn iff for any x, y ∈ Ω,
(x − y)T Q(x − y) is p.s.d.: minx xT Qx| Ax = b is a convex program.
• For f : R → R you need convexity to apply 1d search methods. If f ∈ C 1
then if f 0 (x) is increasing from negative to positive over the domain, f is
convex.
• In Rn if the feasible region is convex and f (x) is convex then the
problem is convex. Again here you can study convexity through the
partial derivatives ∂x
∂f
j
and the directional derivatives < ∇f (x), d >.
• Saddle points are the primary causes of lack of convexity.
MATH 412 OPTIMIZATION

END

Thanks to you all! Giorgio

You might also like