You are on page 1of 56

Data-Driven Optimization with Machine Learning

Applications
Chapter 3: Basics of mathematical optimization

Dr. D.Sc. (habilitus) Abebe Geletu W. Selassie


E-mail: abebe.geletu@aims.ac.rw

German Research Chair


AIMS Rwanda
3.1. Theoretical foundations of unconstrained
optimization
3.1.1. Motivation

Generally, training supervised machine learning models


corresponds to solving optimization problems
Supervised learning is done through the minimization
of some error (loss) function.
Regression and artificial neural network models
commonly lead to unconstrained optimization
problems.
The basic ingredients of training supervised learning
through optimization methods are
a relevant dataset
an appropriate loss (objective) function
optimality criteria
optimization algorithms
model testing and validation
3.2.2. Unconstrained optimization problems

(UNLP) min F (w ) (1)


w
subject to: (2)
n
w ∈R , (3)

where
F (w ) - objective (loss) function
w - decision (optimization) variable (model parameter)
Here, the UNLP is used to refer to the unconstrained opti-
mization problem.

The problem UNLP is


if F (w ) is a convex function, then UNLP a convex
3.2.2. Unconstrained optimization problems...
Examples of convex and non-convex optimization models in
machine learning.
Linear regression models (ridge regression, LASSO,
Elastic-Net) are convex unconstrained optimization problems
N
1 X h (j) i2
F (w ) = y − f (x (j) , w ) + R(w )
2
j=1

where f (x, w ) is a linear function w.r.t. w and R(w ) is a


convex regularization function
Logistic regression leads to a convex unconstrained
optimization problem
Neural network models commonly lead to non-convex
optimization problems
3.3. Global and Local Minimum Point

Definition (Global and Local Minimum Point)


Global minimum point: A point w ∗ is a global
minimum point of a function F if

F (w ) ≥ F (w ∗ ), for any w ∈ Rn .

Local minimum point: A point w ∗ is a local minimum


point of a function F if

F (w ) ≥ F (w ∗ ), for any w sufficiently near top w ∗ .

i.e., for any w in a neighborhood of w .

Note that, any global minimum point is also a local minimum


3.3. Global and Local Minimum Point...

Example 1: The convex optimization problem


 
1 2 2
(UNLP) min F (w ) = (w1 − 2) + w2 − 5
w1 ,w2 2

has w ∗ = (2, 0)> as a (global) minimum point (see figure).


In particular, the minimum value is equal to −5 = F (w ∗ ) ≤
F (w ) for all w ∈ R2 .
3.3. Global and Local Minimum Point...

Example 1: The non-convex optimization problem


n 2 2
o
(UNLP) min F (w ) = w1 w2 exp−w1 −w2
w1 ,w2

 √ √ 
2 2
has two local minimum at the points − 2 , 2 and
√ √ 
2 2
2 , − 2 .
3.3. Global and Local Minimum Point...
import matplotlib.cm as cm
X = np.linspace(-2,2,100)
Y = X.copy()
X, Y = np.meshgrid(X, Y)
Z = X*Y*np.exp(-X*X-Y*Y)
fig = plt.figure()
ax = fig.add_subplot(111)
# Reversed Greys colourmap for filled contours
cpf = ax.contourf(X,Y,Z, 20, cmap=cm.Greys_r)
# Set the colours of the contours and labels so theyre white where the
# contour fill is dark (Z < 0) and black where its light (Z >= 0)
colours = [r if level<0 else k for level in cpf.levels]
cp = ax.contour(X, Y, Z, 20, colors=colours)
ax.clabel(cp, fontsize=12, colors=colours)
plt.show()

2 2
Figure: Contour plot for F (w ) = w1 w2 exp−w1 −w2 .
3.3. Global and Local Minimum Point...

Theorem (global optimality of convex problems)


1 Any local minimum point of a convex function is a global
minimum point; i.e., for a convex function a local minimum
point is a global minimum point.
2 If a strictly convex function has a minimum point, then this
minimum point is unique. That is, a strictly convex function
has unique one minimum point.
Proof:(by contradiction)
Let w ∗ be a local minimum point of a convex function F (w ). Assume that w ∗ is not a global minimum point. Hence, here is a point
w ∈ Rn , such that

F (w ) < F (w ).

Now, we can find a sufficiently small λ ∈ [0, 1], so that w ∗ + λ w − w ∗ is in the neighborhood of (i.e., near to) w ∗ . Hence,


∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
 
F (w ) ≤ F (w +λ w − w ) = F ((1−λ)w +λw ) ≤ (1−λ)F (w )+(1−λ) F (w ) < (1−λ)F (w )+λF (w ) = F (w )
| {z }
<F (w ∗ )

This implies that F (w ∗ ) < F (w ∗ ) which is a contradiction. Hence, the assumption is false and w ∗ is a global minimum point.
3.4. Optimality Criteria

Q: (i) How do we know that a given point is a local minimizer for a loss
function? (ii) How do we find a minimum point?

Note that:
w ∗ is local minimum point of F (w ) means that there is a
neighborhood N (w ∗ ) such that

F (w ∗ ) ≤ F (w ), for any w ∈ N (w ∗ ).

Outside the neighborhood N (w ∗ ), w ∗ may not be a local minimum


point.
3.4. Optimality Criteria ...
In other words, which ever direction d we move from w ∗ to a new point
w in N (w ∗ ); i.e., w = w ∗ + αd ∈ N (w ∗ ), for α ∈ [0, 1], we have

F (w ∗ ) ≤ F (w ∗ + αd).

Definition (Descent direction)


A direction d ∈ Rm is called a descent direction for a function F at the a point
w if
F (w + αd) < F (w ), for some α ∈ [0, 1]

Optimality criteria: If w ∗ is a local minimum point of the function F ,


then there is no descent direction d of F at w ∗ .
3.4. Optimality Criteria ...

However, checking that w ∗ is a local minimum point by showing that


there is no descent direction at w ∗ is not trivial. Hence, we need a better
optimality criteria!

For a differentiable function F , d is a descent direction of F at w is


equivalent to

∇F (w )> d < 0.

This comes from the first-order Taylor approximation of F at w :


F (w + αd) ≈ F (w ) + α∇F (w )> d and, since, F (w + αd) < F (w ).
However, if w ∗ is a local minimum point of the function F , there is no
descent direction of F at w ∗ . That is,
∇F (w ∗ )> d ≥ 0.
for any direction d ∈ Rm . In particular, if d = −∇F (w ∗ ), we have
−∇F (w ∗ )> ∇F (w ∗ )> ≥ 0. Which implies that
−k∇F (w ∗ )k2 ≥ 0. This means that k∇F (w ∗ )k2 ≤ 0.
This implies that k∇F (w ∗ )k2 = 0. Consequently, ∇F (w ∗ ) = 0.
3.4. Optimality Criteria ...
First-Order Necessary Optimality Criteria for UNLP

Let F be a differentiable function. If w ∗ is a local minimum


point, then
∇F (w ∗ ) = 0.

The first-order necessary optimality criteria says that


at the local minimum point w ∗ , the surface of F has a
horizontal tangent plane
a point w with property ∇F (w ) 6= 0 is not a local
minimum point
However, in general, a point ŵ satisfies the equation
∇F (ŵ ) = 0 does not automatically imply that ŵ is a
local minimum point of F .
3.4. Optimality Criteria ...

Example
2 2
For the function F (w ) = w1 w2 e −w1 −w2 , at the points
(−1, 1) and (1, 1) we have ∇F (w ) = 0, but neither of them
is a local minimum point.

Second-Order Sufficient Optimality Criteria for UNLP

Let F is a twice differentiable function and w ∗ ∈ Rm . If


(i) ∇F (w ) = 0, and
(ii) the Hessian matrix H(w ) of F at w is positive definite,
then w is a local minimum point of F .
3.4. Optimality Criteria ...Sufficient Condition

Example

Find local minimum point(s) for the problem


 
1
(UNLP) min F (w ) = w13 + w1 w2 − 4w1 w2 + 1 .
x∈R2 3

First set the gradient to zero


 2
w1 + w22 − 4w2
  
0
∇F (w ) = = .
2w1 (w2 − 2) 0

Hence, w12 + w22 − 4w2 = 0 and 2w1 (w2 − 2) = 0.


3.4. Optimality Criteria ...Sufficient Condition

From the equation 2w1 (w2 − 2) = 0, we have w1 = 0 and w2 = 2.


Hence, form the equation w12 + w22 − 4w2 = 0, we have
if w1 = 0, then 02 + w22 − 4w2 = 0. We obtain (0, 0), (0, 4)
are candidates
if w2 = 2, the w12 + 4 − 4 × 2 = 0. We obtain here
(2, 2), (−2, 2).
Thus, the points (0, 0), (0, 4), (2, 2), (−2, 2) are stationary
(candidate) points for local minimum.
Next, we use the Hessian to identify the true local minimum points.
 
2w1 2w2 − 4
H(w ) = 2w − 4 2w
2 1

This matrix is positive definite only at the point (2, 2).


3.4. Optimality Criteria ...Sufficient Condition...
First-Order sufficient Optimality Criteria for convex UNLP

Let F is a differentiable function. A point w ∗ is a global minimum


point if and only if
∇F (w ∗ ) = 0.

For a convex differentiable function F , it usually enough to solve the system


of (possibly nonlinear) equations
∇F (w ) = 0
to obtain a minimum point of F .
Example: To solve the convex optimization problem
 
1
(UNLP ) min F (w ) = (w1 − 2)2 + w22 − 5
w=(w1 ,w2 ) 2

solve  
(w1 − 2)
∇F (w ) = w2 = 0.

Which yields [w1∗ , w2∗ ] = [2, 0].


3.4. Optimality Criteria ... Summary
Important summary

a non-convex function can have several local minimum


points
- thus, it can be difficult to select the best among several
local minimum points
- a machine learning model designed based on a local
minimum point may not be reliable
for a convex function, local minimum points are global
- a convex function can have several global minimum
points
- Example: F (w ) = ReLU(w ) has several global minima
a strictly-convex function can have only a unique global
minimum point
- therefore, training machine learning models is
preferable using a strictly convex loss function
3.4. Optimality Criteria ... Summary

Important summary (....)

To find a (local or global) minimum point of a


differentiable function F (w ) follow the following steps
S1. Solve the system of equations ∇F (w ) = 0 to find
stationary points.
S2. Evaluate the Hessian H(w ) on the points from S1. to
identity local minimum points.
A vector d ∈ Rm is called a descent direction for a
function F at a point w if ∇F (w )> d < 0 . Descent
directions are central in optimization algorithms.
3.5. Training Regression Models
A. Ridge Regression
 
1 2 γ 2
min F (w ) = ky − Aw k + kw k2 ,
w ∈R(m+1) 2 2

The function F (w ) = ky − Aw k2 + γkw k22 is convex.


First-order (sufficient) optimality criteria

∇F (w ) = −A> (y − Aw ) + γw = 0.

This is equivalent to
h i
A> A + γIm+1 w = A> y .

Here, for a given γ > 0, the matrix D := A> A + γIm+1 and the
vector b = A> y are known, since they are defined through the
dataset.
Next, solve the system of linear equations Dw = b to determine
w.
3.5. Optimality ... Linear Regression Models...
Example: Given the average monthly temperature in Germany from May 2019
to May 2020 (From:https://www.statista.com/statistics/982472/average-monthly-temperature-germany/ )
Month May june jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May
Temp. 10.9 19.8 18.9 19 14.1 10.9 5.2 3.7 3.3 5.3 5.3 10.5 11.9
find a reasonable prediction for the average temperature in June 2020.
Solution:
1. Identify the type of function class to fit to the data.
For this, visualize the data.

Clearly, a function f (x, w ) which is linear w.r.t. w will not be a good fit.
2. For instance, we may fit a polynomial
f (x, w ) = w5 x 5 + w4 x 4 + w3 x 3 + w2 x 2 + w1 x + w0 .
3.5. Optimality ... Linear Regression Models...
3. Dataset {(xj , yj ) | j = 1, . . . , 13}

{(1, 10.9), (2, 19.8), (3, 18.9), (4, 19), (5, 14.1), (6, 10.9), (7, 5.2), (8, 3.7), (9, 3.3), (10, 5.3), (11, 5.3), (12, 10.5), (13, 11.9)}

4. The ridge regression problem


 
13
 1 X 2 γ 
min F (w ) = yj − f (xj , w ) + kw k22 ,
w ∈R6  2 j=1 2 

where x = [1, 2, . . . , 13]> and


y = [10.9, 19.8, 18.9, 19, 14.1, 10.9, 5.2, 3.7, 3.3, 5.3, 5.3, 10.5, 11.9]> .
Using matrix notation
 
1 γ
min F (w ) = ky − Aw k2 + kw k2 ,
w ∈R6 2 2
x15 x14 x13 x12
 
x1 1
 x25 x24 x23 x22 x2 1
13×6
where A =  . . ∈ R .
 
. . . .
 .. .. .. .. .. .. 
5
x13 4
x13 3
x13 2
x13 x13 1
3.5. Optimality ... Linear Regression Models

 1 1 1 1 1 1 
 32 16 8 4 2 1 
 243 81 27 9 3 1 
 1024 256 64 16 4 1 
 
 3125 625 125 25 5 1 
 
 7776 1296 216 36 6 1 
A =  16807 2401 343 49 7 1
 

 32768 4096 512 64 8 1 
 
 59049 6561 729 81 9 1 
100000 10000 1000 100 10 1 
 
161051 14641 1331 121 11 1 
 
248832 20736 1728 144 12 1
371293 28561 2197 169 13 1
The Hessian matrix of F

H(w ) = A> A + γI

is positive definite, where I is the 6 × 6 identity matrix.


3.5. ... Linear Regression Models...
import numpy as np
import matplotlib.pyplot as plt
A = np.array([[1,1,1,1,1,1], [32,16,8,4,2,1], [243,81,27,9,3,1],[1024,256,64,16,4,1],
[3125,625,125,25,5,1],[7776,1296,216,36,6,1],[16807,2401,343,49,7,1],
[32768,4096,512,64,8,1],[59049,6561,729,81,9,1],[100000,10000,1000,100,10,1],
[161051,14641,1331,121,11,1],[248832,20736,1728,144,12,1],[371293,28561,2197,169,13,1]])
At = A.transpose()
AtA = np.matmul(At, A)
I = np.identity(6)
gamma= 0.9
D= np.add(AtA,gamma*I)
y= np.array([10.9, 19.8, 18.9, 19, 14.1, 10.9, 5.2, 3.7, 3.3, 5.3, 5.3, 10.5, 11.9])
yt = y.transpose()
b = np.matmul(At, yt)
weights = np.linalg.solve(D,b)
print(weights)
# Plot the polynomial
t=np.linspace(1.0,13.0)
w=weights
myPoly =[]
for i in range(len(t)):
myPoly.append(w[0]*t[i]**5 + w[1]*t[i]**4 + w[2]*t[i]**3+w[3]*t[i]**2+w[4]*t[i] + w[5])
#plt.plot(t,myPoly)
plt.plot(t,myPoly,’r-’)
plt.xlabel(’Months: May 2019 to May 2020’)
plt.ylabel(’Monthly average temperature’)
plt.show()
# Letting June 2020 to be the 14th month
average4June2020=w[0]*14**5 + w[1]*14**4 + w[2]*14**3+w[3]*14**2+w[4]*14 + w[5]
print(average4June2020)
3.5. ... Linear Regression Models...

Run the code given in the previous slide to generate


the following plotting

Figure:Fitting a 5th -order polynomial using ridge regression


3.5. ... Linear Regression Models...

The 5th -order polynomial fits well to the data than a


linear function. You may try lower or higher degree
polynomials, or a different class of functions.
γ = 0.9 is selected by trial and error. Test a different γ.
The code (in the program above) determines the
predicted average temperature for June 2020.
average4June2020=w[0]*14**5 + w[1]*14**4 + w[2]*14**3+w[3]*14**2+w[4]*14 + w[5]
print(average4June2020)
4.4. ... Linear Regression Models...

The LASSO and the Elastic-net regression models


LASSO
( 13
)
1X 2
min F (w ) = |yj − f (xj , w )| + γkw k1 ,
w ∈R6 2 j=1

Elastic-net
( 13
)
1X
min f (w ) = |yj − f (xj , w )|2 + γ2 kw k22 + γ1 kw k1 ,
w ∈Rm+1 2 j=1

involves the `1 -regularization term kw k1 , which is not


differentiable.
However, similar first-order optimality criteria can be formulated
using the concept of sub-gradient or use smoothing for kw k1 .
You may use
sklearn.linear_model.Lasso
sklearn.linear_model.ElasticNet
3.2. Theoretical
Foundation of Constrained
Optimization
3.2.1. Constrained Optimization Problems

The standard form of a constrained optimization problem

(NLO) min f (x) (4)


x∈Rn
subject to:
hi (x) = 0, i = 1, . . . , m; (5)
gj (x) ≤ 0, j = 1, . . . , q; (6)

where f , hi , gj : Rn → R, i = 1, . . . , m; j = 1, . . . , q, are functions


which are at least one-time continuously differentiable.

f (x) is the objective function.


hi (x) = 0, i = 1, . . . , m - equality constraints
gj (x) ≤ 0, j = 1, . . . , q - inequality constraints
S = {x ∈ Rn | hi (x) = 0, i = 1, . . . , m; gj (x) ≤ 0, j =
1, . . . , q} - feasible set of (NLO).
3.2.1. Constrained Optimization Problems
Example

Suppose you are given a rectangular steel plate of width 10 m and


length 30 m

Objective: To construct a water tanker with maximum capacity


3.2.1. Constrained Optimization Problems
Example

Problem: What are the dimensions of an open-top wa-


ter tanker with maximum volume that can be constructed
from the rectangular steel plate of dimensions 10 × 30m2 ?

(NLO) max {V (x) = x1 x2 x3 } (7)


x∈R3
subject to:
x1 x2 + 2x2 x3 + 2x1 x3 = 300 (8)
x1 ≥ 0, x2 ≥ 0, x3 ≥ 0. (9)
3.2.1. Constrained Optimization Problems - Linear
Programming

The standard form of linear programming (optimization) problem

min f (x) = c > x + b



(LP) (10)
x∈Rn
subject to:
Ax = a; (11)
Bx ≤ b; (12)

where
f (x) = c > x + b - linear objective function.
Ax = a - linear equality constraints
Bx ≤ b - linear inequality constraints
S = {x ∈ Rn | Ax = a; Bx ≤ b} - feasible set of (LP).
3.2.1. Constrained Optimization Problems - Quadratic
Programming

The standard form of quadratic programming (optimization) prob-


lem
 
1
(LP) minn f (x) = x > Qx + q > x (13)
x∈R 2
subject to:
Ax = a; (14)
Bx ≤ b; (15)

where
f (x) = 12 x > Qx + q > x - quadratic objective function.
Ax = a - linear equality constraints
Bx ≤ b - linear inequality constraints
S = {x ∈ Rn | Ax = a; Bx ≤ b} - feasible set of (QP).
3.2.1. .... Constrained Optimization ...
Feasible Set
A point x ∈ Rn is called a feasible point of the NLP
if hi (x) = 0, i = 1, 2, . . . , p and
gj (x) ≤ 0, j = 1, 2, . . . , m.
Represent the set of all feasible points of the NLP by

S := {x ∈ Rn | hi (x) = 0, i = 1, . . . , p; gj (x) ≤ 0, j = 1, . . . , m} .

The set S is called the feasible set of the NLP.

Any point that lies outside the feasible set is infeasible (not
admissible) to the optimization problem.
Infeasible points are usually not considered during the
optimization process.
3.2.1. ... Constrained Optimization ...
Example 1:
 
1 2
(NLP1) min x + x1 x22
x 2 1
s.t.
x1 x22 − 1 = 0,
− x12 + x2 ≤ 0,
x2 ≥ 0.
In this example
there is one equality constraint h1 (x) = x1 x22 − 1 and
two inequality constraints g1 (x) = −x12 + x2 ≤ 0 and
g2 (x) = −x2 ≤ 0.
I Observe that, x = (1, 1)> is a feasible point; while (0, 0)> is not
feasible; i.e., x = (0, 0)> does not belong to the feasible set
S = x ∈ R2 | x1 x22 − 1 = 0, −x12 + x2 ≤ 0, x2 ≥ 0 .

3.2.1. Introduction to Constrained Optimization ...
Example 2:

min f (x) = 4 x12 + x2 + 50x3 − 10(x1 − x3 )


 
(NLP2)
x
subject to :
g1 (x) = x1 − x3 ≥ 0
g2 (x) = x1 ≥ 200
g3 (x) = x2 ≥ 400
x3 ≥ 0.
Note that:
all constraints are inequality constraints
The feasible set is
S = x ∈ R3 | − x1 + x3 ≤ 0, x1 ≥ 200, x2 ≥ 400, x3 ≥ 0


Observe that the point x = (100, 400, 0) is infeasible; while


x = (200, 400, 0) is feasible.
3.2.1. ... Constrained Optimization ...
It is convenient to represent the constrains in the compact
form
h(x) = 0, g (x) ≤ 0
using the vector representations
   
h1 (x) g1 (x)
h2 (x)  g2 (x) 
h(x) =  .  and g (x) =  . 
   
 ..   .. 
hp (x) gm (x)

Optimal solution (minimum point)


A point x ∗ ∈ Rn is an optimal solution of the constrained optimization
problem NLP if
(i) x ∗ is a feasible point of NLP; that is, x ∗ ∈ S.
(ii) f (x) ≥ f (x∗ ) for all x ∈ S.
3.1. Introduction to Constrained Optimization ...

• For NLP1, the point x ∗ = (1, 1)> is an optimal solution.

I In general, it is not trivial to find an optimal solution of a


constrained optimization problem.

Questions
Q1: How do we verify that a point x ∈ Rn an optimal solution to
the the NLP? ( We need optimality criteria )
Q2: What methods are available to solve a constrained nonlinear
optimization problem?
( Methods for constrained optimization. )
3.2.1. ... Constrained Optimization ...

Definition (Active Constraints)


Let x be a feasible point of NLP. An inequality constraint
gi (x) ≤ 0 is active constraint at x if

gi (x) = 0.

The set A(x) := {i ∈ {1, 2, . . . , m} | gi (x) = 0} is the index


set of active constraints at x.

For the Example 1 above, the constraint g1 = −x12 + x2 is


active at x = (1, 1)> , but g2 (x) = −x2 is not active. Hence,
A = {1}.
For Example 2, the constrains g1 (x) ≤ 0 and g3 (x) ≤ 0 are
active at the point x = (300, 400, 300)> . Hence, we have
A = {1, 3}.
3.2. Optimality Criteria for Constrained Optimization

Descent direction
A vector d is a descent direction to the objective function f at
the point x if
f (x + d) ≤ f (x).
A movement from the point x in the direction of the vector d reduces the value of the function f .

Every vector d with the property d > ∇f (x) < 0 is a descent


direction at x.
To verify this, use the first-order Taylor-Approximation: f (x + d ) ≈ f (x ) + d > ∇f (x ).
Which implies that

f (x + d ) − f (x ) = d > ∇f (x ) < 0 ⇒ f (x + d ) − f (x ) ≤ 0.

Hence, f (x + d ) ≤ f (x ) and d is is a descent direction.


INote that: If d is descent direction, the vector de = αd , for α > 0 , is also a
descent direction.
3.3. Optimality Criteria for Constrained Optimization
Feasible Direction
let x be a feasible point (i.e. x ∈ S) and d a vector in Rn . If
(i) hi (x + d) = 0, i = 1, . . . , p and
(ii) gj (x + d) ≤ 0, j = 1, . . . , m,
then d is a feasible direction of NLP at the point x.
• Let x be a feasible point. If a vector d satisfies
d > ∇hi (x) = 0, i = 1, . . . , p and d > ∇gj (x) < 0, j ∈ A(x),

then de = αd is a feasible direction at x for any α > 0.


To verify this, we use the first-order Taylor-Approximation.

>
i = 1, . . . , p : hi (x + αd) ≈ hi (x) +α d ∇hi (x); ⇒ hi (x + d)
e = 0,
| {z } | {z }
=0 =0
>
j ∈ A(x) : gj (x + αd) ≈ gj (x) +α d ∇gj (x) ⇒ gj (x + d)
e ≤ 0;
| {z } | {z }
=0 <0

> gj (x) >


j ∈
/ A(x) : gj (x + αd) ≈ gj (x) +αd ∇gj (x) ⇒ gj (x + d)
e ≤ 0, for 0 < α ≤ − if d ∇gj (x) > 0.
| {z } | {z } d > ∇gj (x)
non active constraints <0
3.3. Optimality Criteria ...

First-Order Optimality Criteria


If x ∗ is an optimal solution of NLP, then there is no vector
d ∈ Rn which is both a descent direction as well as a feasible
direction at x ∗ .
That is the system of inequalities

d > ∇f (x∗ ) < 0; (16)


> >
d ∇hi (x∗ ) = 0, i = 1, . . . , p; d ∇gj (x∗ ) < 0, j ∈ A(x∗ ).(17)

has no solution. Equivalently,

[−∇f (x∗ )]> d > 0,


[∇hi (x∗ )]> d = 0, i = 1, . . . , p; [∇gj (x∗ )]> d < 0, j ∈ A(x∗ )

has no solution vector d.


3.3. Optimality Criteria ...

Theorem of Farkas (Theorem of Alternatives)


Given a set of vectors c, ai , bj ∈ Rn , i = 1, . . . , p; j = 1, . . . , m̃.
Then one and only one of the following systems has a solution

System I: c > d > 0, ai> d = 0, i = 1, . . . , p; bj> d < 0, j = 1, . . . , m


e
System II: There is µ > 0, λ ∈ R` such that
Xp m̃
X
c= λi ai + µj bj , .
i=1 j=1
3.3. Optimality Criteria ...
Let now c = −∇f (x ∗ ), ai = ∇hi (x ∗ ), i = 1, . . . , p and
bj = ∇gj (x ∗ ), j ∈ A(x ∗ ), m̃ = #A(x ∗ ). Thus, x ∗ is an
optimal solution to NLP, only if System II above has a
solution.
Hence, x ∗ is an optimal solution of NLP, implies that there
are vectors λ ∈ Rp , λ> = (λ1 , λ2 , . . . , λp ) and
µ ∈ Rm̃ , µ> = (µ1 , µ2 , . . . , µm̃ ) > 0 so that
p
X m̃
X
−∇f (x∗ ) = λi ∇hi (x∗ ) + µj ∇gj (x∗ ).
i=1 j=1

Furthermore, let µj = 0 fr j ∈ {1, . . . , m} \ A(x∗ ) , then we


can write
p
X Xm
−∇f (x∗ ) = λi ∇hi (x∗ ) + µj ∇gj (x∗ ).
i=1 j=1
3.3. Optimality Criteria ...KKT Conditions

The Karush-Kuhn-Tucker (KKT) Optimality Criteria


If x ∗ is a minimum point of NLP, then there are λ ∈ Rp and
µ ∈ Rn , µ ≥ 0, so that the following conditions are satisfied
p
X m
X
∇f (x∗ ) + λi ∇hi (x∗ ) + µj ∇gj (x∗ ) = 0 (Optimality)
i=1 j=1
h(x∗ ) = 0 (feasibility)
g (x∗ ) ≤ 0
(nonnegative multipliers) µ ≥ 0
(Complementarity) µj gj (x∗ ) = 0, j = 1, . . . , m.
3.2. Optimality Criteria ...KKT Conditions...

Lagrange-Function
The function
p
X m
X
L(x, λ, µ) = f (x) + λi hi (x) + µj gj (x)
i=1 j=1

is called the Lagrange-Function of NLP.

Example: Solve the optimization problem

min f (x) = x12 − x22



(NLP3)
x
s.t.
x1 + 2x2 + 1 = 0
x1 − x2 ≤ 3.
3.2. Optimality Criteria ...KKT Conditions
Solution:
Lagrange function
L(x, λ, µ) = (x12 − x22 ) + λ(x1 + 2x2 + 1) + µ(x1 − x2 − 3).

Optimality conditions
∂L 1
= 0 ⇒ 2x1 + λ + µ = 0 ⇒ x1 = − (λ + µ) (18)
∂x1 2
∂L 1
= 0 ⇒ −2x2 + 2λ − µ = 0 ⇒ x2 = (2λ − µ) (19)
∂x2 2

Feasibility (Using x1 and x2 from equations (18) and (19), resp. )


1
h(x) = 0 ⇒ x1 + 2x2 + 1 = 0 ⇒ − (λ + µ) + (2λ − µ) + 1 = 0
2
2
⇒ λ=µ− (20)
3
3.2. Optimality Criteria ...KKT Conditions

Complementarity
 
1 1
µg (x) = 0 ⇒ µ(x1 − x2 − 3) = 0 ⇒ µ − (λ + µ) − (2λ − µ) − 3
2 2
     
1 2 1 2
⇒ µ − (µ − ) + µ − 2(µ − ) − µ − 3 = 0
2 3 2 3
 
3 4
⇒ µ − µ − 2 = 0 ⇒ µ = 0 or µ = − .
2 3

However, µ = − 43 < 0 ist not allowed. Hence, µ∗ = 0 is the only


remaining possibility. With this we obtain, from (Using equation
(20) ) that
2 2
λ ∗ = µ∗ − =− .
3 3
3.2. Optimality Criteria ...KKT Conditions
Now using µ∗ = 0 und λ∗ = − 32 we obtain

1 1 2 1
x1∗ = − (λ∗ + µ∗ ) = − (− + 0) = (21)
2 2 3 3
1 1 2 2
x2∗ = (2λ∗ − µ∗ ) = (2 × (− ) − 0) = − . (22)
2 2 3 3
>
Consequently, the point x ∗ = 13 , − 32 is the only candidate for
local minima.

Note that:, the inequality constraint g (x) = x1 − x2 − 3 ≤ 0 is


not active at the point x ∗ .

• In general, the the KKT conditions are only necessary optimality


conditions.
3.2. Optimality Criteria ...Sufficient Conditions
Sufficient optimality conditions
Suppose the function f , hi , i = 1, . . . , p; gj , j = 1, . . . , m are twice
differentiable and x ∗ is feasible point of NLP. If there are
Lagrange-Multipliers λ∗ and µ∗ ≥ 0, such that:
(i) the KKT conditions are satisfied for (x ∗ , µ∗ , λ∗ ); and
(ii) the Hessian Matrix of the Lagrange function
p
X m
X
∇xx L(x ∗ , λ∗ , µ∗ ) = ∇2 f (x ∗ ) + λ∗i ∇2 hj (x ∗ ) + µ∗j ∇2 gj (x ∗ )
i=1 j=1

is positive definite (i.e., d > ∇x L(x ∗ , λ∗ , µ∗ )d > 0) for all d from


the subspace
V=
d ∈ Rn | d > ∇hi (x ∗ ) = 0, i = 1, . . . , p; d > ∇gj (x ∗ ) = 0, µj > 0, j ∈ A(x ∗ ) ,


then x ∗ is an optimal solution of the NLP.


3.2. Optimality Criteria ...KKT Conditions
For the example problem NLP3, at the stationary point
>
x ∗ = 31 , − 32 , we have g (x) = x1∗ − x2∗ − 3 < 0. That is,
g (x) ≤ 0 is inactive at x ∗ .
Hence, we have the subspace  
2 > 2 1
V = {d ∈ R | d ∇h(x∗ ) = 0} = {d ∈ R | (d1 , d2 ) =
2
0} = {d ∈ R2 | d1 = −2d2 }.
Hessian of the Lagrange
 function:

∗ ∗ ∗ 2 0
∇x L(x , λ , µ ) = .
0 −2
For d > = (d1 , d2 )> ∈ V and d 6= 0 (note that if d1 6= 0, then
d2 6= 0; and conversely) we have
     
2 0 d1 2 0 −2d2
(d1 , d2 ) = (−2d2 , d2 ) = 6d22 > 0
0 −2 d2 0 −2 d2
Therefore, x ∗ > = 1 , − 2 is an optimal solution of NLP3.

3.2. Sufficiency of KKT for convex problems
If in the optimization problem

(NLP ) min f (x )
x
s .t .
hi (x ) = 0, i = 1, 2, . . . , p ;
gj (x ) ≤ 0, j = 1, 2, . . . , m.

we have f (x ) is a convex function, each function hi (x ) is a linear function, and each


gj (x ) is a convex function, then
the feasible set

S = {x ∈ Rn | hi (x ) = 0, i = 1, 2, . . . , p ; gj (x ) ≤ 0, j = 1, 2, . . . , m.}

is a convex set;
The Lagrange function
p
X m
X
L(x , λ, µ) = f (x ) + λi hi (x ) + µj gj (x )
i =1 j =1

is convex with respect to x ; i.e., the Hessian of L(x , λ, µ) is positive definite.


Hence, the satisfaction of the KKT condition at x ∗ is sufficient for x ∗ to be a
minimum point of NLP.
3.2. Optimality Criteria ...KKT Conditions
Example:
A two-bar truss consists of :
- bars of lengths L and L/cos(α)
- the cross-sectional areas of the bars are A1 and A2
- the material density of the bars is ρ with Young’s modulus E
- a force F is applied vertically at the intersection of the bars
and causes a displacement D
- the angle between the bars is α = 300
3.2. Optimality Criteria ...KKT Conditions
Objective: To determine the cross-section areas A1 and A2 of the bars that minimize
the the total weight  
2
W (A1 , A2 ) = ρL √ A1 + A2
3
of the truss under stress constraint on both bars |σi | ≤ σ0 , i = 1, 2 and displacement
constraint D ≤ D0 = σE0 L .

Optimization problem:
  
2
(NLP ) min W (A1 , A2 ) = ρL √ A1 + A2 (23)
A1 , A2 3
subject to
 
8 3
displacement constraint: F √ +3 ≤ σ0 (24)
3A1 A2
2F
stress constraint bar 1: − σ0 ≤ ≤ σ0 (25)
A1

3F
stress constraint bar 2: − σ0 ≤ ≤ σ0 (26)
A2
A1 ≥ 0, A2 ≥ 0. (27)
3.2. Optimality Criteria ...KKT Conditions
In standard form Optimization problem:
  
2
(NLP) min W (A1 , A2 ) = ρL √ A1 + A2 (28)
A1 ,A2 3
 
8 3
subject to: F √ + − σ0 ≤ 0 (29)
3A1 A2
2F
− σ0 ≤ 0 (30)
A1
2F
− − σ0 ≤ 0 (31)
A1

3F
− σ0 ≤ 0 (32)
A2

3F
− − σ0 ≤ 0 (33)
A2
A1 ≥ 0, A2 ≥ 0. (34)

Major difficulties in Optimization


In general, in optimization problems, major difficulties come from inequality
constraints.

You might also like