Chapter3 PDF

Data-Driven Optimization with Machine Learning
Applications
Chapter 3: Basics of mathematical optimization
Dr. D.Sc. (habilitus) Abebe Geletu W. Selassie

E-mail: abebe.geletu@aims.ac.rw
German Research Chair

AIMS Rwanda
3.1. Theoretical foundations of unconstrained
optimization
3.1.1. Motivation
Generally, training supervised machine learning models

corresponds to solving optimization problems
Supervised learning is done through the minimization
of some error (loss) function.
Regression and artificial neural network models
commonly lead to unconstrained optimization
problems.
The basic ingredients of training supervised learning
through optimization methods are
a relevant dataset
an appropriate loss (objective) function
optimality criteria
optimization algorithms
model testing and validation
3.2.2. Unconstrained optimization problems
(UNLP) min F (w ) (1)

w
subject to: (2)
n
w ∈R , (3)
where
F (w ) - objective (loss) function
w - decision (optimization) variable (model parameter)
Here, the UNLP is used to refer to the unconstrained opti-
mization problem.
The problem UNLP is

if F (w ) is a convex function, then UNLP a convex
3.2.2. Unconstrained optimization problems...
Examples of convex and non-convex optimization models in
machine learning.
Linear regression models (ridge regression, LASSO,
Elastic-Net) are convex unconstrained optimization problems
N
1 X h (j) i2
F (w ) = y − f (x (j) , w ) + R(w )
2
j=1
where f (x, w ) is a linear function w.r.t. w and R(w ) is a

convex regularization function
Logistic regression leads to a convex unconstrained
optimization problem
Neural network models commonly lead to non-convex
optimization problems
3.3. Global and Local Minimum Point
Definition (Global and Local Minimum Point)

Global minimum point: A point w ∗ is a global
minimum point of a function F if
F (w ) ≥ F (w ∗ ), for any w ∈ Rn .
Local minimum point: A point w ∗ is a local minimum

point of a function F if
F (w ) ≥ F (w ∗ ), for any w sufficiently near top w ∗ .
i.e., for any w in a neighborhood of w .
Note that, any global minimum point is also a local minimum

3.3. Global and Local Minimum Point...
Example 1: The convex optimization problem

1 2 2
(UNLP) min F (w ) = (w1 − 2) + w2 − 5
w1 ,w2 2
has w ∗ = (2, 0)> as a (global) minimum point (see figure).

In particular, the minimum value is equal to −5 = F (w ∗ ) ≤
F (w ) for all w ∈ R2 .
Example 1: The non-convex optimization problem

n 2 2
o
(UNLP) min F (w ) = w1 w2 exp−w1 −w2
w1 ,w2
√ √
2 2
has two local minimum at the points − 2 , 2 and
√ √
2 2
2 , − 2 .
import matplotlib.cm as cm
X = np.linspace(-2,2,100)
Y = X.copy()
X, Y = np.meshgrid(X, Y)
Z = X*Y*np.exp(-X*X-Y*Y)
fig = plt.figure()
ax = fig.add_subplot(111)
# Reversed Greys colourmap for filled contours
cpf = ax.contourf(X,Y,Z, 20, cmap=cm.Greys_r)
# Set the colours of the contours and labels so theyre white where the
# contour fill is dark (Z < 0) and black where its light (Z >= 0)
colours = [r if level<0 else k for level in cpf.levels]
cp = ax.contour(X, Y, Z, 20, colors=colours)
ax.clabel(cp, fontsize=12, colors=colours)
plt.show()
2 2
Figure: Contour plot for F (w ) = w1 w2 exp−w1 −w2 .
Theorem (global optimality of convex problems)

1 Any local minimum point of a convex function is a global
minimum point; i.e., for a convex function a local minimum
point is a global minimum point.
2 If a strictly convex function has a minimum point, then this
minimum point is unique. That is, a strictly convex function
has unique one minimum point.
Proof:(by contradiction)
Let w ∗ be a local minimum point of a convex function F (w ). Assume that w ∗ is not a global minimum point. Hence, here is a point
w ∈ Rn , such that
∗
F (w ) < F (w ).
Now, we can find a sufficiently small λ ∈ [0, 1], so that w ∗ + λ w − w ∗ is in the neighborhood of (i.e., near to) w ∗ . Hence,

∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗

F (w ) ≤ F (w +λ w − w ) = F ((1−λ)w +λw ) ≤ (1−λ)F (w )+(1−λ) F (w ) < (1−λ)F (w )+λF (w ) = F (w )
| {z }
<F (w ∗ )
This implies that F (w ∗ ) < F (w ∗ ) which is a contradiction. Hence, the assumption is false and w ∗ is a global minimum point.
3.4. Optimality Criteria
Q: (i) How do we know that a given point is a local minimizer for a loss
function? (ii) How do we find a minimum point?
Note that:
w ∗ is local minimum point of F (w ) means that there is a
neighborhood N (w ∗ ) such that
F (w ∗ ) ≤ F (w ), for any w ∈ N (w ∗ ).
Outside the neighborhood N (w ∗ ), w ∗ may not be a local minimum

point.
3.4. Optimality Criteria ...
In other words, which ever direction d we move from w ∗ to a new point
w in N (w ∗ ); i.e., w = w ∗ + αd ∈ N (w ∗ ), for α ∈ [0, 1], we have
F (w ∗ ) ≤ F (w ∗ + αd).
Definition (Descent direction)

A direction d ∈ Rm is called a descent direction for a function F at the a point
w if
F (w + αd) < F (w ), for some α ∈ [0, 1]
Optimality criteria: If w ∗ is a local minimum point of the function F ,

then there is no descent direction d of F at w ∗ .
However, checking that w ∗ is a local minimum point by showing that

there is no descent direction at w ∗ is not trivial. Hence, we need a better
optimality criteria!
For a differentiable function F , d is a descent direction of F at w is

equivalent to
∇F (w )> d < 0.
This comes from the first-order Taylor approximation of F at w :

F (w + αd) ≈ F (w ) + α∇F (w )> d and, since, F (w + αd) < F (w ).
However, if w ∗ is a local minimum point of the function F , there is no
descent direction of F at w ∗ . That is,
∇F (w ∗ )> d ≥ 0.
for any direction d ∈ Rm . In particular, if d = −∇F (w ∗ ), we have
−∇F (w ∗ )> ∇F (w ∗ )> ≥ 0. Which implies that
−k∇F (w ∗ )k2 ≥ 0. This means that k∇F (w ∗ )k2 ≤ 0.
This implies that k∇F (w ∗ )k2 = 0. Consequently, ∇F (w ∗ ) = 0.
First-Order Necessary Optimality Criteria for UNLP
Let F be a differentiable function. If w ∗ is a local minimum

point, then
∇F (w ∗ ) = 0.
The first-order necessary optimality criteria says that

at the local minimum point w ∗ , the surface of F has a
horizontal tangent plane
a point w with property ∇F (w ) 6= 0 is not a local
minimum point
However, in general, a point ŵ satisfies the equation
∇F (ŵ ) = 0 does not automatically imply that ŵ is a
local minimum point of F .
Example
2 2
For the function F (w ) = w1 w2 e −w1 −w2 , at the points
(−1, 1) and (1, 1) we have ∇F (w ) = 0, but neither of them
is a local minimum point.
Second-Order Sufficient Optimality Criteria for UNLP
Let F is a twice differentiable function and w ∗ ∈ Rm . If

(i) ∇F (w ) = 0, and
(ii) the Hessian matrix H(w ) of F at w is positive definite,
then w is a local minimum point of F .
3.4. Optimality Criteria ...Sufficient Condition
Example
Find local minimum point(s) for the problem

1
(UNLP) min F (w ) = w13 + w1 w2 − 4w1 w2 + 1 .
x∈R2 3
First set the gradient to zero

2
w1 + w22 − 4w2

0
∇F (w ) = = .
2w1 (w2 − 2) 0
Hence, w12 + w22 − 4w2 = 0 and 2w1 (w2 − 2) = 0.

3.4. Optimality Criteria ...Sufficient Condition
From the equation 2w1 (w2 − 2) = 0, we have w1 = 0 and w2 = 2.

Hence, form the equation w12 + w22 − 4w2 = 0, we have
if w1 = 0, then 02 + w22 − 4w2 = 0. We obtain (0, 0), (0, 4)
are candidates
if w2 = 2, the w12 + 4 − 4 × 2 = 0. We obtain here
(2, 2), (−2, 2).
Thus, the points (0, 0), (0, 4), (2, 2), (−2, 2) are stationary
(candidate) points for local minimum.
Next, we use the Hessian to identify the true local minimum points.

2w1 2w2 − 4
H(w ) = 2w − 4 2w
2 1
This matrix is positive definite only at the point (2, 2).

3.4. Optimality Criteria ...Sufficient Condition...
First-Order sufficient Optimality Criteria for convex UNLP
Let F is a differentiable function. A point w ∗ is a global minimum

point if and only if
∇F (w ∗ ) = 0.
For a convex differentiable function F , it usually enough to solve the system

of (possibly nonlinear) equations
∇F (w ) = 0
to obtain a minimum point of F .
Example: To solve the convex optimization problem

1
(UNLP ) min F (w ) = (w1 − 2)2 + w22 − 5
w=(w1 ,w2 ) 2
solve
(w1 − 2)
∇F (w ) = w2 = 0.
Which yields [w1∗ , w2∗ ] = [2, 0].

3.4. Optimality Criteria ... Summary
Important summary
a non-convex function can have several local minimum

points
- thus, it can be difficult to select the best among several
local minimum points
- a machine learning model designed based on a local
minimum point may not be reliable
for a convex function, local minimum points are global
- a convex function can have several global minimum
points
- Example: F (w ) = ReLU(w ) has several global minima
a strictly-convex function can have only a unique global
minimum point
- therefore, training machine learning models is
preferable using a strictly convex loss function
3.4. Optimality Criteria ... Summary
Important summary (....)
To find a (local or global) minimum point of a

differentiable function F (w ) follow the following steps
S1. Solve the system of equations ∇F (w ) = 0 to find
stationary points.
S2. Evaluate the Hessian H(w ) on the points from S1. to
identity local minimum points.
A vector d ∈ Rm is called a descent direction for a
function F at a point w if ∇F (w )> d < 0 . Descent
directions are central in optimization algorithms.
3.5. Training Regression Models
A. Ridge Regression

1 2 γ 2
min F (w ) = ky − Aw k + kw k2 ,
w ∈R(m+1) 2 2
The function F (w ) = ky − Aw k2 + γkw k22 is convex.

First-order (sufficient) optimality criteria
∇F (w ) = −A> (y − Aw ) + γw = 0.
This is equivalent to
h i
A> A + γIm+1 w = A> y .
Here, for a given γ > 0, the matrix D := A> A + γIm+1 and the
vector b = A> y are known, since they are defined through the
dataset.
Next, solve the system of linear equations Dw = b to determine
w.
3.5. Optimality ... Linear Regression Models...
Example: Given the average monthly temperature in Germany from May 2019
to May 2020 (From:https://www.statista.com/statistics/982472/average-monthly-temperature-germany/ )
Month May june jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May
Temp. 10.9 19.8 18.9 19 14.1 10.9 5.2 3.7 3.3 5.3 5.3 10.5 11.9
find a reasonable prediction for the average temperature in June 2020.
Solution:
1. Identify the type of function class to fit to the data.
For this, visualize the data.
Clearly, a function f (x, w ) which is linear w.r.t. w will not be a good fit.
2. For instance, we may fit a polynomial
f (x, w ) = w5 x 5 + w4 x 4 + w3 x 3 + w2 x 2 + w1 x + w0 .
3.5. Optimality ... Linear Regression Models...
3. Dataset {(xj , yj ) | j = 1, . . . , 13}
{(1, 10.9), (2, 19.8), (3, 18.9), (4, 19), (5, 14.1), (6, 10.9), (7, 5.2), (8, 3.7), (9, 3.3), (10, 5.3), (11, 5.3), (12, 10.5), (13, 11.9)}
4. The ridge regression problem

 
13
 1 X 2 γ 
min F (w ) = yj − f (xj , w ) + kw k22 ,
w ∈R6  2 j=1 2 
where x = [1, 2, . . . , 13]> and

y = [10.9, 19.8, 18.9, 19, 14.1, 10.9, 5.2, 3.7, 3.3, 5.3, 5.3, 10.5, 11.9]> .
Using matrix notation

1 γ
min F (w ) = ky − Aw k2 + kw k2 ,
w ∈R6 2 2
x15 x14 x13 x12
 
x1 1
 x25 x24 x23 x22 x2 1
13×6
where A =  . . ∈ R .
 
. . . .
 .. .. .. .. .. .. 
5
x13 4
x13 3
x13 2
x13 x13 1
3.5. Optimality ... Linear Regression Models
 1 1 1 1 1 1 
 32 16 8 4 2 1 
 243 81 27 9 3 1 
 1024 256 64 16 4 1 
 
 3125 625 125 25 5 1 
 
 7776 1296 216 36 6 1 
A =  16807 2401 343 49 7 1
 

 32768 4096 512 64 8 1 
 
 59049 6561 729 81 9 1 
100000 10000 1000 100 10 1 
 
161051 14641 1331 121 11 1 
 
248832 20736 1728 144 12 1
371293 28561 2197 169 13 1
The Hessian matrix of F
H(w ) = A> A + γI
is positive definite, where I is the 6 × 6 identity matrix.

3.5. ... Linear Regression Models...
import numpy as np
import matplotlib.pyplot as plt
A = np.array([[1,1,1,1,1,1], [32,16,8,4,2,1], [243,81,27,9,3,1],[1024,256,64,16,4,1],
[3125,625,125,25,5,1],[7776,1296,216,36,6,1],[16807,2401,343,49,7,1],
[32768,4096,512,64,8,1],[59049,6561,729,81,9,1],[100000,10000,1000,100,10,1],
[161051,14641,1331,121,11,1],[248832,20736,1728,144,12,1],[371293,28561,2197,169,13,1]])
At = A.transpose()
AtA = np.matmul(At, A)
I = np.identity(6)
gamma= 0.9
D= np.add(AtA,gamma*I)
y= np.array([10.9, 19.8, 18.9, 19, 14.1, 10.9, 5.2, 3.7, 3.3, 5.3, 5.3, 10.5, 11.9])
yt = y.transpose()
b = np.matmul(At, yt)
weights = np.linalg.solve(D,b)
print(weights)
# Plot the polynomial
t=np.linspace(1.0,13.0)
w=weights
myPoly =[]
for i in range(len(t)):
myPoly.append(w[0]*t[i]**5 + w[1]*t[i]**4 + w[2]*t[i]**3+w[3]*t[i]**2+w[4]*t[i] + w[5])
#plt.plot(t,myPoly)
plt.plot(t,myPoly,’r-’)
plt.xlabel(’Months: May 2019 to May 2020’)
plt.ylabel(’Monthly average temperature’)
plt.show()
# Letting June 2020 to be the 14th month
average4June2020=w[0]*14**5 + w[1]*14**4 + w[2]*14**3+w[3]*14**2+w[4]*14 + w[5]
print(average4June2020)
Run the code given in the previous slide to generate

the following plotting
Figure:Fitting a 5th -order polynomial using ridge regression

The 5th -order polynomial fits well to the data than a

linear function. You may try lower or higher degree
polynomials, or a different class of functions.
γ = 0.9 is selected by trial and error. Test a different γ.
The code (in the program above) determines the
predicted average temperature for June 2020.
average4June2020=w[0]*14**5 + w[1]*14**4 + w[2]*14**3+w[3]*14**2+w[4]*14 + w[5]
print(average4June2020)
The LASSO and the Elastic-net regression models

LASSO
( 13
)
1X 2
min F (w ) = |yj − f (xj , w )| + γkw k1 ,
w ∈R6 2 j=1
Elastic-net
( 13
)
1X
min f (w ) = |yj − f (xj , w )|2 + γ2 kw k22 + γ1 kw k1 ,
w ∈Rm+1 2 j=1
involves the `1 -regularization term kw k1 , which is not

differentiable.
However, similar first-order optimality criteria can be formulated
using the concept of sub-gradient or use smoothing for kw k1 .
You may use
sklearn.linear_model.Lasso
sklearn.linear_model.ElasticNet
3.2. Theoretical
Foundation of Constrained
Optimization
3.2.1. Constrained Optimization Problems
The standard form of a constrained optimization problem
(NLO) min f (x) (4)

x∈Rn
subject to:
hi (x) = 0, i = 1, . . . , m; (5)
gj (x) ≤ 0, j = 1, . . . , q; (6)
where f , hi , gj : Rn → R, i = 1, . . . , m; j = 1, . . . , q, are functions

which are at least one-time continuously differentiable.
f (x) is the objective function.

hi (x) = 0, i = 1, . . . , m - equality constraints
gj (x) ≤ 0, j = 1, . . . , q - inequality constraints
S = {x ∈ Rn | hi (x) = 0, i = 1, . . . , m; gj (x) ≤ 0, j =
1, . . . , q} - feasible set of (NLO).
Example
Suppose you are given a rectangular steel plate of width 10 m and

length 30 m
Objective: To construct a water tanker with maximum capacity

Example
Problem: What are the dimensions of an open-top wa-

ter tanker with maximum volume that can be constructed
from the rectangular steel plate of dimensions 10 × 30m2 ?
(NLO) max {V (x) = x1 x2 x3 } (7)

x∈R3
subject to:
x1 x2 + 2x2 x3 + 2x1 x3 = 300 (8)
x1 ≥ 0, x2 ≥ 0, x3 ≥ 0. (9)
3.2.1. Constrained Optimization Problems - Linear
Programming
The standard form of linear programming (optimization) problem
min f (x) = c > x + b

(LP) (10)
x∈Rn
subject to:
Ax = a; (11)
Bx ≤ b; (12)
where
f (x) = c > x + b - linear objective function.
Ax = a - linear equality constraints
Bx ≤ b - linear inequality constraints
S = {x ∈ Rn | Ax = a; Bx ≤ b} - feasible set of (LP).
3.2.1. Constrained Optimization Problems - Quadratic
Programming
The standard form of quadratic programming (optimization) prob-

lem

1
(LP) minn f (x) = x > Qx + q > x (13)
x∈R 2
subject to:
Ax = a; (14)
Bx ≤ b; (15)
where
f (x) = 12 x > Qx + q > x - quadratic objective function.
Ax = a - linear equality constraints
Bx ≤ b - linear inequality constraints
S = {x ∈ Rn | Ax = a; Bx ≤ b} - feasible set of (QP).
3.2.1. .... Constrained Optimization ...
Feasible Set
A point x ∈ Rn is called a feasible point of the NLP
if hi (x) = 0, i = 1, 2, . . . , p and
gj (x) ≤ 0, j = 1, 2, . . . , m.
Represent the set of all feasible points of the NLP by
S := {x ∈ Rn | hi (x) = 0, i = 1, . . . , p; gj (x) ≤ 0, j = 1, . . . , m} .
The set S is called the feasible set of the NLP.
Any point that lies outside the feasible set is infeasible (not
admissible) to the optimization problem.
Infeasible points are usually not considered during the
optimization process.
3.2.1. ... Constrained Optimization ...
Example 1:

1 2
(NLP1) min x + x1 x22
x 2 1
s.t.
x1 x22 − 1 = 0,
− x12 + x2 ≤ 0,
x2 ≥ 0.
In this example
there is one equality constraint h1 (x) = x1 x22 − 1 and
two inequality constraints g1 (x) = −x12 + x2 ≤ 0 and
g2 (x) = −x2 ≤ 0.
I Observe that, x = (1, 1)> is a feasible point; while (0, 0)> is not
feasible; i.e., x = (0, 0)> does not belong to the feasible set
S = x ∈ R2 | x1 x22 − 1 = 0, −x12 + x2 ≤ 0, x2 ≥ 0 .

3.2.1. Introduction to Constrained Optimization ...
Example 2:
min f (x) = 4 x12 + x2 + 50x3 − 10(x1 − x3 )

(NLP2)
x
subject to :
g1 (x) = x1 − x3 ≥ 0
g2 (x) = x1 ≥ 200
g3 (x) = x2 ≥ 400
x3 ≥ 0.
Note that:
all constraints are inequality constraints
The feasible set is
S = x ∈ R3 | − x1 + x3 ≤ 0, x1 ≥ 200, x2 ≥ 400, x3 ≥ 0

Observe that the point x = (100, 400, 0) is infeasible; while

x = (200, 400, 0) is feasible.
It is convenient to represent the constrains in the compact
form
h(x) = 0, g (x) ≤ 0
using the vector representations
   
h1 (x) g1 (x)
h2 (x)  g2 (x) 
h(x) =  .  and g (x) =  . 
   
 ..   .. 
hp (x) gm (x)
Optimal solution (minimum point)

A point x ∗ ∈ Rn is an optimal solution of the constrained optimization
problem NLP if
(i) x ∗ is a feasible point of NLP; that is, x ∗ ∈ S.
(ii) f (x) ≥ f (x∗ ) for all x ∈ S.
3.1. Introduction to Constrained Optimization ...
• For NLP1, the point x ∗ = (1, 1)> is an optimal solution.
I In general, it is not trivial to find an optimal solution of a

constrained optimization problem.
Questions
Q1: How do we verify that a point x ∈ Rn an optimal solution to
the the NLP? ( We need optimality criteria )
Q2: What methods are available to solve a constrained nonlinear
optimization problem?
( Methods for constrained optimization. )
Definition (Active Constraints)

Let x be a feasible point of NLP. An inequality constraint
gi (x) ≤ 0 is active constraint at x if
gi (x) = 0.
The set A(x) := {i ∈ {1, 2, . . . , m} | gi (x) = 0} is the index

set of active constraints at x.
For the Example 1 above, the constraint g1 = −x12 + x2 is

active at x = (1, 1)> , but g2 (x) = −x2 is not active. Hence,
A = {1}.
For Example 2, the constrains g1 (x) ≤ 0 and g3 (x) ≤ 0 are
active at the point x = (300, 400, 300)> . Hence, we have
A = {1, 3}.
3.2. Optimality Criteria for Constrained Optimization
Descent direction
A vector d is a descent direction to the objective function f at
the point x if
f (x + d) ≤ f (x).
A movement from the point x in the direction of the vector d reduces the value of the function f .
Every vector d with the property d > ∇f (x) < 0 is a descent

direction at x.
To verify this, use the first-order Taylor-Approximation: f (x + d ) ≈ f (x ) + d > ∇f (x ).
Which implies that
f (x + d ) − f (x ) = d > ∇f (x ) < 0 ⇒ f (x + d ) − f (x ) ≤ 0.
Hence, f (x + d ) ≤ f (x ) and d is is a descent direction.

INote that: If d is descent direction, the vector de = αd , for α > 0 , is also a
descent direction.
3.3. Optimality Criteria for Constrained Optimization
Feasible Direction
let x be a feasible point (i.e. x ∈ S) and d a vector in Rn . If
(i) hi (x + d) = 0, i = 1, . . . , p and
(ii) gj (x + d) ≤ 0, j = 1, . . . , m,
then d is a feasible direction of NLP at the point x.
• Let x be a feasible point. If a vector d satisfies
d > ∇hi (x) = 0, i = 1, . . . , p and d > ∇gj (x) < 0, j ∈ A(x),
then de = αd is a feasible direction at x for any α > 0.

To verify this, we use the first-order Taylor-Approximation.
>
i = 1, . . . , p : hi (x + αd) ≈ hi (x) +α d ∇hi (x); ⇒ hi (x + d)
e = 0,
| {z } | {z }
=0 =0
>
j ∈ A(x) : gj (x + αd) ≈ gj (x) +α d ∇gj (x) ⇒ gj (x + d)
e ≤ 0;
| {z } | {z }
=0 <0
> gj (x) >

j ∈
/ A(x) : gj (x + αd) ≈ gj (x) +αd ∇gj (x) ⇒ gj (x + d)
e ≤ 0, for 0 < α ≤ − if d ∇gj (x) > 0.
| {z } | {z } d > ∇gj (x)
non active constraints <0
First-Order Optimality Criteria

If x ∗ is an optimal solution of NLP, then there is no vector
d ∈ Rn which is both a descent direction as well as a feasible
direction at x ∗ .
That is the system of inequalities
d > ∇f (x∗ ) < 0; (16)

> >
d ∇hi (x∗ ) = 0, i = 1, . . . , p; d ∇gj (x∗ ) < 0, j ∈ A(x∗ ).(17)
has no solution. Equivalently,
[−∇f (x∗ )]> d > 0,

[∇hi (x∗ )]> d = 0, i = 1, . . . , p; [∇gj (x∗ )]> d < 0, j ∈ A(x∗ )
has no solution vector d.

Theorem of Farkas (Theorem of Alternatives)

Given a set of vectors c, ai , bj ∈ Rn , i = 1, . . . , p; j = 1, . . . , m̃.
Then one and only one of the following systems has a solution
System I: c > d > 0, ai> d = 0, i = 1, . . . , p; bj> d < 0, j = 1, . . . , m

e
System II: There is µ > 0, λ ∈ R` such that
Xp m̃
X
c= λi ai + µj bj , .
i=1 j=1
Let now c = −∇f (x ∗ ), ai = ∇hi (x ∗ ), i = 1, . . . , p and
bj = ∇gj (x ∗ ), j ∈ A(x ∗ ), m̃ = #A(x ∗ ). Thus, x ∗ is an
optimal solution to NLP, only if System II above has a
solution.
Hence, x ∗ is an optimal solution of NLP, implies that there
are vectors λ ∈ Rp , λ> = (λ1 , λ2 , . . . , λp ) and
µ ∈ Rm̃ , µ> = (µ1 , µ2 , . . . , µm̃ ) > 0 so that
p
X m̃
X
−∇f (x∗ ) = λi ∇hi (x∗ ) + µj ∇gj (x∗ ).
i=1 j=1
Furthermore, let µj = 0 fr j ∈ {1, . . . , m} \ A(x∗ ) , then we

can write
p
X Xm
−∇f (x∗ ) = λi ∇hi (x∗ ) + µj ∇gj (x∗ ).
i=1 j=1
3.3. Optimality Criteria ...KKT Conditions
The Karush-Kuhn-Tucker (KKT) Optimality Criteria

If x ∗ is a minimum point of NLP, then there are λ ∈ Rp and
µ ∈ Rn , µ ≥ 0, so that the following conditions are satisfied
p
X m
X
∇f (x∗ ) + λi ∇hi (x∗ ) + µj ∇gj (x∗ ) = 0 (Optimality)
i=1 j=1
h(x∗ ) = 0 (feasibility)
g (x∗ ) ≤ 0
(nonnegative multipliers) µ ≥ 0
(Complementarity) µj gj (x∗ ) = 0, j = 1, . . . , m.
3.2. Optimality Criteria ...KKT Conditions...
Lagrange-Function
The function
p
X m
X
L(x, λ, µ) = f (x) + λi hi (x) + µj gj (x)
i=1 j=1
is called the Lagrange-Function of NLP.
Example: Solve the optimization problem
min f (x) = x12 − x22

(NLP3)
x
s.t.
x1 + 2x2 + 1 = 0
x1 − x2 ≤ 3.
Solution:
Lagrange function
L(x, λ, µ) = (x12 − x22 ) + λ(x1 + 2x2 + 1) + µ(x1 − x2 − 3).
Optimality conditions
∂L 1
= 0 ⇒ 2x1 + λ + µ = 0 ⇒ x1 = − (λ + µ) (18)
∂x1 2
∂L 1
= 0 ⇒ −2x2 + 2λ − µ = 0 ⇒ x2 = (2λ − µ) (19)
∂x2 2
Feasibility (Using x1 and x2 from equations (18) and (19), resp. )

1
h(x) = 0 ⇒ x1 + 2x2 + 1 = 0 ⇒ − (λ + µ) + (2λ − µ) + 1 = 0
2
2
⇒ λ=µ− (20)
3
Complementarity

1 1
µg (x) = 0 ⇒ µ(x1 − x2 − 3) = 0 ⇒ µ − (λ + µ) − (2λ − µ) − 3
2 2

1 2 1 2
⇒ µ − (µ − ) + µ − 2(µ − ) − µ − 3 = 0
2 3 2 3

3 4
⇒ µ − µ − 2 = 0 ⇒ µ = 0 or µ = − .
2 3
However, µ = − 43 < 0 ist not allowed. Hence, µ∗ = 0 is the only

remaining possibility. With this we obtain, from (Using equation
(20) ) that
2 2
λ ∗ = µ∗ − =− .
3 3
Now using µ∗ = 0 und λ∗ = − 32 we obtain
1 1 2 1
x1∗ = − (λ∗ + µ∗ ) = − (− + 0) = (21)
2 2 3 3
1 1 2 2
x2∗ = (2λ∗ − µ∗ ) = (2 × (− ) − 0) = − . (22)
2 2 3 3
>
Consequently, the point x ∗ = 13 , − 32 is the only candidate for
local minima.
Note that:, the inequality constraint g (x) = x1 − x2 − 3 ≤ 0 is

not active at the point x ∗ .
• In general, the the KKT conditions are only necessary optimality

conditions.
3.2. Optimality Criteria ...Sufficient Conditions
Sufficient optimality conditions
Suppose the function f , hi , i = 1, . . . , p; gj , j = 1, . . . , m are twice
differentiable and x ∗ is feasible point of NLP. If there are
Lagrange-Multipliers λ∗ and µ∗ ≥ 0, such that:
(i) the KKT conditions are satisfied for (x ∗ , µ∗ , λ∗ ); and
(ii) the Hessian Matrix of the Lagrange function
p
X m
X
∇xx L(x ∗ , λ∗ , µ∗ ) = ∇2 f (x ∗ ) + λ∗i ∇2 hj (x ∗ ) + µ∗j ∇2 gj (x ∗ )
i=1 j=1
is positive definite (i.e., d > ∇x L(x ∗ , λ∗ , µ∗ )d > 0) for all d from

the subspace
V=
d ∈ Rn | d > ∇hi (x ∗ ) = 0, i = 1, . . . , p; d > ∇gj (x ∗ ) = 0, µj > 0, j ∈ A(x ∗ ) ,

then x ∗ is an optimal solution of the NLP.

For the example problem NLP3, at the stationary point
>
x ∗ = 31 , − 32 , we have g (x) = x1∗ − x2∗ − 3 < 0. That is,
g (x) ≤ 0 is inactive at x ∗ .
Hence, we have the subspace
2 > 2 1
V = {d ∈ R | d ∇h(x∗ ) = 0} = {d ∈ R | (d1 , d2 ) =
2
0} = {d ∈ R2 | d1 = −2d2 }.
Hessian of the Lagrange
function:

∗ ∗ ∗ 2 0
∇x L(x , λ , µ ) = .
0 −2
For d > = (d1 , d2 )> ∈ V and d 6= 0 (note that if d1 6= 0, then
d2 6= 0; and conversely) we have

2 0 d1 2 0 −2d2
(d1 , d2 ) = (−2d2 , d2 ) = 6d22 > 0
0 −2 d2 0 −2 d2
Therefore, x ∗ > = 1 , − 2 is an optimal solution of NLP3.

3.2. Sufficiency of KKT for convex problems
If in the optimization problem
(NLP ) min f (x )
x
s .t .
hi (x ) = 0, i = 1, 2, . . . , p ;
gj (x ) ≤ 0, j = 1, 2, . . . , m.
we have f (x ) is a convex function, each function hi (x ) is a linear function, and each

gj (x ) is a convex function, then
the feasible set
S = {x ∈ Rn | hi (x ) = 0, i = 1, 2, . . . , p ; gj (x ) ≤ 0, j = 1, 2, . . . , m.}
is a convex set;
The Lagrange function
p
X m
X
L(x , λ, µ) = f (x ) + λi hi (x ) + µj gj (x )
i =1 j =1
is convex with respect to x ; i.e., the Hessian of L(x , λ, µ) is positive definite.

Hence, the satisfaction of the KKT condition at x ∗ is sufficient for x ∗ to be a
minimum point of NLP.
Example:
A two-bar truss consists of :
- bars of lengths L and L/cos(α)
- the cross-sectional areas of the bars are A1 and A2
- the material density of the bars is ρ with Young’s modulus E
- a force F is applied vertically at the intersection of the bars
and causes a displacement D
- the angle between the bars is α = 300
Objective: To determine the cross-section areas A1 and A2 of the bars that minimize
the the total weight
2
W (A1 , A2 ) = ρL √ A1 + A2
3
of the truss under stress constraint on both bars |σi | ≤ σ0 , i = 1, 2 and displacement
constraint D ≤ D0 = σE0 L .
Optimization problem:

2
(NLP ) min W (A1 , A2 ) = ρL √ A1 + A2 (23)
A1 , A2 3
subject to

8 3
displacement constraint: F √ +3 ≤ σ0 (24)
3A1 A2
2F
stress constraint bar 1: − σ0 ≤ ≤ σ0 (25)
A1
√
3F
stress constraint bar 2: − σ0 ≤ ≤ σ0 (26)
A2
A1 ≥ 0, A2 ≥ 0. (27)
In standard form Optimization problem:

2
(NLP) min W (A1 , A2 ) = ρL √ A1 + A2 (28)
A1 ,A2 3

8 3
subject to: F √ + − σ0 ≤ 0 (29)
3A1 A2
2F
− σ0 ≤ 0 (30)
A1
2F
− − σ0 ≤ 0 (31)
A1
√
3F
− σ0 ≤ 0 (32)
A2
√
3F
− − σ0 ≤ 0 (33)
A2
A1 ≥ 0, A2 ≥ 0. (34)
Major difficulties in Optimization

In general, in optimization problems, major difficulties come from inequality
constraints.

Chapter3 PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter3 PDF

Uploaded by

Copyright:

Available Formats

Data-Driven Optimization with Machine Learning

Dr. D.Sc. (habilitus) Abebe Geletu W. Selassie

German Research Chair

Generally, training supervised machine learning models

(UNLP) min F (w ) (1)

The problem UNLP is

where f (x, w ) is a linear function w.r.t. w and R(w ) is a

Definition (Global and Local Minimum Point)

Local minimum point: A point w ∗ is a local minimum

F (w ) ≥ F (w ∗ ), for any w sufficiently near top w ∗ .

i.e., for any w in a neighborhood of w .

Note that, any global minimum point is also a local minimum

Example 1: The convex optimization problem

has w ∗ = (2, 0)> as a (global) minimum point (see figure).

Example 1: The non-convex optimization problem

Theorem (global optimality of convex problems)

Outside the neighborhood N (w ∗ ), w ∗ may not be a local minimum

Definition (Descent direction)

Optimality criteria: If w ∗ is a local minimum point of the function F ,

However, checking that w ∗ is a local minimum point by showing that

For a differentiable function F , d is a descent direction of F at w is

This comes from the first-order Taylor approximation of F at w :

Let F be a differentiable function. If w ∗ is a local minimum

The first-order necessary optimality criteria says that

Second-Order Sufficient Optimality Criteria for UNLP

Let F is a twice differentiable function and w ∗ ∈ Rm . If

Find local minimum point(s) for the problem

First set the gradient to zero

Hence, w12 + w22 − 4w2 = 0 and 2w1 (w2 − 2) = 0.

From the equation 2w1 (w2 − 2) = 0, we have w1 = 0 and w2 = 2.

This matrix is positive definite only at the point (2, 2).

Let F is a differentiable function. A point w ∗ is a global minimum

For a convex differentiable function F , it usually enough to solve the system

Which yields [w1∗ , w2∗ ] = [2, 0].

a non-convex function can have several local minimum

Important summary (....)

To find a (local or global) minimum point of a

The function F (w ) = ky − Aw k2 + γkw k22 is convex.

4. The ridge regression problem

where x = [1, 2, . . . , 13]> and

is positive definite, where I is the 6 × 6 identity matrix.

Run the code given in the previous slide to generate

Figure:Fitting a 5th -order polynomial using ridge regression

The 5th -order polynomial fits well to the data than a

The LASSO and the Elastic-net regression models

involves the `1 -regularization term kw k1 , which is not

The standard form of a constrained optimization problem

(NLO) min f (x) (4)

where f , hi , gj : Rn → R, i = 1, . . . , m; j = 1, . . . , q, are functions

f (x) is the objective function.

Suppose you are given a rectangular steel plate of width 10 m and

Objective: To construct a water tanker with maximum capacity

Problem: What are the dimensions of an open-top wa-

(NLO) max {V (x) = x1 x2 x3 } (7)

The standard form of linear programming (optimization) problem

min f (x) = c > x + b

The standard form of quadratic programming (optimization) prob-

The set S is called the feasible set of the NLP.

min f (x) = 4 x12 + x2 + 50x3 − 10(x1 − x3 )

Observe that the point x = (100, 400, 0) is infeasible; while

Optimal solution (minimum point)

• For NLP1, the point x ∗ = (1, 1)> is an optimal solution.