Op Tim Ization Main

Data science for Engineers
Optimization for Data Science
Unconstrained nonlinear optimization

Constrained nonlinear optimization
Connections to data science
Optimization for Data Science 1 1

Three pillars of data science
DATA SCIENCE
LINEAR ALGEBRA
OPTIMIZATION
STATISTICS
Optimization for Data Science 2

Fundamentals of optimization
What is optimization ?
“An optimization problem consists of maximizing or minimizing a real

function by systematically choosing input values from within an allowed set and
computing the value of the function.”*
*“http://en.wikipedia.org/wiki/Mathematical_optimization”

What is optimization?
… the use of specific methods to
determine the “best” solution to a problem
◦Find the best functional representation
for data
◦Find the best hyperplane to classify data

Why optimization for machine learning

 (Almost) All machine learning (ML) algorithms can be viewed
as solutions to optimization problems
◦ Even in cases where, the original machine learning technique has a
basis derived from other fields
 A basic understanding of optimization approaches help in
◦ More deeply understand the working of the ML algorithm
◦ Rationalize the workings of the algorithm
◦ And (may be !!!), develop new algorithms ourselves

Components of an optimization problem
Objective function
◦We look at minimization problem
Decision variables
Constraints

Types of optimization problems

 Depending on the type of objective function, constraints
and decision variables
◦ Linear programming problem
◦ Nonlinear programming problem
 Convex vs Non-convex
◦ Integer programming problem (linear and nonlinear)
◦ Mixed integer linear programming problem
◦ Mixed integer nonlinear programming problem

Nonlinear Optimization
UNCONSTRAINED CASE
8
Univariate Optimization – Local and Global Optimum

Univariate optimization
min f ( x )
x
Decision variable xR Objective function
Local
Golbal
minimum (𝒙∗𝟏 )
minimum (𝒙∗𝟐 )
f(x)
minimizer
f*
x* x
𝒙∗𝟏 𝒙∗𝟐
Univariate Optimization – Conditions for Local Optimum

min f ( x )
x
xR
Approximate 𝑓(𝑥) as a quadratic function using Taylor series at a point 𝑥 𝑘

1 1 2
𝑓 𝑥 ≈ 𝑓 𝑥 𝑘 + 𝑓 ′ 𝑥 𝑘 𝑥 − 𝑥 𝑘 + 𝑓 ′′ 𝑥 𝑘 𝑥 − 𝑥 𝑘
1! 2!
1 ′ ∗ 1 ′′ ∗
When 𝑥 𝑘 = 𝑥 ∗ , 𝑓 𝑥 ≈ 𝑓 𝑥 + 𝑓 𝑥 𝑥 − 𝑥 + 𝑓 𝑥 𝑥 − 𝑥∗ 2
∗ ∗
1! 2!
0
Always positive
Positive 1
𝑓 𝑥 − 𝑓 𝑥 ∗ ≈ 𝑓 ′′ 𝑥 ∗ 𝑥 − 𝑥 ∗ 2
2!
Has to
be positive
Univariate Optimization – Summary

min f ( x )
x
xR
Necessary and sufficient conditions for 𝑥 ∗ to be the minimizer of the function 𝑓(𝑥)
First order necessary conditon: 𝑓 ′ 𝑥 ∗ = 0
Second order sufficiency condition: 𝑓′′(𝑥 ∗ ) > 0

Univariate Optimization – Numerical Example

min f ( x)
x
f ( x) = 3 x 4 − 4 x 3 − 12 x 2 + 3
First order condition Second order condition
f ' ( x) = 12 x 3 − 12 x 2 − 24 x = 0
f ' ' ( x) = 36 x 2 − 24 x − 24
= 12 x( x 2 − x − 2 x) = 0
f ' ' ( x) x =0 = −24
= 12 x( x + 1)( x − 2) = 0
f ' ' ( x) x = −1 = 36  0
x = 0, x = −1, x = 2 f ' ' ( x) x = 2 = 72  0
𝑓 −1 = −2 𝑓 2 = −29
𝑥 ∗ = −1, is a local minimizer of 𝑓(𝑥) 𝑥 ∗ = 2, is a global minimizer of 𝑓(𝑥)

UNCONSTRAINED
MULTIVARIATE OPTIMIZATION
1
Multivariate optimization – Contour plots

Multivariate optimization
z = f ( x1 , x2 ....xn )
z= x12 + x22
Contour plot
10
Increasing values
8
of z
6
0
10
5 10
0 5
0
-5 -5
-10 -10
The minimum value of the function is at 0,0

Optimization for data science 2
Multivariate optimization – Local and global optimum

Rastrigin function
2
𝑓(𝑥1 , 𝑥2 ) = 20 + ෍[𝑥𝑖2 −10cos(2𝜋𝑥𝑖 )]

𝑖=1
Contour plot
Global minimum at 0,0

http://en.wikipedia.org/wiki/Rastrigin_function
Multivariate optimization – Key ideas

z = f ( x1 , x2 ....xn ) z= x12 + x22

 3 
 13 
f = 
Gradient Hessian 3, 2
 2 

 13 

 f   2 f 2 f 2 f 
 x    
 x1 x1x2 x1xn 
2
 1

 f   2 f 2 f 2 f 
 x2   2 f =  x x x22

x2 xn

f =    2 1 
...      
 
 ...    f
2
 f
2
 f
2

 x x  
 f   n 1 xn x2 xn2   3 
   
 xn
− f = −  13 
 3, 2
 2 

 13 

➢ Gradient of a function at a point is orthogonal to the contours

➢ Gradient points in the direction of greatest increase of the function
➢ Negative gradient points in the direction of the greatest decrease of the function
➢ Hessian is a symmetric matrix

Multivariate optimization – Conditions for local optimum

Approximate 𝑓(𝑥)ҧ as a quadratic using

Taylor series at a point 𝑥 𝑘
1
f ( x )  f ( x k ) + [f ( x k )]T ( x − x k ) + ( x − x k )T  2 f ( x k )( x − x k )
2
At 𝑥 𝑘 = 𝑥 ∗ (minimizer of 𝑓(𝑥))
ҧ
0 1
f ( x )  f ( x ) + [f ( x )] ( x − x ) +
* * T *
( x − x * )T  2 f ( x * )( x − x * )
2
1
f ( x ) − f ( x )  ( x − x * )T  2 f ( x * )( x − x * )
*
2
positive Has to be positive
Multivariate optimization – Summary of conditions

( x − x * )T  2 f ( x * )( x − x * )  0
(v )T  2 f ( x * )(v )  0
Condition for Hessian to be positive

definite
Hessian matrix is said to be positive definite at a point if all the eigen values of
the Hessian matrix are positive

Overall Summary – Univariate and multivariate local optimum conditions

min f ( x ) min f ( x )
x x
xR xR n
Necessary condition for 𝑥 ∗ to be

Necessary condition for 𝑥 ∗ to be the
the minimizer
minimizer
𝑓′(𝑥 ∗ ) = 0
𝛻𝑓(𝑥ҧ ∗ ) = 0
Sufficient condition
𝛻 2 𝑓 𝑥 ∗ has to be positive definite
𝑓 ′′ 𝑥∗ >0
Multivariate optimization – Numerical example

min x1 + 2 x2 + 4 x12 − x1 x2 + 2 x22

x1 , x2
First order condition Second order condition

 f 
 x   1 + 8 x1 − x2  0  2 f 2 f 
f =  1  =   = 0   
 f   2 − x + 4 x 2   x
2
x1 x2   8 − 1
1  f = 2
2 1
=

 x2 
   f 2 f   − 1 4 
solving
 x x x2 
2 
 2 1
 x1*  − 0.19  1  3.76

 * =    = 8.23
−   2  
 x2   0.54 
 x1*  − 0.19
 * =  
 x2  − 0.54
UNCONSTRAINED
1
Unconstrained multivariate optimization - Directional search
 Aim is to reach the

bottom most region
 Directions of descent
 Steepest descent
 Sometimes we might even
want to climb the
mountain for better
prospects to get down
further
Unconstrained multivariate optimization - Descent direction and movement

 Iterative
𝑥 𝑘+1 = 𝑥 𝑘 + α𝑘 𝑠 𝑘 Multivariate
Optimization
Starting point Step length Search direction
𝑥 𝑘+1 Step length

Search direction computation =
computation Univariate
optimization
 In ML techniques, this is called as the learning rule

 In neural networks
𝑥𝑘 ◦ Back-propagation algorithm
◦ Same gradient descent with application of chain rule
 In clustering
◦ Minimization of an Euclidean distance norm
Steepest descent and optimum step size

 Minimize f(x1, x2, … xn) = f(x)
 Steepest descent
◦ At iteration k starting point is xk
◦ Search direction sk = Negative of gradient of f(x) = −𝛻𝑓(𝑥 𝑘 )
◦ New point is 𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼 𝑘 𝑠 𝑘 where 𝛼 𝑘 is the value of  for
which f(𝑥 𝑘+1 ) = f() = is a minimum (univariate minimization)
𝑥𝑘
𝑥 𝑘+1

Numerical Example
GRADIENT (STEEPEST)
DESCENT (OR) LEARNING RULE
1
f ( X ) = 4 x12 + 3x1 x2 + 2.5 x22 − 5.5 x1 − 4 x2
300
250
200
150
100
f(X)
50
-50
-100
5
4
3 5
2 4
1 3
0 2
-1 1
X2 0 X1
-2 -1
-3 -2
-4 -3
-4
-5 -5
8 x1 + 3 x2 − 5.5 Constant objective function contour plots

f '( X ) =   f ( X ) = 4 x12 + 3x1 x2 + 2.5 x22 − 5.5 x1 − 4 x2 = K
 1
3 x + 5 x2 − 4 
Learning parameter 𝛼 = 0.135 Quadratic in this case - ellipse
4
 2
Initial guess ( X 0 ) =  f ( X 0 ) =19
 2  3
Gradient Descent (or)
Learning Rule in ML
Step 1: X 1 = X 0 −  f ' ( X 0 ) 2
X0
 2 8 x0,1 + 3 x0,2 − 5.5 1

X1 =  −  3x + 5x − 4 
2 
0.135 X1
X2
  0,1 0,2  0
 2 8 ( 2 ) + 3 ( 2 ) − 5.5
X1 =  − 0.135  
2 
-1
  3 ( 2 ) + 5 ( 2 ) − 4 
-2
 -0.2275
X1 =   f ( X 1 ) = 0.0399
 0.3800  -3
-2 -1 0 1 2 3
X1
 -0.2275
First iteration ( X 1 ) =  
4
 0.3800 
3
X0
Step 2: X 2 = X 1 −  f ' ( X 1 )
2
1
 -0.2275 8 x1,1 + 3 x1,2 − 5.5 X2
X2 =   − 0.135  3 x + 5 x − 4 
X2
X1
 0.3800   1,1 1,2  0
 -0.2275 8 ( -0.2275 ) + 3 ( 0.3800 ) − 5.5 -1

X2 =   − 0.135  
 0.3800   3 ( -0.2275 ) + 5 ( 0.3800 ) − 4  -2
 0.6068
X2 =   f ( X 2 ) = − 2.0841 -3
 0.7556  -2 -1 0 1 2 3
X1

 0.6068
Second iteration ( X 2 ) =   2
 0.7556 
1.5
Step 3: X 3 = X 2 −  f ' ( X 2 ) 1
X2
 0.6068 8 x2,1 + 3 x2,2 − 5.5
X2
0.5 X3
X3 =   − 0.135  3 x + 5 x − 4 
X1
 0.7556   2,1 2,2  0
 0.6068 8 ( 0.6068 ) + 3 ( 0.7556 ) − 5.5

X3 =   − 0.135   -0.5
 0.7556   3 ( 0.6068 ) + 5 ( 0.7556 ) − 4 
-1
 0.3879 
X3 =   f ( X 3 ) = − 2.3342
 0.5398 -1 -0.5 0 0.5 1 1.5
X1

 0.3879 
Third iteration ( X3 ) =  
 0.5398
1.2
1
Step 4: X 4 = X 3 −  f ' ( X 3 )
0.8 X2
 0.3879  8 x3,1 + 3 x3,2 − 5.5 0.6

X4 =   − 0.135  3x + 5x − 4  X3
X4
X2
 0.5398   3,1 3,2  0.4
X1
 0.3879  8 ( 0.3879 ) + 3 ( 0.5398 ) − 5.5 0.2

X4 =   − 0.135  3 0.3879 + 5 0.5398 − 4 
 0.5398   ( ) ( )  0
-0.2
 0.4928
X4 =   f ( X 4 ) = − 2.3675 -0.4
 0.5583
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1
X1
 0.5
Optimal solution ( X opti ) =   f ( X opti ) = − 2.3750
 0.5 
Gradient is zero at the optimum point

WITH EQUALITY CONSTRAINTS
1
Multivariate optimization with constraints
min 2 x12 + 4 x22
x1 , x 2
st
3 x1 + 2 x2 = 12 3𝑥1 + 2𝑥2 = 12
4
All points on this line represent
3 the feasible region
Constrained
minimum 2
Unconstrained minimum is not
Unconstrained
1 the same as constrained
minimum 0 minimum
-1
-2
-3
-3 -2 -1 0 1 2 3 4
min 2 x12 + 4 x22
x1 , x 2
st
3 x1 + 2 x2  12 3𝑥1 + 2𝑥2 = 12
4
3
Constrained
minimum 2
1 Unconstrained minimum is the

Unconstrained same as constrained minimum
minimum 0
-1
-2
feasible region
-3
-3 -2 -1 0 1 2 3 4
min 2 x12 + 4 x22
x1 , x 2
st
3 x1 + 2 x2  12 3𝑥1 + 2𝑥2 = 12
4
3
Constrained
minimum 2
Unconstrained
1 feasible region
minimum 0
-1 Where does the optimum of the

-2 constrained function lie?
-3
-3 -2 -1 0 1 2 3 4
Multivariate optimization with equality constraints
At optimum (one equality constraint case)
−𝛻𝑓(𝑥 ∗ ) = 𝜆∗ 𝛻ℎ(𝑥 ∗ )
In higher dimensions and when there are more than one equality constraint
-𝛻𝑓(𝑥 ∗ ) = σ𝑙𝑖=1 𝛻ℎ𝑖 (𝑥 ∗ ) 𝜆∗i
Gradient lies in the space spanned by the normal of the gradients

min 2 x12 + 4 x22
x1 , x 2
st
3 x1 + 2 x2 − 12 = 0
First order condition

L( x1 , x2 ,  ) = 2 x12 + 4 x22 −  (3 x1 + 2 x2 − 12)
L = 0
L
= 4 x1 − 3 = 0
x1  x1*  3.27 
solving  *  
L  x2  = 1.09 
= 8 x2 − 2 = 0  * 
x2 
 4.36

 
L
= −(3 x1 + 2 x2 − 12) = 0


MULTIVARIATE OPTIMIZATION WITH
INEQUALITY CONSTRAINTS
1
min 2 x12 + 4 x22
x1 , x 2
st
3 x1 + 2 x2  12 3𝑥1 + 2𝑥2 = 12
4
3
Constrained
minimum 2
1 Unconstrained minimum is the

Unconstrained same as constrained minimum
minimum 0
-1
-2
feasible region
-3
-3 -2 -1 0 1 2 3 4
min 2 x12 + 4 x22
x1 , x 2
st
3 x1 + 2 x2  12 3𝑥1 + 2𝑥2 = 12
4
3
Constrained
minimum 2
Unconstrained
1 feasible region
minimum 0
-1 Where does the optimum of the

-2 constrained function lie?
-3
-3 -2 -1 0 1 2 3 4
General formulation
min
−
f ( x)
x
st
hi ( x ) = 0, i = 1,...m
g j ( x )  0, j = 1,2...l
Necessary condition for 𝑥 ∗ to be the minimizer
KKT conditions has to be satisfied
𝛻 2 𝐿 𝑥 ∗ has to be positive definite

Summary – KKT conditions

When both equality and inequality constraints are present, at the optimum we have
KKT (Karush-Kuhn-Tucker) conditions

Gradient of the “Lagrangian function” at 𝑥 ∗
𝑙 𝑚
𝑙 𝑚
𝛻𝑓 𝑥 ∗ + ෍ 𝛻ℎ𝑖 𝑥 ∗ 𝜆∗i + ෍ 𝛻𝑔𝑗 𝑥 ∗ 𝜇𝑗∗ = 0 𝐿 𝑥 ∗ , 𝜆∗ , 𝜇 ∗ =𝑓 𝑥∗ + ෍ 𝜆𝑖 ℎ 𝑥∗ + ෍ 𝜇𝑗 𝑔𝑗 𝑥 ∗
𝑖=1 𝑗=1 𝑖=1 𝑗=1
ℎ𝑖 𝑥 ∗ = 0, 𝑖 = 1 … 𝑙 Ensures that the optimum satisfies equality constraints
𝜆𝑖 𝜖𝑅, 𝑖 = 1 … 𝑙
𝑔𝑗 𝑥 ∗ ≤ 0, 𝑗 = 1 … 𝑚 Ensures that the optimum is in the feasible region
𝜇𝑗∗ (𝑔𝑗 𝑥 ∗ =0 Complementary slackness
𝜇𝑗∗ ≥ 0, 𝑗 = 1 … 𝑚 ➢ No possibility of improvement near the active constraints

Summary – KKT conditions

➢ In general it is difficult to use the KKT conditions to solve for the optimum of an inequality
constrained problem (than for a problem with equality constraints only) because we do not
know a priori which constraints are active at the optimum.
➢ Makes this a combinatorial problem
➢ KKT conditions are used to verify that a point we have reached is a candidate optimal solution.
➢ Given a point, it is easy to check which constraints are binding.

Multivariate optimization-quadratic programming
𝒙𝟏 = 𝟏
min 2 x12 + 4 x22
x1 , x 2
st Feasible region
3 x1 + 2 x2  12
2 x1 + 5 x2  10
x1  1

min 2 x12 + 4 x22  Lagrangian
x1 , x2
st
𝐿 𝑥1 , 𝑥2 , 𝜇1 , 𝜇2 , 𝜇3 = 2𝑥12 + 4𝑥22 + 𝜇1 3𝑥1 + 2𝑥2 − 12
3 x1 + 2 x2  12  ( a ) +𝜇2 10 − 2𝑥1 − 5𝑥2 + 𝜇3 (𝑥1 − 1)
2 x1 + 5 x2  10  (b)  First order KKT conditions
x1  1  (c )
4𝑥1 + 3𝜇1 − 2𝜇2 + 𝜇3 = 0
8𝑥2 + 2𝜇1 − 5𝜇2 = 0
𝜇1 3𝑥1 + 2𝑥2 − 12 = 0
𝜇2 10 − 2𝑥1 − 5𝑥2 = 0
𝜇3 𝑥3 − 1 = 0
𝜇𝑖 ≥ 0

Active (A) /Inactive (I) Possible
Solution
Sl.no constraints optima Remark
(𝒙, 𝝁)
(a) (b) (c) (Y/N)
1 A A A Infeasible N Equations do not have a valid solution.
𝑥 = 3.6364 0.5455
2 A A I N 𝑥1 ≤ 1 is not satisfied, 𝜇1 < 0, 𝜇2 < 0
𝜇 = [−5.2 −1.45 0]
𝑥 = 1 4.5
3 A I A N 𝜇1 < 0
𝜇 = [−18 0 50]
𝑥 = 1 1.6
4 I A A Y All constraints and KKT conditions satisfied
𝜇 = [0 2.56 1.12]
𝑥 = 3.27 1.09
5 A I I N 𝑥1 ≤ 1 is not satisfied
𝜇 = [−4.36 0 0]
𝑥 = 1.21 1.51
6 I A I N 𝑥1 ≤ 1 is not satisfied
𝜇 = [0 2.45 0]
𝑥= 1 0 2𝑥1 + 5𝑥2 ≥ 10 is not satisfied
7 I I A N
𝜇 = [0 0 −4]
𝑥= 0 0
8 I I I N 2𝑥1 + 5𝑥2 ≥ 10 is not satisfied
𝜇 = [0 0 0]

Fundamentals of optimization Solution for each case

Actual Optima

Op Tim Ization Main

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Op Tim Ization Main

Uploaded by

Copyright:

Available Formats

Data science for Engineers

Optimization for Data Science

Unconstrained nonlinear optimization

Optimization for Data Science 1 1

Three pillars of data science

Optimization for Data Science 2

“An optimization problem consists of maximizing or minimizing a real

Optimization for Data Science 3

Optimization for Data Science 4

Why optimization for machine learning

Optimization for Data Science 5

Components of an optimization problem

Optimization for Data Science 6

Types of optimization problems

Optimization for Data Science 7

Univariate Optimization – Local and Global Optimum

Decision variable xR Objective function

Univariate Optimization – Conditions for Local Optimum

Approximate 𝑓(𝑥) as a quadratic function using Taylor series at a point 𝑥 𝑘

Univariate Optimization – Summary

First order necessary conditon: 𝑓 ′ 𝑥 ∗ = 0

Second order sufficiency condition: 𝑓′′(𝑥 ∗ ) > 0

Optimization for Data Science 11

Univariate Optimization – Numerical Example

𝑥 ∗ = −1, is a local minimizer of 𝑓(𝑥) 𝑥 ∗ = 2, is a global minimizer of 𝑓(𝑥)

Multivariate optimization – Contour plots

The minimum value of the function is at 0,0

Multivariate optimization – Local and global optimum

𝑓(𝑥1 , 𝑥2 ) = 20 + ෍[𝑥𝑖2 −10cos(2𝜋𝑥𝑖 )]

Global minimum at 0,0

Multivariate optimization – Key ideas

z = f ( x1 , x2 ....xn ) z= x12 + x22

➢ Gradient of a function at a point is orthogonal to the contours

Optimization for data science 4

Multivariate optimization – Conditions for local optimum

Approximate 𝑓(𝑥)ҧ as a quadratic using

Multivariate optimization – Summary of conditions

Condition for Hessian to be positive

Optimization for data science 6

Overall Summary – Univariate and multivariate local optimum conditions

Necessary condition for 𝑥 ∗ to be

Multivariate optimization – Numerical example

min x1 + 2 x2 + 4 x12 − x1 x2 + 2 x22

First order condition Second order condition

 x1*  − 0.19  1  3.76

Unconstrained multivariate optimization - Directional search

 Aim is to reach the

Unconstrained multivariate optimization - Descent direction and movement

𝑥 𝑘+1 Step length

 In ML techniques, this is called as the learning rule

Steepest descent and optimum step size

Optimization for data science 4

f ( X ) = 4 x12 + 3x1 x2 + 2.5 x22 − 5.5 x1 − 4 x2

8 x1 + 3 x2 − 5.5 Constant objective function contour plots

 2 8 x0,1 + 3 x0,2 − 5.5 1

 -0.2275 8 ( -0.2275 ) + 3 ( 0.3800 ) − 5.5 -1

Optimization for data science 4

 0.7556   2,1 2,2  0

 0.6068 8 ( 0.6068 ) + 3 ( 0.7556 ) − 5.5

Optimization for data science 5

 0.3879  8 x3,1 + 3 x3,2 − 5.5 0.6

 0.3879  8 ( 0.3879 ) + 3 ( 0.5398 ) − 5.5 0.2

Optimization for data science 6

1 Unconstrained minimum is the