You are on page 1of 46

Data science for Engineers

Optimization for Data Science

Unconstrained nonlinear optimization


Constrained nonlinear optimization
Connections to data science

Optimization for Data Science 1 1


Data science for Engineers

Three pillars of data science

DATA SCIENCE

LINEAR ALGEBRA

OPTIMIZATION
STATISTICS

Optimization for Data Science 2


Data science for Engineers

Fundamentals of optimization

What is optimization ?

“An optimization problem consists of maximizing or minimizing a real


function by systematically choosing input values from within an allowed set and
computing the value of the function.”*

*“http://en.wikipedia.org/wiki/Mathematical_optimization”

Optimization for Data Science 3


Data science for Engineers

What is optimization?
… the use of specific methods to
determine the “best” solution to a problem
◦Find the best functional representation
for data
◦Find the best hyperplane to classify data

Optimization for Data Science 4


Data science for Engineers

Why optimization for machine learning


 (Almost) All machine learning (ML) algorithms can be viewed
as solutions to optimization problems
◦ Even in cases where, the original machine learning technique has a
basis derived from other fields
 A basic understanding of optimization approaches help in
◦ More deeply understand the working of the ML algorithm
◦ Rationalize the workings of the algorithm
◦ And (may be !!!), develop new algorithms ourselves

Optimization for Data Science 5


Data science for Engineers

Components of an optimization problem

Objective function
◦We look at minimization problem
Decision variables
Constraints

Optimization for Data Science 6


Data science for Engineers

Types of optimization problems


 Depending on the type of objective function, constraints
and decision variables
◦ Linear programming problem
◦ Nonlinear programming problem
 Convex vs Non-convex
◦ Integer programming problem (linear and nonlinear)
◦ Mixed integer linear programming problem
◦ Mixed integer nonlinear programming problem

Optimization for Data Science 7


Nonlinear Optimization

UNCONSTRAINED CASE

8
Data science for Engineers

Univariate Optimization – Local and Global Optimum


Univariate optimization

min f ( x )
x

Decision variable xR Objective function

Local
Golbal
minimum (𝒙∗𝟏 )
minimum (𝒙∗𝟐 )
f(x)

minimizer
f*

x* x
𝒙∗𝟏 𝒙∗𝟐
Optimization for Data Science 9
Data science for Engineers

Univariate Optimization – Conditions for Local Optimum


Univariate optimization

min f ( x )
x

xR

Approximate 𝑓(𝑥) as a quadratic function using Taylor series at a point 𝑥 𝑘


1 1 2
𝑓 𝑥 ≈ 𝑓 𝑥 𝑘 + 𝑓 ′ 𝑥 𝑘 𝑥 − 𝑥 𝑘 + 𝑓 ′′ 𝑥 𝑘 𝑥 − 𝑥 𝑘
1! 2!
1 ′ ∗ 1 ′′ ∗
When 𝑥 𝑘 = 𝑥 ∗ , 𝑓 𝑥 ≈ 𝑓 𝑥 + 𝑓 𝑥 𝑥 − 𝑥 + 𝑓 𝑥 𝑥 − 𝑥∗ 2
∗ ∗
1! 2!
0
Always positive
Positive 1
𝑓 𝑥 − 𝑓 𝑥 ∗ ≈ 𝑓 ′′ 𝑥 ∗ 𝑥 − 𝑥 ∗ 2
2!
Has to
be positive
Optimization for Data Science 10
Data science for Engineers

Univariate Optimization – Summary


Univariate optimization

min f ( x )
x

xR

Necessary and sufficient conditions for 𝑥 ∗ to be the minimizer of the function 𝑓(𝑥)

First order necessary conditon: 𝑓 ′ 𝑥 ∗ = 0

Second order sufficiency condition: 𝑓′′(𝑥 ∗ ) > 0

Optimization for Data Science 11


Data science for Engineers

Univariate Optimization – Numerical Example


min f ( x)
x

f ( x) = 3 x 4 − 4 x 3 − 12 x 2 + 3
First order condition Second order condition
f ' ( x) = 12 x 3 − 12 x 2 − 24 x = 0
f ' ' ( x) = 36 x 2 − 24 x − 24
= 12 x( x 2 − x − 2 x) = 0
f ' ' ( x) x =0 = −24
= 12 x( x + 1)( x − 2) = 0
f ' ' ( x) x = −1 = 36  0
x = 0, x = −1, x = 2 f ' ' ( x) x = 2 = 72  0

𝑓 −1 = −2 𝑓 2 = −29

𝑥 ∗ = −1, is a local minimizer of 𝑓(𝑥) 𝑥 ∗ = 2, is a global minimizer of 𝑓(𝑥)


Optimization for Data Science 12
Nonlinear Optimization

UNCONSTRAINED
MULTIVARIATE OPTIMIZATION

1
Data science for Engineers

Multivariate optimization – Contour plots


Multivariate optimization

z = f ( x1 , x2 ....xn )

z= x12 + x22
Contour plot
10
Increasing values
8
of z
6

0
10
5 10
0 5
0
-5 -5
-10 -10

The minimum value of the function is at 0,0


Optimization for data science 2
Data science for Engineers

Multivariate optimization – Local and global optimum


Multivariate optimization
Rastrigin function
2

𝑓(𝑥1 , 𝑥2 ) = 20 + ෍[𝑥𝑖2 −10cos(2𝜋𝑥𝑖 )]


𝑖=1
Contour plot

Global minimum at 0,0


http://en.wikipedia.org/wiki/Rastrigin_function
Optimization for data science 3
Data science for Engineers

Multivariate optimization – Key ideas


Multivariate optimization

z = f ( x1 , x2 ....xn ) z= x12 + x22


 3 
 13 
f = 
Gradient Hessian 3, 2
 2 

 13 

 f   2 f 2 f 2 f 
 x    
 x1 x1x2 x1xn 
2
 1

 f   2 f 2 f 2 f 
 x2   2 f =  x x x22

x2 xn

f =    2 1 
...      
 
 ...    f
2
 f
2
 f
2

 x x  
 f   n 1 xn x2 xn2   3 
   
 xn
− f = −  13 
 3, 2
 2 

 13 

➢ Gradient of a function at a point is orthogonal to the contours


➢ Gradient points in the direction of greatest increase of the function
➢ Negative gradient points in the direction of the greatest decrease of the function
➢ Hessian is a symmetric matrix

Optimization for data science 4


Data science for Engineers

Multivariate optimization – Conditions for local optimum


Multivariate optimization

Approximate 𝑓(𝑥)ҧ as a quadratic using


Taylor series at a point 𝑥 𝑘
1
f ( x )  f ( x k ) + [f ( x k )]T ( x − x k ) + ( x − x k )T  2 f ( x k )( x − x k )
2

At 𝑥 𝑘 = 𝑥 ∗ (minimizer of 𝑓(𝑥))
ҧ

0 1
f ( x )  f ( x ) + [f ( x )] ( x − x ) +
* * T *
( x − x * )T  2 f ( x * )( x − x * )
2
1
f ( x ) − f ( x )  ( x − x * )T  2 f ( x * )( x − x * )
*

2
positive Has to be positive
Optimization for data science 5
Data science for Engineers

Multivariate optimization – Summary of conditions


Multivariate optimization

( x − x * )T  2 f ( x * )( x − x * )  0

(v )T  2 f ( x * )(v )  0

Condition for Hessian to be positive


definite

Hessian matrix is said to be positive definite at a point if all the eigen values of
the Hessian matrix are positive

Optimization for data science 6


Data science for Engineers

Overall Summary – Univariate and multivariate local optimum conditions


Multivariate optimization

min f ( x ) min f ( x )
x x

xR xR n

Necessary condition for 𝑥 ∗ to be


Necessary condition for 𝑥 ∗ to be the
the minimizer
minimizer
𝑓′(𝑥 ∗ ) = 0
𝛻𝑓(𝑥ҧ ∗ ) = 0
Sufficient condition
Sufficient condition
𝛻 2 𝑓 𝑥 ∗ has to be positive definite
𝑓 ′′ 𝑥∗ >0
Optimization for data science 7
Data science for Engineers

Multivariate optimization – Numerical example


Multivariate optimization

min x1 + 2 x2 + 4 x12 − x1 x2 + 2 x22


x1 , x2

First order condition Second order condition


 f 
 x   1 + 8 x1 − x2  0  2 f 2 f 
f =  1  =   = 0   
 f   2 − x + 4 x 2   x
2
x1 x2   8 − 1
1  f = 2
2 1
=

 x2 
   f 2 f   − 1 4 
solving

 x x x2 
2 
 2 1

 x1*  − 0.19  1  3.76


 * =    = 8.23
−   2  
 x2   0.54 
 x1*  − 0.19
 * =  
 x2  − 0.54
Optimization for data science 8
Nonlinear Optimization

UNCONSTRAINED
MULTIVARIATE OPTIMIZATION

1
Data science for Engineers

Unconstrained multivariate optimization - Directional search

 Aim is to reach the


bottom most region
 Directions of descent
 Steepest descent
 Sometimes we might even
want to climb the
mountain for better
prospects to get down
further
Optimization for data science 2
Data science for Engineers

Unconstrained multivariate optimization - Descent direction and movement


 Iterative
𝑥 𝑘+1 = 𝑥 𝑘 + α𝑘 𝑠 𝑘 Multivariate
Optimization
Starting point Step length Search direction

𝑥 𝑘+1 Step length


Search direction computation =
computation Univariate
optimization

 In ML techniques, this is called as the learning rule


 In neural networks
𝑥𝑘 ◦ Back-propagation algorithm
◦ Same gradient descent with application of chain rule
 In clustering
◦ Minimization of an Euclidean distance norm
Optimization for data science 3
Data science for Engineers

Steepest descent and optimum step size


 Minimize f(x1, x2, … xn) = f(x)
 Steepest descent
◦ At iteration k starting point is xk
◦ Search direction sk = Negative of gradient of f(x) = −𝛻𝑓(𝑥 𝑘 )
◦ New point is 𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼 𝑘 𝑠 𝑘 where 𝛼 𝑘 is the value of  for
which f(𝑥 𝑘+1 ) = f() = is a minimum (univariate minimization)
𝑥𝑘

𝑥 𝑘+1

Optimization for data science 4


Numerical Example

GRADIENT (STEEPEST)
DESCENT (OR) LEARNING RULE

1
Data science for Engineers

f ( X ) = 4 x12 + 3x1 x2 + 2.5 x22 − 5.5 x1 − 4 x2

300

250

200

150

100
f(X)

50

-50

-100
5
4
3 5
2 4
1 3
0 2
-1 1
X2 0 X1
-2 -1
-3 -2
-4 -3
-4
-5 -5
Optimization for data science 2
Data science for Engineers

8 x1 + 3 x2 − 5.5 Constant objective function contour plots


f '( X ) =   f ( X ) = 4 x12 + 3x1 x2 + 2.5 x22 − 5.5 x1 − 4 x2 = K
 1
3 x + 5 x2 − 4 
Learning parameter 𝛼 = 0.135 Quadratic in this case - ellipse
4
 2
Initial guess ( X 0 ) =  f ( X 0 ) =19
 2  3
Gradient Descent (or)
Learning Rule in ML
Step 1: X 1 = X 0 −  f ' ( X 0 ) 2
X0

 2 8 x0,1 + 3 x0,2 − 5.5 1


X1 =  −  3x + 5x − 4 
2 
0.135 X1

X2
  0,1 0,2  0

 2 8 ( 2 ) + 3 ( 2 ) − 5.5
X1 =  − 0.135  
2 
-1
  3 ( 2 ) + 5 ( 2 ) − 4 
-2
 -0.2275
X1 =   f ( X 1 ) = 0.0399
 0.3800  -3
-2 -1 0 1 2 3
X1
Optimization for data science 3
Data science for Engineers

 -0.2275
First iteration ( X 1 ) =  
4
 0.3800 
3

X0
Step 2: X 2 = X 1 −  f ' ( X 1 )
2

1
 -0.2275 8 x1,1 + 3 x1,2 − 5.5 X2
X2 =   − 0.135  3 x + 5 x − 4 

X2
X1
 0.3800   1,1 1,2  0

 -0.2275 8 ( -0.2275 ) + 3 ( 0.3800 ) − 5.5 -1


X2 =   − 0.135  
 0.3800   3 ( -0.2275 ) + 5 ( 0.3800 ) − 4  -2

 0.6068
X2 =   f ( X 2 ) = − 2.0841 -3
 0.7556  -2 -1 0 1 2 3
X1

Optimization for data science 4


Data science for Engineers

 0.6068
Second iteration ( X 2 ) =   2
 0.7556 
1.5

Step 3: X 3 = X 2 −  f ' ( X 2 ) 1
X2
 0.6068 8 x2,1 + 3 x2,2 − 5.5

X2
0.5 X3
X3 =   − 0.135  3 x + 5 x − 4 
X1

 0.7556   2,1 2,2  0

 0.6068 8 ( 0.6068 ) + 3 ( 0.7556 ) − 5.5


X3 =   − 0.135   -0.5
 0.7556   3 ( 0.6068 ) + 5 ( 0.7556 ) − 4 
-1
 0.3879 
X3 =   f ( X 3 ) = − 2.3342
 0.5398 -1 -0.5 0 0.5 1 1.5
X1

Optimization for data science 5


Data science for Engineers

 0.3879 
Third iteration ( X3 ) =  
 0.5398
1.2

1
Step 4: X 4 = X 3 −  f ' ( X 3 )
0.8 X2

 0.3879  8 x3,1 + 3 x3,2 − 5.5 0.6


X4 =   − 0.135  3x + 5x − 4  X3
X4

X2
 0.5398   3,1 3,2  0.4
X1

 0.3879  8 ( 0.3879 ) + 3 ( 0.5398 ) − 5.5 0.2


X4 =   − 0.135  3 0.3879 + 5 0.5398 − 4 
 0.5398   ( ) ( )  0

-0.2
 0.4928
X4 =   f ( X 4 ) = − 2.3675 -0.4
 0.5583
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1
X1
 0.5
Optimal solution ( X opti ) =   f ( X opti ) = − 2.3750
 0.5 
Gradient is zero at the optimum point

Optimization for data science 6


MULTIVARIATE OPTIMIZATION
WITH EQUALITY CONSTRAINTS

1
Data science for Engineers

Fundamentals of optimization
Multivariate optimization with constraints
min 2 x12 + 4 x22
x1 , x 2

st
3 x1 + 2 x2 = 12 3𝑥1 + 2𝑥2 = 12
4
All points on this line represent
3 the feasible region
Constrained
minimum 2
Unconstrained minimum is not
Unconstrained
1 the same as constrained
minimum 0 minimum
-1

-2

-3
-3 -2 -1 0 1 2 3 4
Optimization for data science 2
Data science for Engineers

Fundamentals of optimization
Multivariate optimization with constraints
min 2 x12 + 4 x22
x1 , x 2

st
3 x1 + 2 x2  12 3𝑥1 + 2𝑥2 = 12
4

3
Constrained
minimum 2

1 Unconstrained minimum is the


Unconstrained same as constrained minimum
minimum 0

-1

-2
feasible region
-3
-3 -2 -1 0 1 2 3 4
Optimization for data science 3
Data science for Engineers

Fundamentals of optimization
Multivariate optimization with constraints
min 2 x12 + 4 x22
x1 , x 2

st
3 x1 + 2 x2  12 3𝑥1 + 2𝑥2 = 12
4

3
Constrained
minimum 2

Unconstrained
1 feasible region
minimum 0

-1 Where does the optimum of the


-2 constrained function lie?
-3
-3 -2 -1 0 1 2 3 4
Optimization for data science 4
Data science for Engineers

Fundamentals of optimization
Multivariate optimization with equality constraints

At optimum (one equality constraint case)

−𝛻𝑓(𝑥 ∗ ) = 𝜆∗ 𝛻ℎ(𝑥 ∗ )

In higher dimensions and when there are more than one equality constraint

-𝛻𝑓(𝑥 ∗ ) = σ𝑙𝑖=1 𝛻ℎ𝑖 (𝑥 ∗ ) 𝜆∗i

Gradient lies in the space spanned by the normal of the gradients

Optimization for data science 5


Data science for Engineers

Fundamentals of optimization
Multivariate optimization
min 2 x12 + 4 x22
x1 , x 2

st
3 x1 + 2 x2 − 12 = 0

First order condition


L( x1 , x2 ,  ) = 2 x12 + 4 x22 −  (3 x1 + 2 x2 − 12)

L = 0

L
= 4 x1 − 3 = 0
x1  x1*  3.27 
solving  *  
L  x2  = 1.09 
= 8 x2 − 2 = 0  * 
x2 
 4.36

 
L
= −(3 x1 + 2 x2 − 12) = 0


Optimization for data science 6


MULTIVARIATE OPTIMIZATION WITH
INEQUALITY CONSTRAINTS

1
Data science for Engineers

Fundamentals of optimization
Multivariate optimization with constraints
min 2 x12 + 4 x22
x1 , x 2

st
3 x1 + 2 x2  12 3𝑥1 + 2𝑥2 = 12
4

3
Constrained
minimum 2

1 Unconstrained minimum is the


Unconstrained same as constrained minimum
minimum 0

-1

-2
feasible region
-3
-3 -2 -1 0 1 2 3 4
Optimization for data science 2
Data science for Engineers

Fundamentals of optimization
Multivariate optimization with constraints
min 2 x12 + 4 x22
x1 , x 2

st
3 x1 + 2 x2  12 3𝑥1 + 2𝑥2 = 12
4

3
Constrained
minimum 2

Unconstrained
1 feasible region
minimum 0

-1 Where does the optimum of the


-2 constrained function lie?
-3
-3 -2 -1 0 1 2 3 4
Optimization for data science 3
Data science for Engineers

General formulation
Multivariate optimization

min

f ( x)
x

st
hi ( x ) = 0, i = 1,...m
g j ( x )  0, j = 1,2...l

Necessary condition for 𝑥 ∗ to be the minimizer

KKT conditions has to be satisfied

Sufficient condition

𝛻 2 𝐿 𝑥 ∗ has to be positive definite

Optimization for data science 4


Data science for Engineers

Summary – KKT conditions


Multivariate optimization
When both equality and inequality constraints are present, at the optimum we have

KKT (Karush-Kuhn-Tucker) conditions


Gradient of the “Lagrangian function” at 𝑥 ∗
𝑙 𝑚
𝑙 𝑚
𝛻𝑓 𝑥 ∗ + ෍ 𝛻ℎ𝑖 𝑥 ∗ 𝜆∗i + ෍ 𝛻𝑔𝑗 𝑥 ∗ 𝜇𝑗∗ = 0 𝐿 𝑥 ∗ , 𝜆∗ , 𝜇 ∗ =𝑓 𝑥∗ + ෍ 𝜆𝑖 ℎ 𝑥∗ + ෍ 𝜇𝑗 𝑔𝑗 𝑥 ∗
𝑖=1 𝑗=1 𝑖=1 𝑗=1

ℎ𝑖 𝑥 ∗ = 0, 𝑖 = 1 … 𝑙 Ensures that the optimum satisfies equality constraints

𝜆𝑖 𝜖𝑅, 𝑖 = 1 … 𝑙

𝑔𝑗 𝑥 ∗ ≤ 0, 𝑗 = 1 … 𝑚 Ensures that the optimum is in the feasible region

𝜇𝑗∗ (𝑔𝑗 𝑥 ∗ =0 Complementary slackness

𝜇𝑗∗ ≥ 0, 𝑗 = 1 … 𝑚 ➢ No possibility of improvement near the active constraints

Optimization for data science 5


Data science for Engineers

Summary – KKT conditions


Multivariate optimization

➢ In general it is difficult to use the KKT conditions to solve for the optimum of an inequality
constrained problem (than for a problem with equality constraints only) because we do not
know a priori which constraints are active at the optimum.

➢ Makes this a combinatorial problem

➢ KKT conditions are used to verify that a point we have reached is a candidate optimal solution.

➢ Given a point, it is easy to check which constraints are binding.

Optimization for data science 6


Data science for Engineers

Fundamentals of optimization
Multivariate optimization-quadratic programming

𝒙𝟏 = 𝟏
min 2 x12 + 4 x22
x1 , x 2

st Feasible region
3 x1 + 2 x2  12
2 x1 + 5 x2  10
x1  1

Optimization for data science 7


Data science for Engineers

Fundamentals of optimization
Multivariate optimization-quadratic programming
min 2 x12 + 4 x22  Lagrangian
x1 , x2

st
𝐿 𝑥1 , 𝑥2 , 𝜇1 , 𝜇2 , 𝜇3 = 2𝑥12 + 4𝑥22 + 𝜇1 3𝑥1 + 2𝑥2 − 12
3 x1 + 2 x2  12  ( a ) +𝜇2 10 − 2𝑥1 − 5𝑥2 + 𝜇3 (𝑥1 − 1)
2 x1 + 5 x2  10  (b)  First order KKT conditions
x1  1  (c )
4𝑥1 + 3𝜇1 − 2𝜇2 + 𝜇3 = 0
8𝑥2 + 2𝜇1 − 5𝜇2 = 0

𝜇1 3𝑥1 + 2𝑥2 − 12 = 0
𝜇2 10 − 2𝑥1 − 5𝑥2 = 0
𝜇3 𝑥3 − 1 = 0
𝜇𝑖 ≥ 0

Optimization for data science 8


Data science for Engineers

Fundamentals of optimization
Multivariate optimization-quadratic programming
Active (A) /Inactive (I) Possible
Solution
Sl.no constraints optima Remark
(𝒙, 𝝁)
(a) (b) (c) (Y/N)
1 A A A Infeasible N Equations do not have a valid solution.
𝑥 = 3.6364 0.5455
2 A A I N 𝑥1 ≤ 1 is not satisfied, 𝜇1 < 0, 𝜇2 < 0
𝜇 = [−5.2 −1.45 0]
𝑥 = 1 4.5
3 A I A N 𝜇1 < 0
𝜇 = [−18 0 50]
𝑥 = 1 1.6
4 I A A Y All constraints and KKT conditions satisfied
𝜇 = [0 2.56 1.12]
𝑥 = 3.27 1.09
5 A I I N 𝑥1 ≤ 1 is not satisfied
𝜇 = [−4.36 0 0]
𝑥 = 1.21 1.51
6 I A I N 𝑥1 ≤ 1 is not satisfied
𝜇 = [0 2.45 0]
𝑥= 1 0 2𝑥1 + 5𝑥2 ≥ 10 is not satisfied
7 I I A N
𝜇 = [0 0 −4]
𝑥= 0 0
8 I I I N 2𝑥1 + 5𝑥2 ≥ 10 is not satisfied
𝜇 = [0 0 0]

Optimization for data science 9


Data science for Engineers

Fundamentals of optimization Solution for each case


Actual Optima

Optimization for data science 10

You might also like