Professional Documents
Culture Documents
Topics on High-
Dimensional Data
Analytics
Optimization Methods
Kamran Paynabar, Ph.D.
Associate Professor
School of Industrial and Systems Engineering
Introduction
Learning Objectives
• Define optimization
• Identify applications of optimization in
analytics
• Differentiate between different types of
optimization problems
• Define convex optimization
1
2/25/2019
What is Optimization?
• Most Statistical and Machine Learning models are (can be presented) in the
form of an optimization problem. For example,
1. Maximum Likelihood Estimation
2. Regularization/penalization
2
2/25/2019
5. Matrix completion
Types of Optimization
• Continuous vs. Discrete
• Convex and non-convex optimization
• Combinatorial optimization and (mixed) Integer programming
3
2/25/2019
Convex Optimization
4
2/25/2019
5
2/25/2019
Topics on High-
Dimensional Data
Analytics
Optimization Methods
Kamran Paynabar, Ph.D.
Associate Professor
School of Industrial and Systems Engineering
6
2/25/2019
Learning Objectives
• Understand first order methods
• Use Gradient Descent
• Use Accelerated algorithms
• Use Stochastic Gradient Descent
Gradient Descent
Assume f is a continuous and twice differentiable function and we like to solve
f(x)
• An intuitive approach to solve this
problem is to start from an initial point
and iteratively move to the direction
that decreases the value of f.
7
2/25/2019
Gradient Descent
8
2/25/2019
x = normrnd(0,1,2,1);
k = 1;
tol = 0.0001;
grad = [8*x(1)-4*x(2); -4*x(1)+4*x(2)];
while grad'*grad > tol
x = x - 0.01*grad; 224 iterations
grad = [8*x(1)-4*x(2); -4*x(1)+4*x(2)];
k = k+1;
end
x = normrnd(0,1,2,1);
k = 1; x =
tol = 0.0001; 0.0030
0.0049
grad = [8*x(1)-4*x(2); -4*x(1)+4*x(2)];
f = 4*x(1)^2+2*x(2)^2-4*x(1)*x(2);
t=0.1 f = 2.5678e-05
while grad'*grad > tol
x = x - (t)*grad;t=t*0.9999;
grad = [8*x(1)-4*x(2); -4*x(1)+4*x(2)];
f = [f; 4*x(1)^2+2*x(2)^2-4*x(1)*x(2)];
31 iterations
k = k+1;
end
9
2/25/2019
Convergence Rate of GD
• Convergence Rate: how quickly the sequence obtained by an algorithm
converges to the desired point. For example,
𝑓 𝑥 𝑘 − 𝑓 ∗ ≤ 𝜖𝑘
1
• The convergence rate for GD is , which is sublinear.
𝑘
𝑘−1
𝑦 𝑘 =𝑥 𝑘 + (𝑥 𝑘 −𝑥 𝑘−1 )
𝑘+2
1
• The convergence rate for AGD is , which is optimal.
𝑘2
10
2/25/2019
• The GD algorithm calculates the gradient of all n functions, which is quite expensive
for large n’s. Stochastic Gradient Descent, however, randomly selects one function
(observation) to update the estimate of F.
11
2/25/2019
12
2/25/2019
% SGD
x = normrnd(0,1,1,1);
k = 1; tol = 0.0001;
while k <= K
for i = 1:n
s = randsample(n,1);
x = x - 0.0001*(-2*(y(s)-x));
end
k = k+1;
end
13
2/25/2019
14
2/25/2019
Topics on High-
Dimensional Data
Analytics
Optimization Methods
Kamran Paynabar, Ph.D.
Associate Professor
School of Industrial and Systems Engineering
Learning Objectives
• Understand second order methods
• Use Newton method
• Use Gauss-Newton Method
• Use Quasi-Newton method–(BFGS)
algorithm
15
2/25/2019
Second-Order Methods
• Assume f is a continuous and twice differentiable function and we like to solve
𝜕𝑓 𝑥
• 𝛻𝑓 𝑥 = is the gradient vector.
𝜕𝑥
𝜕2 𝑓 𝑥 𝜕2 𝑓 𝑥
⋯
𝜕𝑥1 𝜕𝑥1 𝜕𝑥1 𝜕𝑥𝑛
• 𝛻2𝑓 𝑥 = ⋮ ⋱ ⋮ is the Hessian matrix.
𝜕2 𝑓 𝑥 𝜕2 𝑓 𝑥
⋯
𝜕𝑥𝑛 𝜕𝑥1 𝜕𝑥𝑛 𝜕𝑥𝑛
• Taylor expansion
Newton Method
• To find a root of a function using the
Newton method i.e., 𝑓 𝑥 = 0:
𝑓 𝑥 y
𝑥 (𝑘) = 𝑥 (𝑘−1) − ′
𝑓 𝑥
16
2/25/2019
𝑥 (𝑘) = 𝑥 (𝑘−1) + 𝛼Ω
While 𝑓 𝑥 𝑘 > 𝑓(𝑥 𝑘−1 )
Return 𝑥 (𝑘)
17
2/25/2019
Gauss-Newton Method
• If the function 𝑓(𝑥) is in a quadratic form, the Gauss-Newton (GN) method
can be used as an approximation to Newton method.
• GN does not require the Hessian matrix computation that might
sometimes be difficult to compute.
• Assume 𝑓(𝑥) is a continuous function,
𝑓 𝑥 = 𝑔𝑇 𝑥 𝑔(𝑥),
𝑇
where 𝑔 𝑥 = 𝑔1 𝑥 ⋯ 𝑔𝑑 𝑥 ∈ ℝ𝑑 .
𝜕𝑔1 𝑥 𝜕𝑔1 𝑥
⋯
𝜕𝑥1 𝜕𝑥𝑛
• Define 𝐉𝑔 𝑥 = ⋮ ⋱ ⋮ as the Jacobian matrix.
𝜕𝑔𝑑 𝑥 𝜕𝑔𝑑 𝑥
⋯
𝜕𝑥1 𝜕𝑥𝑛
Gauss-Newton Method
• The gradient and Hessian of 𝑓(𝑥) are
𝛻𝑓 𝑥 = 2𝐉𝑔𝑇 𝑥 𝑔 𝑥
𝛻 2 𝑓 𝑥 ≈ 2𝐉𝑔𝑇 𝑥 𝐉𝑔 𝑥 .
18
2/25/2019
Gauss-Newton Algorithm
(Adaptive Step Size)
Initialize 𝑥 (0) , 𝑓 𝑥 , 𝛻𝑓 𝑥 , 𝛻 2𝑓 𝑥 , step size 𝛼 = 1 and damping factor, 𝜆 = 10−10, and
tolerance 𝜖.
Repeat until convergence ( Ω ∞ < 𝜖) • 𝜆 makes the parabola more
Solve(𝐉𝑔𝑇 𝑥 𝐉𝑔 𝑥 + 𝜆𝐼)Ω = −𝐉𝑔𝑇 𝑔 𝑥 for Ω, “steep” around current x
𝑥
k=k+1,
𝑥 (𝑘) = 𝑥 (𝑘−1) + 𝛼Ω
While 𝑓 𝑥 𝑘 > 𝑓(𝑥 𝑘−1 )
𝛼 = 0.1 𝛼 (decrease step size)
𝑥 (𝑘) = 𝑥 (𝑘−1) + 𝛼Ω
end
𝛼 = 𝛼 0.5 (increase step size)
end
Return 𝑥 (𝑘)
19
2/25/2019
Quasi-Newton Methods
• Quasi – Newton methods where the Hessian matrix cannot analytically be computed.
But it can numerically approximated by previous iterations data and gradients,
𝑘
i.e., 𝑥 (𝑖) , 𝛻𝑓 𝑥 (𝑖) 𝑖=1
.
• For 1D case, using iterations 1 and 2 data, we have
𝛻𝑓 𝑥 2 −𝛻𝑓 𝑥 1
𝛻2𝑓 𝑥 =
𝑥 (2) −𝑥(1)
• For nD case,
𝑇
𝛻𝑓 𝑥 2 −𝛻𝑓 𝑥 1 𝛻𝑓 𝑥 2 −𝛻𝑓 𝑥 1
𝛻 2𝑓 𝑥 = 𝑇
𝛻𝑓 𝑥 2 −𝛻𝑓 𝑥 1 𝑥 (2) −𝑥 (1)
𝑇
𝑥 (2) −𝑥 (1) 𝑥 (2) −𝑥 (1)
𝛻 2𝑓 𝑥 −1 =
𝑇
𝑥 2 −𝑥 1 𝛻𝑓 𝑥 2 −𝛻𝑓 𝑥 1
20
2/25/2019
Broyden-Fletcher-Goldfarb-Shanno
(BFGS)
Initialize 𝑥 (0) , 𝑓 𝑥 , 𝛻𝑓 𝑥 , step size 𝛼 = 1, 𝐻 −1 = 𝐼, and tolerance 𝜖.
Repeat until convergence ( Ω ∞ < 𝜖)
−1
Compute Ω = −𝐻 𝑘 𝛻𝑓 𝑥 𝑘 ,
Solve min 𝑓(𝑥 + 𝛼Ω) using a line search method
𝛼
Ω = 𝛼Ω,
𝑥 𝑘+1 = 𝑥 𝑘 + Ω
𝑦 = 𝛻𝑓 𝑥 𝑘+1 − 𝛻𝑓 𝑥 𝑘
𝑇 𝑇
𝑘+1 −1 𝑦Ω𝑇 𝑘 −1 𝑦Ω𝑇 ΩΩ𝑇
Update 𝐻 = 𝐼 − Ω𝑇 𝑦 𝐻 𝐼 − Ω𝑇 𝑦 + Ω𝑇 𝑦
k=k+1
end
Return 𝑥 (𝑘)
2
min 𝑓 𝑥1 , 𝑥2 = 𝑒 𝑥1 −1 + 𝑒 −𝑥2+1 + 𝑥1 − 𝑥2
𝑒 𝑥1 −1 + 2(𝑥1 − 𝑥2 )
𝛻𝑓 𝑥1 , 𝑥2 =
−𝑒 −𝑥2 +1 − 2(𝑥1 − 𝑥2 )
21
2/25/2019
22