ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)

ECS171: Machine Learning
Lecture 4: Optimization (LFD 3.3, SGD)
Cho-Jui Hsieh
UC Davis
Jan 22, 2018

Gradient descent
Optimization
Goal: find the minimizer of a function
min f (w )
w
For now we assume f is twice differentiable

Machine learning algorithm: find the hypothesis that minimizes Ein
Convex vs Nonconvex
Convex function:
∇f (w ∗ ) = 0 ⇔ w ∗ is global minimum
A function is convex if ∇2 f (w ) is positive definite
Example: linear regression, logistic regression, · · ·
Non-convex function:
∇f (x) = 0 ⇔ Global min, local min, or saddle point
most algorithms only converge to gradient= 0
Example: neural network, · · ·
Convex vs Nonconvex
Convex function:
∇f (w ∗ ) = 0 ⇔ w ∗ is global minimum
A function is convex if ∇2 f (w ) is positive definite
Example: linear regression, logistic regression, · · ·
Non-convex function:
∇f (w ∗ ) = 0 ⇔ w ∗ is Global min, local min, or saddle point
most algorithms only converge to gradient= 0
Example: neural network, · · ·
Gradient Descent
Gradient descent: repeatedly do

w t+1 ← w t − α∇f (w t )
α > 0 is the step size
Gradient Descent

w t+1 ← w t − α∇f (w t )
Generate the sequence w 1 , w 2 , · · ·
converge to minimum solution (∇f (w ) = 0)
Gradient Descent

w t+1 ← w t − α∇f (w t )
Generate the sequence w 1 , w 2 , · · ·
converge to minimum solution (∇f (w ) = 0)
Step size too large ⇒ diverge; too small ⇒ slow convergence
Why gradient descent?
Reason I: Gradient is the “steepest direction” to decrease the objective

function locally

function locally
Reason II: successive approximation view
At each iteration, form an approximation function of f (·):
1
f (w t + d ) ≈ g (d ) := f (w t ) + ∇f (w t )T d + kd k2
2α
Update solution by w t+1 ← w t + d ∗
d ∗ = arg mind g (d )
∇g (d ∗ ) = 0 ⇒ ∇f (w t ) + α1 d ∗ = 0 ⇒ d ∗ = −α∇f (w t )

function locally
Reason II: successive approximation view
At each iteration, form an approximation function of f (·):
1
f (w t + d ) ≈ g (d ) := f (w t ) + ∇f (w t )T d + kd k2
2α
Update solution by w t+1 ← w t + d ∗
d ∗ = arg mind g (d )
∇g (d ∗ ) = 0 ⇒ ∇f (w t ) + α1 d ∗ = 0 ⇒ d ∗ = −α∇f (w t )
d ∗ will decrease f (·) if α (step size) is sufficiently small

Illustration of gradient descent
Form a quadratic approximation

1
f (w t + d ) ≈ g (d ) = f (w t ) + ∇f (w t )T d + kd k2
2α
Minimize g (d ):
1 ∗
∇g (d ∗ ) = 0 ⇒ ∇f (w t ) + d = 0 ⇒ d ∗ = −α∇f (w t )
α
Update
w t+1 = w t + d ∗ = w t −α∇f (w t )
Form another quadratic approximation

1
f (w t+1 + d ) ≈ g (d ) = f (w t+1 ) + ∇f (w t+1 )T d + kd k2
2α
d ∗ = −α∇f (w t+1 )
Update
w t+2 = w t+1 + d ∗ = w t+1 −α∇f (w t+1 )
When will it diverge?
Can diverge (f (w t )<f (w t+1 )) if g is not an upperbound of f

When will it converge?
Always converge (f (w t )>f (w t+1 )) when g is an upperbound of f

Convergence
Let L be the Lipchitz constant

(∇2 f (x) LI for all x)
1
Theorem: gradient descent converges if α < L
Convergence
Let L be the Lipchitz constant

(∇2 f (x) LI for all x)
1
Theorem: gradient descent converges if α < L
In practice, we do not know L · · ·
need to tune step size when running gradient descent
Applying to Logistic regression
gradient descent for logistic regression

Initialize the weights w0
For t = 1, 2, · · ·
Compute the gradient
N
1 X yn xn
∇Ein = −
N n=1 1 + e yn w T xn
Update the weights: w ← w − η∇Ein

Return the final weights w
Applying to Logistic regression
gradient descent for logistic regression

Initialize the weights w0
For t = 1, 2, · · ·
Compute the gradient
N
1 X yn xn
∇Ein = −
N n=1 1 + e yn w T xn
Update the weights: w ← w − η∇Ein

Return the final weights w
When to stop?
Fixed number of iterations, or
Stop when k∇Ein k <
Stochastic Gradient descent
Large-scale Problems
Machine learning: usually minimizing the in-sample loss (training loss)

N
1 X
min{ `(w T xn , yn )} := Ein (w ) (linear model)
w N
n=1
N
1 X
min{ `(hw (xn ), yn )} := Ein (w ) (general hypothesis)
w N
n=1
`: loss function (e.g., `(a, b) = (a − b)2 )

Gradient descent:
w ←w −η ∇Ein (w )
| {z }
Main computation
Large-scale Problems
Machine learning: usually minimizing the in-sample loss (training loss)

N
1 X
min{ `(w T xn , yn )} := Ein (w ) (linear model)
w N
n=1
N
1 X
min{ `(hw (xn ), yn )} := Ein (w ) (general hypothesis)
w N
n=1
`: loss function (e.g., `(a, b) = (a − b)2 )

Gradient descent:
w ←w −η ∇Ein (w )
| {z }
Main computation
1 PN
In general, Ein (w ) = N n=1 fn (w ),
each fn (w ) only depends on (xn , yn )
Stochastic gradient
Gradient:
N
1X
∇Ein (w ) = ∇fn (w )
N
n=1
Each gradient computation needs to go through all training samples

slow when millions of samples
Faster way to compute “approximate gradient”?
Stochastic gradient
Gradient:
N
1X
∇Ein (w ) = ∇fn (w )
N
n=1
Each gradient computation needs to go through all training samples

slow when millions of samples
Faster way to compute “approximate gradient”?
Use stochastic sampling:
Sample a small subset B ⊆ {1, · · · , N}
Estimate gradient
1 X
∇Ein (w ) ≈ ∇fn (w )
|B|
n∈B
|B|: batch size

Stochastic gradient descent
Stochastic Gradient Descent (SGD)

Input: training data {xn , yn }N
n=1
Initialize w (zero or random)
For t = 1, 2, · · ·
Sample a small batch B ⊆ {1, · · · , N}
Update parameter
1 X
w ← w − ηt ∇fn (w )
|B|
n∈B
Stochastic Gradient Descent (SGD)

n=1
For t = 1, 2, · · ·
Sample a small batch B ⊆ {1, · · · , N}
Update parameter
1 X
w ← w − ηt ∇fn (w )
|B|
n∈B
Extreme case: |B| = 1 ⇒ Sample one training data at a time

Logistic Regression by SGD
Logistic regression:
N
1 X T
min log(1 + e −yn w xn )
w N | {z }
n=1
fn (w )
SGD for Logistic Regression

n=1
For t = 1, 2, · · ·
Sample a batch B ⊆ {1, · · · , N}
Update parameter
1 X −yn xn
w ← w − ηt
|B| 1 + e yn w T xn
i∈B | {z }
∇fn (w )
Why SGD works?
Stochastic gradient is an unbiased estimator of full gradient:

N
1 X 1 X
E[ ∇fn (w )] = ∇fn (w )
|B| N
n∈B n=1
= ∇Ein (w )
Why SGD works?
Stochastic gradient is an unbiased estimator of full gradient:

N
1 X 1 X
E[ ∇fn (w )] = ∇fn (w )
|B| N
n∈B n=1
= ∇Ein (w )
Each iteration updated by
gradient + zero-mean noise

In gradient descent, η (step size) is a fixed constant

Can we use fixed step size for SGD?

SGD with fixed step size cannot converge to global/local minimizers

If w ∗ is the minimizer, ∇f (w ∗ ) = N1 N ∗
P
n=1 ∇fn (w )=0,

P
n=1 ∇fn (w )=0,
1 X
but ∇fn (w ∗ )6=0 if B is a subset
|B|
n∈B

P
n=1 ∇fn (w )=0,
1 X
but ∇fn (w ∗ )6=0 if B is a subset
|B|
n∈B
(Even if we got minimizer, SGD will move away from it)

Stochastic gradient descent, step size
To make SGD converge:

Step size should decrease to 0
ηt → 0
Usually with polynomial rate: η t ≈ t −a with constant a
Stochastic gradient descent vs Gradient descent
Stochastic gradient descent:

pros:
cheaper computation per iteration
faster convergence in the beginning
cons:
less stable, slower final convergence
hard to tune step size
(Figure from https://medium.com/@ImadPhd/

gradient-descent-algorithm-and-its-variants-10f652806a3)
Revisit perceptron Learning Algorithm
Given a classification data {xn , yn }N

n=1
Learning a linear model:
N
1 X
min `(w T xn , yn )
w N
n=1
Consider the loss:
`(w T xn , yn ) = max(0, −yn w T xn )
What’s the gradient?


Consider two cases:
Case I: yn w T xn > 0 (prediction correct)
`(w T xn , yn ) = 0
∂ T
∂w `(w xn , yn ) = 0

Consider two cases:
`(w T xn , yn ) = 0
∂ T
∂w `(w xn , yn ) = 0
Case II: yn w T xn < 0 (prediction wrong)
`(w T xn , yn ) = −yn w T xn
∂ T
∂w `(w xn , yn ) = −yn xn

Consider two cases:
`(w T xn , yn ) = 0
∂ T
∂w `(w xn , yn ) = 0
Case II: yn w T xn < 0 (prediction wrong)
`(w T xn , yn ) = −yn w T xn
∂ T
∂w `(w xn , yn ) = −yn xn
SGD update rule: Sample an index n
(
t+1 wt if yn w T xn ≥0 (predict correct)
w ←
w t + η t yn xn if yn w T xn <0 (predict wrong)
Equivalent to Perceptron Learning Algorithm when η t = 1

Conclusions
Gradient descent
Next class: LFD 2
Questions?

ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)

Uploaded by

Copyright:

Available Formats

ECS171: Machine Learning

Lecture 4: Optimization (LFD 3.3, SGD)

Jan 22, 2018

Goal: find the minimizer of a function

For now we assume f is twice differentiable

Gradient descent: repeatedly do

Gradient descent: repeatedly do

Gradient descent: repeatedly do

Reason I: Gradient is the “steepest direction” to decrease the objective

Reason I: Gradient is the “steepest direction” to decrease the objective

Reason I: Gradient is the “steepest direction” to decrease the objective

d ∗ will decrease f (·) if α (step size) is sufficiently small

Form a quadratic approximation

Form another quadratic approximation

Can diverge (f (w t )<f (w t+1 )) if g is not an upperbound of f

Always converge (f (w t )>f (w t+1 )) when g is an upperbound of f

Let L be the Lipchitz constant

Let L be the Lipchitz constant

gradient descent for logistic regression

Update the weights: w ← w − η∇Ein

gradient descent for logistic regression

Update the weights: w ← w − η∇Ein

Machine learning: usually minimizing the in-sample loss (training loss)

`: loss function (e.g., `(a, b) = (a − b)2 )

Machine learning: usually minimizing the in-sample loss (training loss)

`: loss function (e.g., `(a, b) = (a − b)2 )

Each gradient computation needs to go through all training samples

Each gradient computation needs to go through all training samples

|B|: batch size

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD)

Extreme case: |B| = 1 ⇒ Sample one training data at a time

SGD for Logistic Regression

Stochastic gradient is an unbiased estimator of full gradient:

Stochastic gradient is an unbiased estimator of full gradient:

Each iteration updated by

gradient + zero-mean noise

In gradient descent, η (step size) is a fixed constant

In gradient descent, η (step size) is a fixed constant

In gradient descent, η (step size) is a fixed constant

In gradient descent, η (step size) is a fixed constant

In gradient descent, η (step size) is a fixed constant

(Even if we got minimizer, SGD will move away from it)

To make SGD converge:

Stochastic gradient descent:

(Figure from https://medium.com/@ImadPhd/

Given a classification data {xn , yn }N

What’s the gradient?

`(w T xn , yn ) = max(0, −yn w T xn )

`(w T xn , yn ) = max(0, −yn w T xn )

`(w T xn , yn ) = max(0, −yn w T xn )

Equivalent to Perceptron Learning Algorithm when η t = 1

You might also like