You are on page 1of 45

ECS171: Machine Learning

Lecture 4: Optimization (LFD 3.3, SGD)

Cho-Jui Hsieh
UC Davis

Jan 22, 2018


Gradient descent
Optimization

Goal: find the minimizer of a function

min f (w )
w

For now we assume f is twice differentiable


Machine learning algorithm: find the hypothesis that minimizes Ein
Convex vs Nonconvex

Convex function:
∇f (w ∗ ) = 0 ⇔ w ∗ is global minimum
A function is convex if ∇2 f (w ) is positive definite
Example: linear regression, logistic regression, · · ·
Non-convex function:
∇f (x) = 0 ⇔ Global min, local min, or saddle point
most algorithms only converge to gradient= 0
Example: neural network, · · ·
Convex vs Nonconvex

Convex function:
∇f (w ∗ ) = 0 ⇔ w ∗ is global minimum
A function is convex if ∇2 f (w ) is positive definite
Example: linear regression, logistic regression, · · ·
Non-convex function:
∇f (w ∗ ) = 0 ⇔ w ∗ is Global min, local min, or saddle point
most algorithms only converge to gradient= 0
Example: neural network, · · ·
Gradient Descent

Gradient descent: repeatedly do


w t+1 ← w t − α∇f (w t )
α > 0 is the step size
Gradient Descent

Gradient descent: repeatedly do


w t+1 ← w t − α∇f (w t )
α > 0 is the step size
Generate the sequence w 1 , w 2 , · · ·
converge to minimum solution (∇f (w ) = 0)
Gradient Descent

Gradient descent: repeatedly do


w t+1 ← w t − α∇f (w t )
α > 0 is the step size
Generate the sequence w 1 , w 2 , · · ·
converge to minimum solution (∇f (w ) = 0)
Step size too large ⇒ diverge; too small ⇒ slow convergence
Why gradient descent?

Reason I: Gradient is the “steepest direction” to decrease the objective


function locally
Why gradient descent?

Reason I: Gradient is the “steepest direction” to decrease the objective


function locally
Reason II: successive approximation view
At each iteration, form an approximation function of f (·):

1
f (w t + d ) ≈ g (d ) := f (w t ) + ∇f (w t )T d + kd k2

Update solution by w t+1 ← w t + d ∗
d ∗ = arg mind g (d )
∇g (d ∗ ) = 0 ⇒ ∇f (w t ) + α1 d ∗ = 0 ⇒ d ∗ = −α∇f (w t )
Why gradient descent?

Reason I: Gradient is the “steepest direction” to decrease the objective


function locally
Reason II: successive approximation view
At each iteration, form an approximation function of f (·):

1
f (w t + d ) ≈ g (d ) := f (w t ) + ∇f (w t )T d + kd k2

Update solution by w t+1 ← w t + d ∗
d ∗ = arg mind g (d )
∇g (d ∗ ) = 0 ⇒ ∇f (w t ) + α1 d ∗ = 0 ⇒ d ∗ = −α∇f (w t )

d ∗ will decrease f (·) if α (step size) is sufficiently small


Illustration of gradient descent
Illustration of gradient descent

Form a quadratic approximation


1
f (w t + d ) ≈ g (d ) = f (w t ) + ∇f (w t )T d + kd k2

Illustration of gradient descent

Minimize g (d ):
1 ∗
∇g (d ∗ ) = 0 ⇒ ∇f (w t ) + d = 0 ⇒ d ∗ = −α∇f (w t )
α
Illustration of gradient descent

Update
w t+1 = w t + d ∗ = w t −α∇f (w t )
Illustration of gradient descent

Form another quadratic approximation


1
f (w t+1 + d ) ≈ g (d ) = f (w t+1 ) + ∇f (w t+1 )T d + kd k2

d ∗ = −α∇f (w t+1 )
Illustration of gradient descent

Update
w t+2 = w t+1 + d ∗ = w t+1 −α∇f (w t+1 )
When will it diverge?

Can diverge (f (w t )<f (w t+1 )) if g is not an upperbound of f


When will it converge?

Always converge (f (w t )>f (w t+1 )) when g is an upperbound of f


Convergence

Let L be the Lipchitz constant


(∇2 f (x)  LI for all x)
1
Theorem: gradient descent converges if α < L
Convergence

Let L be the Lipchitz constant


(∇2 f (x)  LI for all x)
1
Theorem: gradient descent converges if α < L
In practice, we do not know L · · ·
need to tune step size when running gradient descent
Applying to Logistic regression

gradient descent for logistic regression


Initialize the weights w0
For t = 1, 2, · · ·
Compute the gradient
N
1 X yn xn
∇Ein = −
N n=1 1 + e yn w T xn

Update the weights: w ← w − η∇Ein


Return the final weights w
Applying to Logistic regression

gradient descent for logistic regression


Initialize the weights w0
For t = 1, 2, · · ·
Compute the gradient
N
1 X yn xn
∇Ein = −
N n=1 1 + e yn w T xn

Update the weights: w ← w − η∇Ein


Return the final weights w
When to stop?
Fixed number of iterations, or
Stop when k∇Ein k < 
Stochastic Gradient descent
Large-scale Problems

Machine learning: usually minimizing the in-sample loss (training loss)


N
1 X
min{ `(w T xn , yn )} := Ein (w ) (linear model)
w N
n=1
N
1 X
min{ `(hw (xn ), yn )} := Ein (w ) (general hypothesis)
w N
n=1

`: loss function (e.g., `(a, b) = (a − b)2 )


Gradient descent:
w ←w −η ∇Ein (w )
| {z }
Main computation
Large-scale Problems

Machine learning: usually minimizing the in-sample loss (training loss)


N
1 X
min{ `(w T xn , yn )} := Ein (w ) (linear model)
w N
n=1
N
1 X
min{ `(hw (xn ), yn )} := Ein (w ) (general hypothesis)
w N
n=1

`: loss function (e.g., `(a, b) = (a − b)2 )


Gradient descent:
w ←w −η ∇Ein (w )
| {z }
Main computation

1 PN
In general, Ein (w ) = N n=1 fn (w ),
each fn (w ) only depends on (xn , yn )
Stochastic gradient

Gradient:
N
1X
∇Ein (w ) = ∇fn (w )
N
n=1

Each gradient computation needs to go through all training samples


slow when millions of samples
Faster way to compute “approximate gradient”?
Stochastic gradient

Gradient:
N
1X
∇Ein (w ) = ∇fn (w )
N
n=1

Each gradient computation needs to go through all training samples


slow when millions of samples
Faster way to compute “approximate gradient”?
Use stochastic sampling:
Sample a small subset B ⊆ {1, · · · , N}
Estimate gradient
1 X
∇Ein (w ) ≈ ∇fn (w )
|B|
n∈B

|B|: batch size


Stochastic gradient descent

Stochastic Gradient Descent (SGD)


Input: training data {xn , yn }N
n=1
Initialize w (zero or random)
For t = 1, 2, · · ·
Sample a small batch B ⊆ {1, · · · , N}
Update parameter
1 X
w ← w − ηt ∇fn (w )
|B|
n∈B
Stochastic gradient descent

Stochastic Gradient Descent (SGD)


Input: training data {xn , yn }N
n=1
Initialize w (zero or random)
For t = 1, 2, · · ·
Sample a small batch B ⊆ {1, · · · , N}
Update parameter
1 X
w ← w − ηt ∇fn (w )
|B|
n∈B

Extreme case: |B| = 1 ⇒ Sample one training data at a time


Logistic Regression by SGD

Logistic regression:
N
1 X T
min log(1 + e −yn w xn )
w N | {z }
n=1
fn (w )

SGD for Logistic Regression


Input: training data {xn , yn }N
n=1
Initialize w (zero or random)
For t = 1, 2, · · ·
Sample a batch B ⊆ {1, · · · , N}
Update parameter
1 X −yn xn
w ← w − ηt
|B| 1 + e yn w T xn
i∈B | {z }
∇fn (w )
Why SGD works?

Stochastic gradient is an unbiased estimator of full gradient:


N
1 X 1 X
E[ ∇fn (w )] = ∇fn (w )
|B| N
n∈B n=1
= ∇Ein (w )
Why SGD works?

Stochastic gradient is an unbiased estimator of full gradient:


N
1 X 1 X
E[ ∇fn (w )] = ∇fn (w )
|B| N
n∈B n=1
= ∇Ein (w )

Each iteration updated by

gradient + zero-mean noise


Stochastic gradient descent

In gradient descent, η (step size) is a fixed constant


Can we use fixed step size for SGD?
Stochastic gradient descent

In gradient descent, η (step size) is a fixed constant


Can we use fixed step size for SGD?
SGD with fixed step size cannot converge to global/local minimizers
Stochastic gradient descent

In gradient descent, η (step size) is a fixed constant


Can we use fixed step size for SGD?
SGD with fixed step size cannot converge to global/local minimizers
If w ∗ is the minimizer, ∇f (w ∗ ) = N1 N ∗
P
n=1 ∇fn (w )=0,
Stochastic gradient descent

In gradient descent, η (step size) is a fixed constant


Can we use fixed step size for SGD?
SGD with fixed step size cannot converge to global/local minimizers
If w ∗ is the minimizer, ∇f (w ∗ ) = N1 N ∗
P
n=1 ∇fn (w )=0,

1 X
but ∇fn (w ∗ )6=0 if B is a subset
|B|
n∈B
Stochastic gradient descent

In gradient descent, η (step size) is a fixed constant


Can we use fixed step size for SGD?
SGD with fixed step size cannot converge to global/local minimizers
If w ∗ is the minimizer, ∇f (w ∗ ) = N1 N ∗
P
n=1 ∇fn (w )=0,

1 X
but ∇fn (w ∗ )6=0 if B is a subset
|B|
n∈B

(Even if we got minimizer, SGD will move away from it)


Stochastic gradient descent, step size

To make SGD converge:


Step size should decrease to 0
ηt → 0
Usually with polynomial rate: η t ≈ t −a with constant a
Stochastic gradient descent vs Gradient descent

Stochastic gradient descent:


pros:
cheaper computation per iteration
faster convergence in the beginning
cons:
less stable, slower final convergence
hard to tune step size

(Figure from https://medium.com/@ImadPhd/


gradient-descent-algorithm-and-its-variants-10f652806a3)
Revisit perceptron Learning Algorithm

Given a classification data {xn , yn }N


n=1
Learning a linear model:
N
1 X
min `(w T xn , yn )
w N
n=1
Consider the loss:
`(w T xn , yn ) = max(0, −yn w T xn )

What’s the gradient?


Revisit perceptron Learning Algorithm

`(w T xn , yn ) = max(0, −yn w T xn )


Consider two cases:
Case I: yn w T xn > 0 (prediction correct)
`(w T xn , yn ) = 0
∂ T
∂w `(w xn , yn ) = 0
Revisit perceptron Learning Algorithm

`(w T xn , yn ) = max(0, −yn w T xn )


Consider two cases:
Case I: yn w T xn > 0 (prediction correct)
`(w T xn , yn ) = 0
∂ T
∂w `(w xn , yn ) = 0
Case II: yn w T xn < 0 (prediction wrong)
`(w T xn , yn ) = −yn w T xn
∂ T
∂w `(w xn , yn ) = −yn xn
Revisit perceptron Learning Algorithm

`(w T xn , yn ) = max(0, −yn w T xn )


Consider two cases:
Case I: yn w T xn > 0 (prediction correct)
`(w T xn , yn ) = 0
∂ T
∂w `(w xn , yn ) = 0
Case II: yn w T xn < 0 (prediction wrong)
`(w T xn , yn ) = −yn w T xn
∂ T
∂w `(w xn , yn ) = −yn xn
SGD update rule: Sample an index n
(
t+1 wt if yn w T xn ≥0 (predict correct)
w ←
w t + η t yn xn if yn w T xn <0 (predict wrong)

Equivalent to Perceptron Learning Algorithm when η t = 1


Conclusions

Gradient descent
Stochastic gradient descent
Next class: LFD 2

Questions?

You might also like