You are on page 1of 41

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Lei Ba

Nadav Cohen

The Hebrew University of Jerusalem


Advanced Seminar in Deep Learning (#67679)

October 18, 2015

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 1 / 32


Optimization in Deep Learning

Outline

1 Optimization in Deep Learning

2 Adaptive Moment Estimation (Adam)

3 Convergence Analysis

4 Relations to Existing Algorithms

5 Experiments

6 AdaMax

7 Conclusion

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 2 / 32


Optimization in Deep Learning

Loss Minimization

Loss minimization problem:


1 Xm
 
min L(W ) := `(W ; xi , yi ) + λr (W )
W m i=1

{(xi , yi )}m
i=1 – training instances (xi ) and corresponding labels (yi )

W – network parameters to learn


`(W ; xi , yi ) – loss of network parameterized by W w.r.t. (xi , yi )
2
r (W ) – regularization function (e.g. kW k2 )
λ > 0 – regularization weight

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 3 / 32


Optimization in Deep Learning

Large-Scale −→ First-Order Stochastic Methods

Large-scale setting:

Many network parameters (e.g. dim(W ) ∼ 108 )


=⇒ computing Hessian (second order derivatives) is expensive
Many training examples (e.g. m ∼ 106 )
=⇒ computing full objective at every iteration is expensive

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 4 / 32


Optimization in Deep Learning

Large-Scale −→ First-Order Stochastic Methods

Large-scale setting:

Many network parameters (e.g. dim(W ) ∼ 108 )


=⇒ computing Hessian (second order derivatives) is expensive
Many training examples (e.g. m ∼ 106 )
=⇒ computing full objective at every iteration is expensive

Optimization methods must be:

First-order – update based on objective value and gradient only


Stochastic – update based on subset of training examples:
1 Xb
Lt (W ) := `(W ; xij , yij ) + λr (W )
b j=1

{(xij , yij )}bj=1 – random mini-batch chosen at iteration t


Nadav Cohen Adam (Kingma & Ba) October 18, 2015 4 / 32
Optimization in Deep Learning

Stochastic Gradient Descent (SGD)

Update rule:

Vt = µVt−1 − α∇Lt (Wt−1 )


Wt = Wt−1 + Vt

α > 0 – learning rate (typical choices: 0.01, 0.1)

µ ∈ [0, 1) – momentum (typical choices: 0.9, 0.95, 0.99)

Momentum smooths updates, enhancing stability and speed.

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 5 / 32


Optimization in Deep Learning

Nesterov’s Accelerated Gradient (NAG)

Update rule:

Vt = µVt−1 − α∇Lt (Wt−1 +µVt−1 )


Wt = Wt−1 + Vt

Only difference from SGD is partial update (+µVt ) in gradient


computation. May increase stability and speed in ill-conditioned problems1 .

1
See "On the Importance of Initialization and Momentum in Deep Learning" by
Sutskever et al.
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 6 / 32
Optimization in Deep Learning

Adaptive Gradient (AdaGrad)

Update rule:
∇Lt (Wt−1 )
Wt = Wt−1 − α qP
t 2
t 0 =1 ∇Lt 0 (Wt 0 −1 )

Learning rate adapted per coordinate:


Highly varying coordinate −→ suppress
Rarely varying coordinate −→ enhance

Disadvantage in non-stationary settings:


All gradients (recent and old) weighted equally

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 7 / 32


Optimization in Deep Learning

Root Mean Square Propagation (RMSProp)

Update rule:

Rt = γRt−1 + (1 − γ)∇Lt (Wt−1 )2


∇Lt (Wt−1 )
Wt = Wt−1 − α √
Rt

Similar to AdaGrad but with an exponential moving average controlled by


γ ∈ [0, 1) (smaller γ =⇒ more emphasis on recent gradients).

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 8 / 32


Optimization in Deep Learning

Root Mean Square Propagation (RMSProp)

Update rule:

Rt = γRt−1 + (1 − γ)∇Lt (Wt−1 )2


∇Lt (Wt−1 )
Wt = Wt−1 − α √
Rt

Similar to AdaGrad but with an exponential moving average controlled by


γ ∈ [0, 1) (smaller γ =⇒ more emphasis on recent gradients).

May be combined with NAG:

Rt = γRt−1 + (1 − γ)∇Lt (Wt−1 + µVt−1 )2


α
Vt = µVt−1 − √ ∇Lt (Wt−1 + µVt−1 )
Rt
Wt = Wt−1 + Vt

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 8 / 32


Adaptive Moment Estimation (Adam)

Outline

1 Optimization in Deep Learning

2 Adaptive Moment Estimation (Adam)

3 Convergence Analysis

4 Relations to Existing Algorithms

5 Experiments

6 AdaMax

7 Conclusion

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 9 / 32


Adaptive Moment Estimation (Adam)

Rationale

Motivation
Combine the advantages of:

AdaGrad – works well with sparse gradients

RMSProp – works well in non-stationary settings

Idea
Maintain exponential moving averages of gradient and its square
average gradient
Update proportional to √
average squared gradient

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 10 / 32


Adaptive Moment Estimation (Adam)

Algorithm
Adam
M0 = 0, R0 = 0 (Initialization)
For t = 1, . . . , T :
Mt = β1 Mt−1 + (1 − β1 )∇Lt (Wt−1 ) (1st moment estimate)
Rt = β2 Rt−1 + (1 − β2 )∇Lt (Wt−1 )2 (2nd moment estimate)
M̂t = Mt / (1 − (β1 )t ) (1st moment bias correction)
R̂t = Rt / (1 − (β2 )t ) (2nd moment bias correction)
Wt = Wt−1 − α √M̂t (Update)
R̂t +
Return WT

Hyper-parameters:
α > 0 – learning rate (typical choice: 0.001)
β1 ∈ [0, 1) – 1st moment decay rate (typical choice: 0.9)
β2 ∈ [0, 1) – 2nd moment decay rate (typical choice: 0.999)
 > 0 – numerical term (typical choice: 10−8 )
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 11 / 32
Adaptive Moment Estimation (Adam)

Parameter Updates

Adam’s step at iteration t (assuming  = 0):

M̂t
∆t = −α q
R̂t

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 12 / 32


Adaptive Moment Estimation (Adam)

Parameter Updates

Adam’s step at iteration t (assuming  = 0):

M̂t
∆t = −α q
R̂t

Properties:
Scale-invariance:

L(W ) → c · L(W ) =⇒ M̂t → c · M̂t ∧ R̂t → c 2 · R̂t


=⇒ ∆t does not change

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 12 / 32


Adaptive Moment Estimation (Adam)

Parameter Updates

Adam’s step at iteration t (assuming  = 0):

M̂t
∆t = −α q
R̂t

Properties:
Scale-invariance:

L(W ) → c · L(W ) =⇒ M̂t → c · M̂t ∧ R̂t → c 2 · R̂t


=⇒ ∆t does not change

Bounded norm:
( √ √
α · (1 − β1 )/ 1 − β2 , (1 − β1 ) > 1 − β2
k∆t k∞ ≤
α , otherwise

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 12 / 32


Adaptive Moment Estimation (Adam)

Bias Correction

Taking into account the initialization M0 = 0, we have:

Mt = β1 Mt−1 + (1 − β1 )∇Lt (Wt−1 )


t
X
= (1 − β1 )(β1 )t−i · ∇Li (Wi−1 )
i=1
Pt
i=1 (1− β1 )(β1 )t−i = 1 − (β1 )t , so to obtain an unbiased estimate we
divided by 1 − (β1 )t :
M̂t = Mt / 1 − (β1 )t


An analogous argument derives the bias correction of Rt .

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 13 / 32


Convergence Analysis

Outline

1 Optimization in Deep Learning

2 Adaptive Moment Estimation (Adam)

3 Convergence Analysis

4 Relations to Existing Algorithms

5 Experiments

6 AdaMax

7 Conclusion

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 14 / 32


Convergence Analysis

Convergence in Online Convex Regime

Regret at iteration T :
XT
R(T ) := [Lt (Wt ) − Lt (W ∗ )]
t=1

where: XT
W ∗ := argmin Lt (W )
W t=1

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 15 / 32


Convergence Analysis

Convergence in Online Convex Regime

Regret at iteration T :
XT
R(T ) := [Lt (Wt ) − Lt (W ∗ )]
t=1

where: XT
W ∗ := argmin Lt (W )
W t=1

In convex regime, Adam gives regret bound comparable to best known:


Theorem
If all batch objectives Lt (W ) are convex and have bounded gradients, and
all points Wt generated by Adam are within bounded distance from each
other, then for every T ∈ N:

R(T ) 1
 
=O √
T T
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 15 / 32
Relations to Existing Algorithms

Outline

1 Optimization in Deep Learning

2 Adaptive Moment Estimation (Adam)

3 Convergence Analysis

4 Relations to Existing Algorithms

5 Experiments

6 AdaMax

7 Conclusion

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 16 / 32


Relations to Existing Algorithms

Relation to SGD

Setting β2 = 1, Adam’s update rule may be written as:

Mt = µMt−1 − η∇Lt (Wt−1 )


Wt = Wt−1 + Mt

where:
β1 (1 − (β1 )t−1 ) α(1 − β1 )
µ := η := √
1 − (β1 )t (1 − (β1 )t ) 

Conclusion
Disabling 2nd moment estimation (β2 = 1) reduces Adam to SGD with:

Learning rate descending towards α(1 − β1 )/ 
Momentum ascending towards β1

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 17 / 32


Relations to Existing Algorithms

Relation to AdaGrad

Setting β1 = 0 and β2 → 1− (and assuming  = 0), Adam’s update rule


may be written as:

∇Lt (Wt−1 )
Wt = Wt−1 − α qP
t
t 1/2 i=1 ∇Lt (Wt−1 )
2

Conclusion
In the limit β2 → 1− , with β1 = 0, Adam reduces to AdaGrad with
annealing learning rate α · t −1/2 .

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 18 / 32


Relations to Existing Algorithms

Relation to RMSProp

RMSProp with momentum is the method most closely related to Adam.

Main differences:

RMSProp rescales gradient and then applies momentum, Adam first


applies momentum (moving average) and then rescales.

RMSProp lacks bias correction, often leading to large stepsizes in


early stages of run (especially when β2 is close to 1).

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 19 / 32


Experiments

Outline

1 Optimization in Deep Learning

2 Adaptive Moment Estimation (Adam)

3 Convergence Analysis

4 Relations to Existing Algorithms

5 Experiments

6 AdaMax

7 Conclusion

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 20 / 32


Experiments

Logistic Regression on MNIST

L2 -regularized logistic regression applied directly to image pixels:

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 21 / 32


Experiments

Logistic Regression on IMDB


Dropout regularized logistic regression applied to sparse Bag-of-Words
features:

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 22 / 32


Experiments

Multi-Layer Neural Networks (Fully-Connected) on MNIST

2 hidden layers, 1000 units each, ReLU activation, dropout regularization:

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 23 / 32


Experiments

Convolutional Neural Networks on CIFAR-10


Network architecture:
conv-5x5@64 → pool-3x3(stride-2) → conv-5x5@64 → pool-3x3(stride-2) →
conv-5x5@128 → pool-3x3(stride-2) → dense@1000 → dense@10:

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 24 / 32


Experiments

Bias Correction Term on Variational Auto-Encoder

Training variational auto-encoder (single hidden layer network) with (red)


and without (green) bias correction, for different values of β1 , β2 , α:

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 25 / 32


AdaMax

Outline

1 Optimization in Deep Learning

2 Adaptive Moment Estimation (Adam)

3 Convergence Analysis

4 Relations to Existing Algorithms

5 Experiments

6 AdaMax

7 Conclusion

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 26 / 32


AdaMax
p
L Generalization

Adam may be generalized by replacing gradient L2 norm with Lp norm:


Adam – Lp Generalization
M0 = 0, R0 = 0 (Initialization)
For t = 1, . . . , T :
Mt = β1 Mt−1 + (1 − β1 )∇Lt (Wt−1 ) (1st moment estimate)
Rt = β2 Rt−1 + (1 − (β2 )p )∇Lt (Wt−1 )p (p’th moment estimate)
M̂t = Mt / (1 − (β1 )t ) (1st moment bias correction)
R̂t = Rt / (1 − (β2 )pt ) (p’th moment bias correction)
M̂t
Wt = Wt−1 − α (R̂ +) 1/p
(Update)
t
Return WT

(β2 re-parameterized as (β2 )p for convenience)

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 27 / 32


AdaMax

p → ∞ =⇒ AdaMax

When p → ∞, k·kp → max{·} and we get:

AdaMax
M0 = 0, U0 = 0 (Initialization)
For t = 1, . . . , T :
Mt = β1 Mt−1 + (1 − β1 )∇Lt (Wt−1 ) (1st moment estimate)
Ut = max {β2 Ut−1 , |∇Lt (Wt−1 )|} (“∞” moment estimate)
M̂t = Mt / (1 − (β1 )t ) (1st moment bias correction)
Wt = Wt−1 − α M̂
Ut
t
(Update)
Return WT

In AdaMax step size is always bounded by α:

k∆t k∞ ≤ α

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 28 / 32


Outline

1 Optimization in Deep Learning

2 Adaptive Moment Estimation (Adam)

3 Convergence Analysis

4 Relations to Existing Algorithms

5 Experiments

6 AdaMax

7 Conclusion

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 29 / 32


Conclusion

Adam:

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 30 / 32


Conclusion

Adam:
Efficient first-order stochastic optimization method

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 30 / 32


Conclusion

Adam:
Efficient first-order stochastic optimization method
Combines the advantages of:
AdaGrad – works well with sparse gradients
RMSProp – deals with non-stationary objectives

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 30 / 32


Conclusion

Adam:
Efficient first-order stochastic optimization method
Combines the advantages of:
AdaGrad – works well with sparse gradients
RMSProp – deals with non-stationary objectives

Parameter updates:
Have bounded norm
Are scale-invariant

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 30 / 32


Conclusion

Adam:
Efficient first-order stochastic optimization method
Combines the advantages of:
AdaGrad – works well with sparse gradients
RMSProp – deals with non-stationary objectives

Parameter updates:
Have bounded norm
Are scale-invariant

Widely used in deep learning community (e.g. Google DeepMind)

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 30 / 32


Outline

1 Optimization in Deep Learning

2 Adaptive Moment Estimation (Adam)

3 Convergence Analysis

4 Relations to Existing Algorithms

5 Experiments

6 AdaMax

7 Conclusion

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 31 / 32


Thank You

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 32 / 32

You might also like