Adam Pres

Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Lei Ba
Nadav Cohen
The Hebrew University of Jerusalem

Advanced Seminar in Deep Learning (#67679)
October 18, 2015
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 1 / 32

Optimization in Deep Learning
Outline
1 Optimization in Deep Learning
2 Adaptive Moment Estimation (Adam)
3 Convergence Analysis
4 Relations to Existing Algorithms
5 Experiments
6 AdaMax
7 Conclusion

Loss Minimization
Loss minimization problem:

1 Xm

min L(W ) := `(W ; xi , yi ) + λr (W )
W m i=1
{(xi , yi )}m
i=1 – training instances (xi ) and corresponding labels (yi )
W – network parameters to learn

`(W ; xi , yi ) – loss of network parameterized by W w.r.t. (xi , yi )
2
r (W ) – regularization function (e.g. kW k2 )
λ > 0 – regularization weight

Large-Scale −→ First-Order Stochastic Methods
Large-scale setting:
Many network parameters (e.g. dim(W ) ∼ 108 )

=⇒ computing Hessian (second order derivatives) is expensive
Many training examples (e.g. m ∼ 106 )
=⇒ computing full objective at every iteration is expensive

Large-Scale −→ First-Order Stochastic Methods
Large-scale setting:
Many network parameters (e.g. dim(W ) ∼ 108 )

=⇒ computing Hessian (second order derivatives) is expensive
Many training examples (e.g. m ∼ 106 )
=⇒ computing full objective at every iteration is expensive
Optimization methods must be:
First-order – update based on objective value and gradient only

Stochastic – update based on subset of training examples:
1 Xb
Lt (W ) := `(W ; xij , yij ) + λr (W )
b j=1
{(xij , yij )}bj=1 – random mini-batch chosen at iteration t

Stochastic Gradient Descent (SGD)
Update rule:
Vt = µVt−1 − α∇Lt (Wt−1 )

Wt = Wt−1 + Vt
α > 0 – learning rate (typical choices: 0.01, 0.1)
µ ∈ [0, 1) – momentum (typical choices: 0.9, 0.95, 0.99)
Momentum smooths updates, enhancing stability and speed.

Nesterov’s Accelerated Gradient (NAG)
Update rule:
Vt = µVt−1 − α∇Lt (Wt−1 +µVt−1 )

Wt = Wt−1 + Vt
Only difference from SGD is partial update (+µVt ) in gradient

computation. May increase stability and speed in ill-conditioned problems1 .
1
See "On the Importance of Initialization and Momentum in Deep Learning" by
Sutskever et al.
Adaptive Gradient (AdaGrad)
Update rule:
∇Lt (Wt−1 )
Wt = Wt−1 − α qP
t 2
t 0 =1 ∇Lt 0 (Wt 0 −1 )
Learning rate adapted per coordinate:

Highly varying coordinate −→ suppress
Rarely varying coordinate −→ enhance
Disadvantage in non-stationary settings:

All gradients (recent and old) weighted equally

Root Mean Square Propagation (RMSProp)
Update rule:
Rt = γRt−1 + (1 − γ)∇Lt (Wt−1 )2

∇Lt (Wt−1 )
Wt = Wt−1 − α √
Rt
Similar to AdaGrad but with an exponential moving average controlled by

γ ∈ [0, 1) (smaller γ =⇒ more emphasis on recent gradients).

Root Mean Square Propagation (RMSProp)
Update rule:
Rt = γRt−1 + (1 − γ)∇Lt (Wt−1 )2

∇Lt (Wt−1 )
Wt = Wt−1 − α √
Rt
Similar to AdaGrad but with an exponential moving average controlled by

γ ∈ [0, 1) (smaller γ =⇒ more emphasis on recent gradients).
May be combined with NAG:
Rt = γRt−1 + (1 − γ)∇Lt (Wt−1 + µVt−1 )2

α
Vt = µVt−1 − √ ∇Lt (Wt−1 + µVt−1 )
Rt
Wt = Wt−1 + Vt

Adaptive Moment Estimation (Adam)
Outline
5 Experiments
6 AdaMax
7 Conclusion

Rationale
Motivation
Combine the advantages of:
AdaGrad – works well with sparse gradients
RMSProp – works well in non-stationary settings
Idea
Maintain exponential moving averages of gradient and its square
average gradient
Update proportional to √
average squared gradient

Algorithm
Adam
M0 = 0, R0 = 0 (Initialization)
For t = 1, . . . , T :
Mt = β1 Mt−1 + (1 − β1 )∇Lt (Wt−1 ) (1st moment estimate)
Rt = β2 Rt−1 + (1 − β2 )∇Lt (Wt−1 )2 (2nd moment estimate)
M̂t = Mt / (1 − (β1 )t ) (1st moment bias correction)
R̂t = Rt / (1 − (β2 )t ) (2nd moment bias correction)
Wt = Wt−1 − α √M̂t (Update)
R̂t +
Return WT
Hyper-parameters:
α > 0 – learning rate (typical choice: 0.001)
β1 ∈ [0, 1) – 1st moment decay rate (typical choice: 0.9)
β2 ∈ [0, 1) – 2nd moment decay rate (typical choice: 0.999)
> 0 – numerical term (typical choice: 10−8 )
Parameter Updates
Adam’s step at iteration t (assuming = 0):
M̂t
∆t = −α q
R̂t

Parameter Updates
M̂t
∆t = −α q
R̂t
Properties:
Scale-invariance:
L(W ) → c · L(W ) =⇒ M̂t → c · M̂t ∧ R̂t → c 2 · R̂t

=⇒ ∆t does not change

Parameter Updates
M̂t
∆t = −α q
R̂t
Properties:
Scale-invariance:
L(W ) → c · L(W ) =⇒ M̂t → c · M̂t ∧ R̂t → c 2 · R̂t

=⇒ ∆t does not change
Bounded norm:
( √ √
α · (1 − β1 )/ 1 − β2 , (1 − β1 ) > 1 − β2
k∆t k∞ ≤
α , otherwise

Bias Correction
Taking into account the initialization M0 = 0, we have:
Mt = β1 Mt−1 + (1 − β1 )∇Lt (Wt−1 )

t
X
= (1 − β1 )(β1 )t−i · ∇Li (Wi−1 )
i=1
Pt
i=1 (1− β1 )(β1 )t−i = 1 − (β1 )t , so to obtain an unbiased estimate we
divided by 1 − (β1 )t :
M̂t = Mt / 1 − (β1 )t

An analogous argument derives the bias correction of Rt .

Convergence Analysis
Outline
5 Experiments
6 AdaMax
7 Conclusion

Convergence in Online Convex Regime
Regret at iteration T :
XT
R(T ) := [Lt (Wt ) − Lt (W ∗ )]
t=1
where: XT
W ∗ := argmin Lt (W )
W t=1

Convergence in Online Convex Regime
Regret at iteration T :
XT
R(T ) := [Lt (Wt ) − Lt (W ∗ )]
t=1
where: XT
W ∗ := argmin Lt (W )
W t=1
In convex regime, Adam gives regret bound comparable to best known:

Theorem
If all batch objectives Lt (W ) are convex and have bounded gradients, and
all points Wt generated by Adam are within bounded distance from each
other, then for every T ∈ N:
R(T ) 1

=O √
T T
Relations to Existing Algorithms
Outline
5 Experiments
6 AdaMax
7 Conclusion

Relation to SGD
Setting β2 = 1, Adam’s update rule may be written as:
Mt = µMt−1 − η∇Lt (Wt−1 )

Wt = Wt−1 + Mt
where:
β1 (1 − (β1 )t−1 ) α(1 − β1 )
µ := η := √
1 − (β1 )t (1 − (β1 )t )
Conclusion
Disabling 2nd moment estimation (β2 = 1) reduces Adam to SGD with:
√
Learning rate descending towards α(1 − β1 )/
Momentum ascending towards β1

Relation to AdaGrad
Setting β1 = 0 and β2 → 1− (and assuming = 0), Adam’s update rule

may be written as:
∇Lt (Wt−1 )
Wt = Wt−1 − α qP
t
t 1/2 i=1 ∇Lt (Wt−1 )
2
Conclusion
In the limit β2 → 1− , with β1 = 0, Adam reduces to AdaGrad with
annealing learning rate α · t −1/2 .

Relation to RMSProp
RMSProp with momentum is the method most closely related to Adam.
Main differences:
RMSProp rescales gradient and then applies momentum, Adam first

applies momentum (moving average) and then rescales.
RMSProp lacks bias correction, often leading to large stepsizes in

early stages of run (especially when β2 is close to 1).

Experiments
Outline
5 Experiments
6 AdaMax
7 Conclusion

Experiments
Logistic Regression on MNIST
L2 -regularized logistic regression applied directly to image pixels:

Experiments
Logistic Regression on IMDB

Dropout regularized logistic regression applied to sparse Bag-of-Words
features:

Experiments
Multi-Layer Neural Networks (Fully-Connected) on MNIST
2 hidden layers, 1000 units each, ReLU activation, dropout regularization:

Experiments
Convolutional Neural Networks on CIFAR-10

Network architecture:
conv-5x5@64 → pool-3x3(stride-2) → conv-5x5@64 → pool-3x3(stride-2) →
conv-5x5@128 → pool-3x3(stride-2) → dense@1000 → dense@10:

Experiments
Bias Correction Term on Variational Auto-Encoder
Training variational auto-encoder (single hidden layer network) with (red)

and without (green) bias correction, for different values of β1 , β2 , α:

AdaMax
Outline
5 Experiments
6 AdaMax
7 Conclusion

AdaMax
p
L Generalization
Adam may be generalized by replacing gradient L2 norm with Lp norm:

Adam – Lp Generalization
M0 = 0, R0 = 0 (Initialization)
For t = 1, . . . , T :
Rt = β2 Rt−1 + (1 − (β2 )p )∇Lt (Wt−1 )p (p’th moment estimate)
R̂t = Rt / (1 − (β2 )pt ) (p’th moment bias correction)
M̂t
Wt = Wt−1 − α (R̂ +) 1/p
(Update)
t
Return WT
(β2 re-parameterized as (β2 )p for convenience)

AdaMax
p → ∞ =⇒ AdaMax
When p → ∞, k·kp → max{·} and we get:
AdaMax
M0 = 0, U0 = 0 (Initialization)
For t = 1, . . . , T :
Ut = max {β2 Ut−1 , |∇Lt (Wt−1 )|} (“∞” moment estimate)
Wt = Wt−1 − α M̂
Ut
t
(Update)
Return WT
In AdaMax step size is always bounded by α:
k∆t k∞ ≤ α

Outline
5 Experiments
6 AdaMax
7 Conclusion

Conclusion
Adam:

Conclusion
Adam:
Efficient first-order stochastic optimization method

Conclusion
Adam:
Combines the advantages of:
RMSProp – deals with non-stationary objectives

Conclusion
Adam:
Parameter updates:
Have bounded norm
Are scale-invariant

Conclusion
Adam:
Parameter updates:
Have bounded norm
Are scale-invariant
Widely used in deep learning community (e.g. Google DeepMind)

Outline
5 Experiments
6 AdaMax
7 Conclusion

Thank You

Adam Pres

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Adam Pres

Uploaded by

Copyright:

Available Formats

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Lei Ba

The Hebrew University of Jerusalem

October 18, 2015

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 1 / 32

1 Optimization in Deep Learning

2 Adaptive Moment Estimation (Adam)

4 Relations to Existing Algorithms

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 2 / 32

Loss minimization problem:

W – network parameters to learn

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 3 / 32

Large-Scale −→ First-Order Stochastic Methods

Many network parameters (e.g. dim(W ) ∼ 108 )

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 4 / 32

Large-Scale −→ First-Order Stochastic Methods

Many network parameters (e.g. dim(W ) ∼ 108 )

Optimization methods must be:

First-order – update based on objective value and gradient only

{(xij , yij )}bj=1 – random mini-batch chosen at iteration t

Stochastic Gradient Descent (SGD)

Vt = µVt−1 − α∇Lt (Wt−1 )

α > 0 – learning rate (typical choices: 0.01, 0.1)

µ ∈ [0, 1) – momentum (typical choices: 0.9, 0.95, 0.99)

Momentum smooths updates, enhancing stability and speed.

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 5 / 32

Nesterov’s Accelerated Gradient (NAG)

Vt = µVt−1 − α∇Lt (Wt−1 +µVt−1 )

Only difference from SGD is partial update (+µVt ) in gradient

Adaptive Gradient (AdaGrad)

Learning rate adapted per coordinate:

Disadvantage in non-stationary settings:

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 7 / 32

Root Mean Square Propagation (RMSProp)

Rt = γRt−1 + (1 − γ)∇Lt (Wt−1 )2

Similar to AdaGrad but with an exponential moving average controlled by

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 8 / 32

Root Mean Square Propagation (RMSProp)

Rt = γRt−1 + (1 − γ)∇Lt (Wt−1 )2

Similar to AdaGrad but with an exponential moving average controlled by

May be combined with NAG:

Rt = γRt−1 + (1 − γ)∇Lt (Wt−1 + µVt−1 )2

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 8 / 32

1 Optimization in Deep Learning

2 Adaptive Moment Estimation (Adam)

4 Relations to Existing Algorithms

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 9 / 32

AdaGrad – works well with sparse gradients

RMSProp – works well in non-stationary settings

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 10 / 32

Adam’s step at iteration t (assuming  = 0):

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 12 / 32

Adam’s step at iteration t (assuming  = 0):

L(W ) → c · L(W ) =⇒ M̂t → c · M̂t ∧ R̂t → c 2 · R̂t

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 12 / 32

Adam’s step at iteration t (assuming  = 0):

L(W ) → c · L(W ) =⇒ M̂t → c · M̂t ∧ R̂t → c 2 · R̂t

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 12 / 32

Taking into account the initialization M0 = 0, we have:

Mt = β1 Mt−1 + (1 − β1 )∇Lt (Wt−1 )

An analogous argument derives the bias correction of Rt .

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 13 / 32

1 Optimization in Deep Learning

Adam’s step at iteration t (assuming = 0):

Adam’s step at iteration t (assuming = 0):

Adam’s step at iteration t (assuming = 0):

Setting β1 = 0 and β2 → 1− (and assuming = 0), Adam’s update rule