Professional Documents
Culture Documents
Nadav Cohen
Outline
3 Convergence Analysis
5 Experiments
6 AdaMax
7 Conclusion
Loss Minimization
{(xi , yi )}m
i=1 – training instances (xi ) and corresponding labels (yi )
Large-scale setting:
Large-scale setting:
Update rule:
Update rule:
1
See "On the Importance of Initialization and Momentum in Deep Learning" by
Sutskever et al.
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 6 / 32
Optimization in Deep Learning
Update rule:
∇Lt (Wt−1 )
Wt = Wt−1 − α qP
t 2
t 0 =1 ∇Lt 0 (Wt 0 −1 )
Update rule:
Update rule:
Outline
3 Convergence Analysis
5 Experiments
6 AdaMax
7 Conclusion
Rationale
Motivation
Combine the advantages of:
Idea
Maintain exponential moving averages of gradient and its square
average gradient
Update proportional to √
average squared gradient
Algorithm
Adam
M0 = 0, R0 = 0 (Initialization)
For t = 1, . . . , T :
Mt = β1 Mt−1 + (1 − β1 )∇Lt (Wt−1 ) (1st moment estimate)
Rt = β2 Rt−1 + (1 − β2 )∇Lt (Wt−1 )2 (2nd moment estimate)
M̂t = Mt / (1 − (β1 )t ) (1st moment bias correction)
R̂t = Rt / (1 − (β2 )t ) (2nd moment bias correction)
Wt = Wt−1 − α √M̂t (Update)
R̂t +
Return WT
Hyper-parameters:
α > 0 – learning rate (typical choice: 0.001)
β1 ∈ [0, 1) – 1st moment decay rate (typical choice: 0.9)
β2 ∈ [0, 1) – 2nd moment decay rate (typical choice: 0.999)
> 0 – numerical term (typical choice: 10−8 )
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 11 / 32
Adaptive Moment Estimation (Adam)
Parameter Updates
M̂t
∆t = −α q
R̂t
Parameter Updates
M̂t
∆t = −α q
R̂t
Properties:
Scale-invariance:
Parameter Updates
M̂t
∆t = −α q
R̂t
Properties:
Scale-invariance:
Bounded norm:
( √ √
α · (1 − β1 )/ 1 − β2 , (1 − β1 ) > 1 − β2
k∆t k∞ ≤
α , otherwise
Bias Correction
Outline
3 Convergence Analysis
5 Experiments
6 AdaMax
7 Conclusion
Regret at iteration T :
XT
R(T ) := [Lt (Wt ) − Lt (W ∗ )]
t=1
where: XT
W ∗ := argmin Lt (W )
W t=1
Regret at iteration T :
XT
R(T ) := [Lt (Wt ) − Lt (W ∗ )]
t=1
where: XT
W ∗ := argmin Lt (W )
W t=1
R(T ) 1
=O √
T T
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 15 / 32
Relations to Existing Algorithms
Outline
3 Convergence Analysis
5 Experiments
6 AdaMax
7 Conclusion
Relation to SGD
where:
β1 (1 − (β1 )t−1 ) α(1 − β1 )
µ := η := √
1 − (β1 )t (1 − (β1 )t )
Conclusion
Disabling 2nd moment estimation (β2 = 1) reduces Adam to SGD with:
√
Learning rate descending towards α(1 − β1 )/
Momentum ascending towards β1
Relation to AdaGrad
∇Lt (Wt−1 )
Wt = Wt−1 − α qP
t
t 1/2 i=1 ∇Lt (Wt−1 )
2
Conclusion
In the limit β2 → 1− , with β1 = 0, Adam reduces to AdaGrad with
annealing learning rate α · t −1/2 .
Relation to RMSProp
Main differences:
Outline
3 Convergence Analysis
5 Experiments
6 AdaMax
7 Conclusion
Outline
3 Convergence Analysis
5 Experiments
6 AdaMax
7 Conclusion
p → ∞ =⇒ AdaMax
AdaMax
M0 = 0, U0 = 0 (Initialization)
For t = 1, . . . , T :
Mt = β1 Mt−1 + (1 − β1 )∇Lt (Wt−1 ) (1st moment estimate)
Ut = max {β2 Ut−1 , |∇Lt (Wt−1 )|} (“∞” moment estimate)
M̂t = Mt / (1 − (β1 )t ) (1st moment bias correction)
Wt = Wt−1 − α M̂
Ut
t
(Update)
Return WT
k∆t k∞ ≤ α
3 Convergence Analysis
5 Experiments
6 AdaMax
7 Conclusion
Adam:
Adam:
Efficient first-order stochastic optimization method
Adam:
Efficient first-order stochastic optimization method
Combines the advantages of:
AdaGrad – works well with sparse gradients
RMSProp – deals with non-stationary objectives
Adam:
Efficient first-order stochastic optimization method
Combines the advantages of:
AdaGrad – works well with sparse gradients
RMSProp – deals with non-stationary objectives
Parameter updates:
Have bounded norm
Are scale-invariant
Adam:
Efficient first-order stochastic optimization method
Combines the advantages of:
AdaGrad – works well with sparse gradients
RMSProp – deals with non-stationary objectives
Parameter updates:
Have bounded norm
Are scale-invariant
3 Convergence Analysis
5 Experiments
6 AdaMax
7 Conclusion