You are on page 1of 2

Stochastic Gradient Descent

Compute gradient estimate Note: Reduce Dependency


1
𝐠 ← + ∇𝛉
𝑚
𝐿(𝑓(𝐱 ; 𝛉), 𝐲 ) Challenge: Non Convex
Apply update Problem, Slow Convergence
𝛉 ← 𝛉 − 𝜖𝐠
Use sample randomly or in random batches instead of using complete data at each update
Depend only on local gradient

Momentum
Compute gradient estimate
1
Note: Faster Convergence,
𝐠 ← + ∇𝛉
𝑚
𝐿(𝑓(𝐱 ; 𝛉), 𝐲 )
Reduced Oscillation
Compute velocity update
Challenge: Blindly follow
𝐯 ← 𝛼𝐯 − 𝜖𝐠
slops
Apply update
𝛉←𝛉+𝐯
Faster near minima, avoid slow convergence

Nesterov momentum
Compute interim update
𝛉 ← 𝛉 + 𝛼𝐯 Note: Faster Convergence,
Compute gradient (at interim point)
1 Know where it is going
𝐠←+ ∇ 𝐿(𝑓 𝐱 ; 𝛉 , 𝐲 )
𝑚 𝛉 Challenge: Not Adaptive
Compute velocity update
𝐯 ← 𝛼𝐯 − 𝜖𝐠
Apply update
𝛉←𝛉+𝐯

AdaGrad
Compute gradient estimate
1
𝐠←+ ∇
𝑚 𝛉
𝐿(𝑓(𝐱 ; 𝛉), 𝐲 ) Note: Adaptive
Accumulate squared gradient Challenge: Keeps going,
𝐫 ← 𝐫 + 𝐠⨀𝐠
Compute parameter update (Division and square root applied element-wise) Learning Rate shrinks
𝜖
𝚫𝛉 ← − ⨀𝐠
𝛿 + √𝐫
Apply update
𝛉 ← 𝛉 + 𝚫𝛉
Learning rate is adaptive, slows down near minima

RMSProp
Compute gradient estimate
1
𝐠←+ ∇ 𝐿(𝑓(𝐱 ; 𝛉), 𝐲 )
𝑚 𝛉
Accumulate squared gradient
𝐫 ← 𝜌𝐫 + (1 − 𝜌)𝐠⨀𝐠
Compute parameter update (Division and square root applied element-wise)
𝜖
𝚫𝛉 ← − ⨀𝐠
𝛿 + √𝐫
Apply update
𝛉 ← 𝛉 + 𝚫𝛉

RMSProp with Nesterov momentum


Compute interim update
𝛉 ← 𝛉 + 𝛼𝐯
Compute gradient (at interim point)
1
𝐠←+ ∇ 𝐿(𝑓 𝐱 ; 𝛉 , 𝐲 )
𝑚 𝛉
Accumulate squared gradient
𝐫 ← 𝜌𝐫 + (1 − 𝜌)𝐠⨀𝐠
Compute velocity update
𝜖
𝐯 ← 𝛼𝐯 − ⨀𝐠
𝛿 + √𝐫
Apply update
𝛉←𝛉+𝐯
Use two knobs to adapt learning

Adam
Compute gradient estimate Note: adaptive
1
𝐠←+ ∇ 𝐿(𝑓(𝐱 ; 𝛉), 𝐲 )
𝑚 𝛉
𝑡 ←𝑡+1
Update biased first moment estimate
𝐬 ← 𝜌 𝐬 + (1 − 𝜌 )𝐠
Update biased second moment estimate
𝐫 ← 𝜌 𝐫 + (1 − 𝜌 )𝐠⨀𝐠
Correct bias in first moment
𝐬
𝐬 ←
1−𝜌
Correct bias in second moment
𝐫
𝐫 ←
1−𝜌
Compute parameter update (Division and square root applied element-wise)
𝜖
𝚫𝛉 ← − ⨀𝐬
𝛿+ 𝐫
Apply update
𝛉 ← 𝛉 + 𝚫𝛉

Use same rule for each step, no special case for initialization

Ref. Book: Chapter 8.3 & 8.5, Deep Learning. Ian Goodfellow, Yoshua Bengio and
Aaron Courville

You might also like