Professional Documents
Culture Documents
The material here is based on a paper by D. P. Kingma and J. L. Ba, the inventors of the Adam
technique.
where λ ≥ 0 is user defined. To make this error “stochastic”, we select at time t a batch of b
random examples out of the m given examples, where typically b m. The stochastic error to be
minimized is:
b
1X
Et (W ) = ei (W, xi , yi ) + λr(W ),
b
i=1
This can be written as the following three step process that computes Wt from Wt−1 :
1. Dt = ∇Et (Wt−1 )
2. Ct = −Dt
3. Wt = Wt−1 + Ct
0. C0 = 0
1. for t = 1, 2, . . .
where 0 ≤ α < 1. To understand what’s going on here consider a single weight wt from the vector
Wt , and the corresponding values ct from Ct , and dt from Dt .
c0 = 0.
c1 = −d1
c2 = αc1 − d2 = −(αd1 + d2 )
c3 = αc2 − d3 = −(α2 d1 + αd2 + d3 )
ck = −(αk−1 d1 + αk−2 d2 + . . . + dk )
Observe that if dt are mostly positive or mostly negative the contributions to ck accumulate. For
the case where they are all identical, dt = d, we have:
1 1
ck ≈ − d, wk ≈ wk−1 − d
1−α 1−α
Thus, this behaves as an increase of by a factor of 1/(1 − α). If the gradient sign keeps changing
(from + to –) then there is no effective increase in .
Typical values: = 0.1, α = 0.9.
X t = βX t−1 + (1 − β)Xt
X0 = 0
X 1 = βX 0 + (1 − β)X1 = (1 − β)X1
t
X
X t = (1 − β) β t−i Xi
i=1
1−β t
If Xi = X, a constant, then: X t = X(1 − β) 1−β = (1 − β t )X. But we want this value to be X,
and this can be achieved if the value is divided by (1 − β t ). This suggests the following algorithm
for running average:
Xt
X 0 = 0, X t = βX t−1 + (1 − β)Xt , X̂t =
1 − βt
1. for t = 1, 2, . . .
1.6 Wt = Wt−1 − α √D̂t (Both square root and division are applied separately for each
R̂t
coordinate.)
To guard against division by a very small number the last step is typically implemented as:
D̂t
Wt = Wt−1 − α p
R̂t +
Typical values: