You are on page 1of 4

Adam: Adaptive Moment Estimation

The material here is based on a paper by D. P. Kingma and J. L. Ba, the inventors of the Adam
technique.

The error to be minimized


Training data is given as the pairs (xi , yi ), i = 1, . . . , m, where xi is a feature vector and yi is the
desired prediction. Let f be the function being used to predict yi from xi . It depends on weights,
specified by the weights vector W . Let e be the error penalty, which also depends on W . For
example, e can be defined as: e(W, x, y) = (y − f (x, W ))2 . The error penalty for the ith training
example is:
ei = e(W, xi , yi )
1 Pm
The error penalty for the entire training data can be defined as: E(W ) = m i=1 e(W, xi , yi ). It
is customary to add a regularization term r(W ) to make sure that weights do not become too big.
For example: r(W ) = |W |2 . Thus, the error to be minimized is given by:
m
1 X
E(W ) = ei (W, xi , yi ) + λr(W ),
m
i=1

where λ ≥ 0 is user defined. To make this error “stochastic”, we select at time t a batch of b
random examples out of the m given examples, where typically b  m. The stochastic error to be
minimized is:
b
1X
Et (W ) = ei (W, xi , yi ) + λr(W ),
b
i=1

Stochastic Steepest Descent


The technique of stochastic steepest descent runs many iterations indexed by t. For each t there is
a single weights update that uses steepest descent to minimize the stochastic error Et :

Wt = Wt−1 − ∇Et (Wt−1 )

This can be written as the following three step process that computes Wt from Wt−1 :

1. Dt = ∇Et (Wt−1 )

2. Ct = −Dt

3. Wt = Wt−1 + Ct

where Ct is the change of the weights vector at time t.


Old Momentum
The intuitive idea behind momentum is that the vector Ct should not be changing too rapidly. This
can be expressed as Ct ≈ Ct−1 . As we see next, an intuitive implementation of this idea creates a
mechanism that automatically adjusts the value of . Here is the momentum idea:

0. C0 = 0

1. for t = 1, 2, . . .

1.1. Dt = ∇Et (Wt−1 )


1.2. Ct = αCt−1 − Dt
1.3. Wt = Wt−1 + Ct

where 0 ≤ α < 1. To understand what’s going on here consider a single weight wt from the vector
Wt , and the corresponding values ct from Ct , and dt from Dt .

c0 = 0.
c1 = −d1
c2 = αc1 − d2 = −(αd1 + d2 )
c3 = αc2 − d3 = −(α2 d1 + αd2 + d3 )
ck = −(αk−1 d1 + αk−2 d2 + . . . + dk )

Observe that if dt are mostly positive or mostly negative the contributions to ck accumulate. For
the case where they are all identical, dt = d, we have:
1 1
ck ≈ − d, wk ≈ wk−1 −  d
1−α 1−α
Thus, this behaves as an increase of  by a factor of 1/(1 − α). If the gradient sign keeps changing
(from + to –) then there is no effective increase in .
Typical values:  = 0.1, α = 0.9.

New Momentum technique: ADAM


The idea: similar to old momentum. Increase the value of  if gradient values have same sign, and
decrease it if gradient values keeps changing signs. How can this be determined? The square of
gradient values always has same sign. Compute Average gradient: call it D̂. p Compute Average
square of gradient: call it R̂. The size comparison should be between D̂ and R̂.
p
D̂ large R̂ large - accelerate speed
p
D̂ large R̂ small - impossible
p
D̂ small R̂ large - decelerate speed
p
D̂ small R̂ small - accelerate speed
p
Conclusion: accelerate proportional to D̂/ R̂.
Computing running averages
Here we discuss how to compute approximate running averages. These are averages over previously
encountered values, averaged so that the most recent values are dominant.
Consider computing the running average of Xt . Denote this running average by X t . The idea
is that X t can be calculated as a weighted sum of X t−1 and Xt . The update rule is:

X t = βX t−1 + (1 − β)Xt

This is a difference equation that can be easily solved.

X0 = 0

X 1 = βX 0 + (1 − β)X1 = (1 − β)X1

X 2 = βX 1 + (1 − β)X2 = β(1 − β)X1 + (1 − β)X2

X 3 = βX 2 + (1 − β)X3 = β 2 (1 − β)X1 + β(1 − β)X2 + (1 − β)X3

X 4 = βX 3 + (1 − β)X4 = β 3 (1 − β)X1 + β 2 (1 − β)X2 + β(1 − β)X3 + (1 − β)X4

X t = β t−1 (1 − β)X1 + . . . + β t−i (1 − β)Xi + . . .

t
X
X t = (1 − β) β t−i Xi
i=1
1−β t
If Xi = X, a constant, then: X t = X(1 − β) 1−β = (1 − β t )X. But we want this value to be X,
and this can be achieved if the value is divided by (1 − β t ). This suggests the following algorithm
for running average:

Xt
X 0 = 0, X t = βX t−1 + (1 − β)Xt , X̂t =
1 − βt

The ADAM algorithm


0. D0 = 0, R0 = 0, W0 is the initial weights.

1. for t = 1, 2, . . .

1.1. Dt = ∇Et (Wt−1 ).


1.2. Dt = β1 Dt−1 + (1 − β1 )Dt .
1.3. Rt = β2 Rt−1 + (1 − β2 )Dt2 . (The vector Dt2 is created by squaring each coordinate of
Dt .)
Dt
1.4. D̂t = 1−β1t
Rt
1.5. R̂t = 1−β2t

1.6 Wt = Wt−1 − α √D̂t (Both square root and division are applied separately for each
R̂t
coordinate.)
To guard against division by a very small number the last step is typically implemented as:

D̂t
Wt = Wt−1 − α p
R̂t + 
Typical values:

0 < α, e.g., 0.001


0 ≤ β1 < 1, e.g., 0.9
0 ≤ β2 < 1, e.g., 0.999
0 < , e.g., 10−8

You might also like