You are on page 1of 9

Does Adam Converge

and When?
Introduction
In this presentation,we discuss the (non-)convergence behavior of Adam.

We take up two papers, and briefly review their results.

● On the Convergence and Beyond :


○ Explains the use of adaptive methods on exponential moving averages,while taking β1
= 0.9 and β2 = 0.999 and further fine tuning the parameters.
○ Proved that Adam (including RMSProp) does not converge for a large set of
hyperparameters
● RMSPROP CONVERGES WITH PROPER HYPERPARAMETER:
○ Explains that RMSprop generates good results only if a critical threshold, let β3;
1>β2>=β3 with β1 either being zero or a very small value.
○ Proved that RMSProp converges for large enough β2.
Background Among all these variants, Adam is one of the most
popular methods. It is being used in almost every ML
domain such as NLP, GANs, RL, etc.
Adaptive gradient methods are dominant
algorithms for training neural networks. One Despite Adam's widespread use, one of the papers
early member of the class is Adagrad, which listed above suggested that Adam could be
uses the gradient updates scaled by square non-convergent, raising a warning for many Ada-Class
roots of exponential moving averages of algorithms.These findings raise an intriguing question:
squared past gradients.
If several Ada-class algorithms, including Adam, fail to
converge even for a basic convex function, how come
they do so well in more harder practical tasks such as
deep neural net training? Is this because real-world
situations are more likely to be "pleasant," or because
Since then, this key idea has given birth to one of the studies' analysis does not match how Adam
many different variants of adaptive gradient is used in practise?
methods such as RMSProp, Adadelta, Adam,
Nadam, Adabound.
This result demonstrates that "Adam with
To answer these questions, we need to revisit the
counter-example presented by S. J. Reddi. One of their small β2 may diverge," further implying that
counter-examples is as follows: "large β2 is recommended in practise." Indeed,
Adam's PyTorch default value for β2 is 0.999,
fk(x)={Cx, for kmod 3=1
−x, otherwise which is a fairly huge number. Furthermore,
β2≥0.95 is used in all NLP, GAN, and RL
where x∈[−1,1] and C>2.For this convex problem, they
experiments. Based on all of these factors,
proved that Adam does not converge to the optimal
solution when Adam with a large β2 may be able to
β2≤min{C^(−4/C-2), 1−(9/2C)^2}, converge.
where β2 is the second order momentum coefficient of
Adam .
Now the question arises :

Does Adam provably converge with large β2 ?


On the paper that talks about RMSprop, They prove
that large-β2 RMSProp converges without any bounded
Adam (including RMSProp) does not converge for a big
gradient assumption. These is based on the following number of hyperparameters, according to one of the papers,
assumptions. while the other demonstrated that RMSProp converges for
large enough β2.

For each β2=[0,1) and β1=0, Reddi et al. showed that there
exists a convex issue for which RMSProp does not converge
to optima.
As a result, (β1,β2)=(0,0.99)=(0,0.99999) are hyperparameter
combinations that can cause divergence;
They also don’t consider the requirement of the
In fact, regardless of how near β2 is to 1, (β1,β,2)=(0,β2) can
bounded gradient assumption.This works for two
still induce divergence.
reasons. First, in Adam's actual applications, the
bounded gradient condition does not hold (deep neural
So why do Shi et al. assert that "a large enough β2 causes
nets training).
RMSProp to converge"?
Second, the gradients cannot diverge under bounded
gradients assumptions, despite a counter-example
showing that the gradient can diverge for specific
issues.
The difference between picking β2 before or With the above discussion, we highlight two
after picking the problem instance is the key. messages on the choice of β2:
Shi et al. shows that RMSProp converges if β2 ● β2 shall be large enough to ensure
is chosen after the problem is specified convergence;
(therefore β2 might be problem-dependent).
● The minimal-convergence-ensuring β2
This is not in conflict with the is problem-dependent hyperparameter,
counterexample of the other paper which rather than a universal hyperparameter.
chooses β2 before noticing the problem.
β2 isn't the first problem-dependent
hyperparameter we've encountered. The step
size used in a problem is a considerably more
well-known example.
Gaps on the Convergence of Adam left by Shi et al
● Lack in providing useful message on β1: There is a
Does Adam's practical usage of β1 match the significant difference between the lower bound of β2
requirement of β1? If not, what is the size of the and the upper bound of β1: the former conveys the
gap? It is shown that this disparity is non-negligible idea that β2 should be large enough to assure high
from a variety of perspectives. performance that is consistent with tests, whereas
the latter does not appear to convey any meaningful
● Gap with practice: One of the papers information.
requires β1 to be as small as 10^(-7).This is
rarely used in practise. In some cases, often
0 or 0.1 is used in place of a rather tiny
● Theoretical gap: Convergence can be achieved by
value. A value of 0.9 is adopted by PyTorch
switching the order of problem and hyper-parameter
in default setting.
selection. This switching-order argument, however,
only applies to β2 and does not always apply to β1.
It's unclear whether Adam converges for greater β1
when β2 is problem-dependent.
Importance of large β1
Many practitioners are currently stuck in the default
setup, with no understanding how to fine-tune β1 and
β2.

Shi et al. offers a straightforward approach to tuning β2 How difficult could it be to adopt large β1 into
when β1=0: convergence analysis?
start at β2=0.8 and tune β2 up until the best
performance is achieved.
Momentum has a significant amount of historical
Nonetheless, there isn't a lot of information on tuning data, which substantially affects the iterate's
β1. How should you tweak the hyperparameter to make course. This error can only be controlled if β1 is
Adam work if doesn't handle the tasks well in the close enough to 0 according to this proof notion.
default value of β1=0.9? Do you want to turn it up, down,
To incorporate large β1 into convergence
or both? Is it okay if you tune β1 and β2 at the same
time? This makes determining the right tune of β1 analysis, one must approach momentum from a
(together with tuning β2) difficult. new angle.
Conclusion

We examined the outcomes of these two publications briefly in our presentation. Their findings
represent significant advancements in our understanding of Adam. Meanwhile, they raise lots of
new new questions that have yet to be addressed. In comparison to its practical success, Adam's
current theoretical knowledge lags behind.

You might also like