Professional Documents
Culture Documents
and When?
Introduction
In this presentation,we discuss the (non-)convergence behavior of Adam.
For each β2=[0,1) and β1=0, Reddi et al. showed that there
exists a convex issue for which RMSProp does not converge
to optima.
As a result, (β1,β2)=(0,0.99)=(0,0.99999) are hyperparameter
combinations that can cause divergence;
They also don’t consider the requirement of the
In fact, regardless of how near β2 is to 1, (β1,β,2)=(0,β2) can
bounded gradient assumption.This works for two
still induce divergence.
reasons. First, in Adam's actual applications, the
bounded gradient condition does not hold (deep neural
So why do Shi et al. assert that "a large enough β2 causes
nets training).
RMSProp to converge"?
Second, the gradients cannot diverge under bounded
gradients assumptions, despite a counter-example
showing that the gradient can diverge for specific
issues.
The difference between picking β2 before or With the above discussion, we highlight two
after picking the problem instance is the key. messages on the choice of β2:
Shi et al. shows that RMSProp converges if β2 ● β2 shall be large enough to ensure
is chosen after the problem is specified convergence;
(therefore β2 might be problem-dependent).
● The minimal-convergence-ensuring β2
This is not in conflict with the is problem-dependent hyperparameter,
counterexample of the other paper which rather than a universal hyperparameter.
chooses β2 before noticing the problem.
β2 isn't the first problem-dependent
hyperparameter we've encountered. The step
size used in a problem is a considerably more
well-known example.
Gaps on the Convergence of Adam left by Shi et al
● Lack in providing useful message on β1: There is a
Does Adam's practical usage of β1 match the significant difference between the lower bound of β2
requirement of β1? If not, what is the size of the and the upper bound of β1: the former conveys the
gap? It is shown that this disparity is non-negligible idea that β2 should be large enough to assure high
from a variety of perspectives. performance that is consistent with tests, whereas
the latter does not appear to convey any meaningful
● Gap with practice: One of the papers information.
requires β1 to be as small as 10^(-7).This is
rarely used in practise. In some cases, often
0 or 0.1 is used in place of a rather tiny
● Theoretical gap: Convergence can be achieved by
value. A value of 0.9 is adopted by PyTorch
switching the order of problem and hyper-parameter
in default setting.
selection. This switching-order argument, however,
only applies to β2 and does not always apply to β1.
It's unclear whether Adam converges for greater β1
when β2 is problem-dependent.
Importance of large β1
Many practitioners are currently stuck in the default
setup, with no understanding how to fine-tune β1 and
β2.
Shi et al. offers a straightforward approach to tuning β2 How difficult could it be to adopt large β1 into
when β1=0: convergence analysis?
start at β2=0.8 and tune β2 up until the best
performance is achieved.
Momentum has a significant amount of historical
Nonetheless, there isn't a lot of information on tuning data, which substantially affects the iterate's
β1. How should you tweak the hyperparameter to make course. This error can only be controlled if β1 is
Adam work if doesn't handle the tasks well in the close enough to 0 according to this proof notion.
default value of β1=0.9? Do you want to turn it up, down,
To incorporate large β1 into convergence
or both? Is it okay if you tune β1 and β2 at the same
time? This makes determining the right tune of β1 analysis, one must approach momentum from a
(together with tuning β2) difficult. new angle.
Conclusion
We examined the outcomes of these two publications briefly in our presentation. Their findings
represent significant advancements in our understanding of Adam. Meanwhile, they raise lots of
new new questions that have yet to be addressed. In comparison to its practical success, Adam's
current theoretical knowledge lags behind.