You are on page 1of 13

A High Probability Analysis of Adaptive SGD with Momentum

Xiaoyu Li 1 Francesco Orabona 2

Abstract et al. (2019a), it is a misconception that for the algorithms


who have expectation bounds it is enough to pick the best of
Stochastic Gradient Descent (SGD) and its vari-
several independent runs to have a high probability guaran-
ants are the most used algorithms in machine
tee: It can actually be a computational inefficient procedure.
learning applications. In particular, SGD with
Moreover, in practical applications like deep learning, it is
adaptive learning rates and momentum is the in-
often the case that only one run of the algorithm is used
dustry standard to train deep networks. Despite
since that the training process may take long time. Hence, it
the enormous success of these methods, our theo-
is essential to get high probability bounds which guarantee
retical understanding of these variants in the non-
the performance of the algorithm on single runs.
convex setting is not complete, with most of the
results only proving convergence in expectation Another very common assumption used in most of the previ-
and with strong assumptions on the stochastic gra- ous papers is the one of bounded stochastic gradients. This
dients. In this paper, we present a high probability is a rather strong assumption and it is false even in the deter-
analysis for adaptive and momentum algorithms, ministic optimization of a convex quadratic function, e.g.,
under weak assumptions on the function, stochas- f (x) = x2 .
tic gradients, and learning rates. We use it to
In this work, we overcome both these problems. We prove
prove for the first time the convergence of the gra-
high probability convergence rates only assuming that the
dients to zero in high probability in the smooth
noises on the gradients are well-behaved, i.e., subgaus-
nonconvex setting for Delayed AdaGrad with mo-
sian. In this way, we allow for unbounded gradients and
mentum.
unbounded noise. The weak assumptions, the nonconvex
analysis, and the adaptive learning rates make our results
particularly challenging to obtain. Indeed, high probability
1. Introduction bounds for bounded stochastic gradients are almost trivial
Despite the incredible popularity of stochastic gradient meth- to obtain but of limited applicability. Overall, we believe
ods in practical machine learning applications, our theoret- this paper is the first one to prove such guarantees.
ical understanding of these methods is still not complete.
In particular, adaptive learning rates methods like Ada- Contributions. In this short paper, we present a high prob-
Grad (Duchi et al., 2011) have been mainly studied in the ability analysis of SGD with momentum and adaptive learn-
convex domain, with few analyses in the non-convex do- ing rates, with weak assumptions on function and stochastic
main (Li & Orabona, 2019; Ward et al., 2019). However, gradients. So, first in Theorem 1 we prove high probability
even in these latter analyses, the assumptions used are very bounds for the gradients of classic momentum SGD step
strong and/or the results limited. size O( √1t ) in the nonconvex setting. Then, in Theorem 2
we prove for the first time high probability convergence
In particular, there are two main problems with the previous
rates for the gradients of AdaGrad with momentum in the
analyses of Stochastic Gradient Descent (SGD) and its vari-
nonconvex setting. In particular, we also show that the high
ants in the nonconvex setting. First, the classic analysis of
probability bounds are adaptive to the level of noise.
convergence for SGD in the nonconvex setting uses an anal-
ysis in expectation. However, expectation bounds do not
rule out extremely bad outcomes. As pointed out by Harvey 2. Related Work
1
Division of System Engineering, Boston University, MA, Stochastic momentum methods. Sutskever et al. (2013)
USA 2 Electrical & Computer Engineering, Boston University, MA, discussed the importance of classic momentum methods in
USA. Correspondence to: Xiaoyu Li <xiaoyuli@bu.edu>. deep learning, which is nowadays widely used in the train-
Workshop on ”Beyond first-order methods in ML systems” at the ing of neural networks. On the convergence of stochastic
37 th International Conference on Machine Learning, Vienna, Aus- momentum methods, Yang et al. (2016) studied a unified
tria, 2020. Copyright 2020 by the author(s). momentum method and provided expectation bounds in
A High Probability Analysis of Adaptive SGD with Momentum

the rate of O( √1T ) in both convex and nonconvex setting. Assumptions. In this paper we focus on the optimization
However, the results hold only for Lipschitz functions. Ga- problem
dat et al. (2018) provided an in-depth description of the min f (x),
x∈Rd
stochastic heavy-ball method. Moreover, for the non-convex
functions, they showed some almost sure convergence re- where f is bounded from below and we denote its infimum
sults. Loizou & Richtárik (2017) provided a general analysis by f ⋆ . We do not assume the function to be convex. We
for the momentum variants of several classes of stochastic consider stochastic optimization algorithms that have access
optimization algorithms and proved the linear rate conver- to a noisy estimate of the gradient of f . This covers the
gence for quadratic and smooth functions. To the best of ubiquitous SGD (Robbins & Monro, 1951), as well mod-
our knowledge, there are no high probability bounds for ern variants as AdaGrad. We are interested in studying the
nonconvex stochastic momentum methods without using convergence of the gradients to zero, because without addi-
strong assumptions. tional assumptions it is the only thing we can study in the
Nonconvex convergence of adaptive methods. In recent nonconvex setting.
years, a variety of adaptive SGD algorithms have been de- We make the following assumption on the objective function
veloped to automatically tune the step size by using the f (x):
past stochastic gradients. The first adaptive algorithm was
AdaGrad Duchi et al. (2011), designed to adapt to sparse (A) f is M -smooth, that is, f is differentiable and its
gradients. Li & Orabona (2019) and Ward et al. (2019) gradient is M -Lipschitz, i.e. k∇f (x) − ∇f (y)k ≤
showed the √ convergence of variants of AdaGrad with a rate M kx − yk, ∀x, y ∈ Rd .
of O(ln T / T ) in the non-convex case. Moreover, Li &
Orabona (2019) showed that AdaGrad with non-coordinate-
wise learning rates is adaptive to the level of noise. Zou Note that (A), for all x, y ∈ Rd , implies (Nesterov, 2004,
et al. (2019) studied AdaGrad with a unified momentum and Lemma 1.2.3)
Chen et al. (2019) considered a large family of Adam-like
M
algorithms (Kingma & Ba, 2015) including AdaGrad with |f (y) − f (x) − h∇f (x), y − xi| ≤ ky − xk2 .
2
momentum. Yet, all of these works prove on bounds in ex-
pectation and most of them use the very strong assumption It is easy to see that this assumption is necessary to have
of bounded stochastic gradients. the convergence of the gradients to zero. Indeed, without
High probability bounds. The results on high probability smoothness the norm of the gradients does not go to zero
bounds are relatively rare compared to those in expectation, even in the convex case, e.g., consider the function f (x) =
which are easier to obtain. Kakade & Tewari (2009) used |x − 1|.
Freeman’s inequality to prove high probability bounds for an We assume that we have access to a stochastic first-order
algorithm solving the SVM objective function. For classic black-box oracle, that returns a noisy estimate of the gra-
SGD, Harvey et al. (2019b) and Harvey et al. (2019a) used dient of f at any point x ∈ Rd . That is, we will use the
a generalized Freedman’s inequality to prove bounds in non- following assumption:
smooth and strongly convex case, while Jain et al. (2019)
proved the optimal bound for the last iterate of SGD with
(B1) We receive a vector g(x, ξ) such that Eξ [g(x, ξ)] =
high probability. As far as we know, there are currently
∇f (x) for any x ∈ Rd .
no high probability bounds for adaptive methods in the
nonconvex setting.
We will also make the following assumption on the variance
of the noise.
3. Problem Set-Up
Notation. We denote vectors and matrices by bold let- (B2) (Sub-Gaussian Noise) The stochastic
 gradient satisfies
Eξ exp k∇f (x) − g(x, ξ)k2 /σ 2 ≤ exp(1), ∀x.

ters. The coordinate j of a vector x is denoted by xj and as
∇f (x)j for the gradient ∇f (x). To keep the notation con-
cise, all standard operations xy, x/y, x2 , 1/x, x1/2 and The condition (B2) has been used by Nemirovski et al.
max(x, y) on the vectors x, y are supposed to be element- (2009) and Harvey et al. (2019a) to prove high probabil-
wise. We denote by E[·] the expectation with respect to the ity convergence guarantees. Intuitively, it implies that the
underlying probability space and by Et [·] the conditional tails of the noise distribution are dominated by tails of a
expectation with respect to the past randomness. We use L2 Gaussian distribution. Note that, by Jensen’s inequality,
norms. this condition implies a bounded variance on the stochastic
gradients.
A High Probability Analysis of Adaptive SGD with Momentum

4. A General Analysis for Algorithms with It is interesting to note that usually people argue that these
Momentum two types updates for momentum are usually considered
equivalent. This seems indeed true only if the learning rates
In this section, we will consider a generic stochastic opti- are not adaptive.
mization algorithm with Polyak’s momentum (Polyak, 1964;
Qian, 1999; Sutskever et al., 2013), also known as the Heavy- Assumptions on learning rates. Note that in the pseudo-
ball algorithm or classic momentum, see Algorithm 1. code we do not specify the learning rates η t ∈ Rd . In fact,
our analysis covers the case of generic learning rates and
Algorithm 1 Algorithms with Momentum adaptive ones too. We only need the following assumptions
1: Input: m0 = 0, {η t }T t=1 , 0 < µ ≤ 1, x1 ∈ R
d
on the stepsizes η t :
2: for t = 1, . . . , T do
3: Get stochastic gradient g t = g(xt , ξt ) (C1) η t is non-increasing, i.e., η t+1 ≤ η t , ∀t.
4: mt = µmt−1 + η t g t
5: xt+1 = xt − mt (C2) η t is independent with ξt .
6: end for
The first assumption is very common (e.g., Duchi et al.,
2011; Reddi et al., 2018; Li & Orabona, 2019; Chen et al.,
Two forms of momentum, but not equivalent. First, we
2019; Zhou et al., 2018). Indeed, AdaGrad has the non-
want to point out that there two forms of Heavyball algo-
increasing step sizes by the definition. Also, Reddi et al.
rithms are possible. The first one is in Algorithm 1, while
(2018) have claimed that the main issue of the divergences
the second one is
of Adam and RMSProp lies in the positive definiteness of
mt = µmt−1 + g t , 1/η t − 1/η t−1 .
(1)
xt+1 = xt − η t mt . The need of the second assumption is technical and shared
by similar analysis (Li & Orabona, 2019; Savarese et al.,
This second is used in many practical implementation, see, 2019). Indeed, Li & Orabona (2019) showed that delayed
for example, PyTorch (Paszke et al., 2019). It would seem step sizes can avoid the possible deviation brought by the
that there is no reason to prefer one over the other. However, step sizes that include the current noise.
here we argue that the classic form of momentum is the
right one if we want to use adaptive learning rates. To see High probability guarantee. Adaptive learning rates and
why, let’s unroll the updates in both cases. Using the update in general learning rates that are decided using previous gra-
in Algorithm 1, we have dients become stochastic variables. This makes the high
probability analysis more complex. Hence, we use a new
xt+1 = xt − η t g t − µη t−1 g t−1 − µ2 η t−1 g t−2 . . . ,
concentration inequality for martingales in which the vari-
while using the update in (1), we have ance is treated as a random variable, rather than a deter-
ministic quantity. We use this concentration in the proof of
xt+1 = xt − η t g t − µη t g t−1 − µ2 η t g t−2 . . . . Lemma 2. Our proof, in the Appendix, merges ideas from
the related results in Beygelzimer et al. (2011, Theorem 1)
In words, in the first case the update is composed by a sum and Lan et al. (2012, Lemma 2). A similar result has also
of weighted gradients, each one multiplied by a learning been shown by Jin et al. (2019, Lemma 6).
rate we decided in the past. On the other hand, in the update Lemma 1. Assume that Z1 , Z2 , ..., ZT is a martingale
(1) the update is composed by a sum of weighted gradients, difference sequence with respect to ξ1 , ξ2 , ..., ξT and
each one multiplied by the current learning rate. From the Et exp(Zt2 /σt2 ) ≤ exp(1) for all 1 ≤ t ≤ T , where
 
analysis point of view, the second update destroys the in- σt is a sequence of random variables with respect to
dependence between the past and the future, introducing a ξ1 , ξ2 , . . . , ξt−1 . Then, for any fixed λ > 0 and δ ∈ (0, 1),
dependency that breaks our analysis, unless we introduce with probability at least 1 − δ, we have
very strict conditions on the gradients. On the other hand,
the update in Algorithm 1 allows us to carry out the anal- T T
X 3 X 2 1 1
ysis because each learning rate was chosen only with the Zt ≤ λ σ + ln .
t=1
4 t=1 t λ δ
knowledge of the past. Note that this is known problem
in adaptive algorithms: the lack of independence between
past and present is exactly the reason why Adam fails to Main result. We can now present our general lemma, that
converge on simple 1d convex problems, see for example allows to analyze SGD with momentum with adaptive learn-
the discussion in Savarese et al. (2019). ing rates. We will then instantiate it for particular examples.
A High Probability Analysis of Adaptive SGD with Momentum

Lemma 2. Assume (A, B1, B2, C1, C2). Then, for any Lemma 3. Assume (A, B1, B2). Let η t set as in (2), where
δ ∈ (0, 1), with probability at least 1 − δ, the iterates of α, β > 0. Then, for any δ ∈ (0, 1), with probability at least
Algorithm 1 satisfy 1 − δ, we have
T T T
X 3kη 1 kσ 2 (1 − µT )2 1 4dα2 σ 2 2T e 4α X
hη t , ∇f (xt )2 i ≤
X
ln kη t g t k2 ≤ ln +√ hη t , ∇f (xt )2 i
t=1
(1 − µ)2 δ β δ β
t=1 t=1
T s v 
M (3 − µ) X 2 ln 2T e
u T
+ 2(f (x1 ) − f ⋆ ) + kη g k2 . 2T σ u2 X
1 − µ t=1 t t ≤ 2α2 d ln  β + δ
+t k∇f (xt )k2  .
d d t=1

Lemma 2P accomplishes the task of upper bounding the inner


T
product t=1 hη t ∇f (xt ), mt i. Then, it is easy to lower We now present the convergence guarantee for Delayed
PT AdaGrad with momentum.
bound the l.h.s by t=1 hη T , ∇f (xt )2 i using the assump-
PT
tion (C1), followed by the upper bound of t=1 kη t g t k2 Theorem 2 (Delayed AdaGrad with Momentum). Assume
based on the setting of η t . (A, B1,√ B2). Let η t set as in (2), where α, β > 0 and
β(1−µ)2
4α ≤ 2M (1+µ) . Then, for any δ ∈ (0, 1), with probability
1
4.1. SGD with Momentum with √
t
Learning Rates at least 1 − δ, the iterates of Algorithm 1 satisfy
As a warm-up, we now use Lemma 2 to prove a high prob- min k∇f (xt )k2
ability convergence guarantee for the simple case of deter- 1≤t≤T
ministic learning rates of ηt,i = √ct . r !
1 4C(T )2 C(T ) 3T e
≤ max , 2β + 4T σ 2 ln ,
Theorem 1. Let T the number of iterations of Algorithm 1. T α2 α δ
Assume (A, B1, B2). Set step size η t as ηt,i = √ct , i =
T
  
ln 1
 
1, · · · , d, where c ≤ 4M1−µ
(3−2µ) . Then, for any δ ∈ (0, 1),
d α+σ 2 α ln T
δ
δ
+ 1−µ
with probability at least 1 − δ, the iterates of Algorithm 1 where C(T ) = O  α1 + 1−µ
.
satisfy
4(f (x1 ) − f ⋆ ) 6(1 − µT )2 σ 2 Adaptivity to Noise. Observe that when σ = 0, the con-
min k∇f (xt )k2 ≤ √ + √
1≤t≤T c T (1 − µ)2 T vergence rate recovers the rate of Gradient Descent if O( T1 )
4(3 − µ)cM σ 2 ln 2Tδ e ln T with a constant learning rate. On the other hand, in the noisy
+ √ . case, it matches the rate of SGD O( √σT ) with the optimal
(1 − µ) T 1
worst-case learning rate of O( σ√ t
). In other words, with a
The proof is in Appendix. unique learning rate, we recover two different optimal con-
vergence rates that requires two different learning rates and
4.2. AdaGrad with Momentum the knowledge of σ. This adaptivity of Delayed AdaGrad
was already proved in Li & Orabona (2019), but only in
Now, we are going to prove the convergence rate of a variant expectation and without a momentum term.
AdaGrad in which we use momentum and learning rates
that do not contain the current gradient. That is, the step Dependency on µ. Observe that the convergence upper
sizes are defined as η t = (ηt,j )j=1,...,d bound increases over µ ∈ (0, 1) and the optimal upper
α bound is achieved when taking the momentum parameter
ηt,j = q Pt−1 2 , j = 1, · · · , d, (2) µ = 0. In words, the algorithms without momentums have
β + i=1 gi,j the best theoretical results. This is a known caveat for this
kind of analysis and a similar behavior w.r.t. µ is present,
where α, β > 0. Removing the current gradient from
e.g., in Zou et al. (2018, Theorem 1) for algorithms with
the learning rate was proposed in Li & Orabona (2019);
Polyak’s momentum.
Savarese et al. (2019). Following the naming style in
(Savarese et al., 2019), we denote this variant by Delayed
AdaGrad. 5. Conclusion and Future Work.
Obviously, (2) satisfies (C1) and (C2). Hence, we are able In this work, we present a high probability analysis of adap-
to employ Lemma 2 to analyze thisP variant. Moreover, for tive SGD with Polyak’s momentum in the nonconvex setting.
T
Delayed AdaGrad, we upper bound t=1 kη t g t k2 with the Without using the common assumption of bounded gradi-
following lemma, whose proof is in the Appendix. ents nor bounded noise, we give the high probability bound
A High Probability Analysis of Adaptive SGD with Momentum

for SGD with Polyak’s momentum with step size O( √1t ) Jin, C., Netrapalli, P., Ge, R., Kakade, S. M., and Jordan,
and for Delayed AdaGrad with momentum. In particular, to M. I. A short note on concentration inequalities for
the best of our knowledge, this is the first high probability random vectors with subgaussian norm. arXiv preprint
convergence guarantee for adaptive methods. arXiv:1902.03736, 2019.
In the future, we plan to extend our results to more adaptive Kakade, S. M. and Tewari, A. On the generalization ability
methods, such as Adam and AMSGrad (Tieleman & Hinton, of online strongly convex programming algorithms. In
2012). Moreover, we will explore other forms of momen- Advances in Neural Information Processing Systems, pp.
tum such as exponential moving average and Nesterov’s 801–808, 2009.
momentum (Nesterov, 1983).
Kingma, D. P. and Ba, J. Adam: A method for stochastic
optimization. In International Conference on Learning
Acknowledgements Representations (ICLR), 2015.
This material is based upon work supported by the National Lan, G., Nemirovski, A., and Shapiro, A. Validation anal-
Science Foundation under grant no. 1925930 “Collabora- ysis of mirror descent stochastic approximation method.
tive Research: TRIPODS Institute for Optimization and Mathematical programming, 134(2):425–458, 2012.
Learning”.
Li, X. and Orabona, F. On the convergence of stochastic
References gradient descent with adaptive stepsizes. In Proc. of the
22nd International Conference on Artificial Intelligence
Beygelzimer, A., Langford, J., Li, L., Reyzin, L., and and Statistics, AISTATS, 2019. URL https://arxiv.
Schapire, R. Contextual bandit algorithms with super- org/abs/1805.08114.
vised learning guarantees. In Proc. of the International
Conference on Artificial Intelligence and Statistics, pp. Loizou, N. and Richtárik, P. Momentum and stochastic
19–26, 2011. momentum for stochastic gradient, Newton, proximal
point and subspace descent methods. arXiv preprint
Chen, X., Liu, S., Sun, R., and Hong, M. On the conver- arXiv:1712.09677, 2017.
gence of a class of Adam-type algorithms for non-convex
optimization. In 7th International Conference on Learn- Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Ro-
ing Representations, ICLR 2019, 2019. bust stochastic approximation approach to stochastic pro-
gramming. SIAM Journal on optimization, 19(4):1574–
Duchi, J. C., Hazan, E., and Singer, Y. Adaptive subgradient 1609, 2009.
methods for online learning and stochastic optimization.
Journal of Machine Learning Research, 12:2121–2159, Nesterov, Y. A method for unconstrained convex minimiza-
2011. tion problem with the rate of convergence O(1/k 2 ). In
Doklady AN SSSR (translated as Soviet. Math. Docl.),
Gadat, S., Panloup, F., and Saadane, S. Stochastic heavy ball. volume 269, pp. 543–547, 1983.
Electronic Journal of Statistics, 12(1):461–529, 2018.
Nesterov, Y. Introductory lectures on convex optimization:
Harvey, N. J., Liaw, C., Plan, Y., and Randhawa, S. Tight A basic course, volume 87. Springer, 2004.
analyses for non-smooth stochastic gradient descent. In
Conference on Learning Theory, pp. 1579–1613, 2019a. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
Harvey, N. J., Liaw, C., and Randhawa, S. Simple and opti- L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison,
mal high-probability bounds for strongly-convex stochas- M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L.,
tic gradient descent. arXiv preprint arXiv:1909.00843, Bai, J., and Chintala, S. Pytorch: An imperative style,
2019b. high-performance deep learning library. In Advances
in Neural Information Processing Systems 32, pp. 8024–
Jain, P., Nagaraj, D., and Netrapalli, P. Making
8035. Curran Associates, Inc., 2019.
the last iterate of sgd information theoretically opti-
mal. In Beygelzimer, A. and Hsu, D. (eds.), Proc. Polyak, B. T. Some methods of speeding up the convergence
of the Conference on Learning Theory (COLT), vol- of iteration methods. USSR Computational Mathematics
ume 99 of Proceedings of Machine Learning Re- and Mathematical Physics, 4(5):1–17, 1964.
search, pp. 1752–1755, Phoenix, USA, 25–28 Jun
2019. PMLR. URL http://proceedings.mlr. Qian, N. On the momentum term in gradient descent learn-
press/v99/jain19a.html. ing algorithms. Neural networks, 12(1):145–151, 1999.
A High Probability Analysis of Adaptive SGD with Momentum

Reddi, S. J., Kale, S., and Kumar, S. On the conver- Ward, R., Wu, X., and Bottou, L. AdaGrad stepsizes: Sharp
gence of Adam and beyond. In International Confer- convergence over nonconvex landscapes, from any initial-
ence on Learning Representations, 2018. URL https: ization. In International Conference on Machine Learn-
//openreview.net/pdf?id=ryQu7f-RZ. ing, pp. 6677–6686, 2019.
Robbins, H. and Monro, S. A stochastic approximation Yang, T., Lin, Q., and Li, Z. Unified convergence analysis
method. Annals of Mathematical Statistics, 22:400– of stochastic momentum methods for convex and non-
407, 1951. URL https://projecteuclid.org/ convex optimization. arXiv preprint arXiv:1604.03257,
euclid.aoms/1177729586. 2016.
Savarese, P., McAllester, D., Babu, S., and Maire, M. Zhou, D., Tang, Y., Yang, Z., Cao, Y., and Gu, Q. On the
Domain-independent dominance of adaptive methods. convergence of adaptive gradient methods for nonconvex
arXiv preprint arXiv:1912.01823, 2019. optimization, 2018. arXiv:1808.05671.
Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the Zou, F., Shen, L., Jie, Z., Sun, J., and Liu, W. Weighted
importance of initialization and momentum in deep learn- AdaGrad with unified momentum. arXiv preprint
ing. In International conference on machine learning, pp. arXiv:1808.03408, 2018.
1139–1147, 2013.
Zou, F., Shen, L., Jie, Z., Zhang, W., and Liu, W. A sufficient
Tieleman, T. and Hinton, G. Lecture 6.5-rmsprop: Divide condition for convergences of Adam and RMSProp. In
the gradient by a running average of its recent magnitude. Proc. of the IEEE Conference on Computer Vision and
COURSERA: Neural Networks for Machine Learning, Pattern Recognition, pp. 11127–11135, 2019.
2012.
A High Probability Analysis of Adaptive SGD with Momentum

A. Appendix
A.1. Details of Section 4
Proof of Lemma 1. Set Z̃t = Zt /σt . By the assumptions of Zt and σt , we have
1 h  i
Et [Z̃t ] = Et [Zt ] = 0 and Et exp Z̃t2 ≤ exp(1) .
σt
By Jensen’s inequality, it follows that for any c ∈ [0, 1],
h  i h  c i  h  ic
Et exp cZ̃t2 = Et exp Z̃t2 ≤ Et exp Z̃t2 ≤ exp(c) . (3)

Also it can be verified that exp(x) ≤ x + exp(9x2 /16) for all x, hence for |κ| ∈ [0, 4/3] we get
h  i h  i
Et exp κZ̃t ≤ Et exp 9κ2 Z̃t2 /16 ≤ exp 9κ2 /16 ≤ exp 3κ2 /4 ,
 
(4)

where in the second inequality, we used (3). Besides, kx ≤ 3k 2 /8 + 2x2 /3 holds for any k and x. Hence for |κ| ≥ 4/3, we
get h  i  h  i
Et exp κZ̃t ≤ exp 3κ2 /8 Et exp 2Z̃t2 /3 ≤ exp 3κ2 /8 + 2/3 ≤ exp 3κ2 /4 ,
 
(5)

where in the second inequality we used (3). Combining (4) and (5), we get ∀κ,
h  i
Et exp κZ̃t ≤ exp 3κ2 /4 .

(6)

 hold when κ is a random variable with respect to ξ1 , ξ2 , . . . , ξt−1 . So for Zt , we


Note that the above analysis for (6) still
have Et [exp (λZt )] ≤ exp 3λ2 σt2 /4 , λ > 0.
Y0 = 1 and Yt = Yt−1 exp λZt − 3λ2 σt2 /4 , 1 ≤ t ≤ T . So, we have Et Yt =

Define the random variables
Yt−1 exp −3λ2 σt2 /4 · Et [exp (λZt )] ≤ Yt−1 . Now, taking full expectation over all variables ξ1 , ξ2 , . . . , ξT , we have


EYT ≤ EYT −1 ≤ · · · ≤ EY0 = 1 .


 P 
T PT
By Markov’s inequality, P YT ≥ 1δ ≤ δ, and YT = exp λ t=1 Zt − 43 λ2 t=1 σt2 , we have


T T T T
  ! !
1 X 3 X 2 1 X 3 X 2 1 1
P YT ≥ =P λ Z t − λ2 σt ≥ ln =P Zt ≥ λ σ + ln , ≤ δ,
δ t=1
4 t=1
δ t=1
4 t=1 t λ δ

which completes the proof.

To prove Lemma 2, we first need the following technical Lemma.


Lemma 4. ∀T ≥ 1, it holds
T
X t
X T
X T
X T
X t−1
X T
X −1 T
X
at bi = bt ai and at bi = bt ai .
t=1 i=1 t=1 i=t t=1 i=0 t=0 i=t+1

Proof. We prove these equalities by induction. When T = 1, they obviously hold. Now, for k < T , assume that
Pk Pt Pk Pk
t=1 at i=1 bi = t=1 bt i=t ai . Then, we have

k+1
X t
X k
X t
X k+1
X k
X k
X k
X
at bi = at bi + ak+1 bi = bt ai + ak+1 bi + ak+1 bk+1
t=1 i=1 t=1 i=1 i=1 t=1 i=t i=1
k
X k+1
X k+1
X k+1
X
= bt ai + ak+1 bk+1 = bt ai .
t=1 i=t t=1 i=t
A High Probability Analysis of Adaptive SGD with Momentum

Hence, by induction, the equality is proved.


Pk Pt−1 Pk−1 Pk
Similarly, for second equality assume that for k < T we have t=1 at i=0 bi = t=0 bt i=t+1 ai . Then, we have
k+1
X t−1
X k
X t−1
X k
X k−1
X k
X k−1
X
at bi = at bi + ak+1 bi = bt ai + ak+1 bi + ak+1 bk
t=1 i=0 t=1 i=0 i=0 t=0 i=t+1 i=0
k−1
X k+1
X k
X k+1
X
= bt ai + ak+1 bk = bt ai .
t=0 i=t+1 t=0 i=t+1

By induction, we finish the proof.

Proof of Lemma 2. By the smoothness of f and the definition of xt+1 , we have


M
f (xt+1 ) − f (xt ) ≤ −h∇f (xt ), mt i + kmt k2 . (7)
2
We now upper bound −h∇f (xt ), mt i.
− h∇f (xt ), mt i
= −µh∇f (xt ), mt−1 i − h∇f (xt ), η t g t i
= −µh∇f (xt−1 ), mt−1 i − µh∇f (xt ) − ∇f (xt−1 ), mt−1 i − h∇f (xt ), η t g t i
≤ −µh∇f (xt−1 ), mt−1 i + µk∇f (xt ) − ∇f (xt−1 )kkmt−1 k − h∇f (xt ), η t g t i
≤ −µh∇f (xt−1 ), mt−1 i + µM kmt−1 k2 − h∇f (xt ), η t g t i,
where the second inequality is due to the smoothness of f . Hence, iterating the inequality we have
−h∇f (xt ), η t mt i ≤ −µ2 h∇f (xt−2 ), mt−2 i + µ2 M kmt−2 k2 + µM kmt−1 k2
− µh∇f (xt−1 ), η t−1 g t−1 i − h∇f (xt ), η t g t i
t−1
X t
X
≤M µt−i kmi k2 − µt−i h∇f (xi ), η i g i i .
i=1 i=1

Thus, denoting by ǫt = g t − ∇f (xt ) and summing (7) over t from 1 to T , we obtain


f ⋆ − f (x1 ) ≤ f (xT +1 ) − f (x1 )
t−1
T X t
T X T
X X X M
≤M µt−i kmi k2 − µt−i h∇f (xi ), η i g i i + kmt k2
t=1 i=1 t=1 i=1 t=1
2
X t−1
T X T
X T X
X t
≤M µt−i kmi k2 − hη t , ∇f (xt )2 i − µt−i h∇f (xi ), η i ǫi i
t=1 i=1 t=1 t=1 i=1
T
X M
+ kmt k2 .
t=1
2

By Lemma 4, we have
t−1
T X T
X M X
M µt−i kmi k2 ≤ kmt k2 .
t=1 i=1
1 − µ t=1

Also, by Lemma 4, we have


X t
T X T
X T
X
− µt−i h∇f (xi ), η i ǫi i = − µ−t h∇f (xt ), η t ǫt i µi
t=1 i=1 t=1 i=t
T
1 X
=− h∇f (xt ), η t ǫt i(1 − µT −t+1 ) , ST .
1−µ t=1
A High Probability Analysis of Adaptive SGD with Momentum
T −t+1 T −t+1 2
We then upper bound ST . Denote by Lt := − 1−µ1−µ h∇f (xt ), η t ǫt i, and Nt := (1−µ (1−µ)2
)
kη t ∇f (xt )k2 σ 2 . Using
the assumptions on the noise, for any 1 ≤ t ≤ T , we have
 2
kη t ∇f (xt )k2 kǫt k2 (1 − µT −t+1 )2 /(1 − µ)2 kǫt k2
   
Lt
exp ≤ exp = exp ≤ exp(1) .
Nt Nt σ2
T −t+1 Pd
We can also see that for any t, Et [Lt ] = − 1−µ1−µ i=1 η t,i ∇f (xt )i Et [ǫt,i ] = 0. Thus, from Lemma 1, with probability
at least 1 − δ, any λ > 0, we have
T T T
X 3 X 1 1 3λ(1 − µT )2 X 1 1
ST = Lt ≤ λ Nt + ln ≤ kη ∇f (xt )k2 σ 2 + ln
t=1
4 t=1 λ δ 4(1 − µ)2 t=1 t λ δ
T
3λkη 1 k(1 − µT )2 X 1 1
≤ 2
hη t , ∇f (xt )2 iσ 2 + ln .
4(1 − µ) t=1
λ δ
PT
Finally, we upper bound t=1 kmt k2 . From the convexity of k · k2 , we have
2
kmt k2 = ≤ µkmt−1 k2 + 1 kη t g t k2 .
ηt gt

µm t−1 + (1 − µ)
1 − µ 1−µ

Summing over t from 1 to T , we have


T T T
X X 1 X
kmt k2 ≤ µkmt−1 k2 + kη g k2
t=1 t=1
1 − µ t=1 t t
T −1 T
X 1 X
= µkmt k2 + kη g k2
t=1
1 − µ t=1 t t
T T
X 1 X
≤ µkmt k2 + kη g k2 ,
t=1
1 − µ t=1 t t

where in the first equality we used m0 = 0. Reordering the terms, we have that
T T
X 1 X
kmt k2 ≤ kη g k2 .
t=1
(1 − µ)2 t=1 t t

2(1−µ)2
Combining things together, and taking λ = 3kη 1 k(1−µT )2 σ 2
, with probability at least 1 − δ, we have

T
3λkη 1 k(1 − µT )2 σ 2
    
⋆ 1 1 X M M 2 2
f − f (x1 ) ≤ ln + + kη t g t k − 1 − hη t , ∇f (xt ) i
λ δ t=1 2 1−µ 4(1 − µ)2
T 
3kη 1 k(1 − µT )2 σ 2 1 X (3 − µ)M

2 1 2
= ln + kη t g t k − hη t , ∇f (xt ) i .
2(1 − µ)2 δ t=1 2(1 − µ) 2

Rearranging the terms, we get the stated bound.

A.2. Details of Section 4.1


The proof of this Theorem 1 makes use of the following additional Lemma on the tail of sub-gaussian noise.
Lemma 5. Assume B2, then for any δ ∈ (0, 1), with probability at least 1 − δ, we have

Te
max kg t − ∇f (xt )k2 ≤ σ 2 ln .
1≤t≤T δ
A High Probability Analysis of Adaptive SGD with Momentum

Proof. By Markov’s inequality, for any A > 0,

max1≤t≤T kg t − ∇f (xt )k2


      
A
P max kg t − ∇f (xt )k2 > A = P exp > exp
1≤t≤T σ2 σ2
max1≤t≤T kg t − ∇f (xt )k2
    
A
≤ exp − 2 E exp
σ σ2
kg t − ∇f (xt )k2
    
A
= exp − 2 E max exp
σ 1≤t≤T σ2
 T
k∇f (xt ) − g t k2
     
A X A
≤ exp − 2 E exp ≤ exp − + 1 T.
σ t=1
σ2 σ2

Proof of Theorem 1. With the fact that ka + bk2 ≤ 2kak2 + 2kbk2 , we have

T
X T
X T
X T
X
kη t g t k2 = ηt2 kg t k2 ≤ 2ηt2 k∇f (xt )k2 + 2ηt2 kg t − ∇f (xt )k2
t=1 t=1 t=1 t=1
T
X T
X
≤ 2ηt2 k∇f (xt )k2 + max kg t − ∇f (xt )k2 2ηt2 .
1≤t≤T
t=1 t=1

By Lemma 5, Lemma 2 and the union bound, we have that with probability at least 1 − δ,

T  XT
ηT X 2M (3 − µ)
k∇f (xt )k2 ≤ 1 − η1 ηt k∇f (xt )k2
2 t=1 1−µ t=1
2(3 − µ)c2 M σ 2 ln 2Tδ e ln T 3c(1 − µT )2 σ 2 1
≤ 2(f (x1 ) − f ⋆ ) + + ln .
1−µ (1 − µ)2 δ
PT
Rearranging the terms and lower bounding t=1 k∇f (xt )k2 by T · min1≤t≤T k∇f (xt )k2 , we have the stated bound.

A.3. Details of Section 4.2


For the proof of Lemma 3, we first need the following technical Lemma.
Lemma 6. Let ai ≥ 0, · · · , T and f : [0, +∞) → [0, +∞) non-increasing function. Then

T t
! PT
X X Z t=0 at
at f a0 + ai ≤ f (x)dx .
t=1 i=1 a0

Pt
Proof. Denote by st = i=0 ai . Then, we have
Z si Z si
ai f (si ) = f (si )dx ≤ f (x)dx .
si−1 si−1

Summing over i = 1, · · · , T , we have the stated bound.

PT
Proof of Lemma 3. First, we separate t=1 kη t g t k2 into two terms:

T
X T
X T
X
kη t g t k2 = kη t+1 g t k2 + hη 2t − η 2t+1 , g 2t i .
t=1 t=1 t=1
A High Probability Analysis of Adaptive SGD with Momentum

Then, we proceed
T
X d X
X T
hη 2t − η 2t+1 , g 2t i = 2
(ηt,i 2
− ηt+1,i 2
)gt,i
t=1 i=1 t=1
d X
X T
2
≤ 2ηt,i gt,i (ηt,i − ηt+1,i )
i=1 t=1
d
X T
X
2
≤2 max ηt,i gt,i (ηt,i − ηt+1,i )
1≤t≤T
i=1 t=1
d
X
2
≤2 η1,i max ηt,i gt,i
1≤t≤T
i=1
d
X d
X T
X
2
− ∇f (xt )2i + 4 ηt,i ∇f (xt )2i

≤4 η1,i max ηt,i gt,i η1,i
1≤t≤T
i=1 i=1 t=1
d
X d
X T
X
2 2
≤4 η1,i max |gt,i − ∇f (xt )2i | + 4 η1,i ηt,i ∇f (xt )2i
1≤t≤T
i=1 i=1 t=1
T
4dα2 4α X
≤ max kg t − ∇f (xt )k2 + √ hη t , ∇f (xt )2 i . (8)
β 1≤t≤T β t=1

Using Lemma 5 on (8), for δ ∈ (0, 1), with probability at least 1 − 2δ , we have

T T
X 4dα2 σ 2 2T e 4α X
hη 2t − η 2t+1 , g 2t i ≤ ln +√ hη , ∇f (xt )2 i .
t=1
β δ β t=1 t
PT
We now upper bound t=1 kη t+1 g t k2 :

T T
d X
X X α2 gt,i
2
kη t+1 g t k2 = Pt 2
t=1 i=1 t=1
β+ j=1 gj,i
d T
!
X X
≤ α2 ln β + 2
gt,i
i=1 t=1
d T
!
2 1 XX 2
≤ α d ln β + gt,i
d i=1 t=1
v 
u T
u 1 X
= 2α2 d ln tβ + kg k2 
d t=1 t
s v 
u T
2 2T u2 X
≤ 2α d ln  β + max kg − ∇f (xt )k2 + t k∇f (xt )k2  , (9)
d 1≤t≤T t d t=1

where in the first inequality we used Lemma 6 and in the second inequality we used Jensen’s inequality. Then using
Lemma 5 on (9), with probability at least 1 − 2δ , we have
r v 
T 2
u T
X 2T σ 2T e u2 X
kη t+1 g t k2 ≤ 2α2 d ln  β + ln +t k∇f (xt )k2  .
t=1
d δ d t=1

Putting things together, we have the stated bound.


A High Probability Analysis of Adaptive SGD with Momentum

Finally, to prove Theorem 2, we need the two following Lemmas.


Lemma 7 (Lemma 6 in Li & Orabona (2019)). Let x ≥ 0, A, C, D ≥ 0, B > 0, and x2 ≤ (A + Bx)(C + D ln(A + Bx)).
Then. √
x < 32B 3 D2 + 2BC + 8B 2 D C + A/B .
√  √ 
Lemma 8 (Lemma 5 in Li & Orabona (2019)). If x ≥ 0, and x ≤ C A + Bx, then x ≤ max 2BC 2 , C 2A .

Proof of Theorem 2. By Lemma 2 and Lemma 3, for δ ∈ (0, 1), with probability at least 1 − 32 δ, we have
 T
X
4αM (3 − µ)
1− √ hη t , ∇f (xt )2 i
β(1 − µ) t=1
4dα2 σ 2 3T e 3kη 1 kσ 2 (1 − µT )2 3
 
M (3 − µ)
≤ 2(f (x1 ) − f ⋆ ) + K+ ln + ln .
1−µ β δ (1 − µ)2 δ
q q qP 
2T σ 2 T
where K denotes 2α2 d ln β+ d ln 2Tδ e + 2
d t=1 k∇f (xt )k
2 for conciseness.

Rearranging the terms, we have

T
X
hη t , ∇f (xt )2 i
t=1
4dα2 σ 2 3T e 3kη 1 kσ 2 (1 − µT )2 3
   
1 M (3 − µ)
≤ 2(f (x1 ) − f ⋆ ) + ln + K+ ln
1 − 4αM

(3−µ) β 1−µ δ (1 − µ)2 δ
β(1−µ)

4dα2 σ 2 3T e 3kη 1 kσ 2 (1 − µT )2 3
 
2M (3 − µ)
≤ 4(f (x1 ) − f ⋆ ) + K+ ln + ln
1−µ β δ (1 − µ)2 δ
, C(T ),

β(1−µ)
where in the second inequality we used 4α ≤ 2M (3−µ) . Also, we have

T
X T
X
hη t , ∇f (xt )2 i ≥ hη T , ∇f (xt )2 i
t=1 t=1
d PT
X α t=1 ∇f (xt )2i
= q PT 2
β + t=1 gt,i
i=1

d PT
X ∇f (xt )2i
α t=1
≥ q PT PT
i=1 β + 2 t=1 ∇f (xt )2i + 2 t=1 (gt,i − ∇f (xt )i )2
d PT
X α t=1 ∇f (xt )2i
≥ q Pd PT Pd PT
i=1 β + 2 i=1 t=1 ∇f (xt )2i + 2 i=1 t=1 (gt,i − ∇f (xt )i )2
PT
α t=1 k∇f (xt )k2
≥q PT .
β + 2 t=1 k∇f (xt )k2 + 2T max1≤t≤T kg t − ∇f (xt )k2

By Lemma 5, with probability at least 1 − δ, we have


v
T u T
X
2 C(T ) u X 3T e
k∇f (xt )k ≤ × tβ + 2 k∇f (xt )k2 + 2T σ 2 ln . (10)
t=1
α t=1
δ
A High Probability Analysis of Adaptive SGD with Momentum
r v 
u T
C(T )  3T e u X
RHS of (10) ≤ × β + 2T σ 2 ln + t2 k∇f (xt )k2 
α δ t=1
  v   v 
u T u T
uX uX
≤ C + D ln A + B t k∇f (xt )k2  × A + B t k∇f (xt )k2  , (11)
t=1 t=1

q √ 4(f (x1 )−f ⋆ ) 8M (3−µ)dασ 2 3d(1−µT )2 σ 2


where A = β + 2T σ 2 ln 3Tδ e , B = 2, C = α + β(1−µ) ln 3Tδ e + β(1−µ)2 ln 3δ and D =
4αdM (3−µ)
1−µ . Using Lemma 7, we have that
v
u T
uX √ A
t k∇f (xt )k2 ≤ 32B 3 D2 + 2BC + 8B 2 D C + .
t=1
B

We use this upper bound in the logarithmic term of (11). Thus, we have (10) again, this time with

C(T ) = C + D ln(2A + 32B 4 D2 + 2B 2 C + 8B 3 D C)
ln δ1
    
2 T
1 d α + σ α ln δ + 1−µ
= O +  .
α 1−µ

PT
Solving (11) by Lemma 8 and lower bounding t=1 k∇f (xt )k2 by T min1≤t≤T k∇f (xt )k2 , we get the stated bound.

You might also like