Professional Documents
Culture Documents
the rate of O( √1T ) in both convex and nonconvex setting. Assumptions. In this paper we focus on the optimization
However, the results hold only for Lipschitz functions. Ga- problem
dat et al. (2018) provided an in-depth description of the min f (x),
x∈Rd
stochastic heavy-ball method. Moreover, for the non-convex
functions, they showed some almost sure convergence re- where f is bounded from below and we denote its infimum
sults. Loizou & Richtárik (2017) provided a general analysis by f ⋆ . We do not assume the function to be convex. We
for the momentum variants of several classes of stochastic consider stochastic optimization algorithms that have access
optimization algorithms and proved the linear rate conver- to a noisy estimate of the gradient of f . This covers the
gence for quadratic and smooth functions. To the best of ubiquitous SGD (Robbins & Monro, 1951), as well mod-
our knowledge, there are no high probability bounds for ern variants as AdaGrad. We are interested in studying the
nonconvex stochastic momentum methods without using convergence of the gradients to zero, because without addi-
strong assumptions. tional assumptions it is the only thing we can study in the
Nonconvex convergence of adaptive methods. In recent nonconvex setting.
years, a variety of adaptive SGD algorithms have been de- We make the following assumption on the objective function
veloped to automatically tune the step size by using the f (x):
past stochastic gradients. The first adaptive algorithm was
AdaGrad Duchi et al. (2011), designed to adapt to sparse (A) f is M -smooth, that is, f is differentiable and its
gradients. Li & Orabona (2019) and Ward et al. (2019) gradient is M -Lipschitz, i.e. k∇f (x) − ∇f (y)k ≤
showed the √ convergence of variants of AdaGrad with a rate M kx − yk, ∀x, y ∈ Rd .
of O(ln T / T ) in the non-convex case. Moreover, Li &
Orabona (2019) showed that AdaGrad with non-coordinate-
wise learning rates is adaptive to the level of noise. Zou Note that (A), for all x, y ∈ Rd , implies (Nesterov, 2004,
et al. (2019) studied AdaGrad with a unified momentum and Lemma 1.2.3)
Chen et al. (2019) considered a large family of Adam-like
M
algorithms (Kingma & Ba, 2015) including AdaGrad with |f (y) − f (x) − h∇f (x), y − xi| ≤ ky − xk2 .
2
momentum. Yet, all of these works prove on bounds in ex-
pectation and most of them use the very strong assumption It is easy to see that this assumption is necessary to have
of bounded stochastic gradients. the convergence of the gradients to zero. Indeed, without
High probability bounds. The results on high probability smoothness the norm of the gradients does not go to zero
bounds are relatively rare compared to those in expectation, even in the convex case, e.g., consider the function f (x) =
which are easier to obtain. Kakade & Tewari (2009) used |x − 1|.
Freeman’s inequality to prove high probability bounds for an We assume that we have access to a stochastic first-order
algorithm solving the SVM objective function. For classic black-box oracle, that returns a noisy estimate of the gra-
SGD, Harvey et al. (2019b) and Harvey et al. (2019a) used dient of f at any point x ∈ Rd . That is, we will use the
a generalized Freedman’s inequality to prove bounds in non- following assumption:
smooth and strongly convex case, while Jain et al. (2019)
proved the optimal bound for the last iterate of SGD with
(B1) We receive a vector g(x, ξ) such that Eξ [g(x, ξ)] =
high probability. As far as we know, there are currently
∇f (x) for any x ∈ Rd .
no high probability bounds for adaptive methods in the
nonconvex setting.
We will also make the following assumption on the variance
of the noise.
3. Problem Set-Up
Notation. We denote vectors and matrices by bold let- (B2) (Sub-Gaussian Noise) The stochastic
gradient satisfies
Eξ exp k∇f (x) − g(x, ξ)k2 /σ 2 ≤ exp(1), ∀x.
ters. The coordinate j of a vector x is denoted by xj and as
∇f (x)j for the gradient ∇f (x). To keep the notation con-
cise, all standard operations xy, x/y, x2 , 1/x, x1/2 and The condition (B2) has been used by Nemirovski et al.
max(x, y) on the vectors x, y are supposed to be element- (2009) and Harvey et al. (2019a) to prove high probabil-
wise. We denote by E[·] the expectation with respect to the ity convergence guarantees. Intuitively, it implies that the
underlying probability space and by Et [·] the conditional tails of the noise distribution are dominated by tails of a
expectation with respect to the past randomness. We use L2 Gaussian distribution. Note that, by Jensen’s inequality,
norms. this condition implies a bounded variance on the stochastic
gradients.
A High Probability Analysis of Adaptive SGD with Momentum
4. A General Analysis for Algorithms with It is interesting to note that usually people argue that these
Momentum two types updates for momentum are usually considered
equivalent. This seems indeed true only if the learning rates
In this section, we will consider a generic stochastic opti- are not adaptive.
mization algorithm with Polyak’s momentum (Polyak, 1964;
Qian, 1999; Sutskever et al., 2013), also known as the Heavy- Assumptions on learning rates. Note that in the pseudo-
ball algorithm or classic momentum, see Algorithm 1. code we do not specify the learning rates η t ∈ Rd . In fact,
our analysis covers the case of generic learning rates and
Algorithm 1 Algorithms with Momentum adaptive ones too. We only need the following assumptions
1: Input: m0 = 0, {η t }T t=1 , 0 < µ ≤ 1, x1 ∈ R
d
on the stepsizes η t :
2: for t = 1, . . . , T do
3: Get stochastic gradient g t = g(xt , ξt ) (C1) η t is non-increasing, i.e., η t+1 ≤ η t , ∀t.
4: mt = µmt−1 + η t g t
5: xt+1 = xt − mt (C2) η t is independent with ξt .
6: end for
The first assumption is very common (e.g., Duchi et al.,
2011; Reddi et al., 2018; Li & Orabona, 2019; Chen et al.,
Two forms of momentum, but not equivalent. First, we
2019; Zhou et al., 2018). Indeed, AdaGrad has the non-
want to point out that there two forms of Heavyball algo-
increasing step sizes by the definition. Also, Reddi et al.
rithms are possible. The first one is in Algorithm 1, while
(2018) have claimed that the main issue of the divergences
the second one is
of Adam and RMSProp lies in the positive definiteness of
mt = µmt−1 + g t , 1/η t − 1/η t−1 .
(1)
xt+1 = xt − η t mt . The need of the second assumption is technical and shared
by similar analysis (Li & Orabona, 2019; Savarese et al.,
This second is used in many practical implementation, see, 2019). Indeed, Li & Orabona (2019) showed that delayed
for example, PyTorch (Paszke et al., 2019). It would seem step sizes can avoid the possible deviation brought by the
that there is no reason to prefer one over the other. However, step sizes that include the current noise.
here we argue that the classic form of momentum is the
right one if we want to use adaptive learning rates. To see High probability guarantee. Adaptive learning rates and
why, let’s unroll the updates in both cases. Using the update in general learning rates that are decided using previous gra-
in Algorithm 1, we have dients become stochastic variables. This makes the high
probability analysis more complex. Hence, we use a new
xt+1 = xt − η t g t − µη t−1 g t−1 − µ2 η t−1 g t−2 . . . ,
concentration inequality for martingales in which the vari-
while using the update in (1), we have ance is treated as a random variable, rather than a deter-
ministic quantity. We use this concentration in the proof of
xt+1 = xt − η t g t − µη t g t−1 − µ2 η t g t−2 . . . . Lemma 2. Our proof, in the Appendix, merges ideas from
the related results in Beygelzimer et al. (2011, Theorem 1)
In words, in the first case the update is composed by a sum and Lan et al. (2012, Lemma 2). A similar result has also
of weighted gradients, each one multiplied by a learning been shown by Jin et al. (2019, Lemma 6).
rate we decided in the past. On the other hand, in the update Lemma 1. Assume that Z1 , Z2 , ..., ZT is a martingale
(1) the update is composed by a sum of weighted gradients, difference sequence with respect to ξ1 , ξ2 , ..., ξT and
each one multiplied by the current learning rate. From the Et exp(Zt2 /σt2 ) ≤ exp(1) for all 1 ≤ t ≤ T , where
analysis point of view, the second update destroys the in- σt is a sequence of random variables with respect to
dependence between the past and the future, introducing a ξ1 , ξ2 , . . . , ξt−1 . Then, for any fixed λ > 0 and δ ∈ (0, 1),
dependency that breaks our analysis, unless we introduce with probability at least 1 − δ, we have
very strict conditions on the gradients. On the other hand,
the update in Algorithm 1 allows us to carry out the anal- T T
X 3 X 2 1 1
ysis because each learning rate was chosen only with the Zt ≤ λ σ + ln .
t=1
4 t=1 t λ δ
knowledge of the past. Note that this is known problem
in adaptive algorithms: the lack of independence between
past and present is exactly the reason why Adam fails to Main result. We can now present our general lemma, that
converge on simple 1d convex problems, see for example allows to analyze SGD with momentum with adaptive learn-
the discussion in Savarese et al. (2019). ing rates. We will then instantiate it for particular examples.
A High Probability Analysis of Adaptive SGD with Momentum
Lemma 2. Assume (A, B1, B2, C1, C2). Then, for any Lemma 3. Assume (A, B1, B2). Let η t set as in (2), where
δ ∈ (0, 1), with probability at least 1 − δ, the iterates of α, β > 0. Then, for any δ ∈ (0, 1), with probability at least
Algorithm 1 satisfy 1 − δ, we have
T T T
X 3kη 1 kσ 2 (1 − µT )2 1 4dα2 σ 2 2T e 4α X
hη t , ∇f (xt )2 i ≤
X
ln kη t g t k2 ≤ ln +√ hη t , ∇f (xt )2 i
t=1
(1 − µ)2 δ β δ β
t=1 t=1
T s v
M (3 − µ) X 2 ln 2T e
u T
+ 2(f (x1 ) − f ⋆ ) + kη g k2 . 2T σ u2 X
1 − µ t=1 t t ≤ 2α2 d ln β + δ
+t k∇f (xt )k2 .
d d t=1
for SGD with Polyak’s momentum with step size O( √1t ) Jin, C., Netrapalli, P., Ge, R., Kakade, S. M., and Jordan,
and for Delayed AdaGrad with momentum. In particular, to M. I. A short note on concentration inequalities for
the best of our knowledge, this is the first high probability random vectors with subgaussian norm. arXiv preprint
convergence guarantee for adaptive methods. arXiv:1902.03736, 2019.
In the future, we plan to extend our results to more adaptive Kakade, S. M. and Tewari, A. On the generalization ability
methods, such as Adam and AMSGrad (Tieleman & Hinton, of online strongly convex programming algorithms. In
2012). Moreover, we will explore other forms of momen- Advances in Neural Information Processing Systems, pp.
tum such as exponential moving average and Nesterov’s 801–808, 2009.
momentum (Nesterov, 1983).
Kingma, D. P. and Ba, J. Adam: A method for stochastic
optimization. In International Conference on Learning
Acknowledgements Representations (ICLR), 2015.
This material is based upon work supported by the National Lan, G., Nemirovski, A., and Shapiro, A. Validation anal-
Science Foundation under grant no. 1925930 “Collabora- ysis of mirror descent stochastic approximation method.
tive Research: TRIPODS Institute for Optimization and Mathematical programming, 134(2):425–458, 2012.
Learning”.
Li, X. and Orabona, F. On the convergence of stochastic
References gradient descent with adaptive stepsizes. In Proc. of the
22nd International Conference on Artificial Intelligence
Beygelzimer, A., Langford, J., Li, L., Reyzin, L., and and Statistics, AISTATS, 2019. URL https://arxiv.
Schapire, R. Contextual bandit algorithms with super- org/abs/1805.08114.
vised learning guarantees. In Proc. of the International
Conference on Artificial Intelligence and Statistics, pp. Loizou, N. and Richtárik, P. Momentum and stochastic
19–26, 2011. momentum for stochastic gradient, Newton, proximal
point and subspace descent methods. arXiv preprint
Chen, X., Liu, S., Sun, R., and Hong, M. On the conver- arXiv:1712.09677, 2017.
gence of a class of Adam-type algorithms for non-convex
optimization. In 7th International Conference on Learn- Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Ro-
ing Representations, ICLR 2019, 2019. bust stochastic approximation approach to stochastic pro-
gramming. SIAM Journal on optimization, 19(4):1574–
Duchi, J. C., Hazan, E., and Singer, Y. Adaptive subgradient 1609, 2009.
methods for online learning and stochastic optimization.
Journal of Machine Learning Research, 12:2121–2159, Nesterov, Y. A method for unconstrained convex minimiza-
2011. tion problem with the rate of convergence O(1/k 2 ). In
Doklady AN SSSR (translated as Soviet. Math. Docl.),
Gadat, S., Panloup, F., and Saadane, S. Stochastic heavy ball. volume 269, pp. 543–547, 1983.
Electronic Journal of Statistics, 12(1):461–529, 2018.
Nesterov, Y. Introductory lectures on convex optimization:
Harvey, N. J., Liaw, C., Plan, Y., and Randhawa, S. Tight A basic course, volume 87. Springer, 2004.
analyses for non-smooth stochastic gradient descent. In
Conference on Learning Theory, pp. 1579–1613, 2019a. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
Harvey, N. J., Liaw, C., and Randhawa, S. Simple and opti- L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison,
mal high-probability bounds for strongly-convex stochas- M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L.,
tic gradient descent. arXiv preprint arXiv:1909.00843, Bai, J., and Chintala, S. Pytorch: An imperative style,
2019b. high-performance deep learning library. In Advances
in Neural Information Processing Systems 32, pp. 8024–
Jain, P., Nagaraj, D., and Netrapalli, P. Making
8035. Curran Associates, Inc., 2019.
the last iterate of sgd information theoretically opti-
mal. In Beygelzimer, A. and Hsu, D. (eds.), Proc. Polyak, B. T. Some methods of speeding up the convergence
of the Conference on Learning Theory (COLT), vol- of iteration methods. USSR Computational Mathematics
ume 99 of Proceedings of Machine Learning Re- and Mathematical Physics, 4(5):1–17, 1964.
search, pp. 1752–1755, Phoenix, USA, 25–28 Jun
2019. PMLR. URL http://proceedings.mlr. Qian, N. On the momentum term in gradient descent learn-
press/v99/jain19a.html. ing algorithms. Neural networks, 12(1):145–151, 1999.
A High Probability Analysis of Adaptive SGD with Momentum
Reddi, S. J., Kale, S., and Kumar, S. On the conver- Ward, R., Wu, X., and Bottou, L. AdaGrad stepsizes: Sharp
gence of Adam and beyond. In International Confer- convergence over nonconvex landscapes, from any initial-
ence on Learning Representations, 2018. URL https: ization. In International Conference on Machine Learn-
//openreview.net/pdf?id=ryQu7f-RZ. ing, pp. 6677–6686, 2019.
Robbins, H. and Monro, S. A stochastic approximation Yang, T., Lin, Q., and Li, Z. Unified convergence analysis
method. Annals of Mathematical Statistics, 22:400– of stochastic momentum methods for convex and non-
407, 1951. URL https://projecteuclid.org/ convex optimization. arXiv preprint arXiv:1604.03257,
euclid.aoms/1177729586. 2016.
Savarese, P., McAllester, D., Babu, S., and Maire, M. Zhou, D., Tang, Y., Yang, Z., Cao, Y., and Gu, Q. On the
Domain-independent dominance of adaptive methods. convergence of adaptive gradient methods for nonconvex
arXiv preprint arXiv:1912.01823, 2019. optimization, 2018. arXiv:1808.05671.
Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the Zou, F., Shen, L., Jie, Z., Sun, J., and Liu, W. Weighted
importance of initialization and momentum in deep learn- AdaGrad with unified momentum. arXiv preprint
ing. In International conference on machine learning, pp. arXiv:1808.03408, 2018.
1139–1147, 2013.
Zou, F., Shen, L., Jie, Z., Zhang, W., and Liu, W. A sufficient
Tieleman, T. and Hinton, G. Lecture 6.5-rmsprop: Divide condition for convergences of Adam and RMSProp. In
the gradient by a running average of its recent magnitude. Proc. of the IEEE Conference on Computer Vision and
COURSERA: Neural Networks for Machine Learning, Pattern Recognition, pp. 11127–11135, 2019.
2012.
A High Probability Analysis of Adaptive SGD with Momentum
A. Appendix
A.1. Details of Section 4
Proof of Lemma 1. Set Z̃t = Zt /σt . By the assumptions of Zt and σt , we have
1 h i
Et [Z̃t ] = Et [Zt ] = 0 and Et exp Z̃t2 ≤ exp(1) .
σt
By Jensen’s inequality, it follows that for any c ∈ [0, 1],
h i h c i h ic
Et exp cZ̃t2 = Et exp Z̃t2 ≤ Et exp Z̃t2 ≤ exp(c) . (3)
Also it can be verified that exp(x) ≤ x + exp(9x2 /16) for all x, hence for |κ| ∈ [0, 4/3] we get
h i h i
Et exp κZ̃t ≤ Et exp 9κ2 Z̃t2 /16 ≤ exp 9κ2 /16 ≤ exp 3κ2 /4 ,
(4)
where in the second inequality, we used (3). Besides, kx ≤ 3k 2 /8 + 2x2 /3 holds for any k and x. Hence for |κ| ≥ 4/3, we
get h i h i
Et exp κZ̃t ≤ exp 3κ2 /8 Et exp 2Z̃t2 /3 ≤ exp 3κ2 /8 + 2/3 ≤ exp 3κ2 /4 ,
(5)
where in the second inequality we used (3). Combining (4) and (5), we get ∀κ,
h i
Et exp κZ̃t ≤ exp 3κ2 /4 .
(6)
T T T T
! !
1 X 3 X 2 1 X 3 X 2 1 1
P YT ≥ =P λ Z t − λ2 σt ≥ ln =P Zt ≥ λ σ + ln , ≤ δ,
δ t=1
4 t=1
δ t=1
4 t=1 t λ δ
Proof. We prove these equalities by induction. When T = 1, they obviously hold. Now, for k < T , assume that
Pk Pt Pk Pk
t=1 at i=1 bi = t=1 bt i=t ai . Then, we have
k+1
X t
X k
X t
X k+1
X k
X k
X k
X
at bi = at bi + ak+1 bi = bt ai + ak+1 bi + ak+1 bk+1
t=1 i=1 t=1 i=1 i=1 t=1 i=t i=1
k
X k+1
X k+1
X k+1
X
= bt ai + ak+1 bk+1 = bt ai .
t=1 i=t t=1 i=t
A High Probability Analysis of Adaptive SGD with Momentum
By Lemma 4, we have
t−1
T X T
X M X
M µt−i kmi k2 ≤ kmt k2 .
t=1 i=1
1 − µ t=1
where in the first equality we used m0 = 0. Reordering the terms, we have that
T T
X 1 X
kmt k2 ≤ kη g k2 .
t=1
(1 − µ)2 t=1 t t
2(1−µ)2
Combining things together, and taking λ = 3kη 1 k(1−µT )2 σ 2
, with probability at least 1 − δ, we have
T
3λkη 1 k(1 − µT )2 σ 2
⋆ 1 1 X M M 2 2
f − f (x1 ) ≤ ln + + kη t g t k − 1 − hη t , ∇f (xt ) i
λ δ t=1 2 1−µ 4(1 − µ)2
T
3kη 1 k(1 − µT )2 σ 2 1 X (3 − µ)M
2 1 2
= ln + kη t g t k − hη t , ∇f (xt ) i .
2(1 − µ)2 δ t=1 2(1 − µ) 2
Te
max kg t − ∇f (xt )k2 ≤ σ 2 ln .
1≤t≤T δ
A High Probability Analysis of Adaptive SGD with Momentum
Proof of Theorem 1. With the fact that ka + bk2 ≤ 2kak2 + 2kbk2 , we have
T
X T
X T
X T
X
kη t g t k2 = ηt2 kg t k2 ≤ 2ηt2 k∇f (xt )k2 + 2ηt2 kg t − ∇f (xt )k2
t=1 t=1 t=1 t=1
T
X T
X
≤ 2ηt2 k∇f (xt )k2 + max kg t − ∇f (xt )k2 2ηt2 .
1≤t≤T
t=1 t=1
By Lemma 5, Lemma 2 and the union bound, we have that with probability at least 1 − δ,
T XT
ηT X 2M (3 − µ)
k∇f (xt )k2 ≤ 1 − η1 ηt k∇f (xt )k2
2 t=1 1−µ t=1
2(3 − µ)c2 M σ 2 ln 2Tδ e ln T 3c(1 − µT )2 σ 2 1
≤ 2(f (x1 ) − f ⋆ ) + + ln .
1−µ (1 − µ)2 δ
PT
Rearranging the terms and lower bounding t=1 k∇f (xt )k2 by T · min1≤t≤T k∇f (xt )k2 , we have the stated bound.
T t
! PT
X X Z t=0 at
at f a0 + ai ≤ f (x)dx .
t=1 i=1 a0
Pt
Proof. Denote by st = i=0 ai . Then, we have
Z si Z si
ai f (si ) = f (si )dx ≤ f (x)dx .
si−1 si−1
PT
Proof of Lemma 3. First, we separate t=1 kη t g t k2 into two terms:
T
X T
X T
X
kη t g t k2 = kη t+1 g t k2 + hη 2t − η 2t+1 , g 2t i .
t=1 t=1 t=1
A High Probability Analysis of Adaptive SGD with Momentum
Then, we proceed
T
X d X
X T
hη 2t − η 2t+1 , g 2t i = 2
(ηt,i 2
− ηt+1,i 2
)gt,i
t=1 i=1 t=1
d X
X T
2
≤ 2ηt,i gt,i (ηt,i − ηt+1,i )
i=1 t=1
d
X T
X
2
≤2 max ηt,i gt,i (ηt,i − ηt+1,i )
1≤t≤T
i=1 t=1
d
X
2
≤2 η1,i max ηt,i gt,i
1≤t≤T
i=1
d
X d
X T
X
2
− ∇f (xt )2i + 4 ηt,i ∇f (xt )2i
≤4 η1,i max ηt,i gt,i η1,i
1≤t≤T
i=1 i=1 t=1
d
X d
X T
X
2 2
≤4 η1,i max |gt,i − ∇f (xt )2i | + 4 η1,i ηt,i ∇f (xt )2i
1≤t≤T
i=1 i=1 t=1
T
4dα2 4α X
≤ max kg t − ∇f (xt )k2 + √ hη t , ∇f (xt )2 i . (8)
β 1≤t≤T β t=1
Using Lemma 5 on (8), for δ ∈ (0, 1), with probability at least 1 − 2δ , we have
T T
X 4dα2 σ 2 2T e 4α X
hη 2t − η 2t+1 , g 2t i ≤ ln +√ hη , ∇f (xt )2 i .
t=1
β δ β t=1 t
PT
We now upper bound t=1 kη t+1 g t k2 :
T T
d X
X X α2 gt,i
2
kη t+1 g t k2 = Pt 2
t=1 i=1 t=1
β+ j=1 gj,i
d T
!
X X
≤ α2 ln β + 2
gt,i
i=1 t=1
d T
!
2 1 XX 2
≤ α d ln β + gt,i
d i=1 t=1
v
u T
u 1 X
= 2α2 d ln tβ + kg k2
d t=1 t
s v
u T
2 2T u2 X
≤ 2α d ln β + max kg − ∇f (xt )k2 + t k∇f (xt )k2 , (9)
d 1≤t≤T t d t=1
where in the first inequality we used Lemma 6 and in the second inequality we used Jensen’s inequality. Then using
Lemma 5 on (9), with probability at least 1 − 2δ , we have
r v
T 2
u T
X 2T σ 2T e u2 X
kη t+1 g t k2 ≤ 2α2 d ln β + ln +t k∇f (xt )k2 .
t=1
d δ d t=1
Proof of Theorem 2. By Lemma 2 and Lemma 3, for δ ∈ (0, 1), with probability at least 1 − 32 δ, we have
T
X
4αM (3 − µ)
1− √ hη t , ∇f (xt )2 i
β(1 − µ) t=1
4dα2 σ 2 3T e 3kη 1 kσ 2 (1 − µT )2 3
M (3 − µ)
≤ 2(f (x1 ) − f ⋆ ) + K+ ln + ln .
1−µ β δ (1 − µ)2 δ
q q qP
2T σ 2 T
where K denotes 2α2 d ln β+ d ln 2Tδ e + 2
d t=1 k∇f (xt )k
2 for conciseness.
T
X
hη t , ∇f (xt )2 i
t=1
4dα2 σ 2 3T e 3kη 1 kσ 2 (1 − µT )2 3
1 M (3 − µ)
≤ 2(f (x1 ) − f ⋆ ) + ln + K+ ln
1 − 4αM
√
(3−µ) β 1−µ δ (1 − µ)2 δ
β(1−µ)
4dα2 σ 2 3T e 3kη 1 kσ 2 (1 − µT )2 3
2M (3 − µ)
≤ 4(f (x1 ) − f ⋆ ) + K+ ln + ln
1−µ β δ (1 − µ)2 δ
, C(T ),
√
β(1−µ)
where in the second inequality we used 4α ≤ 2M (3−µ) . Also, we have
T
X T
X
hη t , ∇f (xt )2 i ≥ hη T , ∇f (xt )2 i
t=1 t=1
d PT
X α t=1 ∇f (xt )2i
= q PT 2
β + t=1 gt,i
i=1
d PT
X ∇f (xt )2i
α t=1
≥ q PT PT
i=1 β + 2 t=1 ∇f (xt )2i + 2 t=1 (gt,i − ∇f (xt )i )2
d PT
X α t=1 ∇f (xt )2i
≥ q Pd PT Pd PT
i=1 β + 2 i=1 t=1 ∇f (xt )2i + 2 i=1 t=1 (gt,i − ∇f (xt )i )2
PT
α t=1 k∇f (xt )k2
≥q PT .
β + 2 t=1 k∇f (xt )k2 + 2T max1≤t≤T kg t − ∇f (xt )k2
We use this upper bound in the logarithmic term of (11). Thus, we have (10) again, this time with
√
C(T ) = C + D ln(2A + 32B 4 D2 + 2B 2 C + 8B 3 D C)
ln δ1
2 T
1 d α + σ α ln δ + 1−µ
= O + .
α 1−µ
PT
Solving (11) by Lemma 8 and lower bounding t=1 k∇f (xt )k2 by T min1≤t≤T k∇f (xt )k2 , we get the stated bound.