You are on page 1of 10

Adaptivity and Optimality: A Universal Algorithm for

Online Convex Optimization

Guanghui Wang, Shiyin Lu, Lijun Zhang


National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
{wanggh,lusy,zhanglj}@lamda.nju.edu.cn

Abstract the classic online gradient


√ descent (OGD) with step size
on the order of O(1/
√ t) (referred to as convex OGD)
In this paper, we study adaptive online convex guarantees an O( T ) regret bound (Zinkevich, 2003),
optimization, and aim to design a universal al- where T is the time horizon. While it has been proved
gorithm that achieves optimal regret bounds for minimax optimal for arbitrary convex functions (Aber-
multiple common types of loss functions. Exist- nethy et al., 2008), tighter bounds are still achievable
ing universal methods are limited in the sense when loss functions are known to fall into some easier
that they are optimal for only a subclass of loss categories in advance. Specifically, for strongly convex
functions. To address this limitation, we pro- functions, OGD with step size proportional to O(1/t) (re-
pose a novel online algorithm, ferred to as strongly convex OGD) achieves an O(log T )
√ namely Maler, regret bound (Hazan et al., 2007); for exponentially con-
which enjoys the optimal O( T ), O(d log T )
and O(log T ) regret bounds for general con- cave functions, the state-of-the-art algorithm is online
vex, exponentially concave, and strongly con- Newton step (ONS) (Hazan et al., 2007), which enjoys an
vex functions respectively. The essential idea O(d log T ) regret bound, where d is the dimensionality.
is to run multiple types of learning algorithms This divides OCO into subclasses, relying on users to
with different learning rates in parallel, and uti- decide which algorithm to use for their specific settings.
lize a meta-algorithm to track the best on the fly. Such requirements, not only are a burden to users, but also
Empirical results demonstrate the effectiveness hinder the applications to broad domains where the types
of our method. of loss functions are unknown and choosing the right algo-
rithm beforehand is impossible. These issues motivate the
innovation of adaptive algorithms, which aim to guarantee
1 INTRODUCTION optimal regret bounds for arbitrary convex functions, and
automatically exploit easier functions whenever possible.
Online convex optimization (OCO) is a well-established The seminal work of Hazan et al. (2008) propose adaptive

paradigm for modeling sequential decision making online gradient descent (AOGD), which attains O( T )
(Shalev-Shwartz et al., 2012). The protocol of OCO is and O(log T ) regret bounds for convex and strongly con-
as follows: in each round t, firstly a learner chooses an vex functions respectively. However, AOGD requires the
action xt from a convex set D ⊆ Rd , at the same time, curvature information of ft as input in each round, and
an adversary reveals a loss function ft : D 7→ R, and fails to provide logarithmic regret bound for exponen-
consequently the learner suffers a loss ft (xt ). The goal is tially concave functions. Another milestone is MetaGrad
to minimize regret, defined as the difference between the (van Erven and Koolen, 2016), which√only requires the
cumulative loss of the learner and that of the best action gradient information, and achieves O( T log log T ) and
in hindsight (Hazan et al., 2016): O(d log T ) regret bounds for convex and exponentially
T T
concave functions respectively. Although it also implies
an O(d log T ) regret for strongly convex functions, there
X X
R(T ) = ft (xt ) − min ft (x). (1)
t=1
x∈D
t=1
still exists a large O(d) gap from the Ω(log T ) lower
bound (Abernethy et al., 2008).
There exist plenty of algorithms for OCO, based on dif-
ferent assumptions on the loss functions. Without any Along this line of research, it is therefore natural to ask
assumptions beyond convexity and Lipschitz continuity, whether both adaptivity and optimality can be attained
simultaneously, or there is an inevitable price in regret to and strongly convex loss functions respectively. More-
be paid for adaptivity, which was also posed as an open over, they have shown superiority over non-adaptive meth-
question by van Erven and Koolen (2016). In this paper, ods in the experiments (Do et al., 2009). However, in each
we give an affirmative answer by developing a novel on- round t these algorithms have to be fed with a parame-
line method, namely Maler, which achieves the optimal ter which depends on the curvature information of ft (·)
regret bounds for all aforementioned three types of loss at xt , and cannot achieve the logarithmic regret bound
functions. Inspired by MetaGrad, our method runs mul- for exponentially concave cases. To address these limita-
tiple expert algorithms in parallel, each with a different tions, van Erven and Koolen (2016) propose the multiple
learning rate, and combines them with a meta-algorithm eta gradient (MetaGrad), whose basic idea is to run a
that learns the empirically best for the OCO problem bunch of variant of ONS algorithms with different learn-
in hand. However, different from MetaGrad where ex- ing rates simultaneously, and employ a meta-algorithm
perts are the same type of OCO algorithms (i.e., a variant to learn the best adaptively based on the empirical per-
of ONS), experts in Maler consists of various types of formances. They show that the regret of MetaGrad for
OCO algorithms (i.e., convex OGD, ONS and strongly arbitrary convex functions can √
be simultaneously bounded
convex OGD). Essentially, the goal of MetaGrad is to by a worst-case bound of q O( T log log T ), and a data-
learn only the optimal learning rate. In contrast, Maler dependant bound of O( VT` d log T + d log T ), where
searches for the best OCO algorithm and the optimal PT
learning rate simultaneously. Theoretical analysis shows VT` = t=1 ((x∗ − xt )> gt ))2 . In particular, for strongly
that, with O(log T ) experts, which is on√the same order convex and exponentially concave functions, the data-
as that of MetaGrad, Maler achieves O( T ), O(d log T ) dependant bound reduces to O(d log T ).
and O(log T ) regret bounds for convex, exponentially The above works as well as this paper focus on adapting
concave and strongly convex functions respectively. Em- to different types of loss functions. A related but parallel
pirical results on both synthetic and real-world datasets direction is adapting to structures in data, such as low-
demonstrate the advantages of our method. rank and sparsity. This line of research includes Adagrad
(Duchi et al., 2011), RMSprop (Tieleman and Hinton,
Notation. Throughout the paper, we use lower case 2012), and Adam (Reddi et al., 2018), to name a few. The
bold face letters to denote vectors, lower case letters to main idea here is to utilize the gradients observed over
denote scalars, and upper case letters to denote matrices. time to dynamically adjust the learning rate or the update
We use k · k to denote the `2 -norm. For a positive definite direction of gradient descent, and their regret bounds de-
matrix H ∈ Rd×d , the weighted `2 -norm is denoted by pend on the cumulation of gradients.
√ For general convex
kxk2H = x> Hx. The H-weighted projection ΠH D (x) of functions, the bounds attain O( T ) in the worst-case,
x onto D is defined as ΠH D (x) = argmin y∈D ky − xk2H . and become tighter when the gradients are sparse.
We denote the gradient of ft at P xt as gt , and the best
T
action in hindsight as x∗ = max t=1 ft (x). Another different direction considers adapting to chang-
x∈D ing environments, where some more stringent criteria are
established to measure the performance of algorithms,
2 RELATED WORK such as dynamic regret (Zinkevich, 2003; Hall and Wil-
lett, 2013; Zhang et al., 2017, 2018a), which compares
In the literature, there exist various of algorithms for OCO the cumulative loss of the learner against any sequence of
targeting on a specific type of loss functions. For general comparators, and adaptive regret (Hazan and Seshadhri,
convex and strongly convex loss functions, 2007; Daniely et al., 2015; Jun et al., 2017; Wang et al.,
√ the classic
OGD with √ step size on the order of O(1/ t) and O(1/t) 2018; Zhang et al., 2018b, 2019), which is defined as the
achieve O( T ) and O(log T ) regret bounds, respectively maximum regret over any contiguous time interval. In this
(Zinkevich, 2003; Hazan et al., 2007). For exponentially paper we mainly focus on the minimization of regret, and
concave functions, online Newton step (ONS) attains a re- it an interesting question to explore whether our method
gret bound of O(d log T ) (Hazan et al., 2007). The above can be extended to adaptive and dynamic regrets.
bounds are known to be minimax optimal as matching
lower bounds have been established (Abernethy et al.,
2008).
3 MALER
To simultaneously deal with multiple types of loss func-
tions, Hazan et al. (2008) propose adaptive online gra-
dient descent (AOGD), which is later extended to prox- In this section, we first state assumptions made in this pa-
imal settings
√ by Do et al. (2009). Both algorithms can per, then provide our motivations, and finally present the
achieve O( T ) and O(log T ) regret bounds for convex proposed algorithm as well as its theoretical guarantees.
q
3.1 ASSUMPTIONS AND DEFINITIONS of order O( dVT` log T + d log T ). However, this is im-
Following previous studies, we introduce some standard possible since VT` depends on the whole learning process.
assumptions (van Erven and Koolen, 2016) and definitions To sidestep this obstacle, MetaGrad maintains multiple
(Boyd and Vandenberghe, 2004). ONS in parallel each of which targets minimizing the
regret with respect to the surrogate loss `ηt with a different
Assumption 1. The gradients of all loss functions are η, and employs a meta-algorithm to track the ONS with
bounded by G, i.e., ∀t > 0, max k∇ft (x)k ≤ G. the best η. Theoreticalqanalysis shows that MetaGrad
x∈D
Assumption 2. The diameter of the action set is bounded achieves the desired O( dVT` log T + d log T ) bound.
by D, i.e., max kx1 − x2 k ≤ D.
x1 ,x2 ∈D
q
While the O( dVT` log T + d log T ) regret bound of
Definition 1. A function f : D 7→ R is convex if
MetaGrad can reduce to O(d log T ) for exp-concave func-
f (x1 ) ≥ f (x2 ) + ∇f (x2 )> (x1 − x2 ), ∀x1 , x2 ∈ D. tions, it can not recover the O(log T ) regret bound for
(2) strongly convex functions. To address this limitation, we
design a new surrogate loss function:
Definition 2. A function f : D 7→ R is λ-strongly convex
if ∀x1 , x2 ∈ D, sηt (x) = −η(xt − x)> gt + η 2 G2 kxt − xk2 (6)

f (x1 ) ≥ f (x2 ) + ∇f (x2 )> (x1 − x2 ) +


λ
kx1 − x2 k2 . where η ∈ (0, 5DG 1
]. The main advantage of sηt over `ηt is
2 its strong convexity, which allows us to adopt a strongly
Definition 3. A function f : D 7→ R is α-exponentially convex OGD that takes sηt as the objective loss function
concave (abbreviated to α-exp-concave) if exp(−αf (x)) and attains an O(log T ) regret with respect to sηt . On the
is concave. other hand, the “upper-bound” property in (5) is preserved
in the sense that the regret with respect to the original loss
ft can be upper bounded by:
3.2 MOTIVATION
PT η PT η
t=1 st (xt ) − minx∈D t=1 st (x)
Our algorithm is inspired by MetaGrad. To help under- R(T ) ≤ + ηVTs
standing, we first give a brief introduction to the intuition η
behind this algorithm. Specifically, MetaGrad introduces PT
the following surrogate loss function, parameterized by where VTs = t=1 G2 kxt − x∗ k2 . Thus, the employed
1
η ∈ (0, 5DG ]: strongly convex OGD enjoys a novel data-dependant
O((log T )/η + ηVTs ) regret with respect to ft , remov-
`ηt (x) = −η(xt − x)> gt + η 2 (x − xt )> gt gt> (x − xt ). ingpthe undesirable factor of d. To optimize this bound to
(3) O( VTs log T + log T ), we follow the idea of MetaGrad
The first advantage of the above definition is that `ηt is and run many instances of strongly convex OGD.

1-exp-concave. Therefore, we can apply ONS on `ηt and Finally, to obtain the optimal O( T ) regret bound for
obtain the following regret bound with respect to `ηt : general covnex functions, we also introduce a linear sur-
T T
rogate loss function
X X
`ηt (xt ) − min `ηt (x) ≤ O(d log T ). (4) 2
x∈D ct (x) = −η c (xt − x)> gt + (η c GD) (7)
t=1 t=1

The second advantage is that the regret with respect to the where η c = 2DG1 √T , which only depends on known quan-
original loss function ft can be upper bounded in terms tities. It can be proved that if we run a convex OGD with
of the regret with respect to the defined surrogate loss ct (·) as the input, its regret with respect
√ to the original
function `ηt : loss function ft (·) can be bounded O( T ).
PT η PT While the idea of incorporating new types of surrogate
` (xt ) − minx∈D t=1 `ηt (x)
R(T ) ≤ t=1 t + ηVT` loss functions to enhance the adaptivity is easy to com-
η prehend, the specific definitions of the two proposed sur-
(5)
where VT` =
PT
((x − x )>
g )2
. Both advantages rogate loss functions in (6) and (7) are more involved.
t=1 t ∗ t
jointly (i.e., combining (4) and (5)) lead to a regret In fact, the proposed functions are carefully designed
bound of O((d log T )/η + ηVT` ). Therefore, had we such that besides the aforementioned properties, they also
` satisfy that
known theq value of VT in advance, we could set η as
1
min{Θ( d log T /VT` ), 5DG } and obtain a regret bound exp(−sηt (x)) ≤ exp(−`ηt (x)) ≤ 1 + η(xt − x)> gt
Algorithm 1 Meta-algorithm Algorithm 3 Exp-concave expert algorithm
c
1: Input: Learning rates η , η1 , η2 , . . . , prior weights 1: Input: Learning rate η
π1c , π1η1 ,s , π1η2 ,s , . . . and π1η1 ,` , π1η2 ,` , . . . η,`
2: x1 = 0, β = 12 min 4G1` D , 1 , where G` = 7

25D ,
2: for t = 1, . . . , T do 1
Σ1 = β 2 D 2 Id
3: Get predictions xct from Algorithm 2, and xη,` t , 3: for t = 1, . . . , T do
xη,s
t from Algorithms 3 and 4 for all η 4: Send xη,`
t to Algorithm 1
πtc η c xct + η (πtη,s ηxη,s η,` η,`
P
t +πt ηxt ) 5: Receive gradient gt from Algorithm 1
4: Play xt = c
P η,s η,`
πt η c + η (πt η+πt η) 6: Update
5: Observe gradient gt and send it to all experts
6: Update weights:    >
c
c
πtc e−ct (xt ) Σt+1 =Σt + ∇`ηt xη,`
t ∇`ηt xη,`
t
πt+1 = Φt 
η η,s 1 −1  
η,s πtη,s e−st (xt ) η,` Σt+1
xt+1 =ΠD η,` η
xt − Σt+1 ∇`t xt η,`
πt+1 = Φt for all η β
η η,`
η,`
−`
πtη,` e t t
x ( )
πt+1 = for all η
Φt where ∇`ηt (xη,` 2 > η,`
t ) = ηgt + 2η gt gt (xt − xt )
where
7: end for
X η η,s η η,`

Φt = πtη,s e−st (xt )
+ πtη,` e−`t (xt )
η Algorithm 4 Strongly convex expert algorithm
c
+ πtc e−ct (xt ) 1: Input: Learning rate η
η,s
2: x1 = 0
7: end for 3: for t = 1, . . . , T do
4: Send xη,s
t to Algorithm 1
Algorithm 2 Convex expert algorithm 5: Receive gradient gt from Algorithm 1
1: xc1 = 0 6: Update
2: for t = 1, . . . , T do  
1
3: Send xct to Algorithm 1 xη,s Id
t+1 = ΠD xη,s
t − ∇sη
(x η,s
)
4: Receive gradient gt from Algorithm 1 2η 2 G2 t t t
 
5: Update xct+1 = ΠIDd xct − D√
∇c (xc
) where ∇sηt (xη,s 2 2 η,s
t ) = ηgt + 2η G (xt − xt )
ηc G t t t
where ∇ct (xct ) = η gt c
7: end for
6: end for

ing rates and prior weights of the experts. In each round


and t, the meta-algorithm firstly receives actions from all ex-
exp(−ct (x)) ≤ 1 + η c (xt − x)> gt perts (Step 3), and then combines these actions by using
which are critical to keep the regret caused by the meta- exponentially weighted average (Step 4). The weights of
algorithm under control and will be made clear in Section the experts are titled by their own η, so that those experts
4.1. with larger learning rates will be assigned with larger
weights. After observing the gradient at xt (Step 5), the
meta-algorithm updates the weight of each expert via an
3.3 THE ALGORITHM
exponential weighting scheme (Step 6).
Our method, named multiple sub-algorithms and learning Experts. Experts are themselves non-adaptive algo-
rates (Maler), is a two-level hierarchical structure: at rithms, such as OGD and ONS. In each round t, each
the lower level, a set of experts run in parallel, each of expert sends its action to the meta-algorithm, then re-
which is configured with a different learning algorithm ceives a gradient vector from the meta-algorithm, and
(Algorithm 2, 3, or 4) and learning rate. At the higher finally updates the action based on the received vector.
level, a meta-algorithm (Algorithm 1) is employed to To optimally handle general convex, exp-concave, and
track the best expert based on empirical performances of strongly convex functions simultaneously, we design three
the experts. types of experts as follows:
Meta-Algorithm. Tracking the best expert is a well-
studied problem, and our meta-algorithm is built upon • Convex expert. As discussed in Section 3.2, there
the titled exponentially weighted average (van Erven and is no need to search for the optimal learning rate in
Koolen, 2016). The inputs of the meta-algorithm are learn- convex cases and thus we only run one convex OGD
(Algorithm 2) on the convex surrogate loss function
ct (x) in (7). We denote its action in round t as xct .
Its prior weight π√ c c R(T )
1 and learning rate η are set to be s 
1/3 and 1/(2GD T ), respectively. √
 
1
 
`
≤3 VT 2 ln 3 log2 T + 3 + 10d log T
• Exp-concave experts. We keep 12 log T + 1 exp-
 
2
concave experts, each of which is a standard ONS  

  
1
(Algorithm 3) running on an exp-concave surrogate + 10GD 2 ln 3 log2 T + 3 + 10d log T
loss function `ηt (·) in (3) with a different η. We 2
q 
denote its output 1 in round
 t as xη,`
t . For expert =O VT` d log T + d log T
i = 0, 1, 2, ..., 2 log T , its learning rate and prior
weight are assigned as follows: (9)

2−i C and
ηi = , and π1ηi ,` =
5DG 3(i + 1)(i + 2)
R(T )
1 
where C = 1 + 1/ 1 + 2 log T is a normaliza- s  

  
s 1
tion parameter. ≤3 VT 2 ln 3 log2 T + 3 + 1 + log T
2
• Strongly convex experts. We maintain 21 log T + 1
 

    
strongly convex experts. In each round t, every ex- 1
+ 10GD 2 ln 3 log2 T + 3 + 1 + log T
pert takes a strongly convex surrogate loss sηt (·) in 2
p 
(6) (with different η) as the loss function, and adopts =O VTs log T + log T .
strongly convex OGD (Algorithm 4) to update  1 its ac-
tion, denoted as xη,s
 (10)
t . For i = 0, 1, 2, ..., 2 log T ,
we configure the i-th strongly convex expert as fol-
lows: Remark. Theorem 1 implies that, similar q to Meta-
2−i
C Grad, Maler can be upper bounded by O( VT` d log T +
ηi = , and π1ηi ,s = .
5DG 3(i + 1)(i + 2) d log T ). Hence, the conclusions of MetaGrad under
some fast rates examples such as Bernstein condition (van
Computational Complexity. The computational com- Erven et al., 2015) still hold for Maler. Moreover, it shows
plexity of Maler is dominated by its experts. If we ignore that Malerpalso enjoys a new type of data-dependant
the projection procedure, the run time of Algorithms 2, bound O( VTs log T + log T ), and thus may perform
3 and 4 are O(d), O(d2 ) and O(d) per iteration respec- better than MetaGrad in some high dimensional cases
tively. Combining with the number of experts, the total such that VTs  dVT` .
run time of Maler is O(d2 log T ), which is of the same
Next, based on Theorem 1, we derive the following re-
order as that of MetaGrad. When taking the projection
gret bounds for strongly convex and exp-concave loss
into account, we note that it can be computed efficiently
functions, respectively.
for many convex bodies used in practical applications
such as d-dimensional balls, cubes and simplexes (Hazan Corollary 2. Suppose Assumptions 1 and 2 hold. For
et al., 2007). To put it more concrete, when the convex λ-strongly convex functions, the regret of Maler is upper
body is a d-dimensional ball, projections in Algorithms 2, bounded by
3, and 4 require O(d), O(d3 ), and O(d) time respectively,
9G2 √
    
and consequently the total computational complexity of 1
R(T ) ≤ 10GD + 2 ln 3 log2 T + 3
Maler is O(d3 log T ), which is also the same as that of 2λ 2
  
MetaGrad. 1
+ 1 + log T = O log T .
λ
3.4 THEORETICAL GUARANTEES
For α-exp-concave functions, let β = 12 min α, 4GD 1

,
Theorem 1. PTSuppose Assumptions 1 and P 2 hold. Let and Maler enjoys the following regret bound
T
VTs = G2 t=1 kxt − x∗ k2 , and VT` = t=1 ((xt −

    
> 2 9 1
x∗ ) gt ) . Then the regret of Maler is simultaneously
R(T ) ≤ 10GD + 2 ln 3 log2 T + 3
bounded by 2β 2
  

3

√ √  1
+ 10d log T = O d log T .
R(T ) ≤ 2 ln 3 + GD T = O T (8) α
2
Remark. Theorem 1 and Corollary 2 indicate that and
our√proposed algorithm achieves the minimax optimal c (7) c
(xt −xct )> gt −(η c GD)2
O( T ), O(d log T ) and O(log T ) regret bounds for con- e−ct (xt ) =eη
2
vex, exponentially concave and strongly convex functions c
(xt −xct )> gt −(η c (xt −xct )> gt ) (16)
≤eη
respectively. In contrast, the regret bounds√of MetaGrad
for the three types of loss functions are O( T log log T ), ≤1 + η c (xt − xct )> gt .
O(d log T ) and O(d log T ) respectively, which are sub- Note that by definition of η c we have η c (xt − xct )> gt >
optimal for convex and strongly convex functions. − 12 .
Now we are ready to prove Lemma 1. Define potential
4 REGRET ANALYSIS function
X  η,s PT η η,s PT η η,`

ΦT = π1 e− t=1 st (xt ) + π1η,` e− t=1 `t (xt )
The regret of Maler can be generally decomposed into η
two components, i.e., the regret of the meta-algorithm PT
ct (xct )
(meta regret) and the regrets of expert algorithms (expert + π1c e− t=1 .
regret). We firstly upper bound the two parts separately, (17)
and then analyse their composition to prove Theorem 1. We have
ΦT +1 − ΦT
4.1 META REGRET X η,s PT η η,s  η η,s

= π1 e− t=1 st (xt ) e−sT +1 (xT +1 ) − 1
We define meta regret as the difference between the cumu- η
lative surrogate losses of the actions of the meta-algorithm X PT
`η η,`
 η η,`

(i.e., xt s) and that of the actions from a specific expert, + π1η,` e− t=1 t (xt ) e−`T +1 (xT +1 ) − 1
η
which measures the learning ability of the meta-algorithm. PT  
c c
For meta regret, we introduce the following lemma. + π1c e− t=1 ct (xt ) e−ct (xT +1 ) − 1
Lemma 1. For every grid point η, we have X η,s PT η η,s
≤ π1 e− t=1 st (xt ) η(xT +1 − xη,s >
T +1 ) gt
T T η

  
X X 1
sηt (xt )− sηt (xη,s `η
X PT η,`
≤ 2 lnt ) 3
2
log2 T + 3 + π1η,` e− t=1 t (xt ) η(xT +1 − xη,` >
T +1 ) gt
t=1 t=1 η
(11) PT
ct (xct ) c
T T 

  + π1c e− t=1 η (xT +1 − xcT +1 )> gt
X η
X η η,` 1
`t (xt )− `t (xt ) ≤ 2 ln 3 log2 T + 3 = (aT xT +1 − bT ) gt
>
t=1 t=1
2
(12) (18)
and where the inequality is due to (14), (15), and (16),
X T T
X
ct (xt ) − ct (xct ) ≤ ln 3.
X η,s PT η η,s PT c
(13) aT = π1 e− t=1 st (xt ) η + π1c e− t=1 ct (xt ) η c
t=1 t=1 η
`η η,`
X PT
Proof. We firstly introduce three inequalities. For every + π1η,` e− t=1 t (xt ) η
η
grid point η,
`η η,`
X PT
bT = π1η,` e− t=1 t (xt ) ηxη,`
T +1
−sη η,s
t (xt )
(6) η(xt −xη,s > 2 2 η,s 2
t ) gt −η G kxt −xt k
e =e η
PT c

≤e η(xt −xη,s >


t ) gt − ( η(xt −xη,s >
t ) gt)
2
(14) + π1c e− t=1 ct (xt ) η c xcT +1
X η,s PT η η,s
≤1 + η(xt − xη,s >
t ) gt
+ π1 e− t=1 st (xt ) ηxη,s T +1
η
where the first inequality follows from Cauchy-Schwarz On the other hand, by the update rule of xt , we have
2
inequality, and the second inequality is due to ex−x ≤ P η,s η,s η,` η,` c c c
1 + x for any x ≥ − 23 (van Erven and Koolen, 2016). η (πT +1 ηxT +1 + πT +1 ηxT +1 ) + πT +1 η xT +1
xT +1 =
Applying similar arguments, we have P η,s η,` c c
η (πT +1 η + πT +1 η) + πT +1 η
η η,` η,` η,`
) (3) η(xt −xt )> gt −(η(xt −xt )> gt )
2 bT
e−`t (xt =e =
(15) aT
≤1 + η(xt − xη,` >
t ) gt (19)
where the second equality comes from Step 6 of Algo- 4.3 PROOF OF THEOREM 1
c η,` η,s
rithm 1, and note that πt+1 , πt+1 and πt+1 share the same
denominator. Plugging (19) into (18), we get In the following, we combine the regret analysis of the
meta and expert algorithms to prove Theorem 1.
ΦT +1 − ΦT ≤ 0

which implies that Proof. To get the O( T ) bound of (8), we upper bound
the regret by using the properties of ct as follows.
1 = Φ0 ≥ Φ1 ≥ · · · ≥ ΦT . (20)
R(T )
Note that all terms in the the definition of ΦT (17) are
T T
positive. Combining with (20), it indicates that these (1) X X
= ft (xt ) − ft (x∗ )
terms are less than 1. Thus,
t=1 t=1
T T
 PT η η,s
 X 1 (2) X
0 ≤ − ln π1η,s e− t=1 st (xt ) = sηt (xη,s
t )+ln η,s ≤ gt> (xt − x∗ )
t=1
π 1 t=1
PT PT c
T (7) t=1 −ct (x∗ ) + t=1 (η GD)2
 PT η η,`
 X 1 =
0 ≤ − ln π1η,` e− t=1 `t (xt ) = `ηt (xη,`
t )+ln η,` ηc
t=1 π1 PT PT
(ct (xt ) − ct (xct )) +
t=1 t=1 (ct (xct ) − ct (x∗ ))
and =
ηc

 
T 3
 PT c
 X 1 ≤ ln 3 + 2GD T
0 ≤ − ln π1c e− t=1 ct (xt ) = ct (xct ) + ln c . 4
t=1
π 1

We finish the proof by noticing that for every grid point η, where the last inequality follows from (13) and (23).
       Next, to achieve the regret of (10), we upper bound R(T )
1 1 1
ln η,s ≤ ln 3 log T + 1 log T + 2 by making use of the properties of s`t . For every grid point
π1 2 2
η, we have

  
1
≤2 ln 3 log2 T + 3
2 R(T )
       T T
1 1 1 (1) X X
ln η,` ≤ ln 3 log T + 1 log T + 2 = ft (xt ) − ft (x∗ )
π1 2 2 t=1 t=1

  
1 T
(2) X
≤2 ln 3 log2 T + 3 ≤ gt> (xt − x∗ )
2
t=1
and ln π1c = ln 3. PT
1 (6) −sηt (x∗ ) + η 2 G2 kx∗ − xt k2
t=1
=
η
4.2 EXPERT REGRET PT η η η,s PT η η,s η
t=1 (st (xt ) − st (xt )) + t=1 (st (xt ) − st (x∗ ))
=
For the regret of each expert, we have the following η
lemma. The proof is postponed to the appendix. XT

Lemma 2. For every grid point η and any u ∈ D, we + ηG2 kxt − x∗ k2


have t=1
√ 1

T T
2 ln 3 2 log2 T + 3 + 1 + log T
X X ≤
sηt (xη,s
t ) − sηt (u) ≤ 1 + log T (21) η
t=1 t=1 T
X
+ ηG2 kxt − x∗ k2
T T
X X t=1
`ηt (xη,`
t )− `ηt (u) ≤ 10d log T (22) √ 1

2 ln 3 log2 T + 3 + 1 + log T
t=1 t=1 =ηVTs + 2
η
and
T T (24)
X X 3
ct (xct ) − ct (u) ≤ . (23)
4 PT
t=1 t=1 where VTs = t=1 G2 kxt − x∗ k2 , and the inequality
comes from (11) and (21). Define By similar arguments, we get

   q
1 R(T ) ≤ 3 VT` B + 10GDB.
A = 2 ln 3 log2 T + 3 + 1 + log T ≥ 1.
2
The optimal η̂ to minimize the right hand side of (24) is
4.4 PROOF OF COROLLARY 2
s
A 1
η̂ = s ≥ √ . (25)
VT 5GD T Proof. For α-exp-concave functions, we have
1 R(T )
If η̂ ≤ 5GD ,
then by construction there exists a grid point
η such that η̂ ∈ [ η2 , η], and thus T
X β `
≤ gt> (xt − x∗ ) − V
A A 2 T
R(T ) ≤ ηVTs + ≤ 2η̂VTs + = 3 VTs A.
p
t=1
η η̂ s

    
1
On the other hand, if η̂ > 1 ≤3 VT` 2 ln 3 log2 T + 3 + 10d log T
5GD , then by (25) we get 2

    
VTs ≤ 25G2 D2 A. 1
+ 10GD 2 ln 3 log2 T + 3 + 10d log T
1
2
Thus for η1 = 5GD , we have
β
− VT`
R(T ) ≤ 10GDA. 2

    
3γ ` 3 1
Overall, we obtain ≤ VT + 10GD + 2 ln 3 log2 T + 3
2 2γ 2

β `
p
R(T ) ≤3 VTs A + 10GDA. + 10d log T − VT
2

Finally, we upper bound the regret by using the properties where the last inequality is based on xy ≤ γ2 x + 2γ y
for
of the exp-concave surrogate loss functions. For every all x, y, γ > 0, The result follows from γ = β
3.
grid point η, we have For λ-strongly convex functions, we have
R(T ) R(T )
T
(2) X T
≤ gt> (xt − x∗ )
X λ
≤ gt> (xt − x∗ ) − kxt − x∗ k2
t=1
t=1
2
PT 2
−`ηt (x∗ ) + η 2 gt> (xt − x∗ )
s 
(3) √
   
t=1 1
= ≤3 VT 2 lns 3 log2 T + 3 + 1 + log T
η 2
PT  η η η,`

t=1 `t (x t ) − `t (xt ) XT
2
 


1
 
= +η gt> (x − x∗ ) + 10GD 2 ln 3 log2 T + 3 + 1 + log T
η t=1
2
  λ
VTs
PT η η,` η
t=1 `t (xt ) − `t (x∗ )

+ 2G2 
3γVTs √
   
η 3 1
√ 1  ≤ + 10GD + 2 ln 3 log2 T + 3
2 ln 3 2 log2 T + 3 + 10d log T 2 2γ 2
≤ 
η λ
+ 1 + log T − VTs
XT
2 2
+η gt> (x∗ − xt ) √
t=1
where the last inequality is based on xy ≤ γ2 x + 2γ y
for
√ 1
 λ
all x, y, γ > 0, and the results follows from γ = 3G2 .
2 ln 3 log2 T + 3 + 10d log T
=ηVT` + 2
η
5 EXPERIMENTS
where the last inequality comes from (12) and (22). De-
fine In this section, we present empirical results on different

  
1 online learning tasks to evaluate the proposed algorithm.
B = 2 ln 3 log2 T + 3 + 10d log T. We choose MetaGrad as the baseline algorithm.
2
(a) Online regression (b) Online classification

Fig. 1: Emprecial results of Maler and MetaGrad for online regression and classification

5.1 ONLINE REGRESSION scale all feature vectors to the unit ball, and restrict the
decision set D to be a ball of radius 0.5 and centered at
We consider mini-batch least mean square regression with the origin, so that Assumptions 1 and 2 are satisfied. We
`2 -regularizer, which is a classic problem belonging to on- set batch size n = 200, and T = 100. The regret v.s. time
line strongly convex optimization. In each round t, a small horizon is shown in Figure 1(b). It can be seen that Maler
batch of training examples {(xt,1 , yt,1 ), . . . , (xt,n , yt,n )} performs better than MetaGrad. Although the worst-case
arrives, and at the same time, the learner makes a predic- regret bounds of Maler and MetaGrad for exp-concave
tion of the unknown parameter w∗ , denoted as wt , and loss are on the same order, the experimental results are
suffers a loss, defined as not surprising since Maler enjoys a tighter data-dependant
n regret bound than that of MetaGrad.
1X > 2
ft (w) = w xt,i − yt,i + λkwk2 . (26)
n i=1
6 CONCLUSION AND FURORE WORK
We conduct the experiment on a symmetric data set, which
is constructed as follows. We sample w∗ and feature vec- In this paper, we propose a universal algorithm for online

tors xt,i uniformly at random from the d-ball of diameter convex optimization, which achieves the optimal O( T ),
1 and 10 respectively, and generate yt,i according to a lin- O(d log T ) and O(log T ) regret bounds for general con-
ear model: yt,i = w∗> xt,i + ηt , where the noise is drawn vex, exp-concave and strongly convex functions respec-
from a normal distribution. We set batch size n = 200, tively, and enjoys a new type of data-dependent bound.
λ = 0.001, d = 50, and T = 200. The regret v.s. time The main idea is to consider different types of learning al-
horizon is shown in Fig. 1(a). It can be seen that Maler gorithms and learning rates at the same time. Experiments
achieves faster convergence rate than MetaGrad. on online regression and online classification problems
demonstrate the effectiveness of our method. In the future,
5.2 ONLINE CLASSIFICATION we will investigate whether our proposed algorithm can
extend to achieve border adaptivity in various directions,
Next, we consider online classification by using logistic for example, adapting to changing environments (Hazan
regression. In each round t, we receive a batch of train- and Seshadhri, 2007; Jun et al., 2017) and/or adapting to
ing examples {(xt,1 , yt,1 ), . . . , (xt,n , yt,n )}, and choose data structures (Reddi et al., 2018; Wang et al., 2019).
a linear classifier wt . After that, we suffer a logistic loss
n Acknowledgement
1X
ft (w) = log(1 + exp(−yt,i wt> xt,i )) (27) This work was partially supported by NSFC-NRF
n i=1
Joint Research Project (61861146001), YESS
which is exp-concave. We conduct the experiments on a (2017QNRC001), and Zhejiang Provincial Key
classic real-world data set a9a (Chang and Lin, 2011). We Laboratory of Service Robot.
References Reddi, S. J., Kale, S., and Kumar, S. (2018). On the
convergence of adam and beyond. In Proceedings of
Abernethy, J., Bartlett, P. L., Rakhlin, A., and Tewari, A. 6th International Conference on Learning Represen-
(2008). Optimal stragies and minimax lower bounds tations.
for online convex games. In Proceedings of the 21st
Annual Conference on Learning Theory, pages 415– Shalev-Shwartz, S. et al. (2012). Online learning and on-
423. line convex optimization. Foundations and Trends R
in Machine Learning, 4(2):107–194.
Boyd, S. and Vandenberghe, L. (2004). Convex optimiza-
tion. Cambridge university press. Tieleman, T. and Hinton, G. (2012). Lecture 6.5-rmsprop:
Divide the gradient by a running average of its re-
Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library cent magnitude. COURSERA: Neural networks for
for support vector machines. ACM Transactions on machine learning, pages 26–31.
Intelligent Systems and Technology, 2:1–27.
van Erven, T., Grünwald, P. D., Mehta, N. A., Reid, M. D.,
Daniely, A., Gonen, A., and Shalev-Shwartz, S. (2015). and Williamson, R. C. (2015). Fast rates in statistical
Strongly adaptive online learning. In Proceedings and online learning. Journal of Machine Learning
of the 32nd International Conference on Machine Research, 16:1793–1861.
Learning, pages 1405–1411.
van Erven, T. and Koolen, W. M. (2016). Metagrad: Mul-
Do, C. B., Le, Q. V., and Foo, C.-S. (2009). Proximal reg- tiple learning rates in online learning. In Advances
ularization for online and batch learning. In Proceed- in Neural Information Processing Systems 29, pages
ings of the 26th Annual International Conference on 3666–3674.
Machine Learning, pages 257–264.
Wang, G., Lu, S., Tu, W., and Zhang, L. (2019). Sadam: A
Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive variant of adam for strongly convex functions. ArXiv
subgradient methods for online learning and stochas- preprint arXiv:1905.02957.
tic optimization. Journal of Machine Learning Re-
search, 12:2121–2159. Wang, G., Zhao, D., and Zhang, L. (2018). Minimizing
adaptive regret with one gradient per iteration. In
Hall, E. C. and Willett, R. M. (2013). Dynamical models Proceedings of the 27th International Joint Confer-
and tracking regret in online convex programming. ence on Artificial Intelligence, pages 2762–2768.
In Proceedings of the 30th International Conference
on Machine Learning, pages 579–587. Zhang, L., Liu, T.-Y., and Zhou, Z.-H. (2019). Adap-
tive regret of convex and smooth functions. In Pro-
Hazan, E., Agarwal, A., and Kale, S. (2007). Logarith- ceedings of the 36th International Conference on
mic regret algorithms for online convex optimization. Machine Learning, pages 7414–7423.
Machine Learning, 69:169–192.
Zhang, L., Lu, S., and Zhou, Z.-H. (2018a). Adaptive on-
Hazan, E. et al. (2016). Introduction to online convex line learning in dynamic environments. In Advances
optimization. Foundations and Trends R in Opti- in Neural Information Processing Systems 31, pages
mization, 2(3-4):157–325. 1330–1340.
Hazan, E., Rakhlin, A., and Bartlett, P. L. (2008). Adap- Zhang, L., Yang, T., Jin, R., and Zhou, Z.-H. (2018b).
tive online gradient descent. In Advances in Neural Dynamic regret of strongly adaptive methods. In
Information Processing Systems 21, pages 65–72. Proceedings of the 35th International Conference on
Machine Learning, pages 5877–5886.
Hazan, E. and Seshadhri, C. (2007). Adaptive algorithms
for online decision problems. In Electronic Collo- Zhang, L., Yang, T., Yi, J., Jin, R., and Zhou, Z.-H. (2017).
quium on Computational Complexity. Improved dynamic regret for non-degenerate func-
tions. In Advance in Neural Information Processing
Jun, K.-S., Orabona, F., Wright, S., and Willett, R. (2017).
Systems 30, pages 732–741.
Improved strongly adaptive online learning using
coin betting. In Proceedings of the 20th International Zinkevich, M. (2003). Online convex programming and
Conference on Artificial Intelligence and Statistics, generalized infinitesimal gradient ascent. In Pro-
pages 943–951. ceedings of the 20th International Conference on
Machine Learning, pages 928–936.

You might also like