Professional Documents
Culture Documents
The second advantage is that the regret with respect to the where η c = 2DG1 √T , which only depends on known quan-
original loss function ft can be upper bounded in terms tities. It can be proved that if we run a convex OGD with
of the regret with respect to the defined surrogate loss ct (·) as the input, its regret with respect
√ to the original
function `ηt : loss function ft (·) can be bounded O( T ).
PT η PT While the idea of incorporating new types of surrogate
` (xt ) − minx∈D t=1 `ηt (x)
R(T ) ≤ t=1 t + ηVT` loss functions to enhance the adaptivity is easy to com-
η prehend, the specific definitions of the two proposed sur-
(5)
where VT` =
PT
((x − x )>
g )2
. Both advantages rogate loss functions in (6) and (7) are more involved.
t=1 t ∗ t
jointly (i.e., combining (4) and (5)) lead to a regret In fact, the proposed functions are carefully designed
bound of O((d log T )/η + ηVT` ). Therefore, had we such that besides the aforementioned properties, they also
` satisfy that
known theq value of VT in advance, we could set η as
1
min{Θ( d log T /VT` ), 5DG } and obtain a regret bound exp(−sηt (x)) ≤ exp(−`ηt (x)) ≤ 1 + η(xt − x)> gt
Algorithm 1 Meta-algorithm Algorithm 3 Exp-concave expert algorithm
c
1: Input: Learning rates η , η1 , η2 , . . . , prior weights 1: Input: Learning rate η
π1c , π1η1 ,s , π1η2 ,s , . . . and π1η1 ,` , π1η2 ,` , . . . η,`
2: x1 = 0, β = 12 min 4G1` D , 1 , where G` = 7
25D ,
2: for t = 1, . . . , T do 1
Σ1 = β 2 D 2 Id
3: Get predictions xct from Algorithm 2, and xη,` t , 3: for t = 1, . . . , T do
xη,s
t from Algorithms 3 and 4 for all η 4: Send xη,`
t to Algorithm 1
πtc η c xct + η (πtη,s ηxη,s η,` η,`
P
t +πt ηxt ) 5: Receive gradient gt from Algorithm 1
4: Play xt = c
P η,s η,`
πt η c + η (πt η+πt η) 6: Update
5: Observe gradient gt and send it to all experts
6: Update weights: >
c
c
πtc e−ct (xt ) Σt+1 =Σt + ∇`ηt xη,`
t ∇`ηt xη,`
t
πt+1 = Φt
η η,s 1 −1
η,s πtη,s e−st (xt ) η,` Σt+1
xt+1 =ΠD η,` η
xt − Σt+1 ∇`t xt η,`
πt+1 = Φt for all η β
η η,`
η,`
−`
πtη,` e t t
x ( )
πt+1 = for all η
Φt where ∇`ηt (xη,` 2 > η,`
t ) = ηgt + 2η gt gt (xt − xt )
where
7: end for
X η η,s η η,`
Φt = πtη,s e−st (xt )
+ πtη,` e−`t (xt )
η Algorithm 4 Strongly convex expert algorithm
c
+ πtc e−ct (xt ) 1: Input: Learning rate η
η,s
2: x1 = 0
7: end for 3: for t = 1, . . . , T do
4: Send xη,s
t to Algorithm 1
Algorithm 2 Convex expert algorithm 5: Receive gradient gt from Algorithm 1
1: xc1 = 0 6: Update
2: for t = 1, . . . , T do
1
3: Send xct to Algorithm 1 xη,s Id
t+1 = ΠD xη,s
t − ∇sη
(x η,s
)
4: Receive gradient gt from Algorithm 1 2η 2 G2 t t t
5: Update xct+1 = ΠIDd xct − D√
∇c (xc
) where ∇sηt (xη,s 2 2 η,s
t ) = ηgt + 2η G (xt − xt )
ηc G t t t
where ∇ct (xct ) = η gt c
7: end for
6: end for
2−i C and
ηi = , and π1ηi ,` =
5DG 3(i + 1)(i + 2)
R(T )
1
where C = 1 + 1/ 1 + 2 log T is a normaliza- s
√
s 1
tion parameter. ≤3 VT 2 ln 3 log2 T + 3 + 1 + log T
2
• Strongly convex experts. We maintain 21 log T + 1
√
strongly convex experts. In each round t, every ex- 1
+ 10GD 2 ln 3 log2 T + 3 + 1 + log T
pert takes a strongly convex surrogate loss sηt (·) in 2
p
(6) (with different η) as the loss function, and adopts =O VTs log T + log T .
strongly convex OGD (Algorithm 4) to update 1 its ac-
tion, denoted as xη,s
(10)
t . For i = 0, 1, 2, ..., 2 log T ,
we configure the i-th strongly convex expert as fol-
lows: Remark. Theorem 1 implies that, similar q to Meta-
2−i
C Grad, Maler can be upper bounded by O( VT` d log T +
ηi = , and π1ηi ,s = .
5DG 3(i + 1)(i + 2) d log T ). Hence, the conclusions of MetaGrad under
some fast rates examples such as Bernstein condition (van
Computational Complexity. The computational com- Erven et al., 2015) still hold for Maler. Moreover, it shows
plexity of Maler is dominated by its experts. If we ignore that Malerpalso enjoys a new type of data-dependant
the projection procedure, the run time of Algorithms 2, bound O( VTs log T + log T ), and thus may perform
3 and 4 are O(d), O(d2 ) and O(d) per iteration respec- better than MetaGrad in some high dimensional cases
tively. Combining with the number of experts, the total such that VTs dVT` .
run time of Maler is O(d2 log T ), which is of the same
Next, based on Theorem 1, we derive the following re-
order as that of MetaGrad. When taking the projection
gret bounds for strongly convex and exp-concave loss
into account, we note that it can be computed efficiently
functions, respectively.
for many convex bodies used in practical applications
such as d-dimensional balls, cubes and simplexes (Hazan Corollary 2. Suppose Assumptions 1 and 2 hold. For
et al., 2007). To put it more concrete, when the convex λ-strongly convex functions, the regret of Maler is upper
body is a d-dimensional ball, projections in Algorithms 2, bounded by
3, and 4 require O(d), O(d3 ), and O(d) time respectively,
9G2 √
and consequently the total computational complexity of 1
R(T ) ≤ 10GD + 2 ln 3 log2 T + 3
Maler is O(d3 log T ), which is also the same as that of 2λ 2
MetaGrad. 1
+ 1 + log T = O log T .
λ
3.4 THEORETICAL GUARANTEES
For α-exp-concave functions, let β = 12 min α, 4GD 1
,
Theorem 1. PTSuppose Assumptions 1 and P 2 hold. Let and Maler enjoys the following regret bound
T
VTs = G2 t=1 kxt − x∗ k2 , and VT` = t=1 ((xt −
√
> 2 9 1
x∗ ) gt ) . Then the regret of Maler is simultaneously
R(T ) ≤ 10GD + 2 ln 3 log2 T + 3
bounded by 2β 2
3
√ √ 1
+ 10d log T = O d log T .
R(T ) ≤ 2 ln 3 + GD T = O T (8) α
2
Remark. Theorem 1 and Corollary 2 indicate that and
our√proposed algorithm achieves the minimax optimal c (7) c
(xt −xct )> gt −(η c GD)2
O( T ), O(d log T ) and O(log T ) regret bounds for con- e−ct (xt ) =eη
2
vex, exponentially concave and strongly convex functions c
(xt −xct )> gt −(η c (xt −xct )> gt ) (16)
≤eη
respectively. In contrast, the regret bounds√of MetaGrad
for the three types of loss functions are O( T log log T ), ≤1 + η c (xt − xct )> gt .
O(d log T ) and O(d log T ) respectively, which are sub- Note that by definition of η c we have η c (xt − xct )> gt >
optimal for convex and strongly convex functions. − 12 .
Now we are ready to prove Lemma 1. Define potential
4 REGRET ANALYSIS function
X η,s PT η η,s PT η η,`
ΦT = π1 e− t=1 st (xt ) + π1η,` e− t=1 `t (xt )
The regret of Maler can be generally decomposed into η
two components, i.e., the regret of the meta-algorithm PT
ct (xct )
(meta regret) and the regrets of expert algorithms (expert + π1c e− t=1 .
regret). We firstly upper bound the two parts separately, (17)
and then analyse their composition to prove Theorem 1. We have
ΦT +1 − ΦT
4.1 META REGRET X η,s PT η η,s η η,s
= π1 e− t=1 st (xt ) e−sT +1 (xT +1 ) − 1
We define meta regret as the difference between the cumu- η
lative surrogate losses of the actions of the meta-algorithm X PT
`η η,`
η η,`
(i.e., xt s) and that of the actions from a specific expert, + π1η,` e− t=1 t (xt ) e−`T +1 (xT +1 ) − 1
η
which measures the learning ability of the meta-algorithm. PT
c c
For meta regret, we introduce the following lemma. + π1c e− t=1 ct (xt ) e−ct (xT +1 ) − 1
Lemma 1. For every grid point η, we have X η,s PT η η,s
≤ π1 e− t=1 st (xt ) η(xT +1 − xη,s >
T +1 ) gt
T T η
√
X X 1
sηt (xt )− sηt (xη,s `η
X PT η,`
≤ 2 lnt ) 3
2
log2 T + 3 + π1η,` e− t=1 t (xt ) η(xT +1 − xη,` >
T +1 ) gt
t=1 t=1 η
(11) PT
ct (xct ) c
T T
√
+ π1c e− t=1 η (xT +1 − xcT +1 )> gt
X η
X η η,` 1
`t (xt )− `t (xt ) ≤ 2 ln 3 log2 T + 3 = (aT xT +1 − bT ) gt
>
t=1 t=1
2
(12) (18)
and where the inequality is due to (14), (15), and (16),
X T T
X
ct (xt ) − ct (xct ) ≤ ln 3.
X η,s PT η η,s PT c
(13) aT = π1 e− t=1 st (xt ) η + π1c e− t=1 ct (xt ) η c
t=1 t=1 η
`η η,`
X PT
Proof. We firstly introduce three inequalities. For every + π1η,` e− t=1 t (xt ) η
η
grid point η,
`η η,`
X PT
bT = π1η,` e− t=1 t (xt ) ηxη,`
T +1
−sη η,s
t (xt )
(6) η(xt −xη,s > 2 2 η,s 2
t ) gt −η G kxt −xt k
e =e η
PT c
We finish the proof by noticing that for every grid point η, where the last inequality follows from (13) and (23).
Next, to achieve the regret of (10), we upper bound R(T )
1 1 1
ln η,s ≤ ln 3 log T + 1 log T + 2 by making use of the properties of s`t . For every grid point
π1 2 2
η, we have
√
1
≤2 ln 3 log2 T + 3
2 R(T )
T T
1 1 1 (1) X X
ln η,` ≤ ln 3 log T + 1 log T + 2 = ft (xt ) − ft (x∗ )
π1 2 2 t=1 t=1
√
1 T
(2) X
≤2 ln 3 log2 T + 3 ≤ gt> (xt − x∗ )
2
t=1
and ln π1c = ln 3. PT
1 (6) −sηt (x∗ ) + η 2 G2 kx∗ − xt k2
t=1
=
η
4.2 EXPERT REGRET PT η η η,s PT η η,s η
t=1 (st (xt ) − st (xt )) + t=1 (st (xt ) − st (x∗ ))
=
For the regret of each expert, we have the following η
lemma. The proof is postponed to the appendix. XT
Fig. 1: Emprecial results of Maler and MetaGrad for online regression and classification
5.1 ONLINE REGRESSION scale all feature vectors to the unit ball, and restrict the
decision set D to be a ball of radius 0.5 and centered at
We consider mini-batch least mean square regression with the origin, so that Assumptions 1 and 2 are satisfied. We
`2 -regularizer, which is a classic problem belonging to on- set batch size n = 200, and T = 100. The regret v.s. time
line strongly convex optimization. In each round t, a small horizon is shown in Figure 1(b). It can be seen that Maler
batch of training examples {(xt,1 , yt,1 ), . . . , (xt,n , yt,n )} performs better than MetaGrad. Although the worst-case
arrives, and at the same time, the learner makes a predic- regret bounds of Maler and MetaGrad for exp-concave
tion of the unknown parameter w∗ , denoted as wt , and loss are on the same order, the experimental results are
suffers a loss, defined as not surprising since Maler enjoys a tighter data-dependant
n regret bound than that of MetaGrad.
1X > 2
ft (w) = w xt,i − yt,i + λkwk2 . (26)
n i=1
6 CONCLUSION AND FURORE WORK
We conduct the experiment on a symmetric data set, which
is constructed as follows. We sample w∗ and feature vec- In this paper, we propose a universal algorithm for online
√
tors xt,i uniformly at random from the d-ball of diameter convex optimization, which achieves the optimal O( T ),
1 and 10 respectively, and generate yt,i according to a lin- O(d log T ) and O(log T ) regret bounds for general con-
ear model: yt,i = w∗> xt,i + ηt , where the noise is drawn vex, exp-concave and strongly convex functions respec-
from a normal distribution. We set batch size n = 200, tively, and enjoys a new type of data-dependent bound.
λ = 0.001, d = 50, and T = 200. The regret v.s. time The main idea is to consider different types of learning al-
horizon is shown in Fig. 1(a). It can be seen that Maler gorithms and learning rates at the same time. Experiments
achieves faster convergence rate than MetaGrad. on online regression and online classification problems
demonstrate the effectiveness of our method. In the future,
5.2 ONLINE CLASSIFICATION we will investigate whether our proposed algorithm can
extend to achieve border adaptivity in various directions,
Next, we consider online classification by using logistic for example, adapting to changing environments (Hazan
regression. In each round t, we receive a batch of train- and Seshadhri, 2007; Jun et al., 2017) and/or adapting to
ing examples {(xt,1 , yt,1 ), . . . , (xt,n , yt,n )}, and choose data structures (Reddi et al., 2018; Wang et al., 2019).
a linear classifier wt . After that, we suffer a logistic loss
n Acknowledgement
1X
ft (w) = log(1 + exp(−yt,i wt> xt,i )) (27) This work was partially supported by NSFC-NRF
n i=1
Joint Research Project (61861146001), YESS
which is exp-concave. We conduct the experiments on a (2017QNRC001), and Zhejiang Provincial Key
classic real-world data set a9a (Chang and Lin, 2011). We Laboratory of Service Robot.
References Reddi, S. J., Kale, S., and Kumar, S. (2018). On the
convergence of adam and beyond. In Proceedings of
Abernethy, J., Bartlett, P. L., Rakhlin, A., and Tewari, A. 6th International Conference on Learning Represen-
(2008). Optimal stragies and minimax lower bounds tations.
for online convex games. In Proceedings of the 21st
Annual Conference on Learning Theory, pages 415– Shalev-Shwartz, S. et al. (2012). Online learning and on-
423. line convex optimization. Foundations and Trends
R
in Machine Learning, 4(2):107–194.
Boyd, S. and Vandenberghe, L. (2004). Convex optimiza-
tion. Cambridge university press. Tieleman, T. and Hinton, G. (2012). Lecture 6.5-rmsprop:
Divide the gradient by a running average of its re-
Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library cent magnitude. COURSERA: Neural networks for
for support vector machines. ACM Transactions on machine learning, pages 26–31.
Intelligent Systems and Technology, 2:1–27.
van Erven, T., Grünwald, P. D., Mehta, N. A., Reid, M. D.,
Daniely, A., Gonen, A., and Shalev-Shwartz, S. (2015). and Williamson, R. C. (2015). Fast rates in statistical
Strongly adaptive online learning. In Proceedings and online learning. Journal of Machine Learning
of the 32nd International Conference on Machine Research, 16:1793–1861.
Learning, pages 1405–1411.
van Erven, T. and Koolen, W. M. (2016). Metagrad: Mul-
Do, C. B., Le, Q. V., and Foo, C.-S. (2009). Proximal reg- tiple learning rates in online learning. In Advances
ularization for online and batch learning. In Proceed- in Neural Information Processing Systems 29, pages
ings of the 26th Annual International Conference on 3666–3674.
Machine Learning, pages 257–264.
Wang, G., Lu, S., Tu, W., and Zhang, L. (2019). Sadam: A
Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive variant of adam for strongly convex functions. ArXiv
subgradient methods for online learning and stochas- preprint arXiv:1905.02957.
tic optimization. Journal of Machine Learning Re-
search, 12:2121–2159. Wang, G., Zhao, D., and Zhang, L. (2018). Minimizing
adaptive regret with one gradient per iteration. In
Hall, E. C. and Willett, R. M. (2013). Dynamical models Proceedings of the 27th International Joint Confer-
and tracking regret in online convex programming. ence on Artificial Intelligence, pages 2762–2768.
In Proceedings of the 30th International Conference
on Machine Learning, pages 579–587. Zhang, L., Liu, T.-Y., and Zhou, Z.-H. (2019). Adap-
tive regret of convex and smooth functions. In Pro-
Hazan, E., Agarwal, A., and Kale, S. (2007). Logarith- ceedings of the 36th International Conference on
mic regret algorithms for online convex optimization. Machine Learning, pages 7414–7423.
Machine Learning, 69:169–192.
Zhang, L., Lu, S., and Zhou, Z.-H. (2018a). Adaptive on-
Hazan, E. et al. (2016). Introduction to online convex line learning in dynamic environments. In Advances
optimization. Foundations and Trends
R in Opti- in Neural Information Processing Systems 31, pages
mization, 2(3-4):157–325. 1330–1340.
Hazan, E., Rakhlin, A., and Bartlett, P. L. (2008). Adap- Zhang, L., Yang, T., Jin, R., and Zhou, Z.-H. (2018b).
tive online gradient descent. In Advances in Neural Dynamic regret of strongly adaptive methods. In
Information Processing Systems 21, pages 65–72. Proceedings of the 35th International Conference on
Machine Learning, pages 5877–5886.
Hazan, E. and Seshadhri, C. (2007). Adaptive algorithms
for online decision problems. In Electronic Collo- Zhang, L., Yang, T., Yi, J., Jin, R., and Zhou, Z.-H. (2017).
quium on Computational Complexity. Improved dynamic regret for non-degenerate func-
tions. In Advance in Neural Information Processing
Jun, K.-S., Orabona, F., Wright, S., and Willett, R. (2017).
Systems 30, pages 732–741.
Improved strongly adaptive online learning using
coin betting. In Proceedings of the 20th International Zinkevich, M. (2003). Online convex programming and
Conference on Artificial Intelligence and Statistics, generalized infinitesimal gradient ascent. In Pro-
pages 943–951. ceedings of the 20th International Conference on
Machine Learning, pages 928–936.