CH 08-On-Line Learning

Foundations of Machine Learning
On-Line Learning
Motivation
PAC learning:
• distribution fixed over time (training and test).
• IID assumption.
On-line learning:
• no distributional assumption.
• worst-case analysis (adversarial).
• mixed training and test.
• Performance measure: mistake model, regret.
Mehryar Mohri - Foundations of Machine Learning page 2

This Lecture
Prediction with expert advice
Linear classification

General On-Line Setting
For t = 1 to T do
• receive instance xt X.
• predict yt Y.
• receive label yt Y.
• incur loss L(yt , yt ).
Classification: Y = {0, 1}, L(y, y ) = |y y|.
Regression: Y ⇥ R, L(y, y ) = (y y)2 .
Objective: minimize total loss T
t=1 yt , yt ).
L(⇥

Prediction with Expert Advice
For t = 1 to T do
• receive instance xt X and advice yt,i Y, i [1, N ].
• predict yt Y.
• receive label yt Y.
• incur loss L(yt , yt ).
Objective: minimize regret, i.e., difference of total
loss incurred and that of best expert.
T
X T
X
N
Regret(T ) = L(b
yt , yt ) min L(yt,i , yt ).
i=1
t=1 t=1

Mistake Bound Model
Definition: the maximum number of mistakes a
learning algorithm L makes to learn c is defined by
ML (c) = max |mistakes(L, c)|.
x1 ,...,xT
Definition: for any concept class C the maximum

number of mistakes a learning algorithm L makes is
ML (C) = max ML (c).
c C
A mistake bound is a bound M on ML (C) .

Halving Algorithm
see (Mitchell, 1997)
Halving(H)
1 H1 H
2 for t 1 to T do
3 Receive(xt )
4 yt MajorityVote(Ht , xt )
5 Receive(yt )
6 if yt = yt then
7 Ht+1 {c Ht : c(xt ) = yt }
8 return HT +1

Halving Algorithm - Bound
(Littlestone, 1988)
Theorem: Let H be a finite hypothesis set, then
MHalving(H) log2 |H|.
Proof: At each mistake, the hypothesis set is
reduced at least by half.

VC Dimension Lower Bound
(Littlestone, 1988)
Theorem: Let opt(H) be the optimal mistake bound
for H . Then,
VCdim(H) opt(H) MHalving(H) log2 |H|.
Proof: for a fully shattered set, form a complete
binary tree of the mistakes with height VCdim(H) .

Weighted Majority Algorithm
(Littlestone and Warmuth, 1988)
Weighted-Majority(N experts) yt , yt,i {0, 1}.
1 for i 1 to N do [0, 1).
2 w1,i 1
3 for t 1 to T do
4 Receive(xt )
5 yt 1PNy =1 wt
PN
y =0 wt
weighted majority vote
t,i t,i
6 Receive(yt )
7 if yt = yt then
8 for i 1 to N do
9 if (yt,i = yt ) then
10 wt+1,i wt,i
11 else wt+1,i wt,i
12 return wT +1
Weighted Majority - Bound
Theorem: Let mt be the number of mistakes made
by the WM algorithm till time t and mt that of the
best expert. Then, for all t ,
log N + mt log 1
mt .
log 2
1+
• Thus, m O(log N ) + constant

t best expert.
• Realizable case: m O(log N ). t
• Halving algorithm: = 0 .
Weighted Majority - Proof
Potential: t =
N
i=1 wt,i .
Upper bound: after each error,


⇥1 1
⇤ 1+
t+1  2 + 2 ⇥ t = t.
mt
2
1+
Thus, t N.
2
Lower bound: for any expert i , t wt,i = mt,i
.
mt
Comparison: mt 1+
2 N
1+
mt log log N + mt log 2
mt log 2
1+ log N + mt log 1 .

Weighted Majority - Notes
Advantage: remarkable bound requiring no
assumption.
Disadvantage: no deterministic algorithm can
achieve a regret RT = o(T ) with the binary loss.
• better guarantee with randomized WM.
• better guarantee for WM with convex losses.

Exponential Weighted Average
total loss incurred by
Algorithm: expert i up to time t
• weight update: wt+1,i wt,i e

PN
⌘L(yt,i ,yt )
=e ⌘Lt,i
.
• prediction: yt = i=1 wt,i yt,i

P N
wt,i
.
i=1
Theorem: assume that L is convex in its first

argument and takes values in [0, 1] . Then, for any > 0
and any sequence y1 , . . . , yT Y , the regret at T
satisfies log N T
Regret(T ) + .
8
For = 8 log N/T ,
Regret(T ) (T /2) log N .
Exponential Weighted Avg - Proof
Potential: t = log
N
i=1 wt,i .
Upper bound:
PN
i=1 wt 1,i e ⌘L(yt,i ,yt )
t t 1 = log PN
i=1 wt 1,i
= log E [e ⌘L(yt,i ,yt ) ]
wt 1
✓  ✓ ⇣ ⌘ ◆ ◆
= log E exp ⌘ L(yt,i , yt ) E [L(yt,i , yt )] ⌘ E [L(yt,i , yt )]
wt 1 wt 1 wt 1
⌘2
 ⌘ E [L(yt,i , yt )] + (Hoeffding’s ineq.)
wt 1 8
⌘2
 ⌘L( E [yt,i ], yt ) + (convexity of first arg. of L)
wt 1 8
⌘2
= ⌘L(b
yt , yt ) + .
8

Exponential Weighted Avg - Proof
Upper bound: summing up the inequalities yields
T 2
T
T 0 ⇥ yt , yt ) +
L(⇥ .
t=1
8
Lower bound:
N
N
T 0 = log e LT ,i
log N ⇥ log max e LT ,i
log N
i=1
i=1 N
= min LT,i log N.
i=1
Comparison: T 2
N T
min LT,i log N ⇥ yt , yt ) +
L(⇥
i=1
t=1
8
T
N log N T
⇤ yt , yt )
L(⇥ min LT,i ⇥ + .
t=1
i=1 8
Exponential Weighted Avg - Notes
Advantage: bound on regret per bound is of the
form RTT = O log(N )
T
.
Disadvantage: choice of requires knowledge of
horizon T .

Doubling Trick
Idea: divide time into periods [2k , 2k+1 1] of length 2k
with k = 0, . . . , n, T ⇥ 2 1, and choose k =
n 8 log N
2k
in each period.
Theorem: with the same assumptions as before, for
any T , the following holds:
⇤
2
Regret(T ) ⇥ ⇤ (T /2) log N + log N/2.
2 1

Doubling Trick - Proof
By the previous theorem, for any Ik = [2k , 2k+1 1] ,
N
LIk min LIk ,i ⇥ 2k /2 log N.
i=1
n n
N
n ⇤
Thus, LT = LIk min LIk ,i +
i=1
2k (log N )/2
k=0 k=0 k=0
n ⇥
N k
min LT,i + 2 2(log N )/2.
i=1
k=0
with
n ⇤ n+1 ⇤ ⇤ ⇤ ⇤ ⇤ ⇤
k 2 1 2 (n+1)/2
1 2 T +1 1 2( T + 1) 1 2 T
22 = ⇤ = ⇤ ⇥ ⇤ ⇥ ⇤ ⇥⇤ + 1.
i=0
2 1 2 1 2 1 2 1 2 1

Notes
Doubling trick used in a variety of other contexts
and proofs.
More general method, learning parameter function
of time: t = (8 log N )/t . Constant factor
improvement:
Regret(T ) 2 (T /2) log N + (1/8) log N.

This Lecture
Prediction with expert advice
Linear classification

Perceptron Algorithm
(Rosenblatt, 1958)
Perceptron(w0 )
1 w1 w0 typically w0 = 0
2 for t 1 to T do
3 Receive(xt )
4 yt sgn(wt · xt )
5 Receive(yt )
6 if (yt = yt ) then
7 wt+1 wt + yt xt more generally yt xt , > 0
8 else wt+1 wt
9 return wT +1

Separating Hyperplane
Margin and errors
w·x = 0 w·x = 0
ρ ρ
yi (w · xi )
w

Perceptron = Stochastic Gradient Descent
Objective function: convex but not differentiable.
T
1
F (w) = max 0, yt (w · xt ) = E [f (w, x)]
T t=1
b
x D
with f (w, x) = max 0, y(w · x) .

Stochastic gradient: for each xt , the update is
wt w f (wt , xt ) if differentiable
wt+1
wt otherwise,
where > 0 is a learning rate parameter.
Here: wt + yt xt if yt (wt · xt ) < 0
wt+1
wt otherwise.
Perceptron Algorithm - Bound
(Novikoff, 1962)
Theorem: Assume that xt R for all t [1, T ] and
that for some > 0 and v R N
, for all t [1, T ] ,
yt (v · xt )
.
v
Then, the number of mistakes made by the
perceptron algorithm is bounded by R 2
/ 2
.
Proof: Let I be the set of ts at which there is an
update and let M be the total number of updates.

• Summing up the assumption inequalities gives:
v· t I yt xt
M
v
v· t I (wt+1 wt )
= (definition of updates)
v
v · wT +1
=
v
wT +1 (Cauchy-Schwarz ineq.)
= wtm + ytm xtm (tm largest t in I)
1/2
= wtm 2
+ xtm 2
+ 2 ytm wtm · xtm
1/2 0
wtm 2
+R 2
1/2
MR 2
= M R. (applying the same to previous ts in I)

• Notes:
• bound independent of dimension and tight.
• convergence can be slow for small margin, it can
be in (2 ) . N
• among the many variants: voted perceptron

algorithm. Predict according to
sign ( c t wt ) · x ,
t I
where ct is the number of iterations wt survives.

• {xt : t I} are the support vectors for the
perceptron algorithm.
• non-separable case: does not converge.

Perceptron - Leave-One-Out Analysis
Theorem: Let hS be the hypothesis returned by the
perceptron algorithm for sample S = (x1 , . . . , xT ) D
and let M (S) be the number of updates defining hS .
Then,
min(M (S), Rm+1
2
/ m+1 )
2
E m [R(hS )] Em+1 .
S D S D m+1
Proof: Let S Dm+1 be a sample linearly separable

and let x S . If hS {x} misclassifies x , then x must
be a ‘support vector’ for hS (update at x ). Thus,
M (S)
Rloo (perceptron) .
m+1
Perceptron - Non-Separable Bound
(MM and Rostamizadeh, 2013)
Theorem: let I denote the set of rounds at which
the Perceptron algorithm makes an update when
processing x1 , . . . , xT and let MT = |I| . Then,
q 2
R
MT  inf L⇢ (u) + ,
⇢>0,kuk2 1 ⇢
where R = maxt2I kxt k

P yt (u·xt )
L⇢ (u) = t2I 1 ⇢ +
.

• Proof: for any t , 1 yt (u·xt )
1 yt (u·xt )
+ , summing
up these inequalities for t I yields:
X⇣ yt (u · xt ) ⌘ X yt (u · xt )
MT  1 +
⇢ + ⇢
t2I t2I
p
MT R
 L⇢ (u) + ,
⇢
by upper-bounding t I (yt u · xt ) as in the proof
for the separable case.
• solving the second-degree inequality
p
MT R
MT  L⇢ (u) + ,
⇢
q
R R2 q
p + + 4L⇢ (u) R
gives MT 
⇢ ⇢2
2

⇢
+ L⇢ (u).

Non-Separable Case - L2 Bound
(Freund and Schapire, 1998; MM and Rostamizadeh, 2013)
Theorem: let I denote the set of rounds at which
the Perceptron algorithm makes an update when
processing x1 , . . . , xT and let MT = |I| . Then,
2
L (u) 2 L (u) 2
t I xt 2
MT inf + 2
+ .
>0, u 2 1 2 4
• when x t R for all t I, this implies

2
R
MT inf + L (u) 2 ,
>0, u 2 1
where L (u) = 1 yt (u·xt )

+ t I
.
• Proof: Reduce problem to separable case in higher
dimension. Let l = 1 1 , for t [1, T ] .
t
yt u·xt
+ t I
• Mapping (similar to trivial mapping):

xt,1
(N +t)th component ..
. u1
xt,N Z
..
0 .
xt,1 .. uN
xt = .. xt = . u u= Z
. y 1 l1
0 Z
xt,N ..
.
0 y T lT
.. Z
. 2 L (u) 2
0 u =1 = Z= 1+ 2
.

• Observe that the Perceptron algorithm makes the
same predictions and makes updates at the same
rounds when processing x1 , . . . , xT .
• For any t I,
u · xt yt lt
yt (u · xt ) = yt +
Z Z
yt u · xt lt
= +
Z Z
1
= yt u · xt + [ yt (u · xt )]+ .
Z Z
• Summing up and using the proof in the separable
case yields:
MT yt (u · xt ) xt 2 .
Z
t I t I
• The inequality can be rewritten as
1 L (u) 2
r2 r2 L (u) 2
MT 2
MT2 2
+ 2
r +MT
2 2
= 2
+ 2
+ 2
+MT L (u) 2,
where r = t I xt 2 .
• Selecting to minimize the bound gives 2

=
L (u)
MT
2r
and leads to
r2 MT L (u) r
MT2 2 +2 + MT L (u) 2
= (r + MT L (u) 2 )2 .
• Solving the second-degree inequality

r
MT MT L (u) 2 0
yields directly the first statement. The second one

results from replacing r with MT R .
Dual Perceptron Algorithm
Dual-Perceptron( 0 )
1 0
typically 0 = 0
2 for t 1 to T do
3 Receive(xt )
T
4 yt sgn( s=1 s ys (xs · xt ))
5 Receive(yt )
7 t t+1
8 return

Kernel Perceptron Algorithm
(Aizerman et al., 1964)
K PDS kernel.
Kernel-Perceptron( 0 )
1 0
typically 0 = 0
2 for t 1 to T do
3 Receive(xt )
T
4 yt sgn( s=1 s ys K(xs , xt ))
5 Receive(yt )
7 t t+1
8 return

Winnow Algorithm
(Littlestone, 1988)
Winnow( )
1 w1 1/N
2 for t 1 to T do
3 Receive(xt )
4 yt sgn(wt · xt ) yt { 1, +1}
5 Receive(yt )
N
7 Zt i=1 wt,i exp( yt xt,i )
8 for i 1 to N do
wt,i exp( yt xt,i )
9 wt+1,i Zt
10 else wt+1 wt
11 return wT +1

Notes
Winnow = weighted majority:
• for yt,i = xt,i { 1, +1}, sgn(wt · xt )coincides with
the majority vote.
• multiplying by e or e the weight of correct or
incorrect experts, is equivalent to multiplying
by = e 2 the weight of incorrect ones.
Relationships with other algorithms: e.g., boosting
and Perceptron (Winnow and Perceptron can be
viewed as special instances of a general family).

Winnow Algorithm - Bound
Theorem: Assume that xt R for all t [1, T ] and
that for some > 0 and v R N
, v 0 for all t [1, T ] ,
yt (v · xt )
.
v 1
Then, the number of mistakes made by the
Winnow algorithm is bounded by 2 (R 2
/ 2
) log N .
Proof: Let I be the set of ts at which there is an
update and let M be the total number of updates.

Notes
Comparison with perceptron bound:
• dual norms: norms for xt and v .
• similar bounds with different norms.
• each advantageous in different cases:
•
Winnow bound favorable when a sparse set of
experts can predict well. For example, if v = e1
and tx {±1} N
, log N vs N.
•
Perceptron favorable in opposite situation.

N
vi vi / v
Potential: t =
v
log
wt,i
. (relative entropy)
i=1
Upper bound: for each t in I ,
PN vi wt,i
t+1 t = i=1 kvk1 log wt+1,i
PN vi Zt
= i=1 kvk1 log exp(⌘y t xt,i )
P N vi
= log Zt ⌘ i=1 kvk1 yt xt,i
⇥ PN ⇤
 log i=1 wt,i exp(⌘yt xt,i ) ⌘⇢1
⇥ ⇤
= log E exp(⌘yt xt ) ⌘⇢1
wt
⇥ 2 2
⇤
(Hoe↵ding)  log exp(⌘ (2R1 ) /8) + ⌘yt wt · xt ⌘⇢1
| {z }
 ⌘ 2 R12
/2 ⌘⇢1 . 0

Upper bound: summing up the inequalities yields
T +1 1 M ( 2 R2 /2 ).
Lower bound: note that
N N
/ v
1 = vi
v 1 log vi1/N 1
= log N + vi
v 1 log vi
v 1 log N
i=1 i=1
and for all t , t 0 (property of relative entropy).

Thus, T +1 1 0 log N = log N.
Comparison: log N M ( 2 R2 /2 ). For = R2
we obtain R2
M 2 log N 2 .
Conclusion
On-line learning:
• wide and fast-growing literature.
• many related topics, e.g., game theory, text
compression, convex optimization.
• online to batch bounds and techniques.
• online version of batch algorithms, e.g.,
regression algorithms (see regression lecture).

References
• Aizerman, M. A., Braverman, E. M., & Rozonoer, L. I. (1964). Theoretical foundations of the
potential function method in pattern recognition learning. Automation and Remote Control,
25, 821-837.
• Nicolò Cesa-Bianchi, Alex Conconi, Claudio Gentile: On the Generalization Ability of On-
Line Learning Algorithms. IEEE Transactions on Information Theory 50(9): 2050-2057. 2004.
• Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge
University Press, 2006.
• Yoav Freund and Robert Schapire. Large margin classification using the perceptron
algorithm. In Proceedings of COLT 1998. ACM Press, 1998.
• Nick Littlestone. From On-Line to Batch Learning. COLT 1989: 269-284.
• Nick Littlestone. "Learning Quickly When Irrelevant Attributes Abound: A New Linear-
threshold Algorithm" Machine Learning 285-318(2). 1988.

References
• Nick Littlestone, Manfred K. Warmuth: The Weighted Majority Algorithm. FOCS 1989:
256-261.
• Tom Mitchell. Machine Learning, McGraw Hill, 1997.
• Mehryar Mohri and Afshin Rostamizadeh. Perceptron Mistake Bounds. arXiv:1305.0208,

2013.
• Novikoff, A. B. (1962). On convergence proofs on perceptrons. Symposium on the

Mathematical Theory of Automata, 12, 615-622. Polytechnic Institute of Brooklyn.
• Rosenblatt, Frank, The Perceptron: A Probabilistic Model for Information Storage and
Organization in the Brain, Cornell Aeronautical Laboratory, Psychological Review, v65, No. 6,
pp. 386-408, 1958.
• Vladimir N.Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.

Appendix
Mehryar Mohri - Foudations of Machine Learning

SVMs - Leave-One-Out Analysis
(Vapnik, 1995)
Theorem: let hS be the optimal hyperplane for a
sample S and let NSV (S) be the number of support
vectors defining hS. Then,
min(NSV (S), Rm+1
2
/ m+1 )
2
E m [R(hS )] Em+1 .
S D S D m+1
Proof: one part proven in lecture 4. The other part
due to i 1/Rm+12
for xi misclassified by SVMs.

Comparison
Bounds on expected error, not high probability
statements.
Leave-one-out bounds not sufficient to distinguish
SVMs and perceptron algorithm. Note however:
• same maximum margin m+1 can be used in both.
• but different radius Rm+1 of support vectors.
Difference: margin distribution.

CH 08-On-Line Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CH 08-On-Line Learning

Uploaded by

Copyright:

Available Formats

Foundations of Machine Learning

Mehryar Mohri - Foundations of Machine Learning page 2

Mehryar Mohri - Foundations of Machine Learning page 3

Mehryar Mohri - Foundations of Machine Learning page 4

Mehryar Mohri - Foundations of Machine Learning page 5

Definition: for any concept class C the maximum

A mistake bound is a bound M on ML (C) .

Mehryar Mohri - Foundations of Machine Learning page 6

Mehryar Mohri - Foundations of Machine Learning page 7

Mehryar Mohri - Foundations of Machine Learning page 8

Mehryar Mohri - Foundations of Machine Learning page 9

• Thus, m O(log N ) + constant

• Realizable case: m O(log N ). t

Upper bound: after each error,

Mehryar Mohri - Foundations of Machine Learning page 12

Mehryar Mohri - Foundations of Machine Learning page 13

• weight update: wt+1,i wt,i e

• prediction: yt = i=1 wt,i yt,i

Theorem: assume that L is convex in its first

Mehryar Mohri - Foundations of Machine Learning page 15

Mehryar Mohri - Foundations of Machine Learning page 17

Mehryar Mohri - Foundations of Machine Learning page 18

Mehryar Mohri - Foundations of Machine Learning page 19

Regret(T ) 2 (T /2) log N + (1/8) log N.

Mehryar Mohri - Foundations of Machine Learning page 20

Mehryar Mohri - Foundations of Machine Learning page 21

Mehryar Mohri - Foundations of Machine Learning page 22

Mehryar Mohri - Foundations of Machine Learning page 23

with f (w, x) = max 0, y(w · x) .

Mehryar Mohri - Foundations of Machine Learning page 25

Mehryar Mohri - Foundations of Machine Learning page 26

• among the many variants: voted perceptron

where ct is the number of iterations wt survives.

• non-separable case: does not converge.

Proof: Let S Dm+1 be a sample linearly separable

where R = maxt2I kxt k

Mehryar Mohri - Foundations of Machine Learning page 29

Mehryar Mohri - Foundations of Machine Learning page 30

• when x t R for all t I, this implies

where L (u) = 1 yt (u·xt )

• Mapping (similar to trivial mapping):

Mehryar Mohri - Foundations of Machine Learning page 32

• Selecting to minimize the bound gives 2

• Solving the second-degree inequality

yields directly the first statement. The second one

Mehryar Mohri - Foundations of Machine Learning page 35

Mehryar Mohri - Foundations of Machine Learning page 36

Mehryar Mohri - Foundations of Machine Learning page 37

Mehryar Mohri - Foundations of Machine Learning page 38

Mehryar Mohri - Foundations of Machine Learning page 39

Mehryar Mohri - Foundations of Machine Learning page 40

Mehryar Mohri - Foundations of Machine Learning page 41

and for all t , t 0 (property of relative entropy).

Mehryar Mohri - Foundations of Machine Learning page 43

• Nick Littlestone. From On-Line to Batch Learning. COLT 1989: 269-284.

Mehryar Mohri - Foundations of Machine Learning page 44

• Tom Mitchell. Machine Learning, McGraw Hill, 1997.

• Mehryar Mohri and Afshin Rostamizadeh. Perceptron Mistake Bounds. arXiv:1305.0208,

• Novikoff, A. B. (1962). On convergence proofs on perceptrons. Symposium on the

• Vladimir N.Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.

Mehryar Mohri - Foundations of Machine Learning page 45

Mehryar Mohri - Foudations of Machine Learning

Mehryar Mohri - Foundations of Machine Learning page 47