You are on page 1of 48

Foundations of Machine Learning

On-Line Learning
Motivation
PAC learning:
• distribution fixed over time (training and test).
• IID assumption.
On-line learning:
• no distributional assumption.
• worst-case analysis (adversarial).
• mixed training and test.
• Performance measure: mistake model, regret.

Mehryar Mohri - Foundations of Machine Learning page 2


This Lecture
Prediction with expert advice
Linear classification

Mehryar Mohri - Foundations of Machine Learning page 3


General On-Line Setting
For t = 1 to T do
• receive instance xt X.
• predict yt Y.
• receive label yt Y.
• incur loss L(yt , yt ).
Classification: Y = {0, 1}, L(y, y ) = |y y|.
Regression: Y ⇥ R, L(y, y ) = (y y)2 .
Objective: minimize total loss T
t=1 yt , yt ).
L(⇥

Mehryar Mohri - Foundations of Machine Learning page 4


Prediction with Expert Advice
For t = 1 to T do
• receive instance xt X and advice yt,i Y, i [1, N ].
• predict yt Y.
• receive label yt Y.
• incur loss L(yt , yt ).
Objective: minimize regret, i.e., difference of total
loss incurred and that of best expert.
T
X T
X
N
Regret(T ) = L(b
yt , yt ) min L(yt,i , yt ).
i=1
t=1 t=1

Mehryar Mohri - Foundations of Machine Learning page 5


Mistake Bound Model
Definition: the maximum number of mistakes a
learning algorithm L makes to learn c is defined by
ML (c) = max |mistakes(L, c)|.
x1 ,...,xT

Definition: for any concept class C the maximum


number of mistakes a learning algorithm L makes is
ML (C) = max ML (c).
c C

A mistake bound is a bound M on ML (C) .

Mehryar Mohri - Foundations of Machine Learning page 6


Halving Algorithm
see (Mitchell, 1997)

Halving(H)
1 H1 H
2 for t 1 to T do
3 Receive(xt )
4 yt MajorityVote(Ht , xt )
5 Receive(yt )
6 if yt = yt then
7 Ht+1 {c Ht : c(xt ) = yt }
8 return HT +1

Mehryar Mohri - Foundations of Machine Learning page 7


Halving Algorithm - Bound
(Littlestone, 1988)
Theorem: Let H be a finite hypothesis set, then
MHalving(H) log2 |H|.
Proof: At each mistake, the hypothesis set is
reduced at least by half.

Mehryar Mohri - Foundations of Machine Learning page 8


VC Dimension Lower Bound
(Littlestone, 1988)
Theorem: Let opt(H) be the optimal mistake bound
for H . Then,
VCdim(H) opt(H) MHalving(H) log2 |H|.
Proof: for a fully shattered set, form a complete
binary tree of the mistakes with height VCdim(H) .

Mehryar Mohri - Foundations of Machine Learning page 9


Weighted Majority Algorithm
(Littlestone and Warmuth, 1988)
Weighted-Majority(N experts) yt , yt,i {0, 1}.
1 for i 1 to N do [0, 1).
2 w1,i 1
3 for t 1 to T do
4 Receive(xt )
5 yt 1PNy =1 wt
PN
y =0 wt
weighted majority vote
t,i t,i
6 Receive(yt )
7 if yt = yt then
8 for i 1 to N do
9 if (yt,i = yt ) then
10 wt+1,i wt,i
11 else wt+1,i wt,i
12 return wT +1
Mehryar Mohri - Foundations of Machine Learning page 10
Weighted Majority - Bound
Theorem: Let mt be the number of mistakes made
by the WM algorithm till time t and mt that of the
best expert. Then, for all t ,
log N + mt log 1
mt .
log 2
1+

• Thus, m O(log N ) + constant


t best expert.

• Realizable case: m O(log N ). t

• Halving algorithm: = 0 .
Mehryar Mohri - Foundations of Machine Learning page 11
Weighted Majority - Proof
Potential: t =
N
i=1 wt,i .

Upper bound: after each error,



⇥1 1
⇤ 1+
t+1  2 + 2 ⇥ t = t.
mt
2
1+
Thus, t N.
2
Lower bound: for any expert i , t wt,i = mt,i
.
mt
Comparison: mt 1+
2 N
1+
mt log log N + mt log 2
mt log 2
1+ log N + mt log 1 .

Mehryar Mohri - Foundations of Machine Learning page 12


Weighted Majority - Notes
Advantage: remarkable bound requiring no
assumption.
Disadvantage: no deterministic algorithm can
achieve a regret RT = o(T ) with the binary loss.
• better guarantee with randomized WM.
• better guarantee for WM with convex losses.

Mehryar Mohri - Foundations of Machine Learning page 13


Exponential Weighted Average
total loss incurred by
Algorithm: expert i up to time t

• weight update: wt+1,i wt,i e


PN
⌘L(yt,i ,yt )
=e ⌘Lt,i
.

• prediction: yt = i=1 wt,i yt,i


P N
wt,i
.
i=1

Theorem: assume that L is convex in its first


argument and takes values in [0, 1] . Then, for any > 0
and any sequence y1 , . . . , yT Y , the regret at T
satisfies log N T
Regret(T ) + .
8
For = 8 log N/T ,
Regret(T ) (T /2) log N .
Mehryar Mohri - Foundations of Machine Learning page 14
Exponential Weighted Avg - Proof
Potential: t = log
N
i=1 wt,i .
Upper bound:
PN
i=1 wt 1,i e ⌘L(yt,i ,yt )
t t 1 = log PN
i=1 wt 1,i
= log E [e ⌘L(yt,i ,yt ) ]
wt 1
✓  ✓ ⇣ ⌘ ◆ ◆
= log E exp ⌘ L(yt,i , yt ) E [L(yt,i , yt )] ⌘ E [L(yt,i , yt )]
wt 1 wt 1 wt 1

⌘2
 ⌘ E [L(yt,i , yt )] + (Hoeffding’s ineq.)
wt 1 8
⌘2
 ⌘L( E [yt,i ], yt ) + (convexity of first arg. of L)
wt 1 8
⌘2
= ⌘L(b
yt , yt ) + .
8

Mehryar Mohri - Foundations of Machine Learning page 15


Exponential Weighted Avg - Proof
Upper bound: summing up the inequalities yields
T 2
T
T 0 ⇥ yt , yt ) +
L(⇥ .
t=1
8
Lower bound:
N
N
T 0 = log e LT ,i
log N ⇥ log max e LT ,i
log N
i=1
i=1 N
= min LT,i log N.
i=1
Comparison: T 2
N T
min LT,i log N ⇥ yt , yt ) +
L(⇥
i=1
t=1
8
T
N log N T
⇤ yt , yt )
L(⇥ min LT,i ⇥ + .
t=1
i=1 8
Mehryar Mohri - Foundations of Machine Learning page 16
Exponential Weighted Avg - Notes
Advantage: bound on regret per bound is of the
form RTT = O log(N )
T
.
Disadvantage: choice of requires knowledge of
horizon T .

Mehryar Mohri - Foundations of Machine Learning page 17


Doubling Trick
Idea: divide time into periods [2k , 2k+1 1] of length 2k
with k = 0, . . . , n, T ⇥ 2 1, and choose k =
n 8 log N
2k
in each period.
Theorem: with the same assumptions as before, for
any T , the following holds:

2
Regret(T ) ⇥ ⇤ (T /2) log N + log N/2.
2 1

Mehryar Mohri - Foundations of Machine Learning page 18


Doubling Trick - Proof
By the previous theorem, for any Ik = [2k , 2k+1 1] ,
N
LIk min LIk ,i ⇥ 2k /2 log N.
i=1

n n
N
n ⇤
Thus, LT = LIk min LIk ,i +
i=1
2k (log N )/2
k=0 k=0 k=0
n ⇥
N k
min LT,i + 2 2(log N )/2.
i=1
k=0

with
n ⇤ n+1 ⇤ ⇤ ⇤ ⇤ ⇤ ⇤
k 2 1 2 (n+1)/2
1 2 T +1 1 2( T + 1) 1 2 T
22 = ⇤ = ⇤ ⇥ ⇤ ⇥ ⇤ ⇥⇤ + 1.
i=0
2 1 2 1 2 1 2 1 2 1

Mehryar Mohri - Foundations of Machine Learning page 19


Notes
Doubling trick used in a variety of other contexts
and proofs.
More general method, learning parameter function
of time: t = (8 log N )/t . Constant factor
improvement:

Regret(T ) 2 (T /2) log N + (1/8) log N.

Mehryar Mohri - Foundations of Machine Learning page 20


This Lecture
Prediction with expert advice
Linear classification

Mehryar Mohri - Foundations of Machine Learning page 21


Perceptron Algorithm
(Rosenblatt, 1958)

Perceptron(w0 )
1 w1 w0 typically w0 = 0
2 for t 1 to T do
3 Receive(xt )
4 yt sgn(wt · xt )
5 Receive(yt )
6 if (yt = yt ) then
7 wt+1 wt + yt xt more generally yt xt , > 0
8 else wt+1 wt
9 return wT +1

Mehryar Mohri - Foundations of Machine Learning page 22


Separating Hyperplane
Margin and errors

w·x = 0 w·x = 0

ρ ρ

yi (w · xi )
w

Mehryar Mohri - Foundations of Machine Learning page 23


Perceptron = Stochastic Gradient Descent
Objective function: convex but not differentiable.
T
1
F (w) = max 0, yt (w · xt ) = E [f (w, x)]
T t=1
b
x D

with f (w, x) = max 0, y(w · x) .


Stochastic gradient: for each xt , the update is
wt w f (wt , xt ) if differentiable
wt+1
wt otherwise,
where > 0 is a learning rate parameter.
Here: wt + yt xt if yt (wt · xt ) < 0
wt+1
wt otherwise.
Mehryar Mohri - Foundations of Machine Learning page 24
Perceptron Algorithm - Bound
(Novikoff, 1962)
Theorem: Assume that xt R for all t [1, T ] and
that for some > 0 and v R N
, for all t [1, T ] ,
yt (v · xt )
.
v
Then, the number of mistakes made by the
perceptron algorithm is bounded by R 2
/ 2
.
Proof: Let I be the set of ts at which there is an
update and let M be the total number of updates.

Mehryar Mohri - Foundations of Machine Learning page 25


• Summing up the assumption inequalities gives:
v· t I yt xt
M
v
v· t I (wt+1 wt )
= (definition of updates)
v
v · wT +1
=
v
wT +1 (Cauchy-Schwarz ineq.)
= wtm + ytm xtm (tm largest t in I)
1/2
= wtm 2
+ xtm 2
+ 2 ytm wtm · xtm
1/2 0
wtm 2
+R 2

1/2
MR 2
= M R. (applying the same to previous ts in I)

Mehryar Mohri - Foundations of Machine Learning page 26


• Notes:
• bound independent of dimension and tight.
• convergence can be slow for small margin, it can
be in (2 ) . N

• among the many variants: voted perceptron


algorithm. Predict according to
sign ( c t wt ) · x ,
t I

where ct is the number of iterations wt survives.


• {xt : t I} are the support vectors for the
perceptron algorithm.

• non-separable case: does not converge.


Mehryar Mohri - Foundations of Machine Learning page 27
Perceptron - Leave-One-Out Analysis
Theorem: Let hS be the hypothesis returned by the
perceptron algorithm for sample S = (x1 , . . . , xT ) D
and let M (S) be the number of updates defining hS .
Then,
min(M (S), Rm+1
2
/ m+1 )
2
E m [R(hS )] Em+1 .
S D S D m+1

Proof: Let S Dm+1 be a sample linearly separable


and let x S . If hS {x} misclassifies x , then x must
be a ‘support vector’ for hS (update at x ). Thus,
M (S)
Rloo (perceptron) .
m+1
Mehryar Mohri - Foundations of Machine Learning page 28
Perceptron - Non-Separable Bound
(MM and Rostamizadeh, 2013)
Theorem: let I denote the set of rounds at which
the Perceptron algorithm makes an update when
processing x1 , . . . , xT and let MT = |I| . Then,
q 2
R
MT  inf L⇢ (u) + ,
⇢>0,kuk2 1 ⇢

where R = maxt2I kxt k


P yt (u·xt )
L⇢ (u) = t2I 1 ⇢ +
.

Mehryar Mohri - Foundations of Machine Learning page 29


• Proof: for any t , 1 yt (u·xt )
1 yt (u·xt )
+ , summing
up these inequalities for t I yields:
X⇣ yt (u · xt ) ⌘ X yt (u · xt )
MT  1 +
⇢ + ⇢
t2I t2I
p
MT R
 L⇢ (u) + ,

by upper-bounding t I (yt u · xt ) as in the proof
for the separable case.
• solving the second-degree inequality
p
MT R
MT  L⇢ (u) + ,

q
R R2 q
p + + 4L⇢ (u) R
gives MT 
⇢ ⇢2
2


+ L⇢ (u).

Mehryar Mohri - Foundations of Machine Learning page 30


Non-Separable Case - L2 Bound
(Freund and Schapire, 1998; MM and Rostamizadeh, 2013)
Theorem: let I denote the set of rounds at which
the Perceptron algorithm makes an update when
processing x1 , . . . , xT and let MT = |I| . Then,
2
L (u) 2 L (u) 2
t I xt 2
MT inf + 2
+ .
>0, u 2 1 2 4

• when x t R for all t I, this implies


2
R
MT inf + L (u) 2 ,
>0, u 2 1

where L (u) = 1 yt (u·xt )


+ t I
.
Mehryar Mohri - Foundations of Machine Learning page 31
• Proof: Reduce problem to separable case in higher
dimension. Let l = 1 1 , for t [1, T ] .
t
yt u·xt
+ t I

• Mapping (similar to trivial mapping):


xt,1
(N +t)th component ..
. u1
xt,N Z
..
0 .
xt,1 .. uN
xt = .. xt = . u u= Z
. y 1 l1
0 Z
xt,N ..
.
0 y T lT
.. Z
. 2 L (u) 2
0 u =1 = Z= 1+ 2
.

Mehryar Mohri - Foundations of Machine Learning page 32


• Observe that the Perceptron algorithm makes the
same predictions and makes updates at the same
rounds when processing x1 , . . . , xT .
• For any t I,
u · xt yt lt
yt (u · xt ) = yt +
Z Z
yt u · xt lt
= +
Z Z
1
= yt u · xt + [ yt (u · xt )]+ .
Z Z
• Summing up and using the proof in the separable
case yields:
MT yt (u · xt ) xt 2 .
Z
t I t I
Mehryar Mohri - Foundations of Machine Learning page 33
• The inequality can be rewritten as
1 L (u) 2
r2 r2 L (u) 2
MT 2
MT2 2
+ 2
r +MT
2 2
= 2
+ 2
+ 2
+MT L (u) 2,

where r = t I xt 2 .

• Selecting to minimize the bound gives 2


=
L (u)
MT
2r

and leads to
r2 MT L (u) r
MT2 2 +2 + MT L (u) 2
= (r + MT L (u) 2 )2 .

• Solving the second-degree inequality


r
MT MT L (u) 2 0

yields directly the first statement. The second one


results from replacing r with MT R .
Mehryar Mohri - Foundations of Machine Learning page 34
Dual Perceptron Algorithm
Dual-Perceptron( 0 )
1 0
typically 0 = 0
2 for t 1 to T do
3 Receive(xt )
T
4 yt sgn( s=1 s ys (xs · xt ))
5 Receive(yt )
6 if (yt = yt ) then
7 t t+1
8 return

Mehryar Mohri - Foundations of Machine Learning page 35


Kernel Perceptron Algorithm
(Aizerman et al., 1964)

K PDS kernel.
Kernel-Perceptron( 0 )
1 0
typically 0 = 0
2 for t 1 to T do
3 Receive(xt )
T
4 yt sgn( s=1 s ys K(xs , xt ))
5 Receive(yt )
6 if (yt = yt ) then
7 t t+1
8 return

Mehryar Mohri - Foundations of Machine Learning page 36


Winnow Algorithm
(Littlestone, 1988)
Winnow( )
1 w1 1/N
2 for t 1 to T do
3 Receive(xt )
4 yt sgn(wt · xt ) yt { 1, +1}
5 Receive(yt )
6 if (yt = yt ) then
N
7 Zt i=1 wt,i exp( yt xt,i )
8 for i 1 to N do
wt,i exp( yt xt,i )
9 wt+1,i Zt
10 else wt+1 wt
11 return wT +1

Mehryar Mohri - Foundations of Machine Learning page 37


Notes
Winnow = weighted majority:
• for yt,i = xt,i { 1, +1}, sgn(wt · xt )coincides with
the majority vote.
• multiplying by e or e the weight of correct or
incorrect experts, is equivalent to multiplying
by = e 2 the weight of incorrect ones.
Relationships with other algorithms: e.g., boosting
and Perceptron (Winnow and Perceptron can be
viewed as special instances of a general family).

Mehryar Mohri - Foundations of Machine Learning page 38


Winnow Algorithm - Bound
Theorem: Assume that xt R for all t [1, T ] and
that for some > 0 and v R N
, v 0 for all t [1, T ] ,
yt (v · xt )
.
v 1
Then, the number of mistakes made by the
Winnow algorithm is bounded by 2 (R 2
/ 2
) log N .
Proof: Let I be the set of ts at which there is an
update and let M be the total number of updates.

Mehryar Mohri - Foundations of Machine Learning page 39


Notes
Comparison with perceptron bound:
• dual norms: norms for xt and v .
• similar bounds with different norms.
• each advantageous in different cases:

Winnow bound favorable when a sparse set of
experts can predict well. For example, if v = e1
and tx {±1} N
, log N vs N.

Perceptron favorable in opposite situation.

Mehryar Mohri - Foundations of Machine Learning page 40


Winnow Algorithm - Bound
N
vi vi / v
Potential: t =
v
log
wt,i
. (relative entropy)
i=1
Upper bound: for each t in I ,
PN vi wt,i
t+1 t = i=1 kvk1 log wt+1,i
PN vi Zt
= i=1 kvk1 log exp(⌘y t xt,i )
P N vi
= log Zt ⌘ i=1 kvk1 yt xt,i
⇥ PN ⇤
 log i=1 wt,i exp(⌘yt xt,i ) ⌘⇢1
⇥ ⇤
= log E exp(⌘yt xt ) ⌘⇢1
wt
⇥ 2 2

(Hoe↵ding)  log exp(⌘ (2R1 ) /8) + ⌘yt wt · xt ⌘⇢1
| {z }
 ⌘ 2 R12
/2 ⌘⇢1 . 0

Mehryar Mohri - Foundations of Machine Learning page 41


Winnow Algorithm - Bound
Upper bound: summing up the inequalities yields
T +1 1 M ( 2 R2 /2 ).
Lower bound: note that
N N
/ v
1 = vi
v 1 log vi1/N 1
= log N + vi
v 1 log vi
v 1 log N
i=1 i=1

and for all t , t 0 (property of relative entropy).


Thus, T +1 1 0 log N = log N.
Comparison: log N M ( 2 R2 /2 ). For = R2
we obtain R2
M 2 log N 2 .
Mehryar Mohri - Foundations of Machine Learning page 42
Conclusion
On-line learning:
• wide and fast-growing literature.
• many related topics, e.g., game theory, text
compression, convex optimization.
• online to batch bounds and techniques.
• online version of batch algorithms, e.g.,
regression algorithms (see regression lecture).

Mehryar Mohri - Foundations of Machine Learning page 43


References
• Aizerman, M. A., Braverman, E. M., & Rozonoer, L. I. (1964). Theoretical foundations of the
potential function method in pattern recognition learning. Automation and Remote Control,
25, 821-837.

• Nicolò Cesa-Bianchi, Alex Conconi, Claudio Gentile: On the Generalization Ability of On-
Line Learning Algorithms. IEEE Transactions on Information Theory 50(9): 2050-2057. 2004.

• Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge
University Press, 2006.

• Yoav Freund and Robert Schapire. Large margin classification using the perceptron
algorithm. In Proceedings of COLT 1998. ACM Press, 1998.

• Nick Littlestone. From On-Line to Batch Learning. COLT 1989: 269-284.

• Nick Littlestone. "Learning Quickly When Irrelevant Attributes Abound: A New Linear-
threshold Algorithm" Machine Learning 285-318(2). 1988.

Mehryar Mohri - Foundations of Machine Learning page 44


References
• Nick Littlestone, Manfred K. Warmuth: The Weighted Majority Algorithm. FOCS 1989:
256-261.

• Tom Mitchell. Machine Learning, McGraw Hill, 1997.

• Mehryar Mohri and Afshin Rostamizadeh. Perceptron Mistake Bounds. arXiv:1305.0208,


2013.

• Novikoff, A. B. (1962). On convergence proofs on perceptrons. Symposium on the


Mathematical Theory of Automata, 12, 615-622. Polytechnic Institute of Brooklyn.

• Rosenblatt, Frank, The Perceptron: A Probabilistic Model for Information Storage and
Organization in the Brain, Cornell Aeronautical Laboratory, Psychological Review, v65, No. 6,
pp. 386-408, 1958.

• Vladimir N.Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.

Mehryar Mohri - Foundations of Machine Learning page 45


Appendix

Mehryar Mohri - Foudations of Machine Learning


SVMs - Leave-One-Out Analysis
(Vapnik, 1995)
Theorem: let hS be the optimal hyperplane for a
sample S and let NSV (S) be the number of support
vectors defining hS. Then,
min(NSV (S), Rm+1
2
/ m+1 )
2
E m [R(hS )] Em+1 .
S D S D m+1
Proof: one part proven in lecture 4. The other part
due to i 1/Rm+12
for xi misclassified by SVMs.

Mehryar Mohri - Foundations of Machine Learning page 47


Comparison
Bounds on expected error, not high probability
statements.
Leave-one-out bounds not sufficient to distinguish
SVMs and perceptron algorithm. Note however:
• same maximum margin m+1 can be used in both.
• but different radius Rm+1 of support vectors.
Difference: margin distribution.

Mehryar Mohri - Foundations of Machine Learning page 48

You might also like