You are on page 1of 41

Foundations of Machine Learning

Boosting

Mehryar Mohri
Courant Institute and Google Research
mohri@cims.nyu.edu
Weak Learning
(Kearns and Valiant, 1994)
Definition: concept class C is weakly PAC-learnable
if there exists a (weak) learning algorithm L and > 0
such that:
• for all > 0 , for all c C and all distributions D,
1
Pr R(hS ) 1 ,
S D 2

• for samples S of size m = poly(1/ ) for a fixed


polynomial.

Mehryar Mohri - Foundations of Machine Learning page 2


Boosting Ideas
Finding simple relatively accurate base classifiers
often not hard weak learner.
Main ideas:

• use weak learner to create a strong learner.


• combine base classifiers returned by weak learner
(ensemble method).
But, how should the base classifiers be combined?

Mehryar Mohri - Foundations of Machine Learning page 3


AdaBoost
(Freund and Schapire, 1997)
H { 1, +1} . X

AdaBoost(S = ((x1 , y1 ), . . . , (xm , ym )))


1 for i 1 to m do
1
2 D1 (i) m
3 for t 1 to T do
4 ht base classifier in H with small error ✏t = Pr [ht (xi ) 6= yi ]
i⇠Dt
1 1 ✏t
5 ↵t log 2 ✏t
1
6 Zt 2[✏t (1 ✏t )] 2 . normalization factor
7 for i 1 to m do
Dt (i) exp( ↵t yi ht (xi ))
8 Dt+1 (i) Zt
Pt
9 ft s=1 ↵s hs
10 return h = sgn(fT )

Mehryar Mohri - Foundations of Machine Learning page 4


Notes
Distributions Dt over training sample:
• originally uniform.
• at each round, the weight of a misclassified
example is increased.
• observation: Dt+1 (i) = me Q Z , since yi ft (xi )
t
s=1 s
Pt
Dt (i)e t yi ht (xi )
Dt 1 (i)e
t 1 y i ht 1 (xi )
e t yi ht (xi )
1 e yi s=1 s hs (xi )
Dt+1 (i) = = = t .
Zt Zt 1 Zt m s=1 Zs

Weight assigned to base classifier ht : t directly


depends on the accuracy of ht at round t .

Mehryar Mohri - Foundations of Machine Learning page 5


Illustration

t = 1

t = 2
Mehryar Mohri - Foundations of Machine Learning page 6
t = 3

... ...

Mehryar Mohri - Foundations of Machine Learning page 7


α1 + α2 + α3

Mehryar Mohri - Foundations of Machine Learning page 8


Bound on Empirical Error
(Freund and Schapire, 1997)
Theorem: The empirical error of the classifier
output by AdaBoost verifies:
T
1 2
R(h) exp 2 t .
t=1
2

• If further for all t [1, T ] , ( 12 t ) , then

R(h) exp( 2 2
T ).

• does not need to be known in advance:


adaptive boosting.

Mehryar Mohri - Foundations of Machine Learning page 9


• Proof: Since, as we saw, D m
t+1 (i) =
m
yi ft (xi )
e Q
m ts=1 Zs
,
1 1
R(h) = 1yi f (xi ) 0 exp( yi f (xi ))
m i=1
m i=1
m T T
1
m Zt DT +1 (i) = Zt .
m i=1 t=1 t=1

• Now, since Z is a normalization factor,


m
t

Zt = Dt (i)e t yi ht (xi )

i=1
= Dt (i)e t
+ Dt (i)e t

i:yi ht (xi ) 0 i:yi ht (xi )<0


= (1 t )e
t
+ te t

= (1 t) 1
t
t
+ t
1
t
t
=2 t (1 t ).

Mehryar Mohri - Foundations of Machine Learning page 10


• Thus, T T T 2
Zt = 2 t (1 t) = 1 4 1
2 t
t=1 t=1 t=1
T 2 T 2
exp 2 1
2 t = exp 2 1
2 t .
t=1 t=1

• Notes:
• minimizer of
t (1 t )e + te .
• since (1 )e = t , at each round, AdaBoost
t
te
t

assigns the same probability mass to correctly


classified and misclassified instances.
• for base classifiers x [ 1, +1] , t can be
similarly chosen to minimize Zt .

Mehryar Mohri - Foundations of Machine Learning page 11


AdaBoost = Coordinate Descent
Objective Function: convex and differentiable.
Xm Xm PN
1 yi f (xi ) 1 yi ↵
¯ j hj (xi )
¯ =
F (↵) e = e j=1 .
m i=1 m i=1

x
e

0 1 loss

Mehryar Mohri - Foundations of Machine Learning page 12


• Direction: unit vector e with best directional k
derivative: ¯t
F (↵ + ⌘ek ) ¯t
F (↵
0 1 1)
¯t
F (↵ 1 , ek ) = lim .
⌘!0 ⌘
m


X PN
Since F (↵¯ t 1 + ⌘ek ) = e yi j=1 ↵
¯t 1,j hj (xi ) ⌘yi hk (xi )
,
m i=1
0 1 X yi
PN

¯t 1,j hj (xi )
¯t
F (↵ 1 , ek ) = yi hk (xi )e j=1
m i=1
m
1 X
= yi hk (xi )D̄t (i)Z̄t
m i=1
" m m
#
X X Z̄t
= D̄t (i)1yi hk (xi )=+1 D̄t (i)1yi hk (xi )= 1
i=1 i=1
m
h i Z̄ h i Z̄
t t
= (1 ✏¯t,k ) ✏¯t,k = 2¯
✏t,k 1 .
m m

Thus, direction corresponding to base classifier with smallest error.

Mehryar Mohri - Foundations of Machine Learning page13


• Step size: ⌘ chosen to minimize F (↵¯ m
t 1 + ⌘ ek );
¯t
dF (↵ 1 + ⌘ek )
X PN
yi ↵
¯t 1,j hj (xi ) ⌘yi hk (xi )
=0, yi hk (xi )e j=1 e =0
d⌘ i=1
Xm
⌘yi hk (xi )
, yi hk (xi )D̄t (i)Z̄t e =0
i=1
Xm
⌘yi hk (xi )
, yi hk (xi )D̄t (i)e =0
i=1
⇥ ⌘ ⌘

, (1 ✏¯t,k )e ✏¯t,k e =0
1 1 ✏¯t,k
, ⌘ = log .
2 ✏¯t,k

Thus, step size matches base classifier weight of AdaBoost.

Mehryar Mohri - Foundations of Machine Learning page 14


Alternative Loss Functions
boosting loss square loss
x e x x (1 x)2 1x 1

logistic loss
x log2 (1 + e x
)

hinge loss
x max(1 x, 0)

zero-one loss
x 1x<0

Mehryar Mohri - Foundations of Machine Learning page 15


Standard Use in Practice
Base learners: decision trees, quite often just
decision stumps (trees of depth one).
Boosting stumps:
• data in RN, e.g., N = 2, (height(x), weight(x)) .
• associate a stump to each component.
• pre-sort each component: O(N m log m).
• at each round, find best component and threshold.
• total complexity: O((m log m)N + mN T ).
• stumps not weak learners: think XOR example!
Mehryar Mohri - Foundations of Machine Learning page 16
Overfitting?
Assume that VCdim(H) = d and for a fixed T , define
T
FT = sgn t ht b : t, b R, ht H .
t=1

FT can form a very rich family of classifiers. It can


be shown (Freund and Schapire, 1997) that:
VCdim(FT ) 2(d + 1)(T + 1) log2 ((T + 1)e).

This suggests that AdaBoost could overfit for large


values of T , and that is in fact observed in some
cases, but in various others it is not!
Mehryar Mohri - Foundations of Machine Learning page 17
Empirical Observations
Several empirical observations (not all): AdaBoost
does not seem to overfit, furthermore:

cumulative distribution
1.0
20
test error
15
error

10 0.5

training error 5

0
10 100 1000 -1 -0.5
# rounds ma

C4.5 decision
Figure 2: Errortrees
curves and the
(Schapire margin
et al., 1998). distribution graph for
the letter dataset as reported by Schapire et al. [69]. Left: the
error curves (lower and upper curves, respectively) of the comb
Mehryar Mohri - Foundations of Machine Learning page 18
Rademacher Complexity of Convex Hulls
Theorem: Let H be a set of functions mapping
from X to R. Let the convex hull of H be defined as
p p
conv(H) = { µk hk : p 1, µk 0, µk 1, hk H}.
k=1 k=1

Then, for any sample S , RS (conv(H)) = RS (H).


m p
1
Proof: RS (conv(H)) =
m
E sup
hk H,µ 0, µ 1 i=1
i µk hk (xi )
1 k=1
p m
1
= E sup sup µk i hk (xi )
m hk H µ 0, µ 1 1 k=1 i=1
m
1
= E sup max i hk (xi )
m hk H k [1,p] i=1
m
1
= E sup i h(xi ) = RS (H).
m h H i=1
Mehryar Mohri - Foundations of Machine Learning page 19
Margin Bound - Ensemble Methods
(Koltchinskii and Panchenko, 2002)

Corollary: Let H be a set of real-valued functions.


Fix > 0 . For any > 0 , with probability at least 1 ,
the following holds for all h conv(H) :
2 log 1
R(h) R (h) + Rm H +
2m
2 log 2
R(h) R (h) + RS H + 3 .
2m
Proof: Direct consequence of margin bound of
Lecture 4 and RS (conv(H)) = RS (H) .

Mehryar Mohri - Foundations of Machine Learning page 20


Margin Bound - Ensemble Methods
(Koltchinskii and Panchenko, 2002); see also (Schapire et al., 1998)

Corollary: Let H be a family of functions taking


values in { 1, +1} with VC dimension d . Fix > 0 .
For any > 0 , with probability at least 1 , the
following holds for all h conv(H):
2 2d log em log 1
R(h) R (h) + d
+ .
m 2m

Proof: Follows directly previous corollary and VC


dimension bound on Rademacher complexity (see
lecture 3).

Mehryar Mohri - Foundations of Machine Learning page 21


Notes
All of these bounds can be generalized to hold
uniformly for all (0, 1), at the cost of an additional
term log log2 2
and other minor constant factor
m
changes (Koltchinskii and Panchenko, 2002).
For AdaBoost, the bound applies to the functions
T
f (x) t ht (x)
x = t=1
conv(H).
1 1

Note that T does not appear in the bound.

Mehryar Mohri - Foundations of Machine Learning page 22


Margin Distribution
Theorem: For any > 0 , the following holds:
T
yf (x) 1
Pr 2T t (1 t)
1+ .
1 t=1
yi f (xi )
Proof: Using the identity Dt+1 (i) = eQ
m Tt=1 Zt
,
m m
1 1
1yi f (xi ) 1 0 exp( yi f (xi ) + 1 )
m i=1
m i=1
m T
1
= e 1
m Zt DT +1 (i)
m i=1 t=1
T T
=e 1
Zt = 2 T 1
t
t
t (1 t ).
t=1 t=1

Mehryar Mohri - Foundations of Machine Learning page 23


Notes
If for all t [1, T ] , ( 12 t ), then the upper bound
can be bounded by
yf (x) T /2
Pr (1 2 )1 (1 + 2 )1+ .
1

For < , (1 2 )1 (1+2 )1+ < 1 and the bound


decreases exponentially in T .
For the bound to be convergent: O(1/ m),
thus O(1/ m) is roughly the condition on the
edge value.

Mehryar Mohri - Foundations of Machine Learning page 24


L1-Geometric Margin
Definition: the L1-margin ⇢f (x) of a linear
PT
function f = t=1 ↵t ht with ↵ 6= 0 at a point x 2 X is
defined by
PT
|f (x)| | t=1 ↵t ht (x)| ↵ · h(x)
⇢f (x) = = = .
|↵k1 k↵k1 k↵k1

• the L -margin of f over a sample S = (x , . . . , x


1 1 m) is
its minimum margin at points in that sample:
↵ · h(xi )
⇢f = min ⇢f (xi ) = min .
i2[1,m] i2[1,m] k↵k1

Mehryar Mohri - Foundations of Machine Learning page 25


SVM vs AdaBoost
SVM AdaBoost
 (x)  h (x)
features or (x) =
1.
.. h(x) =
1.
..
base hypotheses N (x) hN (x)

predictor x 7! w · (x) x 7! ↵ · h(x)

w · (x) ↵ · h(x)
geom. margin kwk2
= d2 ( (x), hyperpl.)
k↵k1
= d1 (h(x), hyperpl.)

conf. margin y(w · (x)) y(↵ · h(x))

regularization kwk2 k↵k1 (L1-AB)

Mehryar Mohri - Foundations of Machine Learning page 26


Maximum-Margin Solutions

Norm || · ||2 . Norm || · ||∞ .

Mehryar Mohri - Foundations of Machine Learning page 27


But, Does AdaBoost Maximize the Margin?
No: AdaBoost may converge to a margin that is
significantly below the maximum margin (Rudin et al.,
2004) (e.g., 1/3 instead of 3/8)!

Lower bound: AdaBoost can achieve asymptotically


a margin that is at least max
2
if the data is separable
and some conditions on the base learners hold
(Rätsch and Warmuth, 2002).

Several boosting-type margin-maximization


algorithms: but, performance in practice not clear
or not reported.

Mehryar Mohri - Foundations of Machine Learning page 28


AdaBoost’s Weak Learning Condition
Definition: the edge of a base classifier ht for a
distribution D over the training sample is
m
1 1
(t) = t = yi ht (xi )D(i).
2 2 i=1

Condition: there exists > 0 for any distribution D


over the training sample and any base classifier
(t) .

Mehryar Mohri - Foundations of Machine Learning page 29


Zero-Sum Games
Definition:
• payoff matrix M = (Mij ) Rm n.
• m possible actions (pure strategy) for row player.

• n possible actions for column player.


• Mij payoff for row player ( = loss for column
player) when row plays i , column plays j .
Example: rock paper scissors
rock 0 -1 1
paper 1 0 -1
scissors -1 1 0
Mehryar Mohri - Foundations of Machine Learning page 30
Mixed Strategies
(von Neumann, 1928)
Definition: player row selects a distribution p over
the rows, player column a distribution q over
columns. The expected payoff for row is
m n
E [Mij ] = pi Mij qj = p Mq.
i p
j q i=1 j=1

von Neumann’s minimax theorem:


max min p Mq = min max p Mq.
p q q p

• equivalent form:
max min p Mej = min max ei Mq.
p j [1,n] q i [1,m]
Mehryar Mohri - Foundations of Machine Learning page 31
John von Neumann (1903 - 1957)

Mehryar Mohri - Foundations of Machine Learning page 32


AdaBoost and Game Theory
Game:
• Player A: selects point xi , i [1, m].
• Player B: selects base learner ht , t [1, T ].
• Payoff matrix M { 1, +1}m T: Mit = yi ht (xi ).
von Neumann’s theorem: assume finite H .
m T
t ht (xi )
2 = min max D(i)yi h(xi ) = max min yi = .
D h H i [1,m] 1
i=1 t=1

Mehryar Mohri - Foundations of Machine Learning page 33


Consequences
Weak learning condition = non-zero margin.
• thus, possible to search for non-zero margin.
• AdaBoost = (suboptimal) search for
corresponding ; achieves at least half of the
maximum margin.
Weak learning =strong condition:
• the condition implies linear separability with
margin 2 > 0.

Mehryar Mohri - Foundations of Machine Learning page 34


Linear Programming Problem
Maximizing the margin:
( · xi )
= max min yi .
i [1,m] || ||1
This is equivalent to the following convex
optimization LP problem:
max
subject to : yi ( · xi )
1 = 1.
Note that:
| · x|
= x H , with H = {x : · x = 0}.
1
Mehryar Mohri - Foundations of Machine Learning page 35
Advantages of AdaBoost
Simple: straightforward implementation.
Efficient: complexity O(mN T ) for stumps:
• when N and T are not too large, the algorithm is
quite fast.
Theoretical guarantees: but still many questions.
• AdaBoost not designed to maximize margin.
• regularized versions of AdaBoost.

Mehryar Mohri - Foundations of Machine Learning page 36


Outliers
AdaBoost assigns larger weights to harder
examples.
Application:
• Detecting mislabeled examples.
• Dealing with noisy data: regularization based on
the average weight assigned to a point (soft
margin idea for boosting) (Meir and Rätsch, 2003).

Mehryar Mohri - Foundations of Machine Learning page 37


Weaker Aspects
Parameters:
• need to determine T , the number of rounds of
boosting: stopping criterion.
• need to determine base learners: risk of
overfitting or low margins.
Noise: severely damages the accuracy of Adaboost
(Dietterich, 2000).

Mehryar Mohri - Foundations of Machine Learning page 38


Other Boosting Algorithms
arc-gv (Breiman, 1996): designed to maximize the
margin, but outperformed by AdaBoost in
experiments (Reyzin and Schapire, 2006).
L1-regularized AdaBoost (Raetsch et al., 2001):
outperfoms AdaBoost in experiments (Cortes et al.,
2014).

DeepBoost (Cortes et al., 2014): more favorable


learning guarantees, outperforms both AdaBoost
and L1-regularized AdaBoost in experiments.

Mehryar Mohri - Foundations of Machine Learning page 39


References
• Corinna Cortes, Mehryar Mohri, and Umar Syed. Deep boosting. In ICML, pages 262-270,
2014.

• Leo Breiman. Bagging predictors. Machine Learning, 24(2): 123-140, 1996.

• Thomas G. Dietterich. An experimental comparison of three methods for constructing


ensembles of decision trees: bagging, boosting, and randomization. Machine Learning, 40(2):
139-158, 2000.

• Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning


and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139,
1997.

• G. Lebanon and J. Lafferty. Boosting and maximum likelihood for exponential models. In
NIPS, pages 447–454, 2001.

• Ron Meir and Gunnar Rätsch. An introduction to boosting and leveraging. In Advanced
Lectures on Machine Learning (LNAI2600), 2003.

• J. von Neumann. Zur Theorie der Gesellschaftsspiele. Mathematische Annalen,


100:295-320, 1928.

Mehryar Mohri - Foundations of Machine Learning page 40


References
• Cynthia Rudin, Ingrid Daubechies and Robert E. Schapire. The dynamics of AdaBoost:
Cyclic behavior and convergence of margins. Journal of Machine Learning Research, 5:
1557-1595, 2004.

• Rätsch, G., and Warmuth, M. K. (2002) “Maximizing the Margin with Boosting”, in
Proceedings of the 15th Annual Conference on Computational Learning Theory (COLT 02),
Sidney, Australia, pp. 334–350, July 2002.

• Reyzin, Lev and Schapire, Robert E. How boosting the margin can also boost classifier
complexity. In ICML, pages 753-760, 2006.

• Robert E. Schapire. The boosting approach to machine learning: An overview. In D. D.


Denison, M. H. Hansen, C. Holmes, B. Mallick, B.Yu, editors, Nonlinear Estimation and
Classification. Springer, 2003.

• Robert E. Schapire and Yoav Freund. Boosting, Foundations and Algorithms. The MIT Press,
2012.

• Robert E. Schapire,Yoav Freund, Peter Bartlett and Wee Sun Lee. Boosting the margin: A
new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):
1651-1686, 1998.

Mehryar Mohri - Foundations of Machine Learning page 41

You might also like