Foundations of Machine Learning: Boosting

Foundations of Machine Learning
Boosting
Mehryar Mohri
Courant Institute and Google Research
mohri@cims.nyu.edu
Weak Learning
(Kearns and Valiant, 1994)
Definition: concept class C is weakly PAC-learnable
if there exists a (weak) learning algorithm L and > 0
such that:
• for all > 0 , for all c C and all distributions D,
1
Pr R(hS ) 1 ,
S D 2
• for samples S of size m = poly(1/ ) for a fixed

polynomial.
Mehryar Mohri - Foundations of Machine Learning page 2

Boosting Ideas
Finding simple relatively accurate base classifiers
often not hard weak learner.
Main ideas:
• use weak learner to create a strong learner.

• combine base classifiers returned by weak learner
(ensemble method).
But, how should the base classifiers be combined?

AdaBoost
(Freund and Schapire, 1997)
H { 1, +1} . X
AdaBoost(S = ((x1 , y1 ), . . . , (xm , ym )))

1 for i 1 to m do
1
2 D1 (i) m
3 for t 1 to T do
4 ht base classifier in H with small error ✏t = Pr [ht (xi ) 6= yi ]
i⇠Dt
1 1 ✏t
5 ↵t log 2 ✏t
1
6 Zt 2[✏t (1 ✏t )] 2 . normalization factor
7 for i 1 to m do
Dt (i) exp( ↵t yi ht (xi ))
8 Dt+1 (i) Zt
Pt
9 ft s=1 ↵s hs
10 return h = sgn(fT )

Notes
Distributions Dt over training sample:
• originally uniform.
• at each round, the weight of a misclassified
example is increased.
• observation: Dt+1 (i) = me Q Z , since yi ft (xi )
t
s=1 s
Pt
Dt (i)e t yi ht (xi )
Dt 1 (i)e
t 1 y i ht 1 (xi )
e t yi ht (xi )
1 e yi s=1 s hs (xi )
Dt+1 (i) = = = t .
Zt Zt 1 Zt m s=1 Zs
Weight assigned to base classifier ht : t directly

depends on the accuracy of ht at round t .

Illustration
t = 1
t = 2
t = 3
... ...

α1 + α2 + α3

Bound on Empirical Error
(Freund and Schapire, 1997)
Theorem: The empirical error of the classifier
output by AdaBoost verifies:
T
1 2
R(h) exp 2 t .
t=1
2
• If further for all t [1, T ] , ( 12 t ) , then
R(h) exp( 2 2
T ).
• does not need to be known in advance:

adaptive boosting.

• Proof: Since, as we saw, D m
t+1 (i) =
m
yi ft (xi )
e Q
m ts=1 Zs
,
1 1
R(h) = 1yi f (xi ) 0 exp( yi f (xi ))
m i=1
m i=1
m T T
1
m Zt DT +1 (i) = Zt .
m i=1 t=1 t=1
• Now, since Z is a normalization factor,

m
t
Zt = Dt (i)e t yi ht (xi )
i=1
= Dt (i)e t
+ Dt (i)e t
i:yi ht (xi ) 0 i:yi ht (xi )<0

= (1 t )e
t
+ te t
= (1 t) 1
t
t
+ t
1
t
t
=2 t (1 t ).

• Thus, T T T 2
Zt = 2 t (1 t) = 1 4 1
2 t
t=1 t=1 t=1
T 2 T 2
exp 2 1
2 t = exp 2 1
2 t .
t=1 t=1
• Notes:
• minimizer of
t (1 t )e + te .
• since (1 )e = t , at each round, AdaBoost
t
te
t
assigns the same probability mass to correctly

classified and misclassified instances.
• for base classifiers x [ 1, +1] , t can be
similarly chosen to minimize Zt .

AdaBoost = Coordinate Descent
Objective Function: convex and differentiable.
Xm Xm PN
1 yi f (xi ) 1 yi ↵
¯ j hj (xi )
¯ =
F (↵) e = e j=1 .
m i=1 m i=1
x
e
0 1 loss

• Direction: unit vector e with best directional k
derivative: ¯t
F (↵ + ⌘ek ) ¯t
F (↵
0 1 1)
¯t
F (↵ 1 , ek ) = lim .
⌘!0 ⌘
m
•
X PN
Since F (↵¯ t 1 + ⌘ek ) = e yi j=1 ↵
¯t 1,j hj (xi ) ⌘yi hk (xi )
,
m i=1
0 1 X yi
PN
↵
¯t 1,j hj (xi )
¯t
F (↵ 1 , ek ) = yi hk (xi )e j=1
m i=1
m
1 X
= yi hk (xi )D̄t (i)Z̄t
m i=1
" m m
#
X X Z̄t
= D̄t (i)1yi hk (xi )=+1 D̄t (i)1yi hk (xi )= 1
i=1 i=1
m
h i Z̄ h i Z̄
t t
= (1 ✏¯t,k ) ✏¯t,k = 2¯
✏t,k 1 .
m m
Thus, direction corresponding to base classifier with smallest error.
Mehryar Mohri - Foundations of Machine Learning page13

• Step size: ⌘ chosen to minimize F (↵¯ m
t 1 + ⌘ ek );
¯t
dF (↵ 1 + ⌘ek )
X PN
yi ↵
¯t 1,j hj (xi ) ⌘yi hk (xi )
=0, yi hk (xi )e j=1 e =0
d⌘ i=1
Xm
⌘yi hk (xi )
, yi hk (xi )D̄t (i)Z̄t e =0
i=1
Xm
⌘yi hk (xi )
, yi hk (xi )D̄t (i)e =0
i=1
⇥ ⌘ ⌘
⇤
, (1 ✏¯t,k )e ✏¯t,k e =0
1 1 ✏¯t,k
, ⌘ = log .
2 ✏¯t,k
Thus, step size matches base classifier weight of AdaBoost.

Alternative Loss Functions
boosting loss square loss
x e x x (1 x)2 1x 1
logistic loss
x log2 (1 + e x
)
hinge loss
x max(1 x, 0)
zero-one loss
x 1x<0

Standard Use in Practice
Base learners: decision trees, quite often just
decision stumps (trees of depth one).
Boosting stumps:
• data in RN, e.g., N = 2, (height(x), weight(x)) .
• associate a stump to each component.
• pre-sort each component: O(N m log m).
• at each round, find best component and threshold.
• total complexity: O((m log m)N + mN T ).
• stumps not weak learners: think XOR example!
Overfitting?
Assume that VCdim(H) = d and for a fixed T , define
T
FT = sgn t ht b : t, b R, ht H .
t=1
FT can form a very rich family of classifiers. It can

be shown (Freund and Schapire, 1997) that:
VCdim(FT ) 2(d + 1)(T + 1) log2 ((T + 1)e).
This suggests that AdaBoost could overfit for large

values of T , and that is in fact observed in some
cases, but in various others it is not!
Empirical Observations
Several empirical observations (not all): AdaBoost
does not seem to overfit, furthermore:
cumulative distribution
1.0
20
test error
15
error
10 0.5
training error 5
0
10 100 1000 -1 -0.5
# rounds ma
C4.5 decision
Figure 2: Errortrees
curves and the
(Schapire margin
et al., 1998). distribution graph for
the letter dataset as reported by Schapire et al. [69]. Left: the
error curves (lower and upper curves, respectively) of the comb
Rademacher Complexity of Convex Hulls
Theorem: Let H be a set of functions mapping
from X to R. Let the convex hull of H be defined as
p p
conv(H) = { µk hk : p 1, µk 0, µk 1, hk H}.
k=1 k=1
Then, for any sample S , RS (conv(H)) = RS (H).

m p
1
Proof: RS (conv(H)) =
m
E sup
hk H,µ 0, µ 1 i=1
i µk hk (xi )
1 k=1
p m
1
= E sup sup µk i hk (xi )
m hk H µ 0, µ 1 1 k=1 i=1
m
1
= E sup max i hk (xi )
m hk H k [1,p] i=1
m
1
= E sup i h(xi ) = RS (H).
m h H i=1
Margin Bound - Ensemble Methods
(Koltchinskii and Panchenko, 2002)
Corollary: Let H be a set of real-valued functions.

Fix > 0 . For any > 0 , with probability at least 1 ,
the following holds for all h conv(H) :
2 log 1
R(h) R (h) + Rm H +
2m
2 log 2
R(h) R (h) + RS H + 3 .
2m
Proof: Direct consequence of margin bound of
Lecture 4 and RS (conv(H)) = RS (H) .

Margin Bound - Ensemble Methods
(Koltchinskii and Panchenko, 2002); see also (Schapire et al., 1998)
Corollary: Let H be a family of functions taking

values in { 1, +1} with VC dimension d . Fix > 0 .
For any > 0 , with probability at least 1 , the
following holds for all h conv(H):
2 2d log em log 1
R(h) R (h) + d
+ .
m 2m
Proof: Follows directly previous corollary and VC

dimension bound on Rademacher complexity (see
lecture 3).

Notes
All of these bounds can be generalized to hold
uniformly for all (0, 1), at the cost of an additional
term log log2 2
and other minor constant factor
m
changes (Koltchinskii and Panchenko, 2002).
For AdaBoost, the bound applies to the functions
T
f (x) t ht (x)
x = t=1
conv(H).
1 1
Note that T does not appear in the bound.

Margin Distribution
Theorem: For any > 0 , the following holds:
T
yf (x) 1
Pr 2T t (1 t)
1+ .
1 t=1
yi f (xi )
Proof: Using the identity Dt+1 (i) = eQ
m Tt=1 Zt
,
m m
1 1
1yi f (xi ) 1 0 exp( yi f (xi ) + 1 )
m i=1
m i=1
m T
1
= e 1
m Zt DT +1 (i)
m i=1 t=1
T T
=e 1
Zt = 2 T 1
t
t
t (1 t ).
t=1 t=1

Notes
If for all t [1, T ] , ( 12 t ), then the upper bound
can be bounded by
yf (x) T /2
Pr (1 2 )1 (1 + 2 )1+ .
1
For < , (1 2 )1 (1+2 )1+ < 1 and the bound

decreases exponentially in T .
For the bound to be convergent: O(1/ m),
thus O(1/ m) is roughly the condition on the
edge value.

L1-Geometric Margin
Definition: the L1-margin ⇢f (x) of a linear
PT
function f = t=1 ↵t ht with ↵ 6= 0 at a point x 2 X is
defined by
PT
|f (x)| | t=1 ↵t ht (x)| ↵ · h(x)
⇢f (x) = = = .
|↵k1 k↵k1 k↵k1
• the L -margin of f over a sample S = (x , . . . , x

1 1 m) is
its minimum margin at points in that sample:
↵ · h(xi )
⇢f = min ⇢f (xi ) = min .
i2[1,m] i2[1,m] k↵k1

SVM vs AdaBoost
SVM AdaBoost
 (x)  h (x)
features or (x) =
1.
.. h(x) =
1.
..
base hypotheses N (x) hN (x)
predictor x 7! w · (x) x 7! ↵ · h(x)
w · (x) ↵ · h(x)
geom. margin kwk2
= d2 ( (x), hyperpl.)
k↵k1
= d1 (h(x), hyperpl.)
conf. margin y(w · (x)) y(↵ · h(x))
regularization kwk2 k↵k1 (L1-AB)

Maximum-Margin Solutions
Norm || · ||2 . Norm || · ||∞ .

But, Does AdaBoost Maximize the Margin?
No: AdaBoost may converge to a margin that is
significantly below the maximum margin (Rudin et al.,
2004) (e.g., 1/3 instead of 3/8)!
Lower bound: AdaBoost can achieve asymptotically

a margin that is at least max
2
if the data is separable
and some conditions on the base learners hold
(Rätsch and Warmuth, 2002).
Several boosting-type margin-maximization

algorithms: but, performance in practice not clear
or not reported.

AdaBoost’s Weak Learning Condition
Definition: the edge of a base classifier ht for a
distribution D over the training sample is
m
1 1
(t) = t = yi ht (xi )D(i).
2 2 i=1
Condition: there exists > 0 for any distribution D

over the training sample and any base classifier
(t) .

Zero-Sum Games
Definition:
• payoff matrix M = (Mij ) Rm n.
• m possible actions (pure strategy) for row player.
• n possible actions for column player.

• Mij payoff for row player ( = loss for column
player) when row plays i , column plays j .
Example: rock paper scissors
rock 0 -1 1
paper 1 0 -1
scissors -1 1 0
Mixed Strategies
(von Neumann, 1928)
Definition: player row selects a distribution p over
the rows, player column a distribution q over
columns. The expected payoff for row is
m n
E [Mij ] = pi Mij qj = p Mq.
i p
j q i=1 j=1
von Neumann’s minimax theorem:

max min p Mq = min max p Mq.
p q q p
• equivalent form:
max min p Mej = min max ei Mq.
p j [1,n] q i [1,m]
John von Neumann (1903 - 1957)

AdaBoost and Game Theory
Game:
• Player A: selects point xi , i [1, m].
• Player B: selects base learner ht , t [1, T ].
• Payoff matrix M { 1, +1}m T: Mit = yi ht (xi ).
von Neumann’s theorem: assume finite H .
m T
t ht (xi )
2 = min max D(i)yi h(xi ) = max min yi = .
D h H i [1,m] 1
i=1 t=1

Consequences
Weak learning condition = non-zero margin.
• thus, possible to search for non-zero margin.
• AdaBoost = (suboptimal) search for
corresponding ; achieves at least half of the
maximum margin.
Weak learning =strong condition:
• the condition implies linear separability with
margin 2 > 0.

Linear Programming Problem
Maximizing the margin:
( · xi )
= max min yi .
i [1,m] || ||1
This is equivalent to the following convex
optimization LP problem:
max
subject to : yi ( · xi )
1 = 1.
Note that:
| · x|
= x H , with H = {x : · x = 0}.
1
Advantages of AdaBoost
Simple: straightforward implementation.
Efficient: complexity O(mN T ) for stumps:
• when N and T are not too large, the algorithm is
quite fast.
Theoretical guarantees: but still many questions.
• AdaBoost not designed to maximize margin.
• regularized versions of AdaBoost.

Outliers
AdaBoost assigns larger weights to harder
examples.
Application:
• Detecting mislabeled examples.
• Dealing with noisy data: regularization based on
the average weight assigned to a point (soft
margin idea for boosting) (Meir and Rätsch, 2003).

Weaker Aspects
Parameters:
• need to determine T , the number of rounds of
boosting: stopping criterion.
• need to determine base learners: risk of
overfitting or low margins.
Noise: severely damages the accuracy of Adaboost
(Dietterich, 2000).

Other Boosting Algorithms
arc-gv (Breiman, 1996): designed to maximize the
margin, but outperformed by AdaBoost in
experiments (Reyzin and Schapire, 2006).
L1-regularized AdaBoost (Raetsch et al., 2001):
outperfoms AdaBoost in experiments (Cortes et al.,
2014).
DeepBoost (Cortes et al., 2014): more favorable

learning guarantees, outperforms both AdaBoost
and L1-regularized AdaBoost in experiments.

References
• Corinna Cortes, Mehryar Mohri, and Umar Syed. Deep boosting. In ICML, pages 262-270,
2014.
• Leo Breiman. Bagging predictors. Machine Learning, 24(2): 123-140, 1996.
• Thomas G. Dietterich. An experimental comparison of three methods for constructing

ensembles of decision trees: bagging, boosting, and randomization. Machine Learning, 40(2):
139-158, 2000.
• Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning

and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139,
1997.
• G. Lebanon and J. Lafferty. Boosting and maximum likelihood for exponential models. In
NIPS, pages 447–454, 2001.
• Ron Meir and Gunnar Rätsch. An introduction to boosting and leveraging. In Advanced
Lectures on Machine Learning (LNAI2600), 2003.
• J. von Neumann. Zur Theorie der Gesellschaftsspiele. Mathematische Annalen,

100:295-320, 1928.

References
• Cynthia Rudin, Ingrid Daubechies and Robert E. Schapire. The dynamics of AdaBoost:
Cyclic behavior and convergence of margins. Journal of Machine Learning Research, 5:
1557-1595, 2004.
• Rätsch, G., and Warmuth, M. K. (2002) “Maximizing the Margin with Boosting”, in
Proceedings of the 15th Annual Conference on Computational Learning Theory (COLT 02),
Sidney, Australia, pp. 334–350, July 2002.
• Reyzin, Lev and Schapire, Robert E. How boosting the margin can also boost classifier
complexity. In ICML, pages 753-760, 2006.
• Robert E. Schapire. The boosting approach to machine learning: An overview. In D. D.

Denison, M. H. Hansen, C. Holmes, B. Mallick, B.Yu, editors, Nonlinear Estimation and
Classification. Springer, 2003.
• Robert E. Schapire and Yoav Freund. Boosting, Foundations and Algorithms. The MIT Press,
2012.
• Robert E. Schapire,Yoav Freund, Peter Bartlett and Wee Sun Lee. Boosting the margin: A
new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):
1651-1686, 1998.

Foundations of Machine Learning: Boosting

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Foundations of Machine Learning: Boosting

Uploaded by

Copyright:

Available Formats

Foundations of Machine Learning

• for samples S of size m = poly(1/ ) for a fixed

Mehryar Mohri - Foundations of Machine Learning page 2

• use weak learner to create a strong learner.

Mehryar Mohri - Foundations of Machine Learning page 3

AdaBoost(S = ((x1 , y1 ), . . . , (xm , ym )))

Mehryar Mohri - Foundations of Machine Learning page 4

Weight assigned to base classifier ht : t directly

Mehryar Mohri - Foundations of Machine Learning page 5

Mehryar Mohri - Foundations of Machine Learning page 7

Mehryar Mohri - Foundations of Machine Learning page 8

• If further for all t [1, T ] , ( 12 t ) , then

• does not need to be known in advance:

Mehryar Mohri - Foundations of Machine Learning page 9

• Now, since Z is a normalization factor,

i:yi ht (xi ) 0 i:yi ht (xi )<0

Mehryar Mohri - Foundations of Machine Learning page 10

assigns the same probability mass to correctly

Mehryar Mohri - Foundations of Machine Learning page 11

Mehryar Mohri - Foundations of Machine Learning page 12

Thus, direction corresponding to base classifier with smallest error.

Mehryar Mohri - Foundations of Machine Learning page13

Thus, step size matches base classifier weight of AdaBoost.

Mehryar Mohri - Foundations of Machine Learning page 14

Mehryar Mohri - Foundations of Machine Learning page 15

FT can form a very rich family of classifiers. It can

This suggests that AdaBoost could overfit for large

Then, for any sample S , RS (conv(H)) = RS (H).

Corollary: Let H be a set of real-valued functions.

Mehryar Mohri - Foundations of Machine Learning page 20

Corollary: Let H be a family of functions taking

Proof: Follows directly previous corollary and VC

Mehryar Mohri - Foundations of Machine Learning page 21

Note that T does not appear in the bound.

Mehryar Mohri - Foundations of Machine Learning page 22

Mehryar Mohri - Foundations of Machine Learning page 23

For < , (1 2 )1 (1+2 )1+ < 1 and the bound

Mehryar Mohri - Foundations of Machine Learning page 24

• the L -margin of f over a sample S = (x , . . . , x

Mehryar Mohri - Foundations of Machine Learning page 25

predictor x 7! w · (x) x 7! ↵ · h(x)

conf. margin y(w · (x)) y(↵ · h(x))

regularization kwk2 k↵k1 (L1-AB)

Mehryar Mohri - Foundations of Machine Learning page 26

Norm || · ||2 . Norm || · ||∞ .

Mehryar Mohri - Foundations of Machine Learning page 27

Lower bound: AdaBoost can achieve asymptotically

Several boosting-type margin-maximization

Mehryar Mohri - Foundations of Machine Learning page 28

Condition: there exists > 0 for any distribution D

Mehryar Mohri - Foundations of Machine Learning page 29

• n possible actions for column player.

von Neumann’s minimax theorem:

Mehryar Mohri - Foundations of Machine Learning page 32

Mehryar Mohri - Foundations of Machine Learning page 33

Mehryar Mohri - Foundations of Machine Learning page 34

Mehryar Mohri - Foundations of Machine Learning page 36

Mehryar Mohri - Foundations of Machine Learning page 37

Mehryar Mohri - Foundations of Machine Learning page 38

DeepBoost (Cortes et al., 2014): more favorable

Mehryar Mohri - Foundations of Machine Learning page 39

• Leo Breiman. Bagging predictors. Machine Learning, 24(2): 123-140, 1996.

• Thomas G. Dietterich. An experimental comparison of three methods for constructing