Professional Documents
Culture Documents
That is, H uses a linear combination of the decisions of each of the hi hypotheses in the
ensemble. The AdaBoost algorithm sequentially chooses hi from H and assigns this
hypothesis a weight αi . We let Ht be the classifier formed by the first t hypotheses. That
is,
∑
t
Ht (x) = αi hi (x)
i=1
where H0 (x) := 0. That is, the empty ensemble will always output 0.
The idea behind the AdaBoost algorithm is that the tth hypothesis will correct for the
errors that the first t − 1 hypotheses make on the training set. More specifically, after we
select the first t − 1 hypotheses, we determine which instances in S our t − 1 hypotheses
perform poorly on and make sure that the tth hypothesis performs well on these instances.
The pseudocode for AdaBoost is described in Algorithm 1. A high-level overview of
the algorithm is described below:
1 for i ∈ {1, 2 . . . , n} do
2 D1 (i) ← 1n
3 end for
4 H←∅
5 for t = 1, . . . , T do
6 ht ← argminh∈H Pi∼Dt (h(xi ) , yi ) ▷ find good hypothesis on weighted training
set
7 ϵt ← Pi∼Dt((ht (x) i ) , yi ) ▷ compute hypothesis's error
8 αt ← 2 ln ϵt
1 1−ϵt
▷ compute hypothesis's weight
9 H ← H ∪ {(αt , ht )} ▷ add hypothesis to the ensemble
10 for i ∈ {1, 2 . . . , n} do ▷ update training set distribution
Dt (i) e−αt yi ht (xi )
11 Dt+1 (i) ← ∑n −αt y j ht (x j )
j=1 Dt ( j) e
12 end for
13 end for
14 return H
We call this expected loss the ``weighted loss" because the 0-1 loss is not computed
on the instances in the training set directly, but rather on the weighted instances in the
training set.
We will soon explain a theoretical justification for this precise probability assignment,
but for now we can gain an intuitive understanding. Note the term e−αt yi ht (xi ) . If ht (xi ) =
yi , then yi ht (xi ) = 1 which means that e−αt yi ht (xi ) = e−αt . If, on the other hand, ht (xi ) , yi ,
then yi ht (xi ) = −1 which means that e−αt yi ht (xi ) = eαt . Thus, we see that e−αt yi ht (xi ) is
smaller if the hypothesis's prediction agrees with the true value. That is, we assign
higher probability to the ith instance if ht was wrong on xi .
Ht = Ht−1 + αt ht
where the new ht and αt minimizes the exponential loss of Ht on the training data. The-
orem 1 shows that AdaBoost's choice of ht minimizes the exponential loss of Ht over
the training data. That is,
where
1∑
n
LS (Ht−1 + Ch) := ℓexp (Ht−1 + Ch, x, y)
n i=1
and C is an arbitrary constant. Theorem 2 shows that once ht is chosen, AdaBoost's
choice of αt then further minimizes the exponential loss of Ht over the training set. That
is,
αt := argmin LS (Ht−1 + αht )
α
.
, minimizes the exponential-loss of Ht over the training set. That is, given an arbi-
trary constant C,
ht = argmin LS (Ht−1 + Ch)
h∈H
.
where
Proof:
wt,i
Dt (i) = ∑n (4)
j=1 wt, j
Next, we need to prove the inductive step. That is, we prove that
wt,i wt+1,i
Dt (i) = ∑n =⇒ Dt+1 (i) = ∑n
j=1 wt, j j=1 wt+1, j
e−αt yi ht (xi )
wt,i
∑n
j=1 wt, j
= ∑n wt, j by the inductive hypothesis
j=1
∑n e−αt y j ht (x j )
k=1 wt,k
e−yi Ht−1 (xi )
∑n −y j Ht−1 (x j ) e−αt yi ht (xi )
j=1 e
=∑ by the fact that wt,i := e−yi Ht−1 (xi )
e−y j Ht−1 (x j )
n
j=1 ∑n e−yk Ht−1 (xk ) e−αt y j ht (x j )
k=1
∑n
1
e−yi Ht−1 (xi ) e−αt yi ht (xi )
j=1 e−y j Ht−1 (x j )
= ∑n
∑n
1
e−yk Ht−1 (xk ) j=1 e−y j Ht−1 (x j ) e−αt y j ht (x j )
k=1
where
ϵt := Pi∼Dt (yi , ht (xi ))
i=1 wt,i
( )
1 1 − ϵt
=⇒ α = ln
2 ϵt
10