You are on page 1of 9

Introduction

• Once features generated/selected and classi-


fier built, need to assess its performance on
new data
CSCE 970 Lecture 6: System Evaluation
and Combining Classifiers
• Assume all data drawn i.i.d. according to some
prob. distribution D and try to estimate clas-
sifier’s prediction error on new data drawn ac-
cording to D
Stephen D. Scott
(Adapted partially from Tom Mitchell’s slides) • If error estimate unacceptable, need to select/gen.
new features and/or build new classifier

– Change features used

March 13, 2003 – Change size/structure of neural network

– Change assumptions in Bayesian classifier

– Choose new learning method, e.g. decision


tree

1 2

Introduction
(cont’d) Outline

• Can’t use error on training set to estimate abil- • Sample error vs. true error
ity to generalize, because it’s too optimistic

• Confidence intervals for observed hypothesis


• So use independent testing set to estimate er- error
ror

• Estimators
• Can use statistical hypothesis testing techniques
to:
• Binomial distribution, Normal distribution, Cen-
– Give confidence intervals for error estimate
tral Limit Theorem
– Contrast performance of two classifiers (see
if the difference in their error estimates is
• Paired t tests
statistically significant)

• Sometimes need to train and test with a small • Comparing learning methods
data set

• Combining classifiers to improve performance:


• Will also look at improving a classifier’s per- Weighted Majority, Bagging, Boosting
formance

3 4
Two Definitions of Error

• Denote the learned classifier by the hypothesis


h and the target function (that labels exam-
ples) by f
Problems Estimating Error

• The true error of hypothesis h with respect


to target function f and distribution D is the • Bias: If S is training set, errorS (h) is optimisti-
probability that h will misclassify an instance cally biased
drawn at random according to D.

errorD (h) ≡ Pr [f (x) 6= h(x)]


x∈D bias ≡ E[errorS (h)] − errorD (h)

• The sample error of h with respect to target For unbiased estimate, h and S must be chosen
function f and data sample S is the proportion independently
of examples h misclassifies
• Variance: Even with unbiased S, errorS (h) may
1 X
errorS (h) ≡ δ(f (x) 6= h(x)) still vary from errorD (h)
|S| x∈S
Where δ(f (x) 6= h(x)) is 1 if f (x) 6= h(x), and
0 otherwise.

• How well does errorS (h) estimate errorD (h)?

5 6

Confidence Intervals
Estimators
If
Experiment: • S contains n examples (feature vectors), drawn
independently of h and each other

1. Choose sample S of size n according to distri- • n ≥ 30


bution D

Then
2. Measure errorS (h) • With approximately 95% probability, errorD (h)
lies in interval
s
errorS (h) is a random variable (i.e., result of an errorS (h)(1 − errorS (h))
errorS (h) ± 1.96
experiment) n

errorS (h) is an unbiased estimator for errorD (h) E.g. hypothesis h misclassifies 12 of the 40 exam-
ples in test set S:
12
Given observed errorS (h), what can we conclude errorS (h) =
= 0.30
40
about errorD (h)?
Then with approx. 95% confidence, errorD (h) ∈
[0.158, 0.442]

7 8
errorS (h) is a Random Variable
Confidence Intervals
(cont’d)
Repeatedly run the experiment, each with different
randomly drawn S (each of size n)
If
• S contains n examples, drawn independently Probability of observing r misclassified examples:
of h and each other 0.14
Binomial distribution for n = 40, p = 0.3

0.12
• n ≥ 30 0.1
0.08

P(r)
0.06
0.04
Then 0.02
0
0 5 10 15 20 25 30 35 40
• With approximately N % probability, errorD (h)
lies in interval
s
errorS (h)(1 − errorS (h))
errorS (h) ± zN n
n P (r) = errorD (h)r (1 − errorD (h))n−r
r
where

I.e. let errorD (h) be probability of heads in biased


N %: 50% 68% 80% 90% 95% 98% 99%
zN : 0.67 1.00 1.28 1.64 1.96 2.33 2.58 coin, the P (r) = prob. of getting r heads out of n
flips
Why?
What kind of distribution is this?

9 10

Binomial Probability Distribution


Binomial distribution for n = 40, p = 0.3
0.14 Approximate Binomial Dist. with Normal
0.12
0.1
0.08 errorS (h) = r/n is binomially distributed, with
P(r)

0.06
0.04 • mean µerrorS (h) = errorD (h) (i.e. unbiased est.)
0.02
0
0 5 10 15 20 25 30 35 40 • standard deviation σerrorS (h)
s
n errorD (h)(1 − errorD (h))
n! σerrorS (h) =
P (r) = pr (1 − p)n−r = pr (1 − p)n−r n
r r!(n − r)!
(i.e. increasing n decreases variance)
Probability P (r) of r heads in n coin flips, if p =
Pr(heads)
Want to compute confidence interval = interval
• Expected, or mean value of X, E[X], is centered at errorD (h) containing N % of the weight
n
X under the distribution (difficult for binomial)
E[X] ≡ iP (i) = np
i=0
Approximate binomial by normal (Gaussian) dist:

• Variance of X is • mean µerrorS (h) = errorD (h)

V ar(X) ≡ E[(X − E[X])2] = np(1 − p) • standard deviation σerrorS (h)


s
errorS (h)(1 − errorS (h))
• Standard deviation of X, σX , is σerrorS (h) ≈
q q n
σX ≡ E[(X − E[X])2] = np(1 − p)

11 12
Normal Probability Distribution Normal Probability Distribution
Normal distribution with mean 0, standard deviation 1
(cont’d)
0.4
0.35
0.3 0.4
0.25 0.35
0.2 0.3
0.15 0.25
0.1 0.2
0.05 0.15
0 0.1
-3 -2 -1 0 1 2 3 0.05
0
-3 -2 -1 0 1 2 3

1 1
  !
x−µ 2 80% of area (probability) lies in µ ± 1.28σ
p(x) = √ exp −
2πσ 2 2 σ N % of area (probability) lies in µ ± zN σ

• Defined completely by µ and σ N %: 50% 68% 80% 90% 95% 98% 99%
zN : 0.67 1.00 1.28 1.64 1.96 2.33 2.58
• The probability that X will fall into the interval
(a, b) is given by Can also have one-sided bounds:
0.4
Z b
0.35
p(x)dx 0.3

a 0.25
0.2
0.15

• Expected, or mean value of X, E[X], is 0.1


0.05
0
E[X] = µ -3 -2 -1 0 1 2 3

0 σ or > µ − z 0 σ, where
N % of area lies < µ + zN
• Variance of X is V ar(X) = σ 2 N
0
zN = z100−2(100−N )
• Standard deviation of X, σX , is
N %: 50% 68% 80% 90% 95% 98% 99%
σX = σ 0 :
zN 0.0 0.47 0.84 1.28 1.64 2.05 2.33
13 14

Confidence Intervals Revisited Central Limit Theorem

If How can we justify approximation?


• S contains n examples, drawn independently
of h and each other Consider a set of independent, identically distributed
random variables Y1 . . . Yn, all governed by an ar-
• n ≥ 30
bitrary probability distribution with mean µ and
finite variance σ 2. Define the sample mean,
Then
n
• With approximately 95% probability, errorS (h) 1 X
Ȳ ≡ Yi
lies in interval n i=1
s
errorD (h)(1 − errorD (h))
errorD (h) ± 1.96 Note that Ȳ is itself a random variable, i.e. the
n
result of an experiment (e.g. errorS (h) = r/n)

Equivalently, errorD (h) lies in interval


s
Central Limit Theorem: As n → ∞, the distribu-
errorD (h)(1 − errorD (h)) tion governing Ȳ approaches a Normal distribu-
errorS (h) ± 1.96
n tion, with mean µ and variance σ 2/n

Thus the distribution of errorS (h) is approximately


which is approximately
s normal for large n, and its expected value is errorD (h)
errorS (h)(1 − errorS (h))
errorS (h) ± 1.96
n (Rule of thumb: n ≥ 30 when estimator’s distribu-
tion is binomial, might need to be larger for other
(One-sided bounds yield upper or lower error bounds) distributions)
15 16
Calculating Confidence Intervals Difference Between Hypotheses

Test h1 on sample S1, test h2 on S2


1. Pick parameter p to estimate

• errorD (h)
1. Pick parameter to estimate

d ≡ errorD (h1) − errorD (h2)


2. Choose an estimator

• errorS (h) 2. Choose an estimator

dˆ ≡ errorS1 (h1) − errorS2 (h2)


3. Determine probability distribution that governs
(unbiased)
estimator

• errorS (h) governed by binomial distribution, 3. Determine probability distribution that governs
estimator (difference between two normals is
approximated by normal when n ≥ 30 also normal, variances add)
s
errorS1 (h1 )(1 − errorS1 (h1)) errorS2 (h2 )(1 − errorS2 (h2 ))
σdˆ ≈ +
4. Find interval (L, U ) such that N % of probabil- n1 n2
ity mass falls in the interval

• Could have L = −∞ or U = ∞
4. Find interval (L, U ) such that N % of prob.
0 values
• Use table of zN or zN mass falls in the interval: dˆ± Zn σdˆ

17 18

Paired t test to compare hA,hB


Comparing Learning Algorithms LA and LB
1. Partition data into k disjoint test sets T1, T2, . . . , Tk
of equal size, where this size is at least 30. What we’d like to estimate:

2. For i from 1 to k, do ES⊂D [errorD (LA(S)) − errorD (LB (S))]


where L(S) is the hypothesis output by learner L
δi ← errorTi (hA ) − errorTi (hB )
using training set S
3. Return the value δ̄, where
k I.e., the expected difference in true error between
1 X
δ̄ ≡ δi hypotheses output by learners LA and LB , when
k i=1
trained using randomly selected training sets S
drawn according to distribution D
N % confidence interval estimate for d:
δ̄ ± tN,k−1 sδ̄ But, given limited data D0, what is a good esti-
v
mator?
u
k  • Could partition D0 into training set S0 and
u
u 1 X 2
sδ̄ ≡ t δi − δ̄
k(k − 1) i=1 testing set T0, and measure

errorT0 (LA (S0)) − errorT0 (LB (S0))


t (student’s t dist. with k − 1 degrees of freedom)
plays role of z, s plays role of σ • Even better, repeat this many times and aver-
age the results (next slide)
t test gives more accurate results since std. devi-
ation approximated and test sets not independent
19 20
Comparing learning algorithms LA and LB
Comparing learning algorithms LA and LB
(cont’d)
(cont’d)

• Notice we’d like to use the paired t test on δ̄


1. Partition data D0 into k disjoint test sets T1, T2, to obtain a confidence interval
. . . , Tk of equal size, where this size is at least
30.
• Not really correct, because the training sets in
2. For i from 1 to k, do this algorithm are not independent (they over-
lap!)
(use Ti for the test set, and the remaining
data for training set Si)
• More correct to view algorithm as producing
an estimate of
• Si ← {D0 − Ti }

• hA ← LA(Si)
ES⊂D0 [errorD (LA(S)) − errorD (LB (S))]
• hB ← LB (Si)
instead of
• δi ← errorTi (hA ) − errorTi (hB )

3. Return the value δ̄, where ES⊂D [errorD (LA(S)) − errorD (LB (S))]
k
1 X
δ̄ ≡ δi
k i=1 • But even this approximation is better than no
comparison

21 22

Combining Classifiers

• Sometimes a single classifier (e.g. neural net-


work, decision tree) won’t perform well, but a
weighted combination of them will

• Each classifier (or expert) in the pool has its Weighted Majority Algorithm (WM)
own weight [Mitchell, Sec. 7.5.4]

• When asked to predict the label for a new ex-


ample, each expert makes its own prediction,
and then the master algorithm combines them pool of "experts"
using the weights for its own prediction (i.e. a1 a1 (x) q0 = Σ wi
y pred=0
the “official” one) w1 ai = 0
?
a2 a2 (x)
w2 master q0 > q1
algorithm n pred=1
wn q1 = Σ wi
• If the classifiers themselves cannot learn (e.g. ai = 1
an an (x)
heuristics) then the best we can do is to learn Correct label
a good set of weights

• If we are using a learning algorithm (e.g. NN,


dec. tree), then we can rerun the algorithm on
different subsamples of the training set and set
the classifiers’ weights during training

23 24
Weighted Majority Algorithm (WM) Weighted Majority
(cont’d) Mistake Bound (On-Line Model)

ai is ith pred. algorithm in pool A of algs; each alg


is arbitrary function from X to {0, 1} or {−1, 1} • Let aopt ∈ A be expert that makes fewest mis-
takes on arbitrary sequence S of exs; let k =
wi is weight the master alg associates with ai its number of mistakes
Pn
• Let β = 1/2 and Wt = i=1 wi,t = sum of wts
β ∈ [0, 1) is parameter
at trial t (W0 = n)

• ∀ i set wi ← 1 • On trial t such that WM makes a mistake, the


total weight reduced is
• For each training example (or trial) hx, c(x)i X
Wtmis = wi ≥ Wt/2
ai (xt)6=c(xt )
– Set q0 ← q1 ← 0
so
 
– For each algorithm ai Wt+1 = Wt − Wtmis + Wtmis/2 = Wt − Wtmis/2 ≤ 3Wt/4
∗ If ai(x) = 0 then q0 ← q0 + wi
else q1 ← q1 + wi • After seeing all of S, wopt,|S| = (1/2)k and
W|S| ≤ n(3/4)M where M = total number of
∗ If q1 > q0 then predict 1 for c(x), else mistakes, yielding
predict 0 (case for q1 = q0 is arbitrary)  k  M
1 3
≤n ,
∗ For each ai ∈ A 2 4
so
· If ai(x) 6= c(x) then wi ← β wi k + log2 n
M ≤ ≤ 2.41 (k + log2 n)
− log2(3/4)
Setting β = 0 yields Halving algorithm over A
25 26

Weighted Majority Bagging Classifiers


Mistake Bound (cont’d) [Breiman, ML Journal, ’96]

• Thus for any arbitrary sequence of examples, Bagging = Bootstrap aggregating


WM guaranteed to not perform much worse
than best expert in pool plus log of number of Bootstrap sampling: given a set D containing m
experts training examples:

– Implicitly agnostic
• Create Di by drawing m examples uniformly at
random with replacement from D
• Other results:
• Expect Di to omit ≈ 37% of examples from D
– Bounds hold for general values of β ∈ [0, 1)

– Better bounds hold for more sophisticated Bagging:


algorithms, but only better by a constant
factor (worst-case lower bound: Ω (k + log n))
• Create k bootstrap samples D1, . . . , Dk
– Get bounds for real-valued labels and pre-
dictions
• Train a classifier on each Di
– Can track shifting concept, i.e. where best
expert can suddenly change in S; key: don’t
let any weight get too low relative to other • Classify new instance x ∈ X by majority vote
weights, i.e. don’t overcommit of learned classifiers (equal weights)

27 28
Bagging Experiment
[Breiman, ML Journal, ’96]

Given sample S of labeled data, Breiman did the Bagging Experiment


following 100 times and reported avg:
(cont’d)

1. Divide S randomly into test set T (10%) and Same experiment, but using a nearest neighbor
training set D (90%) classifier, where prediction of new feature vector
x’s label is that of x’s nearest neighbor in training
2. Learn decision tree from D and let eS be its
set, where distance is e.g. Euclidean distance
error rate on T

3. Do 50 times: Create bootstrap set Di, learn Results


decision tree and let eB be the error of a ma-
jority vote of the trees on T Data Set ēS ēB Decrease
waveform 26.1 26.1 0%
heart 6.3 6.3 0%
Results breast cancer 4.9 4.9 0%
ionosphere 35.7 35.7 0%
Data Set ēS ēB Decrease diabetes 16.4 16.4 0%
waveform 29.0 19.4 33% glass 16.4 16.4 0%
heart 10.0 5.3 47%
breast cancer 6.0 4.2 30%
What happened?
ionosphere 11.2 8.6 23%
diabetes 23.4 18.8 20%
glass 32.0 24.9 27%
soybean 14.5 10.6 27%
29 30

Boosting Classifiers
[Freund & Schapire, ICML ’96; many more]

Similar to bagging, but don’t always sample uni-


formly; instead adjust resampling distribution over
When Does Bagging Help? D to focus attention on previously misclassified
examples

When learner is unstable, i.e. if small change in Final classifier weights learned classifiers, but not
training set causes large change in hypothesis pro- uniform; instead weight of classifier ht depends on
duced its performance on data it was trained on

Repeat for t = 1, . . . , T :
• Decision trees, neural networks
1. Run learning algorithm on examples randomly
drawn from training set D according to distri-
• Not nearest neighbor bution Dt (D1 = uniform)

2. Output of learner is hypothesis ht : X → {−1, +1}


Experimentally, bagging can help substantially for 3. Compute expected error of ht on examples drawn
unstable learners; can somewhat degrade results according to Dt (can compute exactly)
for stable learners
4. Create Dt+1 from Dt by increasing weight of
examples that ht mispredicts

Final classifier is weighted combination of h1, . . . , hT ,


where ht’s weight depends on its error w.r.t. Dt
31 32
Boosting
(cont’d)
Boosting
Example
• Preliminaries: D = {(x1, y1), . . . , (xm, ym)}, yi ∈
{−1, +1}, Dt(i) = weight of (xi , yi) under Dt
ε2 = 0.21
• Initialization: D1(i) = 1/m D1 D2
α2 = 0.65

• Error Computation: t = Pr [ht (xi) 6= yi]


Dt
(easy to do since we know Dt) h2

• If t > 1/2 then halt; else:


!
1 1 − t ε1 = 0.30
• Weighting Factor: αt = ln D1
α1 = 0.42
D3
2 t
(grows as t decreases)
Dt(i) exp (−αt yi ht(xi ))
• Update: Dt+1(i) = h1
Zt
|{z}
normalization factor
(increase wt. of mispredicted exs, decr. wt of
h3
correctly pred.) D2 D3
ε3 = 0.14
 
T
X α3 = 0.92
• Final Hypothesis: H(x) = sign  αt ht (x)
t=1
(t large ⇒ flip ht’s prediction strongly)

33 34

Boosting
Miscellany

• If each t < 1/2 − γt , error of H(·) on D drops


Boosting exponentially in T
P
t=1 γt (what does γ corre-
Example (cont’d) spond to?)

• Can also bound generalization error of H(·)


H final = sign 0.42 + 0.65 + 0.92
independent of T

• Also successful empirically on neural network


and decision tree learners

=
– Empirically, generalization sometimes im-
proves if training continues after H(·)’s er-
ror on D drops to 0
– Contrary to intuition; expect overfitting
– Related to increasing the combined classi-
fier’s margin (confidence in prediction)

Topic summary due in 1 week!


35 36

You might also like