Professional Documents
Culture Documents
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 1/
Classification Assessment
A classifier is a model or function M that predicts the class label ŷ for a given
input example x:
ŷ = M(x)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 2/
Classification Performance Measures
Let D be the testing set comprising n points in a d-dimensional space, let
{c1 , c2 , . . . , ck } denote the set of k class labels, and let M be a classifier. For
x i ∈ D, let yi denote its true class, and let ŷi = M(x i ) denote its predicted class.
Error Rate: The error rate is the fraction of incorrect predictions for the classifier
over the testing set, defined as
n
1X
Error Rate = I (yi 6= ŷi )
n i =1
rS
uT
rS
rS uT
uTrS
rS rS
rS uT
uT
uT uT
uT
rS rS uT uT
08
rS rS rS uT uT Error Rate = = 0.27
2.5 uT rS rS rS uT uTrS uT 30
bC
rS
rS
rS
rS rS
22
uTrS rS Accuracy = = 0.73
30
rS X1
2
4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 4/
Contingency Table–based Measures
where 1 ≤ i, j ≤ k. The count nij denotes the number of points with predicted
class ci whose true label is cj . Thus, nii (for 1 ≤ i ≤ k) denotes the number of
cases where the classifier agrees on the true label ci . The remaining counts nij ,
with i 6= j, are cases where the classifier and true labels disagree.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 5/
Accuracy/Precision and Coverage/Recall
The class-specific accuracy or precision of the classifier M for class ci is given as the
fraction of correct predictions over all points predicted to be in class ci
nii
acci = preci =
mi
k
mi
X k
1X
Accuracy = Precision = acci = nii
n n
i =1 i =1
The class-specific coverage or recall of M for class ci is the fraction of correct predictions
over all points in class ci :
nii
coveragei = recalli =
ni
The class-specif ic F-measure tries to balance the precision and recall values, by
computing their harmonic mean for class ci :
r
1X
F= Fi
k i =1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 7/
Contingency Table for Iris: Full Bayes Classifier
True
Predicted Iris-setosa (c1 ) Iris-versicolor (c2 ) Iris-virginica(c3 )
Iris-setosa (c1 ) 10 0 0 m1 = 10
Iris-versicolor (c2 ) 0 7 5 m2 = 12
Iris-virginica (c3 ) 0 3 5 m3 = 8
n1 = 10 n2 = 10 n3 = 10 n = 30
The class-specific precison, recall and F-measure values are:
n11 10 n11 10 2 · n11 20
prec1 = = = 1.0 recall1 = = = 1.0 F1 = = = 1.0
m1 10 n1 10 (n1 + m1 ) 20
n22 7 n22 7 2 · n22 14
prec2 = = = 0.583 recall2 = = = 0.7 F2 = = = 0.636
m2 12 n2 10 (n2 + m2 ) 22
n33 5 n33 5 2 · n33 10
prec3 = = = 0.625 recall3 = = = 0.5 F3 = = = 0.556
m3 8 n3 10 (n3 + m3 ) 18
The overall accuracy and F-measure is
(n11 + n22 + n33 ) (10 + 7 + 5)
Accuracy = = = 22/30 = 0.733
n 30
1 2.192
F = (1.0 + 0.636 + 0.556) = = 0.731
3 3
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 8/
Binary Classification: Positive and Negative Class
When there are only k = 2 classes, we call class c1 the positive class and c2 the
negative class. The entries of the resulting 2 × 2 confusion matrix are
True Class
Predicted Class Positive (c1 ) Negative (c2 )
Positive (c1 ) True Positive (TP) False Positive (FP)
Negative (c2 ) False Negative (FN) True Negative (TN)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 9/
Binary Classification: Positive and Negative Class
True Positives (TP): The number of points that the classifier correctly
predicts as positive:
TP = n11 = {x i |ŷi = yi = c1 }
True Negatives (TN): The number of points that the classifier correctly
predicts as negative:
TN = n22 = {x i |ŷi = yi = c2 }
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 10
Binary Classification: Assessment Measures
Error Rate: The error rate for the binary classification case is given as the
fraction of mistakes (or false predictions):
FP + FN
Error Rate =
n
TP + TN
Accuracy =
n
TP TP
precP = =
TP + FP m1
TN TN
precN = =
TN + FN m2
TP TP
TPR = sensitivity = recallP = =
TP + FN n1
TN TN
TNR = specif icity = recallN = =
FP + TN n2
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 12
Binary Classification: Assessment Measures
FN FN
FNR = = = 1 − sensitivity
TP + FN n1
FP FP
FPR = = = 1 − specif icity
FP + TN n2
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 13
Iris Principal Components Data: Naive Bayes Classifier
Iris-versicolor (c1 - circles) and other two Irises (c2 - triangles).
u2
uT uT P̂(c1 ) = 40/120 = 0.33
uT
uT
uT T
1.0
uT
uT
uT uT
µ̂1 = −0.641 −0.204
uT bC
uT
uT
uT uT
b 1 = 0.29 0
uT uT
bC uT uT
0.5 uT uT bC uT Tu
uT uT uT uT
uT Tu
Tu
Tu uT bC
bC bC bC
uT uT uT
uT uT uT Σ
uT
uT uT
uT
uT
uT
uT uT Tu
uT bC bC
bC
bC bC
bC uT uT
uT
uT
uT
uT
uT
uT
uT
uT 0 0.18
uT bC uT
0 uT
bC
bC
bC bC uT
uT uT
uT uT bC uT Tu
uT bC bC bC
uT uT uT
uT uT
uT
uT Tu
uT
bC
bC bC
bC
bC Cb Cb
bC bC
bC
Cb
Cb
uT uT Tu Tu Tu
uT
uT Tu
P̂(c2 ) = 80/120 = 0.67
bC bC
−0.5 uT uT uT uT
bC
bC bC
bC Cb uT
uT uT
T
uT uT
Cb
bC
bC
bC Cb
bC
bC µ̂2 = 0.27 0.14
−1.0 Cb bC
uT
uT
bC b 2 = 6.14
Σ
0
−1.5 u1 0 0.206
−4 −3 −2 −1 0 1 2 3
The mean (in white) and the contour plot of the normal distribution for each class
are shown; the contours are shown for one and two standard deviations along each
axis.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 14
Iris PC Data: Assessment Measures
True
Predicted Positive (c1 ) Negative (c2 )
Positive (c1 ) TP = 7 FP = 7 m1 = 14
Negative (c2 ) FN = 3 TN = 13 m2 = 16
n1 = 10 n2 = 20 n = 30
The naive Bayes classifier misclassified 10 out of the 30 test instances, resulting in an
error rate and accuracy of
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 15
ROC Analysis
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 16
ROC Analysis
Let S(x i ) denote the real-valued score for the positive class output by a classifier
M for the point x i . Let the maximum and minimum score thresholds observed on
testing dataset D be as follows:
Initially, we classify all points as negative. Both TP and FP are thus initially zero,
as given in the confusion matrix:
True
Predicted Pos Neg
Pos 0 0
Neg FN TN
This results in TPR and FPR rates of zero, which correspond to the point (0, 0) at
the lower left corner in the ROC plot.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 17
ROC Analysis
Next, for each distinct value of ρ in the range [ρmin , ρmax ], we tabulate the set of
positive points:
and we compute the corresponding true and false positive rates, to obtain a new
point in the ROC plot.
Finally, in the last step, we classify all points as positive. Both FN and TN are
thus zero, as per the confusion matrix
True
Predicted Pos Neg
Pos TP FP
Neg 0 0
resulting in TPR and FPR values of 1. This results in the point (1, 1) at the top
right-hand corner in the ROC plot.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 18
ROC Analysis
An ideal classifier corresponds to the top left point (0, 1), which corresponds to
the case FPR = 0 and TPR = 1, that is, the classifier has no false positives, and
identifies all true positives (as a consequence, it also correctly predicts all the
points in the negative class). This case is shown in the confusion matrix:
True
Predicted Pos Neg
Pos TP 0
Neg 0 TN
A classifier with a curve closer to the ideal case, that is, closer to the upper left
corner, is a better classifier.
Area Under ROC Curve: The area under the ROC curve, abbreviated AUC, can
be used as a measure of classifier performance. The AUC value is essentially the
probability that the classifier will rank a random positive test case higher than a
random negative test instance.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 19
ROC: Different Cases for 2 × 2 Confusion Matrix
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 20
Random Classifier
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 21
ROC/AUC Algorithm
The ROC/AUC takes as input the testing set D, and the classifier M.
The first step is to predict the score S(x i ) for the positive class (c1 ) for each test
point x i ∈ D. Next, we sort the (S(x i ), yi ) pairs, that is, the score and the true
class pairs, in decreasing order of the scores
Initially, we set the positive score threshold ρ = ∞. We then examine each pair
(S(x i ), yi ) in sorted order, and for each distinct value of the score, we set
ρ = S(x i ) and plot the point
FP TP
(FPR, TPR) = ,
n2 n1
As each test point is examined, the true and false positive values are adjusted
based on the true class yi for the test point x i . If y1 = c1 , we increment the true
positives, otherwise, we increment the false positives
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 22
ROC/AUC Algorithm
The AUC value is computed as each new point is added to the ROC plot. The
algorithm maintains the previous values of the false and true positives, FP prev and
TP prev , for the previous score threshold ρ.
Given the current FP and TP values, we compute the area under the curve
defined by the four points
FP prev TP prev FP TP
(x1 , y1 ) = , (x2 , y2 ) = ,
n2 n1 n2 n1
FP prev FP
(x1 , 0) = ,0 (x2 , 0) = ,0
n2 n2
These four points define a trapezoid, whenever x2 > x1 and y2 > y1 , otherwise,
they define a rectangle (which may be degenerate, with zero area).
The area under the trapezoid is given as b · h, where b = |x2 − x1 | is the length of
the base of the trapezoid and h = 21 (y2 + y1 ) is the average height of the trapezoid.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 23
Algorithm ROC-Curve
ROC-Curve(D,
M):
1 n1 ← {x i ∈ D|yi = c1 } // size of positive class
2 n2 ← {x i ∈ D|yi = c2 } // size of negative class
// classify, score, and sort all test points
3 L ← sort the set {(S(x i ), yi ) : x i ∈ D} by decreasing scores
4 FP ← TP ← 0
5 FP prev ← TP prev ← 0
6 AUC ← 0
7 ρ←∞
8 foreach (S(x i ), yi ) ∈ L do
9 if ρ > S(x i ) then
FP TP
10 plot point ,
n 2 n1
FP prev TP prev
11 AUC ← AUC + Trapezoid-Area n2
, n1
, FP
n
, TP
n
2 1
12 ρ ← S(x i )
13 FP prev ← FP
14 TP prev ← TP
15 if yi = c1 then TP ← TP + 1
16 else FP ← FP + 1
17 plot point FP , TP
n2 n1
FP prev TP prev
18 AUC ← AUC + Trapezoid-Area n
, n , FP
n
, TP
n
2 1 2 1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 24
Algorithm Trapezoid-Area
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 25
Iris PC Data: ROC Analysis
We use the naive Bayes classifier to compute the probability that each test point
belongs to the positive class (c1 ; iris-versicolor).
The score of the classifier for test point x i is therefore S(x i ) = P(c1 |x i ). The
sorted scores (in decreasing order) along with the true class labels are as follows:
S(x i ) 0.93 0.82 0.80 0.77 0.74 0.71 0.69 0.67 0.66 0.61
yi c2 c1 c2 c1 c1 c1 c2 c1 c2 c2
S(x i ) 0.59 0.55 0.55 0.53 0.47 0.30 0.26 0.11 0.04 2.97e-03
yi c2 c2 c1 c1 c1 c1 c1 c2 c2 c2
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 26
ROC Plot for Iris PC Data
AUC for naive Bayes is 0.775, whereas the AUC for the random classifier (ROC
plot in grey) is 0.5. b b b b b b b b b b b b b b
b
0.9
b
0.8
b
0.7
True Positive Rate
b
0.6
b b b b b
0.5
b b
0.4
b
0.3
b
0.2
b b
0.1
b b
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
False Positive Rate
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 27
ROC Plot and AUC: Trapezoid Region
Consider the following sorted scores, along with the true class, for some testing
dataset with n = 5, n1 = 3 and n2 = 2.
Algorithm yields the following points that are added to the ROC plot, along with
the running AUC:
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 28
ROC Plot and AUC: Trapezoid Region
b b
1.0
0.8
True Positive Rate
0.6
0.4
b
0.333 0.5
0.2
b
0
0 0.2 0.4 0.6 0.8 1.0
False Positive Rate
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 29
Classifier Evaluation
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 30
K -fold Cross-Validation
Cross-validation divides the dataset D into K equal-sized parts, called folds,
namely D 1 , D 2 , . . ., D K .
Each fold D i is, in turn, treated as the
S testing set, with the remaining folds
comprising the training set D \ D i = j 6=i D j .
After training the model Mi on D \ D i , we assess its performance on the testing
set D i to obtain the i-th estimate θi .
The expected value of the performance measure can then be estimated as
K
1X
µ̂θ = E [θ] = θi
K i =1
Cross-Validation(K , D):
1 D ← randomly shuffle D
2 {D 1 , D 2 , . . . , D K } ← partition D in K equal parts
3 foreach i ∈ [1, K ] do
4 Mi ← train classifier on D \ D i
5 θi ← assess Mi on D i
PK
6 µ̂θ = K1 i =1 θi
PK
7 σ̂θ2 = K1 i =1 (θi − µ̂θ )2
8 return µ̂θ , σ̂θ2
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 32
2D Iris Dataset: K -fold Cross-Validation
Consider the 2D Iris dataset with k = 3 classes. We assess the error rate of the
full Bayes classifier via 5-fold cross-validation, obtaining the following error rates
when testing on each fold:
The mean and variance for the error rate are as follows:
1.167
µ̂θ = = 0.233 σ̂θ2 = 0.00833
5
Performing ten 5-fold cross-validation runs for the Iris dataset results in the mean
of the expected error rate as 0.232, and the mean of the variance as 0.00521, with
the variance in both these estimates being less than 10−3 .
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 33
Bootstrap Resampling
The bootstrap method draws K random samples of size n with replacement from
D.Each sample D i is thus the same size as D, and has several repeated points.
The probability that a particular point x j is not selected even after n tries is given
as
n
1
n
P(x j 6∈ D i ) = q = 1 − ≃ e −1 = 0.368
n
which implies that each bootstrap sample contains approximately 63.2% of the
points from D.
The bootstrap samples can be used to evaluate the classifier by training it on each
of samples D i and then using the full input dataset D as the testing set.
However, the estimated mean and variance of θ will be somewhat optimistic owing
to the fairly large overlap between the training and testing datasets (63.2%).
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 34
Bootstrap Resampling Algorithm
Bootstrap-Resampling(K , D):
1 for i ∈ [1, K ] do
2 D i ← sample of size n with replacement from D
3 Mi ← train classifier on D i
4 θi ← assess Mi on D
PK
5 µ̂θ = K1 i =1 θi
PK
6 σ̂θ2 = K1 i =1 (θi − µ̂θ )2
7 return µ̂θ , σ̂θ2
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 35
Iris 2D Data: Bootstrap Resampling using Error Rate
We apply bootstrap sampling to estimate the error rate for the full Bayes
classifier, using K = 50 samples. The sampling distribution of error rates is:
6
Frequency
0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27
Error Rate
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 36
Confidence Intervals
We would like to derive confidence bounds on how much the estimated mean and
variance may deviate from the true value.
To answer this question we make use of the central limit theorem, which states
that the sum of a large number of independent and identically distributed (IID)
random variables has approximately a normal distribution, regardless of the
distribution of the individual random variables.
Let θ1 , θ2 , . . . , θK be a sequence of IID random variables, representing, for example,
the error rate or some other performance measure over the K -folds in
cross-validation or K bootstrap samples.
Assume that each θi has a finite mean E [θi ] = µ and finite variance var (θi ) = σ 2 .
Let µ̂ denote the sample mean:
1
µ̂ = (θ1 + θ2 + · · · + θK )
K
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 37
Confidence Intervals
By linearity of expectation, we have
K
1 1X 1
E [µ̂] = E (θ1 + θ2 + · · · + θK ) = E [θi ] = (K µ) = µ
K K i =1 K
The variance of µ̂ is given as
K
σ2
1 1 X 1
var (µ̂) = var (θ1 + θ2 + · · · + θK ) = 2 var (θi ) = 2 K σ 2 =
K K i =1 K K
Thus, the standard deviation of µ̂ is given as
p σ
std(µ̂) = var (µ̂) = √
K
We are interested in the distribution of the z-score of µ̂, which is itself a random
variable
µ̂ − E [µ̂] µ̂ − µ √
µ̂ − µ
ZK = = σ = K
std(µ̂) √
K
σ
ZK specifies the deviation of the estimated mean from the true mean in terms of
its standard deviation.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 38
Confidence Intervals
The central limit theorem states that as the sample size increases, the random
variable ZK converges in distribution to the standard normal distribution (which
has mean 0 and variance 1). That is, as K → ∞, for any x ∈ R, we have
where Φ(x) is the cumulative distribution function for the standard normal density
function f (x|0, 1).
Let zα/2 denote the z-score value that encompasses α/2 of the probability mass
for a standard normal distribution, that is,
then, because the normal distribution is symmetric about the mean, we have
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 39
Confidence Intervals
Note that
σ σ
−zα/2 ≤ ZK ≤ zα/2 implies that µ̂ − zα/2 √ ≤ µ ≤ µ̂ + zα/2 √
K K
We obtain bounds on the value of the true mean µ in terms of the estimated
value µ̂:
σ σ
lim P µ̂ − zα/2 √ ≤ µ ≤ µ̂ + zα/2 √ = 1−α
K →∞ K K
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 40
Confidence Intervals: Unknown Variance
The true variance σ 2 is usually unknown, but we may use the sample variance:
K
1X
σ̂ 2 = (θi − µ̂)2
K i =1
µ̂ − zα/2 √σ̂K , µ̂ + zα/2 √σ̂K is the 100(1-α)% confidence interval for µ.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 41
2D Iris Data: Confidence Intervals
Consider the 5-fold cross-validation (K = 5) to assess the full Bayes classifier:
√
µ̂θ = 0.233 σ̂θ2 = 0.00833 σ̂θ = 0.00833 = 0.0913
Let 1 − α = 0.95 be the confidence level and α = 0.05 the significance level.
The standard normal distribution has 95% of the probability density within
zα/2 = 1.96 standard deviations from the mean.
!
σ̂θ σ̂θ
P µ ∈ µ̂θ − zα/2 √ , µ̂θ + zα/2 √ = 0.95
K K
With 95% confidence, the true expected error rate lies in the interval
(0.153, 0.313). If α = 0.01, then zα/2 = 2.58, zα/2 √σ̂θK = 2.58×0.0913
√
5
= 0.105, and
the interval becomes (0.128, 0.338).
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 42
Confidence Intervals: Small Sample Size
The confidence interval applies only when the sample size K → ∞. However, in
practice for K -fold cross-validation or bootstrap resampling K is small.
In the small sample case, instead of the normal density to derive the confidence
interval, we use the Student’s t distribution.
In particular, we choose the value tα/2,K −1 such that the cumulative t distribution
function with K − 1 degrees of freedom encompasses α/2 of the probability mass,
that is,
P(0 ≤ ZK∗ ≤ tα/2,K −1 ) = TK −1 (tα/2 ) − TK −1 (0) = α/2
where TK −1 is the cumulative distribution function for the Student’s t distribution
with K − 1 degrees of freedom.
The 100(1-α%) confidence interval for the true mean µ is thus
σ̂ σ̂
µ̂ − tα/2,K −1 √ ≤ µ ≤ µ̂ + tα/2,K −1 √
K K
Note the dependence of the interval on both α and the sample size K .
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 43
Student’s t Distribution: K Degrees of Freedom
f (x|0, 1)
t(10)
0.4 t(4)
t(1)
0.3
0.2
0.1
x
-5 -4 -3 -2 -1 0 1 2 3 4 5
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 44
Iris 2D Data: Small Sample Confidence Intervals
Due to the small sample size (K = 5), we can get a better confidence interval by
using the t distribution. For K − 1 = 4 degrees of freedom, for α = 0.95, we get
tα/2,K −1 = 2.776. Thus,
σ̂θ 0.0913
tα/2,K −1 √ = 2.776 × √ = 0.113
K 5
The 95% confidence interval is therefore
which is much wider than the overly optimistic confidence interval (0.153, 0.313)
obtained for the large sample case.
For 1 − α = 0.99, tα/2 = 4.604, tα/2 √σ̂θK = 4.604 × 0.0913
√
5
= 0.188 and the 99%
confidence interval is
This is also much wider than the 99% confidence interval (0.128, 0.338) obtained
for the large sample case.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 45
Comparing Classifiers: Paired t-Test
How can we test for a significant difference in the classification performance of
two alternative classifiers, M A and M B on a given dataset D.
We can apply K -fold cross-validation (or bootstrap resampling) and tabulate their
performance over each of the K folds, with identical folds for both classifiers.
That is, we perform a paired test, with both classifiers trained and tested on the
same data.
Let θ1A , θ2A , . . . , θKA and θ1B , θ2B , . . . , θKB denote the performance values for MA and
MB , respectively. To determine if the two classifiers have different or similar
performance, define the random variable δi as the difference in their performance
on the ith dataset:
δi = θiA − θiB
The expected difference and the variance estimates are given as:
K K
1X 1X
µ̂δ = δi σ̂δ2 = (δi − µ̂δ )2
K i =1 K i =1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 46
Comparing Classifiers: Paired t-Test
The null hypothesis H0 is that the performance of M A and M B is the same. The
alternative hypothesis Ha is that they are not the same, that is:
H0 : µδ = 0 Ha : µδ 6= 0
Define the z-score random variable for the estimated expected difference as
√ µ̂δ − µδ
Zδ∗ = K
σ̂δ
Zδ∗ follows a t distribution with K − 1 degrees of freedom. However, under the null
hypothesis we have µδ = 0, and thus
√
K µ̂δ
Zδ∗ = ∼ tK −1
σ̂δ
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 48
2D Iris Dataset: Paired t-Test
We compare, via error rate, the naive Bayes (M A ) with the full Bayes (M B )
classifier via cross-validation using K = 5.
i 1 2 3 4 5
θA 0.233 0.267 0.1 0.4 0.3
B i
θi 0.2 0.2 0.167 0.333 0.233
δi 0.033 0.067 −0.067 0.067 0.067
hypothesis, that is, there is no significant difference between the naive and full
Bayes classifier for this dataset.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 49
Bias-Variance Decomposition
Thus, zero-one loss assigns a cost of zero if the prediction is correct, and one
otherwise.
Another commonly used loss function is the squared loss, defined as
2
L(y , M(x)) = (y − M(x))
where we assume that the classes are discrete valued, and not categorical.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 50
Expected Loss
An ideal or optimal classifier is the one that minimizes the loss function. Because
the true class is not known for a test case x, the goal of learning a classification
model can be cast as minimizing the expected loss:
X
Ey L(y , M(x)) |x = L(y , M(x)) · P(y |x)
y
where P(y |x) is the conditional probability of class y given test point x, and Ey
denotes that the expectation is taken over the different class values y .
Minimizing the expected zero–one loss corresponds to minimizing the error rate.
Let M(x) = ci , then we have
X X
Ey L(y , M(x)) |x = I (y 6= ci ) · P(y |x) = P(y |x) = 1 − P(ci |x)
y y 6=ci
The expected loss for the squared loss function offers important insight into the
classification problem because it can be decomposed into bias and variance terms.
Intuitively, the bias of a classifier refers to the systematic deviation of its predicted
decision boundary from the true decision boundary, whereas the variance of a classifier
refers to the deviation among the learned decision boundaries over different training sets.
Because M depends on the training set, given a test point x, we denote its predicted
value as M(x, D). Consider the expected square loss:
h i h 2 i
Ey L y , M(x, D) x, D = Ey y − M(x, D) x, D
h 2 i 2
= Ey y − Ey [y |x] x, D + M(x, D) − Ey [y |x]
| {z } | {z }
var (y |x ) squared-error
The first term is simply the variance of y given x. The second term is the squared error
between the predicted value M(x, D) and the expected value Ey [y |x].
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 52
Bias and Variance
The squared error depends on the training set. We can eliminate this dependence by
averaging over all possible training tests of size n. The average or expected squared error
for a given test point x over all training sets is then given as
h 2 i h 2 i 2
ED M(x, D) − Ey [y |x] = ED M(x, D) − ED [M(x, D)] + ED [M(x, D)] − Ey [y |x]
| {z } | {z }
variance bias
The expected squared loss over all test points x and over all training sets D of size n
yields the following decomposition:
h 2 i h 2 i h 2 i
Ex ,D ,y y − M(x, D) = Ex ,y y − Ey [y |x] + Ex ,D M(x, D) − ED [M(x, D)]
| {z } | {z }
noise average variance
h 2 i
+ Ex ED [M(x, D)] − Ey [y |x]
| {z }
average bias
The expected square loss over all test points and training sets can be decomposed into
three terms: noise, average bias, and average variance.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 53
Bias and Variance
The noise term is the average variance var (y |x) over all test points x. It contributes a
fixed cost to the loss independent of the model, and can thus be ignored when
comparing different classifiers.
The classifier specific loss can then be attributed to the variance and bias terms. Bias
indicates whether the model M is correct or incorrect.
If the decision boundary is nonlinear, and we use a linear classifier, then it is likely to
have high bias. A nonlinear (or a more complex) classifier is more likely to capture the
correct decision boundary, and is thus likely to have a low bias.
The complex classifier is not necessarily better, since we also have to consider the
variance term, which measures the inconsistency of the classifier decisions. A complex
classifier induces a more complex decision boundary and thus may be prone to
overf itting, and thus may be susceptible to small changes in training set, which may
result in high variance.
In general, the expected loss can be attributed to high bias or high variance, typically a
trade-off between the two. We prefer a classifier with an acceptable bias and as low a
variance as possible.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 54
Bias-variance Decomposition: SVM Quadratic Kernels
Iris PC Data: Iris-versicolor (c1 -circles) and other two Irises (c2 - triangles).
K = 10 Bootstrap samples, trained via SVMs, varying the regularization constant C from
10−2 to 102 . The decision boundaries over the 10 samples were as follows:
u2 u2
2 2
uT uT uT uT
uT uT uT uT uT uT
1 uT
uT uT uT uT
1 uT
uT Tu uT uT
uT Tu bC uT uT uT Tu bC uT uT
uT uT Tu uT uT uT Tu uT
uT uT uT Tu TuuT uT Tu uT Tu bC bC bC bC bC bC
bC bC uT uT uT uTuT uT uT uT uT uT Tu uT uTuT Tu Tu uT Tu bC bC bC bC bC bC
bC bC uT uT uT uTuT uT uT uT
uT bC Cb bC T u uT uT uT uT bC Cb bC T u uT uT uT
Tu uT uT uT uT uT bC uT uT uT uT uT uT uT Tu uT uT uT uT uT uT bC uT uT uT uT uT uT uT Tu
uT uT Tu uT bC uT uT uT uT uT uT uT Tu uT bC uT uT uT uT uT
0 T u T u T u u T u T T u uT uT uT bC bC bC bC bC Cb bC bC uT uT uTuT uT uT 0 T u T u T u u T u T T u uT uT uT bC bC bC bC bC Cb bC bC uT uT uTuT uT uT
uT Cb uT Cb
uT Tu bC bC bC Cb Cb bC bCbC bC bC uT Tu uT Tu bC bC bC Cb Cb bC bC bC bC bC uT Tu
uT uT uT Tu bC bC Cb Cb bC bC Cb uT uT Tu uT uT uT Tu bC bC Cb Cb bC bC Cb uT uT Tu
uT uT bC bC Cb bC Cb bC uT uT bC bC Cb bC Cb bC
Cb uT Cb uT
−1 bC bC −1 bC bC
uT bC uT bC
−2 −2
−3 u1 −3 u1
−4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3
A small value of C emphasizes the margin, whereas a large value of C tries to minimize
the slack terms.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 55
Bias-variance Decomposition: SVM Quadratic Kernels
u2
rS
loss
uT
2 rS bias
bC
variance
uT uT 0.3 uT
uT uT uT
1 uT
uT Tu uT uT
uT Tu bC uT uT
uT uT uT uT bC
uT uT Tu Tu Tu Tu uT Tu uT Tu bC bC bC bC bC bC
bC uT uT uT uTuT uT uT
uT uT uT uT Tu uT uT bC Cb bC uT uT uT uT uT
Tu uT bC uT uT uT uT uT uT uT uT
0 uT uT T u bC bC bC Cb bC Cb uT uT uT uT uT
uT uT uT uT Tu Tu uTuT bC uT uT bC bC bC bC Cb bC bC bC bC Cb uT uT uTuT uT uT
uT Tu
0.2
uT bC bC
uT uT Tu bC uT bC Cb bC bC Cb bC Cb uT uT Tu rS
Cb bC Cb bC Cb rS rS
uT uT Cb bC rS
bC bC uT uT
−1 uT bC
uT uT
0.1 uT
bC bC
−2
bC bC
bC
−3 u1 0
−4 −3 −2 −1 0 1 2 3 10−2 10−1 100 101 102
C
(c) C = 100 (d) Bias-Variance
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 56
Ensemble Classifiers
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 57
Bagging
Bagging stands for Bootstrap Aggregation. It is an ensemble classification method
that employs multiple bootstrap samples (with replacement) from the input
training data D to create slightly different training sets D i , i = 1, 2, . . . , K .
Different base classifiers Mi are learned, with Mi trained on D i .
Given any test point x, it is first classified using each of the K base classifiers,
Mi . Let the number of classifiers that predict the class of x as cj be given as
vj (x) = Mi (x) = cj i = 1, . . . , K
Bagging can help reduce the variance, especially if the base classifiers are
unstable, due to the averaging effect of majority voting. It does not, in general,
have much effect on the bias.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 58
Bagging: Combined SVM Classifiers
SVM classifiers are trained on K = 10 bootstrap samples of the Iris PCA dataset
using C = 1.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 59
Bagging: Combined SVM Classifiers
2 2
uT uT uT uT
uT uT uT uT uT uT
1 uT
uT Tu uT uT
1 uT
uT Tu uT
uT Tu bC uT uT uT Tu bC uuTT uT uT
uT uT uT uT uT uT
uT Tu Tu uT Tu Tu uT uT uT uT
bC bC bC
bC Tu uT uT uTuT uT uT uT uT Tu Tu uT TuTu uT uT uT uT
bC bC bC
bC Tu uT uT uT uT uT
uT uT uT Tu Tu Tu bC bC bC bC bC Cb bC uT uT uT uT uT uT uT Tu Tu Tu bC bC bC bC bC Cb bC uT uT uT uT
uT
uT uT uT bC Cb uT uT uT uT uT uT uT Tu uT
uT uT Tu bC Cb uT uT uT uT uT uT uT Tu
uT uT uT uT
0 uT uT uT uT uT Tu uT uT uT uT bC bC bC bC bC Cb bC bC Cb
uT uT uT uT uT
uT uT uTuT uT uT 0 uT uT uT uT uT Tu uT uT uT uT bC bC bC bC bC Cb bC bC Cb
uT uT uT uT uT
uT uT uTuT uT uT
uT Tu bC bC bC bC Cb bC bC bC bC Cb uT Tu uT Tu bC bC bC bC Cb bC bC bC bC Cb uT Tu
uT uT uT uT bC bC bC Cb bC bC Cb uT uT Tu uT uT uT Tu bC bC bC Cb bC bC Cb uT uT Tu
uT uT bC bC Cb bC bC bC uT uT bC bC Cb bC bC bC
Cb uT Cb uT
−1 bC bC −1 bC bC
uT bC uT bC
−2 −2
−3 u 1 −3 u1
−4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3
The worst training performance is obtained for K = 3 (in thick gray) and the best for
K = 8 (in thick black).
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 60
Random Forest
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 61
Random Forest Algorithm
RandomForest(D, K , p, η, π):
1 foreach x i ∈ D do
2 vj (x i ) ← 0, for all j = 1, 2, . . . , k
3 for t ∈ [1, K ] do
4 D t ← sample of size n with replacement from D
5 Mt ← DecisionTree (D t , η, π, p)
6 foreach (x i , yi ) ∈ D \ D t do // out-of-bag votes
7 ŷi ← Mt (x i )
8 if ŷi = cj then vj (x i ) = vj (x i ) + 1
Pn
ǫoob = 1n · i =1 I yi 6= arg maxcj vj (x i )|(x i , yi ) ∈ D
9 // OOB
error
10 return {M1 , M2 , . . . , MK }
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 62
Random Forest - Out of bag estimation
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 63
Iris PC Dataset: Random Forest
The task is to separate Iris-versicolor (c1 - circles) from the other two Irises
(c2 - triangles).
Since d = 2, we pick p = 1 attribute for each split-point evaluation.
We use η = 3 (maximum leaf size) and minimum purity π = 1.0.
We grow K = 5 decision trees on different bootstrap samples.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 64
Iris PC Dataset: Random Forest
Decision boundary is shown in bold.
The training error rate is 2.0%, but OOB error rate is 49.33%, which is overly
pessimistic in this case, since the dataset has only two attributes, and we use only
one attribute to evaluate each split point.
u2
uT bCbC
bCbC
bCbC
bCbC uT
bCbC bCbC
bCbC bCbC
bCbC bCbC
uT
bCbC
bCbC
bCbC
bCbC
bCbC
bCbC uT
uT
bCbC bCbC
bCbC bCbC
bCbC bCbC
bCbC bCbC
bCbC bCbC
bCbC bCbC
bCbC bCbC
bCbC bCbC
bCbC bCbC
bCbC
bCbC
bCbC
bCbC uT
uT
bCbC
bCbC
bCbC
bCbC
bCbC
bCbC uT uT
bCbC
bCbC
bCbC
bCbC uT
uT bCbC
bCbC
bCbC
bCbC
bC bCbC
bCbC
bCbC
bCbC uT Tu
1 uT uT
bCbC
bCbC
bCbC
bCbC uT
uT uT
bCbC bCbC
uT uT
bCbC
bCbC
bC bC
bCbC
bCbC
uT uT uT uT
bCbC bCbC
uT Tu Tu uT bC
bC bC bC bC bC bC bC bCbC
bC bCbC
uT uT uT uT uT
bCbC bCbC
uT
bCbC
bC bC
bCbC
uT
bCbC
uT Tu
bCbC
uT Cb bC Cb bC bC
bCbC bCbC
uT uT
bCbC
uT bC bCbC
bC bCbC
uT uT uT Tu
bCbC bC bC bC bC bC bC bC bC bC
uT
bCbC
uT uT uT
bCbC bCbC
bC uT
bCbC bCbC
uT uT
bCbC bCbC
bCbC
bCbC
Cb
bCbC
bCbC
uT uT
uT uT
bCbC bCbC
uT uT
bCbC
bCbC
bC
bCbC
bCbC
uT
uT
bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC
bC bC
uT
bCbC bCbC
uT uT uT uT uT
bCbC bCbC
Tu uT uT uT Cb Cb bC
bCbC bCbC
bCbC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
uT uT
bCbC
Tu uT uT uT uT
bCbC
uT uT uT bC
bCbC bCbC
bCbC
bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC Cb Cb bC
bCbC
Tu
bCbC bC bCbC
uT Tu
bCbC bC bC bC bC bC
uT bC bC
bCbC
bC
bCbC
bC
bCbC
Cb bC
bCbC bCbC
bC
bCbC bCbC
bC
bCbC bCbC
0 uT uT
bCbC
bCbC
bCbC
bCbC bC bC bC
bCbC
bCbC
bCbC
uT uT
uT uT bC
bCbC bCbC
uT
bCbC
bC
bCbC bCbC
bCbC bCbC
bCbC bCbC
bCbC
bC bC bC bC bC
bCbC bCbC
bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC
uT uT
bCbC
bCbC
bCbCbC bC
bC
bCbC
bCbCbC
bCbC
bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
uT
bCbC bCbC
bCbC bCbC
bCbC
bC bC
bCbC bCbC
bCbC bCbC
bCbC bCbC
bCbC bCbC
bCbC bCbC
bCbC bCbC
bCbC bCbC
uT
bCbC bCbC
bCbC bCbC
bCbC bCbC
bC
bCbC bCbC
bCbC bCbC
bC
-1.4 u1
-4 -3 -2 -1 0 1 2 3
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 65
Random Forest: Varying K
We used the full Iris dataset which has four attributes (d = 4), and three classes
(k = 3).
√
We employed p = d = 2, η = 3 and π = 1.0.
K ǫoob ǫ
1 0.4333 0.0267
2 0.2933 0.0267
3 0.1867 0.0267
4 0.1200 0.0400
5 0.1133 0.0333
6 0.1067 0.0400
7 0.0733 0.0333
8 0.0600 0.0267
9 0.0467 0.0267
10 0.0467 0.0267
We can see that the OOB error decreases as we increase the number of trees.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 66
Boosting
In Boosting the main idea is to carefully select the samples to boost the
performance on hard to classify instances.
Starting from an initial training sample D 1 , we train the base classifier M1 , and
obtain its training error rate.
To construct the next sample D 2 , we select the misclassified instances with higher
probability, and after training M2 , we obtain its training error rate.
To construct D 3 , those instances that are hard to classify by M1 or M2 , have a
higher probability of being selected. This process is repeated for K iterations.
Finally, the combined classifier is obtained via weighted voting over the output of
the K base classifiers M1 , M2 , . . . , MK .
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 67
Boosting
Boosting is most beneficial when the base classifiers are weak, that is, have an
error rate that is slightly less than that for a random classifier.
The idea is that whereas M1 may not be particularly good on all test instances, by
design M2 may help classify some cases where M1 fails, and M3 may help classify
instances where M1 and M2 fail, and so on. Thus, boosting has more of a bias
reducing effect.
Each of the weak learners is likely to have high bias (it is only slightly better than
random guessing), but the final combined classifier can have much lower bias,
since different weak learners learn to classify instances in different regions of the
input space.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 68
Adaptive Boosting: AdaBoost
AdaBoost repeats the boosting process K times. Let t denote the iteration and
let αt denote the weight for the tth classifier Mt .
Let wit denote the weight for x i , with w t = (w1t , w2t , . . . , wnt )T being the weight
vector over all the points for the tth iteration.
w is a probability vector, whose elements sum to one. Initially all points have
equal weights, that is,
T
0 1 1 1 1
w = , ,..., = 1
n n n n
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 69
Adaptive Boosting: AdaBoost
The weight for the tth classifier is then set as
1 − ǫt
αt = ln
ǫt
The weight for each point x i ∈ D is updated as
n o
wit = wit −1 · exp αt · I Mt (x i ) 6= yi
If the predicted class matches the true class, that is, if Mt (x i ) = yi , then the
weight for point x i remains unchanged.
If the point is misclassified, that is, Mt (x i ) 6= yi , then
( )
t t −1
t −1 1 − ǫt t −1 1
wi = wi · exp αt = wi exp ln = wi −1
ǫt ǫt
Thus, if the error rate ǫt is small, then there is a greater weight increment for x i .
The intuition is that a point that is misclassified by a good classifier (with a low
error rate) should be more likely to be selected for the next training dataset.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 70
Adaptive Boosting: AdaBoost
For boosting we require that a base classifier has an error rate at least slightly
better than random guessing, that is, ǫt < 0.5. If the error rate ǫt ≥ 0.5, then the
boosting method discards the classifier, and tries another data sample.
Combined Classifier: Given the set of boosted classifiers, M1 , M2 , . . . , MK , along
with their weights α1 , α2 , . . . , αK , the class for a test case x is obtained via
weighted majority voting.
Let vj (x) denote the weighted vote for class cj over the K classifiers, given as
K
X
vj (x) = αt · I Mt (x) = cj
t =1
Because I (Mt (x) = cj ) is 1 only when Mt (x) = cj , the variable vj (x) simply
obtains the tally for class cj among the K base classifiers, taking into account the
classifier weights. The combined classifier, denoted M K , then predicts the class
for x as follows:
n o
M K (x) = arg max vj (x) j = 1, .., k
cj
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 71
AdaBoost Algorithm
AdaBoost(K , D):
1 w 0 ← 1n · 1 ∈ Rn
2 t ←1
3 while t ≤ K do
5 D t ← weighted resampling with replacement from D using w t −1
6 Mt ← train classifier on D t
P
7 ǫt ← ni=1 wit −1 · I Mt (x i ) 6= yi // weighted error rate on D
8 if ǫt = 0 then break
9 else if ǫt < 0.5 then
1−ǫt
10 αt = ln ǫt
// classifier weight
11 foreach i ∈ [1, n] do
// update point weights
( t −1
12
t
wi =
wi if Mt (x i ) = yi
wit −1 1−ǫǫt
t
if Mt (x i ) 6= yi
t
14 w t = 1Tww t // normalize weights
15 t ← t +1
16 return {M1 , M2 , . . . , MK }
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 72
Boosting SVMs: Linear Kernel (C = 1)
u 2 h3 h2 h4
The regularization constant C = 1 .
uT uT
uT uT
uT ht is learned in iteration t.
1 uT uT
uT uT
uT bC uT uT uT
uT
uT uT
Tu
uT uT uT uT uT uT Tu Tu uT
bC
bC bC uT uT Tu uTuT Tu
uT
uT uT uT
No ht discriminates between classes well:
bC bC bC
uT Cb
uT uT uT Tu uT uT Cb bC Cb bC uT uT uT uT uT
uT Tu uT bC uT uT uT Tu
bC Cb uT uT uT
0 uT uT
uT uT uT TuT u T u uT
uT
Tu uT uT bC bC bC bC Cb bC
bC
Cb bC Cb
uT uT uT uT uT uT
uT uT uT uT
Mt ǫt αt
uT bC bC bC bC bC Cb bCbC bC Cb Tu
uT uT uT
uuT T
uT bC
bC bC Cb
Cb
bC Cb Cb bC bC
bC
uT
uT Tu h1 h1 0.280 0.944
bC
−1
uT
bC bC uT
h2 0.305 0.826
bC
h3 0.174 1.559
−2 u1 h4 0.282 0.935
−4 −3 −2 −1 0 1 2 3
combined model M1 M2 M3 M4
training error rate 0.280 0.253 0.073 0.047
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 73
Boosting SVMs: Linear Kernel (C = 1)
Five-fold cross-validation using independent testing data.
uT
uT Testing Error
0.35 bC
Training Error
0.30 bC
As the number of base classifiers K
0.25 uTbC increases, both error rates reduce.
0.20
However, while the training error
0.15 essentially goes to 0, the testing error
0.10
uT
does not reduce beyond 0.02 (K = 110).
bC
uTbC uT uT uT uT uT uT uT
0.05 uT uT uT uT
bC uTbC bC uT uT uT uT uT uT uT uT uT uT
bC bC bC bC bC uTbC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC
0 K
0 50 100 150 200
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 75
Stacking
Stacking(K , M, C , D):
// Train base classifiers
1 for t ∈ [1, K ] do
2 Mt ← train tth base classifier on D
// Train combiner model C on Z
3 Z ←∅
4 foreach (x i , yi ) ∈ D do
T
5 z i ← M1 (x i ), M2 (x i ), . . . , MK (x i )
6 Z ← Z ∪ {(z i , yi )}
7 C ← train combiner classifier on Z
8 return (C , M1 , M2 , . . . , MK )
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 76
Iris 2D PCA - Stacking
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 77
Iris 2D PCA - Stacking
uT
bCbC
bCbC
bCbC
bCbC
bCbC
bCbC uT
uT
linear SVM boundary is the line in
bCbC bCbC
bCbC bCbC
bCbC bCbC
bCbC bCbC
bCbC bCbC
bCbC bCbC
bCbC bCbC
bCbC bCbC
bCbC bCbC
bCbC
bCbC
bCbC
bCbC uT
uT
bCbC
bCbC
bCbC
bCbC
uT uT
uT
bCbC bCbC
bCbC bCbC
light gray,
bCbC bCbC
uT bCbC
bCbC
bCbC
bCbC
bC bCbC
bCbC
bCbC
bCbC uT uT
1 uT uT
bCbC
bCbC
bCbC
bCbC uT
uT uT
bCbC bCbC
uT uT bC bC
bCbC
bCbC
bCbC
bCbC
uT uT Tu uT
bCbC bCbC
uT Tu Tu uT bC
bCbC bCbC
bCbC rs rs rs rs rs rs rs rs rs bCbC
uT uT Tu uT uT
bCbC rs rsrs rs rs rs rs rs bCbC
uT bC bC
bCbC
rs rsrs sr
rs rsrs rs sr sr bCbC
uT uT
uT Tu
bCbC rs rs rs bCbC
uT
bCbC rs rsrs rsrs bCbC
uT uT uT
naive Bayes boundary is the ellipse
bCbC rsrs rsrs rsrs bCbC
uT uT uT Tu
bCbC rsrs bCbC
bCbC rsrsrs rs rsrs bCbC
bC uT
bCbC rsrs rs rs rs bCbC
uT
+
+ +
+ bCbCrs srsr
bCrs
rsrs rs bCbC
bCbC
uT T u uT uT
++++++++++++++++++++++++++++++++++++++++++++++++++++++uT +uT ++++uT +++++++
+++++++++++++bC ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ +
+ rs rsrs CbCb
rsrs bCbCbCrsrs rsrs rs
rs
bCbC
bCbC
uT
rsrs
++++++++++++++++++++++++++bC +++++++++++++++++++++++++++++++++++++++++++++++++++ rsrs bCbCbC rsrs bCbC
bC bC bC
rsrs rsrs bCbC
+
bC bC
bC bCbC
rsrs rsrs bCbCbC
uT
rsrs bCbC
+
uT uT uT uT uT
rs rs
T u
rsrs + bCbC rs bCbC
uT uT uT uT uT
+ bCbC rsrs bCbC
uT uT uT bC
rs
rsrs rsrs + bCbC rsrs bCbC
C b
+ bCbC rsrs bCbC
bC bC bC bC
rsrs
uT bC bC
+ bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
rsrs srsr
bC bCbCbC bC rsrs
bC bC bC
+ bCbCbC
C b
rsrs bCbC rsrs
+
bC
rsrs bCbC rsrs bCbC
rsrs
+
+ bCbC rs rs bCbC
0 uT +++++uT ++++++++++++++++++bC++++++++++bC++ bC bC
srsr
rsrs
rsrs
rsrs
+
+
+
+
bCbC
bCbC
bCbC
bCbC
rs rsrs
rsrs
rsrs
rsrs
bCbC
bCbC
bCbC
bCbC
bCbC uT uT
uT
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ rsrs bCbC rsrs
uT uT bC
+ bCbC
srsr rsrs + + bCbC
bCbC
rsrs
rsrs bCbC
rsrs + + bCbC
C b
bCbC rsrs
uT uT
+ rsrs + bCbC rsrs bCbC
+ rsrs + bCbC rsrs bCbC
+ +
bC
rsrs bCbC bCbC
+
+ rsrs +
+
rsrs bCbC rsrs rsrs
rsrs bCbC
bCbC
+ + rsrs bCbC rsrs bCbC
+ + rsrs bCbC rsrs bCbC
uT
rsrs bCbC + rsrs + bCbC
rsrs bCbC + rsrs + bCbC
bCbC + +
bC bC
rsrs bCbC rsrs rsrs bCbC
rsrs + + bCbC
bC
rs rs bCbCbC bCbC
rsrs
rs bCbC
+
+ rs rsrs
rsrs +
+ bCbC
bC
+ +
-1.4 X1
-4 -3 -2 -1 0 1 2 3 The boundary of the stacking classifier is
shown as the thicker black lines.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 78
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 22: Classification Assessment 79