Professional Documents
Culture Documents
Evaluation Metrics
3
Evaluation: the key to success
4
Issues in evaluation
5
Note on parameter tuning
• It is important that the test data is not used in any way
to create the classifier
• Some learning schemes operate in two stages:
• Stage 1: build the basic structure
• Stage 2: optimize parameter settings
• The test data cannot be used for parameter tuning!
• Proper procedure uses three sets: training data,
validation data, and test data
• Validation data is used to optimize parameters
6
Making the most of the data
7
Holdout estimation
8
Repeated holdout method
9
Cross-validation
• K-fold cross-validation avoids overlapping test sets
• First step: split data into k subsets of equal size
• Second step: use each subset in turn for testing, the
remainder for training
• This means the learning algorithm is applied to k
different training sets
• Often the subsets are stratified before the cross-
validation is performed to yield stratified k-fold cross-
validation
• The error estimates are averaged to yield an overall
error estimate; also, standard deviation is often
computed
• Alternatively, predictions and actual target values
from the k folds are pooled to compute one
estimate
• Does not yield an estimate of standard 10
More on cross-validation
11
Leave-one-out cross-validation
• Leave-one-out:
a particular form of k-fold cross-validation:
• Set number of folds to number of training instances
• I.e., for n training instances, build classifier n times
• Makes best use of the data
• Involves no random subsampling
• Very computationally expensive (exception: using lazy
classifiers such as the nearest-neighbor classifier)
12
Leave-one-out CV and stratification
• Disadvantage of Leave-one-out CV: stratification is
not possible
• It guarantees a non-stratified sample because there is
only one instance in the test set!
• Extreme example: random dataset split equally
into two classes
• Best inducer predicts majority class
• 50% accuracy on fresh data
• Leave-one-out CV estimate gives 100% error!
13
Class Imbalance Problem
! Key Challenge:
– Evaluation measures such as accuracy are not well-
suited for imbalanced class
Confusion Matrix
! Confusion Matrix:
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL
CLASS Class=No c d
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
Accuracy
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS Class=No c d
(FP) (TN)
a+d TP + TN
Accuracy = =
a + b + c + d TP + TN + FP + FN
Problem with Accuracy
! Consider a 2-class problem
– Number of Class NO examples = 990
– Number of Class YES examples = 10
! If a model predicts everything to be class NO, accuracy is
990/1000 = 99 %
– This is misleading because this trivial model does not detect any class
YES example
– Detecting the rare class is usually more interesting (e.g., frauds,
intrusions, defects, etc)
PREDICTED CLASS
Class=Yes Class=No
Class=Yes 0 10
ACTUAL
CLASS Class=No 0 990
Which model is better?
PREDICTED
Class=Yes Class=No
A ACTUAL Class=Yes 0 10
Class=No 0 990
Accuracy: 99%
PREDICTED
B Class=Yes Class=No
ACTUAL Class=Yes 10 0
Class=No 500 490
Accuracy: 50%
Which model is better?
PREDICTED
A Class=Yes Class=No
ACTUAL Class=Yes 5 5
Class=No 0 990
B PREDICTED
Class=Yes Class=No
ACTUAL Class=Yes 10 0
Class=No 500 490
Alternative Measures
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL
CLASS Class=No c d
a
Precision (p) =
a+c
a
Recall (r) =
a+b
2rp 2a
F - measure (F) = =
r + p 2a + b + c
Alternative Measures
10
Precision (p) = = 0.5
PREDICTED CLASS 10 + 10
Class=Yes Class=No 10
Recall (r) = =1
10 + 0
Class=Yes 10 0 2 *1* 0.5
ACTUAL F - measure (F) = = 0.62
CLASS Class=No 10 980 1 + 0.5
990
Accuracy = = 0.99
1000
Alternative Measures
10
Precision (p) = = 0.5
PREDICTED CLASS 10 + 10
Class=Yes Class=No 10
Recall (r) = =1
10 + 0
Class=Yes 10 0 2 *1* 0.5
ACTUAL F - measure (F) = = 0.62
CLASS Class=No 10 980 1 + 0.5
990
Accuracy = = 0.99
1000
PREDICTED CLASS 1
Precision (p) = =1
1+ 0
Class=Yes Class=No
1
Recall (r) = = 0.1
Class=Yes 1 9 1+ 9
ACTUAL 2 * 0.1*1
CLASS Class=No 0 990 F - measure (F) = = 0.18
1 + 0.1
991
Accuracy = = 0.991
1000
Which of these classifiers is better?
PREDICTED CLASS
Precision (p) = 0.8
Class=Yes Class=No
A Class=Yes 40 10
Recall (r) = 0.8
F - measure (F) = 0.8
ACTUAL
CLASS Class=No 10 40 Accuracy = 0.8
PREDICTED CLASS
B Class=Yes Class=No Precision (p) =~ 0.04
Class=Yes 40 10 Recall (r) = 0.8
ACTUAL F - measure (F) =~ 0.08
CLASS Class=No 1000 4000
Accuracy =~ 0.8
Measures of Classification Performance
PREDICTED CLASS
Yes No
ACTUAL
CLASS Yes TP FN
No FP TN
A PREDICTED CLASS
Class=Yes Class=No
Precision (p) = 0.5
Class=Yes 10 40
TPR = Recall (r) = 0.2
ACTUAL
Class=No 10 40
FPR = 0.2
CLASS F - measure = 0.28
B PREDICTED CLASS
Precision (p) = 0.5
Class=Yes Class=No
TPR = Recall (r) = 0.5
Class=Yes 25 25
ACTUAL Class=No 25 25
FPR = 0.5
CLASS F - measure = 0.5
(TPR,FPR):
! (0,0): declare everything
to be negative class
! (1,1): declare everything
to be positive class
! (1,0): ideal
! Diagonal line:
– Random guessing
– Below diagonal line:
u prediction is opposite
of the true class
ROC (Receiver Operating Characteristic)
x2 < 12.63
x2 < 12.63
x1 < 7.24
x2 < 8.64
x1 < 13.29 x2 < 17.35
x1 < 12.11
x2 < 1.38 x1 < 6.56 x1 < 2.15
0.059 0.220
x1 < 18.88
x1 < 7.24
x2 < 8.64 0.071
0.107
x1 < 12.11
x2 < 1.38 0.727
0.164
x1 < 18.88
0.143 0.669 0.271
0.654 0
ROC Curve Example
x2 < 12.63
x1 < 7.24
x2 < 8.64 0.071
0.107
x1 < 12.11
x2 < 1.38 0.727
0.164
x1 < 18.88
0.143 0.669 0.271
0.654 0
ROC Curve Example
- 1-dimensional data set containing 2 classes (positive and negative)
- Any points located at x > t is classified as positive
At threshold t:
TPR=0.5, FNR=0.5, FPR=0.12, TNR=0.88
How to Construct an ROC curve
TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
TN 0 0 1 1 2 3 4 4 5 5 5
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
ROC Curve:
Using ROC for Model Comparison
! No model consistently
outperforms the other
! M1 is better for
small FPR
! M2 is better for
large FPR