CSC4316 9

Introduction to Data Mining:
Evaluation Metrics
By Habeebah Adamu Kakudi

(PhD)
MAY TO JUNE 2021 CSC4316:DATA MING 1

! Reference Textbooks
– Data Mining: Practical Machine Learning
Tools and Techniques
by I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal
and
– Introduction to Data Mining, 2nd Edition
by
Tan, Steinbach, Karpatne, Kumar
MAY TO JUNE 2021 CSC4316:DATA MING 2

Credibility: Evaluating what’s been
learned
• Issues: training, testing, tuning

• Predicting performance: confidence limits
• Holdout, cross-validation, bootstrap
• Hyperparameter selection
• Comparing machine learning schemes
• Predicting probabilities
• Cost-sensitive evaluation
• Evaluating numeric prediction
• The minimum description length principle
• Model selection using a validation set
3
Evaluation: the key to success
• How predictive is the model we have learned?

• Error on the training data is not a good indicator of
performance on future data
• Otherwise 1-NN would be the optimum classifier!
• Simple solution that can be used if a large amount of
(labeled) data is available:
• Split data into training and test set
• However: (labeled) data is usually limited
• More sophisticated techniques need to be used
4
Issues in evaluation
• Statistical reliability of estimated differences in

performance ( significance tests)
• Choice of performance measure:
• Number of correct classifications
• Accuracy of probability estimates
• Error in numeric predictions
• Costs assigned to different types of errors
• Many practical applications involve costs
5
Note on parameter tuning
• It is important that the test data is not used in any way
to create the classifier
• Some learning schemes operate in two stages:
• Stage 1: build the basic structure
• Stage 2: optimize parameter settings
• The test data cannot be used for parameter tuning!
• Proper procedure uses three sets: training data,
validation data, and test data
• Validation data is used to optimize parameters
6
Making the most of the data
• Once evaluation is complete, all the data can be used

to build the final classifier
• Generally, the larger the training data the better the
classifier (but returns diminish)
• The larger the test data the more accurate the error
estimate
• Holdout procedure: method of splitting original data into
training and test set
• Dilemma: ideally both training set and test set should be large!
7
Holdout estimation
• What should we do if we only have a single dataset?

• The holdout method reserves a certain amount for
testing and uses the remainder for training, after
shuffling
• Usually: one third for testing, the rest for training
• Problem: the samples might not be representative
• Example: class might be missing in the test data
• Advanced version uses stratification
• Ensures that each class is represented with approximately
equal proportions in both subsets
8
Repeated holdout method
• Holdout estimate can be made more reliable by

repeating the process with different subsamples
• In each iteration, a certain proportion is randomly
selected for training (possibly with stratificiation)
• The error rates on the different iterations are averaged
to yield an overall error rate
• This is called the repeated holdout method
• Still not optimum: the different test sets overlap
• Can we prevent overlapping?
9
Cross-validation
• K-fold cross-validation avoids overlapping test sets
• First step: split data into k subsets of equal size
• Second step: use each subset in turn for testing, the
remainder for training
• This means the learning algorithm is applied to k
different training sets
• Often the subsets are stratified before the cross-
validation is performed to yield stratified k-fold cross-
validation
• The error estimates are averaged to yield an overall
error estimate; also, standard deviation is often
computed
• Alternatively, predictions and actual target values
from the k folds are pooled to compute one
estimate
• Does not yield an estimate of standard 10
More on cross-validation
• Standard method for evaluation: stratified ten-fold

cross-validation
• Why ten?
• Extensive experiments have shown that this is the best
choice to get an accurate estimate
• There is also some theoretical evidence for this
• Stratification reduces the estimate’s variance
• Even better: repeated stratified cross-validation
• E.g., ten-fold cross-validation is repeated ten times and
results are averaged (reduces the variance)
11
Leave-one-out cross-validation
• Leave-one-out:
a particular form of k-fold cross-validation:
• Set number of folds to number of training instances
• I.e., for n training instances, build classifier n times
• Makes best use of the data
• Involves no random subsampling
• Very computationally expensive (exception: using lazy
classifiers such as the nearest-neighbor classifier)
12
Leave-one-out CV and stratification
• Disadvantage of Leave-one-out CV: stratification is
not possible
• It guarantees a non-stratified sample because there is
only one instance in the test set!
• Extreme example: random dataset split equally
into two classes
• Best inducer predicts majority class
• 50% accuracy on fresh data
• Leave-one-out CV estimate gives 100% error!
13
Class Imbalance Problem
! Lots of classification problems where the classes

are skewed (more records from one class than
another)
– Credit card fraud
– Intrusion detection
– Defective products in manufacturing assembly line
– COVID-19 test results on a random sample
! Key Challenge:
– Evaluation measures such as accuracy are not well-
suited for imbalanced class
Confusion Matrix
! Confusion Matrix:
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL
CLASS Class=No c d
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
Accuracy
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS Class=No c d
(FP) (TN)
! Most widely-used metric:
a+d TP + TN
Accuracy = =
a + b + c + d TP + TN + FP + FN
Problem with Accuracy
! Consider a 2-class problem
– Number of Class NO examples = 990
– Number of Class YES examples = 10
! If a model predicts everything to be class NO, accuracy is
990/1000 = 99 %
– This is misleading because this trivial model does not detect any class
YES example
– Detecting the rare class is usually more interesting (e.g., frauds,
intrusions, defects, etc)
PREDICTED CLASS
Class=Yes Class=No
Class=Yes 0 10
ACTUAL
CLASS Class=No 0 990
Which model is better?
PREDICTED
Class=Yes Class=No
A ACTUAL Class=Yes 0 10
Class=No 0 990
Accuracy: 99%
PREDICTED
B Class=Yes Class=No
ACTUAL Class=Yes 10 0
Class=No 500 490
Accuracy: 50%
Which model is better?
PREDICTED
A Class=Yes Class=No
Class=No 0 990
B PREDICTED
Class=Yes Class=No
Class=No 500 490
Alternative Measures
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL
CLASS Class=No c d
a
Precision (p) =
a+c
a
Recall (r) =
a+b
2rp 2a
F - measure (F) = =
r + p 2a + b + c
10
Precision (p) = = 0.5
PREDICTED CLASS 10 + 10
Class=Yes Class=No 10
Recall (r) = =1
10 + 0
Class=Yes 10 0 2 *1* 0.5
ACTUAL F - measure (F) = = 0.62
CLASS Class=No 10 980 1 + 0.5
990
Accuracy = = 0.99
1000
10
Precision (p) = = 0.5
PREDICTED CLASS 10 + 10
Class=Yes Class=No 10
Recall (r) = =1
10 + 0
Class=Yes 10 0 2 *1* 0.5
ACTUAL F - measure (F) = = 0.62
CLASS Class=No 10 980 1 + 0.5
990
Accuracy = = 0.99
1000
PREDICTED CLASS 1
Precision (p) = =1
1+ 0
Class=Yes Class=No
1
Recall (r) = = 0.1
Class=Yes 1 9 1+ 9
ACTUAL 2 * 0.1*1
CLASS Class=No 0 990 F - measure (F) = = 0.18
1 + 0.1
991
Accuracy = = 0.991
1000
Which of these classifiers is better?
PREDICTED CLASS
Precision (p) = 0.8
Class=Yes Class=No
A Class=Yes 40 10
Recall (r) = 0.8
F - measure (F) = 0.8
ACTUAL
CLASS Class=No 10 40 Accuracy = 0.8
PREDICTED CLASS
B Class=Yes Class=No Precision (p) =~ 0.04
Class=Yes 40 10 Recall (r) = 0.8
ACTUAL F - measure (F) =~ 0.08
Accuracy =~ 0.8
Measures of Classification Performance
PREDICTED CLASS
Yes No
ACTUAL
CLASS Yes TP FN
No FP TN
a is the probability that we reject

the null hypothesis when it is
true. This is a Type I error or a
false positive (FP).
b is the probability that we

accept the null hypothesis when
it is false. This is a Type II error
or a false negative (FN).
A PREDICTED CLASS Precision (p) = 0.8

TPR = Recall (r) = 0.8
Class=Yes Class=No FPR = 0.2
F−measure (F) = 0.8
Class=Yes 40 10 Accuracy = 0.8
ACTUAL
TPR
=4
FPR
B PREDICTED CLASS Precision (p) = 0.038

Class=Yes Class=No
FPR = 0.2
Class=Yes 40 10 F−measure (F) = 0.07
ACTUAL Accuracy = 0.8
TPR
=4
FPR
Which of these classifiers is better?
A PREDICTED CLASS
Class=Yes Class=No
Precision (p) = 0.5
Class=Yes 10 40
ACTUAL
Class=No 10 40
FPR = 0.2
CLASS F - measure = 0.28
B PREDICTED CLASS
Precision (p) = 0.5
Class=Yes Class=No
Class=Yes 25 25
ACTUAL Class=No 25 25
FPR = 0.5
C PREDICTED CLASS Precision (p) = 0.5

Class=Yes Class=No
Class=Yes 40 10
ACTUAL FPR = 0.8
F - measure = 0.61
ROC (Receiver Operating Characteristic)
! A graphical approach for displaying trade-off

between detection rate and false alarm rate
! Developed in 1950s for signal detection theory to
analyze noisy signals
! ROC curve plots TPR against FPR
– Performance of a model represented as a point in an
ROC curve
ROC Curve
(TPR,FPR):
! (0,0): declare everything
to be negative class
! (1,1): declare everything
to be positive class
! (1,0): ideal
! Diagonal line:
– Random guessing
– Below diagonal line:
u prediction is opposite
of the true class
ROC (Receiver Operating Characteristic)
! To draw ROC curve, classifier must produce

continuous-valued output
– Outputs are used to rank test records, from the most likely
positive class record to the least likely positive class record
– By using different thresholds on this value, we can create
different variations of the classifier with TPR/FPR tradeoffs
! Many classifiers produce only discrete outputs (i.e.,
predicted class)
– How to get continuous-valued outputs?
u Decision trees, rule-based classifiers, neural networks,
Bayesian classifiers, k-nearest neighbors, SVM
Example: Decision Trees
Decision Tree
x2 < 12.63
x1 < 13.29 x2 < 17.35

Continuous-valued outputs
x1 < 6.56 x1 < 2.15
x2 < 12.63
x1 < 7.24
x2 < 8.64
x1 < 13.29 x2 < 17.35
x1 < 12.11
x2 < 1.38 x1 < 6.56 x1 < 2.15
0.059 0.220
x1 < 18.88
x1 < 7.24
x2 < 8.64 0.071
0.107
x1 < 12.11
x2 < 1.38 0.727
0.164
x1 < 18.88
0.143 0.669 0.271
0.654 0
ROC Curve Example
x2 < 12.63
x1 < 13.29 x2 < 17.35
x1 < 6.56 x1 < 2.15

0.059 0.220
x1 < 7.24
x2 < 8.64 0.071
0.107
x1 < 12.11
x2 < 1.38 0.727
0.164
x1 < 18.88
0.143 0.669 0.271
0.654 0
ROC Curve Example
- 1-dimensional data set containing 2 classes (positive and negative)
- Any points located at x > t is classified as positive
At threshold t:
TPR=0.5, FNR=0.5, FPR=0.12, TNR=0.88
How to Construct an ROC curve
• Use a classifier that produces a

Instance Score True Class
continuous-valued score for
1 0.95 + each instance
2 0.93 +
• The more likely it is for the
3 0.87 - instance to be in the + class, the
4 0.85 - higher the score
5 0.85 - • Sort the instances in decreasing
6 0.85 + order according to the score
7 0.76 - • Apply a threshold at each unique
8 0.53 + value of the score
9 0.43 - • Count the number of TP, FP,
10 0.25 + TN, FN at each threshold
• TPR = TP/(TP+FN)
• FPR = FP/(FP + TN)
How to construct an ROC curve
Class + - + - - - + - + +
P
Threshold >= 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00
TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
TN 0 0 1 1 2 3 4 4 5 5 5
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0
ROC Curve:
Using ROC for Model Comparison
! No model consistently
outperforms the other
! M1 is better for
small FPR
! M2 is better for
large FPR
! Area Under the ROC

curve (AUC)
! Ideal:
§ Area
=1
! Random guess:
§ Area = 0.5
Dealing with Imbalanced Classes - Summary
! Many measures exists, but none of them may be ideal in

all situations
– Random classifiers can have high value for many of these measures
– TPR/FPR provides important information but may not be sufficient by
itself in many practical scenarios
– Given two classifiers, sometimes you can tell that one of them is
strictly better than the other
uC1 is strictly better than C2 if C1 has strictly better TPR and FPR relative to C2 (or same
TPR and better FPR, and vice versa)
– Even if C1 is strictly better than C2, C1’s F-value can be worse than
C2’s if they are evaluated on data sets with different imbalances
– Classifier C1 can be better or worse than C2 depending on the scenario
at hand (class imbalance, importance of TP vs FP, cost/time tradeoffs)
Which Classifer is better?
Precision (p) = 0.98
T1 PREDICTED CLASS TPR = Recall (r) = 0.5
Class=Yes Class=No
FPR = 0.01
Class=Yes 50 50 TPR/FPR = 50
ACTUAL
Class=No 1 99
Precision (p) = 0.9

T2 PREDICTED CLASS
Class=Yes Class=No
FPR = 0.1
ACTUAL
Class=Yes 99 1
TPR/FPR = 9.9
Class=No 10 90
CLASS
F - measure = 0.94
T3 PREDICTED CLASS Precision (p) = 0.99

Class=Yes Class=No TPR = Recall (r) = 0.99
Class=Yes 99 1 FPR = 0.01
ACTUAL
CLASS Class=No 1 99 TPR/FPR = 99
F - measure = 0.99
Which Classifer is better? Medium Skew case
Class=Yes Class=No
FPR = 0.01
ACTUAL
Class=No 10 990
Precision (p) = 0.5

T2 PREDICTED CLASS
Class=Yes Class=No
FPR = 0.1
ACTUAL
Class=Yes 99 1
TPR/FPR = 9.9
Class=No 100 900
CLASS
F - measure = 0.66

ACTUAL
F - measure = 0.94
Which Classifer is better? High Skew case
Precision (p) = 0.3
Class=Yes Class=No
FPR = 0.01
ACTUAL
Class=No 100 9900

T2 PREDICTED CLASS
Class=Yes Class=No
FPR = 0.1
ACTUAL
Class=Yes 99 1
TPR/FPR = 9.9
Class=No 1000 9000
CLASS
F - measure = 0.165

ACTUAL
F - measure = 0.66
Building Classifiers with Imbalanced Training Set
! Modify the distribution of training data so that rare

class is well-represented in training set
– Undersample the majority class
– Oversample the rare class

CSC4316 9

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSC4316 9

Uploaded by

Copyright:

Available Formats

Introduction to Data Mining:

By Habeebah Adamu Kakudi

MAY TO JUNE 2021 CSC4316:DATA MING 1

MAY TO JUNE 2021 CSC4316:DATA MING 2

• Issues: training, testing, tuning

• How predictive is the model we have learned?

• Statistical reliability of estimated differences in

• Once evaluation is complete, all the data can be used

• What should we do if we only have a single dataset?

• Holdout estimate can be made more reliable by

• Standard method for evaluation: stratified ten-fold

! Lots of classification problems where the classes

! Most widely-used metric:

a is the probability that we reject

b is the probability that we

A PREDICTED CLASS Precision (p) = 0.8

B PREDICTED CLASS Precision (p) = 0.038

C PREDICTED CLASS Precision (p) = 0.5

! A graphical approach for displaying trade-off

! To draw ROC curve, classifier must produce

x1 < 13.29 x2 < 17.35

x1 < 13.29 x2 < 17.35

x1 < 6.56 x1 < 2.15

• Use a classifier that produces a

FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0

! Area Under the ROC

! Many measures exists, but none of them may be ideal in

Precision (p) = 0.9

T3 PREDICTED CLASS Precision (p) = 0.99

Precision (p) = 0.5

T3 PREDICTED CLASS Precision (p) = 0.9

Precision (p) = 0.09

T3 PREDICTED CLASS Precision (p) = 0.5

! Modify the distribution of training data so that rare

You might also like