Lecture 10 - Performance Issues, Dataset Partition, Evaluation Metrics (DONE!!) PDF

EE2211 Introduction to
Machine Learning
Lecture 10
Wang Xinchao
xinchao@nus.edu.sg
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Course Contents
• Introduction and Preliminaries (Xinchao)
– Introduction
– Data Engineering
– Introduction to Linear Algebra, Probability and Statistics
• Fundamental Machine Learning Algorithms I (Helen)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Xinchao)
– Performance Issues
– K-means Clustering
– Neural Networks
2
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
EE2211: Learning Outcome
• I am able to understand the formulation of a machine learning task
– Lecture 1 (feature extraction + classification)
– Lecture 4 to Lecture 9 (regression and classification)
– Lecture 11 and Lecture 12 (clustering and neural network)
• I am able to relate the fundamentals of linear algebra and probability to machine
learning
– Lecture 2 (recap of probability and linear algebra)
– Lecture 4 to Lecture 8 (regression and classification)
– Lecture 12 (neural network)
• I am able to prepare the data for supervised learning and unsupervised learning
– Lecture 1 (feature extraction), Page 25 to 30 [For supervised and unsupervised]
– Lecture 2 (data wrangling) [For supervised and unsupervised]
– Lecture 10 (Training/Validation/Test) [For supervised]
– Programming Exercises in tutorials
• I am able to evaluate the performance of a machine learning algorithm
– Lecture 5 to Lecture 9 (evaluate the difference between labels and predictions)
– Lecture 10 (evaluation metrics)
• I am able to implement regression and classification algorithms
– Lecture 5 to Lecture 9
3
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
• Dataset Partition:
– Training/Validation/Testing
• Cross Validation
• Evaluation Metrics
– Evaluating the quality of a trained classifier
4
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
DatasetPartition therearechoiceofparameters
when
y only
Training, Validation, and Test
datapreparation
We only have a dataset, yet we want to todesignMLmodel
1. Train a model
2. The model may have parameters, and we want to know which
or model
parameters works the bestTraining, Validation
foregpolynomial order and Test
3. With the selected parameters, we want
Suppose to points
these data see aretheall model’s
we have, andreal
we want to use them t
algorithm’s performance on new unseen data
performances on un-seen data
non humanface
humanface
dataset
5
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
I Training, Validation, and Test
• In real-world application,
Training, Validation and Test
– We don’t have test data, since they are unseen
– Imagine you develop a face detector app, you don’t know whom
you will test on
• In lab practice, mean
– We divide the dataset into three parts
yyggggyyyy.mn
aretrainedusingtrainingset
Hidden from
Training !
Training set Validation set Test set
For training the ML model For validation: For testing the “real”
choosing the performance and
parameter or generalization
model
it'scapabilityonunseendata
– NEVER touch test data during training!

6
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Training,
Training, Validation,
Validation and Test
and Test
Training set
7
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681" 9
Training,
Validation and Test
and Test
Training set
Validation
set
8
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
10
Training,
Validation and Test
and Test
Training set
Validation
set
Test set
11 9
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
opyright EE, NUS. All Rights Reserved.
Training set Validation set Test set
For training the ML model For validation: For testing the “real”
choosing the performance and
parameter or generalization
model
Example:
Assume we have two classifiers, with following accuracies:

1. Classifier with parameter 1: Training accuracy 95%, validation accuracy 88%
2. Classifier with parameter 2: Training accuracy 92%, validation accuracy 90%
Which one to choose for test?

I 8
checktheperformance different
yright EE, NUS. All Rights Reserved.
Classifier 2! depends
onaccuracyofvalidation
of models
10
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Cross Validation
moreadvanced
• In practice, we do the k-fold cross validation
4-fold cross validation 4 experiments
rot
Test soyFold 1 Train Train Train Validation
Fold 2 Train Train Validation Train
Fold 3 Train Validation Train Train
Fold 4 Validation Train Train Train
Example: which one to select for test?
Acc. on Val Acc. on Val Acc. on Val Acc. on Val Avg. Acc on
Set 1 Set 2 Set 3 Set 4 Val. Set
Classifier 88% 89% 93% 92% 90.5%
with Param1
Classifier 90% 88% 91% 91% 90%
with Param2
11
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Cross Validation
Other common partitioning:

• 10-Fold CV
• 5-Fold CV
• 3-Fold CV
We may decide on the size of the test set, for example,

15%, 20%, 30% of the whole dataset, the rest for
training/validation.
The test set contains the examples that the learning

algorithm has never seen before, so if our model performs
well on predicting the labels of the examples from the test
set, we say that our model generalizes well.
12
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
• Validation is however not always used:
– Validation is used when you need to pick parameters or models
– If you have no models or parameters to compare, you may
consider partition the data into only training and test
13
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
• Dataset Partition:
– Training/Validation/Testing
• Cross Validation
• Evaluation Metrics howtoevaluatetheperformanceoftheclassifier

– Evaluating the quality of a trained classifier
canbeused on both validation testing
14
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation
Evaluation Metrics Metrics
Regression
canperformgradient
Mean Square Error I much
descent
easier
Test samples
(MSE = )
Mean Absolute Error

| |
(MAE = )
where denotes the target
output and denotes the
predicted output for sample .
14
15
© Copyright EE, NUS. All Rights Reserved.
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation Metrics
Classification
Bydefault
Class-1: Positive Class
Class-2: Negative Class
Confusion Matrix
Class-1 Class-2
(predicted) (predicted)
Class-1
(actual)
7 (TP) 7 (FN)
Class-2 TP: True Positive
(actual) FN: False Negative (i.e., Type II Error)
2 (FP) 25 (TN) FP: False Positive (i.e., Type I Error)
TN: True Negative
16
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Class-1 Class-2
Evaluation Metrics (predicted) (predicted)
Class-1
Classification (actual)
7 (TP) 7 (FN)
Class-2
• How many samples in the dataset
(actual)
have the real label of class-1? 2 (FP) 25 (TN)
14
• How many samples are there in total? 41
32 9 incorrectly
• How many sample are correctly classified? How many are
classified?
17
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation
Classification
Confusion Matrix for Binary Classification
P Recall
(actual) TP FN TP/(TP+FN)
N
(actual) FP TN
Precision Accuracy
TP/(TP+FP) (TP+TN)/(TP+TN+FP+FN)
in prediction
18
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681" 16
Evaluation Metrics
Classification
Cost Matrix for Binary Classification
!
" !
#
(predicted) (predicted) Total cost:
!!,! * TP +
P
!!,# * FN +
(actual) !"," * TP !",$ * FN !#,! * FP +
N !#,# * TN
(actual) !$," * FP !$,$ * TN
Main Idea: To assign different penalties for different entries. Higher

penalties for more severe results.
Usually, !!,! and !#,# are set to 0; !#,! and !!,# may and may not equal
19
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation Metrics
• Example of cost matrix
– Assume we would like to develop a self-driving car system
– We have an ML system that detects the pedestrians using camera,
by conducing a binary classification
• When it detects a person (positive class), the car should stop
• When no person is detected (negative class), the car keeps going
True Positive (cost !!,! )
There is person, ML detects person and car stops
True Negative (cost ##,# )

There is no person, car keeps going
False Positive (cost !$,! )

There is no person, ML detects person and car stops
False Negative (cost !!,$ ）

There is person, ML fails to detect person and car keeps going
Credit: automotiveworld.com $!,# ? $#,! (>, O

<, or =)
20
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation Metrics
• Handling unbalanced data
– Assume we have 1000 samples, of which 10 are positive and 990
are negative
Class-1 Class-2
cannotreallyevaluatetheoutcomewell
– Accuracy = 990/1000=0.99!
Class-1
(actual)
– Yet, half of the Class-1 are 5 (TP) 5 (FN)
Classified to Class-2! Class-2
(actual)
5 (FP) 985 (TN)
In this case, we shall
1) Use cost matrix, assign different costs for each entry
2) Use Precision and Recall! Precision = 0.5 and Recall = 0.5
I canbetterevaluatetheoutcome
21
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation Metrics
$
% &$
Classification (predicted) (predicted)
total a positive
Recall gal P
(actual) TP FN
(True Positive Rate) TPR = TP/(TP+FN)
(False Negative Rate) FNR = FN/(TP+FN)
0 N
(actual) FP TN
(True Negative Rate) TNR = TN/(FP+TN)
(False Positive Rate) FPR = FP/(FP+TN)
recall EYE
TPR + FNR = 1 (100% of positive-class data)
TNR + FPR = 1 (100% of negative-class data)
22
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation
Ypredco Ypred so
N1 N2 P1 N3 P2 P3
Classification y -1.1 -0.5 -0.1 0.2 0.6 0.9
prediction
pred
P1 P2 P3
classifiermodel
T
Yeo isthethreshold
N1 N2 N3
23
20
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation
N1 N2 P1 N3 P2 P3
rhigherthresholdw r
t
y
I moreinthe venone
24
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
21
Evaluation
N1 N2 P1 N3 P2 P3
lowerthreshold wiv t
y
f
25
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
22
Evaluation Metrics
TP, FP, Recall
(predicted) (predicted) (predicted) (predicted) (predicted) (predicted)
P P P
(actual) =?
2 =?
I (actual) =? 2 =?
I (actual) =? 3 = ?o
N N N
(actual) =? I =?
2 (actual) =? 0 =?
3 (actual) =?
I =?
2
26
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation Metrics
(predicted) (predicted) (predicted) (predicted) (predicted) (predicted)
P P P
(actual) = = (actual) = = (actual) = =
N N N
(actual) = = (actual) = = (actual) = =
27
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation Metrics
Evaluation Metrics
Classification
28
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681" 26
Evaluation Metrics
Evaluation Metrics
Classification
differentthreshold
1
29
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681" 27
Evaluation Metrics
Sliding the threshold
Classification
EER: Equal Error Rate
EER: Higher better or Lower better?

thelowerthebetter
30
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation Metrics
ROC: Widely used
I convert
É
FPR FMR
ft
t TPR + FNR = 1
AUC areaunderthecurve
Detection Error Trade-off (DET) curve Receiver Operating Characteristic (ROC) Curve
31
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation Metrics
Area Under the Curve (ROC curve)
AUC provides an aggregate measure of performance across all

possible classification thresholds. Intuitively, it can be seen as the
probability that the model ranks a random positive example more
highly than a random negative example. themodelgivespredictionofpositiveahighervalue
gx dvalue
higher
AUC ranges in value from 0 (100% wrong prediction) to 1 (100% correct

prediction). It is classification-threshold-invariant.
t
changingthreshold
32
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation Metrics
ai é e.j gTxt g x so
Area Under the Curve (ROC curve) É é e.g jittiigix co
increasing8 Targhettitput
Consider a set of binary samples indexed by % = 1, … , * '
for positive class and + = 1, … , *( for negative class.

valuefor
for aspredicted
predictedvalue
negativesample
Atesampie
Let , - be a predictor and .)* = , - )' − , -*( is the
difference in the value of the predictor between a
positive-class sample - )'and a negative-class sample -*(.
perfectcase
o o oo 000 000 0000
allpositiveontherighthasahighervalueofglx
, -
33
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation Metrics
Evaluation Metrics
Evaluation Metrics
Area Under the Curve (ROC curve)
Area Under the Curve Area(ROCUnder
curve)the Curve (R
Consider also a step function given by
AUC
1, provides
if . > 0anideal AUC s provides
aggregate measure an of aggregat
performa
gtxt go
0 . = 1possible
0.5 if . = 0 g x t possible
classification thresholds.
lx
g model
classification
It can be seenthresho
as the
the model ranks a random the positiveranks a random
example more
0 if . < 0 ideal
non random negative example.
random negative example.
The Area Under the ROC Curve (AUC) can be
expressed as N N N N N N N N N N N N N N N N
P N N
P N N
P N N
P N N
P N
P N
P N
P N
P N
P N
P P N
,$
,(
0.0+ Output of Log.
0.0 Reg. model Output of Log.1.0
Reg
AUC = ,$ ,%
< = 0 .)*
*-+
numberofsampled )-+
ranges in value fromAUC
frompositiveclassAUC ranges
0 (100% in value
wrong from 0 to
prediction) (1
The AUC=1 when all the samples
prediction). areprediction).
ranked
It is scale-invariant, correctly.
It is scale-invarian
classification-threshold-
Ref: D. Bamber, The area above the ordinal dominance graph and the area below the receiver operating characteristic graph, J. Math. Psychol. 12 (1975) 387--415.
34
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation Metrics
Classification
Gini coefficient
G=A/(A+B)
TPR FPR
2A
L AUC I
0.5
AUC A 40.5
At 3 0.5
35
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation Metrics
Classification
Confusion Matrix for Multicategory Classification
"&% "'% "(%

(predicted) (predicted) (predicted)
"& (actual) "&,&% "&,'% … "&,(%

>. (actual)
"',&% "','% … "',(%
⁞ ⁞ ⁞ ⁞
"( (actual) "(,&% "(,'% "(,(%
36
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Other Issues
• Computational speed and memory consumptions are also
important factors
– Especially for mobile or edge devices
• Other factors
– Parallelable, Modularity, Maintainability
– Not focus of this module
37
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
38
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"

Lecture 10 - Performance Issues, Dataset Partition, Evaluation Metrics (DONE!!) PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 10 - Performance Issues, Dataset Partition, Evaluation Metrics (DONE!!) PDF

Uploaded by

Copyright:

Available Formats

EE2211 Introduction to

– NEVER touch test data during training!

Training set Validation set Test set

Assume we have two classifiers, with following accuracies:

Which one to choose for test?

Other common partitioning:

We may decide on the size of the test set, for example,

The test set contains the examples that the learning

• Evaluation Metrics howtoevaluatetheperformanceoftheclassifier

Mean Absolute Error

Confusion Matrix for Binary Classification

Main Idea: To assign different penalties for different entries. Higher

True Negative (cost ##,# )

False Positive (cost !$,! )

False Negative (cost !!,$ ）

Credit: automotiveworld.com $!,# ? $#,! (>, O

(predicted) (predicted) (predicted) (predicted) (predicted) (predicted)

(predicted) (predicted) (predicted) (predicted) (predicted) (predicted)

EER: Equal Error Rate

EER: Higher better or Lower better?

ROC: Widely used

AUC provides an aggregate measure of performance across all

AUC ranges in value from 0 (100% wrong prediction) to 1 (100% correct

for positive class and + = 1, … , *( for negative class.

"&% "'% "(%

"& (actual) "&,&% "&,'% … "&,(%

You might also like