You are on page 1of 38

EE2211 Introduction to

Machine Learning
Lecture 10

Wang Xinchao
xinchao@nus.edu.sg

!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Course Contents
• Introduction and Preliminaries (Xinchao)
– Introduction
– Data Engineering
– Introduction to Linear Algebra, Probability and Statistics
• Fundamental Machine Learning Algorithms I (Helen)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Xinchao)
– Performance Issues
– K-means Clustering
– Neural Networks
2
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
EE2211: Learning Outcome
• I am able to understand the formulation of a machine learning task
– Lecture 1 (feature extraction + classification)
– Lecture 4 to Lecture 9 (regression and classification)
– Lecture 11 and Lecture 12 (clustering and neural network)
• I am able to relate the fundamentals of linear algebra and probability to machine
learning
– Lecture 2 (recap of probability and linear algebra)
– Lecture 4 to Lecture 8 (regression and classification)
– Lecture 12 (neural network)
• I am able to prepare the data for supervised learning and unsupervised learning
– Lecture 1 (feature extraction), Page 25 to 30 [For supervised and unsupervised]
– Lecture 2 (data wrangling) [For supervised and unsupervised]
– Lecture 10 (Training/Validation/Test) [For supervised]
– Programming Exercises in tutorials
• I am able to evaluate the performance of a machine learning algorithm
– Lecture 5 to Lecture 9 (evaluate the difference between labels and predictions)
– Lecture 10 (evaluation metrics)
• I am able to implement regression and classification algorithms
– Lecture 5 to Lecture 9

3
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
• Dataset Partition:
– Training/Validation/Testing

• Cross Validation

• Evaluation Metrics
– Evaluating the quality of a trained classifier

4
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
DatasetPartition therearechoiceofparameters
when
y only
Training, Validation, and Test
datapreparation
We only have a dataset, yet we want to todesignMLmodel
1. Train a model
2. The model may have parameters, and we want to know which
or model
parameters works the bestTraining, Validation
foregpolynomial order and Test
3. With the selected parameters, we want
Suppose to points
these data see aretheall model’s
we have, andreal
we want to use them t
algorithm’s performance on new unseen data
performances on un-seen data
non humanface

humanface

dataset

5
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
I Training, Validation, and Test
• In real-world application,
Training, Validation and Test
– We don’t have test data, since they are unseen
– Imagine you develop a face detector app, you don’t know whom
you will test on
• In lab practice, mean
– We divide the dataset into three parts
yyggggyyyy.mn
aretrainedusingtrainingset
Hidden from
Training !
Training set Validation set Test set

For training the ML model For validation: For testing the “real”
choosing the performance and
parameter or generalization
model
it'scapabilityonunseendata

– NEVER touch test data during training!


6
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Training,
Training, Validation,
Validation and Test
and Test
Training set

7
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681" 9
Training,
Training, Validation,
Validation and Test
and Test
Training set

Validation
set

8
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
10
Training,
Training, Validation,
Validation and Test
and Test
Training set

Validation
set

Test set

11 9
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
opyright EE, NUS. All Rights Reserved.
Training, Validation, and Test

Training set Validation set Test set

For training the ML model For validation: For testing the “real”
choosing the performance and
parameter or generalization
model

Example:

Assume we have two classifiers, with following accuracies:


1. Classifier with parameter 1: Training accuracy 95%, validation accuracy 88%
2. Classifier with parameter 2: Training accuracy 92%, validation accuracy 90%

Which one to choose for test?


I 8
checktheperformance different
yright EE, NUS. All Rights Reserved.
Classifier 2! depends
onaccuracyofvalidation
of models

10
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Cross Validation
moreadvanced
• In practice, we do the k-fold cross validation
4-fold cross validation 4 experiments
rot
Test soyFold 1 Train Train Train Validation
Fold 2 Train Train Validation Train
Fold 3 Train Validation Train Train
Fold 4 Validation Train Train Train
Example: which one to select for test?
Acc. on Val Acc. on Val Acc. on Val Acc. on Val Avg. Acc on
Set 1 Set 2 Set 3 Set 4 Val. Set
Classifier 88% 89% 93% 92% 90.5%
with Param1
Classifier 90% 88% 91% 91% 90%
with Param2

11
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Cross Validation

Other common partitioning:


• 10-Fold CV
• 5-Fold CV
• 3-Fold CV

We may decide on the size of the test set, for example,


15%, 20%, 30% of the whole dataset, the rest for
training/validation.

The test set contains the examples that the learning


algorithm has never seen before, so if our model performs
well on predicting the labels of the examples from the test
set, we say that our model generalizes well.

12
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Training, Validation, and Test
• Validation is however not always used:
– Validation is used when you need to pick parameters or models
– If you have no models or parameters to compare, you may
consider partition the data into only training and test

13
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
• Dataset Partition:
– Training/Validation/Testing

• Cross Validation

• Evaluation Metrics howtoevaluatetheperformanceoftheclassifier


– Evaluating the quality of a trained classifier
canbeused on both validation testing

14
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation
Evaluation Metrics Metrics
Regression
canperformgradient
Mean Square Error I much
descent
easier
Test samples

(MSE = )

Mean Absolute Error


| |
(MAE = )
where denotes the target
output and denotes the
predicted output for sample .

14
15
© Copyright EE, NUS. All Rights Reserved.
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation Metrics
Classification
Bydefault
Class-1: Positive Class
Class-2: Negative Class

Confusion Matrix
Class-1 Class-2
(predicted) (predicted)
Class-1
(actual)
7 (TP) 7 (FN)
Class-2 TP: True Positive
(actual) FN: False Negative (i.e., Type II Error)
2 (FP) 25 (TN) FP: False Positive (i.e., Type I Error)
TN: True Negative

16
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Class-1 Class-2
Evaluation Metrics (predicted) (predicted)
Class-1
Classification (actual)
7 (TP) 7 (FN)
Class-2
• How many samples in the dataset
(actual)
have the real label of class-1? 2 (FP) 25 (TN)
14
• How many samples are there in total? 41
32 9 incorrectly
• How many sample are correctly classified? How many are
classified?

17
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation
Evaluation Metrics Metrics
Classification

Confusion Matrix for Binary Classification

(predicted) (predicted)

P Recall
(actual) TP FN TP/(TP+FN)
N
(actual) FP TN
Precision Accuracy
TP/(TP+FP) (TP+TN)/(TP+TN+FP+FN)
in prediction

18
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681" 16
Evaluation Metrics
Classification
Cost Matrix for Binary Classification

!
" !
#
(predicted) (predicted) Total cost:
!!,! * TP +
P
!!,# * FN +
(actual) !"," * TP !",$ * FN !#,! * FP +
N !#,# * TN
(actual) !$," * FP !$,$ * TN

Main Idea: To assign different penalties for different entries. Higher


penalties for more severe results.

Usually, !!,! and !#,# are set to 0; !#,! and !!,# may and may not equal
19
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation Metrics
• Example of cost matrix
– Assume we would like to develop a self-driving car system
– We have an ML system that detects the pedestrians using camera,
by conducing a binary classification
• When it detects a person (positive class), the car should stop
• When no person is detected (negative class), the car keeps going
True Positive (cost !!,! )
There is person, ML detects person and car stops

True Negative (cost ##,# )


There is no person, car keeps going

False Positive (cost !$,! )


There is no person, ML detects person and car stops

False Negative (cost !!,$ )


There is person, ML fails to detect person and car keeps going

Credit: automotiveworld.com $!,# ? $#,! (>, O


<, or =)
20
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation Metrics
• Handling unbalanced data
– Assume we have 1000 samples, of which 10 are positive and 990
are negative
Class-1 Class-2
cannotreallyevaluatetheoutcomewell
(predicted) (predicted)
– Accuracy = 990/1000=0.99!
Class-1
(actual)
– Yet, half of the Class-1 are 5 (TP) 5 (FN)
Classified to Class-2! Class-2
(actual)
5 (FP) 985 (TN)
In this case, we shall
1) Use cost matrix, assign different costs for each entry
2) Use Precision and Recall! Precision = 0.5 and Recall = 0.5
I canbetterevaluatetheoutcome

21
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation Metrics
$
% &$
Classification (predicted) (predicted)
total a positive
Recall gal P
(actual) TP FN
(True Positive Rate) TPR = TP/(TP+FN)
(False Negative Rate) FNR = FN/(TP+FN)
0 N
(actual) FP TN
(True Negative Rate) TNR = TN/(FP+TN)
(False Positive Rate) FPR = FP/(FP+TN)
recall EYE
TPR + FNR = 1 (100% of positive-class data)
TNR + FPR = 1 (100% of negative-class data)

22
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation
Evaluation Metrics Metrics
Ypredco Ypred so
N1 N2 P1 N3 P2 P3
Classification y -1.1 -0.5 -0.1 0.2 0.6 0.9
prediction
pred

P1 P2 P3

classifiermodel

T
Yeo isthethreshold

N1 N2 N3

23
20
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation
Evaluation Metrics Metrics
N1 N2 P1 N3 P2 P3
Classification y -1.1 -0.5 -0.1 0.2 0.6 0.9

rhigherthresholdw r
t
y
I moreinthe venone

24
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
21
Evaluation
Evaluation Metrics Metrics
N1 N2 P1 N3 P2 P3
Classification y -1.1 -0.5 -0.1 0.2 0.6 0.9

lowerthreshold wiv t
y
f

25
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
22
Evaluation Metrics
TP, FP, Recall

(predicted) (predicted) (predicted) (predicted) (predicted) (predicted)

P P P
(actual) =?
2 =?
I (actual) =? 2 =?
I (actual) =? 3 = ?o

N N N
(actual) =? I =?
2 (actual) =? 0 =?
3 (actual) =?
I =?
2

26
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation Metrics

(predicted) (predicted) (predicted) (predicted) (predicted) (predicted)

P P P
(actual) = = (actual) = = (actual) = =

N N N
(actual) = = (actual) = = (actual) = =

27
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation Metrics
Evaluation Metrics
Classification

28
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681" 26
Evaluation Metrics
Evaluation Metrics
Classification
differentthreshold
1

29
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681" 27
Evaluation Metrics
Sliding the threshold
Classification

EER: Equal Error Rate

EER: Higher better or Lower better?


thelowerthebetter

30
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation Metrics

ROC: Widely used

I convert
É

FPR FMR
ft
t TPR + FNR = 1

AUC areaunderthecurve

Detection Error Trade-off (DET) curve Receiver Operating Characteristic (ROC) Curve

31
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation Metrics
Area Under the Curve (ROC curve)

AUC provides an aggregate measure of performance across all


possible classification thresholds. Intuitively, it can be seen as the
probability that the model ranks a random positive example more
highly than a random negative example. themodelgivespredictionofpositiveahighervalue

gx dvalue
higher

AUC ranges in value from 0 (100% wrong prediction) to 1 (100% correct


prediction). It is classification-threshold-invariant.
t
changingthreshold

32
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation Metrics
ai é e.j gTxt g x so
Area Under the Curve (ROC curve) É é e.g jittiigix co
increasing8 Targhettitput
Consider a set of binary samples indexed by % = 1, … , * '

for positive class and + = 1, … , *( for negative class.


valuefor
for aspredicted
predictedvalue
negativesample
Atesampie
Let , - be a predictor and .)* = , - )' − , -*( is the
difference in the value of the predictor between a
positive-class sample - )'and a negative-class sample -*(.
perfectcase
o o oo 000 000 0000
allpositiveontherighthasahighervalueofglx

, -

33
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation Metrics
Evaluation Metrics
Evaluation Metrics
Area Under the Curve (ROC curve)
Area Under the Curve Area(ROCUnder
curve)the Curve (R
Consider also a step function given by
AUC
1, provides
if . > 0anideal AUC s provides
aggregate measure an of aggregat
performa
gtxt go
0 . = 1possible
0.5 if . = 0 g x t possible
classification thresholds.
lx
g model
classification
It can be seenthresho
as the
the model ranks a random the positiveranks a random
example more
0 if . < 0 ideal
non random negative example.
random negative example.
The Area Under the ROC Curve (AUC) can be
expressed as N N N N N N N N N N N N N N N N
P N N
P N N
P N N
P N N
P N
P N
P N
P N
P N
P N
P P N
,$
,(
0.0+ Output of Log.
0.0 Reg. model Output of Log.1.0
Reg
AUC = ,$ ,%
< = 0 .)*
*-+
numberofsampled )-+
ranges in value fromAUC
frompositiveclassAUC ranges
0 (100% in value
wrong from 0 to
prediction) (1
The AUC=1 when all the samples
prediction). areprediction).
ranked
It is scale-invariant, correctly.
It is scale-invarian
classification-threshold-
Ref: D. Bamber, The area above the ordinal dominance graph and the area below the receiver operating characteristic graph, J. Math. Psychol. 12 (1975) 387--415.

34
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation Metrics
Classification

Gini coefficient
G=A/(A+B)
TPR FPR
2A
L AUC I
0.5
AUC A 40.5
At 3 0.5

35
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Evaluation Metrics
Classification
Confusion Matrix for Multicategory Classification

"&% "'% "(%


(predicted) (predicted) (predicted)

"& (actual) "&,&% "&,'% … "&,(%


>. (actual)
"',&% "','% … "',(%

⁞ ⁞ ⁞ ⁞
"( (actual) "(,&% "(,'% "(,(%

36
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Other Issues
• Computational speed and memory consumptions are also
important factors
– Especially for mobile or edge devices

• Other factors
– Parallelable, Modularity, Maintainability
– Not focus of this module

37
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
38
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"

You might also like