2020 Evaluation PDF

Evaluation
These slides were assembled by Byron Boots, with only minor modifications from Eric Eaton’s slides and
grateful acknowledgement to the many others who made their course materials freely available online. Feel
free to reuse or adapt these slides for your own academic purposes, provided that you include proper
attribution.
Robot Image Credit: Viktoriya Sukhanova © 123RF.com
Stages of (Batch) Machine Learning
Given: labeled training data X, Y = {hxi , yi i}ni=1
• Assumes each xi ⇠ D(X ) with yi = ftarget (xi )
X, Y
Train the model:
learner
model ß classifier.train(X, Y )
x model yprediction
Apply the model to new data:

• Given: new unlabeled instance x ⇠ D(X )
yprediction ß model.predict(x)
Classification Metrics
# correct predictions
accuracy =
# test instances
# incorrect predictions
error = 1 accuracy =
# test instances
3
Confusion Matrix
• Given a dataset of P positive instances and N negative instances:
Predicted Class
Yes No
Actual Class
Yes TP FN TP + TN
accuracy =
P +N
No FP TN
• Imagine using classifier to identify positive cases (i.e., for

information retrieval)
TP TP
precision = recall =
TP + FP TP + FN
Probability that a randomly Probability that a randomly
selected result is relevant selected relevant document
is retrieved 4
Training Data and Test Data
• Training data: data used to build the model
• Test data: new data, not used in the training process
• Training performance is often a poor indicator of

generalization performance
– Generalization is what we really care about in ML
– Easy to overfit the training data
– Performance on test data is a good indicator of
generalization performance
– i.e., test accuracy is more important than training accuracy
5
Simple Decision Boundary
TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE
6
Decision
Region 1 Decision
5 Region 2
3
Feature 2
0
Decision
Boundary
-1
2 3 4 5 6 7 8 9 10
Feature 1
6
Slide by Padhraic Smyth, UCIrvine
More Complex Decision Boundary
TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE
6
Decision
Region 1 Decision
5 Region 2
3
Feature 2
0
Decision
Boundary
-1
2 3 4 5 6 7 8 9 10
Feature 1
7
Example: The Overfitting Phenomenon
8
A Complex Model
Y = high-order polynomial in X
9
The True (simpler) Model
Y = a X + b + noise
10
Example: The Overfitting Phenomenon
11
A Complex Model
Y = high-order polynomial in X
12
The True (simpler) Model
Y = a X + b + noise
13
How Overfitting Affects Prediction
Predictive
Error
Error on Training Data
Model Complexity
14
Predictive
Error
Error on Test Data
Model Complexity
15
Predictive Underfitting Overfitting

Error
Error on Test Data
Model Complexity
Ideal Range
for Model Complexity
16
Comparing Classifiers
Say we have two classifiers, C1 and C2, and want to
choose the best one to use for future predictions
Can we use training accuracy to choose between them?

• No!
– e.g., C1 = pruned decision tree, C2 = 1-NN
training_accuracy(1-NN) = 100%, but may not be best
Instead, choose based on test accuracy...
17
Based on Slide by Padhraic Smyth, UCIrvine
Training and Test Data
Training Data Idea:

Full Data Set Train each
model on the
“training data”...
...and then test

each model’s
“Test” Data
accuracy on
(validation set)
The simulated test
data, called the
“validation set” 18
k-Fold Cross-Validation
• Why just choose one particular split of the data?
– In principle, we should do this multiple times since
performance may be different for each split
• k-Fold Cross-Validation (e.g., k=10)

– randomly partition full data set of n instances into
k disjoint subsets (each roughly of size n/k)
– Choose each fold in turn as the test set; train model
on the other folds and evaluate
– Compute statistics over k test performances, or
choose best of the k models
– Can also do “leave-one-out CV” where k = n
19
Example 3-Fold CV
Full Data Set 1st Partition 2nd Partition kth Partition

Training
Test Data Data Training
Training
Test Data ... Data
Data Training
Data Test Data
Test Test Test

Performance Performance Performance
Summary statistics
over k test
performances
20
More on Cross-Validation
• Cross-validation generates an approximate estimate
of how well the classifier will do on “unseen” data
– As k à n, the model becomes more accurate
(more training data)
– ...but, CV becomes more computationally expensive
– Choosing k < n is a compromise
• Averaging over different partitions is more robust

than just a single train/validate partition of the data
• It is an even better idea to do CV repeatedly!

21
Multiple Trials of k-Fold CV
1.) Loop for t trials:
Full Data Set
a.) Randomize
Data Set Shuffle

Training
Test Data
Data
b.) Perform Test Data ...
Training
Data
k-fold CV Training
Data Training
Test Data
Data
Test Test Test

Performance Performance Performance
2.) Compute statistics over

t x k test performances 22
Comparing Multiple Classifiers
Full Data Set
a.) Randomize Test each candidate learner on
same training/testing splits
Data Set Shuffle

Training
Test Data
Data
Training
Data
k-fold CV Training
Data Training
Test Data
Data
Test Perf. Test Perf. Test Test Test Test

C1 C2 C1 C2 C1 C2
2.) Compute statistics over Allows us to do paired summary

statistics (e.g., paired t-test)
t x k test performances 23
Learning Curve
• Shows performance versus the # training examples
– Compute over a single training/testing split
– Then, average across multiple trials of CV
24
Building Learning Curves
Full Data Set
a.) Randomize Compute learning curve over each
training/testing split
Data Set Shuffle

Training
Test Data
Data
Training
Data
k-fold CV Training
Data Training
Test Data
Data
Curve Curve Curve Curve Curve Curve

C1 C2 C1 C2 C1 C2
2.) Compute statistics over

t x k learning curves 25

2020 Evaluation PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2020 Evaluation PDF

Uploaded by

Copyright:

Available Formats

Evaluation

Apply the model to new data:

• Imagine using classifier to identify positive cases (i.e., for

• Training performance is often a poor indicator of

Error on Training Data

Error on Test Data

Error on Training Data

Predictive Underfitting Overfitting

Error on Test Data

Error on Training Data

Can we use training accuracy to choose between them?

Instead, choose based on test accuracy...

Training Data Idea:

...and then test

• k-Fold Cross-Validation (e.g., k=10)

Full Data Set 1st Partition 2nd Partition kth Partition

Test Test Test

• Averaging over different partitions is more robust

• It is an even better idea to do CV repeatedly!

Full Data Set 1st Partition 2nd Partition kth Partition

Test Test Test

2.) Compute statistics over

Full Data Set 1st Partition 2nd Partition kth Partition

Test Perf. Test Perf. Test Test Test Test

2.) Compute statistics over Allows us to do paired summary

Full Data Set 1st Partition 2nd Partition kth Partition

Curve Curve Curve Curve Curve Curve

2.) Compute statistics over

You might also like