Analytics in Practice: Model Evaluation

Nursen Aydin
Autumn 2020
Analytics in Practice
Model Evaluation
Learning objectives
• Understand the overfitting problem and the role of training and

test sets
• Learn about various model quality measures
• Use of evaluation techniques
Stages in CRISP-DM
1 Business Understanding
2 Data Understanding
3 Data Preparation
4 Modelling
5 Evaluation
6 Deployment
Overfitting
• Generalisation capability: how a model performs on unseen data
• Typically, accuracy on test data is somewhat lower than on training data.
• If it drops by a large amount, it suggests that model overfits the training
data
Overfitting: Model is too strongly tailored to the training data and won’t
work on new data.
Overfitting examples
• Assume a leaf node in a decision tree contains

only a single record.
• Are we willing to say that there is a 100%
probability that the test records identified by that
leaf node will have the same target value?
• All models have the tendency to overfit.

Overfitting examples
• All models have the tendency to overfit.

• Trade-off between complexity and
overfitting.
Test
Training
Overfitting in Linear Functions
• Complexity in mathematical models:

Learned parameters
𝑓 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3
• By adding more attributes, we can increase the accuracy on the training

data, but our model can overfit.
𝑓 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 + 𝑤4 𝑥4 + 𝑤5 𝑥5 + ⋯ + 𝑤𝑛 𝑥𝑛
Avoiding overfitting
• Acquire more training data

• Remove any unnecessary inputs (that have no relation to target, or
low predictor importance)
• Reduce complexity of the model (e.g. reduce number of leaf nodes in
a decision tree)
All models tend to overfit to training data!

Which model to use: Evaluating a Model
• Try several different types of models on the same problem and then compare
the result to find the best model.
• What would you like to achieve by developing a predictive model?
• Evaluation metrics are the key to measure the performance of your model
when applied to a test dataset.
• Accuracy, Error Rate, Confusion matrix, ROC chart
Model evaluation is the assessment of how a model performs on

unseen data.
Confusion Matrix
• Give insight into the types of errors that are being made.
• False positives are also known as type I errors
• “False alarms”
• For example, we predict that the customer is going to leave but actually not.
• False negatives are also known as type II errors
• “Failed to raise the alarm”
• For example, we predict that the patient will stay with the company
but actually s/he will leave.
Predicting Churn
Actual
Actual
Yes
Yes No
No
Yes True Positives False Positives
Yes 10 (TP) 5 (FP)
(TP) (FP)
Predicted
Predicted No
No False
90 (FN) True
395 Negatives
(TN)
Negatives (FN) (TN)
Confusion matrix
• Accuracy: percentage of correct classifications:
Number of correct predictions TP+TN 10+395

= = = = 0.81
Total number of predictions TP+FN+FP+TN 10+90+5+395
Predicting Churn
Actual
Yes No
• Accuracy is not always a useful metric Yes True Positives False Positives
• when one outcome is very rare Predicted
TP = 10 FP = 5
• when errors have very different costs No False Negatives True Negatives
FN = 90 TN = 395
Confusion Matrix
• Precision: What proportion of positive identifications is actually correct?

- Probability that a prediction of outcome of interest is accurate
𝑇𝑃 10
=
TP+FP
= = 0.67
10+5
• Recall / True positive rate or Sensitivity: What proportion of actual positives

is identified correctly?
-Frequency of being correct. Actual
TP Yes No
10
= = = 0.10 Yes True Positives False Positives
TP+FN 10+90
TP = 10 FP = 5
Predicted
No False Negatives True Negatives
FN = 90 TN = 395
Confusion Matrix
• False positive rate: (False alarm rate) the number of incorrect positive
predictions (FP) divided by the total number of negatives (N).
FP 5
= = = 0.01
TN+FP 5+395
• Specificity: (True negative rate)

- Proportion of negatives that are correctly identified.
TN 395 Actual
= = = 0.99
TN+FP 395+5 Yes No

TP = 10 FP = 5
Predicted
FN = 90 TN = 395
Confusion Matrix
• F-measure (F1-score) considers both precision and recall metrics.

• It is generally used when we want to have a balance between
precision and recall.
Precision∗Recall 0.67∗0.10
F-measure = 2 ∗ =2∗ = 0.17
Precision+Recall 0.67+0.10
Actual
Yes No

TP = 10 FP = 5
Predicted
FN = 90 TN = 395
Making predictions revisited
• Most predictive models don’t give one decision rule. They give a predicted likelihood
(score) of belonging to a class.
• Threshold value is used to define which prediction scores are labelled as predicted
positive vs predicted negative.
List cases in order of decreasing
• By default the threshold is 0.5 likelihood of belonging to a class
Employment Status Club Card Shopping Responder
responder
in a month
Predict
Student Yes 2 0.70
Retired No 3 0.67
Retired Yes 4 0.48

not responder
Student No 1 0.32
Predict
Working … … …
Making predictions revisited - Example
• If the prediction score is greater than or equal to the threshold value, we predict the
target value as “Yes”, otherwise “No”.
Actual Prediction score Prediction for

Target (or probability) threshold 0.5 Confusion matrix is different for different threshold
value
values.
Yes 0.90 Yes
Yes 0.90 Yes
Actual
No 0.70 Yes
Yes No
Yes 0.60 Yes
No 0.60 Yes Yes True Positives False Positives
TP = 3 FP = 2
Yes 0.40 No Predicted
No 0.30 No FN = 1 TN = 2
No 0.20 No
Making predictions revisited - Example
• If the prediction score is greater than or equal to the threshold value, we predict the
target value as “Yes”, otherwise “No”.
Actual Prediction score Prediction for

Target (or probability) threshold 0.7 Confusion matrix is different for different threshold
value
values.
Yes 0.90 Yes
Yes 0.90 Yes
Actual
No 0.70 Yes
Yes No
Yes 0.60 No
No 0.60 No Yes True Positives False Positives
TP = 2 FP = 1
Yes 0.40 No Predicted
No 0.30 No FN = 2 TN = 3
No 0.20 No
Receiver Operator Characteristic (ROC) Graph 1
• ROC graph illustrates relative trade-offs between true
True Positive Rate

positives (benefits) and false positives (costs).
• A ROC graph plots true positive rate (TPR) against 0.5
false positive rates (FPR) as the threshold is
adjusted between 0.0 and 1.0
𝑇𝑃
Actual 𝑇𝑃𝑅 = 0 0.5 1
𝑇𝑃 + 𝐹𝑁
Y N False Positive Rate
Predicted Y TP FP 𝐹𝑃
𝐹𝑃𝑅 =
N FN TN
𝐹𝑃 + 𝑇𝑁
Receiver Operator Characteristic (ROC) Graph
• ROC graph illustrates relative trade-offs between true
positives (benefits) and false positives (costs).
𝑇𝑃
False Positive Rate

𝑇𝑃𝑅 =
𝑇𝑃 + 𝐹𝑁
𝐹𝑃 Model 1
𝐹𝑃𝑅 = Model 2
𝐹𝑃 + 𝑇𝑁 Model 3
False Positive Rate

Area under the ROC curve (AUC)
• Single metric obtained from the ROC
chart, useful for comparing models
• The average true positive rate across all
True Positive Rate

possible false positive rates.
• Number varies between 0.0 (very bad)
and 1.0 (perfect)
• A random model would have an AUC=0.5
so we are looking for models better than
0.5
False Positive Rate
Cumulative Response (Gain) Chart
• Cumulative response charts evaluate a

model’s capability to identify a particular
outcome.
• Predictions are ranked by the probability
of outcome of interest (highest
probability first).
• Sort the table by probability of outcome
Employment Card Shopping Response Prediction Employment Card Shopping Responder Prediction
Per month Score Per month Score
Student Yes 2 No 0.45 Working Yes 3 Yes 0.87

Retired No 3 Yes 0.77 Working No 6 Yes 0.82
Retired Yes 4 No 0.37 Retired No 3 Yes 0.77
Student No 1 Yes 0.48 Retired Yes 2 No 0.60
Working No 6 Yes 0.82 Retired Yes 2 No 0.57
Working Yes 3 Yes 0.87 Student No 3 No 0.52
Student No 3 No 0.52 Student No 1 Yes 0.48
Retired Yes 2 No 0.60 Student Yes 2 No 0.45
Working No 1 No 0.30 Retired Yes 4 No 0.37
Retired Yes 2 No 0.57 Working No 1 No 0.30
Cumulative Response (Gain) Chart
To find 65% of all customers who will respond...
... need to consider the top 30% of the customers as ranked by predicted
likelihood to respond to the marketing campaign.
Questions
• Consider the gain plot on the right.

Which model is better?
Lift Chart
• Lift is a measure calculated as the ratio
between the results obtained with and
without the predictive model.
• The greater the area between the lift
curve and the baseline, the better the
model.
Lift Chart
• Example: For contacting 10% of customers, using no model we should get
10% of responders and using the given model we should get 25% of
responders.
Overfitting in Decision Trees
Training Data: Income
ID Income House Age Credit Card Class Credit

Check High Low
1 High own >= 30 yes Low risk

House Class:
2 High own >= 30 yes Low risk High Risk
3 High own >= 30 yes Low risk
Own No
4 High own < 30 no High risk
5 Low no < 30 yes High risk Class:
Credit Card
High Risk
6 High no < 30 no High risk
Yes No
7 Low no < 30 no High risk
Class: Class:
Misclassification Error Rate: 0% Low Risk High Risk
Overfitting in Decision Trees
Test Data: Income
ID Income House Age Credit Card Class Credit

Check High Low
T1 High own >= 30 yes Low Risk
House Class:
T2 High no >= 30 yes Low Risk
High Risk
T3 Low no >= 30 no High Risk
Own No
T4 Low own < 30 yes Low Risk
Class:
Credit Card
High Risk
Yes No
Class: Class:
Low Risk High Risk
Expected Value
• Weighted average of the values of the possible outcomes.

(Weights are probability of occurrence of each class).
• To understand what benefit the predictive model can offer in
predicting which customers will be responders versus non-responders.
𝐸𝑉 = 𝑝 𝑜1 𝑣 𝑜1 + 𝑝 𝑜2 𝑣 𝑜2 + 𝑝 𝑜3 𝑣 𝑜3 + ⋯
𝑜𝑖 : a decision outcome (e.g. churn or not churn, responder or non-responder)

𝑝 𝑜𝑖 : probability of outcome i
v 𝑜𝑖 : value of outcome i.
Expected Value - example
• Our business aim is to target the likely responders.
• We develop a prediction model to predict whether a consumer is a
class of likely responder or not.
• Assume the profit obtained from a product
Employment Card Shopping Responder
per month
sale is £100.
Student Yes 2 0.20
• The cost of marketing to a customer is £1.
Working No 3 0.02
• Should we target customers with responder
Retired Yes 1 0.15
probability of 0.10?
Student No 1 0.10
𝐸𝑉 = 𝑝 𝑜1 𝑣 𝑜1 + (1 − 𝑝 𝑜1 )𝑣 𝑜2
… … … …
= 0.10 ∗ (100 − 1) + 0.90 ∗ (−1) = 9.0
Expected Value to Compare Classifiers (Costs and Benefits)
• We can evaluate the all predictions made by the model.
• By using Expected Value measure, we can compute the expected profit
of a model.
Confusion Matrix Actual

of a classifier
Respond: Yes Respond: No
Respond True Positives False Positives We will contact

Yes these customers
Predicted
Respond False Negatives True Negatives
No
• Assume that we build a decision tree and a logistic regression model
and obtain following confusion matrices.
Actual Actual
Confusion Matrix: Confusion Matrix:
Decision Tree Respond: Yes Respond: No Logistic Regression Respond: Yes Respond: No
Respond TP = 56 FP = 7 Respond TP = 50 FP = 5
Yes Yes
Predicted Predicted
Respond FN = 5 TN = 42 Respond FN = 3 TN = 52
No No
• Which model is the best when we consider cost and benefits?

• True Positive (TP): a consumer who is offered the product and actually
buys it.
Profit(TP) = 100 – 1 = £99
Probability(TP) = TP / Total Instances = 56/110 = 0.51
• False Negative (FN): a consumer who is predicted as a not responder but

actually she would bought the product. Actual
Confusion Matrix:
Profit(FN) = 0 Decision Tree Respond: Yes Respond: No
Respond TP = 56 FP = 7
Probability(FN) = FN / Total Instances Yes
= 5/110 = 0.05 Predicted
Respond FN = 5 TN = 42
No
• False Positive (FP): a consumer who is predicted to respond the marketing
campaign but actually she does not respond.
Profit(FP) = - £1
Probability(FP) = FP / Total Instances = 7/110 = 0.06
• True Negative (TN): a consumer who is predicted as not responder and

actually she does not respond to the campaign.
Actual
Profit(TN) = 0 Confusion Matrix:
Decision Tree Respond: Yes Respond: No
Probability(TN) = TN / Total Instances Respond TP = 56 FP = 7
= 42/110 = 0.38 Yes
Predicted
No
• True Positive: Profit(TP) = £99 Probability(TP) = 0.51
• False Negative: Profit(FN) = 0 Probability(FN) = 0.05
• False Positive: Profit(FP) = - £1 Probability(FP) = 0.06
• True Negative: Profit(TN) = 0 Probability(TN) = 0.38
Expected value = 99*0.51+ 0*0.05 + (-1)*0.06 + 0*0.38

= 50.43
Actual
Confusion Matrix:
• If we use Decision Tree, our expected Decision Tree Respond: Yes Respond: No
profit is £50.43 per customer. Yes
Predicted
No
• True Positive (TP): a consumer who is offered the product and actually
buys it.
Profit(TP) = 100 – 1 = £99
Probability(TP) = TP / Total Instances = 50/110 = 0.45
• False Negative (FN): a consumer who is predicted as a not responder but

actually she would bought the product.
Actual
Profit(FN) = 0 Confusion Matrix:
Logistic Regression Respond: Yes Respond: No
Probability(FN) = FN / Total Instances Respond TP = 50 FP = 5

Yes
= 3/110 = 0.03 Predicted
No
• False Positive (FP): a consumer who is predicted to respond the marketing
campaign but actually she does not respond.
Profit(FP) = - £1
Probability(FP) = FP / Total Instances = 5/110 = 0.05
• True Negative (TN): a consumer who is predicted as not responder and

actually she does not respond to the campaign.
Profit(TN) = 0 Actual
Confusion Matrix:
Logistic Regression Respond: Yes Respond: No
Probability(TN) = TN / Total Instances
= 52/110 = 0.47 Respond TP = 50 FP = 5
Yes
Predicted
No
• True Positive: Profit(TP) = £99 Prob(TP) = 0.45
• False Negative: Profit(FN) = 0 Prob(FN) = 0.03
• False Positive: Profit(FP) = - £1 Prob(FP) = 0.05
• True Negative: Profit(TN) = 0 Prob(TN) = 0.47
Expected value = 99*0.45+ 0*0.03 + (-1)*0.05 + 0*0.47

= 44.5
Actual
Confusion Matrix:
Respond: Yes Respond: No
• If we use Logistic Regression, our Logistic Regression
expected profit per customer is £44.5 Yes
Predicted
No
• Comparison of the models:
Models (Classifiers) Decision Tree Logistic Regression

Expected Values 50.43 44.5
Accuracy 89% 93%
Actual Actual
Confusion Matrix: Confusion Matrix:
Decision Tree Respond: Yes Respond: No Logistic Regression Respond: Yes Respond: No
Respond TP = 56 FP = 7 Respond TP = 50 FP = 5
Yes Yes
Predicted Predicted
Respond FN = 5 TN = 42 Respond FN = 3 TN = 52
No No
Summary
• A model may overfit to the training data, in particular if there is little

data
• It is crucial that model is evaluated on data it has not seen before
• Accuracy of a model is not sufficient indicator of quality
• Alternative measures precision, recall, AUC, Gain chart

Analytics in Practice: Model Evaluation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Analytics in Practice: Model Evaluation

Uploaded by

Copyright:

Available Formats

Nursen Aydin

• Understand the overfitting problem and the role of training and

• Assume a leaf node in a decision tree contains

• All models have the tendency to overfit.

• All models have the tendency to overfit.

• Complexity in mathematical models:

• By adding more attributes, we can increase the accuracy on the training

• Acquire more training data

All models tend to overfit to training data!

Model evaluation is the assessment of how a model performs on

• Accuracy: percentage of correct classifications:

Number of correct predictions TP+TN 10+395

• Precision: What proportion of positive identifications is actually correct?

• Recall / True positive rate or Sensitivity: What proportion of actual positives

• Specificity: (True negative rate)

Yes True Positives False Positives

• F-measure (F1-score) considers both precision and recall metrics.

Yes True Positives False Positives

Student Yes 2 0.70

Retired Yes 4 0.48

Actual Prediction score Prediction for

Actual Prediction score Prediction for

• ROC graph illustrates relative trade-offs between true

True Positive Rate

False Positive Rate

False Positive Rate

True Positive Rate

• Cumulative response charts evaluate a

Student Yes 2 No 0.45 Working Yes 3 Yes 0.87

• Consider the gain plot on the right.

ID Income House Age Credit Card Class Credit

1 High own >= 30 yes Low risk

ID Income House Age Credit Card Class Credit

• Weighted average of the values of the possible outcomes.

𝑜𝑖 : a decision outcome (e.g. churn or not churn, responder or non-responder)

Confusion Matrix Actual

Respond True Positives False Positives We will contact

• Which model is the best when we consider cost and benefits?

• False Negative (FN): a consumer who is predicted as a not responder but

• True Negative (TN): a consumer who is predicted as not responder and

Expected value = 99*0.51+ 0*0.05 + (-1)*0.06 + 0*0.38

• False Negative (FN): a consumer who is predicted as a not responder but

Probability(FN) = FN / Total Instances Respond TP = 50 FP = 5

• True Negative (TN): a consumer who is predicted as not responder and

Expected value = 99*0.45+ 0*0.03 + (-1)*0.05 + 0*0.47

Models (Classifiers) Decision Tree Logistic Regression

• A model may overfit to the training data, in particular if there is little

You might also like

Expected value = 990.51+ 00.05 + (-1)0.06 + 00.38

Expected value = 990.45+ 00.03 + (-1)0.05 + 00.47