You are on page 1of 40

Nursen Aydin

Autumn 2020

Analytics in Practice
Model Evaluation
Learning objectives

• Understand the overfitting problem and the role of training and


test sets
• Learn about various model quality measures
• Use of evaluation techniques
Stages in CRISP-DM
1 Business Understanding

2 Data Understanding

3 Data Preparation

4 Modelling

5 Evaluation

6 Deployment
Overfitting
• Generalisation capability: how a model performs on unseen data
• Typically, accuracy on test data is somewhat lower than on training data.
• If it drops by a large amount, it suggests that model overfits the training
data
Overfitting: Model is too strongly tailored to the training data and won’t
work on new data.
Overfitting examples

• Assume a leaf node in a decision tree contains


only a single record.
• Are we willing to say that there is a 100%
probability that the test records identified by that
leaf node will have the same target value?

• All models have the tendency to overfit.


Overfitting examples

• All models have the tendency to overfit.


• Trade-off between complexity and
overfitting.
Test

Training
Overfitting in Linear Functions

• Complexity in mathematical models:


Learned parameters

𝑓 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3

• By adding more attributes, we can increase the accuracy on the training


data, but our model can overfit.

𝑓 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 + 𝑤4 𝑥4 + 𝑤5 𝑥5 + ⋯ + 𝑤𝑛 𝑥𝑛
Avoiding overfitting

• Acquire more training data


• Remove any unnecessary inputs (that have no relation to target, or
low predictor importance)
• Reduce complexity of the model (e.g. reduce number of leaf nodes in
a decision tree)

All models tend to overfit to training data!


Which model to use: Evaluating a Model

• Try several different types of models on the same problem and then compare
the result to find the best model.
• What would you like to achieve by developing a predictive model?
• Evaluation metrics are the key to measure the performance of your model
when applied to a test dataset.
• Accuracy, Error Rate, Confusion matrix, ROC chart

Model evaluation is the assessment of how a model performs on


unseen data.
Confusion Matrix

• Give insight into the types of errors that are being made.
• False positives are also known as type I errors
• “False alarms”
• For example, we predict that the customer is going to leave but actually not.
• False negatives are also known as type II errors
• “Failed to raise the alarm”
• For example, we predict that the patient will stay with the company
but actually s/he will leave.
Predicting Churn
Actual
Actual
Yes
Yes No
No
Yes True Positives False Positives
Yes 10 (TP) 5 (FP)
(TP) (FP)
Predicted
Predicted No
No False
90 (FN) True
395 Negatives
(TN)
Negatives (FN) (TN)
Confusion matrix

• Accuracy: percentage of correct classifications:

Number of correct predictions TP+TN 10+395


= = = = 0.81
Total number of predictions TP+FN+FP+TN 10+90+5+395

Predicting Churn

Actual
Yes No
• Accuracy is not always a useful metric Yes True Positives False Positives
• when one outcome is very rare Predicted
TP = 10 FP = 5
• when errors have very different costs No False Negatives True Negatives
FN = 90 TN = 395
Confusion Matrix

• Precision: What proportion of positive identifications is actually correct?


- Probability that a prediction of outcome of interest is accurate
𝑇𝑃 10
=
TP+FP
= = 0.67
10+5

• Recall / True positive rate or Sensitivity: What proportion of actual positives


is identified correctly?
-Frequency of being correct. Actual
TP Yes No
10
= = = 0.10 Yes True Positives False Positives
TP+FN 10+90
TP = 10 FP = 5
Predicted
No False Negatives True Negatives
FN = 90 TN = 395
Confusion Matrix
• False positive rate: (False alarm rate) the number of incorrect positive
predictions (FP) divided by the total number of negatives (N).
FP 5
= = = 0.01
TN+FP 5+395

• Specificity: (True negative rate)


- Proportion of negatives that are correctly identified.
TN 395 Actual
= = = 0.99
TN+FP 395+5 Yes No

Yes True Positives False Positives


TP = 10 FP = 5
Predicted
No False Negatives True Negatives
FN = 90 TN = 395
Confusion Matrix

• F-measure (F1-score) considers both precision and recall metrics.


• It is generally used when we want to have a balance between
precision and recall.

Precision∗Recall 0.67∗0.10
F-measure = 2 ∗ =2∗ = 0.17
Precision+Recall 0.67+0.10

Actual
Yes No

Yes True Positives False Positives


TP = 10 FP = 5
Predicted
No False Negatives True Negatives
FN = 90 TN = 395
Making predictions revisited
• Most predictive models don’t give one decision rule. They give a predicted likelihood
(score) of belonging to a class.
• Threshold value is used to define which prediction scores are labelled as predicted
positive vs predicted negative.
List cases in order of decreasing
• By default the threshold is 0.5 likelihood of belonging to a class
Employment Status Club Card Shopping Responder
responder

in a month
Predict

Student Yes 2 0.70

Retired No 3 0.67

Retired Yes 4 0.48


not responder

Student No 1 0.32
Predict

Working … … …
Making predictions revisited - Example
• If the prediction score is greater than or equal to the threshold value, we predict the
target value as “Yes”, otherwise “No”.

Actual Prediction score Prediction for


Target (or probability) threshold 0.5 Confusion matrix is different for different threshold
value
values.
Yes 0.90 Yes
Yes 0.90 Yes
Actual
No 0.70 Yes
Yes No
Yes 0.60 Yes
No 0.60 Yes Yes True Positives False Positives
TP = 3 FP = 2
Yes 0.40 No Predicted
No False Negatives True Negatives
No 0.30 No FN = 1 TN = 2
No 0.20 No
Making predictions revisited - Example
• If the prediction score is greater than or equal to the threshold value, we predict the
target value as “Yes”, otherwise “No”.

Actual Prediction score Prediction for


Target (or probability) threshold 0.7 Confusion matrix is different for different threshold
value
values.
Yes 0.90 Yes
Yes 0.90 Yes
Actual
No 0.70 Yes
Yes No
Yes 0.60 No
No 0.60 No Yes True Positives False Positives
TP = 2 FP = 1
Yes 0.40 No Predicted
No False Negatives True Negatives
No 0.30 No FN = 2 TN = 3
No 0.20 No
Receiver Operator Characteristic (ROC) Graph 1

• ROC graph illustrates relative trade-offs between true

True Positive Rate


positives (benefits) and false positives (costs).
• A ROC graph plots true positive rate (TPR) against 0.5
false positive rates (FPR) as the threshold is
adjusted between 0.0 and 1.0
𝑇𝑃
Actual 𝑇𝑃𝑅 = 0 0.5 1
𝑇𝑃 + 𝐹𝑁
Y N False Positive Rate

Predicted Y TP FP 𝐹𝑃
𝐹𝑃𝑅 =
N FN TN
𝐹𝑃 + 𝑇𝑁
Receiver Operator Characteristic (ROC) Graph
• ROC graph illustrates relative trade-offs between true
positives (benefits) and false positives (costs).

𝑇𝑃

False Positive Rate


𝑇𝑃𝑅 =
𝑇𝑃 + 𝐹𝑁
𝐹𝑃 Model 1
𝐹𝑃𝑅 = Model 2
𝐹𝑃 + 𝑇𝑁 Model 3

False Positive Rate


Area under the ROC curve (AUC)
• Single metric obtained from the ROC
chart, useful for comparing models
• The average true positive rate across all

True Positive Rate


possible false positive rates.
• Number varies between 0.0 (very bad)
and 1.0 (perfect)
• A random model would have an AUC=0.5
so we are looking for models better than
0.5
False Positive Rate
Cumulative Response (Gain) Chart

• Cumulative response charts evaluate a


model’s capability to identify a particular
outcome.
• Predictions are ranked by the probability
of outcome of interest (highest
probability first).
• Sort the table by probability of outcome

Employment Card Shopping Response Prediction Employment Card Shopping Responder Prediction
Per month Score Per month Score

Student Yes 2 No 0.45 Working Yes 3 Yes 0.87


Retired No 3 Yes 0.77 Working No 6 Yes 0.82
Retired Yes 4 No 0.37 Retired No 3 Yes 0.77
Student No 1 Yes 0.48 Retired Yes 2 No 0.60
Working No 6 Yes 0.82 Retired Yes 2 No 0.57
Working Yes 3 Yes 0.87 Student No 3 No 0.52
Student No 3 No 0.52 Student No 1 Yes 0.48
Retired Yes 2 No 0.60 Student Yes 2 No 0.45
Working No 1 No 0.30 Retired Yes 4 No 0.37
Retired Yes 2 No 0.57 Working No 1 No 0.30
Cumulative Response (Gain) Chart
To find 65% of all customers who will respond...

... need to consider the top 30% of the customers as ranked by predicted
likelihood to respond to the marketing campaign.
Questions

• Consider the gain plot on the right.


Which model is better?
Lift Chart
• Lift is a measure calculated as the ratio
between the results obtained with and
without the predictive model.
• The greater the area between the lift
curve and the baseline, the better the
model.
Lift Chart
• Example: For contacting 10% of customers, using no model we should get
10% of responders and using the given model we should get 25% of
responders.
Overfitting in Decision Trees
Training Data: Income

ID Income House Age Credit Card Class Credit


Check High Low

1 High own >= 30 yes Low risk


House Class:
2 High own >= 30 yes Low risk High Risk
3 High own >= 30 yes Low risk
Own No
4 High own < 30 no High risk
5 Low no < 30 yes High risk Class:
Credit Card
High Risk
6 High no < 30 no High risk
Yes No
7 Low no < 30 no High risk

Class: Class:
Misclassification Error Rate: 0% Low Risk High Risk
Overfitting in Decision Trees
Test Data: Income

ID Income House Age Credit Card Class Credit


Check High Low
T1 High own >= 30 yes Low Risk
House Class:
T2 High no >= 30 yes Low Risk
High Risk
T3 Low no >= 30 no High Risk
Own No
T4 Low own < 30 yes Low Risk

Class:
Credit Card
High Risk

Yes No

Class: Class:
Low Risk High Risk
Expected Value

• Weighted average of the values of the possible outcomes.


(Weights are probability of occurrence of each class).
• To understand what benefit the predictive model can offer in
predicting which customers will be responders versus non-responders.
𝐸𝑉 = 𝑝 𝑜1 𝑣 𝑜1 + 𝑝 𝑜2 𝑣 𝑜2 + 𝑝 𝑜3 𝑣 𝑜3 + ⋯

𝑜𝑖 : a decision outcome (e.g. churn or not churn, responder or non-responder)


𝑝 𝑜𝑖 : probability of outcome i
v 𝑜𝑖 : value of outcome i.
Expected Value - example
• Our business aim is to target the likely responders.
• We develop a prediction model to predict whether a consumer is a
class of likely responder or not.
• Assume the profit obtained from a product
Employment Card Shopping Responder
per month
sale is £100.
Student Yes 2 0.20
• The cost of marketing to a customer is £1.
Working No 3 0.02
• Should we target customers with responder
Retired Yes 1 0.15
probability of 0.10?
Student No 1 0.10
𝐸𝑉 = 𝑝 𝑜1 𝑣 𝑜1 + (1 − 𝑝 𝑜1 )𝑣 𝑜2
… … … …
= 0.10 ∗ (100 − 1) + 0.90 ∗ (−1) = 9.0
Expected Value to Compare Classifiers (Costs and Benefits)
• We can evaluate the all predictions made by the model.
• By using Expected Value measure, we can compute the expected profit
of a model.

Confusion Matrix Actual


of a classifier
Respond: Yes Respond: No

Respond True Positives False Positives We will contact


Yes these customers
Predicted
Respond False Negatives True Negatives
No
Expected Value to Compare Classifiers (Costs and Benefits)
• Assume that we build a decision tree and a logistic regression model
and obtain following confusion matrices.

Actual Actual
Confusion Matrix: Confusion Matrix:
Decision Tree Respond: Yes Respond: No Logistic Regression Respond: Yes Respond: No

Respond TP = 56 FP = 7 Respond TP = 50 FP = 5
Yes Yes
Predicted Predicted
Respond FN = 5 TN = 42 Respond FN = 3 TN = 52
No No

• Which model is the best when we consider cost and benefits?


Expected Value to Compare Classifiers (Costs and Benefits)
• True Positive (TP): a consumer who is offered the product and actually
buys it.
Profit(TP) = 100 – 1 = £99
Probability(TP) = TP / Total Instances = 56/110 = 0.51

• False Negative (FN): a consumer who is predicted as a not responder but


actually she would bought the product. Actual
Confusion Matrix:
Profit(FN) = 0 Decision Tree Respond: Yes Respond: No

Respond TP = 56 FP = 7
Probability(FN) = FN / Total Instances Yes
= 5/110 = 0.05 Predicted
Respond FN = 5 TN = 42
No
• False Positive (FP): a consumer who is predicted to respond the marketing
campaign but actually she does not respond.
Profit(FP) = - £1
Probability(FP) = FP / Total Instances = 7/110 = 0.06

• True Negative (TN): a consumer who is predicted as not responder and


actually she does not respond to the campaign.
Actual
Profit(TN) = 0 Confusion Matrix:
Decision Tree Respond: Yes Respond: No
Probability(TN) = TN / Total Instances Respond TP = 56 FP = 7
= 42/110 = 0.38 Yes
Predicted
Respond FN = 5 TN = 42
No
Expected Value to Compare Classifiers (Costs and Benefits)
• True Positive: Profit(TP) = £99 Probability(TP) = 0.51
• False Negative: Profit(FN) = 0 Probability(FN) = 0.05
• False Positive: Profit(FP) = - £1 Probability(FP) = 0.06
• True Negative: Profit(TN) = 0 Probability(TN) = 0.38

Expected value = 99*0.51+ 0*0.05 + (-1)*0.06 + 0*0.38


= 50.43
Actual
Confusion Matrix:
• If we use Decision Tree, our expected Decision Tree Respond: Yes Respond: No

Respond TP = 56 FP = 7
profit is £50.43 per customer. Yes
Predicted
Respond FN = 5 TN = 42
No
Expected Value to Compare Classifiers (Costs and Benefits)
• True Positive (TP): a consumer who is offered the product and actually
buys it.
Profit(TP) = 100 – 1 = £99
Probability(TP) = TP / Total Instances = 50/110 = 0.45

• False Negative (FN): a consumer who is predicted as a not responder but


actually she would bought the product.
Actual
Profit(FN) = 0 Confusion Matrix:
Logistic Regression Respond: Yes Respond: No

Probability(FN) = FN / Total Instances Respond TP = 50 FP = 5


Yes
= 3/110 = 0.03 Predicted
Respond FN = 3 TN = 52
No
• False Positive (FP): a consumer who is predicted to respond the marketing
campaign but actually she does not respond.
Profit(FP) = - £1
Probability(FP) = FP / Total Instances = 5/110 = 0.05

• True Negative (TN): a consumer who is predicted as not responder and


actually she does not respond to the campaign.

Profit(TN) = 0 Actual
Confusion Matrix:
Logistic Regression Respond: Yes Respond: No
Probability(TN) = TN / Total Instances
= 52/110 = 0.47 Respond TP = 50 FP = 5
Yes
Predicted
Respond FN = 3 TN = 52
No
Expected Value to Compare Classifiers (Costs and Benefits)
• True Positive: Profit(TP) = £99 Prob(TP) = 0.45
• False Negative: Profit(FN) = 0 Prob(FN) = 0.03
• False Positive: Profit(FP) = - £1 Prob(FP) = 0.05
• True Negative: Profit(TN) = 0 Prob(TN) = 0.47

Expected value = 99*0.45+ 0*0.03 + (-1)*0.05 + 0*0.47


= 44.5
Actual
Confusion Matrix:
Respond: Yes Respond: No
• If we use Logistic Regression, our Logistic Regression
Respond TP = 50 FP = 5
expected profit per customer is £44.5 Yes
Predicted
Respond FN = 3 TN = 52
No
Expected Value to Compare Classifiers (Costs and Benefits)
• Comparison of the models:

Models (Classifiers) Decision Tree Logistic Regression


Expected Values 50.43 44.5
Accuracy 89% 93%

Actual Actual
Confusion Matrix: Confusion Matrix:
Decision Tree Respond: Yes Respond: No Logistic Regression Respond: Yes Respond: No

Respond TP = 56 FP = 7 Respond TP = 50 FP = 5
Yes Yes
Predicted Predicted
Respond FN = 5 TN = 42 Respond FN = 3 TN = 52
No No
Summary

• A model may overfit to the training data, in particular if there is little


data
• It is crucial that model is evaluated on data it has not seen before
• Accuracy of a model is not sufficient indicator of quality
• Alternative measures precision, recall, AUC, Gain chart

You might also like