Professional Documents
Culture Documents
PART2
Looking at different performance
evaluation metrics
Confusion Matrix
• Confusion matrix is a matrix that lays out the
performance of a learning algorithm.
• A confusion matrix is an N X N matrix, where
N is the number of classes being predicted.
For the problem in hand, we have N=2, and
hence we get a 2 X 2 matrix.
• The confusion matrix is simply a square matrix
that reports the counts of the True positive
(TP), True negative (TN), False positive (FP),
and False negative (FN) predictions of a
classifier, as shown in the following figure:
What and Why - Confusion Matrix
• It is a performance measurement for machine learning
classification problem where output can be two or more
classes.
• It is a table with 4 different combinations of predicted and
actual values.
• It is extremely useful for measuring Recall, Precision,
Specificity, Accuracy and most importantly *AUC-ROC Curve.
What is AUC - ROC Curve?
AUC - ROC curve is a performance measurement for classification problem at various
thresholds settings. ROC is a probability curve and AUC represents degree or measure of
separability. It tells how much model is capable of distinguishing between classes. Higher the
AUC, better the model is at predicting 0s as 0s and 1s as 1s. By analogy, Higher the AUC, better
the model is at distinguishing between patients with disease and no disease.
Error & Accuracy
• Prediction error (ERR) and Accuracy (ACC)
• provide general information about how many samples are misclassified.
error can be understood as the sum accuracy is calculated as the sum of correct
of all false predictions divided by predictions divided by the total number
the number of total predications of predictions
True Positive & False Negative
• The True positive rate (TPR) and False positive rate (FPR) are
performance metrics that are especially useful for imbalanced class
problems:
We describe predicted values as Positive and Negative and actual values as True
and False.
Examples * Remember, the class is
YES (pregnant) +ve
or NOT (pregnant) -ve
True Positive:
Interpretation: You predicted positive and it’s true.
You predicted that a woman is pregnant and she actually
is.
True Negative:
Interpretation: You predicted negative and it’s true.
You predicted that a man is not pregnant and he actually is
not.
False Positive: (Type 1 Error)
Interpretation: You predicted positive and it’s false.
You predicted that a man is pregnant but he actually is
not.
• In practice, often a combination of PRE and REC is used, the so-called F1-score:
a single score that kind of
represents both Precision(P) and
Recall(R). It uses Harmonic Mean in
place of Arithmetic Mean
How to Calculate Confusion Matrix for a 2-class classification
problem?
Accuracy
How to Calculate Confusion Matrix
The accuracy for the problem in hand comes
out to be 88%. As you can see from the
above two tables, the Positive predictive
Value is high, but negative predictive value is
quite low. Same holds for Senstivity and
Specificity. This is primarily driven by the
threshold value we have chosen. If we
decrease our threshold value, the two pairs
of starkly different numbers will come closer.
e.g. pharmaceutical company, they will be
more concerned with minimal wrong
positive diagnosis. Hence, they will be more
concerned about high Specificity. On the
other hand an attrition model will be more
concerned with Sensitivity. Confusion matrix
are generally used only with class output
models
When to minimise what?
• Some guidelines to what to use …
1. Minimising False Negatives:
e.g. In detecting cancerous patients - we want to correctly classify all the
cancerous patients, but classifying when the person actually NOT as having
cancer might be okay as it is less dangerous than NOT identifying/capturing .a
cancerous patient
2. Minimising False Positives:
e.g. Let’s say that you are expecting an important email. Suppose the Model
classifies that important email that you are desperately waiting for, as
Spam(case of False positive), this would be BAD. So in case of Spam email
classification, minimising False positives is more important than False Negatives.
Accuracy
• Accuracy in classification problems is the number of correct
predictions made by the model over all kinds predictions made.
• The numerator is our
correct predictions:
- True positives and
- True Negatives
• The denominator, are the
kind of all predictions made
by the algorithm -Right as
well as wrong ones.
When to use Accuracy:
Accuracy is a good measure when the target variable classes in the data
are nearly balanced.
Ex:60% classes in our fruits images data are apple and 40% are oranges.
A model which predicts whether a new image is Apple or an Orange, 97%
of times correctly is a very good measure in this example.
• It is clear that recall gives us information about a classifier’s performance with respect
to false negatives (how many did we miss), while precision gives us information about
its performance with respect to false positives(how many did we caught).
• Precision is about being precise. So even if we managed to capture only one cancer
case, and we captured it correctly, then we are 100% precise.
• Recall is not so much about capturing cases correctly but more about capturing all
cases that have “cancer” with the answer as “cancer”. So if we simply always say every
case as “cancer”, we have 100% recall.
• So basically if we want to focus more on minimising False Negatives, we would want
our Recall to be as close to 100% as possible without precision being too bad and if we
want to focus on minimising False positives, then our focus should be to make
Precision as close to 100% as possible.
Specificity
• Specificity is a measure that tells us what proportion of patients that did NOT
have cancer, were predicted by the model as non-cancerous. The actual
negatives (People actually NOT having cancer are FP and TN) and the people
diagnosed by us not having cancer are TN. (Note: FP is included because the
Person did NOT actually have cancer even though the model predicted
otherwise). Specificity is the exact opposite of Recall.
Ex: In our cancer example with 100 people, 5 people actually have
cancer. Let’s say that the model predicts every case as cancer.
So our denominator(False positives and True Negatives) is 95 and
the numerator, person not having cancer and the model
predicting his case as no cancer is 0 (Since we predicted every
case as cancer). So in this example, we can that that Specificity of
such model is 0%.
F1 Score
• We don’t really want to carry both Precision and Recall in our pockets every
time we make a model for solving a classification problem. So it’s best if we
can get a single score that kind of represents both Precision(P) and
Recall(R).
• One way to do that is simply taking their arithmetic mean. i.e (P + R) / 2
where P is Precision and R is Recall. But that’s pretty bad in some situations.
• Classification Accuracy is what we usually mean, when we use the term accuracy. It
is the ratio of number of correct predictions to the total number of input samples.
• It works well only if there are equal number of samples belonging to each class.
• For example, consider that there are 98% samples of class A and 2% samples of class
B in our training set. Then our model can easily get 98% training accuracy by
simply predicting every training sample belonging to class A.
• When the same model is tested on a test set with 60% samples of class A and 40%
samples of class B, then the test accuracy would drop down to 60%. Classification
Accuracy is great, but gives us the false sense of achieving high accuracy.