You are on page 1of 17

Confusion Matrix

for evaluating the Knn model


KNN Algorithm
• Generates a model that can predict based on supervised learning.

• Trained on labelled data

• It is a nearest neighbour Algorithm

• Prediction of new data is done based on feature similarity with the


neighbours.

• Also called feature similarly algorithm

• We should choose a right ‘K’….usually sqrt(n)


Examples where KNN is used to predict

• Amazon uses it to recommend books to the new customers

• Bank uses it to approve loan

• Doctors use it to predict diabetics

• Predicting risk factor of prostrate cancer


Steps for Developing a KNN Algorithm
Step -1: Data Collection followed by importing data in R

Step -2 Prepare and explore data

Step 3: Normalizing numeric data

Step 4: Creating training and test data set

Step 5: Training a model on data

Step 6: Evaluate the model performance….Using Confusion Matrix


What is a Confusion Matrix (CM)

• useful tool for calibrating the output of a model.....a tool for evaluating the
performance of the model
• examines all possible outcomes...True Positive (TP), True Negative(TN), False
Positive(FP), False Negative(FN)
• categorises the predictions against the actual values....it is a 2-dim matrix of predicted
values X actual values
• gives lot of additional data in addition to the accuracy of the KNN model.
• has this name as it shows how confused the model is between predicted outcome
values and actual outcome values.
• The columns in CM represent actual classes and rows represent the predicted
classes….or VV
How does Confusion Matrix (CM) look like?
What does TP, TN, FP, FN stand for in the Confusion Matrix (CM)?

In all there can be 4 possible outcomes:

TP: these are cases where the model correctly predicts outcome Y
FP: these are cases where the model incorrectly predicts outcome Y
TN: these are cases where the model correctly predicts N
FN: these are cases where the model incorrectly predicts outcome N

We can have binary matrix with two levels or more.


In case of fibre identification problem, there are three level : Cotton, Silk, Wool for
both Actual and predicted values........CM in this case will be a 3X3 matrix.
Examples
Confusion Matrix (CM)
For predicting mails as Spam or Non-spam
Example: Confusion Matrix (CM)....
for predicting Cancer or No-Cancer in diagnoses

Example: Confusion Matrix (CM)....


for predicting Cancer is benign or malignant
For prediction of prostrate cancer among men
based on their medical report for test 1 and test 2.

• At random sample tests (say, test 1 and test 2)


performed on 500 men ,
• Of these, 50 actually have prostate cancer.
• I predicted 100 total cases of prostrate cancer
• 45 of which are actually having prostrate cancer.

Confusion Matrix

Actual Positive Actual Negative


Type 1
Predicted Positive TP 45 FP 55

Predicted Negative FN 5 TN 395

Type 2
We can measure other additional information in Confusion Matrix

Actual values
Predicted values
For prediction of prostrate cancer among men
based on their medical report for test 1 and test 2.

• At random sample tests performed on500 men ,


Calculate the statistics using
• Of these, 50 actually have prostate cancer. labelled confusion matrix:
• I predicted 100 total cases of prostrate cancer
1. Accuracy (all correct / all) = TP + TN / TP + TN + FP + FN
• 45 of which are actually having PC. (45 + 395) / 500 = 440 / 500 = 0.88 or 88% Accuracy
Actual Actual
2. Misclassification (all incorrect / all) = FP + FN / TP + TN + FP +
Positive Negative FN
Predicted TP 45 FP 55 (55 + 5) / 500 = 60 / 500 = 0.12 or 12% Misclassification
Precision
Positive You can also just do 1 — Accuracy,
so: 1–0.88 = 0.12 or 12% Misclassification
Predicted FN 5 TN 395
3. Precision (true positives / predicted positives) = TP / TP + FP
Negative 45 / (45 + 55) = 45 / 100 = 0.45 or 45% Precision
Sensitivity Specificity
4. Sensitivity aka Recall (true positives / all actual positives) = TP /
TP + FN
45 / (45 + 5) = 45 / 50 = 0.90 or 90% Sensitivity

5. Specificity (true negatives / all actual negatives) =TN / TN + FP


395 / (395 + 55) = 395 / 450 = 0.88 or 88% Specificity
Therefore, KNN model built for prediction of prostrate cancer among
men based on their medical report for test 1 and test 2
(hypothetical….data not provided) with outcome as “have prostrate
cancer” or “do not have prostrate cancer” is 88% accurate.
Interpretation of the Results
• A confusion matrix can help us evaluate the performance of our models, it
provides statistics ….Accuracy, Precision, Sufficiency, etc….one or more of
these can be used to evaluate the model.

• In the given example (prostrate cancer), it can be seen that there are high
incidences of false positives (45 out of 45+55=100). Therefore, precision
given by TP/TP+FP ……is just 45%. This means that we are falsely
predicting prostrate cancer 55% of the time. Our model is thus NOT
precise.
Command in R to execute the Confusion Matrix

confusionMatrix(data = m1,factor(fibre_test_target))

1. We specify the predicted data and the actual data as the arguments
2. Both the datasets should be of factor type….convert them to factor in case it is not

conf_matrix <- confusionMatrix(data = m1,factor(fibre_test_target))

Assign the output of CM to a variable

print (conf_matrix)

Print confusion matrix to evaluate the model based on the statistics


R programme for predicting the fibre type (cotton, silk, wool) using KNN Model
Result of confusion matrix in KNN Model

Model’s accuracy is 1

You might also like