You are on page 1of 19

K NEAREST NEIGHBORS

KNN
What is KNN?
• K nearest neighbors is a simple algorithm that stores
all available cases and classifies new cases based on a
similarity measure (e.g., distance functions).
• Different names:
• Instance-Based Learning: KNN is often referred to as
instance-based learning or a case-based learning (where
each training instance is a case from the problem domain).
• Lazy Learning: No learning of the model is required and all of
the work happens at the time a prediction is requested.
Algorithm
• A case is classified by a majority vote of its neighbors.
• How does K-NN work?
• Step-1: Select the number K of the neighbors
• Step-2: Calculate the Euclidean distance of K number of
neighbors
• Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
• Step-4: Among these k neighbors, count the number of the
data points in each category.
• Step-5: Assign the new data points to that category for which
the number of the neighbor is maximum.
Distance
• It should also be noted that
all three distance measures
are only valid for continuous
variables. In the instance of
categorical variables the
Hamming distance must be
used.
Example
• Let’s try to classify the unknown green point by
looking at k = 3 and k = nearest neighbors
• For k = 3, we see 2 triangles and 1 square; so we
might classify the point as a triangle
• For k = 5, we see 2 triangles and 3 squares; so we
might classify the point as a square
• Typically, we classify by some variant of majority vote,
so use an odd value of k to avoid ties
Numerical Example
Predict the quality of paper_5 having Acid Durability
= 3 and Strength = 7 for K= 3 (Nearest Neighbor).

Sample Acid Strength Quality


Paper Durability
Paper_1 7 7 Bad
Paper_2 7 4 Bad
Paper_3 3 4 Good
Paeper_4 1 4 Good
Example
• Step 1 First, Number of parameters K = Number of nearest
neighbors (K = 3).
• Step 2 Calculate the distance between the query instance and
all the training samples. Here query instance is (3,7) and
calculates the distance by using the Euclidean Distance formula
Example
• Step 3 Sort the distance and determine the nearest neighbors
based on Kth minimum distance.
Example
• Step 4 Collect the Quality of the nearest neighbors. Hence in
the below, table Quality for Paper_2 is not included because
the rank of this paper item is more than 3.

• Step 5 Use the simple majority of the category of nearest


neighbors as the prediction value of the query instance.
• Hence, 2 Good > 1 Bad from which, the conclusion is that a new sample
paper_5 that passes laboratory test with "Acid durability =
3" and "Strength = 7" is included in Good category quality.
Example
• Step 4 Collect the Quality of the nearest neighbors. Hence in
the below, table Quality for Paper_2 is not included because
the rank of this paper item is more than 3.

• Step 5 Use the simple majority of the category of nearest


neighbors as the prediction value of the query instance.
• Hence, 2 Good > 1 Bad from which, the conclusion is that a new sample
paper_5 that passes laboratory test with "Acid durability =
3" and "Strength = 7" is included in Good category quality.
Advantages and Disadvantages
• Advantages:
1. KNN algorithm requires no training before making predictions, new data
can be added seamlessly, which will not impact the accuracy of the
algorithm.
2. KNN is very easy to implement.
• Disadvantages:
1. The KNN algorithm does not work well with large datasets. The cost of
calculating the distance between the new point and each existing point is
huge, which degrades performance.
2. Feature scaling (standardization and normalization) is required before
applying the KNN algorithm to any dataset. Otherwise, KNN may
generate wrong predictions.
MODEL EVALUATION
Confusion Matrix
Model Evaluation and Selection
• Evaluation metrics: How can we measure accuracy?
Other metrics to consider?
• Use validation test set of class-labeled tuples instead of
training set when assessing accuracy
• Cross-validation: Method for estimating a classifier’s
accuracy:

13
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:


Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

• Given m classes, an entry, CMi,j in a confusion matrix indicates


# of tuples in class i that were labeled by the classifier as class j
• May have extra rows/columns to provide totals

14
Classifier Evaluation Metrics: Accuracy, Error
Rate, Sensitivity and Specificity
A\P C ¬C  Class Imbalance Problem:
C TP FN P
 One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
 Significant majority of the

• Classifier Accuracy, or negative class and minority of


recognition rate: percentage of the positive class
test set tuples that are correctly  Sensitivity: True Positive
classified recognition rate
 Sensitivity = TP/P
Accuracy = (TP + TN)/All
• Error rate: 1 – accuracy, or  Specificity: True Negative

Error rate = (FP + FN)/All recognition rate


 Specificity = TN/N

15
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive

• Recall: completeness – what % of positive tuples did the


classifier label as positive?
• Perfect score is 1.0
• Inverse relationship between precision & recall
• F measure (F1 or F-score): harmonic mean of precision
and recall,

16
Classifier Evaluation Metrics: Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)


cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)

• Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

17
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
• Holdout method
• Given data is randomly partitioned into two independent
sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
• Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies obtained
• Cross-validation (k-fold, where k = 10 is most popular)
• Randomly partition the data into k mutually exclusive
subsets, each approximately equal size
• At i-th iteration, use Di as test set and others as training set
• Leave-one-out: k folds where k = # of tuples, for small sized
data
• *Stratified cross-validation*: folds are stratified so that
class dist. in each fold is approx. the same as that in the
initial data
18
Evaluating Classifier Accuracy: Bootstrap
• Bootstrap
• Works well with small data sets
• Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
• Several bootstrap methods, and a common one is .632 boostrap
• A data set with d tuples is sampled d times, with replacement, resulting
in a training set of d samples. The data tuples that did not make it into
the training set end up forming the test set. About 63.2% of the
original data end up in the bootstrap, and the remaining 36.8% form
the test set (since (1 – 1/d)d ≈ e-1 = 0.368)
• Repeat the sampling procedure k times, overall accuracy of the model:

19

You might also like