Professional Documents
Culture Documents
KNN
What is KNN?
• K nearest neighbors is a simple algorithm that stores
all available cases and classifies new cases based on a
similarity measure (e.g., distance functions).
• Different names:
• Instance-Based Learning: KNN is often referred to as
instance-based learning or a case-based learning (where
each training instance is a case from the problem domain).
• Lazy Learning: No learning of the model is required and all of
the work happens at the time a prediction is requested.
Algorithm
• A case is classified by a majority vote of its neighbors.
• How does K-NN work?
• Step-1: Select the number K of the neighbors
• Step-2: Calculate the Euclidean distance of K number of
neighbors
• Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
• Step-4: Among these k neighbors, count the number of the
data points in each category.
• Step-5: Assign the new data points to that category for which
the number of the neighbor is maximum.
Distance
• It should also be noted that
all three distance measures
are only valid for continuous
variables. In the instance of
categorical variables the
Hamming distance must be
used.
Example
• Let’s try to classify the unknown green point by
looking at k = 3 and k = nearest neighbors
• For k = 3, we see 2 triangles and 1 square; so we
might classify the point as a triangle
• For k = 5, we see 2 triangles and 3 squares; so we
might classify the point as a square
• Typically, we classify by some variant of majority vote,
so use an odd value of k to avoid ties
Numerical Example
Predict the quality of paper_5 having Acid Durability
= 3 and Strength = 7 for K= 3 (Nearest Neighbor).
13
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
14
Classifier Evaluation Metrics: Accuracy, Error
Rate, Sensitivity and Specificity
A\P C ¬C Class Imbalance Problem:
C TP FN P
One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
Significant majority of the
15
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
16
Classifier Evaluation Metrics: Example
17
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
• Holdout method
• Given data is randomly partitioned into two independent
sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
• Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies obtained
• Cross-validation (k-fold, where k = 10 is most popular)
• Randomly partition the data into k mutually exclusive
subsets, each approximately equal size
• At i-th iteration, use Di as test set and others as training set
• Leave-one-out: k folds where k = # of tuples, for small sized
data
• *Stratified cross-validation*: folds are stratified so that
class dist. in each fold is approx. the same as that in the
initial data
18
Evaluating Classifier Accuracy: Bootstrap
• Bootstrap
• Works well with small data sets
• Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
• Several bootstrap methods, and a common one is .632 boostrap
• A data set with d tuples is sampled d times, with replacement, resulting
in a training set of d samples. The data tuples that did not make it into
the training set end up forming the test set. About 63.2% of the
original data end up in the bootstrap, and the remaining 36.8% form
the test set (since (1 – 1/d)d ≈ e-1 = 0.368)
• Repeat the sampling procedure k times, overall accuracy of the model:
19