You are on page 1of 13

Session 15: Classification Problem: The

k-Nearest Neighbor (k-NN) algorithm

Dr. Mahesh K C 1
k- Nearest Neighbors
• Extensively and commonly used as a data mining tool.
• Used for classification of a categorical outcome or prediction of a continuous
outcome.
• The method relies on finding “similar” records in the training data to classify or
predict a new record. Similarities are found by using some distance function.
• These “neighbors” are then used to derive a classification or prediction for a
new record by voting (for classification) or averaging (for prediction).

• Highly automated data-driven method.


• It does not make assumptions about the form of the relationship between the
response (categorical) variable and the predictors.
• Method draws information from similarities between the predictor values of
the records in the data set.
Dr. Mahesh K C 2
The classification task
• Classification is a supervised method and includes two or more classes for
the categorical target variable.
• For example, the target variable income_bracket may include the categories
“Low”, “Middle”, and “High”.

• Suppose, we want to classify a person’s income bracket based on the age,


gender, and occupation values of others contained in a database.
• First, the classification algorithm examines the data set values for the
predictor and target variables in the training set.
• This way, the algorithm “learns” which values of the predictor variables are
associated with values of the target variable.

Dr. Mahesh K C 3
The classification task Cont’d
• Now that the data model is built, the algorithm examines new records
for which income_bracket is unknown.
• According to classifications in the training set, the algorithm classifies
the new records
• For example, a 63 year-old female might be classified in the “High”
income bracket.

• Next, the classification of a new unclassified record is performed by


comparing it to records in the training set it is most similar to.

• The k-Nearest Neighbor algorithm is an example of instance-based


learning.

Dr. Mahesh K C 4
Examples of classification tasks in business
• Finding a particular credit card transaction is fraudulent or not
• Placing a new student into a particular task with regard to special
need.
• Assessing whether a mortgage application is good or bad credit
risk
• Diagnosing whether a particular disease is present.
• Identifying whether certain financial or personal behavior
indicates a possible terror threats.
• Prescribing a drug to a new patient
• Determining whether a new product is success/failure.

Dr. Mahesh K C 5
Considerations when using k-NN: Distance between records
• A distance metric is a real-valued function d used to measure the similarity between coordinates
x, y, and z with properties:
1. d ( x, y )  0, and d ( x, y )  0 if and only if x  y
2. d ( x, y )  d ( y , x)
3. d ( x, z )  d ( x, y )  d ( y, z )
Property 1: Distance is always non-negative.
Property 2: Commutative, distance from “A to B” is distance from “B to A”.
Property 3: Triangle inequality holds, distance from “A to C” must be less than or equal to
distance from “A to B to C”.

• The Euclidean Distance function is commonly-used to measure distance

d (x, y )   ( xi  yi ) 2
Euclidean
i

where x  x1 , x2 ,..., xm , and y  y1 , y2 ,..., ym represent the m attributes


Dr. Mahesh K C 6
Considerations when using k-NN: Choosing k
What value of k is optimal?
There is no obvious solution. Generally, selects odd numbers to avoid tie.
• Smaller k
Choosing a small value for k may lead the algorithm to overfit the data.
Noise or outliers may unduly affect classification.
• Larger k
Larger values will tend to smooth out obscure data values in the training
set.
It the values become too large, locally interesting values will be overlooked.

• Choosing the appropriate value for k requires balancing these considerations.


• Using cross-validation may help determine the value for k, by choosing a
value that minimizes the classification error.
• A general method is k = sqrt(sample size)
Dr. Mahesh K C 7
Cross-Validation
• Data partitioning is not advisable when the number of records is small as
each partition will have few records for modelling.
• Alternate way is cross-validation- a procedure that partitioning the data into
k-“folds” or k-non-overlapping subsamples.

• A model if fit k-times and each time one of the folds is used as the validation
set and remaining (k-1) folds is used as training set.
• Each fold is used once as the validation set and thereby producing predictions
for every observation in the dataset.
• Combine the model’s predictions on each of the k-validation sets in order to
evaluate the overall performance of the model.

Dr. Mahesh K C 8
Example
• A riding-mower manufacturer would like to classify families in a city into those likely to
purchase a riding mower and those not likely to buy one based on income ($000s) and
lot size (000s.ft2). A pilot random sample is undertaken of 12 owners and 12 non-owners
in the city.
• How do we classify a new record with $60000 income and lot size 20000ft2.?

• The scatter plot shows that among the house holds in the training data set, the closest to
the new house hold is record number 9 with $69000 income and lot size 20000 ft2.
• If we use 1-NN classifier, we would classify the new house hold as an owner.
• If we use 3-NN classifier, the three nearest house holds are records 9, 14 and 1.
• Two of these neighbors (9 and 1) are owners of riding mowers and the record 14 is a
non-owner.
• Majority vote is for “owner” and hence the new household would be classified as an
owner.

Dr. Mahesh K C 9
Scatter plot: Income Vs Lot size

Dr. Mahesh K C 10
Judging Classifier Performance
• The need of performance measures arise when there are wide choice of
classifiers and predictive methods.
• A natural choice is probability of making misclassification error i.e. the
probability that the record belongs to one class but the model classifies it as a
member of different class.

• Confusion (classification) Matrix: A matrix summarizes the correct and


incorrect classifications that a classifier produced for a certain dataset
• Accuracy: (Sum of diagonals)/total. The overall accuracy of the correct
classification.

Dr. Mahesh K C 11
Advantages & Shortcomings of k-NN
• The main advantage of k-NN is in its simplicity and lack of parametric assumptions.
• In the presence of large enough training data, this method perform well especially
when each class is characterized by multiple combinations of predictor values.
• Even though, no time required to estimate the parameter in the training data, the
time to find the nearest neighbors in a large training data is prohibitive. This could
be overcome by:
• Reduce the time taken to compute distances by reducing the dimension using
dimension reduction methods.
• Use sophisticated data structures such as “search trees” to speed up
identification of nearest neighbors.
• The number of records required in training set should increases exponentially with
the number of predictors (curse of dimensionality).
• k-NN is a “lazy learner”: the time-consuming computation is deferred to the time of
prediction.

Dr. Mahesh K C 12
References

• Shmueli, G., Bruce, P .C, Yahav, I., Patel, N.R., Lichtendahl, K .C.
(2018), Data Mining for Business Analytics, Wiley.
• Larose, D.T. & Larose, C.D. (2016), Data Mining and Predictive
Analytics, 2nd edition, Wiley.
• Kumar, U.D., (2018), Business Analytics-The Science of Data-
Driven Decision Making, 1st edition, Wiley.

Dr. Mahesh K C 13

You might also like