Professional Documents
Culture Documents
Dr. Mahesh K C 1
k- Nearest Neighbors
• Extensively and commonly used as a data mining tool.
• Used for classification of a categorical outcome or prediction of a continuous
outcome.
• The method relies on finding “similar” records in the training data to classify or
predict a new record. Similarities are found by using some distance function.
• These “neighbors” are then used to derive a classification or prediction for a
new record by voting (for classification) or averaging (for prediction).
Dr. Mahesh K C 3
The classification task Cont’d
• Now that the data model is built, the algorithm examines new records
for which income_bracket is unknown.
• According to classifications in the training set, the algorithm classifies
the new records
• For example, a 63 year-old female might be classified in the “High”
income bracket.
Dr. Mahesh K C 4
Examples of classification tasks in business
• Finding a particular credit card transaction is fraudulent or not
• Placing a new student into a particular task with regard to special
need.
• Assessing whether a mortgage application is good or bad credit
risk
• Diagnosing whether a particular disease is present.
• Identifying whether certain financial or personal behavior
indicates a possible terror threats.
• Prescribing a drug to a new patient
• Determining whether a new product is success/failure.
Dr. Mahesh K C 5
Considerations when using k-NN: Distance between records
• A distance metric is a real-valued function d used to measure the similarity between coordinates
x, y, and z with properties:
1. d ( x, y ) 0, and d ( x, y ) 0 if and only if x y
2. d ( x, y ) d ( y , x)
3. d ( x, z ) d ( x, y ) d ( y, z )
Property 1: Distance is always non-negative.
Property 2: Commutative, distance from “A to B” is distance from “B to A”.
Property 3: Triangle inequality holds, distance from “A to C” must be less than or equal to
distance from “A to B to C”.
d (x, y ) ( xi yi ) 2
Euclidean
i
• A model if fit k-times and each time one of the folds is used as the validation
set and remaining (k-1) folds is used as training set.
• Each fold is used once as the validation set and thereby producing predictions
for every observation in the dataset.
• Combine the model’s predictions on each of the k-validation sets in order to
evaluate the overall performance of the model.
Dr. Mahesh K C 8
Example
• A riding-mower manufacturer would like to classify families in a city into those likely to
purchase a riding mower and those not likely to buy one based on income ($000s) and
lot size (000s.ft2). A pilot random sample is undertaken of 12 owners and 12 non-owners
in the city.
• How do we classify a new record with $60000 income and lot size 20000ft2.?
• The scatter plot shows that among the house holds in the training data set, the closest to
the new house hold is record number 9 with $69000 income and lot size 20000 ft2.
• If we use 1-NN classifier, we would classify the new house hold as an owner.
• If we use 3-NN classifier, the three nearest house holds are records 9, 14 and 1.
• Two of these neighbors (9 and 1) are owners of riding mowers and the record 14 is a
non-owner.
• Majority vote is for “owner” and hence the new household would be classified as an
owner.
Dr. Mahesh K C 9
Scatter plot: Income Vs Lot size
Dr. Mahesh K C 10
Judging Classifier Performance
• The need of performance measures arise when there are wide choice of
classifiers and predictive methods.
• A natural choice is probability of making misclassification error i.e. the
probability that the record belongs to one class but the model classifies it as a
member of different class.
Dr. Mahesh K C 11
Advantages & Shortcomings of k-NN
• The main advantage of k-NN is in its simplicity and lack of parametric assumptions.
• In the presence of large enough training data, this method perform well especially
when each class is characterized by multiple combinations of predictor values.
• Even though, no time required to estimate the parameter in the training data, the
time to find the nearest neighbors in a large training data is prohibitive. This could
be overcome by:
• Reduce the time taken to compute distances by reducing the dimension using
dimension reduction methods.
• Use sophisticated data structures such as “search trees” to speed up
identification of nearest neighbors.
• The number of records required in training set should increases exponentially with
the number of predictors (curse of dimensionality).
• k-NN is a “lazy learner”: the time-consuming computation is deferred to the time of
prediction.
Dr. Mahesh K C 12
References
• Shmueli, G., Bruce, P .C, Yahav, I., Patel, N.R., Lichtendahl, K .C.
(2018), Data Mining for Business Analytics, Wiley.
• Larose, D.T. & Larose, C.D. (2016), Data Mining and Predictive
Analytics, 2nd edition, Wiley.
• Kumar, U.D., (2018), Business Analytics-The Science of Data-
Driven Decision Making, 1st edition, Wiley.
Dr. Mahesh K C 13