You are on page 1of 21

Artificial Intelligence

k - Nearest Neighbors

Pham Viet Cuong


Dept. Control Engineering & Automation, FEEE
Ho Chi Minh City University of Technology
k - Nearest Neighbors
ü A type of supervised ML algorithm
ü Can be used for both classification and regression
ü Lazy learning algorithm

Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 2
k - Nearest Neighbors
ü Uses ‘feature similarity’ to predict the values of new datapoints
ü The new data point will be assigned a value based on how closely it
matches the points in the training set

Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 3
k - Nearest Neighbors
ü The kNN requires
v An integer k
v A set of labeled examples (training data)
v A metric to measure “closeness”
ü Example 1: Classification
v 2D
v 2 classes
v k=3
v Euclidean distance
v 2 sea bass, 1 salmon

Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 4
k - Nearest Neighbors
ü Example 2: Classification
v 2D
v Three classes
v k=5
v Euclidean distance

Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 5
k - Nearest Neighbors
ü Example 3: Classification
v Three-class 2D problem
v non-linearly separable
v k=5
v Euclidean distance

Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 6
k - Nearest Neighbors
ü Example 4: Classification
v Three-class 2D problem
v non-linearly separable
v k=5
v Euclidean distance

Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 7
k - Nearest Neighbors
ü Algorithm
v Step 1: Load training data and test data
v Step 2: Choose k
v Step 3:
Ø Calculate distance between test data and other data points
Ø Identify k nearest neighbors
Ø Use class labels of nearest neighbors to determine the class label
of test data (e.g., by taking majority vote)
v Step 4: End

Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 8
k - Nearest Neighbors
ü How to choose k?
v If infinite number of samples available, the larger is k the better
v In practice: # samples is finite
v Rule of thumb: k = sqrt(n), n: number of examples
v k = 1: for efficiency, but can be sensitive to “noise”

Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 9
k - Nearest Neighbors
ü How to choose k?
v Larger k may improve performance, but too large k destroys locality
v Smaller k: higher variance (less stable)
v Larger k: higher bias (less precise)

Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 10
k - Nearest Neighbors
ü How well does KNN work?
v If we have lots of samples, kNN works well

Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 11
k - Nearest Neighbors
ü Minkowski distance

ü Mahattan distance

ü Euclidean distance
ü Chebyshev distance

Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 12
k - Nearest Neighbors
ü Best distance?

Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 13
k - Nearest Neighbors
ü Euclidian distance

v Euclidean distance treats each feature as equally important


v However some features (dimensions) may be much more
discriminative than others

Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 14
k - Nearest Neighbors
ü Euclidian distance

Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 15
k - Nearest Neighbors
ü Feature nomalization
v Linearly scale to 0 mean, variance 1

Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 16
k - Nearest Neighbors
ü Feature weighting
v Scale each feature by its importance for classification

wi

Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 17
k - Nearest Neighbors
ü Computational complexity
v Basic kNN algorithm stores all examples
v Very expensive for a large number of samples

Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 18
k - Nearest Neighbors
ü kNN - a lazy learning algorithm
v Defers data processing until it receives a request to classify unlabeled
data
v Replies to a request for information by combining its stored training
data
v Discards the constructed answer and any intermediate results
v Lazy algorithms have fewer computational costs than eager
algorithms during training but greater storage requirements and
higher computational costs on recall

Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 19
k - Nearest Neighbors
ü Advantages
v Can be applied to the data from any distribution
v Very simple and intuitive
v Good classification if the number of samples is large enough
v Uses local information, which can yield highly adaptive behavior
v Very easy for parallel implementations
ü Disadvantages
v Choosing k may be tricky
v Test stage is computationally expensive
v Need large number of samples for accuracy
v Large storage requirements
v Highly susceptible to the curse of dimensionality
Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 20
k - Nearest Neighbors
ü Sources:
v https://www.csd.uwo.ca/courses/CS4442b/L3-ML-knn.pdf
v http://research.cs.tamu.edu/prism/lectures/pr/pr_l8.pdf
v http://web.iitd.ac.in/~bspanda/KNN%20presentation.pdf
v V. B. Surya Prasath et. al., Effects of Distance Measure Choice on KNN
Classifier Performance - A Review, Big Data. 7. 10.1089/big.2018.0175.

Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 21

You might also like