You are on page 1of 9

Table of contents

k-Nearest Neighbors 1. The k-NN Algorithm

2. Why/When does K-NN work

September 11th, 2023


3. Curse of dimensionality (i.e., when it can fail)

1/33 2/33

Announcement Plan

1. The k-NN Algorithm


• Please ensure that Anaconda and Jupyter Notebook are installed
on your laptop before our class meeting this Wednesday.
• Additionally, remember to bring your laptop with you to class on 2. Why/When does K-NN work
that day.
3. Curse of dimensionality (i.e., when it can fail)

3/33 4/33
Introduction The k-NN Algorithm - Formal Definition

• Assumption: Similar Inputs have similar outputs


• Classification rule: For a test input x, assign the most
common label amongst its k most similar training inputs.
• Amongst the simplest of all machine learning algorithms. No • Formal definition of k-NN:
• Test point: x
explicit training or model. • Sx = set of k nearest neighbors of x.
• Can be used both for classification and regression. • Sx ⊆ D s.t. |Sx | = k
• Use x’s k-Nearest Neighbors to vote on what k’s label should be. • ∀(x , y ) ∈ D \ Sx , d(x, x ) ≥ max (x ,y )
• Define
• Is considered a nonparametric model. • Classification:

fˆ(x) = mode ({y : (x , y ) ∈ Sx }) (1)


• Regression:
1
fˆ(x) = y (2)
k
(x ,y )∈Sx

5/33 6/33

The k-NN Algorithm The k-NN Algorithm - Example

Input: classification training dataset D = {(x 1 , y1 ), . . . , (x n , yn )}, 3-NN for binary classification using Euclidean distance
and parameter k ∈ N+ , and a distance metric d (x, x )(e.g.
x − x 2 , euclidean distance)

k-NN Algorithm
Store all training data
For any test point x:
1. Find its top k nearest neighbors (under metric d)
2. Return the most common label among these k neighbors (If for
regression, return the average value of the k neighbors)

7/33 8/33
The choice of metric The choice of metric

• We believe our metric d captures similarities between examples:


Examples that are close to each other share similar labels
• The k-nearest neighbor classifier fundamentally relies on a
distance metric. This Minkowski distance definition is pretty general and
→ The better that metric reflects label similarity, the better the contains many well-known distances as special cases. Can you
classified will be. identify the following candidates?
• There are many distance metrics or measures we can use to
select k-nearest neighbors. 1. p = 1
• There is no “best" distance measure, and the choice is highly
2. p = 2
context- or problem-dependent.
2. p → ∞
• The most common choice is the Minkowski distance
p 1/p
d (x, z) = |xr − zr | p

r =1

9/33 10/33

The choice of metric The choice of metric

This Minkowski distance definition is pretty general and


contains many well-known distances as special cases. Can you Remark: The NN classifier is still widely used today, but often
identify the following candidates? with learned metrics. For more information on metric learning
check out the Large Margin NearestNeighbors (LMNN)
1. p = 1: Manhattan distance algorithm to learn a pseudo-metric (nowadays also known
2. p = 2: Euclidean distance asthe triplet loss) or FaceNet for face verification.
2. p → ∞: Chebyshev distance

11/33 12/33
The choice of k The choice of k

1. What if we set k very large?

1. What if we set k very large? Top k-neighbors will include examples that are very far
away...

2. What if we set k very small (k = 1)? 2. What if we set k very small (k = 1)?

label has noise (easily overfit to the noise)


3. What about the training error when k = 1?
3. What about the training error when k = 1?

training error = 0)

13/33 14/33

1-Nearest Neighbors Decision Boundary 1-Nearest Neighbors Decision Boundary (Cont)

• Assuming a Euclidean distance metric, the decision boundary


between any two training examples a and b is a straight line.
• If a query point is located on the decision boundary, this means
its equidistant from both training example a and b.
• While the decision boundary between a pair of points is a
straight line, the decision boundary of the 1-NN model on a
global level, considering the whole training set, is a set of
connected, convex polyhedra.
• This partitioning of regions on a plane in 2D is also called
“Voronoi diagram".

15/33 16/33
1-Nearest Neighbors Decision Boundary (Cont) Plan

1. The k-NN Algorithm

2. Why/When does K-NN work

3. Curse of dimensionality (i.e., when it can fail)

17/33 18/33

Bayes Optimal Predictor Bayes Optimal Predictor

• Assume our data is collected in an i.i.d fashion, i.e., X , Y ∼ P


(say y ∈ {−1, 1})

• Assume our data is collected in an i.i.d fashion, i.e., X , Y ∼ P


• Assume we know P for now.
(say y ∈ {−1, 1})
• Question: what label you would predict?
• Assume we know P for now.
Answer: we will simply predict the most-likely label,
• Question: what label you would predict?

hopt (x) = arg max P(y |x)


y

Bayes optimal predictor

19/33 20/33
Bayes Optimal Predictor Bayes Optimal Predictor

• Assume our data is collected in an i.i.d fashion, i.e., X , Y ∼ P • Assume our data is collected in an i.i.d fashion, i.e., X , Y ∼ P
(say y ∈ {−1, 1}) (say y ∈ {−1, 1})
• Bayes optimal predictor: hopt (x) = arg maxy P(y |x) • Bayes optimal predictor: hopt (x) = arg maxy P(y |x)
• Example: • Example:
Question: What’s the Question: What’s the
P(+1|x) = 0.8 P(+1|x) = 0.8 probability of hopt making a
probability of hopt making a
P(−1|x) = 0.2 mistake on x? P(−1|x) = 0.2 mistake on x?
Answer:
yb = hopt (x) = 1 yb = hopt (x) = 1
BayesOpt = 1 − P(yb |x) =0.2

21/33 22/33

Guarantee of k-NN when k = 1 and n → ∞ Guarantee of k-NN when k = 1 and n → ∞

• Assume x ∈ [−1, 1]2 , P(x) has support everywhere,


P(x) > 0, ∀x ∈ [−1, 1]2
• What does it look when n → ∞?
• Assume x ∈ [−1, 1]2 , P(x) has support everywhere,
P(x) > 0, ∀x ∈ [−1, 1]2
• What does it look when n → ∞?

23/33 24/33
Guarantee of k-NN when k = 1 and n → ∞ Guarantee of k-NN when k = 1 and n → ∞

• Assume x ∈ [−1, 1]2 , P(x) has support everywhere,


P(x) > 0, ∀x ∈ [−1, 1]2
• What does it look when n → ∞?

• Cover and Hart “ as n → ∞, 1-NN prediction error is no more


than twice of the error of the Bayes optimal classifier"

Given test x, as n → ∞ its nearest neighbor x NN is super


close, i.e., d (x, x NN ) → 0

25/33 26/33

Plan Curse of Dimensionality Explanation

• In high dimensional spaces, points that are drawn from a


probability distribution, tend to never be close together
• Example: Considering applying a k-NN classifier to data where
1. The k-NN Algorithm the inputs are uniformly distributed in the p-dimensional unit
cube [0, 1]p . Consider a test point x ∈ [0, 1]p and the k = 10
nearest neighbors of such a test point.
2. Why/When does K-NN work

3. Curse of dimensionality (i.e., when it can fail)

Let l = the edge length of the smallest hyper-cube that contains


1
k k p
all k-nearest neighbor of a test point. → l p ≈ or l ≈
n n

27/33 28/33
Curse of Dimensionality Explanation The distance between two sampled points increases as p grows

• Example (Cont):
If n = 1000, how big is l ?
In [0, 1]p , we
uniformly sample
p l
two points x, x ,
2 0.100000
calculate
10 0.630957
d (x, x ) x − x 2
100 0.954993
1000 0.995405 Let’s plot the
• If p 0 almost the entire space is needed to find the 10-NN → distribution of such
This breaks down the k−NN assumptions. distance
• Question: Could we increase the number of data points, n, until
the nearest neighbors are truly close to the test point. How
many data points would we need such that l becomes truly
small?
Distance increases as p → ∞
29/33 30/33

Data with low dimensional structure Strengths, Weakeness

• Strength
• k-NN: the simplest ML algorithm (very good baseline, should
always try!)
• It is very easy to understand and often gives reasonable
performance without a lot of adjustment.
→ It is a good baseline method to try before considering more
advanced techniques
• No training involved (“lazy"). New training examples can be
added easily
• Works well when data is low-dimensional (e.g., can compare
against the Bayes optimal)

31/33 32/33
Strengths, Weakeness

• Weaknesses
• Data needs to be pre-processed.
• Suffer when data is high-dimensional, due to the fact that in
highdimension space, data tends to spread far away from each
other
• It is expensive and slow: to determine the nearest neighbor of a
new point x, must compute the distance to all n training
examples

33/33

You might also like