K-Nearest Neighbours (KNN)

K-NEAREST NEIGHBOURS (KNN)
K-Nearest Neighbours (KNN) is a popular supervised machine

learning algorithm used for classification and regression tasks. It
is a non-parametric algorithm, meaning it doesn't make any
assumptions about the underlying data distribution.
How KNN Works?
 In KNN, the "k" refers to the number of nearest neighbours to

consider when making a prediction.
 To classify a new data point, the algorithm calculates the
distances between that point and all the other data points in
the training set.
 It then selects the "k" nearest neighbours based on these
distances and assigns the class label based on majority voting
(for classification) or calculates the average (for regression).
PRATICAL EXAMPLE OF KNN
Here's a real-world example of how K-Nearest Neighbours

(KNN) can be applied:
Fraud Detection in Credit Card Transactions:
KNN can be used in fraud detection systems to identify
potentially fraudulent credit card transactions. Here's how it can
be applied:
1. Dataset: We have a dataset of credit card transactions,
where each transaction is labelled as either fraudulent or
non-fraudulent. The dataset contains features such as
transaction amount, location, time, and other relevant
transaction details.
2. Data pre-processing: The dataset is pre-processed by

handling missing values, performing feature scaling, and
possibly reducing dimensionality using techniques like
Principal Component Analysis (PCA).
3. Training: The pre-processed dataset is split into a training

set and a test set. The training set is used to train the KNN
model.
4. Choosing K: The value of K is chosen based on

experimentation or cross-validation to achieve the best
performance on the training data.
5. Computing distances: For each test transaction, the

distances between its features and the features of all
transactions in the training set are calculated using a suitable
distance metric (e.g., Euclidean distance).
6. Finding nearest neighbours: The K nearest neighbours
to the test transaction are determined based on the calculated
distances.
7. Classifying the transaction: The majority class label

among the K nearest neighbours is assigned to the test
transaction. If a majority of the neighbours are fraudulent,
the test transaction is classified as fraudulent, and vice versa.
8. Evaluating the model: The performance of the KNN

model is evaluated using metrics such as accuracy, precision,
recall, and F1 score. The model's ability to correctly identify
fraudulent transactions and minimize false positives is
crucial.
By using KNN in this fraud detection scenario, the

model can leverage the similarities between
transactions to identify potentially fraudulent
activities. It considers the characteristics of nearby
transactions to make predictions, which can be
effective in detecting fraudulent patterns that may not
be easily captured by other algorithms.
COMPUTE KNN: DISTANCE METRICS
Euclidean distance: This is the most commonly used distance

measure, and it is limited to real-valued vectors. Using the below
formula, it measures a straight line between the query point and
the other point being measured.
Manhattan distance: This is also another popular distance

metric, which measures the absolute value between two points.
It is also referred to as taxicab distance or city block distance as
it is commonly visualized with a grid, illustrating how one might
navigate from one address to another via city streets.
Minkowski distance: This distance measure is the generalized
form of Euclidean and Manhattan distance metrics. The
parameter, p, in the formula below, allows for the creation of
other distance metrics. Euclidean distance is represented by this
formula when p is equal to two, and Manhattan distance is
denoted with p equal to one.
Hamming distance: This technique is used typically used with

Boolean or string vectors, identifying the points where the
vectors do not match. As a result, it has also been referred to as
the overlap metric. This can be represented with the following
formula:
NUMERICAL EXAMPLE
Suppose we have a dataset of students with their exam scores

and corresponding pass/fail labels. The exam scores are the
features, and the pass/fail labels are the classes we want to
predict. Here is a simplified version of the dataset:
Exam 1 Score Exam 2 Score Pass/Fail
73 80 Pass
92 88 Pass
89 91 Pass
96 98 Pass
50 45 Fail
71 80 Fail
60 62 Fail
78 68 Fail
Now, let's say we want to predict whether a student
with exam scores of 75 in Exam 1 and 70 in Exam 2 will
pass or fail.
Step 1: Choose the value of k. Let's say we choose k = 3, meaning

we will consider the three nearest neighbours of the test instance.
Step 2: Compute the distance. To find the nearest neighbours,

we need to calculate the distance between the test instance and
all other instances in the dataset. In this example, we'll use
Euclidean distance as the distance metric:
For the first instance in the dataset (73, 80):

Distance = sqrt ((75-73)^2 + (70-80)^2) = sqrt (4 + 100)
= sqrt (104) ≈ 10.20
Similarly, we calculate the distance for the remaining instances:

Exam 1 Score Exam 2 Score Pass/Fail Distance
73 80 Pass 10.20
92 88 Pass 20.39
89 91 Pass 22.36
96 98 Pass 31.14
50 45 Fail 31.62
71 80 Fail 6.08
60 62 Fail 10. 20
78 68 Fail 12 .81
Step 3: Find the k nearest neighbours. Next, we need to select

the k nearest neighbours based on the computed distances. In
our case, k = 3, so we choose the three instances with the smallest
distances:
Exam 1 Score Exam 2 Score Pass/Fail Distance
73 80 Pass 10.20
71 80 Fail 6.08
60 62 Fail 10.20
Step 4: Determine the majority class. Finally, we determine the

majority class among the selected nearest neighbours. In our
case, two out of the three nearest neighbours belong to the "Fail"
class. Therefore, we predict that the test instance with exam
scores (75, 70) will fail.
ADVANTAGES AND DISADVANTAGES
Advantages of KNN:
 Simplicity: KNN is a simple and intuitive algorithm that is
easy to understand and implement.
 No training phase: KNN is a lazy learning algorithm, which
means it does not require an explicit training phase. It
memorizes the training data and uses it for classification or
regression directly.
 Non-parametric: KNN makes no assumptions about the
underlying data distribution, making it suitable for both linear
and non-linear relationships.
Disadvantages of KNN:
 Computational complexity: KNN can be computationally

expensive, especially with large datasets. As the size of the
training data increases, the algorithm needs to calculate
distances between the query instance and all training
instances, which can be time-consuming.
 Memory usage: KNN requires storing the entire training
dataset in memory, as it uses all instances for classification or
regression. This can become a limitation when dealing with
large datasets.
 Determining the optimal value of K: The choice of the
number of neighbours, K, can significantly impact the
performance of KNN. Selecting an appropriate value for K is
crucial, and it requires some experimentation or cross-
validation.

K-Nearest Neighbours (KNN)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

K-Nearest Neighbours (KNN)

Uploaded by

Copyright:

Available Formats

K-NEAREST NEIGHBOURS (KNN)

K-Nearest Neighbours (KNN) is a popular supervised machine

How KNN Works?

 In KNN, the "k" refers to the number of nearest neighbours to

Here's a real-world example of how K-Nearest Neighbours

2. Data pre-processing: The dataset is pre-processed by

3. Training: The pre-processed dataset is split into a training

4. Choosing K: The value of K is chosen based on

5. Computing distances: For each test transaction, the

7. Classifying the transaction: The majority class label

8. Evaluating the model: The performance of the KNN

By using KNN in this fraud detection scenario, the

Euclidean distance: This is the most commonly used distance

Manhattan distance: This is also another popular distance

Hamming distance: This technique is used typically used with

Suppose we have a dataset of students with their exam scores

Exam 1 Score Exam 2 Score Pass/Fail

Step 1: Choose the value of k. Let's say we choose k = 3, meaning

Step 2: Compute the distance. To find the nearest neighbours,

For the first instance in the dataset (73, 80):

Similarly, we calculate the distance for the remaining instances:

Step 3: Find the k nearest neighbours. Next, we need to select

Step 4: Determine the majority class. Finally, we determine the

 Computational complexity: KNN can be computationally

You might also like