You are on page 1of 10

K-NEAREST NEIGHBOURS (KNN)

K-Nearest Neighbours (KNN) is a popular supervised machine


learning algorithm used for classification and regression tasks. It
is a non-parametric algorithm, meaning it doesn't make any
assumptions about the underlying data distribution.

How KNN Works?

 In KNN, the "k" refers to the number of nearest neighbours to


consider when making a prediction.
 To classify a new data point, the algorithm calculates the
distances between that point and all the other data points in
the training set.
 It then selects the "k" nearest neighbours based on these
distances and assigns the class label based on majority voting
(for classification) or calculates the average (for regression).
PRATICAL EXAMPLE OF KNN

Here's a real-world example of how K-Nearest Neighbours


(KNN) can be applied:
Fraud Detection in Credit Card Transactions:
KNN can be used in fraud detection systems to identify
potentially fraudulent credit card transactions. Here's how it can
be applied:
1. Dataset: We have a dataset of credit card transactions,
where each transaction is labelled as either fraudulent or
non-fraudulent. The dataset contains features such as
transaction amount, location, time, and other relevant
transaction details.

2. Data pre-processing: The dataset is pre-processed by


handling missing values, performing feature scaling, and
possibly reducing dimensionality using techniques like
Principal Component Analysis (PCA).

3. Training: The pre-processed dataset is split into a training


set and a test set. The training set is used to train the KNN
model.

4. Choosing K: The value of K is chosen based on


experimentation or cross-validation to achieve the best
performance on the training data.

5. Computing distances: For each test transaction, the


distances between its features and the features of all
transactions in the training set are calculated using a suitable
distance metric (e.g., Euclidean distance).
6. Finding nearest neighbours: The K nearest neighbours
to the test transaction are determined based on the calculated
distances.

7. Classifying the transaction: The majority class label


among the K nearest neighbours is assigned to the test
transaction. If a majority of the neighbours are fraudulent,
the test transaction is classified as fraudulent, and vice versa.

8. Evaluating the model: The performance of the KNN


model is evaluated using metrics such as accuracy, precision,
recall, and F1 score. The model's ability to correctly identify
fraudulent transactions and minimize false positives is
crucial.

By using KNN in this fraud detection scenario, the


model can leverage the similarities between
transactions to identify potentially fraudulent
activities. It considers the characteristics of nearby
transactions to make predictions, which can be
effective in detecting fraudulent patterns that may not
be easily captured by other algorithms.
COMPUTE KNN: DISTANCE METRICS

Euclidean distance: This is the most commonly used distance


measure, and it is limited to real-valued vectors. Using the below
formula, it measures a straight line between the query point and
the other point being measured.

Manhattan distance: This is also another popular distance


metric, which measures the absolute value between two points.
It is also referred to as taxicab distance or city block distance as
it is commonly visualized with a grid, illustrating how one might
navigate from one address to another via city streets.
Minkowski distance: This distance measure is the generalized
form of Euclidean and Manhattan distance metrics. The
parameter, p, in the formula below, allows for the creation of
other distance metrics. Euclidean distance is represented by this
formula when p is equal to two, and Manhattan distance is
denoted with p equal to one.

Hamming distance: This technique is used typically used with


Boolean or string vectors, identifying the points where the
vectors do not match. As a result, it has also been referred to as
the overlap metric. This can be represented with the following
formula:
NUMERICAL EXAMPLE

Suppose we have a dataset of students with their exam scores


and corresponding pass/fail labels. The exam scores are the
features, and the pass/fail labels are the classes we want to
predict. Here is a simplified version of the dataset:

Exam 1 Score Exam 2 Score Pass/Fail

73 80 Pass

92 88 Pass

89 91 Pass

96 98 Pass

50 45 Fail

71 80 Fail

60 62 Fail

78 68 Fail
Now, let's say we want to predict whether a student
with exam scores of 75 in Exam 1 and 70 in Exam 2 will
pass or fail.

Step 1: Choose the value of k. Let's say we choose k = 3, meaning


we will consider the three nearest neighbours of the test instance.

Step 2: Compute the distance. To find the nearest neighbours,


we need to calculate the distance between the test instance and
all other instances in the dataset. In this example, we'll use
Euclidean distance as the distance metric:

For the first instance in the dataset (73, 80):


Distance = sqrt ((75-73)^2 + (70-80)^2) = sqrt (4 + 100)
= sqrt (104) ≈ 10.20

Similarly, we calculate the distance for the remaining instances:


Exam 1 Score Exam 2 Score Pass/Fail Distance

73 80 Pass 10.20

92 88 Pass 20.39

89 91 Pass 22.36

96 98 Pass 31.14

50 45 Fail 31.62

71 80 Fail 6.08

60 62 Fail 10. 20

78 68 Fail 12 .81

Step 3: Find the k nearest neighbours. Next, we need to select


the k nearest neighbours based on the computed distances. In
our case, k = 3, so we choose the three instances with the smallest
distances:
Exam 1 Score Exam 2 Score Pass/Fail Distance

73 80 Pass 10.20

71 80 Fail 6.08

60 62 Fail 10.20

Step 4: Determine the majority class. Finally, we determine the


majority class among the selected nearest neighbours. In our
case, two out of the three nearest neighbours belong to the "Fail"
class. Therefore, we predict that the test instance with exam
scores (75, 70) will fail.
ADVANTAGES AND DISADVANTAGES

Advantages of KNN:
 Simplicity: KNN is a simple and intuitive algorithm that is
easy to understand and implement.
 No training phase: KNN is a lazy learning algorithm, which
means it does not require an explicit training phase. It
memorizes the training data and uses it for classification or
regression directly.
 Non-parametric: KNN makes no assumptions about the
underlying data distribution, making it suitable for both linear
and non-linear relationships.

Disadvantages of KNN:

 Computational complexity: KNN can be computationally


expensive, especially with large datasets. As the size of the
training data increases, the algorithm needs to calculate
distances between the query instance and all training
instances, which can be time-consuming.
 Memory usage: KNN requires storing the entire training
dataset in memory, as it uses all instances for classification or
regression. This can become a limitation when dealing with
large datasets.
 Determining the optimal value of K: The choice of the
number of neighbours, K, can significantly impact the
performance of KNN. Selecting an appropriate value for K is
crucial, and it requires some experimentation or cross-
validation.

You might also like