You are on page 1of 7

EXPERIMENT 6

Aim: Implementation of (KNN) K-Nearest Neighbour supervised algorithm for classification.

COURSE OUTCOMES

CO3 Identify and implement simple learning strategies using data science and statistics principles.

CO4 Evaluate machine learning model’s performance and apply learning strategy to improve the
performance of supervised and unsupervised learning model.

K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised
Learning technique. K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available categories.K-NN
algorithm stores all the available data and classifies a new data point based on the similarity. This
means when new data appears then it can be easily classified into a well suite category by using K- NN
algorithm.K-NN algorithm can be used for Regression as well as for Classification but mostly it is used
for the Classification problems.K-NN is a non-parametric algorithm, which means it does not make
any assumption on underlying data.It is also called a lazy learner algorithm because it does not learn
from the training set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset. KNN algorithm at the training phase just stores the dataset and when
it gets new data, then it classifies that data into a category that is much similar to the new data.

Why do you need to scale your data for the k-NN algorithm?

A dataset having m number of “examples” and n number of “features”. There is one feature dimension
having values exactly between 0 and 1. Meanwhile, there is also a feature dimension that varies from -
99999 to 99999. Considering the formula of Euclidean Distance, this will affect the performance by
giving higher weightage to variables having a higher magnitude.

Is Euclidean Distance always the case?

Although Euclidean Distance is the most common method used and taught, it is not always the optimal
decision. In fact, it is hard to come up with the right metric just by looking at data, so I would suggest
trying a set of them. However, there are some special cases. For instance, hamming distance is used in
case of a categorical variable.

Why should we not use the KNN algorithm for large datasets?

Here is an overview of the data flow that occurs in the KNN algorithm:

1. Calculate the distances to all vectors in a training set and store them
2. Sort the calculated distances

3. Store the K nearest vectors

4. Calculate the most frequent class displayed by K nearest vectors

Imagine you have a very large dataset. Therefore, it is not only a bad decision to store a large amount
of data but it is also computationally costly to keep calculating and sorting all the values.

What is advantage and disadvantages of K-NN algorithm?

advantage

It is simple to implement.

It is robust to the noisy training data

It can be more effective if the training data is large.

Disadvantages

Always needs to determine the value of K which may be complex some time.

The computation cost is high because of calculating the distance between the data points for all the
training samples.

What are steps to implement K-NN ALGORITHM?

Data Pre-processing step

Fitting the K-NN algorithm to the Training set

Predicting the test result

Test accuracy of the result(Creation of Confusion matrix)

Visualizing the test set result.

Implementation of KNN Algorithm in Python

Let’s now get into the implementation of KNN in Python. We’ll go over the steps to help you break the
code down and make better sense of it.

1. Importing the modules


import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split

2. Creating Dataset

Scikit-learn has a lot of tools for creating synthetic datasets, which are great for testing machine
learning algorithms. I’m going to utilize the make blobs method.

X, y = make_blobs(n_samples = 500, n_features = 2, centers = 4,cluster_std = 1.5, random_state = 4)

This code generates a dataset of 500 samples separated into four classes with a total of two
characteristics. Using associated parameters, you may quickly change the number of samples,
characteristics, and classes. We may also change the distribution of each cluster (or class).

3. Visualize the Dataset

plt.style.use('seaborn')

plt.figure(figsize = (10,10))

plt.scatter(X[:,0], X[:,1], c=y, marker= '*',s=100,edgecolors='black')

plt.show()
Data Visualization KNN

4. Splitting Data into Training and Testing Datasets

It is critical to partition a dataset into train and test sets for every supervised machine learning method.
We first train the model and then put it to the test on various portions of the dataset. If we don’t
separate the data, we’re simply testing the model with data it already knows. Using the train_test_split
method, we can simply separate the tests.

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)


With the train size and test size options, we may determine how much of the original data is utilized for
train and test sets, respectively. The default separation is 75% for the train set and 25% for the test set.

5. KNN Classifier Implementation

After that, we’ll build a kNN classifier object. I develop two classifiers with k values of 1 and 5 to
demonstrate the relevance of the k value. The models are then trained using a train set. The k value is
chosen using the n_neighbors argument. It does not need to be explicitly specified because the default
value is 5.

knn5 = KNeighborsClassifier(n_neighbors = 5)

knn1 = KNeighborsClassifier(n_neighbors=1)

6. Predictions for the KNN Classifiers

Then, in the test set, we forecast the target values and compare them to the actual values.

knn5.fit(X_train, y_train)

knn1.fit(X_train, y_train)

y_pred_5 = knn5.predict(X_test)

y_pred_1 = knn1.predict(X_test)

7. Predict Accuracy for both k values

from sklearn.metrics import accuracy_score

print("Accuracy with k=5", accuracy_score(y_test, y_pred_5)*100)

print("Accuracy with k=1", accuracy_score(y_test, y_pred_1)*100)

The accuracy for the values of k comes out as follows:

Accuracy with k=5 93.60000000000001

Accuracy with k=1 90.4


8. Visualize Predictions

Let’s view the test set and predicted values with k=5 and k=1 to see the influence of k values.

plt.figure(figsize = (15,5))

plt.subplot(1,2,1)

plt.scatter(X_test[:,0], X_test[:,1], c=y_pred_5, marker= '*', s=100,edgecolors='black')

plt.title("Predicted values with k=5", fontsize=20)

plt.subplot(1,2,2)

plt.scatter(X_test[:,0], X_test[:,1], c=y_pred_1, marker= '*', s=100,edgecolors='black')

plt.title("Predicted values with k=1", fontsize=20)

plt.show()

Visualize Predictions KNN

Viva Questions

Why is KNN a non-parametric Algorithm?

Why is the KNN Algorithm known as Lazy Learner?

Why is it recommended not to use the KNN Algorithm for large datasets?

How to choose the optimal value of K in the KNN Algorithm?

How can you relate KNN Algorithm to the Bias-Variance tradeoff?

You might also like