You are on page 1of 10

Machine Learning 21BEC505

Experiment-4
Objective: Implementation of k-Nearest Neighbour Classification
Task #1
Code:
import pandas as pd
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

data ={
'x1' : [0.8,1,1.2,0.8,1.2,4,3.8,4.2,3.8,4.2,4.4,4.4,3.2,3.2,3.8,3.5,4,4],
'x2' : [0.8,1,0.8,1.2,1.2,3,2.8,2.8,3.2,3.2,2.8,3.2,0.4,0.7,0.5,1,1,0.7],
'class' : ['A','A','A','A','A','B','B','B','B','B','B','B','C','C','C','C','C','C'],
}
df = pd.DataFrame(data)
print(df)
print('\n')
sns.scatterplot(data=df, x='x1',y='x2',hue='class',palette=['red','blue','green'])
plt.show()

sc = StandardScaler()
X = sc.fit_transform(df[['x1','x2']])
print(X)
print('\n')
y = df['class']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=0)

sns.scatterplot(data=pd.DataFrame(X_train, columns=['x1', 'x2']), x='x1', y='x2', hue=y_train, palette=['red',


'purple', 'yellow'])
plt.show()
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
print(cm)

X_new = sc.transform([[3, 2], [4.2, 1.8]])


y_new = knn.predict(X_new)
print(y_new)

sns.scatterplot(data=pd.DataFrame(X_train, columns=['x1', 'x2']), x='x1', y='x2', hue=y_train, palette=['red',


'blue', 'green'])
plt.scatter(X_new[:,0], X_new[:,1], color='purple', marker='x', s=100)
plt.show()
Machine Learning 21BEC505

Output:

Data Plot
Machine Learning 21BEC505

Training Data

New Test Point P1(3,2), P2(4.2,1.8)

Task #2
Code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix,accuracy_score
Machine Learning 21BEC505

from matplotlib.colors import ListedColormap

df = pd.read_csv(r"E:\Jay\NIRMA\Sem6\ML\Exp4\iris.data")
x = df.iloc[:,[0,1]].values
y = df.iloc[:, -1].values

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 1/3, random_state = 0)

model = KNeighborsClassifier(n_neighbors=3, metric='minkowski', p=2)


model.fit(X_train,y_train)
y_pred = model.predict(X_test)
cm= confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test,y_pred)

values = []

for x in y_train:
if(x=="Iris-setosa"):
values.append(0)
elif(x=="Iris-versicolor"):
values.append(1)
else:
values.append(2)

scatter = plt.scatter(X_train[:,0],X_train[:,1],c=values,cmap=ListedColormap(["red","blue","green"]))
plt.legend(handles=scatter.legend_elements()[0],labels=["Iris-setosa","Iris-
versicolor","Iris virginica"],title="Class")
plt.xlabel("Sepal-width")
plt.ylabel("Sepal-length")
plt.title("Training data")
plt.show()

values = []

for x in y_test:
if(x=="Iris-setosa"):
values.append(0)
elif(x=="Iris-versicolor"):
values.append(1)
else:
values.append(2)

scatter = plt.scatter(X_test[:,0],X_test[:,1],c=values,cmap=ListedColormap(["red","blue","green"]))
plt.legend(handles=scatter.legend_elements()[0],labels=["Iris-setosa","Iris-
versicolor","Iris virginica"],title="Class")
plt.xlabel("Sepal-width")
plt.ylabel("Sepal-length")
plt.title("Testing data")
plt.show()

values = []
Machine Learning 21BEC505

for x in y_pred:
if(x=="Iris-setosa"):
values.append(0)
elif(x=="Iris-versicolor"):
values.append(1)
else:
values.append(2)

scatter = plt.scatter(X_test[:,0],X_test[:,1],c=values,cmap=ListedColormap(["red","blue","green"]))
plt.legend(handles=scatter.legend_elements()[0],labels=["Iris-setosa","Iris-
versicolor","Iris virginica"],title="Class")
plt.xlabel("Sepal-width")
plt.ylabel("Sepal-length")
plt.title("Predicted data")
plt.show()
print(f"Error rate for k = 3 is {1-accuracy}")

model = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)


model.fit(X_train,y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
print(f"Error rate for k = 5 is {1-accuracy}")

model = KNeighborsClassifier(n_neighbors=9, metric='minkowski', p=2)


model.fit(X_train,y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
print(f"Error rate for k = 9 is {1-accuracy}")

model = KNeighborsClassifier(n_neighbors=15, metric='minkowski', p=2)


model.fit(X_train,y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
print(f"Error rate for k = 15 is {1-accuracy}")

Output:
Machine Learning 21BEC505
Machine Learning 21BEC505

Post-Lab Exercise:

1. Apply kNN on the same dataset with different features.


Code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix,accuracy_score
from matplotlib.colors import ListedColormap

df = pd.read_csv(r"E:\Jay\NIRMA\Sem6\ML\Exp4\iris.data")
x = df.iloc[:,[2,3]].values
y = df.iloc[:, -1].values

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 1/3, random_state = 0)


model = KNeighborsClassifier(n_neighbors=3, metric='minkowski', p=2)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
cm= confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test,y_pred)

values = []
for x in y_train:
if(x=="Iris-setosa"):
values.append(0)
elif(x=="Iris-versicolor"):
values.append(1)
else:
values.append(2)

scatter = plt.scatter(X_train[:,0],X_train[:,1],c=values,cmap=ListedColormap(["red","blue","green"]))
plt.legend(handles=scatter.legend_elements()[0],labels=["Iris-setosa","Iris-versicolor","Iris virginica"],
title ="Class")
plt.xlabel("Petal-width")
plt.ylabel("Petal-length")
plt.title("Training data")
plt.show()

values = []
for x in y_test:
if(x=="Iris-setosa"):
values.append(0)
elif(x=="Iris-versicolor"):
values.append(1)
else:
values.append(2)
scatter = plt.scatter(X_test[:,0],X_test[:,1],c=values,cmap=ListedColormap(["red","blue","green"]))
Machine Learning 21BEC505

plt.legend(handles=scatter.legend_elements()[0],labels=["Iris-setosa","Iris-versicolor","Iris virginica"],
title="Class")
plt.xlabel("Petal-width")
plt.ylabel("Petal-length")
plt.title("Testing data")
plt.show()

values = []
for x in y_pred:
if(x=="Iris-setosa"):
values.append(0)
elif(x=="Iris-versicolor"):
values.append(1)
else:
values.append(2)

scatter = plt.scatter(X_test[:,0],X_test[:,1],c=values,cmap=ListedColormap(["red","blue","green"]))
plt.legend(handles=scatter.legend_elements()[0],labels=["Iris setosa","Irisversicolor" ,"Iris virginica"],
title ="Class")
plt.xlabel("Petal-width")
plt.ylabel("Petal-length")
plt.title("Predicted data")
plt.show()
print(f"Error rate for k = 3 is {1-accuracy}")

model = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)


model.fit(X_train,y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
print(f"Error rate for k = 5 is {1-accuracy}")

model = KNeighborsClassifier(n_neighbors=9, metric='minkowski', p=2)


model.fit(X_train,y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
print(f"Error rate for k = 9 is {1-accuracy}")

model = KNeighborsClassifier(n_neighbors=15, metric='minkowski', p=2)


model.fit(X_train,y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
print(f"Error rate for k = 15 is {1-accuracy}")

Output:
Machine Learning 21BEC505
Machine Learning 21BEC505

Self-Evaluation:

1. What is the significance of precision and recall?

 Precision is a measure of the accuracy of positive predictions made by the model. It is defined as the
ratio of true positives (i.e., instances that the model correctly identified as positive) to the total number
of instances predicted as positive by the model. A high precision indicates that the model is making
very few false positive predictions, i.e., it correctly identifies most positive instances.

 Recall is a measure of the completeness of positive predictions made by the model. It is defined as the
ratio of true positives to the total number of actual positive instances. A high recall indicates that the
model is correctly identifying most of the positive instances, i.e., it is not missing many positive
instances.

2. What is a hyper parameter? Which are the hyper parameters for the kNN?

 Hyperparameters are the parameters that cannot be learned from the training data directly and need to
be set before training the model. These parameters affect the behavior of the model and can be tuned
to optimize the performance of the model on the validation data.

 In k-Nearest Neighbors (kNN) algorithm, some of the commonly used hyperparameters are:

 k: The number of nearest neighbors to consider when making predictions. This parameter can
significantly affect the model's performance. A small value of k may result in overfitting, while a
large value of k may result in underfitting.
 Distance metric: The metric used to calculate the distance between data points. The commonly
used distance metrics are Euclidean distance, Manhattan distance, and Minkowski distance.
 Weights: The weights assigned to each nearest neighbor when making predictions. The two
commonly used weighting schemes are uniform weights (all neighbors are equally weighted) and
distance weights (the closer neighbors are given more weight).
 Data preprocessing: The preprocessing steps applied to the data before training the model, such as
scaling or normalization of the feature values.
 Algorithm optimization: The algorithm used to find the nearest neighbors, such as brute force or
optimized data structures like KD-tree.

Conclusion:
From this experiment we concluded that, it is a supervised technique in which a particular data is put into
different classes based upon the k nearest neighbours basically calculating the Euclidean distance from each
data value. There is no particular value for k but it is suggested to take k as square root of n where n is number
of data values. This method is considered as costly since we have to calculate distance from each data value.
We have trained and tested the model using KNN. I have also calculated the accuracy for different values of
k and we can observe that there is no particular value or a mathematical logical value which gives good
accuracy but it is suggested to choose k equals to square root of n.

You might also like