Exp 4

Machine Learning 21BEC505
Experiment-4
Objective: Implementation of k-Nearest Neighbour Classification
Task #1
Code:
import pandas as pd
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
data ={
'x1' : [0.8,1,1.2,0.8,1.2,4,3.8,4.2,3.8,4.2,4.4,4.4,3.2,3.2,3.8,3.5,4,4],
'x2' : [0.8,1,0.8,1.2,1.2,3,2.8,2.8,3.2,3.2,2.8,3.2,0.4,0.7,0.5,1,1,0.7],
'class' : ['A','A','A','A','A','B','B','B','B','B','B','B','C','C','C','C','C','C'],
}
df = pd.DataFrame(data)
print(df)
print('\n')
sns.scatterplot(data=df, x='x1',y='x2',hue='class',palette=['red','blue','green'])
plt.show()
sc = StandardScaler()
X = sc.fit_transform(df[['x1','x2']])
print(X)
print('\n')
y = df['class']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=0)
sns.scatterplot(data=pd.DataFrame(X_train, columns=['x1', 'x2']), x='x1', y='x2', hue=y_train, palette=['red',

'purple', 'yellow'])
plt.show()
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
X_new = sc.transform([[3, 2], [4.2, 1.8]])

y_new = knn.predict(X_new)
print(y_new)
sns.scatterplot(data=pd.DataFrame(X_train, columns=['x1', 'x2']), x='x1', y='x2', hue=y_train, palette=['red',

'blue', 'green'])
plt.scatter(X_new[:,0], X_new[:,1], color='purple', marker='x', s=100)
plt.show()
Output:
Data Plot
Training Data
New Test Point P1(3,2), P2(4.2,1.8)
Task #2
Code:
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix,accuracy_score
from matplotlib.colors import ListedColormap
df = pd.read_csv(r"E:\Jay\NIRMA\Sem6\ML\Exp4\iris.data")
x = df.iloc[:,[0,1]].values
y = df.iloc[:, -1].values
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 1/3, random_state = 0)
model = KNeighborsClassifier(n_neighbors=3, metric='minkowski', p=2)

model.fit(X_train,y_train)
y_pred = model.predict(X_test)
cm= confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test,y_pred)
values = []
for x in y_train:
if(x=="Iris-setosa"):
values.append(0)
elif(x=="Iris-versicolor"):
values.append(1)
else:
values.append(2)
scatter = plt.scatter(X_train[:,0],X_train[:,1],c=values,cmap=ListedColormap(["red","blue","green"]))
plt.legend(handles=scatter.legend_elements()[0],labels=["Iris-setosa","Iris-
versicolor","Iris virginica"],title="Class")
plt.xlabel("Sepal-width")
plt.ylabel("Sepal-length")
plt.title("Training data")
plt.show()
values = []
for x in y_test:
values.append(0)
values.append(1)
else:
values.append(2)
scatter = plt.scatter(X_test[:,0],X_test[:,1],c=values,cmap=ListedColormap(["red","blue","green"]))
plt.title("Testing data")
plt.show()
values = []
for x in y_pred:
values.append(0)
values.append(1)
else:
values.append(2)
plt.title("Predicted data")
plt.show()
print(f"Error rate for k = 3 is {1-accuracy}")



Output:
Post-Lab Exercise:
1. Apply kNN on the same dataset with different features.

Code:
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix,accuracy_score
from matplotlib.colors import ListedColormap
df = pd.read_csv(r"E:\Jay\NIRMA\Sem6\ML\Exp4\iris.data")
x = df.iloc[:,[2,3]].values
y = df.iloc[:, -1].values
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 1/3, random_state = 0)

cm= confusion_matrix(y_test, y_pred)
values = []
for x in y_train:
values.append(0)
values.append(1)
else:
values.append(2)
scatter = plt.scatter(X_train[:,0],X_train[:,1],c=values,cmap=ListedColormap(["red","blue","green"]))
plt.legend(handles=scatter.legend_elements()[0],labels=["Iris-setosa","Iris-versicolor","Iris virginica"],
title ="Class")
plt.xlabel("Petal-width")
plt.ylabel("Petal-length")
plt.title("Training data")
plt.show()
values = []
for x in y_test:
values.append(0)
values.append(1)
else:
values.append(2)
plt.legend(handles=scatter.legend_elements()[0],labels=["Iris-setosa","Iris-versicolor","Iris virginica"],
title="Class")
plt.title("Testing data")
plt.show()
values = []
for x in y_pred:
values.append(0)
values.append(1)
else:
values.append(2)
plt.legend(handles=scatter.legend_elements()[0],labels=["Iris setosa","Irisversicolor" ,"Iris virginica"],
title ="Class")
plt.title("Predicted data")
plt.show()



Output:
Self-Evaluation:
1. What is the significance of precision and recall?
 Precision is a measure of the accuracy of positive predictions made by the model. It is defined as the
ratio of true positives (i.e., instances that the model correctly identified as positive) to the total number
of instances predicted as positive by the model. A high precision indicates that the model is making
very few false positive predictions, i.e., it correctly identifies most positive instances.
 Recall is a measure of the completeness of positive predictions made by the model. It is defined as the
ratio of true positives to the total number of actual positive instances. A high recall indicates that the
model is correctly identifying most of the positive instances, i.e., it is not missing many positive
instances.
2. What is a hyper parameter? Which are the hyper parameters for the kNN?
 Hyperparameters are the parameters that cannot be learned from the training data directly and need to
be set before training the model. These parameters affect the behavior of the model and can be tuned
to optimize the performance of the model on the validation data.
 In k-Nearest Neighbors (kNN) algorithm, some of the commonly used hyperparameters are:
 k: The number of nearest neighbors to consider when making predictions. This parameter can
significantly affect the model's performance. A small value of k may result in overfitting, while a
large value of k may result in underfitting.
 Distance metric: The metric used to calculate the distance between data points. The commonly
used distance metrics are Euclidean distance, Manhattan distance, and Minkowski distance.
 Weights: The weights assigned to each nearest neighbor when making predictions. The two
commonly used weighting schemes are uniform weights (all neighbors are equally weighted) and
distance weights (the closer neighbors are given more weight).
 Data preprocessing: The preprocessing steps applied to the data before training the model, such as
scaling or normalization of the feature values.
 Algorithm optimization: The algorithm used to find the nearest neighbors, such as brute force or
optimized data structures like KD-tree.
Conclusion:
From this experiment we concluded that, it is a supervised technique in which a particular data is put into
different classes based upon the k nearest neighbours basically calculating the Euclidean distance from each
data value. There is no particular value for k but it is suggested to take k as square root of n where n is number
of data values. This method is considered as costly since we have to calculate distance from each data value.
We have trained and tested the model using KNN. I have also calculated the accuracy for different values of
k and we can observe that there is no particular value or a mathematical logical value which gives good
accuracy but it is suggested to choose k equals to square root of n.

Exp 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exp 4

Uploaded by

Copyright:

Available Formats

Machine Learning 21BEC505

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=0)

sns.scatterplot(data=pd.DataFrame(X_train, columns=['x1', 'x2']), x='x1', y='x2', hue=y_train, palette=['red',

X_new = sc.transform([[3, 2], [4.2, 1.8]])

sns.scatterplot(data=pd.DataFrame(X_train, columns=['x1', 'x2']), x='x1', y='x2', hue=y_train, palette=['red',

New Test Point P1(3,2), P2(4.2,1.8)

from matplotlib.colors import ListedColormap

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 1/3, random_state = 0)

model = KNeighborsClassifier(n_neighbors=3, metric='minkowski', p=2)

model = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)

model = KNeighborsClassifier(n_neighbors=9, metric='minkowski', p=2)

model = KNeighborsClassifier(n_neighbors=15, metric='minkowski', p=2)

1. Apply kNN on the same dataset with different features.

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 1/3, random_state = 0)

model = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)

model = KNeighborsClassifier(n_neighbors=9, metric='minkowski', p=2)

model = KNeighborsClassifier(n_neighbors=15, metric='minkowski', p=2)

1. What is the significance of precision and recall?

You might also like