You are on page 1of 3

Nageswarao Datatechs

KNN CLASSIFICATION

K Nearest Neighbor is a simple algorithm that classifies a new data point based on it neighbours. A
group of neighbours are selected (this is k value) and the data point is classified under that class to
which majority of neighbours belong to.

K represents the number of nearest neighbours selected. Choosing k value properly will give more
accuracy. Choosing right value for k is called ‘parameter tuning’.

Example: We have two classes of variables. They are squares and triangles. When a new variable is
taken, how to classify whether it belongs to square class or triangle class?

When k=3, we should consider 3 nearest neighbours to our new variable. There are 2 squares and 1
triangle. Hence the new variable belongs to square.

When k=7, there are 3 squares and 4 triangles in the nearest neighbours. Hence the new variable
belongs to triangle class.

How to select k value

1. Generally, we take k value as square root value of no. data points.


2. k is taken as odd number to avoid confusion between two classes of data.

We use KNN algorithm when the data is labelled. Also, when the data is small.

How the distance between two points are calculated

This is done either using Euclidean distance formula or Minkowski distance formula.
Nageswarao Datatechs

Problem: Given data of breast cancer patients. Find out whether the cancer is benign or malignant.

Dataset: breast-cancer-wisconsin.data

# prediction for breast cancer using KNN algorithm


import pandas as pd

# load the dataset from Notepad file


df = pd.read_csv('F:/knn/breast-cancer-wisconsin.data')
df.head()

# display the column names


df.columns

# we can find ? mark in the bare_nulei column.


# replace the ? mark with -99999 in the df
df.replace('?', -99999, inplace=True)
df

# remove useless data. Here it is id col


# 1 indicates drop label from columns. 0 indicates drop label from index.
df.drop(['id'], axis=1, inplace=True) # axis='columns'
df

# take 0 to 8th cols in x. take 9th column, i.e. class column in y


x = df.iloc[:, :9]
x
y = df.iloc[:, 9] # y can be 2 or 4
y

# split the data


from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state = 0)
Nageswarao Datatechs

import math
n = math.sqrt(len(y_test)) #
n # 11.832159566199232

# apply KNN model


from sklearn.neighbors import KNeighborsClassifier

# we can supply: n_neighbors=11, metric='euclidean'


model = KNeighborsClassifier() # default: n_neighbors=5, metric='minkowski'
model.fit(x_train, y_train)

# find accuracy
accuracy = model.score(x_test, y_test)
accuracy # 0.9857142857142858

# predict for the given data


model.predict([[4,2,1,1,1,2,3,2,1]]) # array([2]) --> benign tumor

# predict for two patients. 2 --> benign, 4 -> malignant


model.predict([[4,2,1,1,1,2,3,2,1], [8,10,10,8,7,10,9,7,1]]) # array([2,4])

Task on KNN: Use KNN model on Indian diabetes patients database and predict whether a new
patient is diabetic (1) or not (0).

Dataset: diabetes.csv

Note: We should not have zeros in the following columns: Glucose, BloodPressure, SkinThickness,
Insulin and BMI. Hence replace any 0s in these columns with their respective mean values. Then only
you can use this dataset.

You might also like