You are on page 1of 20

10/18/23, 2:51 PM

Practical 04

Importing Libraries

In [1]:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt


import seaborn as sns

sns.set()

import warnings

warnings. filterwarnings(’ ignore


#matplotlib inline

yl

Loading The Dataset

In [2]:

diabetes_data = pd.read_csv('diabetes.csv')

diabetes_data

ML _Practical_04 - Jupyter Notebook

Out[2]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI Pedigree Age Outcome
0 6 148 72 35 0 336 0627 50 1
1 1 85 66 29 0 266 0351 3 0
2 8 183 64 0 0 233 0872 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2288 33 1
763 10 101 76 48 180 32.9 0171 863 0
764 2 122 70 27 0 36.38 0.340 27 0
765 5 121 72 23 112 26.2 0.245 30 0
768 1 126 60 [1] 0 301 0349 47 1
767 1 a3 70 3 0 304 0315 23 0
768 rows x 9 columns
In [3]:
#Print the first 5 rows of the daotaframe.
diabetes_data.head()
out[3]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI Pedigree Age Outcomes
0 6 148 72 35 0 336 0.627 50 1
1 1 85 66 29 0 268 0.351 31 0
2 8 183 64 o 0 233 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2288 33 1

Check Information About Dataset


localhost:8888/notebooks/ML_Practical 04.ipynb

1M
10/18/23, 2:51 PM

In [4]:

diabetes_data.info({verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, @ to 767
Data columns (total 9 columns):

# Column

Pregnancies
Glucose

BloodPressure 768 non-null


SkinThickness 768 non-null

BMI
Pedigree
Age
Outcome
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

In [5]:

}
1
2
3
4 Insulin
5
6
7
8

Non-Null Count
768 non-null
768 non-null

768 non-null
768 non-null
768 non-null
768 non-null
768 non-null

diabetes_data.describe()

float64
float64
inte4
inte4

ML _Practical_04 - Jupyter Notebook

Out[5]:
Pregnancles Glucose BloodPressure SkinThickness Insulin BMI Pedigree Age Outcome
count 768.000000 768.000000 766.000000 768.000000 768.000000 768.000000 768.000000
768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.7990479 31.992578 0.471876
33.240885 0.348958
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.834160 0.331329 11.760232
0476951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000
0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000
0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 28.000000
0.000000
75% 6.000000 140.250000 80.000000 32,000000 127.250000 36.600000 0.626250 41.000000
1.000000
max 17.000000 199.000000 122.000000 99.000000 B846.000000 67.100000 2.420000
81.000000 1.000000
In [6]:
diabetes_data.describe().T
Out[6]:
count mean std min 25% 50% 75% max
Pregnancies 768.0 3.845052 3.369578 0.000 1.00000 3.0000 6.00000 17.00
Glucose 768.0 120.894531 31.972618 0.000 99.00000 117.0000 140.25000 199.00
BloodPressure 768.0 69.105469 19.355807 0.000 62.00000 72.0000 80.00000 122.00
SkinThickness 768.0 20.536458 15952218 0.000 0.00000 23.0000 32.00000 99.00
Insulin 768.0 79.799479 116.244002 0.000 0.00000 30.5000 127.25000 846.00
BMI 768.0 31.992578 7.884160 0.000 27.30000 32.0000 36.60000 67.10
Pedigree 768.0 0.471876 0.331320 0.078 0.24375 0.3725 0.62625 242
Age 768.0 33.240885 11760232 21.000 24.00000 29.0000 41.00000 81.00
Outcome 768.0 0.348958 0.476951 0.000 0.00000 0.0000 1.00000 1.00

Checking Null Values

In [7]:

diabetes_data_copy = diabetes_data.copy(deep = True)

diabetes_data_copy[['Glucose', 'BloodPressure’, 'SkinThickness', 'Insulin','BMI']]


= diabetes_data_copy[[ 'Glucose', 'BloodPressure’, 'SkinThicl
##t showing the count of Nans
print (diabetes_data_copy.isnull().sum())

Pregnancies

Glucose

BloodPressure
SkinThickness

Insulin

BMI

Pedigree

Age

Outcome

dtype: int64
227
374

localhost:8888/notebooks/ML_Practical 04.ipynb

2111
10/18/23, 2:51 PM ML _Practical_04 - Jupyter Notebook

To fill these Nan values the data distribution needs to be understood

In [8]: |
p = diabetes_data.hist(figsize = (2e,20))

Pregnancies Glucose BloodPressure


250

200
150
150
125 150
100
100
75 100
] : 1 ]
, i 1. . = _N J . | __ mm -——
00 25 50 [1] 25 50 75 100 125 150 20 40 60 80 1

75 100 125 150 175 175 200 0 00 120

8
8

SkinThickness Insulin BMI


500
250
200
400
200
150
300
150
100 200
100
] ) ] i
0 = 0 He , mm HN -—
0 20 40 80 100 0 200 400 600 800 0 10 20 30 40 50 70
Pedigree Age Outcome
300 500
300
250
250 #00,
200
200 300
150
150
200
100 100
100
, | , -_ ,

0.0 05 10 15 20 25 20 30 40 70 80 0.0 0.2 04 06 08 1.0


Aiming to impute nan values for the columns in accordance with their distribution
4 4
In [9]: |

diabetes_data_copy[ 'Glucose'].fillna(diabetes_data_copy[ 'Glucose'].mean(),


inplace = True)
diabetes_data_copy[ 'BlocdPressure'].fillna(diabetes_data_copy[ 'BloodPressure'].me
an(), inplace = True)
diabetes_data_copy[ 'SkinThickness'].fillna(diabetes_data_copy[ 'SkinThickness'].me
dian(), inplace = True)
diabetes_data_copy[ 'Insulin'].fillna(diabetes_data_copy[ 'Insulin'].median(),
inplace = True)
diabetes_data_copy[ 'BMI'].fillna(diabetes_data_copy[ 'BMI'].median(), inplace =
True)

| >

Plotting after Nan removal

localhost:8888/notebooks/ML_Practical 04.ipynb 3M
10/18/23, 2:51 PM

In [10]:

p = diabetes_data_copy.hist(figsize = (20,20))

Pregnancies

,
. bs
00 25 50

75 100 125 150 175

SkinThickness

20 40

Pedigree

200
150
I
, CE =
00 05 1.0 15 20

Identify Shape Of Dataset

In [11]:

25

diabetes_data.shape

out[11]:
(768, 9)

Data Type Analysis

localhost:8888/notebooks/ML_Practical 04.ipynb

BS]

300
0

|
40 60

80

100

ML _Practical_04 - Jupyter Notebook

Glucose

120

Insulin

140

160

180 200

8g

BloodPressure

I's
20 40 80 80 100 120

0.0

20 30

02

40

Outcome

04

06

08

4/11
10/18/23, 2:51 PM ML _Practical_04 - Jupyter Notebook

In [12]:

sns.countplot(y=diabetes_data.dtypes ,data=diabetes_data)
plt.xlabel( "count of each data type")

plt.ylabel(“"data types")

plt.show()

int64

data types

float64

0 1 2 3 4 5 6 7

count of each data type

checking the balance of the data

In [13]:

color_wheel = {1: "#@8392cf",

2: "#7bce43"}

colors = diabetes_data[ "Outcome”].map(lambda x: color_wheel.get(x + 1))


print(diabetes_data.Outcome.value_counts())
p=diabetes_data.Outcome.value_counts().plot(kind="bar")

[] 500
1 268
Name: Outcome, dtype: int64

400

100

Heatmap for unclean data

localhost:8888/notebooks/ML_Practical 04.ipynb

5111
10/18/23, 2:51 PM ML _Practical_04 - Jupyter Notebook

In [14]:
plt.figure(figsize=(12,10))

# on this line I just set the size of figure to 12 by 16.


p=sns.heatmap(diabetes_data.corr(), annot=True,cmap ='RdY1Gn'})

Pregnancies ; ; 0.018

Glucose 0.15 0.057

BloodPressure . i . 0.041

0.075

SkinThickness { 0.057

Insulin

BMI 0.018

0.034

@
4

Pedigree

Age

Outcome

Pregnancies
Glucose
BloodPressure
SkinThickness
Insulin

BMI
Pedigree
Qutcome

Heatmap for clean data

localhost:8888/notebooks/ML_Practical 04.ipynb 6/11


10/18/23, 2:51 PM

In [15]:

plt.figure(figsize=(12,10))
# on this line I just set the size of figure to 12 by 16.
p=sns.heatmap(diabetes_data_copy.corr(), annot=True,cmap ='RdYlGn'}

Pregnancies

Glucose

BloodPressure

SkinThickness

Insulin

BMI

Pedigree

Age

Outcome

0.082

0.082

ML _Practical_04 - Jupyter Notebook

0.045

0.034

[2 © [2] [2] [= — [4] ©


@ In] = 7] = = ©
§ 8 = g 3 a 2 <
3 wn = c 5
2 Io} g 8 == B
[=] 0 = o
I) ° =
aT 8 =
@ [7]
Scaling the data
In [16]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X = pd.DataFrame(sc_X.fit_transform(diabetes_data_copy.drop(["Outcome"],axis =
1),),
columns=[ ‘Pregnancies’, 'Glucose', 'BloodPressure’, ‘'SkinThickness', ‘Insulin’,
'BMI', 'DiabetesPedigreeFunction’, 'Age'])
In [17]:
X.head()
out[17]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
DiabetesPedigresFunction Age
0 0.639947 0.865108 -0.033518 0.670643 -0.181541 0.166619 0.468492 1.425995
1 0.844885 -1.206162 -0.529859 -0.012301 -0.181541 -0.852200 -0.365061 -0.190672
2 1.233880 2.015813 -0.695306 -0.012301 -0.181541 -1.332500 0.604397 -0.105584
3 -0.844885 -1.074852 -0.529859 -0.695245 -0.540642 -0.633881 -0.920763 -1.041549
4 -1.141852 0.503458 -2.680669 0.670643 0.316566 1.549303 5.484909 -0.020496

localhost:8888/notebooks/ML_Practical 04.ipynb

Qutcome

10

-06

-04

7M
10/18/23, 2:51 PM ML _Practical_04 - Jupyter Notebook

In [18]: |

#X = diabetes_data.drop( "Outcome", axis = 1)


y = diabetes_data_copy.Outcome

Splitting the dataset

In [19]: |

from sklearn.model_selection import train_test_split


X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=1/3,random_state=42,
stratify=y)

In [20]: |

from sklearn.neighbors import KNeighborsClassifier


test_scores = []
train_scores = []
for i in range(1,15):
knn = KNeighborsClassifier(i)
knn.fit(X_train,y_train)
train_scores.append{knn.score(X_train,y_train)}
test_scores.append(knn.score(X_test,y_test))

In [21]: |

## score that comes from testing on the same datapoints that were used for training

max_train_score = max(train_scores)

train_scores_ind = [i for i, v in enumerate(train_scores) if v == max_train_score]

print('Max train score {} % and k = {}'.format(max_train_score*100,list{map(lambda


x: x+1, train_scores_ind)}))

Max train score 79.6875 %¥ and k = [1]

In [22]: |

## score that comes from testing on the datapoints that were split in the beginning
to be used for testing solely
max_test_score = max(test_scores)

test_scores_ind = [1 for 1, v in enumerate(test_scores) if v == max_test_score]

print('Max test score {} %¥ and k = {}'.format(max_test_score*100,list(map(lambda


x: x+1, test_scores_ind))))

Max test score 73.4375 ¥ and k = [1]

Result Visualisation

In [23]: |

plt.figure(figsize=(12,5))
p = sns.lineplot(data=(range(1,15),train_scores),marker="'*',label="'Train Score')
p = sns.lineplot(data=(range(1,15),test_scores),marker="0"',label="Test Score’)
14 —.— Train Score ®
— - = Train Score |
y===) be

12

L )
—e— Test Score
10

~o~ Test Score


a —
co sa

The best result is captured at k = 11 hence 11 is used for the final mode

localhost:8888/notebooks/ML_Practical 04.ipynb 8/11


10/18/23, 2:51 PM ML _Practical_04 - Jupyter Notebook

Setup a knn classifier with k neighbor

In [24]:
knn = KNeighborsClassifier(11)

knn.fit(X_train,y_train)
knn.score(X_test,y_test)

out[24]:
0.765625

In [25]:

print(train_scores)

[8.796875]
Model Performance Analysis

confusion Matrix

In [26]:

#import confusion_matrix

from sklearn.metrics import confusion_matrix

#let us get the predictions using the classifier we had fit above
y_pred = knn.predict(X_test)

confusion _matrix(y_test,y_pred)

pd.crosstab(y_test, y pred, rownames=['True'], colnames=['Predicted'],


margins=True)

out[26]:

Predicted c 1 Al
True

0 142 25 167

1 35 54 89

All 177 79 256

localhost:8888/notebooks/ML_Practical 04.ipynb

9/11
10/18/23, 2:51 PM ML _Practical_04 - Jupyter Notebook

In [27]:

y_pred = knn.predict(X_test)
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)

p = sns.heatmap(pd.DataFrame(cnf matrix), annot=True, cmap="Y1GnBu" ,fmt='g’)

plt.title('Confusion matrix’, y=1.1)


plt.ylabel('Actual label’)
plt.xlabel('Predicted label")

out[27]:
Text(.5, 20.049999999999997, 'Predicted label')

Confusion matrix

o 25
@
a
=

i=)
<

a 35 54

[4] 1

Predicted label

Classification Report (Report which includes Precision, Recall and F1-Score)

In [28]:
#import classification report

from sklearn.metrics import classification_report


print(classification_report(y_test,y_pred))

precision recall fl-score support

] 0.80 9.85 2.83 167

1 0.68 9.61 9.64 89

accuracy 8.77 256


macro avg 0.74 8.73 8.73 256
weighted avg 0.76 2.77 8.76 256

In [29]:

from sklearn.metrics import roc_curve


y_pred_proba = knn.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

localhost:8888/notebooks/ML_Practical 04.ipynb
140

120

100

10/11
10/18/23, 2:51 PM ML _Practical_04 - Jupyter Notebook

In [30]:

plt.plot([@,1],[0,1], 'k--"}
plt.plot(fpr,tpr, label="Knn')
plt.xlabel(’fpr'}

plt.ylabel('tpr'})
plt.title('Knn{n_neighbors=11) ROC curve")
plt.show()

Knn{n_neighbors=11) ROC curve

08

06

tpr

0.4

02

0.0

0.0 02 04 0.6 0.8 1.0

Area under ROC curve

In [31]:

from sklearn.metrics import roc_auc_score


roc_auc_score(y_test,y_pred_proba)

Out[31]:

©.8193500639171096

Hyper Parameter optimization

In [32]:

#import GridSearchCV
from sklearn.model_ selection import GridSearchcv

#In case of classifier Like knn the parameter to be tuned is n_neighbors


param_grid = {'n_neighbors':np.arange(1,50)}

knn = KNeighborsClassifier()

knn_cv= GridSearchCv(knn,param_grid,cv=5)

knn_cv.fit(X,y)
print(“"Best Score:
print("Best Parameters:

+ str(knn_cv.best_score_))
" + str(knn_cv.best_params_))
Best Score:9.7721848251252015
Best Parameters: {'n_neighbors': 25}

localhost:8888/notebooks/ML_Practical 04.ipynb

11/11

You might also like