ML Practical 04

10/18/23, 2:51 PM
Practical 04
Importing Libraries
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
sns.set()
import warnings
warnings. filterwarnings(’ ignore

#matplotlib inline
yl
Loading The Dataset
In [2]:
diabetes_data = pd.read_csv('diabetes.csv')
diabetes_data
ML _Practical_04 - Jupyter Notebook
Out[2]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI Pedigree Age Outcome
0 6 148 72 35 0 336 0627 50 1
1 1 85 66 29 0 266 0351 3 0
2 8 183 64 0 0 233 0872 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2288 33 1
763 10 101 76 48 180 32.9 0171 863 0
764 2 122 70 27 0 36.38 0.340 27 0
765 5 121 72 23 112 26.2 0.245 30 0
768 1 126 60 [1] 0 301 0349 47 1
767 1 a3 70 3 0 304 0315 23 0
768 rows x 9 columns
In [3]:
#Print the first 5 rows of the daotaframe.
diabetes_data.head()
out[3]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI Pedigree Age Outcomes
0 6 148 72 35 0 336 0.627 50 1
1 1 85 66 29 0 268 0.351 31 0
2 8 183 64 o 0 233 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2288 33 1
Check Information About Dataset

localhost:8888/notebooks/ML_Practical 04.ipynb
1M
10/18/23, 2:51 PM
In [4]:
diabetes_data.info({verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, @ to 767
Data columns (total 9 columns):
# Column
Pregnancies
Glucose
BloodPressure 768 non-null

SkinThickness 768 non-null
BMI
Pedigree
Age
Outcome
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
In [5]:
}
1
2
3
4 Insulin
5
6
7
8
Non-Null Count
768 non-null
768 non-null
768 non-null
768 non-null
768 non-null
768 non-null
768 non-null
diabetes_data.describe()
float64
float64
inte4
inte4
Out[5]:
Pregnancles Glucose BloodPressure SkinThickness Insulin BMI Pedigree Age Outcome
count 768.000000 768.000000 766.000000 768.000000 768.000000 768.000000 768.000000
768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.7990479 31.992578 0.471876
33.240885 0.348958
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.834160 0.331329 11.760232
0476951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000
0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000
0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 28.000000
0.000000
75% 6.000000 140.250000 80.000000 32,000000 127.250000 36.600000 0.626250 41.000000
1.000000
max 17.000000 199.000000 122.000000 99.000000 B846.000000 67.100000 2.420000
81.000000 1.000000
In [6]:
diabetes_data.describe().T
Out[6]:
count mean std min 25% 50% 75% max
Pregnancies 768.0 3.845052 3.369578 0.000 1.00000 3.0000 6.00000 17.00
Glucose 768.0 120.894531 31.972618 0.000 99.00000 117.0000 140.25000 199.00
BloodPressure 768.0 69.105469 19.355807 0.000 62.00000 72.0000 80.00000 122.00
SkinThickness 768.0 20.536458 15952218 0.000 0.00000 23.0000 32.00000 99.00
Insulin 768.0 79.799479 116.244002 0.000 0.00000 30.5000 127.25000 846.00
BMI 768.0 31.992578 7.884160 0.000 27.30000 32.0000 36.60000 67.10
Pedigree 768.0 0.471876 0.331320 0.078 0.24375 0.3725 0.62625 242
Age 768.0 33.240885 11760232 21.000 24.00000 29.0000 41.00000 81.00
Outcome 768.0 0.348958 0.476951 0.000 0.00000 0.0000 1.00000 1.00
Checking Null Values
In [7]:
diabetes_data_copy = diabetes_data.copy(deep = True)
diabetes_data_copy[['Glucose', 'BloodPressure’, 'SkinThickness', 'Insulin','BMI']]

= diabetes_data_copy[[ 'Glucose', 'BloodPressure’, 'SkinThicl
##t showing the count of Nans
print (diabetes_data_copy.isnull().sum())
Pregnancies
Glucose
BloodPressure
SkinThickness
Insulin
BMI
Pedigree
Age
Outcome
dtype: int64
227
374
2111
10/18/23, 2:51 PM ML _Practical_04 - Jupyter Notebook
To fill these Nan values the data distribution needs to be understood
In [8]: |
p = diabetes_data.hist(figsize = (2e,20))
Pregnancies Glucose BloodPressure

250
200
150
150
125 150
100
100
75 100
] : 1 ]
, i 1. . = _N J . | __ mm -——
00 25 50 [1] 25 50 75 100 125 150 20 40 60 80 1
75 100 125 150 175 175 200 0 00 120
8
8
SkinThickness Insulin BMI

500
250
200
400
200
150
300
150
100 200
100
] ) ] i
0 = 0 He , mm HN -—
0 20 40 80 100 0 200 400 600 800 0 10 20 30 40 50 70
Pedigree Age Outcome
300 500
300
250
250 #00,
200
200 300
150
150
200
100 100
100
, | , -_ ,
0.0 05 10 15 20 25 20 30 40 70 80 0.0 0.2 04 06 08 1.0

Aiming to impute nan values for the columns in accordance with their distribution
4 4
In [9]: |
diabetes_data_copy[ 'Glucose'].fillna(diabetes_data_copy[ 'Glucose'].mean(),

inplace = True)
diabetes_data_copy[ 'BlocdPressure'].fillna(diabetes_data_copy[ 'BloodPressure'].me
an(), inplace = True)
diabetes_data_copy[ 'SkinThickness'].fillna(diabetes_data_copy[ 'SkinThickness'].me
dian(), inplace = True)
diabetes_data_copy[ 'Insulin'].fillna(diabetes_data_copy[ 'Insulin'].median(),
inplace = True)
diabetes_data_copy[ 'BMI'].fillna(diabetes_data_copy[ 'BMI'].median(), inplace =
True)
| >
Plotting after Nan removal
localhost:8888/notebooks/ML_Practical 04.ipynb 3M
10/18/23, 2:51 PM
In [10]:
p = diabetes_data_copy.hist(figsize = (20,20))
Pregnancies
,
. bs
00 25 50
75 100 125 150 175
SkinThickness
20 40
Pedigree
200
150
I
, CE =
00 05 1.0 15 20
Identify Shape Of Dataset
In [11]:
25
diabetes_data.shape
out[11]:
(768, 9)
Data Type Analysis
BS]
300
0
|
40 60
80
100
Glucose
120
Insulin
140
160
180 200
8g
BloodPressure
I's
20 40 80 80 100 120
0.0
20 30
02
40
Outcome
04
06
08
4/11
In [12]:
sns.countplot(y=diabetes_data.dtypes ,data=diabetes_data)
plt.xlabel( "count of each data type")
plt.ylabel(“"data types")
plt.show()
int64
data types
float64
0 1 2 3 4 5 6 7
count of each data type
checking the balance of the data
In [13]:
color_wheel = {1: "#@8392cf",
2: "#7bce43"}
colors = diabetes_data[ "Outcome”].map(lambda x: color_wheel.get(x + 1))

print(diabetes_data.Outcome.value_counts())
p=diabetes_data.Outcome.value_counts().plot(kind="bar")
[] 500
1 268
Name: Outcome, dtype: int64
400
100
Heatmap for unclean data
5111
In [14]:
plt.figure(figsize=(12,10))
# on this line I just set the size of figure to 12 by 16.

p=sns.heatmap(diabetes_data.corr(), annot=True,cmap ='RdY1Gn'})
Pregnancies ; ; 0.018
Glucose 0.15 0.057
BloodPressure . i . 0.041
0.075
SkinThickness { 0.057
Insulin
BMI 0.018
0.034
@
4
Pedigree
Age
Outcome
Pregnancies
Glucose
BloodPressure
SkinThickness
Insulin
BMI
Pedigree
Qutcome
Heatmap for clean data
localhost:8888/notebooks/ML_Practical 04.ipynb 6/11

10/18/23, 2:51 PM
In [15]:
# on this line I just set the size of figure to 12 by 16.
p=sns.heatmap(diabetes_data_copy.corr(), annot=True,cmap ='RdYlGn'}
Pregnancies
Glucose
BloodPressure
SkinThickness
Insulin
BMI
Pedigree
Age
Outcome
0.082
0.082
0.045
0.034
[2 © [2] [2] [= — [4] ©

@ In] = 7] = = ©
§ 8 = g 3 a 2 <
3 wn = c 5
2 Io} g 8 == B
[=] 0 = o
I) ° =
aT 8 =
@ [7]
Scaling the data
In [16]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X = pd.DataFrame(sc_X.fit_transform(diabetes_data_copy.drop(["Outcome"],axis =
1),),
columns=[ ‘Pregnancies’, 'Glucose', 'BloodPressure’, ‘'SkinThickness', ‘Insulin’,
'BMI', 'DiabetesPedigreeFunction’, 'Age'])
In [17]:
X.head()
out[17]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
DiabetesPedigresFunction Age
0 0.639947 0.865108 -0.033518 0.670643 -0.181541 0.166619 0.468492 1.425995
1 0.844885 -1.206162 -0.529859 -0.012301 -0.181541 -0.852200 -0.365061 -0.190672
2 1.233880 2.015813 -0.695306 -0.012301 -0.181541 -1.332500 0.604397 -0.105584
3 -0.844885 -1.074852 -0.529859 -0.695245 -0.540642 -0.633881 -0.920763 -1.041549
4 -1.141852 0.503458 -2.680669 0.670643 0.316566 1.549303 5.484909 -0.020496
Qutcome
10
-06
-04
7M
In [18]: |
#X = diabetes_data.drop( "Outcome", axis = 1)

y = diabetes_data_copy.Outcome
Splitting the dataset
In [19]: |
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=1/3,random_state=42,
stratify=y)
In [20]: |
from sklearn.neighbors import KNeighborsClassifier

test_scores = []
train_scores = []
for i in range(1,15):
knn = KNeighborsClassifier(i)
knn.fit(X_train,y_train)
train_scores.append{knn.score(X_train,y_train)}
test_scores.append(knn.score(X_test,y_test))
In [21]: |
## score that comes from testing on the same datapoints that were used for training
max_train_score = max(train_scores)
train_scores_ind = [i for i, v in enumerate(train_scores) if v == max_train_score]
print('Max train score {} % and k = {}'.format(max_train_score*100,list{map(lambda

x: x+1, train_scores_ind)}))
Max train score 79.6875 %¥ and k = [1]
In [22]: |
## score that comes from testing on the datapoints that were split in the beginning
to be used for testing solely
max_test_score = max(test_scores)
test_scores_ind = [1 for 1, v in enumerate(test_scores) if v == max_test_score]
print('Max test score {} %¥ and k = {}'.format(max_test_score*100,list(map(lambda

x: x+1, test_scores_ind))))
Max test score 73.4375 ¥ and k = [1]
Result Visualisation
In [23]: |
p = sns.lineplot(data=(range(1,15),train_scores),marker="'*',label="'Train Score')
p = sns.lineplot(data=(range(1,15),test_scores),marker="0"',label="Test Score’)
14 —.— Train Score ®
— - = Train Score |
y===) be
12
L )
—e— Test Score
10
~o~ Test Score

a —
co sa
The best result is captured at k = 11 hence 11 is used for the final mode
localhost:8888/notebooks/ML_Practical 04.ipynb 8/11

Setup a knn classifier with k neighbor
In [24]:
knn = KNeighborsClassifier(11)
knn.fit(X_train,y_train)
knn.score(X_test,y_test)
out[24]:
0.765625
In [25]:
print(train_scores)
[8.796875]
Model Performance Analysis
confusion Matrix
In [26]:
#import confusion_matrix
from sklearn.metrics import confusion_matrix
#let us get the predictions using the classifier we had fit above
y_pred = knn.predict(X_test)
confusion _matrix(y_test,y_pred)
pd.crosstab(y_test, y pred, rownames=['True'], colnames=['Predicted'],

margins=True)
out[26]:
Predicted c 1 Al
True
0 142 25 167
1 35 54 89
All 177 79 256
9/11
In [27]:
y_pred = knn.predict(X_test)
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
p = sns.heatmap(pd.DataFrame(cnf matrix), annot=True, cmap="Y1GnBu" ,fmt='g’)
plt.title('Confusion matrix’, y=1.1)

plt.ylabel('Actual label’)
plt.xlabel('Predicted label")
out[27]:
Text(.5, 20.049999999999997, 'Predicted label')
Confusion matrix
o 25
@
a
=
@©
i=)
<
a 35 54
[4] 1
Predicted label
Classification Report (Report which includes Precision, Recall and F1-Score)
In [28]:
#import classification report
from sklearn.metrics import classification_report

print(classification_report(y_test,y_pred))
precision recall fl-score support
] 0.80 9.85 2.83 167
1 0.68 9.61 9.64 89
accuracy 8.77 256

macro avg 0.74 8.73 8.73 256
weighted avg 0.76 2.77 8.76 256
In [29]:
from sklearn.metrics import roc_curve

y_pred_proba = knn.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
140
120
100
10/11
In [30]:
plt.plot([@,1],[0,1], 'k--"}
plt.plot(fpr,tpr, label="Knn')
plt.xlabel(’fpr'}
plt.ylabel('tpr'})
plt.title('Knn{n_neighbors=11) ROC curve")
plt.show()
Knn{n_neighbors=11) ROC curve
08
06
tpr
0.4
02
0.0
0.0 02 04 0.6 0.8 1.0
Area under ROC curve
In [31]:
from sklearn.metrics import roc_auc_score

roc_auc_score(y_test,y_pred_proba)
Out[31]:
©.8193500639171096
Hyper Parameter optimization
In [32]:
#import GridSearchCV
from sklearn.model_ selection import GridSearchcv
#In case of classifier Like knn the parameter to be tuned is n_neighbors

param_grid = {'n_neighbors':np.arange(1,50)}
knn = KNeighborsClassifier()
knn_cv= GridSearchCv(knn,param_grid,cv=5)
knn_cv.fit(X,y)
print(“"Best Score:
print("Best Parameters:
+ str(knn_cv.best_score_))
" + str(knn_cv.best_params_))
Best Score:9.7721848251252015
Best Parameters: {'n_neighbors': 25}
11/11

ML Practical 04

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Practical 04

Uploaded by

Copyright:

Available Formats

10/18/23, 2:51 PM

import matplotlib.pyplot as plt

warnings. filterwarnings(’ ignore

Loading The Dataset

ML _Practical_04 - Jupyter Notebook

Check Information About Dataset

BloodPressure 768 non-null

ML _Practical_04 - Jupyter Notebook

Checking Null Values

diabetes_data_copy = diabetes_data.copy(deep = True)

diabetes_data_copy[['Glucose', 'BloodPressure’, 'SkinThickness', 'Insulin','BMI']]

To fill these Nan values the data distribution needs to be understood

Pregnancies Glucose BloodPressure

75 100 125 150 175 175 200 0 00 120

SkinThickness Insulin BMI

0.0 05 10 15 20 25 20 30 40 70 80 0.0 0.2 04 06 08 1.0

diabetes_data_copy[ 'Glucose'].fillna(diabetes_data_copy[ 'Glucose'].mean(),

Plotting after Nan removal

75 100 125 150 175

Identify Shape Of Dataset

Data Type Analysis

ML _Practical_04 - Jupyter Notebook

count of each data type

checking the balance of the data

color_wheel = {1: "#@8392cf",

colors = diabetes_data[ "Outcome”].map(lambda x: color_wheel.get(x + 1))

Heatmap for unclean data

# on this line I just set the size of figure to 12 by 16.

Glucose 0.15 0.057

Heatmap for clean data

localhost:8888/notebooks/ML_Practical 04.ipynb 6/11

ML _Practical_04 - Jupyter Notebook

[2 © [2] [2] [= — [4] ©

#X = diabetes_data.drop( "Outcome", axis = 1)

Splitting the dataset

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

train_scores_ind = [i for i, v in enumerate(train_scores) if v == max_train_score]

print('Max train score {} % and k = {}'.format(max_train_score*100,list{map(lambda

Max train score 79.6875 %¥ and k = [1]

test_scores_ind = [1 for 1, v in enumerate(test_scores) if v == max_test_score]

print('Max test score {} %¥ and k = {}'.format(max_test_score*100,list(map(lambda

Max test score 73.4375 ¥ and k = [1]

~o~ Test Score

localhost:8888/notebooks/ML_Practical 04.ipynb 8/11

Setup a knn classifier with k neighbor

from sklearn.metrics import confusion_matrix

pd.crosstab(y_test, y pred, rownames=['True'], colnames=['Predicted'],

All 177 79 256

p = sns.heatmap(pd.DataFrame(cnf matrix), annot=True, cmap="Y1GnBu" ,fmt='g’)

plt.title('Confusion matrix’, y=1.1)

Classification Report (Report which includes Precision, Recall and F1-Score)

from sklearn.metrics import classification_report

precision recall fl-score support

] 0.80 9.85 2.83 167

1 0.68 9.61 9.64 89

accuracy 8.77 256

from sklearn.metrics import roc_curve

Knn{n_neighbors=11) ROC curve

0.0 02 04 0.6 0.8 1.0

Area under ROC curve

from sklearn.metrics import roc_auc_score

Hyper Parameter optimization