In [35]
Introduction
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney
Diseases. The objective of the dataset is to diagnostically predict whether a patient has diabetes
based on certain diagnostic measurements included in the dataset. Several constraints were
placed on the selection of these instances from a larger database. In particular, all patients here
are females at least 21 years old of Pima Indian heritage.
+ Coloumns
Pregnancies - No of times pregnent
Glucose - Plasma glucose concentration a 2 hour in a glucose tolerance test
Blood Pressure - Diastolic blood pressure (mm Hg)
SkinThickness - Triceps skin fold thickness(mm)
Insulin -2 hour serum insulin (mu Ulm!)
BMI - Body Mass Index(weight in kg /(height in m*2)
DiabetesPedigreeFunction - Diabetes Pedigree Function
‘Age -Age in years
Outcome - Class variable (0 or 1) 268 out of 278 are 1,others are 0
+ Importing required liabraies
# importing required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsIn [36]:
out [36]:
In [37]:
+ Loading datset
df=pd.read_csv("diabetes.csv")
df.head(6)
# read the dataset
# finding the first 5 record
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction
° 6 148 2 3 «0 336 0.627
1 1 85 66 2 0 286 0.351
2 8 183 64 0 0 88 0672
3 1 89 66 2394 284 0.167
4 0 47 40 35168 43.1 2.288
5 5 te 7% 0 0 56 0.201
ED »
+ Finding the information about datset
df.info() # informing about the dataset
RangeIndex: 768 entries, @ to 767
Data columns (total 9 columns):
# Column Non-Null Count type
@ Pregnancies 768 non-null inte
1 Glucose 768 non-null inte
2 BloodPressure 768 non-null inte
3. SkinThickness 768 non-null inte
4 Insulin 768 non-null inte
5 BMI 768 non-null —_floate4
6 DiabetesPedigreeFunction 768 non-null —float6a
7 Age 768 non-null inte
8 Outcome 768 non-null inte
dtypes: floatea(2), intea(7)
memory usage: 54.1 KB+ Summary of statistics
In [38]: df.describe().1 # to get summary of statistics
out [38]:
count mean std min 25% = 50% 75%
Pregnancies 768.0 3.845052 3.369578 0.000 1.00000 3.0000 6.00000,
Glucose 768.0 120894531 31.972618 0.000 99.0000 117.0000 140.2500
BloodPressure 768.0 69.105469 19.355807 0,000 62,0000 72.0000 80,0000
SkinThickness 768.0 20.536458 15952218 0.000 0.00000 23.0000 32.0000,
Insulin 768.0 79,799479 115,244002 0,000 0,00000 30,5000 127.2600
BMI 768.0 31.992578 7.884160 0.000 27.3000 32,0000 36.6000
DiabetesPedigreeFunction 768.0 0.471876 0.331329 0.078 0.24375 0.3725 0.62625,
Age 768.0 33240885 11.760232 21.000 24.00000 29.0000 41.0000
Outcome 768.0 0.348958 0.476951 0.000 0.00000 0.0000 1.00000
TS »
+ Finding the null values
In [39]: df.isnul1().sum() # checking for missing values
out[39]: Pregnancies °
Glucose e
BloodPressure e
SkinThickness e
Insulin e
BMI °
DiabetesPedigreeFunction @
Age e
outcone @
dtype: intea* visualizing the null values
In [40]: sns.heatmap(df.isnull(),cmap="Blues') # visualized the null values
Out [40]:
0.100
0.075
0.050
0.025
0.000
0.025
0.050
- 0.075
--0.100
4
&
g
E
a
DiabetesPedigreeFunction* Co - Relation matrix
In [41]: df.corr()
out (41):
Pregnancies Glucose BloodPressure SkinThickness Insulin 1
Pregnancies 1.000000 0.129459 0.141282 0.081672 0.073535 0.017
Glucose _0,129459 1.000000 0.152590, 0.087328 0.331357 0.221
BloodPressure 0.141282. 0.152590 1.000000 0.207371 0.088933 0.281
SkinThickness — -0.081672 0.057328 0.207371 1.000000 0.436783 0.392
Insulin -0.073535 0.331357 0.088933, 0.436783 1.000000 0.197
BMI 0.017683. 0.221071 0.281805 0.392573 0.197859 1.000
DiabetesPedigreeFunction _-0.033523. 0.137337 0.041265 0.183928 0.185071 0.140
Age 0.544341 0.263514 0.239528 0.113970 -0.042163. 0.038,
Outcome 0.221898 0.466581 0.085068, 0.074752 0.130548 0.202
SS »+ Visualizing the co relation
In [42]: sns.heatmap(d#.corr() ,cmaj
pink")
out[42]:
Pregnancies ~
Glucose
BloodPressure
SkinThickness
Insulin
BMI
DiabetesPedigreeFunction
Outcome -
Pregnancies
BloogPressure
SkinThickness
DiabetesredigreeFunction
-10
-08
06
04
a2
0.0In [43]:
out [43]:
In [44]:
In [45]:
* Create istogram distribution in all levels
df hist (figsize-(18,10), grid-False, color="#ADD8E6' )
plt.suptitle("histogram distribution levels", siz
38)
Text(@.5, 0.98, "histogram distribution levels’)
histogram distribution levels
Chteshesrehnctn oe
+ Importing the liabraries for prediction
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
fron sklearn.metrics import accuracy_score
import warnings
warnings. filterwarnings(' ignore’)
d#.drop( Outcome" ,axis=1)
y = df[‘Outcone" ]
x train,x_test,y train,y test = train_test_split(x
stest_size-0.2)
In X all the independent variables are stored In Y the predictor variable("OUTCOME") is stored.
Train-test split is a technique used in machine leaming to assess model performance. It divides
the dataset into a training set and a testing set, with a 0.2 test size indicating that 20% of the
data is used for testing and 80% for training.In [46]:
out [46]:
In [47]:
In [48]
* Training the model
model = LogisticRegression()
model. fit(x_train,y_train)
+ LogisticRegression
[Logisticregression()
Fitting the X train and y train data into the variable called model.
prediction = model.predict(x_test)
print (prediction)
[eee1e0e1
Hoon
Fitting the X train and y train data into the variable called model.
accuracy = accuracy_score(prediction,y_test)
print (accuracy)
@.7922077922077922
The accuracy of the model is then calculated and determined,