You are on page 1of 8
In [35] Introduction This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether a patient has diabetes based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. + Coloumns Pregnancies - No of times pregnent Glucose - Plasma glucose concentration a 2 hour in a glucose tolerance test Blood Pressure - Diastolic blood pressure (mm Hg) SkinThickness - Triceps skin fold thickness(mm) Insulin -2 hour serum insulin (mu Ulm!) BMI - Body Mass Index(weight in kg /(height in m*2) DiabetesPedigreeFunction - Diabetes Pedigree Function ‘Age -Age in years Outcome - Class variable (0 or 1) 268 out of 278 are 1,others are 0 + Importing required liabraies # importing required Libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns In [36]: out [36]: In [37]: + Loading datset df=pd.read_csv("diabetes.csv") df.head(6) # read the dataset # finding the first 5 record Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction ° 6 148 2 3 «0 336 0.627 1 1 85 66 2 0 286 0.351 2 8 183 64 0 0 88 0672 3 1 89 66 2394 284 0.167 4 0 47 40 35168 43.1 2.288 5 5 te 7% 0 0 56 0.201 ED » + Finding the information about datset df.info() # informing about the dataset RangeIndex: 768 entries, @ to 767 Data columns (total 9 columns): # Column Non-Null Count type @ Pregnancies 768 non-null inte 1 Glucose 768 non-null inte 2 BloodPressure 768 non-null inte 3. SkinThickness 768 non-null inte 4 Insulin 768 non-null inte 5 BMI 768 non-null —_floate4 6 DiabetesPedigreeFunction 768 non-null —float6a 7 Age 768 non-null inte 8 Outcome 768 non-null inte dtypes: floatea(2), intea(7) memory usage: 54.1 KB + Summary of statistics In [38]: df.describe().1 # to get summary of statistics out [38]: count mean std min 25% = 50% 75% Pregnancies 768.0 3.845052 3.369578 0.000 1.00000 3.0000 6.00000, Glucose 768.0 120894531 31.972618 0.000 99.0000 117.0000 140.2500 BloodPressure 768.0 69.105469 19.355807 0,000 62,0000 72.0000 80,0000 SkinThickness 768.0 20.536458 15952218 0.000 0.00000 23.0000 32.0000, Insulin 768.0 79,799479 115,244002 0,000 0,00000 30,5000 127.2600 BMI 768.0 31.992578 7.884160 0.000 27.3000 32,0000 36.6000 DiabetesPedigreeFunction 768.0 0.471876 0.331329 0.078 0.24375 0.3725 0.62625, Age 768.0 33240885 11.760232 21.000 24.00000 29.0000 41.0000 Outcome 768.0 0.348958 0.476951 0.000 0.00000 0.0000 1.00000 TS » + Finding the null values In [39]: df.isnul1().sum() # checking for missing values out[39]: Pregnancies ° Glucose e BloodPressure e SkinThickness e Insulin e BMI ° DiabetesPedigreeFunction @ Age e outcone @ dtype: intea * visualizing the null values In [40]: sns.heatmap(df.isnull(),cmap="Blues') # visualized the null values Out [40]: 0.100 0.075 0.050 0.025 0.000 0.025 0.050 - 0.075 --0.100 4 & g E a DiabetesPedigreeFunction * Co - Relation matrix In [41]: df.corr() out (41): Pregnancies Glucose BloodPressure SkinThickness Insulin 1 Pregnancies 1.000000 0.129459 0.141282 0.081672 0.073535 0.017 Glucose _0,129459 1.000000 0.152590, 0.087328 0.331357 0.221 BloodPressure 0.141282. 0.152590 1.000000 0.207371 0.088933 0.281 SkinThickness — -0.081672 0.057328 0.207371 1.000000 0.436783 0.392 Insulin -0.073535 0.331357 0.088933, 0.436783 1.000000 0.197 BMI 0.017683. 0.221071 0.281805 0.392573 0.197859 1.000 DiabetesPedigreeFunction _-0.033523. 0.137337 0.041265 0.183928 0.185071 0.140 Age 0.544341 0.263514 0.239528 0.113970 -0.042163. 0.038, Outcome 0.221898 0.466581 0.085068, 0.074752 0.130548 0.202 SS » + Visualizing the co relation In [42]: sns.heatmap(d#.corr() ,cmaj pink") out[42]: Pregnancies ~ Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Outcome - Pregnancies BloogPressure SkinThickness DiabetesredigreeFunction -10 -08 06 04 a2 0.0 In [43]: out [43]: In [44]: In [45]: * Create istogram distribution in all levels df hist (figsize-(18,10), grid-False, color="#ADD8E6' ) plt.suptitle("histogram distribution levels", siz 38) Text(@.5, 0.98, "histogram distribution levels’) histogram distribution levels Chteshesrehnctn oe + Importing the liabraries for prediction from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression fron sklearn.metrics import accuracy_score import warnings warnings. filterwarnings(' ignore’) d#.drop( Outcome" ,axis=1) y = df[‘Outcone" ] x train,x_test,y train,y test = train_test_split(x stest_size-0.2) In X all the independent variables are stored In Y the predictor variable("OUTCOME") is stored. Train-test split is a technique used in machine leaming to assess model performance. It divides the dataset into a training set and a testing set, with a 0.2 test size indicating that 20% of the data is used for testing and 80% for training. In [46]: out [46]: In [47]: In [48] * Training the model model = LogisticRegression() model. fit(x_train,y_train) + LogisticRegression [Logisticregression() Fitting the X train and y train data into the variable called model. prediction = model.predict(x_test) print (prediction) [eee1e0e1 Hoon Fitting the X train and y train data into the variable called model. accuracy = accuracy_score(prediction,y_test) print (accuracy) @.7922077922077922 The accuracy of the model is then calculated and determined,

You might also like