You are on page 1of 13

PROJECT REPORT

Diabetes Prediction Using Machine Learning Algorithms

Abstract:-

Diabetes is a chronic disease caused due to high amount of glucose


present in the human body.

There are types in diabetes Type1 and Type 2 other form is


gestational diabetes which is caused during pregnancy.

This can be controlled in the earlier stages of the attack. According to


International Diabetes Federation (IDF) 382 million people are
suffering with diabetes .

To accomplish this goal, in this project we can do early prediction of


diabetes in humans or patients for good accuracy.
However, in this project we are predicting diabetes using KNN
classifier model.

Problem Statement:-

By using patient records, we will try to build a machine learning


model to accurately predict wheather or not the patients in the dataset
have diabetes or not.

Description of Dataset:-
The dataset represents a list of study from different patients that leads
to classification of either diabetic or not.

For this coursework I will use these presented data and adopt a Knn
algorithm to test some given data of patients and see if they are under
either category diabetes or non-diabetic.

Total number of studied list in this dataset related to diabetic and non-
diabetic patient is 768 , which we will manipulate ,scrap and clean
these data to use them in our KNN predictive model.

The dataset consists of several medical predictor values and one target
variable, outcome.

 Predictor variables includes the number of pregnancies that the


patient has had and their glucose level…
 blood pressure ( mm Hg )
 BMI ( weight in kg / (height in meters^2) )
 Insulin level ( ml )
 age ( years )

Outcome : ( class variable 0 or 1)

Implementation of Code:-

import pandas as pd // data manipulation


import numpy as np // numerical operations
import matplotlib.pyplot as plt // Graph/basic plotting
import seaborn as sns // data visualization of bar plots,scatter
plots(advanced)

import pandas as pd

data = pd.read_csv("/content/diabetes.csv")
data

x = data.drop(['Outcome'], axis = 1)
x.head()

y = data['Outcome']
y

0 1
1 0
2 1
3 0
4 1
..
763 0
764 0
765 0
766 1
767 0
Name: Outcome, Length: 768, dtype: int64

from sklearn.preprocessing import MinMaxScaler// to scale the data such


that each feature's values fall within the range [0, 1]

scaler = MinMaxScaler()

x = scaler.fit_transform(x)

0 1
1 0
2 1
3 0
4 1
..
763 0
764 0
765 0
766 1
767 0
Name: Outcome, Length: 768, dtype: int64

from sklearn.model_selection import train_test_split// for splitting a


dataset into training and testing sets
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.3,
random_state=1)// 30% of the data for testing, and remaining 70% for
training

From sklearn.neighbors import KNeighborsClassifier// for classification


based on the nearest neighbors

knn = KNeighborsClassifier(n_neighbors=1)

knn.fit(xtrain, ytrain)

KNeighborsClassifier (n_neighbors=1)

ypred = knn.predict(xtest)

ypred

array([1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1,
0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1,
0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0,
0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0,
1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0])

ytest

285 0
101 0
581 0
352 0
726 0
..
241 0
599 0
650 0
11 1
214 1
Name: Outcome, Length: 231, dtype: int64

from sklearn.metrics import confusion_matrix, classification_report //


summaries the score ( positive or negative predictions )

print(confusion_matrix(ytest, ypred))
print(classification_report(ytest, ypred))
import numpy as np

error_rate = []

for i in range(1, 40):


knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(xtrain, ytrain)
pred_i = knn.predict(xtest)

error_rate.append(np.mean(pred_i != ytest))

plt.figure(figsize=(10, 6))

plt.plot(range(1, 40), error_rate, color='blue', linestyle='--',


markersize=10, markerfacecolor='red', marker='o')

plt.title('K versus Error rate')

plt.xlabel('K')
plt.ylabel('Error rate')

Text(0, 0.5, 'Error rate')

# lowest error rate at " 11 i think"

knn = KNeighborsClassifier(n_neighbors=13)
knn.fit(xtrain, ytrain)
predictions = knn.predict(xtest)

print(confusion_matrix(ytest, ypred))
print(classification_report(ytest, ypred))

[[119 27]
[ 40 45]]
precision recall f1-score support

0 0.75 0.82 0.78 146


1 0.62 0.53 0.57 85

accuracy 0.71 231


macro avg 0.69 0.67 0.68 231
weighted avg 0.70 0.71 0.70 231

## checking the balance of the data by plotting the count of outcomes


by their value
color_wheel = {1: "#0392cf",
2: "#1020cf"}
colors = data["Outcome"].map(lambda x: color_wheel.get(x + 1))
print(data.Outcome.value_counts())
p=data.Outcome.value_counts().plot(kind="bar")

p = data.hist(figsize = (20,20))
y_pred = knn.predict(xtest)

from sklearn import metrics


cnf_matrix = metrics.confusion_matrix(ytest, ypred)

p = sns.heatmap(pd.DataFrame(cnf_matrix), annot=True,

cmap="YlGnBu" ,fmt='g')

plt.title('Confusion matrix', y=1.1)

plt.ylabel('Actual label')

plt.xlabel('Predicted label')

from sklearn.metrics import accuracy_score


print(accuracy_score(ytest,ypred))
0.8008658008658008
Using our own test instance to check is the person having

diabetes or not
REFERENCES :-

For this coursework we use these presented data and adopt a Knn
algorithm to test some given data of patients and see if they are under
either category diabetes or non-diabetic.

We’ve obtained the subjected dataset from Kaggle, YouTube…

https://colab.research.google.com/drive/1vEet9M4-
0shXSlqTlhFmoh0LTthYtf7M?usp=sharing

https://youtu.be/DzWE7xIlkPM?si=E-mlb1fLtwKsZi9S

You might also like