AIML Report (1) 11

PROJECT REPORT
Diabetes Prediction Using Machine Learning Algorithms
Abstract:-
Diabetes is a chronic disease caused due to high amount of glucose

present in the human body.
There are types in diabetes Type1 and Type 2 other form is

gestational diabetes which is caused during pregnancy.
This can be controlled in the earlier stages of the attack. According to

International Diabetes Federation (IDF) 382 million people are
suffering with diabetes .
To accomplish this goal, in this project we can do early prediction of

diabetes in humans or patients for good accuracy.
However, in this project we are predicting diabetes using KNN
classifier model.
Problem Statement:-
By using patient records, we will try to build a machine learning

model to accurately predict wheather or not the patients in the dataset
have diabetes or not.
Description of Dataset:-
The dataset represents a list of study from different patients that leads
to classification of either diabetic or not.
For this coursework I will use these presented data and adopt a Knn
algorithm to test some given data of patients and see if they are under
either category diabetes or non-diabetic.
Total number of studied list in this dataset related to diabetic and non-
diabetic patient is 768 , which we will manipulate ,scrap and clean
these data to use them in our KNN predictive model.
The dataset consists of several medical predictor values and one target
variable, outcome.
 Predictor variables includes the number of pregnancies that the

patient has had and their glucose level…
 blood pressure ( mm Hg )
 BMI ( weight in kg / (height in meters^2) )
 Insulin level ( ml )
 age ( years )
Outcome : ( class variable 0 or 1)
Implementation of Code:-
import pandas as pd // data manipulation

import numpy as np // numerical operations
import matplotlib.pyplot as plt // Graph/basic plotting
import seaborn as sns // data visualization of bar plots,scatter
plots(advanced)
import pandas as pd
data = pd.read_csv("/content/diabetes.csv")
data
x = data.drop(['Outcome'], axis = 1)
x.head()
y = data['Outcome']
y
0 1
1 0
2 1
3 0
4 1
..
763 0
764 0
765 0
766 1
767 0
Name: Outcome, Length: 768, dtype: int64
from sklearn.preprocessing import MinMaxScaler// to scale the data such

that each feature's values fall within the range [0, 1]
scaler = MinMaxScaler()
x = scaler.fit_transform(x)
0 1
1 0
2 1
3 0
4 1
..
763 0
764 0
765 0
766 1
767 0
from sklearn.model_selection import train_test_split// for splitting a

dataset into training and testing sets
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.3,
random_state=1)// 30% of the data for testing, and remaining 70% for
training
From sklearn.neighbors import KNeighborsClassifier// for classification

based on the nearest neighbors
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(xtrain, ytrain)
KNeighborsClassifier (n_neighbors=1)
ypred = knn.predict(xtest)
ypred
array([1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1,
0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1,
0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0,
0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0,
1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0])
ytest
285 0
101 0
581 0
352 0
726 0
..
241 0
599 0
650 0
11 1
214 1
from sklearn.metrics import confusion_matrix, classification_report //

summaries the score ( positive or negative predictions )
print(confusion_matrix(ytest, ypred))
print(classification_report(ytest, ypred))
import numpy as np
error_rate = []
for i in range(1, 40):

knn = KNeighborsClassifier(n_neighbors=i)
pred_i = knn.predict(xtest)
error_rate.append(np.mean(pred_i != ytest))
plt.figure(figsize=(10, 6))
plt.plot(range(1, 40), error_rate, color='blue', linestyle='--',

markersize=10, markerfacecolor='red', marker='o')
plt.title('K versus Error rate')
plt.xlabel('K')
plt.ylabel('Error rate')
Text(0, 0.5, 'Error rate')
# lowest error rate at " 11 i think"
knn = KNeighborsClassifier(n_neighbors=13)
predictions = knn.predict(xtest)
print(confusion_matrix(ytest, ypred))
print(classification_report(ytest, ypred))
[[119 27]
[ 40 45]]
precision recall f1-score support
0 0.75 0.82 0.78 146

1 0.62 0.53 0.57 85
accuracy 0.71 231

macro avg 0.69 0.67 0.68 231
weighted avg 0.70 0.71 0.70 231
## checking the balance of the data by plotting the count of outcomes

by their value
color_wheel = {1: "#0392cf",
2: "#1020cf"}
colors = data["Outcome"].map(lambda x: color_wheel.get(x + 1))
print(data.Outcome.value_counts())
p=data.Outcome.value_counts().plot(kind="bar")
p = data.hist(figsize = (20,20))
y_pred = knn.predict(xtest)
from sklearn import metrics

cnf_matrix = metrics.confusion_matrix(ytest, ypred)
p = sns.heatmap(pd.DataFrame(cnf_matrix), annot=True,
cmap="YlGnBu" ,fmt='g')
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
from sklearn.metrics import accuracy_score

print(accuracy_score(ytest,ypred))
0.8008658008658008
Using our own test instance to check is the person having
diabetes or not
REFERENCES :-
For this coursework we use these presented data and adopt a Knn
algorithm to test some given data of patients and see if they are under
either category diabetes or non-diabetic.
We’ve obtained the subjected dataset from Kaggle, YouTube…
https://colab.research.google.com/drive/1vEet9M4-
0shXSlqTlhFmoh0LTthYtf7M?usp=sharing
https://youtu.be/DzWE7xIlkPM?si=E-mlb1fLtwKsZi9S

AIML Report (1) 11

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AIML Report (1) 11

Uploaded by

Copyright:

Available Formats

PROJECT REPORT

Diabetes Prediction Using Machine Learning Algorithms

Diabetes is a chronic disease caused due to high amount of glucose

There are types in diabetes Type1 and Type 2 other form is

This can be controlled in the earlier stages of the attack. According to

To accomplish this goal, in this project we can do early prediction of

By using patient records, we will try to build a machine learning

 Predictor variables includes the number of pregnancies that the

Outcome : ( class variable 0 or 1)

import pandas as pd // data manipulation

from sklearn.preprocessing import MinMaxScaler// to scale the data such

from sklearn.model_selection import train_test_split// for splitting a

From sklearn.neighbors import KNeighborsClassifier// for classification

from sklearn.metrics import confusion_matrix, classification_report //

for i in range(1, 40):

plt.plot(range(1, 40), error_rate, color='blue', linestyle='--',

plt.title('K versus Error rate')

Text(0, 0.5, 'Error rate')

# lowest error rate at " 11 i think"

0 0.75 0.82 0.78 146

accuracy 0.71 231

## checking the balance of the data by plotting the count of outcomes

from sklearn import metrics

plt.title('Confusion matrix', y=1.1)

from sklearn.metrics import accuracy_score

We’ve obtained the subjected dataset from Kaggle, YouTube…

You might also like