You are on page 1of 52

Diabetes Prediction Using Data Mining

Fahima Afroz Rozy and Fariha Tabassum

A Thesis in the Partial Fulfillment of the Requirements

for the Award of Bachelor of Computer Science and Engineering (BCSE)

Department of Computer Science and Engineering


College of Engineering and Technology
IUBAT – International University of Business Agriculture and Technology

1
Spring 2020 Diabetes Prediction Using Data Mining

Fahima Afroz Rozy and Fariha Tabassum

A Thesis in the Partial Fulfillment of the Requirements for the Award of Bachelor of
Computer Science and Engineering (BCSE)
The thesis has been examined and approved,

_____________________________
Prof. Dr. Md. Abdul Haque
Chairman and Professor

_____________________________
Prof. Dr. Utpal Kanti Das
Co-supervisor, Coordinator and Professor

_____________________________
Nusrath Tabassum
Supervisor and Lecturer

Department of Computer Science and Engineering


College of Engineering and Technology
IUBAT – International University of Business Agriculture and Technology

Spring 2020

ii
Abstract

There have been many diseases around the world causing various health issues.

Nowadays, Data Mining has become very popular in the health industry. This paper helps in

predicting diabetes people with different age groups are being affected by diabetes based on

their life style activities. According to WHO, more than 463 million people are suffering

from diabetes. Diabetes is seventh leading cause of death. Now the youngsters are the most

affected by it. Here we discussed about 3 types of diabetes. Diabetes is growing at an

alarming rate nowadays. As it is incurable, but if we can predict diabetes in early stage, it can

be balanced with treatment. Clinical decisions are often made based on doctors’ experience

rather than on the rich database. Our objective of this research is to find out new features and

factors that can change the prediction of diabetes. As data mining techniques prove to be

good in predictive analysis, a data mining approach is used to predict the risk of diabetes in

the proposed approach. The performance of the algorithm is also measured and improved

using feature selection and selection of training set. In this paper we used 769 records.

Accuracy of Orange methodology is high. We used so many methods and tools like Decision

Tree Algorithm, Naïve Bayes Algorithm, Random Tree, Support Vector Machine, KNN,

Logistic Regression, Weka Tool but Orange network is more accurate. This study can be

further extended to deal datasets with multiple classes. This paper gives detailed review of

existing data mining methods used for prediction of diabetes. It also gives future direction for

severity estimation of diabetes. Moreover, these data analysis results can be used for further

research in enhancing the accuracy of the prediction system in future.


Letter of Transmittal

The Chairman
Thesis Defense Committee
Department of Computer Science and Engineering
IUBAT–International University of Business Agriculture and Technology
4 Embankment Drive Road, Sector 10, Uttara Model Town
Dhaka 1230, Bangladesh
Subject: Letter of Transmittal.
Dear Sir,
It is a great pleasure for us to be able to hand over the result of our hard work on Diabetes
Prediction Using Data Mining. We tried to give our best for preparing this report.
During preparation of the report, we have experienced practically a lot that will help us a
great in our career. It has enlightened our practical knowledge regarding the prediction. We
will be able to explain anything for more clarification if necessary. We would like to thank
you, for giving us the opportunity to do a report on Diabetes Prediction Using Data Mining.
Hope you will appreciate our hard work and excuse the minor errors. Thanking you for your
cooperation.
Yours sincerely,
_____________ _____________
Fahima Afroz Rozy Fariha Tabassum
ID:17103091 ID:17103092
Student’s Declaration

This thesis paper titled “Diabetes Prediction Using Data Mining”, submitted by the group
Fahima Afroz Rozy and Fariha Tabassum has been looked after by our supervisor Nusrath
Tabassum, Lecturer, Department of Computer Science and Engineering, IUBAT. Because of
her support we did it nicely.

The complete study is based on literature survey, study of periodicals, journals and websites
and building a model for proving the concept studied and designed. We further declare that
the complete thesis work, including all analysis, hypothesis, inferences and interpretation of
data and information, is done by me and my thesis partner.

_____________ _____________

Fahima Afroz Rozy Fariha Tabassum

ID 17103091 ID 17103092
Supervisor’s Certification

This is to certify that the work contained in the thesis entitled “Diabetes Prediction
Using Data Mining”, submitted by Fahima Afroz Rozy, ID-17103091 and Fariha Tabassum,
ID-17103092 has been accepted by our supervisor Nusrath Tabassum as satisfactory in
partial fulfillment of the requirements for the degree B.Sc. in Computer Science and
Engineering.

This report has performed the standard required for submission. To the best of our
knowledge, the results summarized in the report for B.Sc. degree in Computer Science.

_______________________________
Nusrath Tabassum

Lecturer

Department of Computer Science and Engineering

IUBAT–International University of Business Agriculture and Technology


Acknowledgments

Firstly, we want to thank our supervisor Nusrath Tabassum, whose guidance and

support have been essential in our research. She shared so many ideas to make understand us

what is thesis, what should we do, what are the tools and methodologies we are going to use.

Our continuous discussions have been a constant source of insightful ideas, significantly

shaping the main contributions of this thesis. Her advice encourage us to finish this work so

easily. Her guidance gave us a great clear vision about the ways of data collection and the

sources of data collection.

We would like to thank our ART 203 course instructor Prof Dr Abhijit Saha who

made our thesis work so understandable. He took so many viva and gave report and

assignment based on thesis work which helped us a lot. Without his valuable effort, this

report and research could not have been successful.

We are especially grateful to the Department of Computer Science and Engineering

(CSE) of IUBAT –International University of Business Agriculture and Technology for

providing us all-out support during the thesis work.

At last we would like to thank our family member for believing in us and for

encouraging us to fulfill our dreams. We also would like to end by saying thanks to our

friends who supported us in our academic career.


Table of Contents
..............................................................................................................................................1

Diabetes Prediction Using Data Mining..............................................................................1

Abstract.................................................................................................................iii

Letter of Transmittal............................................................................................iv

Student’s Declaration............................................................................................v

Supervisor’s Certification....................................................................................vi

Acknowledgments................................................................................................vii

List of Tables..........................................................................................................x

Introduction........................................................................................................................1

1.1. Background..................................................................................................2

1.2 Problem Statement.......................................................................................7

1.3 Objective of the research.............................................................................7

Literature Review..............................................................................................................9

Research Methodology....................................................................................................19

3.1 Research Design.............................................................................................19

3.2 Dataset Collection..........................................................................................22

3.3 Tools................................................................................................................25

3.4 Classification Algorithms..............................................................................27

Result and Discussion......................................................................................................35

Conclusion........................................................................................................................40

References.............................................................................................................43
List of Figures

Figure 1: Disorder because of diabetes......................................................................................2

Figure 1.1: Number of adults (20-79 years) with diabetes worldwide......................................3

Figure 1.2: Bangladesh - Diabetes Prevalence (% of Population Ages 20 To 79)....................4

Figure 3.1.1: Framework for Diabetes Prediction...................................................................19

Figure 3.1.2: Statistical Summary

Figure 3.3.1: A typical workflow in Orange 3.........................................................................25

Figure 3.4.1: Example of Random Forest Algorithm..............................................................29

Figure 3.4.5.1: Example of SVM……………………………………………………………31

Figure 3.4.6.1: Example of KNN……………………………………………………………33


List of Tables

Table 1.1: Risk factors % of Diabetes.......................................................................................4

Table 2.1- Comparison of different classification algorithms based on the accuracy.............13

Table 2.2 Parameters used for prediction................................................................................15

Table 2.3: Attributes of Diabetes Dataset................................................................................16

Table 3.2.1- PIMA Attributes and Description.......................................................................21

Table 3.2.2 – Rules generated by proposed model..................................................................22

Table 4.1.1: Confusion Matrix.................................................................................................35

Table 4.2.1.1: Accuracy Measures..........................................................................................36

Table 4.2.1.2: Classification report of KNN

algorithm……………………………………….37
Introduction

Data mining is the process of sorting out large data sets to identify patterns and establish

relationships to solve problems through data analysis. Data mining tools allow to predict

outcomes. We can use this information to increase revenues, reduce costs, improve customer

relationships, and reduce risks and more. Data mining is also known as Knowledge

discovery, Knowledge extraction, data/pattern analysis, information harvesting, etc. Now a

days data mining is rapidly growing successful in a wide range of applications. Such as

analysis of financial forecasting, healthcare and weather forecasting. Currently Data mining

is most effective in healthcare which provide prognosis and a deeper understanding of

medical data. Data mining applications in healthcare include analysis of health care centers

for better health policy-making and prevention of hospital errors, early detection, and

prevention of diseases. It also preventable hospital deaths, more value for money and cost

savings.

Researchers are using data mining techniques in the prediction of different types of

diseases such as diabetes, stroke, cancer, and heart disease. Data mining algorithm is used for

testing the accuracy in predicting diabetic status.

Diabetes is a chronic disease that occurs either when the pancreas does not produce

enough insulin or when the body cannot effectively use the insulin it produces. Insulin is a

hormone that regulates blood sugar. Because of imbalance blood sugar in human body people

is facing so many problems like damage the heart, blood vessels, eyes, kidneys, and nerves.

Diabetes can be the cause of death also. Diabetes is the cause of 2.6% of global blindness.
Figure 1: Disorder because of diabetes

1.1. Background

Based on 2011 National Diabetes Fact Sheet 8.3% (25 million) of U.S population has

diabetes. Diabetes is seventh leading cause of death according to U.S. death certificates .In

recent years, diabetes has become one of the major causes of deaths worldwide. According to

the WHO, around 1.6 million people worldwide died due to diabetes in 2016. It is estimated

that 463 million people are living with diabetes all over the world. The number of diabetes

patient will be 700 million globally by 2045 according to research. According to

International Diabetes Federation 7.1 million people with diabetes in Bangladesh and almost

an equal number with undetected diabetes. This number can be double by 2025.

2
Figure 1.1: Number of adults (20-79 years) with diabetes worldwide

Bangladesh is a developing country where 75% of total population lives in rural area.

In our country type 2 is increasing rapidly. Diabetes is increasing in Bangladesh in both

urban and rural areas. According to research from 1994 to 2013 the prediction of type 2

diabetes varied from 4.5% to 35.0% in Bangladesh. According to International Diabetes

3
Federation 7.1 million people with diabetes in Bangladesh and almost an equal number with

undetected diabetes. This number can be double by 2025.

Figure 1.2: Bangladesh - Diabetes Prevalence (% of Population Ages 20 To 79)

Prevalence of diabetes in Bangladesh:

Table 1.1: Risk factors % of Diabetes

Male Females Total


Diabetes 8.6% 7.4% 8.0%
Overweight 14.4% 19.6% 17.0%
Obesity 2.0% 4.6% 3.3%
Physical 9.2% 41.3% 25.1%

4
inactivity

Several data mining techniques are used to predict Diabetes disease such as Naïve

Bayes, Decision Tree, neural network, kernel density, automatically defined groups, bagging

algorithm, and support vector machine showing different levels of accuracies. By applying

data mining in disease diagnosis and treatment is beneficial for patients and especially for

Diabetes disease patients. Researchers have proven that hospitals do not provide the same

quality of service even though they provide having Diabetes disease. Diabetes disease

professionals store significant amounts of patient’s data. It is important to analyze these

datasets to extract useful knowledge.

There are 3 types of diabetes:

 Type 1 DM results from the body's failure to produce insulin. This form was

previously referred to as "insulin-dependent diabetes mellitus" (IDDM) or "juvenile

diabetes".

CONTROL TYPE 1 DIABETES

 Type 1 diabetes requires close monitoring as blood sugar levels can be

quite erratic through the day if left unchecked.

 People with type 1 diabetes will usually take a combination of long acting

(basal) and short acting (bolus) insulin.

 Type 2 DM results from insulin resistance, a condition in which cells fail to use

insulin properly, sometimes also with an absolute insulin deficiency. This form was

5
previously referred to as non-insulin-dependent diabetes mellitus (NIDDM) or "adult-

onset diabetes".

CONTROL TYPE 2 DIABETES

 With type 2 diabetes, one of the best ways to achieve greater

control of your diabetes is through diet.

 Some foods affect our blood sugar significantly more than others

and so picking the diet for type 2 diabetes that works for you can

make a big difference to your numbers and your health.

 Gestational diabetes, is the third main form and occurs when pregnant women

without a previous diagnosis of diabetes develop a high blood glucose level.

There is no diabetes cure. Diabetes can be treated and controlled. To

manage diabetes effectively, we need to do some following steps to manage our blood sugar

levels.

Like-

 Exercise Regularly

 Increase Your Fiber Intake

 Drink Water and Stay Hydrated

6
 Choose Foods With a Low Glycemic Index

 Control Stress Levels

 Monitor Your Blood Sugar Levels

 Get Enough Quality Sleep

1.2 Problem Statement

Diabetes is growing at an alarming rate nowadays. As it is incurable, but if we can

predict diabetes in early stage, it can be balanced with treatment.

 It becomes a cause for other illnesses also like blindness, kidney failure,

cholesterol and heart diseases.

 The deaths due to diabetes and high blood glucose are on the rise.

 Prediction of diabetes at an early stage would help the patients to maintain the

sugar level under control.

In this paper we use orange method, decision tree, naïve Bayes to predict disease and

some parameters. Like

 Fasting Plasma Glucose

 Plasma Glucose Test

 A1C

1.3 Objective of the research

Application of data mining in analyzing the medical data is a good method for

investigating. Nowadays, data stored in medical databases are growing in an increasingly

7
rapid rate. It has been widely recognized that medical data analysis can lead to an

enhancement of health care.

The following are the objectives leading to achievement of the primary objective

mentioned below:

 To identify the best classification model which can help the physicians in

predicting the risk of diabetic patient using diabetics attributes.

 To recognize and classify patterns in multivariate patient attributes.

 To predict the future outcomes based on previous experiences and present

conditions.

 To identify the patients at risk, with the aim of increasing quality of care and to

reduce the cost of care.

 To build a prediction model using appropriate classification techniques such as

naïve Bayes, decision trees and Orange.

8
Literature Review

The purposes of this study were to predict diabetes among Childs, adults, old people

and to identify age- and sex-specific thresholds of low strength for detection of risk. In

general predictive algorithms work well when the data is normally distributed or symmetric

but in real world mostly we get undistributed data. In order to improve the efficiency of

predictive algorithms transformation of the variables which are not normally distributed is

required.

“Prediction of Diabetes Based on Data Mining Techniques” by Madhusmita Rout,


Amandeep Kaur,
In this paper they have used –
Naive Bayes
SVM
Logistic Regression
Decision Tree
Tools used-
1. K-means,
2. R tool, Anaconda,
3. WEKA.

 In this paper they have used 769 records.


•  Accuracy of Logistic Regression is 82.35 which is high.
•  They have 9 attributes.
•  More dataset helps to find out more accurate value.

Pardha Repalli, Oklahoma State University “Prediction on Diabetes Using Data

mining” uses the application of techniques for data mining in the Healthcare and the

prediction of Diabetes. The author has deeply examined the use of data mining techniques in
classification such as Decision Tree technique and Regression models, methodology Cross

Industry Standard to discover the hidden information from the data. Important input variables

selected by the decision tree are high_Blood_Pressure, Cholest_last_Check, Adult_Bmi, Last

Flu shot, heart attack diagnose and other variables. Average square error for the selected

model is 0.043 which is low error. The most important variable that has major effect on

diabetes is high blood pressure. The people with whose high_blood_pressure_diagnosis value

is 2, -1 then the probability of not affected by diabetes is 98% and in the other case the

probability to be affected by diabetes is 2%. According to their research it is evident that

people with age above 45 years are mostly affected by diabetes.

K. R. Ananthapadmanaban* and G. Parthiban in “Prediction of Chances -

Diabetic Retinopathy using Data Mining Classification Techniques” authors tried to

make health prediction by using this automation. This automation system helps and update -

Administration of health services, Clinical care, Medical analysis and Training.

In this paper they have used –

• “Naive Bayes Classifier” to develop expected framework.

• Decision Tree” to make a ready-made set of explanation.

• Eclipse IDE is used for planning graphical user interface.

• Java is used to organize different parts of User Interface.

• MySQL is used as database at the web server.

In this paper, the authors present the techniques and applications of data mining in

Medicinal and Clinical Predictions. They have used already existing information in different

databases to rework it into new researches and results. This framework includes some initial

10
parts, like login, enter side effects in the system, and recommend medications, proposes an

adjacent specialist. When the symptoms occur then the patient need the specialist's help but

they are not accessible because of some reason. This can be the limitation of this paper.

Deepti Sisodiaa, Dilip Singh Sisodiab in “Prediction of Diabetes using

Classification Algorithms” by authors discussed about different kinds of diabetes based

problems which is predicted according to the symptoms shown by the patient. They also

discussed about the data mining techniques.

In this paper-

• They have used 3 algorithms to evaluate patient's symptoms such as SVM, Decision

Tree and Naive Bayes.

• They talked about techniques like association rule mining, classification, clustering to

analyze the different kinds of heart based problems.

• By using MAFIA algorithms they find out the accuracy of their data.

• They take five symptoms from the patient before analyzing.

This system is time consuming because of searching for insignificant branches. They

used excel to check the redundancy of data. It is good side of this system but the previous

system didn’t use this kind of data redundancy checker. This system is capable of ensuring

maximum patient satisfaction.

Mukesh kumari, Dr. Rajan Vohra, Anshul Arora “Prediction of Diabetes Using

Bayesian Network” developed an Intelligent Diabetes disease Prediction System. The

authors in this paper has used the various techniques in data mining like decision tree,

11
Bayesian network, Naïve Bayes model weka and neural network for the prediction of

diabetes disease that is whether a patient is suffering from heart disease or not . This paper

contains 206 records and 9 attributes. According to experimental results, correctly classified

instances for Bayesian network is 205. Accuracy of Bayesian network is 99.51 which is high.

Bayesian network is a promising technique for this type of dataset. Decision tree model for

the diagnosis of diabetes.Pre-processing is used to improve the quality of data. Classifier is

applied to the modified dataset to construct the Bayesian model. Weka will be used to do

simulation, and the accuracy of the model is calculated and compared with other algorithms

efficiency. Classification with Bayesian network shows the best accuracy, 99.51 percent and

error in the classification is .48%.

Madhusmita Rout and Amandeep Kaur “Prediction of Diabetes Based on Data

Mining Techniques” predicts the diabetic disease from clinical database by using Support

Vector Machine (SVM), Naive Bayes, Decision tree and K-nearest neighbor. This research

was being conducted with the various objectives like it identifies the various complications

that cause diabetes. It develops a Hybrid Genetic Algorithm that computes the best fitness

value which is used for evaluating the prediction accuracy of diabetes from clinical

databases. They have some parameters like Pregnancies, Glucose level Blood, Pressure

(mmHg), BMI (Body Mass Index, Skinfold thickness (mm), Insulin value in 2 hrs. (Mu

U/ml), Diabetes Pedigree function, Age (years). Then after they have applied classifiers like

logistic regression, Naive Bayes, Decision tree, K-nearest neighbour and Support vector

machine to the model. The experiment is evaluated using a python tool which shows the

performance of an individual algorithm. Logistic Regression is having a more accuracy rate

12
than other classifiers i.e. 82.35%. Other classifiers obtained an accuracy of 76.62%, 75.97%,

66.23%, and 64.28% by naive Bayes, decision tree, KNN, and SVM respectively. But this is

only based on individual performance in aspect of accuracy.

Table 2.1- Comparison of different classification algorithms based on the accuracy

Logistic Naive Decision KNN SVM


Regression Bayes tree
82.35 76.62 75.97 66.23 64.28

The accuracy differs in different proposed hybrid systems based on the chosen

algorithms. The error rates are analyzed and other factors also based on which an algorithm is

to be selected. More datasets need to be evaluated so the decision making can be more

accurate with very low error rates.

Veena Vijayan and Ravikumar Aswathy (2014), “Study of Data mining

Algorithms for prediction and Diagnosis of Diabetes Mellitus” studies the various data

mining algorithms for the prediction and diagnosis of diabetes Mellitus. According to the

author several methods are available for diagnosing diabetes based on the several physical

and chemical tests which are being performed. They took 8 attributes into the consideration.

The author surveyed various algorithms and analyzed that Expectation Maximization

Algorithm is the simplest algorithm. That can be performed by using two steps but the results

of this algorithm is less than 70% and author defined that this algorithm is not very accurate

for the higher dimensional data sets due to imprecision. The author also used the K-nearest

13
Neighbor Algorithm which is one of the simplest and also named as lazy learning algorithm

used for the classification. The accuracy of this algorithm comes out be 73.17% because of

certain drawbacks in this algorithm. The more efficient approach used by the author is K-

means Clustering algorithm but the accuracy of this approach is also 66-77%. For solving the

disadvantages of the K-means clustering algorithm, the author combined it with KNN

algorithm and forms the Amalgam KNN which improves the accuracy even for the large

datasets. The author observed that the value of K is very crucial as the value of K decreases

the accuracy becomes very less and with the increase of value of K, the accuracy is

increased. The author then uses the Adaptive Neuro Fuzzy 28 Inference System (ANFIS)

algorithm combined with the adaptive KNN which leads to the accuracy of 80%.

Desmond Bala Bisandu, Dorcas Dachollom Datiri, Eva Onokpasa, Godwin

Thomas, Musa Maaji Haruna, Aminu Aliyu, Jerry Zachariah Yakubu“Diabetes

Prediction Using Data Mining Techniques” designs an expert system that predicts diabetes

disease. This research will help in automating prediction of diabetes even before clinicians

arrived. The system was design using Java Programming Language, Weka Tool, and MySQL

(Microsoft Structured Query Language) as the back end and a strategic approach and Naïve

Bayesian Classifier was used for the front end. Solve the problems of the existing system by

implementing the naïve beyes classifier. They used some parameters to justify their research.

Table 2.2 Parameters used for prediction

Serial Parameters Description Allowed


number values
1 Age Age of the Discrete
subject integer value

14
2 Take insulin Take any Yes or No
drug or injection that
can prevent you from
having diabetes
3 Smoke Whether the Yes or No
cigarette subject smoke
cigarette
4 Age first Age the Discrete
smoked subject does the integer value
smoking
5 Where did Where the Home or
you take the survey? subject takes the Office
survey

They use total of 155 cleaned preprocessed records were collected and stored in

database say diabetes. 155 were used for training the model in the classification phase.

During the performance testing 50 records sample was drawn from initial 155 populations as

a validation set. They check the accuracy of the Naive Bayes classifier using confusion

matrix. Finally they achieved 90-95% accuracy of correctly classified instances in the

classification phase. The naïve Bayes classifier based system is very useful for diagnosis of

diabetes. The system can perform good prediction with less error and this technique could be

an important tool for supplementing the medical doctors in performing expert diagnosis. In

this method the efficiency of forecasting was found to be around 95%.

Mr. R. Sengamuthu1, Mrs. R. Abirami 2, Mr. D. Karthik 3“Various data mining

Techniques analysis to predict diabetes mellitus” explores the early prediction of diabetes

using various data mining techniques. The above attributes can be classified and cluster using

various techniques such as Navie Bayes, J48, PLS-LDA, SVM,BLR, MLP, K-NN, Bayesian

15
Network, Tanagara, WEKA and MATLAB tools. The dataset comprises 9 attributes and 768

instances.

Table 2.3: Attributes of Diabetes Dataset

Attribute No. Attribute Description


1 Plasma Plasma glucose
concentration a 2 hours in an
oral glucose tolerance test
2 Pressure Diastolic blood
pressure(mmHg)
3 Skin Triceps skin fold
thickness(mm)
4 Insulin 2-Hour serum insulin
(mu U/ml)
5 Pregnancy Number of times
pregnant
6 Mass Body Mass
Index(BMI)
7 Pedigree Diabetes Pedigree
function
8 Age Age(in years)
9 Class Class variable(0 or 1)

They use so many data mining techniques and they got various result based on their

applied techniques. But they got the highest percentage by using J48 classifier and WEKA &

MATLAB tool and the accuracy is 99.87% to predict the diseases. The performance of the

algorithm is calculated using the equation for Total Accuracy and Random Accuracy.

Thereby creating a user-friendly interface and environment for the patient’s without any

requirement of a doctor or hospital staff.

16
Research Methodology

3.1 Research Design

The framework for research design is given below:

17
Figure 3.1.1: Framework for Diabetes Prediction

 Data Collection: Data collection is the process of gathering and measuring

information on targeted variables in an established system. Here we will collect the

relevant data for our research.

 Data Pre-processing: Data pre-processing is a data mining technique which is used

to transform the raw data in a useful and efficient format. Major steps of data pre-

processing are data cleaning, removing noisy data, data transformation and data

reduction.

Steps Involved in Data Preprocessing:

1. Data Cleaning: The data can have many irrelevant and missing parts. To

handle this part, data cleaning is done. It involves handling of missing data, noisy

data etc.

18
(a) Missing Data: This situation arises when some data is missing in the data. It

can be handled in various ways. Some of them are: Ignore the tuples, Fill the Missing

values.

(b) Noisy Data: Noisy data is a meaningless data that can’t be interpreted by

machines. It can be generated due to faulty data collection, data entry errors etc. It can

be handled in following ways: Binning Method, Regression, Clustering.

2. Data Transformation: This step is taken in order to transform the data in

appropriate forms suitable for mining process. Data transformation ways are:

Normalization, Attribute Selection, Discretization, Concept Hierarchy Generation.

3. Data Reduction: Since data mining is a technique that is used to handle

huge amount of data. While working with huge volume of data, analysis became

harder in such cases. In order to get rid of this, we use data reduction technique. It

aims to increase the storage efficiency and reduce data storage and analysis costs.

The various steps to data reduction are: Data Cube Aggregation, Attribute

Subset Selection, Numerosity Reduction, Dimensionality Reduction.

In our research we have used the Pima Indian Diabetes Dataset. After doing the data

preprocessing, the statistical summary is given below:

19
Figure 3.1.2: Statistical Summary

 Training Dataset: Training data is an initial set of data used to help a program

understand how to apply technologies to learn and produce advanced results. We will

use PIMA Indian Training Dataset for our research purpose.

 Test Dataset: A test dataset is a dataset that is independent of the training dataset, but

that follows the same probability distribution as the training dataset.

 Classifier: A classifier utilizes some training data to understand how given input

variables relate to the class. We will use three classification algorithms in our

research.

3.2 Dataset Collection

For this research, the PIMA Indian dataset is collected from the UCI Machine Learning

Repository. It was originally collected from the Pima people of America. The National

20
Institute of Diabetes and Digestive and kidney Disease of the National Institute of Health

(NIH) originally owned the Pima Indian diabetes Database (PIDD).

Diabetes occurred in the dataset contains a record of 769 patients with nine attributes. Out of

the nine conditional attributes, six are due to physical examination rest of the attributes are

chemical examination. This dataset is already used by many researchers for their

experimental work to predict the onset of diabetes mellitus. Data pre-processing is required

to obtain structured data. There are eight inputs and last one being the output. The goal is to

use the first 8 variables to predict attribute values of the 9th variables.

Table 3.2.1- PIMA Attributes and Description

Serial Attributes Description

1 Pregnancies Number of times pregnant

2 Glucose level Glucose concentration 2 hours in an


oral glucose tolerance test

3 Blood Pressure (mmHg) Diastolic blood pressure (mm Hg)

4 BMI (Body Mass Index) Body mass index (weight in kg/ (height
in m) ^2)

5 Skinfold thickness (mm) Triceps skin fold thickness (mm)

21
6 Insulin value in 2 hrs. (mu U/ml) 2-Hour serum insulin (mu U/ml)

7 Diabetes Pedigree function A function which scores likelihood of


diabetes based on family history

8 Age (years) Age (years)

9 Outcome Positive or Negative

The rules generated by the proposed cascaded model are given below:

Table 3.2.2 – Rules generated by proposed model

Serial Condition Result

1 If Glucose level=low Negative

2 If Glucose level=medium & Age=low & Negative


Pedigree=low

3 If Glucose level=medium & Age=low & Negative


Pedigree=medium & BP=medium

4 If Glucose level=medium & Age=low & Negative


Pedigree=medium & BP=low

5 If Glucose level=medium & Age=low & Positive

22
Pedigree=medium & BP= high

6 If Glucose level=medium & Age=high Positive

7 If Glucose level=medium & Age=low & Positive


Pedigree=high

8 If Glucose level=medium & Age=medium Positive

9 If Glucose level=high Positive

3.3 Tools

Orange Tool: Orange is an open-source data visualization, machine learning and data

mining toolkit. It features a visual programming front-end for explorative data analysis and

interactive data visualization.

Orange components are called widgets and they range from simple data visualization, subset

selection, and preprocessing, to empirical evaluation of learning algorithms and predictive

modeling.

Orange consists of a canvas interface onto which the user places widgets and creates a data

analysis workflow. Widgets offer basic functionalities such as reading the data, showing

a data table, selecting features, training predictors, comparing learning algorithms,

visualizing data elements, etc.

23
Figure 3.3.1: A typical workflow in Orange 3

The program provides a platform for experiment selection, recommendation systems, and

predictive modeling and is used in biomedicine, bioinformatics, genomic research, and

teaching. In science, it is used as a platform for testing new machine learning algorithms and

for implementing new techniques in genetics and bioinformatics. In education, it was used

for teaching machine learning and data mining methods to students of biology, biomedicine,

and informatics.

24
3.4 Classification Algorithms

Classification is the process of identifying a new observation category set on the basis of

training set of data that contains observation whose category is known.

According to the problem identification mentioned in the introduction section, we developed

a classification model that will predict diabetes. In this model, different classifiers will be

used like Naïve Bayes, Decision tree and Random Forest. Each individual algorithm will be

applied to the model to obtain accuracy.

These types of algorithms fall under a supervised learning approach that can be performed on

any type of data. Classification learns from the input data and then based upon it classify the

new data. This technique helps in identifying the class labels where new data can be fit. We

will use three classification algorithms:

3.4.1 Naïve Bayes Classifier

A Naive Bayesian classifier using Bayes theorem works with a probabilistic statistical

classifier. The major advantage of using this Naïve Bayesian classifier lies in its simplicity

and is efficient in handling the dataset containing many attributes.

Studies have been made to compare the different techniques of classification which have

been developed so far. A set of programs that assign a class of predefined set to an object

under construction is based on the descriptive attributes. This classifier uses conditional

independence in which attribute value is independent. It calculates the probability of each tag

25
for a given text, and then output the tag with the highest one. This is done by using a

probabilistic approach which computes class probabilities and predicts most probable classes.

If ‘A’ is referred as prior event and ‘B’ as dependent event, Bayes’ theorem can be given as:

Prob (B given A) = Prob (A and B)/Prob(A)

Limitations of Naïve Bayes:

1) The naive Bayes classifier requires a very large number of records to obtain good results.

2) It is instance-based or lazy in that they store all of the training samples.

3.4.2 Decision Tree Algorithm

Decision tree can be used in a classification or regression model. It works like a

tree structure. It breaks down a big data set into smaller subsets. A decision node can

have two or more branches. Leaf node represents a classification or decision.

Different trees are built and converted them into different set of rules and these rules were

further reduced and filtered. The main objective of this rule is to analyze the number of rules

which are generated and how many rules will be balanced after performing filtering and

reduction. It also analyzes how many rules will be generated employing association rule

approach on the same database. The sets of rules built by decision trees were much smaller

than results of association rules.

26
Limitation of Decision Tree:

1) Empty branches.

2) Insignificant branches.

3) Over fitting.

3.4.3 Random Forest Algorithm

It is a classification algorithm based on many decision trees. Random forest

algorithm is used to obtain better predictive performance than could be obtained from any

other algorithms alone. That is why we use multiple decision tree in this case. This

prediction is more accurate than that of any individual tree.

27
Figure 3.4.1: Example of Random Forest Algorithm

In the given image, there are nine test predictions. Each individual tree in the random forest

spits out a class prediction and the class with the most votes becomes our model’s prediction.

So, in this case, Predict 1 is the final outcome.

3.4.4 Logistic Regression Algorithm

Logistic regression is a supervised learning classification algorithm used to predict the

probability of a target variable. The nature of target or dependent variable is dichotomous,

which means there would be only two possible classes.

In simple words, the dependent variable is binary in nature having data coded as either 1

(stands for success/yes) or 0 (stands for failure/no).

28
Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It is one of

the simplest ML algorithms that can be used for various classification problems such as spam

detection, Diabetes prediction, cancer detection etc.

Limitations of Logistic Regression:

1. We are only considering LINEAR relationships.

2. r and least squares regression are NOT resistant to outliers.

3. There may be variables other than x which are not studied, yet do influence the

response variable.

4. A strong correlation does NOT imply cause and effect relationship.

5. Extrapolation is dangerous.

3.4.5 Support Vector Machine

SVM is one of the most popular Supervised Learning algorithms, which is used for

Classification as well as Regression problems. However, primarily, it is used for

Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can

segregate n-dimensional space into classes so that we can easily put the new data point in the

correct category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme

cases are called as support vectors, and hence algorithm is termed as Support Vector

29
Machine. Consider the below diagram in which there are two different categories that are

classified using a decision boundary or hyperplane:

Figure 3.4.5.1: Example of SVM

Limitations of SVM:

1. SVM algorithm is not suitable for large data sets.

2. SVM does not perform very well when the data set has more noise i.e. target classes

are overlapping.

3. In cases where the number of features for each data point exceeds the number of

training data samples, the SVM will underperform.

4. As the support vector classifier works by putting data points, above and below the

classifying hyperplane there is no probabilistic explanation for the classification.

30
3.4.6 KNN

The KNN algorithm is a machine-learning algorithm that is considered a lazy learning

algorithm, with a low computational cost and very simple implementation. It supports

classification and regression problems. When making a prediction, it stores the entire training

dataset and queries it to locate k data points in the training set that are most similar to the

data point to be classified. Therefore, there is no model other than the raw training dataset,

and the only computation performed is querying of the training dataset.

When the KNN method is used for regression, the response value is calculated as a weighted

sum of the responses of all the k neighbors, where the weight is inversely proportional to the

distance from the input record.

Suppose there are two categories, i.e., Category A and Category B, and we have a new data

point x1, so this data point will lie in which of these categories. To solve this type of

problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the

category or class of a particular dataset. Consider the below diagram:

31
Figure 3.4.6.1: Example of KNN

Limitations of KNN:

1. Accuracy depends on the quality of the data.

2. With large data, the prediction stage might be slow.

3. Sensitive to the scale of the data and irrelevant features.

4. Require high memory – need to store all of the training data.

5. Given that it stores all of the training, it can be computationally expensive.

32
Result and Discussion

The dataset has 9 attributes and 768 instances. Attributes are exacting, all patients now are

females at least 21 years old of Pima Indian heritage. From the 768 patients in the PID

dataset, classification algorithms used a training set with 576 patients and a testing dataset

with 192 patients.

4.1 Performance Measures

To find the performance metrics such as sensitivity, specificity and accuracy, a distinguished

confusion matrix is obtained based on the classification results from these algorithms.

Confusion matrix is a matrix representation of the classification results as shown in Table

4.1.1.

Table 4.1.1: Confusion Matrix

Classified as Classified as
Healthy not Healthy

Actual Healthy TP FN

Actual not FP TN
Healthy
Accuracy is the percentage of predictions that are correct. The precision is the

measure of accuracy provided that a specific class has been predicted. Sensitivity is the

percentage of positive labeled instances that were predicted as positive. These performance

criterion for the classifiers in disease detection are evaluated as follows from the confusion

matrix.

Accuracy = (TP+TN) / (TP+FP+TN+FN)

Sensitivity = TP / (TP+FN)

Specificity = TN / (FP+TN)

Positive Prediction=FP / (TP+FP)

Negative Prediction = FN / (TN+FP)

4.2 Results

4.2.1 Accuracy Measures:

Naive Bayes, SVM, Logistic Regression, Random Forest, KNN and Decision Tree

algorithms are used in this research work. Experiments are performed using internal cross-

validation 10-folds. Accuracy, F-Measure, Recall and Precision measures are used for the

classification of this work. Table 4.2.1.1 defines accuracy measures below:

Table 4.2.1.1: Accuracy Measures

Measures Definitions Formula

34
1. Accuracy (A) Accuracy determines A=(TP+TN) / (Total
the accuracy of the algorithm no of samples)
in predicting instances
2. Precision (P) Classifier’s P = TP / (TP+ FP)
correctness/accuracy is
measured by Precision
3. Recall (R) To measure the R =TP / (TP+FN)
classifiers completeness or
sensitivity, Recall is used.
4. F-Measure F-Measure is the F=2*(P*R) / (P+R)
weighted average of
precision and recall.
Accuracy of each algorithm:

1. Naive Bayes: 71.42%

2. Decision tree: 68.18%

3. Random Forest: 75.97%

4. K Nearest neighbors: 78.57%

5. Support Vector Machine: 73.37%

6. Logistic Regression: 71.42%

We can see KNN algorithm has the highest accuracy which is 78.57%.

Table 4.2.1.2: Classification report of KNN algorithm: 

35
Precision Recall F-Measure Accuracy %
0.0 0.81 0.87 0.84 78.57
1.0 0.72 0.63 0.67

Corresponding classifiers performance over Accuracy, Precision, F-measure and

Recall values are listed in Table 4.2.1.2 and classifiers performance on the basis of classified

instances are defined in Table 4.2.1.3. Where, TP defines True Positive, TN defines True

Negative, FP defines False positive, FN defines False Negative. The corresponding

classifiers performance on the basis of Accuracy, Precision, F-measure, Recall values are

listed in Table 4.2.1.2 and classifier’s performance on the basis of classified instances are

shown in Table 4.2.1.3.

Table 4.2.1.2 represents different performance values of all classification algorithms

calculated on various measures. From Table 4.2.1.2 it is analyzed that KNN showing the

maximum accuracy. So, the KNN machine learning classifier can predict the chances of

diabetes with more accuracy as compared to other classifiers.

36
Conclusion

Various data mining techniques and its application were studied or reviewed.

Different algorithms were applied to find out the best accuracy of diabetes prediction. In our

case KNN provided high accuracy.

In this study, we used the diabetic patient follow-up data. We have combined feature

selection and imbalanced processing techniques. In this work, we offered proof that KNN

algorithm can be successfully used for Diabetes Prediction.

The main aim of this project was to design and implement Diabetes Prediction Using

Methods and Performance Analysis of that methods and it has been achieved successfully.

The proposed approach uses various classification and ensemble learning method in which

SVM, Knn, Random Forest, Decision Tree, Logistic Regression classifiers are used. And

KNN achieved accuracy 78.57%. The Experimental results can be asset health care to take

early prediction and make early decision to cure diabetes and save humans life.

The ability of our model to predict patients with Diabetes using some commonly used

lab results is high with satisfactory sensitivity. These models can be built into an online

computer program to help physicians in predicting patients with future occurrence of diabetes

and providing necessary preventive interventions. The model is developed and validated on

the Bangladeshi population which is more specific and powerful to apply on Bangladeshi

patients than existing models developed. Fasting blood glucose, body mass index, age,

insulin were the most important predictors in these models.


The goal of the work is to reduce the false positive and false negative rates as much

as possible, so as to boost up the precision and recall rates. The KNN is utilized for diabetes

prediction, owing to its faster learning capability. The performance of the work is analyzed

by varying the classifiers and tested against existing techniques. The experimental results

prove the efficacy of the proposed approach and in future, this work is planned to be

extended such that the medical images are processed.

Lately, medical Data mining has gained in interest by the scientific and research

communities. Diabetes is considered as the world's fastest-growing chronic disease. It needs

continuous self-management and control to maintain blood glucose level within the normal

range, in order to prevent complications. Data mining has played an important role in

diabetes research. Data mining would be a valuable asset for diabetes researchers because it

can unearth hidden knowledge from a huge amount of diabetes-related data. We believe that

data mining can significantly help diabetes research and ultimately improve the quality of

health care for diabetes patients. Using data mining to deal with the avalanche of clinical data

collected from patients and generated from the research and management of diabetes is a

valuable asset that can help researchers and clinicians provide better health care for the

patients affected by this modern-society disease. The present study concludes that elderly

diabetes patients should be given an assessment and a treatment plan that is suited to their

needs and lifestyles. Public health awareness of simple measures such as low sugar diet,

exercise, and avoiding obesity should be promoted by health care providers. In this study,

38
predictions on the effectiveness of different treatment methods for young and old age groups

were elucidated. The preferential orders of treatment were found to be different for the young

and old age groups. Diet control, weight reduction, exercise and smoking cessation are

mutually beneficial to each other for the treatment of diabetes.

At last by using all these six machine learning algorithms we had measured different

parameters within the dataset and we had come through better accuracy rate with KNN with

nearly 78.57%. This work can be extended by adding any other algorithm which can give

better accuracy than KNN.

References

[1] Ravi Sanakal, Smt. T Jayakumari, May (2014). Prognosis of Diabetes Using Data mining
Approach-Fuzzy C Means Clustering and Support Vector Machine International Journal of
Computer Trends and Technology (IJCTT) (volume 11 number 2) Dorcas Dachollom Datiri.
[2] Kaseda C, Kobayashi M, Yamaguchi M, Yamazaki K (2006). “Prediction of blood
glucose level of type 1 diabetics using response surface methodology and data mining” 44(6)
Med Biol Eng. Compute.
[3] Afshar Aalam, M. N Doja, Sapna Jain .February 25 – 26( 2010) ”K-MEANS
CLUSTERING USING WEKA INTERFACE”, 4th National Conference; INDIACom-2010
Computing For Nation Development, Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining
Concept and Techniques” (Third edition) Mukesh kumari .
[4] Jiawei Han, Jian Pei, Micheline Kamber, (2007). “Data Mining Conceptsand
Techniques”: Database - “patient data base” (Third edition) Mukesh kumari et al (IJCSIT).

39
[5] I.Parvin Begum, K.Tajudin, V.Karthikeyani, December (2012) Comparative of Data
Mining Classification Algorithm (CDMCA) in Diabetes Disease Prediction: International
Journal of Computer Applications (Volume 60– No.12) K.Tajudin.
[6] G. Parthiban and K. R. Ananthapadmanaban (October 2014) Prediction of Chances
-Diabetic Retinopathy using Data Mining Classification Techniques Indian Journal of
Science and Technology, Vol 7(10), Sopharak.
[7] K.R Lakshmi, S.Premkumar, June (2013) “Utilization of Data mining Techniques for
prediction of Diabetes Disease survivability”, International Journal of Scientific &
Engineering Research, vol.4 Issue 6 S. Sapna.
[8] Dr. A. Tamilarasi and M. Pravin Kumar, January (2012) “Implementation of Genetic
Algorithm in predicting Diabetes”, International Journal of computer science, (vol.9 Issue 1,
No.3) Murat Koklu and Yauz Unal.
[9] Murat Koklu and Manaswini Pradhan, Yauz Unal April (2011) “ predict the onset of
diabetes disease using Artificial Neural Network”, “ International Journal of Computer
Science & Emerging Technologies, (vol.2 Issue 2) Arwa Al-Rofiyee, Maram Al-Nowiser.

40
41
42 | P a g e

References

 Ravi Sanakal, Smt. T Jayakumari, May (2014). Prognosis of Diabetes Using Data
mining Approach-Fuzzy C Means Clustering and Support Vector Machine
International Journal of Computer Trends and Technology (IJCTT) (volume 11
number 2) Dorcas Dachollom Datiri.
 Kaseda C, Kobayashi M, Yamaguchi M, Yamazaki K (2006). “Prediction of blood
glucose level of type 1 diabetics using response surface methodology and data
mining” 44(6) Med Biol Eng. Compute.
 Afshar Aalam, M. N Doja, Sapna Jain .February 25 – 26( 2010) ”K-MEANS
CLUSTERING USING WEKA INTERFACE”, 4th National Conference;
INDIACom-2010 Computing For Nation Development, Jiawei Han, Micheline
Kamber, Jian Pei, “Data Mining Concept and Techniques” (Third edition) Mukesh
kumari .
 Jiawei Han, Jian Pei, Micheline Kamber, (2007). “Data Mining Conceptsand
Techniques”: Database - “patient data base” (Third edition) Mukesh kumari et al
(IJCSIT).
 I.Parvin Begum, K.Tajudin, V.Karthikeyani, December (2012) Comparative of Data
Mining Classification Algorithm (CDMCA) in Diabetes Disease Prediction:
International Journal of Computer Applications (Volume 60– No.12) K.Tajudin.
 G. Parthiban and K. R. Ananthapadmanaban (October 2014) Prediction of Chances
-Diabetic Retinopathy using Data Mining Classification Techniques Indian Journal of
Science and Technology, Vol 7(10), Sopharak.
 K.R Lakshmi, S.Premkumar, June (2013) “Utilization of Data mining Techniques for
prediction of Diabetes Disease survivability”, International Journal of Scientific &
Engineering Research, vol.4 Issue 6 S. Sapna.
 Dr. A. Tamilarasi and M. Pravin Kumar, January (2012) “Implementation of Genetic
Algorithm in predicting Diabetes”, International Journal of computer science, (vol.9
Issue 1, No.3) Murat Koklu and Yauz Unal.
 Murat Koklu and Manaswini Pradhan, Yauz Unal April (2011) “ predict the onset of
diabetes disease using Artificial Neural Network”, “ International Journal of
Computer Science & Emerging Technologies, (vol.2 Issue 2) Arwa Al-Rofiyee,
Maram Al-Nowiser.

You might also like