Professional Documents
Culture Documents
1
Spring 2020 Diabetes Prediction Using Data Mining
A Thesis in the Partial Fulfillment of the Requirements for the Award of Bachelor of
Computer Science and Engineering (BCSE)
The thesis has been examined and approved,
_____________________________
Prof. Dr. Md. Abdul Haque
Chairman and Professor
_____________________________
Prof. Dr. Utpal Kanti Das
Co-supervisor, Coordinator and Professor
_____________________________
Nusrath Tabassum
Supervisor and Lecturer
Spring 2020
ii
Abstract
There have been many diseases around the world causing various health issues.
Nowadays, Data Mining has become very popular in the health industry. This paper helps in
predicting diabetes people with different age groups are being affected by diabetes based on
their life style activities. According to WHO, more than 463 million people are suffering
from diabetes. Diabetes is seventh leading cause of death. Now the youngsters are the most
alarming rate nowadays. As it is incurable, but if we can predict diabetes in early stage, it can
be balanced with treatment. Clinical decisions are often made based on doctors’ experience
rather than on the rich database. Our objective of this research is to find out new features and
factors that can change the prediction of diabetes. As data mining techniques prove to be
good in predictive analysis, a data mining approach is used to predict the risk of diabetes in
the proposed approach. The performance of the algorithm is also measured and improved
using feature selection and selection of training set. In this paper we used 769 records.
Accuracy of Orange methodology is high. We used so many methods and tools like Decision
Tree Algorithm, Naïve Bayes Algorithm, Random Tree, Support Vector Machine, KNN,
Logistic Regression, Weka Tool but Orange network is more accurate. This study can be
further extended to deal datasets with multiple classes. This paper gives detailed review of
existing data mining methods used for prediction of diabetes. It also gives future direction for
severity estimation of diabetes. Moreover, these data analysis results can be used for further
The Chairman
Thesis Defense Committee
Department of Computer Science and Engineering
IUBAT–International University of Business Agriculture and Technology
4 Embankment Drive Road, Sector 10, Uttara Model Town
Dhaka 1230, Bangladesh
Subject: Letter of Transmittal.
Dear Sir,
It is a great pleasure for us to be able to hand over the result of our hard work on Diabetes
Prediction Using Data Mining. We tried to give our best for preparing this report.
During preparation of the report, we have experienced practically a lot that will help us a
great in our career. It has enlightened our practical knowledge regarding the prediction. We
will be able to explain anything for more clarification if necessary. We would like to thank
you, for giving us the opportunity to do a report on Diabetes Prediction Using Data Mining.
Hope you will appreciate our hard work and excuse the minor errors. Thanking you for your
cooperation.
Yours sincerely,
_____________ _____________
Fahima Afroz Rozy Fariha Tabassum
ID:17103091 ID:17103092
Student’s Declaration
This thesis paper titled “Diabetes Prediction Using Data Mining”, submitted by the group
Fahima Afroz Rozy and Fariha Tabassum has been looked after by our supervisor Nusrath
Tabassum, Lecturer, Department of Computer Science and Engineering, IUBAT. Because of
her support we did it nicely.
The complete study is based on literature survey, study of periodicals, journals and websites
and building a model for proving the concept studied and designed. We further declare that
the complete thesis work, including all analysis, hypothesis, inferences and interpretation of
data and information, is done by me and my thesis partner.
_____________ _____________
ID 17103091 ID 17103092
Supervisor’s Certification
This is to certify that the work contained in the thesis entitled “Diabetes Prediction
Using Data Mining”, submitted by Fahima Afroz Rozy, ID-17103091 and Fariha Tabassum,
ID-17103092 has been accepted by our supervisor Nusrath Tabassum as satisfactory in
partial fulfillment of the requirements for the degree B.Sc. in Computer Science and
Engineering.
This report has performed the standard required for submission. To the best of our
knowledge, the results summarized in the report for B.Sc. degree in Computer Science.
_______________________________
Nusrath Tabassum
Lecturer
Firstly, we want to thank our supervisor Nusrath Tabassum, whose guidance and
support have been essential in our research. She shared so many ideas to make understand us
what is thesis, what should we do, what are the tools and methodologies we are going to use.
Our continuous discussions have been a constant source of insightful ideas, significantly
shaping the main contributions of this thesis. Her advice encourage us to finish this work so
easily. Her guidance gave us a great clear vision about the ways of data collection and the
We would like to thank our ART 203 course instructor Prof Dr Abhijit Saha who
made our thesis work so understandable. He took so many viva and gave report and
assignment based on thesis work which helped us a lot. Without his valuable effort, this
At last we would like to thank our family member for believing in us and for
encouraging us to fulfill our dreams. We also would like to end by saying thanks to our
Abstract.................................................................................................................iii
Letter of Transmittal............................................................................................iv
Student’s Declaration............................................................................................v
Supervisor’s Certification....................................................................................vi
Acknowledgments................................................................................................vii
List of Tables..........................................................................................................x
Introduction........................................................................................................................1
1.1. Background..................................................................................................2
Literature Review..............................................................................................................9
Research Methodology....................................................................................................19
3.3 Tools................................................................................................................25
Conclusion........................................................................................................................40
References.............................................................................................................43
List of Figures
algorithm……………………………………….37
Introduction
Data mining is the process of sorting out large data sets to identify patterns and establish
relationships to solve problems through data analysis. Data mining tools allow to predict
outcomes. We can use this information to increase revenues, reduce costs, improve customer
relationships, and reduce risks and more. Data mining is also known as Knowledge
days data mining is rapidly growing successful in a wide range of applications. Such as
analysis of financial forecasting, healthcare and weather forecasting. Currently Data mining
medical data. Data mining applications in healthcare include analysis of health care centers
for better health policy-making and prevention of hospital errors, early detection, and
prevention of diseases. It also preventable hospital deaths, more value for money and cost
savings.
Researchers are using data mining techniques in the prediction of different types of
diseases such as diabetes, stroke, cancer, and heart disease. Data mining algorithm is used for
Diabetes is a chronic disease that occurs either when the pancreas does not produce
enough insulin or when the body cannot effectively use the insulin it produces. Insulin is a
hormone that regulates blood sugar. Because of imbalance blood sugar in human body people
is facing so many problems like damage the heart, blood vessels, eyes, kidneys, and nerves.
Diabetes can be the cause of death also. Diabetes is the cause of 2.6% of global blindness.
Figure 1: Disorder because of diabetes
1.1. Background
Based on 2011 National Diabetes Fact Sheet 8.3% (25 million) of U.S population has
diabetes. Diabetes is seventh leading cause of death according to U.S. death certificates .In
recent years, diabetes has become one of the major causes of deaths worldwide. According to
the WHO, around 1.6 million people worldwide died due to diabetes in 2016. It is estimated
that 463 million people are living with diabetes all over the world. The number of diabetes
2
Figure 1.1: Number of adults (20-79 years) with diabetes worldwide
Bangladesh is a developing country where 75% of total population lives in rural area.
urban and rural areas. According to research from 1994 to 2013 the prediction of type 2
3
Federation 7.1 million people with diabetes in Bangladesh and almost an equal number with
4
inactivity
Several data mining techniques are used to predict Diabetes disease such as Naïve
Bayes, Decision Tree, neural network, kernel density, automatically defined groups, bagging
algorithm, and support vector machine showing different levels of accuracies. By applying
data mining in disease diagnosis and treatment is beneficial for patients and especially for
Diabetes disease patients. Researchers have proven that hospitals do not provide the same
quality of service even though they provide having Diabetes disease. Diabetes disease
Type 1 DM results from the body's failure to produce insulin. This form was
diabetes".
People with type 1 diabetes will usually take a combination of long acting
Type 2 DM results from insulin resistance, a condition in which cells fail to use
insulin properly, sometimes also with an absolute insulin deficiency. This form was
5
previously referred to as non-insulin-dependent diabetes mellitus (NIDDM) or "adult-
onset diabetes".
Some foods affect our blood sugar significantly more than others
and so picking the diet for type 2 diabetes that works for you can
Gestational diabetes, is the third main form and occurs when pregnant women
levels.
Like-
Exercise Regularly
6
Choose Foods With a Low Glycemic Index
It becomes a cause for other illnesses also like blindness, kidney failure,
The deaths due to diabetes and high blood glucose are on the rise.
Prediction of diabetes at an early stage would help the patients to maintain the
In this paper we use orange method, decision tree, naïve Bayes to predict disease and
A1C
Application of data mining in analyzing the medical data is a good method for
7
rapid rate. It has been widely recognized that medical data analysis can lead to an
The following are the objectives leading to achievement of the primary objective
mentioned below:
To identify the best classification model which can help the physicians in
conditions.
To identify the patients at risk, with the aim of increasing quality of care and to
8
Literature Review
The purposes of this study were to predict diabetes among Childs, adults, old people
and to identify age- and sex-specific thresholds of low strength for detection of risk. In
general predictive algorithms work well when the data is normally distributed or symmetric
but in real world mostly we get undistributed data. In order to improve the efficiency of
predictive algorithms transformation of the variables which are not normally distributed is
required.
mining” uses the application of techniques for data mining in the Healthcare and the
prediction of Diabetes. The author has deeply examined the use of data mining techniques in
classification such as Decision Tree technique and Regression models, methodology Cross
Industry Standard to discover the hidden information from the data. Important input variables
Flu shot, heart attack diagnose and other variables. Average square error for the selected
model is 0.043 which is low error. The most important variable that has major effect on
diabetes is high blood pressure. The people with whose high_blood_pressure_diagnosis value
is 2, -1 then the probability of not affected by diabetes is 98% and in the other case the
make health prediction by using this automation. This automation system helps and update -
In this paper, the authors present the techniques and applications of data mining in
Medicinal and Clinical Predictions. They have used already existing information in different
databases to rework it into new researches and results. This framework includes some initial
10
parts, like login, enter side effects in the system, and recommend medications, proposes an
adjacent specialist. When the symptoms occur then the patient need the specialist's help but
they are not accessible because of some reason. This can be the limitation of this paper.
problems which is predicted according to the symptoms shown by the patient. They also
In this paper-
• They have used 3 algorithms to evaluate patient's symptoms such as SVM, Decision
• They talked about techniques like association rule mining, classification, clustering to
• By using MAFIA algorithms they find out the accuracy of their data.
This system is time consuming because of searching for insignificant branches. They
used excel to check the redundancy of data. It is good side of this system but the previous
system didn’t use this kind of data redundancy checker. This system is capable of ensuring
Mukesh kumari, Dr. Rajan Vohra, Anshul Arora “Prediction of Diabetes Using
authors in this paper has used the various techniques in data mining like decision tree,
11
Bayesian network, Naïve Bayes model weka and neural network for the prediction of
diabetes disease that is whether a patient is suffering from heart disease or not . This paper
contains 206 records and 9 attributes. According to experimental results, correctly classified
instances for Bayesian network is 205. Accuracy of Bayesian network is 99.51 which is high.
Bayesian network is a promising technique for this type of dataset. Decision tree model for
applied to the modified dataset to construct the Bayesian model. Weka will be used to do
simulation, and the accuracy of the model is calculated and compared with other algorithms
efficiency. Classification with Bayesian network shows the best accuracy, 99.51 percent and
Mining Techniques” predicts the diabetic disease from clinical database by using Support
Vector Machine (SVM), Naive Bayes, Decision tree and K-nearest neighbor. This research
was being conducted with the various objectives like it identifies the various complications
that cause diabetes. It develops a Hybrid Genetic Algorithm that computes the best fitness
value which is used for evaluating the prediction accuracy of diabetes from clinical
databases. They have some parameters like Pregnancies, Glucose level Blood, Pressure
(mmHg), BMI (Body Mass Index, Skinfold thickness (mm), Insulin value in 2 hrs. (Mu
U/ml), Diabetes Pedigree function, Age (years). Then after they have applied classifiers like
logistic regression, Naive Bayes, Decision tree, K-nearest neighbour and Support vector
machine to the model. The experiment is evaluated using a python tool which shows the
12
than other classifiers i.e. 82.35%. Other classifiers obtained an accuracy of 76.62%, 75.97%,
66.23%, and 64.28% by naive Bayes, decision tree, KNN, and SVM respectively. But this is
The accuracy differs in different proposed hybrid systems based on the chosen
algorithms. The error rates are analyzed and other factors also based on which an algorithm is
to be selected. More datasets need to be evaluated so the decision making can be more
Algorithms for prediction and Diagnosis of Diabetes Mellitus” studies the various data
mining algorithms for the prediction and diagnosis of diabetes Mellitus. According to the
author several methods are available for diagnosing diabetes based on the several physical
and chemical tests which are being performed. They took 8 attributes into the consideration.
The author surveyed various algorithms and analyzed that Expectation Maximization
Algorithm is the simplest algorithm. That can be performed by using two steps but the results
of this algorithm is less than 70% and author defined that this algorithm is not very accurate
for the higher dimensional data sets due to imprecision. The author also used the K-nearest
13
Neighbor Algorithm which is one of the simplest and also named as lazy learning algorithm
used for the classification. The accuracy of this algorithm comes out be 73.17% because of
certain drawbacks in this algorithm. The more efficient approach used by the author is K-
means Clustering algorithm but the accuracy of this approach is also 66-77%. For solving the
disadvantages of the K-means clustering algorithm, the author combined it with KNN
algorithm and forms the Amalgam KNN which improves the accuracy even for the large
datasets. The author observed that the value of K is very crucial as the value of K decreases
the accuracy becomes very less and with the increase of value of K, the accuracy is
increased. The author then uses the Adaptive Neuro Fuzzy 28 Inference System (ANFIS)
algorithm combined with the adaptive KNN which leads to the accuracy of 80%.
Prediction Using Data Mining Techniques” designs an expert system that predicts diabetes
disease. This research will help in automating prediction of diabetes even before clinicians
arrived. The system was design using Java Programming Language, Weka Tool, and MySQL
(Microsoft Structured Query Language) as the back end and a strategic approach and Naïve
Bayesian Classifier was used for the front end. Solve the problems of the existing system by
implementing the naïve beyes classifier. They used some parameters to justify their research.
14
2 Take insulin Take any Yes or No
drug or injection that
can prevent you from
having diabetes
3 Smoke Whether the Yes or No
cigarette subject smoke
cigarette
4 Age first Age the Discrete
smoked subject does the integer value
smoking
5 Where did Where the Home or
you take the survey? subject takes the Office
survey
They use total of 155 cleaned preprocessed records were collected and stored in
database say diabetes. 155 were used for training the model in the classification phase.
During the performance testing 50 records sample was drawn from initial 155 populations as
a validation set. They check the accuracy of the Naive Bayes classifier using confusion
matrix. Finally they achieved 90-95% accuracy of correctly classified instances in the
classification phase. The naïve Bayes classifier based system is very useful for diagnosis of
diabetes. The system can perform good prediction with less error and this technique could be
an important tool for supplementing the medical doctors in performing expert diagnosis. In
Techniques analysis to predict diabetes mellitus” explores the early prediction of diabetes
using various data mining techniques. The above attributes can be classified and cluster using
various techniques such as Navie Bayes, J48, PLS-LDA, SVM,BLR, MLP, K-NN, Bayesian
15
Network, Tanagara, WEKA and MATLAB tools. The dataset comprises 9 attributes and 768
instances.
They use so many data mining techniques and they got various result based on their
applied techniques. But they got the highest percentage by using J48 classifier and WEKA &
MATLAB tool and the accuracy is 99.87% to predict the diseases. The performance of the
algorithm is calculated using the equation for Total Accuracy and Random Accuracy.
Thereby creating a user-friendly interface and environment for the patient’s without any
16
Research Methodology
17
Figure 3.1.1: Framework for Diabetes Prediction
to transform the raw data in a useful and efficient format. Major steps of data pre-
processing are data cleaning, removing noisy data, data transformation and data
reduction.
1. Data Cleaning: The data can have many irrelevant and missing parts. To
handle this part, data cleaning is done. It involves handling of missing data, noisy
data etc.
18
(a) Missing Data: This situation arises when some data is missing in the data. It
can be handled in various ways. Some of them are: Ignore the tuples, Fill the Missing
values.
(b) Noisy Data: Noisy data is a meaningless data that can’t be interpreted by
machines. It can be generated due to faulty data collection, data entry errors etc. It can
appropriate forms suitable for mining process. Data transformation ways are:
huge amount of data. While working with huge volume of data, analysis became
harder in such cases. In order to get rid of this, we use data reduction technique. It
aims to increase the storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are: Data Cube Aggregation, Attribute
In our research we have used the Pima Indian Diabetes Dataset. After doing the data
19
Figure 3.1.2: Statistical Summary
understand how to apply technologies to learn and produce advanced results. We will
variables relate to the class. We will use three classification algorithms in our
research.
For this research, the PIMA Indian dataset is collected from the UCI Machine Learning
Repository. It was originally collected from the Pima people of America. The National
20
Institute of Diabetes and Digestive and kidney Disease of the National Institute of Health
Diabetes occurred in the dataset contains a record of 769 patients with nine attributes. Out of
the nine conditional attributes, six are due to physical examination rest of the attributes are
chemical examination. This dataset is already used by many researchers for their
experimental work to predict the onset of diabetes mellitus. Data pre-processing is required
to obtain structured data. There are eight inputs and last one being the output. The goal is to
use the first 8 variables to predict attribute values of the 9th variables.
4 BMI (Body Mass Index) Body mass index (weight in kg/ (height
in m) ^2)
21
6 Insulin value in 2 hrs. (mu U/ml) 2-Hour serum insulin (mu U/ml)
The rules generated by the proposed cascaded model are given below:
22
Pedigree=medium & BP= high
3.3 Tools
Orange Tool: Orange is an open-source data visualization, machine learning and data
mining toolkit. It features a visual programming front-end for explorative data analysis and
Orange components are called widgets and they range from simple data visualization, subset
modeling.
Orange consists of a canvas interface onto which the user places widgets and creates a data
visualizing data elements, etc.
23
Figure 3.3.1: A typical workflow in Orange 3
The program provides a platform for experiment selection, recommendation systems, and
teaching. In science, it is used as a platform for testing new machine learning algorithms and
for implementing new techniques in genetics and bioinformatics. In education, it was used
for teaching machine learning and data mining methods to students of biology, biomedicine,
and informatics.
24
3.4 Classification Algorithms
Classification is the process of identifying a new observation category set on the basis of
a classification model that will predict diabetes. In this model, different classifiers will be
used like Naïve Bayes, Decision tree and Random Forest. Each individual algorithm will be
These types of algorithms fall under a supervised learning approach that can be performed on
any type of data. Classification learns from the input data and then based upon it classify the
new data. This technique helps in identifying the class labels where new data can be fit. We
A Naive Bayesian classifier using Bayes theorem works with a probabilistic statistical
classifier. The major advantage of using this Naïve Bayesian classifier lies in its simplicity
Studies have been made to compare the different techniques of classification which have
been developed so far. A set of programs that assign a class of predefined set to an object
under construction is based on the descriptive attributes. This classifier uses conditional
independence in which attribute value is independent. It calculates the probability of each tag
25
for a given text, and then output the tag with the highest one. This is done by using a
probabilistic approach which computes class probabilities and predicts most probable classes.
If ‘A’ is referred as prior event and ‘B’ as dependent event, Bayes’ theorem can be given as:
1) The naive Bayes classifier requires a very large number of records to obtain good results.
tree structure. It breaks down a big data set into smaller subsets. A decision node can
Different trees are built and converted them into different set of rules and these rules were
further reduced and filtered. The main objective of this rule is to analyze the number of rules
which are generated and how many rules will be balanced after performing filtering and
reduction. It also analyzes how many rules will be generated employing association rule
approach on the same database. The sets of rules built by decision trees were much smaller
26
Limitation of Decision Tree:
1) Empty branches.
2) Insignificant branches.
3) Over fitting.
algorithm is used to obtain better predictive performance than could be obtained from any
other algorithms alone. That is why we use multiple decision tree in this case. This
27
Figure 3.4.1: Example of Random Forest Algorithm
In the given image, there are nine test predictions. Each individual tree in the random forest
spits out a class prediction and the class with the most votes becomes our model’s prediction.
In simple words, the dependent variable is binary in nature having data coded as either 1
28
Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It is one of
the simplest ML algorithms that can be used for various classification problems such as spam
3. There may be variables other than x which are not studied, yet do influence the
response variable.
5. Extrapolation is dangerous.
SVM is one of the most popular Supervised Learning algorithms, which is used for
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector
29
Machine. Consider the below diagram in which there are two different categories that are
Limitations of SVM:
2. SVM does not perform very well when the data set has more noise i.e. target classes
are overlapping.
3. In cases where the number of features for each data point exceeds the number of
4. As the support vector classifier works by putting data points, above and below the
30
3.4.6 KNN
algorithm, with a low computational cost and very simple implementation. It supports
classification and regression problems. When making a prediction, it stores the entire training
dataset and queries it to locate k data points in the training set that are most similar to the
data point to be classified. Therefore, there is no model other than the raw training dataset,
When the KNN method is used for regression, the response value is calculated as a weighted
sum of the responses of all the k neighbors, where the weight is inversely proportional to the
Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the
31
Figure 3.4.6.1: Example of KNN
Limitations of KNN:
32
Result and Discussion
The dataset has 9 attributes and 768 instances. Attributes are exacting, all patients now are
females at least 21 years old of Pima Indian heritage. From the 768 patients in the PID
dataset, classification algorithms used a training set with 576 patients and a testing dataset
To find the performance metrics such as sensitivity, specificity and accuracy, a distinguished
confusion matrix is obtained based on the classification results from these algorithms.
4.1.1.
Classified as Classified as
Healthy not Healthy
Actual Healthy TP FN
Actual not FP TN
Healthy
Accuracy is the percentage of predictions that are correct. The precision is the
measure of accuracy provided that a specific class has been predicted. Sensitivity is the
percentage of positive labeled instances that were predicted as positive. These performance
criterion for the classifiers in disease detection are evaluated as follows from the confusion
matrix.
Sensitivity = TP / (TP+FN)
Specificity = TN / (FP+TN)
4.2 Results
Naive Bayes, SVM, Logistic Regression, Random Forest, KNN and Decision Tree
algorithms are used in this research work. Experiments are performed using internal cross-
validation 10-folds. Accuracy, F-Measure, Recall and Precision measures are used for the
34
1. Accuracy (A) Accuracy determines A=(TP+TN) / (Total
the accuracy of the algorithm no of samples)
in predicting instances
2. Precision (P) Classifier’s P = TP / (TP+ FP)
correctness/accuracy is
measured by Precision
3. Recall (R) To measure the R =TP / (TP+FN)
classifiers completeness or
sensitivity, Recall is used.
4. F-Measure F-Measure is the F=2*(P*R) / (P+R)
weighted average of
precision and recall.
Accuracy of each algorithm:
We can see KNN algorithm has the highest accuracy which is 78.57%.
35
Precision Recall F-Measure Accuracy %
0.0 0.81 0.87 0.84 78.57
1.0 0.72 0.63 0.67
Recall values are listed in Table 4.2.1.2 and classifiers performance on the basis of classified
instances are defined in Table 4.2.1.3. Where, TP defines True Positive, TN defines True
classifiers performance on the basis of Accuracy, Precision, F-measure, Recall values are
listed in Table 4.2.1.2 and classifier’s performance on the basis of classified instances are
calculated on various measures. From Table 4.2.1.2 it is analyzed that KNN showing the
maximum accuracy. So, the KNN machine learning classifier can predict the chances of
36
Conclusion
Various data mining techniques and its application were studied or reviewed.
Different algorithms were applied to find out the best accuracy of diabetes prediction. In our
In this study, we used the diabetic patient follow-up data. We have combined feature
selection and imbalanced processing techniques. In this work, we offered proof that KNN
The main aim of this project was to design and implement Diabetes Prediction Using
Methods and Performance Analysis of that methods and it has been achieved successfully.
The proposed approach uses various classification and ensemble learning method in which
SVM, Knn, Random Forest, Decision Tree, Logistic Regression classifiers are used. And
KNN achieved accuracy 78.57%. The Experimental results can be asset health care to take
early prediction and make early decision to cure diabetes and save humans life.
The ability of our model to predict patients with Diabetes using some commonly used
lab results is high with satisfactory sensitivity. These models can be built into an online
computer program to help physicians in predicting patients with future occurrence of diabetes
and providing necessary preventive interventions. The model is developed and validated on
the Bangladeshi population which is more specific and powerful to apply on Bangladeshi
patients than existing models developed. Fasting blood glucose, body mass index, age,
as possible, so as to boost up the precision and recall rates. The KNN is utilized for diabetes
prediction, owing to its faster learning capability. The performance of the work is analyzed
by varying the classifiers and tested against existing techniques. The experimental results
prove the efficacy of the proposed approach and in future, this work is planned to be
Lately, medical Data mining has gained in interest by the scientific and research
continuous self-management and control to maintain blood glucose level within the normal
range, in order to prevent complications. Data mining has played an important role in
diabetes research. Data mining would be a valuable asset for diabetes researchers because it
can unearth hidden knowledge from a huge amount of diabetes-related data. We believe that
data mining can significantly help diabetes research and ultimately improve the quality of
health care for diabetes patients. Using data mining to deal with the avalanche of clinical data
collected from patients and generated from the research and management of diabetes is a
valuable asset that can help researchers and clinicians provide better health care for the
patients affected by this modern-society disease. The present study concludes that elderly
diabetes patients should be given an assessment and a treatment plan that is suited to their
needs and lifestyles. Public health awareness of simple measures such as low sugar diet,
exercise, and avoiding obesity should be promoted by health care providers. In this study,
38
predictions on the effectiveness of different treatment methods for young and old age groups
were elucidated. The preferential orders of treatment were found to be different for the young
and old age groups. Diet control, weight reduction, exercise and smoking cessation are
At last by using all these six machine learning algorithms we had measured different
parameters within the dataset and we had come through better accuracy rate with KNN with
nearly 78.57%. This work can be extended by adding any other algorithm which can give
References
[1] Ravi Sanakal, Smt. T Jayakumari, May (2014). Prognosis of Diabetes Using Data mining
Approach-Fuzzy C Means Clustering and Support Vector Machine International Journal of
Computer Trends and Technology (IJCTT) (volume 11 number 2) Dorcas Dachollom Datiri.
[2] Kaseda C, Kobayashi M, Yamaguchi M, Yamazaki K (2006). “Prediction of blood
glucose level of type 1 diabetics using response surface methodology and data mining” 44(6)
Med Biol Eng. Compute.
[3] Afshar Aalam, M. N Doja, Sapna Jain .February 25 – 26( 2010) ”K-MEANS
CLUSTERING USING WEKA INTERFACE”, 4th National Conference; INDIACom-2010
Computing For Nation Development, Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining
Concept and Techniques” (Third edition) Mukesh kumari .
[4] Jiawei Han, Jian Pei, Micheline Kamber, (2007). “Data Mining Conceptsand
Techniques”: Database - “patient data base” (Third edition) Mukesh kumari et al (IJCSIT).
39
[5] I.Parvin Begum, K.Tajudin, V.Karthikeyani, December (2012) Comparative of Data
Mining Classification Algorithm (CDMCA) in Diabetes Disease Prediction: International
Journal of Computer Applications (Volume 60– No.12) K.Tajudin.
[6] G. Parthiban and K. R. Ananthapadmanaban (October 2014) Prediction of Chances
-Diabetic Retinopathy using Data Mining Classification Techniques Indian Journal of
Science and Technology, Vol 7(10), Sopharak.
[7] K.R Lakshmi, S.Premkumar, June (2013) “Utilization of Data mining Techniques for
prediction of Diabetes Disease survivability”, International Journal of Scientific &
Engineering Research, vol.4 Issue 6 S. Sapna.
[8] Dr. A. Tamilarasi and M. Pravin Kumar, January (2012) “Implementation of Genetic
Algorithm in predicting Diabetes”, International Journal of computer science, (vol.9 Issue 1,
No.3) Murat Koklu and Yauz Unal.
[9] Murat Koklu and Manaswini Pradhan, Yauz Unal April (2011) “ predict the onset of
diabetes disease using Artificial Neural Network”, “ International Journal of Computer
Science & Emerging Technologies, (vol.2 Issue 2) Arwa Al-Rofiyee, Maram Al-Nowiser.
40
41
42 | P a g e
References
Ravi Sanakal, Smt. T Jayakumari, May (2014). Prognosis of Diabetes Using Data
mining Approach-Fuzzy C Means Clustering and Support Vector Machine
International Journal of Computer Trends and Technology (IJCTT) (volume 11
number 2) Dorcas Dachollom Datiri.
Kaseda C, Kobayashi M, Yamaguchi M, Yamazaki K (2006). “Prediction of blood
glucose level of type 1 diabetics using response surface methodology and data
mining” 44(6) Med Biol Eng. Compute.
Afshar Aalam, M. N Doja, Sapna Jain .February 25 – 26( 2010) ”K-MEANS
CLUSTERING USING WEKA INTERFACE”, 4th National Conference;
INDIACom-2010 Computing For Nation Development, Jiawei Han, Micheline
Kamber, Jian Pei, “Data Mining Concept and Techniques” (Third edition) Mukesh
kumari .
Jiawei Han, Jian Pei, Micheline Kamber, (2007). “Data Mining Conceptsand
Techniques”: Database - “patient data base” (Third edition) Mukesh kumari et al
(IJCSIT).
I.Parvin Begum, K.Tajudin, V.Karthikeyani, December (2012) Comparative of Data
Mining Classification Algorithm (CDMCA) in Diabetes Disease Prediction:
International Journal of Computer Applications (Volume 60– No.12) K.Tajudin.
G. Parthiban and K. R. Ananthapadmanaban (October 2014) Prediction of Chances
-Diabetic Retinopathy using Data Mining Classification Techniques Indian Journal of
Science and Technology, Vol 7(10), Sopharak.
K.R Lakshmi, S.Premkumar, June (2013) “Utilization of Data mining Techniques for
prediction of Diabetes Disease survivability”, International Journal of Scientific &
Engineering Research, vol.4 Issue 6 S. Sapna.
Dr. A. Tamilarasi and M. Pravin Kumar, January (2012) “Implementation of Genetic
Algorithm in predicting Diabetes”, International Journal of computer science, (vol.9
Issue 1, No.3) Murat Koklu and Yauz Unal.
Murat Koklu and Manaswini Pradhan, Yauz Unal April (2011) “ predict the onset of
diabetes disease using Artificial Neural Network”, “ International Journal of
Computer Science & Emerging Technologies, (vol.2 Issue 2) Arwa Al-Rofiyee,
Maram Al-Nowiser.