Fahima Afroz Rozy and Fariha Tabassum

Diabetes Prediction Using Data Mining
Fahima Afroz Rozy and Fariha Tabassum
A Thesis in the Partial Fulfillment of the Requirements
for the Award of Bachelor of Computer Science and Engineering (BCSE)
Department of Computer Science and Engineering

College of Engineering and Technology
IUBAT – International University of Business Agriculture and Technology
1
Spring 2020 Diabetes Prediction Using Data Mining
Fahima Afroz Rozy and Fariha Tabassum
A Thesis in the Partial Fulfillment of the Requirements for the Award of Bachelor of
Computer Science and Engineering (BCSE)
The thesis has been examined and approved,
_____________________________
Prof. Dr. Md. Abdul Haque
Chairman and Professor
_____________________________
Prof. Dr. Utpal Kanti Das
Co-supervisor, Coordinator and Professor
_____________________________
Nusrath Tabassum
Supervisor and Lecturer

College of Engineering and Technology
IUBAT – International University of Business Agriculture and Technology
Spring 2020
ii
Abstract
There have been many diseases around the world causing various health issues.
Nowadays, Data Mining has become very popular in the health industry. This paper helps in
predicting diabetes people with different age groups are being affected by diabetes based on
their life style activities. According to WHO, more than 463 million people are suffering
from diabetes. Diabetes is seventh leading cause of death. Now the youngsters are the most
affected by it. Here we discussed about 3 types of diabetes. Diabetes is growing at an
alarming rate nowadays. As it is incurable, but if we can predict diabetes in early stage, it can
be balanced with treatment. Clinical decisions are often made based on doctors’ experience
rather than on the rich database. Our objective of this research is to find out new features and
factors that can change the prediction of diabetes. As data mining techniques prove to be
good in predictive analysis, a data mining approach is used to predict the risk of diabetes in
the proposed approach. The performance of the algorithm is also measured and improved
using feature selection and selection of training set. In this paper we used 769 records.
Accuracy of Orange methodology is high. We used so many methods and tools like Decision
Tree Algorithm, Naïve Bayes Algorithm, Random Tree, Support Vector Machine, KNN,
Logistic Regression, Weka Tool but Orange network is more accurate. This study can be
further extended to deal datasets with multiple classes. This paper gives detailed review of
existing data mining methods used for prediction of diabetes. It also gives future direction for
severity estimation of diabetes. Moreover, these data analysis results can be used for further
research in enhancing the accuracy of the prediction system in future.

Letter of Transmittal
The Chairman
Thesis Defense Committee
IUBAT–International University of Business Agriculture and Technology
4 Embankment Drive Road, Sector 10, Uttara Model Town
Dhaka 1230, Bangladesh
Subject: Letter of Transmittal.
Dear Sir,
It is a great pleasure for us to be able to hand over the result of our hard work on Diabetes
Prediction Using Data Mining. We tried to give our best for preparing this report.
During preparation of the report, we have experienced practically a lot that will help us a
great in our career. It has enlightened our practical knowledge regarding the prediction. We
will be able to explain anything for more clarification if necessary. We would like to thank
you, for giving us the opportunity to do a report on Diabetes Prediction Using Data Mining.
Hope you will appreciate our hard work and excuse the minor errors. Thanking you for your
cooperation.
Yours sincerely,
_____________ _____________
Fahima Afroz Rozy Fariha Tabassum
ID:17103091 ID:17103092
Student’s Declaration
This thesis paper titled “Diabetes Prediction Using Data Mining”, submitted by the group
Fahima Afroz Rozy and Fariha Tabassum has been looked after by our supervisor Nusrath
Tabassum, Lecturer, Department of Computer Science and Engineering, IUBAT. Because of
her support we did it nicely.
The complete study is based on literature survey, study of periodicals, journals and websites
and building a model for proving the concept studied and designed. We further declare that
the complete thesis work, including all analysis, hypothesis, inferences and interpretation of
data and information, is done by me and my thesis partner.
_____________ _____________
Fahima Afroz Rozy Fariha Tabassum
ID 17103091 ID 17103092
Supervisor’s Certification
This is to certify that the work contained in the thesis entitled “Diabetes Prediction
Using Data Mining”, submitted by Fahima Afroz Rozy, ID-17103091 and Fariha Tabassum,
ID-17103092 has been accepted by our supervisor Nusrath Tabassum as satisfactory in
partial fulfillment of the requirements for the degree B.Sc. in Computer Science and
Engineering.
This report has performed the standard required for submission. To the best of our
knowledge, the results summarized in the report for B.Sc. degree in Computer Science.
_______________________________
Nusrath Tabassum
Lecturer
IUBAT–International University of Business Agriculture and Technology

Acknowledgments
Firstly, we want to thank our supervisor Nusrath Tabassum, whose guidance and
support have been essential in our research. She shared so many ideas to make understand us
what is thesis, what should we do, what are the tools and methodologies we are going to use.
Our continuous discussions have been a constant source of insightful ideas, significantly
shaping the main contributions of this thesis. Her advice encourage us to finish this work so
easily. Her guidance gave us a great clear vision about the ways of data collection and the
sources of data collection.
We would like to thank our ART 203 course instructor Prof Dr Abhijit Saha who
made our thesis work so understandable. He took so many viva and gave report and
assignment based on thesis work which helped us a lot. Without his valuable effort, this
report and research could not have been successful.
We are especially grateful to the Department of Computer Science and Engineering
(CSE) of IUBAT –International University of Business Agriculture and Technology for
providing us all-out support during the thesis work.
At last we would like to thank our family member for believing in us and for
encouraging us to fulfill our dreams. We also would like to end by saying thanks to our
friends who supported us in our academic career.

Table of Contents
..............................................................................................................................................1
Diabetes Prediction Using Data Mining..............................................................................1
Abstract.................................................................................................................iii
Letter of Transmittal............................................................................................iv
Student’s Declaration............................................................................................v
Supervisor’s Certification....................................................................................vi
Acknowledgments................................................................................................vii
List of Tables..........................................................................................................x
Introduction........................................................................................................................1
1.1. Background..................................................................................................2
1.2 Problem Statement.......................................................................................7
1.3 Objective of the research.............................................................................7
Literature Review..............................................................................................................9
Research Methodology....................................................................................................19
3.1 Research Design.............................................................................................19
3.2 Dataset Collection..........................................................................................22
3.3 Tools................................................................................................................25
3.4 Classification Algorithms..............................................................................27
Result and Discussion......................................................................................................35
Conclusion........................................................................................................................40
References.............................................................................................................43
List of Figures
Figure 1: Disorder because of diabetes......................................................................................2
Figure 1.1: Number of adults (20-79 years) with diabetes worldwide......................................3
Figure 1.2: Bangladesh - Diabetes Prevalence (% of Population Ages 20 To 79)....................4
Figure 3.1.1: Framework for Diabetes Prediction...................................................................19
Figure 3.1.2: Statistical Summary
Figure 3.3.1: A typical workflow in Orange 3.........................................................................25
Figure 3.4.1: Example of Random Forest Algorithm..............................................................29
Figure 3.4.5.1: Example of SVM……………………………………………………………31
Figure 3.4.6.1: Example of KNN……………………………………………………………33

List of Tables
Table 1.1: Risk factors % of Diabetes.......................................................................................4
Table 2.1- Comparison of different classification algorithms based on the accuracy.............13
Table 2.2 Parameters used for prediction................................................................................15
Table 2.3: Attributes of Diabetes Dataset................................................................................16
Table 3.2.1- PIMA Attributes and Description.......................................................................21
Table 3.2.2 – Rules generated by proposed model..................................................................22
Table 4.1.1: Confusion Matrix.................................................................................................35
Table 4.2.1.1: Accuracy Measures..........................................................................................36
Table 4.2.1.2: Classification report of KNN
algorithm……………………………………….37
Introduction
Data mining is the process of sorting out large data sets to identify patterns and establish
relationships to solve problems through data analysis. Data mining tools allow to predict
outcomes. We can use this information to increase revenues, reduce costs, improve customer
relationships, and reduce risks and more. Data mining is also known as Knowledge
discovery, Knowledge extraction, data/pattern analysis, information harvesting, etc. Now a
days data mining is rapidly growing successful in a wide range of applications. Such as
analysis of financial forecasting, healthcare and weather forecasting. Currently Data mining
is most effective in healthcare which provide prognosis and a deeper understanding of
medical data. Data mining applications in healthcare include analysis of health care centers
for better health policy-making and prevention of hospital errors, early detection, and
prevention of diseases. It also preventable hospital deaths, more value for money and cost
savings.
Researchers are using data mining techniques in the prediction of different types of
diseases such as diabetes, stroke, cancer, and heart disease. Data mining algorithm is used for
testing the accuracy in predicting diabetic status.
Diabetes is a chronic disease that occurs either when the pancreas does not produce
enough insulin or when the body cannot effectively use the insulin it produces. Insulin is a
hormone that regulates blood sugar. Because of imbalance blood sugar in human body people
is facing so many problems like damage the heart, blood vessels, eyes, kidneys, and nerves.
Diabetes can be the cause of death also. Diabetes is the cause of 2.6% of global blindness.
Figure 1: Disorder because of diabetes
1.1. Background
Based on 2011 National Diabetes Fact Sheet 8.3% (25 million) of U.S population has
diabetes. Diabetes is seventh leading cause of death according to U.S. death certificates .In
recent years, diabetes has become one of the major causes of deaths worldwide. According to
the WHO, around 1.6 million people worldwide died due to diabetes in 2016. It is estimated
that 463 million people are living with diabetes all over the world. The number of diabetes
patient will be 700 million globally by 2045 according to research. According to
International Diabetes Federation 7.1 million people with diabetes in Bangladesh and almost
an equal number with undetected diabetes. This number can be double by 2025.
2
Figure 1.1: Number of adults (20-79 years) with diabetes worldwide
Bangladesh is a developing country where 75% of total population lives in rural area.
In our country type 2 is increasing rapidly. Diabetes is increasing in Bangladesh in both
urban and rural areas. According to research from 1994 to 2013 the prediction of type 2
diabetes varied from 4.5% to 35.0% in Bangladesh. According to International Diabetes
3
Federation 7.1 million people with diabetes in Bangladesh and almost an equal number with
undetected diabetes. This number can be double by 2025.
Figure 1.2: Bangladesh - Diabetes Prevalence (% of Population Ages 20 To 79)
Prevalence of diabetes in Bangladesh:
Table 1.1: Risk factors % of Diabetes
Male Females Total

Diabetes 8.6% 7.4% 8.0%
Overweight 14.4% 19.6% 17.0%
Obesity 2.0% 4.6% 3.3%
Physical 9.2% 41.3% 25.1%
4
inactivity
Several data mining techniques are used to predict Diabetes disease such as Naïve
Bayes, Decision Tree, neural network, kernel density, automatically defined groups, bagging
algorithm, and support vector machine showing different levels of accuracies. By applying
data mining in disease diagnosis and treatment is beneficial for patients and especially for
Diabetes disease patients. Researchers have proven that hospitals do not provide the same
quality of service even though they provide having Diabetes disease. Diabetes disease
professionals store significant amounts of patient’s data. It is important to analyze these
datasets to extract useful knowledge.
There are 3 types of diabetes:
 Type 1 DM results from the body's failure to produce insulin. This form was
previously referred to as "insulin-dependent diabetes mellitus" (IDDM) or "juvenile
diabetes".
CONTROL TYPE 1 DIABETES
 Type 1 diabetes requires close monitoring as blood sugar levels can be
quite erratic through the day if left unchecked.
 People with type 1 diabetes will usually take a combination of long acting
(basal) and short acting (bolus) insulin.
 Type 2 DM results from insulin resistance, a condition in which cells fail to use
insulin properly, sometimes also with an absolute insulin deficiency. This form was
5
previously referred to as non-insulin-dependent diabetes mellitus (NIDDM) or "adult-
onset diabetes".
CONTROL TYPE 2 DIABETES
 With type 2 diabetes, one of the best ways to achieve greater
control of your diabetes is through diet.
 Some foods affect our blood sugar significantly more than others
and so picking the diet for type 2 diabetes that works for you can
make a big difference to your numbers and your health.
 Gestational diabetes, is the third main form and occurs when pregnant women
without a previous diagnosis of diabetes develop a high blood glucose level.
There is no diabetes cure. Diabetes can be treated and controlled. To
manage diabetes effectively, we need to do some following steps to manage our blood sugar
levels.
Like-
 Exercise Regularly
 Increase Your Fiber Intake
 Drink Water and Stay Hydrated
6
 Choose Foods With a Low Glycemic Index
 Control Stress Levels
 Monitor Your Blood Sugar Levels
 Get Enough Quality Sleep
1.2 Problem Statement
Diabetes is growing at an alarming rate nowadays. As it is incurable, but if we can
predict diabetes in early stage, it can be balanced with treatment.
 It becomes a cause for other illnesses also like blindness, kidney failure,
cholesterol and heart diseases.
 The deaths due to diabetes and high blood glucose are on the rise.
 Prediction of diabetes at an early stage would help the patients to maintain the
sugar level under control.
In this paper we use orange method, decision tree, naïve Bayes to predict disease and
some parameters. Like
 Fasting Plasma Glucose
 Plasma Glucose Test
 A1C
1.3 Objective of the research
Application of data mining in analyzing the medical data is a good method for
investigating. Nowadays, data stored in medical databases are growing in an increasingly
7
rapid rate. It has been widely recognized that medical data analysis can lead to an
enhancement of health care.
The following are the objectives leading to achievement of the primary objective
mentioned below:
 To identify the best classification model which can help the physicians in
predicting the risk of diabetic patient using diabetics attributes.
 To recognize and classify patterns in multivariate patient attributes.
 To predict the future outcomes based on previous experiences and present
conditions.
 To identify the patients at risk, with the aim of increasing quality of care and to
reduce the cost of care.
 To build a prediction model using appropriate classification techniques such as
naïve Bayes, decision trees and Orange.
8
Literature Review
The purposes of this study were to predict diabetes among Childs, adults, old people
and to identify age- and sex-specific thresholds of low strength for detection of risk. In
general predictive algorithms work well when the data is normally distributed or symmetric
but in real world mostly we get undistributed data. In order to improve the efficiency of
predictive algorithms transformation of the variables which are not normally distributed is
required.
“Prediction of Diabetes Based on Data Mining Techniques” by Madhusmita Rout,

Amandeep Kaur,
In this paper they have used –
Naive Bayes
SVM
Logistic Regression
Decision Tree
Tools used-
1. K-means,
2. R tool, Anaconda,
3. WEKA.
In this paper they have used 769 records.

• Accuracy of Logistic Regression is 82.35 which is high.
• They have 9 attributes.
• More dataset helps to find out more accurate value.
Pardha Repalli, Oklahoma State University “Prediction on Diabetes Using Data
mining” uses the application of techniques for data mining in the Healthcare and the
prediction of Diabetes. The author has deeply examined the use of data mining techniques in
classification such as Decision Tree technique and Regression models, methodology Cross
Industry Standard to discover the hidden information from the data. Important input variables
selected by the decision tree are high_Blood_Pressure, Cholest_last_Check, Adult_Bmi, Last
Flu shot, heart attack diagnose and other variables. Average square error for the selected
model is 0.043 which is low error. The most important variable that has major effect on
diabetes is high blood pressure. The people with whose high_blood_pressure_diagnosis value
is 2, -1 then the probability of not affected by diabetes is 98% and in the other case the
probability to be affected by diabetes is 2%. According to their research it is evident that
people with age above 45 years are mostly affected by diabetes.
K. R. Ananthapadmanaban* and G. Parthiban in “Prediction of Chances -
Diabetic Retinopathy using Data Mining Classification Techniques” authors tried to
make health prediction by using this automation. This automation system helps and update -
Administration of health services, Clinical care, Medical analysis and Training.
In this paper they have used –
• “Naive Bayes Classifier” to develop expected framework.
• Decision Tree” to make a ready-made set of explanation.
• Eclipse IDE is used for planning graphical user interface.
• Java is used to organize different parts of User Interface.
• MySQL is used as database at the web server.
In this paper, the authors present the techniques and applications of data mining in
Medicinal and Clinical Predictions. They have used already existing information in different
databases to rework it into new researches and results. This framework includes some initial
10
parts, like login, enter side effects in the system, and recommend medications, proposes an
adjacent specialist. When the symptoms occur then the patient need the specialist's help but
they are not accessible because of some reason. This can be the limitation of this paper.
Deepti Sisodiaa, Dilip Singh Sisodiab in “Prediction of Diabetes using
Classiﬁcation Algorithms” by authors discussed about different kinds of diabetes based
problems which is predicted according to the symptoms shown by the patient. They also
discussed about the data mining techniques.
In this paper-
• They have used 3 algorithms to evaluate patient's symptoms such as SVM, Decision
Tree and Naive Bayes.
• They talked about techniques like association rule mining, classification, clustering to
analyze the different kinds of heart based problems.
• By using MAFIA algorithms they find out the accuracy of their data.
• They take five symptoms from the patient before analyzing.
This system is time consuming because of searching for insignificant branches. They
used excel to check the redundancy of data. It is good side of this system but the previous
system didn’t use this kind of data redundancy checker. This system is capable of ensuring
maximum patient satisfaction.
Mukesh kumari, Dr. Rajan Vohra, Anshul Arora “Prediction of Diabetes Using
Bayesian Network” developed an Intelligent Diabetes disease Prediction System. The
authors in this paper has used the various techniques in data mining like decision tree,
11
Bayesian network, Naïve Bayes model weka and neural network for the prediction of
diabetes disease that is whether a patient is suffering from heart disease or not . This paper
contains 206 records and 9 attributes. According to experimental results, correctly classified
instances for Bayesian network is 205. Accuracy of Bayesian network is 99.51 which is high.
Bayesian network is a promising technique for this type of dataset. Decision tree model for
the diagnosis of diabetes.Pre-processing is used to improve the quality of data. Classifier is
applied to the modified dataset to construct the Bayesian model. Weka will be used to do
simulation, and the accuracy of the model is calculated and compared with other algorithms
efficiency. Classification with Bayesian network shows the best accuracy, 99.51 percent and
error in the classification is .48%.
Madhusmita Rout and Amandeep Kaur “Prediction of Diabetes Based on Data
Mining Techniques” predicts the diabetic disease from clinical database by using Support
Vector Machine (SVM), Naive Bayes, Decision tree and K-nearest neighbor. This research
was being conducted with the various objectives like it identifies the various complications
that cause diabetes. It develops a Hybrid Genetic Algorithm that computes the best fitness
value which is used for evaluating the prediction accuracy of diabetes from clinical
databases. They have some parameters like Pregnancies, Glucose level Blood, Pressure
(mmHg), BMI (Body Mass Index, Skinfold thickness (mm), Insulin value in 2 hrs. (Mu
U/ml), Diabetes Pedigree function, Age (years). Then after they have applied classifiers like
logistic regression, Naive Bayes, Decision tree, K-nearest neighbour and Support vector
machine to the model. The experiment is evaluated using a python tool which shows the
performance of an individual algorithm. Logistic Regression is having a more accuracy rate
12
than other classifiers i.e. 82.35%. Other classifiers obtained an accuracy of 76.62%, 75.97%,
66.23%, and 64.28% by naive Bayes, decision tree, KNN, and SVM respectively. But this is
only based on individual performance in aspect of accuracy.
Table 2.1- Comparison of different classification algorithms based on the accuracy
Logistic Naive Decision KNN SVM

Regression Bayes tree
82.35 76.62 75.97 66.23 64.28
The accuracy differs in different proposed hybrid systems based on the chosen
algorithms. The error rates are analyzed and other factors also based on which an algorithm is
to be selected. More datasets need to be evaluated so the decision making can be more
accurate with very low error rates.
Veena Vijayan and Ravikumar Aswathy (2014), “Study of Data mining
Algorithms for prediction and Diagnosis of Diabetes Mellitus” studies the various data
mining algorithms for the prediction and diagnosis of diabetes Mellitus. According to the
author several methods are available for diagnosing diabetes based on the several physical
and chemical tests which are being performed. They took 8 attributes into the consideration.
The author surveyed various algorithms and analyzed that Expectation Maximization
Algorithm is the simplest algorithm. That can be performed by using two steps but the results
of this algorithm is less than 70% and author defined that this algorithm is not very accurate
for the higher dimensional data sets due to imprecision. The author also used the K-nearest
13
Neighbor Algorithm which is one of the simplest and also named as lazy learning algorithm
used for the classification. The accuracy of this algorithm comes out be 73.17% because of
certain drawbacks in this algorithm. The more efficient approach used by the author is K-
means Clustering algorithm but the accuracy of this approach is also 66-77%. For solving the
disadvantages of the K-means clustering algorithm, the author combined it with KNN
algorithm and forms the Amalgam KNN which improves the accuracy even for the large
datasets. The author observed that the value of K is very crucial as the value of K decreases
the accuracy becomes very less and with the increase of value of K, the accuracy is
increased. The author then uses the Adaptive Neuro Fuzzy 28 Inference System (ANFIS)
algorithm combined with the adaptive KNN which leads to the accuracy of 80%.
Desmond Bala Bisandu, Dorcas Dachollom Datiri, Eva Onokpasa, Godwin
Thomas, Musa Maaji Haruna, Aminu Aliyu, Jerry Zachariah Yakubu“Diabetes
Prediction Using Data Mining Techniques” designs an expert system that predicts diabetes
disease. This research will help in automating prediction of diabetes even before clinicians
arrived. The system was design using Java Programming Language, Weka Tool, and MySQL
(Microsoft Structured Query Language) as the back end and a strategic approach and Naïve
Bayesian Classifier was used for the front end. Solve the problems of the existing system by
implementing the naïve beyes classifier. They used some parameters to justify their research.
Table 2.2 Parameters used for prediction
Serial Parameters Description Allowed

number values
1 Age Age of the Discrete
subject integer value
14
2 Take insulin Take any Yes or No
drug or injection that
can prevent you from
having diabetes
3 Smoke Whether the Yes or No
cigarette subject smoke
cigarette
4 Age first Age the Discrete
smoked subject does the integer value
smoking
5 Where did Where the Home or
you take the survey? subject takes the Office
survey
They use total of 155 cleaned preprocessed records were collected and stored in
database say diabetes. 155 were used for training the model in the classification phase.
During the performance testing 50 records sample was drawn from initial 155 populations as
a validation set. They check the accuracy of the Naive Bayes classifier using confusion
matrix. Finally they achieved 90-95% accuracy of correctly classified instances in the
classification phase. The naïve Bayes classifier based system is very useful for diagnosis of
diabetes. The system can perform good prediction with less error and this technique could be
an important tool for supplementing the medical doctors in performing expert diagnosis. In
this method the efficiency of forecasting was found to be around 95%.
Mr. R. Sengamuthu1, Mrs. R. Abirami 2, Mr. D. Karthik 3“Various data mining
Techniques analysis to predict diabetes mellitus” explores the early prediction of diabetes
using various data mining techniques. The above attributes can be classified and cluster using
various techniques such as Navie Bayes, J48, PLS-LDA, SVM,BLR, MLP, K-NN, Bayesian
15
Network, Tanagara, WEKA and MATLAB tools. The dataset comprises 9 attributes and 768
instances.
Table 2.3: Attributes of Diabetes Dataset
Attribute No. Attribute Description

1 Plasma Plasma glucose
concentration a 2 hours in an
oral glucose tolerance test
2 Pressure Diastolic blood
pressure(mmHg)
3 Skin Triceps skin fold
thickness(mm)
4 Insulin 2-Hour serum insulin
(mu U/ml)
5 Pregnancy Number of times
pregnant
6 Mass Body Mass
Index(BMI)
7 Pedigree Diabetes Pedigree
function
8 Age Age(in years)
9 Class Class variable(0 or 1)
They use so many data mining techniques and they got various result based on their
applied techniques. But they got the highest percentage by using J48 classifier and WEKA &
MATLAB tool and the accuracy is 99.87% to predict the diseases. The performance of the
algorithm is calculated using the equation for Total Accuracy and Random Accuracy.
Thereby creating a user-friendly interface and environment for the patient’s without any
requirement of a doctor or hospital staff.
16
Research Methodology
3.1 Research Design
The framework for research design is given below:
17
Figure 3.1.1: Framework for Diabetes Prediction
 Data Collection: Data collection is the process of gathering and measuring
information on targeted variables in an established system. Here we will collect the
relevant data for our research.
 Data Pre-processing: Data pre-processing is a data mining technique which is used
to transform the raw data in a useful and efficient format. Major steps of data pre-
processing are data cleaning, removing noisy data, data transformation and data
reduction.
Steps Involved in Data Preprocessing:
1. Data Cleaning: The data can have many irrelevant and missing parts. To
handle this part, data cleaning is done. It involves handling of missing data, noisy
data etc.
18
(a) Missing Data: This situation arises when some data is missing in the data. It
can be handled in various ways. Some of them are: Ignore the tuples, Fill the Missing
values.
(b) Noisy Data: Noisy data is a meaningless data that can’t be interpreted by
machines. It can be generated due to faulty data collection, data entry errors etc. It can
be handled in following ways: Binning Method, Regression, Clustering.
2. Data Transformation: This step is taken in order to transform the data in
appropriate forms suitable for mining process. Data transformation ways are:
Normalization, Attribute Selection, Discretization, Concept Hierarchy Generation.
3. Data Reduction: Since data mining is a technique that is used to handle
huge amount of data. While working with huge volume of data, analysis became
harder in such cases. In order to get rid of this, we use data reduction technique. It
aims to increase the storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are: Data Cube Aggregation, Attribute
Subset Selection, Numerosity Reduction, Dimensionality Reduction.
In our research we have used the Pima Indian Diabetes Dataset. After doing the data
preprocessing, the statistical summary is given below:
19
Figure 3.1.2: Statistical Summary
 Training Dataset: Training data is an initial set of data used to help a program
understand how to apply technologies to learn and produce advanced results. We will
use PIMA Indian Training Dataset for our research purpose.
 Test Dataset: A test dataset is a dataset that is independent of the training dataset, but
that follows the same probability distribution as the training dataset.
 Classifier: A classifier utilizes some training data to understand how given input
variables relate to the class. We will use three classification algorithms in our
research.
3.2 Dataset Collection
For this research, the PIMA Indian dataset is collected from the UCI Machine Learning
Repository. It was originally collected from the Pima people of America. The National
20
Institute of Diabetes and Digestive and kidney Disease of the National Institute of Health
(NIH) originally owned the Pima Indian diabetes Database (PIDD).
Diabetes occurred in the dataset contains a record of 769 patients with nine attributes. Out of
the nine conditional attributes, six are due to physical examination rest of the attributes are
chemical examination. This dataset is already used by many researchers for their
experimental work to predict the onset of diabetes mellitus. Data pre-processing is required
to obtain structured data. There are eight inputs and last one being the output. The goal is to
use the first 8 variables to predict attribute values of the 9th variables.
Table 3.2.1- PIMA Attributes and Description
Serial Attributes Description
1 Pregnancies Number of times pregnant
2 Glucose level Glucose concentration 2 hours in an

oral glucose tolerance test
3 Blood Pressure (mmHg) Diastolic blood pressure (mm Hg)
4 BMI (Body Mass Index) Body mass index (weight in kg/ (height
in m) ^2)
5 Skinfold thickness (mm) Triceps skin fold thickness (mm)
21
6 Insulin value in 2 hrs. (mu U/ml) 2-Hour serum insulin (mu U/ml)
7 Diabetes Pedigree function A function which scores likelihood of

diabetes based on family history
8 Age (years) Age (years)
9 Outcome Positive or Negative
The rules generated by the proposed cascaded model are given below:
Table 3.2.2 – Rules generated by proposed model
Serial Condition Result
1 If Glucose level=low Negative
2 If Glucose level=medium & Age=low & Negative

Pedigree=low

Pedigree=medium & BP=medium

Pedigree=medium & BP=low
5 If Glucose level=medium & Age=low & Positive
22
Pedigree=medium & BP= high
6 If Glucose level=medium & Age=high Positive
7 If Glucose level=medium & Age=low & Positive

Pedigree=high
8 If Glucose level=medium & Age=medium Positive
9 If Glucose level=high Positive
3.3 Tools
Orange Tool: Orange is an open-source data visualization, machine learning and data
mining toolkit. It features a visual programming front-end for explorative data analysis and
interactive data visualization.
Orange components are called widgets and they range from simple data visualization, subset
selection, and preprocessing, to empirical evaluation of learning algorithms and predictive
modeling.
Orange consists of a canvas interface onto which the user places widgets and creates a data
analysis workflow. Widgets offer basic functionalities such as reading the data, showing
a data table, selecting features, training predictors, comparing learning algorithms,
visualizing data elements, etc.
23
Figure 3.3.1: A typical workflow in Orange 3
The program provides a platform for experiment selection, recommendation systems, and
predictive modeling and is used in biomedicine, bioinformatics, genomic research, and
teaching. In science, it is used as a platform for testing new machine learning algorithms and
for implementing new techniques in genetics and bioinformatics. In education, it was used
for teaching machine learning and data mining methods to students of biology, biomedicine,
and informatics.
24
3.4 Classification Algorithms
Classification is the process of identifying a new observation category set on the basis of
training set of data that contains observation whose category is known.
According to the problem identification mentioned in the introduction section, we developed
a classification model that will predict diabetes. In this model, different classifiers will be
used like Naïve Bayes, Decision tree and Random Forest. Each individual algorithm will be
applied to the model to obtain accuracy.
These types of algorithms fall under a supervised learning approach that can be performed on
any type of data. Classification learns from the input data and then based upon it classify the
new data. This technique helps in identifying the class labels where new data can be fit. We
will use three classification algorithms:
3.4.1 Naïve Bayes Classifier
A Naive Bayesian classifier using Bayes theorem works with a probabilistic statistical
classifier. The major advantage of using this Naïve Bayesian classifier lies in its simplicity
and is efficient in handling the dataset containing many attributes.
Studies have been made to compare the different techniques of classification which have
been developed so far. A set of programs that assign a class of predefined set to an object
under construction is based on the descriptive attributes. This classifier uses conditional
independence in which attribute value is independent. It calculates the probability of each tag
25
for a given text, and then output the tag with the highest one. This is done by using a
probabilistic approach which computes class probabilities and predicts most probable classes.
If ‘A’ is referred as prior event and ‘B’ as dependent event, Bayes’ theorem can be given as:
Prob (B given A) = Prob (A and B)/Prob(A)
Limitations of Naïve Bayes:
1) The naive Bayes classifier requires a very large number of records to obtain good results.
2) It is instance-based or lazy in that they store all of the training samples.
3.4.2 Decision Tree Algorithm
Decision tree can be used in a classification or regression model. It works like a
tree structure. It breaks down a big data set into smaller subsets. A decision node can
have two or more branches. Leaf node represents a classification or decision.
Different trees are built and converted them into different set of rules and these rules were
further reduced and filtered. The main objective of this rule is to analyze the number of rules
which are generated and how many rules will be balanced after performing filtering and
reduction. It also analyzes how many rules will be generated employing association rule
approach on the same database. The sets of rules built by decision trees were much smaller
than results of association rules.
26
Limitation of Decision Tree:
1) Empty branches.
2) Insignificant branches.
3) Over fitting.
3.4.3 Random Forest Algorithm
It is a classification algorithm based on many decision trees. Random forest
algorithm is used to obtain better predictive performance than could be obtained from any
other algorithms alone. That is why we use multiple decision tree in this case. This
prediction is more accurate than that of any individual tree.
27
Figure 3.4.1: Example of Random Forest Algorithm
In the given image, there are nine test predictions. Each individual tree in the random forest
spits out a class prediction and the class with the most votes becomes our model’s prediction.
So, in this case, Predict 1 is the final outcome.
3.4.4 Logistic Regression Algorithm
Logistic regression is a supervised learning classification algorithm used to predict the
probability of a target variable. The nature of target or dependent variable is dichotomous,
which means there would be only two possible classes.
In simple words, the dependent variable is binary in nature having data coded as either 1
(stands for success/yes) or 0 (stands for failure/no).
28
Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It is one of
the simplest ML algorithms that can be used for various classification problems such as spam
detection, Diabetes prediction, cancer detection etc.
Limitations of Logistic Regression:
1. We are only considering LINEAR relationships.
2. r and least squares regression are NOT resistant to outliers.
3. There may be variables other than x which are not studied, yet do influence the
response variable.
4. A strong correlation does NOT imply cause and effect relationship.
5. Extrapolation is dangerous.
3.4.5 Support Vector Machine
SVM is one of the most popular Supervised Learning algorithms, which is used for
Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector
29
Machine. Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:
Figure 3.4.5.1: Example of SVM
Limitations of SVM:
1. SVM algorithm is not suitable for large data sets.
2. SVM does not perform very well when the data set has more noise i.e. target classes
are overlapping.
3. In cases where the number of features for each data point exceeds the number of
training data samples, the SVM will underperform.
4. As the support vector classifier works by putting data points, above and below the
classifying hyperplane there is no probabilistic explanation for the classification.
30
3.4.6 KNN
The KNN algorithm is a machine-learning algorithm that is considered a lazy learning
algorithm, with a low computational cost and very simple implementation. It supports
classification and regression problems. When making a prediction, it stores the entire training
dataset and queries it to locate k data points in the training set that are most similar to the
data point to be classified. Therefore, there is no model other than the raw training dataset,
and the only computation performed is querying of the training dataset.
When the KNN method is used for regression, the response value is calculated as a weighted
sum of the responses of all the k neighbors, where the weight is inversely proportional to the
distance from the input record.
Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the
category or class of a particular dataset. Consider the below diagram:
31
Figure 3.4.6.1: Example of KNN
Limitations of KNN:
1. Accuracy depends on the quality of the data.
2. With large data, the prediction stage might be slow.
3. Sensitive to the scale of the data and irrelevant features.
4. Require high memory – need to store all of the training data.
5. Given that it stores all of the training, it can be computationally expensive.
32
Result and Discussion
The dataset has 9 attributes and 768 instances. Attributes are exacting, all patients now are
females at least 21 years old of Pima Indian heritage. From the 768 patients in the PID
dataset, classification algorithms used a training set with 576 patients and a testing dataset
with 192 patients.
4.1 Performance Measures
To find the performance metrics such as sensitivity, specificity and accuracy, a distinguished
confusion matrix is obtained based on the classification results from these algorithms.
Confusion matrix is a matrix representation of the classification results as shown in Table
4.1.1.
Table 4.1.1: Confusion Matrix
Classified as Classified as
Healthy not Healthy
Actual Healthy TP FN
Actual not FP TN
Healthy
Accuracy is the percentage of predictions that are correct. The precision is the
measure of accuracy provided that a specific class has been predicted. Sensitivity is the
percentage of positive labeled instances that were predicted as positive. These performance
criterion for the classifiers in disease detection are evaluated as follows from the confusion
matrix.
Accuracy = (TP+TN) / (TP+FP+TN+FN)
Sensitivity = TP / (TP+FN)
Specificity = TN / (FP+TN)
Positive Prediction=FP / (TP+FP)
Negative Prediction = FN / (TN+FP)
4.2 Results
4.2.1 Accuracy Measures:
Naive Bayes, SVM, Logistic Regression, Random Forest, KNN and Decision Tree
algorithms are used in this research work. Experiments are performed using internal cross-
validation 10-folds. Accuracy, F-Measure, Recall and Precision measures are used for the
classification of this work. Table 4.2.1.1 defines accuracy measures below:
Table 4.2.1.1: Accuracy Measures
Measures Definitions Formula
34
1. Accuracy (A) Accuracy determines A=(TP+TN) / (Total
the accuracy of the algorithm no of samples)
in predicting instances
2. Precision (P) Classifier’s P = TP / (TP+ FP)
correctness/accuracy is
measured by Precision
3. Recall (R) To measure the R =TP / (TP+FN)
classifiers completeness or
sensitivity, Recall is used.
4. F-Measure F-Measure is the F=2*(P*R) / (P+R)
weighted average of
precision and recall.
Accuracy of each algorithm:
1. Naive Bayes: 71.42%
2. Decision tree: 68.18%
3. Random Forest: 75.97%
4. K Nearest neighbors: 78.57%
5. Support Vector Machine: 73.37%
6. Logistic Regression: 71.42%
We can see KNN algorithm has the highest accuracy which is 78.57%.
Table 4.2.1.2: Classification report of KNN algorithm:
35
Precision Recall F-Measure Accuracy %
0.0 0.81 0.87 0.84 78.57
1.0 0.72 0.63 0.67
Corresponding classifiers performance over Accuracy, Precision, F-measure and
Recall values are listed in Table 4.2.1.2 and classifiers performance on the basis of classified
instances are defined in Table 4.2.1.3. Where, TP defines True Positive, TN defines True
Negative, FP defines False positive, FN defines False Negative. The corresponding
classifiers performance on the basis of Accuracy, Precision, F-measure, Recall values are
listed in Table 4.2.1.2 and classifier’s performance on the basis of classified instances are
shown in Table 4.2.1.3.
Table 4.2.1.2 represents different performance values of all classification algorithms
calculated on various measures. From Table 4.2.1.2 it is analyzed that KNN showing the
maximum accuracy. So, the KNN machine learning classifier can predict the chances of
diabetes with more accuracy as compared to other classifiers.
36
Conclusion
Various data mining techniques and its application were studied or reviewed.
Different algorithms were applied to find out the best accuracy of diabetes prediction. In our
case KNN provided high accuracy.
In this study, we used the diabetic patient follow-up data. We have combined feature
selection and imbalanced processing techniques. In this work, we offered proof that KNN
algorithm can be successfully used for Diabetes Prediction.
The main aim of this project was to design and implement Diabetes Prediction Using
Methods and Performance Analysis of that methods and it has been achieved successfully.
The proposed approach uses various classification and ensemble learning method in which
SVM, Knn, Random Forest, Decision Tree, Logistic Regression classifiers are used. And
KNN achieved accuracy 78.57%. The Experimental results can be asset health care to take
early prediction and make early decision to cure diabetes and save humans life.
The ability of our model to predict patients with Diabetes using some commonly used
lab results is high with satisfactory sensitivity. These models can be built into an online
computer program to help physicians in predicting patients with future occurrence of diabetes
and providing necessary preventive interventions. The model is developed and validated on
the Bangladeshi population which is more specific and powerful to apply on Bangladeshi
patients than existing models developed. Fasting blood glucose, body mass index, age,
insulin were the most important predictors in these models.

The goal of the work is to reduce the false positive and false negative rates as much
as possible, so as to boost up the precision and recall rates. The KNN is utilized for diabetes
prediction, owing to its faster learning capability. The performance of the work is analyzed
by varying the classifiers and tested against existing techniques. The experimental results
prove the efficacy of the proposed approach and in future, this work is planned to be
extended such that the medical images are processed.
Lately, medical Data mining has gained in interest by the scientific and research
communities. Diabetes is considered as the world's fastest-growing chronic disease. It needs
continuous self-management and control to maintain blood glucose level within the normal
range, in order to prevent complications. Data mining has played an important role in
diabetes research. Data mining would be a valuable asset for diabetes researchers because it
can unearth hidden knowledge from a huge amount of diabetes-related data. We believe that
data mining can significantly help diabetes research and ultimately improve the quality of
health care for diabetes patients. Using data mining to deal with the avalanche of clinical data
collected from patients and generated from the research and management of diabetes is a
valuable asset that can help researchers and clinicians provide better health care for the
patients affected by this modern-society disease. The present study concludes that elderly
diabetes patients should be given an assessment and a treatment plan that is suited to their
needs and lifestyles. Public health awareness of simple measures such as low sugar diet,
exercise, and avoiding obesity should be promoted by health care providers. In this study,
38
predictions on the effectiveness of different treatment methods for young and old age groups
were elucidated. The preferential orders of treatment were found to be different for the young
and old age groups. Diet control, weight reduction, exercise and smoking cessation are
mutually beneficial to each other for the treatment of diabetes.
At last by using all these six machine learning algorithms we had measured different
parameters within the dataset and we had come through better accuracy rate with KNN with
nearly 78.57%. This work can be extended by adding any other algorithm which can give
better accuracy than KNN.
References
[1] Ravi Sanakal, Smt. T Jayakumari, May (2014). Prognosis of Diabetes Using Data mining
Approach-Fuzzy C Means Clustering and Support Vector Machine International Journal of
Computer Trends and Technology (IJCTT) (volume 11 number 2) Dorcas Dachollom Datiri.
[2] Kaseda C, Kobayashi M, Yamaguchi M, Yamazaki K (2006). “Prediction of blood
glucose level of type 1 diabetics using response surface methodology and data mining” 44(6)
Med Biol Eng. Compute.
[3] Afshar Aalam, M. N Doja, Sapna Jain .February 25 – 26( 2010) ”K-MEANS
CLUSTERING USING WEKA INTERFACE”, 4th National Conference; INDIACom-2010
Computing For Nation Development, Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining
Concept and Techniques” (Third edition) Mukesh kumari .
[4] Jiawei Han, Jian Pei, Micheline Kamber, (2007). “Data Mining Conceptsand
Techniques”: Database - “patient data base” (Third edition) Mukesh kumari et al (IJCSIT).
39
[5] I.Parvin Begum, K.Tajudin, V.Karthikeyani, December (2012) Comparative of Data
Mining Classification Algorithm (CDMCA) in Diabetes Disease Prediction: International
Journal of Computer Applications (Volume 60– No.12) K.Tajudin.
[6] G. Parthiban and K. R. Ananthapadmanaban (October 2014) Prediction of Chances
-Diabetic Retinopathy using Data Mining Classification Techniques Indian Journal of
Science and Technology, Vol 7(10), Sopharak.
[7] K.R Lakshmi, S.Premkumar, June (2013) “Utilization of Data mining Techniques for
prediction of Diabetes Disease survivability”, International Journal of Scientific &
Engineering Research, vol.4 Issue 6 S. Sapna.
[8] Dr. A. Tamilarasi and M. Pravin Kumar, January (2012) “Implementation of Genetic
Algorithm in predicting Diabetes”, International Journal of computer science, (vol.9 Issue 1,
No.3) Murat Koklu and Yauz Unal.
[9] Murat Koklu and Manaswini Pradhan, Yauz Unal April (2011) “ predict the onset of
diabetes disease using Artificial Neural Network”, “ International Journal of Computer
Science & Emerging Technologies, (vol.2 Issue 2) Arwa Al-Rofiyee, Maram Al-Nowiser.
40
41
42 | P a g e
References
 Ravi Sanakal, Smt. T Jayakumari, May (2014). Prognosis of Diabetes Using Data
mining Approach-Fuzzy C Means Clustering and Support Vector Machine
International Journal of Computer Trends and Technology (IJCTT) (volume 11
number 2) Dorcas Dachollom Datiri.
 Kaseda C, Kobayashi M, Yamaguchi M, Yamazaki K (2006). “Prediction of blood
glucose level of type 1 diabetics using response surface methodology and data
mining” 44(6) Med Biol Eng. Compute.
 Afshar Aalam, M. N Doja, Sapna Jain .February 25 – 26( 2010) ”K-MEANS
CLUSTERING USING WEKA INTERFACE”, 4th National Conference;
INDIACom-2010 Computing For Nation Development, Jiawei Han, Micheline
Kamber, Jian Pei, “Data Mining Concept and Techniques” (Third edition) Mukesh
kumari .
 Jiawei Han, Jian Pei, Micheline Kamber, (2007). “Data Mining Conceptsand
Techniques”: Database - “patient data base” (Third edition) Mukesh kumari et al
(IJCSIT).
 I.Parvin Begum, K.Tajudin, V.Karthikeyani, December (2012) Comparative of Data
Mining Classification Algorithm (CDMCA) in Diabetes Disease Prediction:
International Journal of Computer Applications (Volume 60– No.12) K.Tajudin.
 G. Parthiban and K. R. Ananthapadmanaban (October 2014) Prediction of Chances
-Diabetic Retinopathy using Data Mining Classification Techniques Indian Journal of
Science and Technology, Vol 7(10), Sopharak.
 K.R Lakshmi, S.Premkumar, June (2013) “Utilization of Data mining Techniques for
prediction of Diabetes Disease survivability”, International Journal of Scientific &
Engineering Research, vol.4 Issue 6 S. Sapna.
 Dr. A. Tamilarasi and M. Pravin Kumar, January (2012) “Implementation of Genetic
Algorithm in predicting Diabetes”, International Journal of computer science, (vol.9
Issue 1, No.3) Murat Koklu and Yauz Unal.
 Murat Koklu and Manaswini Pradhan, Yauz Unal April (2011) “ predict the onset of
diabetes disease using Artificial Neural Network”, “ International Journal of
Computer Science & Emerging Technologies, (vol.2 Issue 2) Arwa Al-Rofiyee,
Maram Al-Nowiser.

Fahima Afroz Rozy and Fariha Tabassum

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fahima Afroz Rozy and Fariha Tabassum

Uploaded by

Copyright:

Available Formats

Diabetes Prediction Using Data Mining

Fahima Afroz Rozy and Fariha Tabassum

A Thesis in the Partial Fulfillment of the Requirements

for the Award of Bachelor of Computer Science and Engineering (BCSE)

Department of Computer Science and Engineering

Fahima Afroz Rozy and Fariha Tabassum

Department of Computer Science and Engineering

affected by it. Here we discussed about 3 types of diabetes. Diabetes is growing at an

research in enhancing the accuracy of the prediction system in future.

Fahima Afroz Rozy Fariha Tabassum

Department of Computer Science and Engineering

IUBAT–International University of Business Agriculture and Technology

sources of data collection.

report and research could not have been successful.

We are especially grateful to the Department of Computer Science and Engineering

(CSE) of IUBAT –International University of Business Agriculture and Technology for

providing us all-out support during the thesis work.

friends who supported us in our academic career.

Diabetes Prediction Using Data Mining..............................................................................1

1.2 Problem Statement.......................................................................................7

1.3 Objective of the research.............................................................................7

3.1 Research Design.............................................................................................19

3.2 Dataset Collection..........................................................................................22

3.4 Classification Algorithms..............................................................................27

Result and Discussion......................................................................................................35

Figure 1: Disorder because of diabetes......................................................................................2

Figure 1.1: Number of adults (20-79 years) with diabetes worldwide......................................3

Figure 1.2: Bangladesh - Diabetes Prevalence (% of Population Ages 20 To 79)....................4

Figure 3.1.1: Framework for Diabetes Prediction...................................................................19

Figure 3.1.2: Statistical Summary

Figure 3.3.1: A typical workflow in Orange 3.........................................................................25

Figure 3.4.1: Example of Random Forest Algorithm..............................................................29

Figure 3.4.5.1: Example of SVM……………………………………………………………31

Figure 3.4.6.1: Example of KNN……………………………………………………………33

Table 1.1: Risk factors % of Diabetes.......................................................................................4

Table 2.1- Comparison of different classification algorithms based on the accuracy.............13

Table 2.2 Parameters used for prediction................................................................................15

Table 2.3: Attributes of Diabetes Dataset................................................................................16

Table 3.2.1- PIMA Attributes and Description.......................................................................21

Table 3.2.2 – Rules generated by proposed model..................................................................22

Table 4.1.1: Confusion Matrix.................................................................................................35

Table 4.2.1.1: Accuracy Measures..........................................................................................36

Table 4.2.1.2: Classification report of KNN

discovery, Knowledge extraction, data/pattern analysis, information harvesting, etc. Now a

is most effective in healthcare which provide prognosis and a deeper understanding of

testing the accuracy in predicting diabetic status.

patient will be 700 million globally by 2045 according to research. According to

International Diabetes Federation 7.1 million people with diabetes in Bangladesh and almost

an equal number with undetected diabetes. This number can be double by 2025.

In our country type 2 is increasing rapidly. Diabetes is increasing in Bangladesh in both

diabetes varied from 4.5% to 35.0% in Bangladesh. According to International Diabetes

undetected diabetes. This number can be double by 2025.

Figure 1.2: Bangladesh - Diabetes Prevalence (% of Population Ages 20 To 79)

Prevalence of diabetes in Bangladesh:

Table 1.1: Risk factors % of Diabetes

Male Females Total

professionals store significant amounts of patient’s data. It is important to analyze these

datasets to extract useful knowledge.

There are 3 types of diabetes:

previously referred to as "insulin-dependent diabetes mellitus" (IDDM) or "juvenile

CONTROL TYPE 1 DIABETES

 Type 1 diabetes requires close monitoring as blood sugar levels can be

quite erratic through the day if left unchecked.