Using Bayes Network For Prediction of Type-2 Diabetes: Yan Hu

The 7th International Conference for Internet Technology and Secured Transactions (ICITST-2012)
Using Bayes Network for Prediction of Type-2

Diabetes
Yan Hu
Yang Guo Guohua Bai School of computing
School of computing School of computing Blekinge Institute of Technology
Blekinge Institute of Technology Blekinge Institute of Technology Karlskrona, Sweden
Karlskrona, Sweden Karlskrona, Sweden
Abstract-Diabetes mellitus is a chronic disease and a major also called insulin dependent, which is usually
public health challenge worldwide. Using data mining methods to
aid people to predict diabetes has become a popular topic. In this
diagnosed in children and juvenile; another is type-
paper, Bayes Network was proposed to predict patients with 2 diabetes-which is often diagnosed in middle aged
developing Type-2 diabetes. The dataset used is the Pima Indians to elderly people. Patients with type-2 diabetes do
Diabetes Data Set, which collects the information of patients with
and without developing Type-2 diabetes. Weka software was
not require insulin cure to remain alive, although up
used throughout this study. Accurate results have been obtained to 20% are treated with insulin to control blood
which proves using the proposed Bayes network to predict Type- glucose levels. It has been shown that 80% of type-
2 diabetes is effective.
2 diabetes complications can be prevented or
delayed by early identification of people at risk [4].
Keywords-Bayes Network; Prediction; Type-2 Diabetes
Thus, it is important to develop medical diagnostic
I. INTRODUCTION decision support systems that can aid middle aged
Healthcare information systems tend to capture to elderly people in the self-diagnostic process at
data in databases for research and analysis in order home.
to assist in making medical decisions. As a result, The data set used in this paper is excerpted from
medical information systems in hospitals and the UCI Machine Learning Repository [5]. The
medical institutions become larger and larger and original owner of this dataset is the National
the process of extracting useful information Institute of Diabetes and Digestive and Kidney
becomes more difficult. Traditional manual data Diseases. The selection of these instances is as
analysis has become inefficient and methods for follow: All patients are females at least 21 years old
efficient computer based analysis are very needed. of Pima Indian heritage.
To this aim, many approaches to computerized data II. RELATED WORK
analysis have been considered and examined. Data
Many researches have been conducted in the field
mining represents a significant advance in the type
of Prediction of Type-2 Diabetes. In [6] authors
of analytical tools. It has been proven that the
have constructed an artificial neural network model
benefits of introducing data mining into medical
for diagnosis of diabetes, they used certain
analysis are to increase diagnostic accuracy, to
combination of preprocessing techniques to handle
reduce costs and to save human resources [1].
the missing values and compared the results of
Diabetes mellitus has become a major global
accuracy of the model for each technique, however
public health problem in recent time. According to
the method of handling missing values presented in
the International Diabetes Federation, there are
this paper wasn't employed in that study. Authors
currently 246 million diabetic people worldwide,
in [7] have constructed association rules for
and this number is expected to rise to 380 million
classification of type -2 diabetic patients. They
by 2025[2]. Diabetes is a chronic disease in which
generated 10 association rules to identify whether
body does not produce insulin or use it properly.
the patient goes on to develop diabetes or not.
This increase the risks of developing, kidney
Several of machine learning algorithms have been
disease, blindness, nerve damage, blood vessel
proposed in the context and have been successfully
damage and contribute to heart disease [3]. There
used in some parts. Bayesian networks are powerful
are two types of diabetes: one is type-1 diabetes-
97B-1-90B320-0B/7/$2S.00©2012 IEEE 471

tools for knowledge representation and inference data mining process, the data needs to undergo
under uncertainty, using Bayes Network as preprocessing, using data cleaning, discretization
classifiers has been shown effective in some and data transformation [9]. It has been estimated
domain [8]. In this study, we will use NaIve Bayes that data preparation alone accounts for 60% of all
Network to build a decision make system for the time and effort expanded in the entire data
middle aged to elderly people to do self-prediction mining process [10].
of type-2 diabetes at home. The dataset used in this study is "The Pima
Indians Diabetes Dataset". There are 768 instances
Ill. DATE PREPROCESSING
in this dataset, and all instances have 8 input
Most of the data sets used in data mining were attributes (from Xl to X8) and 1 output attribute(Y).
not necessarily gathered with a specific goal in TABLE I shows the attributes of this dataset.
mind. Some of them may contain errors, outliers or
missing values. In order to use those data sets in the
TABLE I . THE ATTRIBUTES OF THE DATASET
Attribute No. Attribute Description Type

Xl PREGNANT Numbers of time pregnant Numeric
X2 GTT Plasma glucose concentration in an oral glucose tolerance test Numeric
X3 BP Diastolic blood pressure(mmHg) Numeric
X4 SKIN Triceps skin fold thickness(mm) Numeric
Xs INSULIN Serum insulin(IlU/ml) Numeric
X6 BMI Body mass Index(kg/m) Numeric
X7 DPF Diabetes pedigree function Numeric
Xg AGE Age of patient(years) Numeric
y DIABETES Diabetes diagnose results("tested positive","tested negative") Nominal
A. Date Normalization
The entire document should be in Times New Based on formula (1), normalization process is
Roman or Times font. Type 3 fonts must not be performed on the data to overcome this problem
used. Other font types may be used if needed for and to get a better result, shown in TABLE III.
special purposes. TABLE III. RESULTS OF THE NORMALlZATION PROCESS PERFORMED ON THE
DATA
Recommended font sizes are shown in Table 1.
Attribute Mean Standard
TABLE II. RECOMMENDED FONT SIZES
Number Deviation
Attribute Mean Standard Xl 0.380 0.340

Number Deviation X2 0.120 0.032
Xl 3.8 3.4 X,., 0.691 0.l94
X2 120.9 32.0 X4 0.205 0.l60
X3 69.1 19.4 Xs 0.080 0.l15
X4 20.5 16.0 X6 0.320 0.079
Xs 79.8 115.2 X7 0.500 0.300
X6 32.0 7.9 Xg 0.332 0.l18
X7 0.5 0.3
Xg 33.2 11.8
B. Numerical data discretization
In this paper, we use the Min-Max normalization model to All of these attributes are numeric, in order to
transfonn the attribute's values to a new range, 0 to 1. The present the standard conditional probability tables
fonnula used to nonnalize attribute X is as follows:
of Bayes belief networks, discrete attributes is
needed. First, make each attribute binary according
97B-1-90B320-0B/7/$2S.00©2012 IEEE 472

to high values and low values, and then fit a diagnoses, it is dropped and a diagnostic score IS
numerical probability distribution for each node. computed for each diagnosis as,
Weka software is a collection of machine
learning algorithms for data mining tasks. It
contains tools for data pre-processing, classification,
regression, clustering, association rules, and
visualization [11]. Frist, we use Weka's Conditional probability of symptom Sj for � is,
'weka.filters.Discretize' method to transform the
attributes to binary variables. But the result is
strange: For most attributes, one side of the attribute _ ft�. n dj)
f I d,) t,<t)
1:.
was a small percentage(less than 10%) of the \-:l ;i '*l - f(�)
samples; this is not useful because over 33% of the
samples are positive. The reason is that Weka filter Where d'1) is the number of patients in the
uses information gain, which often favors highly dataset with disease � and I{>ij n �, is the frequency
pure small splits. count of patients with both �i and ,� .
As an alternative, in order to find the median When Bayesian belief network is applied to the
value of each attribute and divide each attribute up classification problem, one of the most effective
SO/SOCor as close as we could), The division of the classifier is the so-called Naive Bayesian classifier.
values into several bins is a very common method When represented as a Bayes network, it has the
for discretization, but usually more than 2 is used. simple structure proposed in Fig. 1. This classifier
After transformation, OVERWEIGHT, BMI and learns from observed data the conditional
SKIN counts more closely, this is the point we can probability of each variable Sj , given the class label
discard SKIN and BMI in this model. S. Classification is then done by applying Bayes
IV. METHODOLOGY
rule to compute the probability peS I !lj, ,£11 ,) and • •
then predicting the class with the highest posterior

A. Naive Byes Basic Principle probability. This computation is feasible by making
NaIve Bayes classifier [12, 13] is a well-known the strong assumption that the variables ,S j are
type of classifiers, i.e., of programs that assign a conditionally independent given the value of the
class from a predefined set to an object or case class S.
under consideration based on the values of
descriptive attributes. They do so using a
probabilistic approach, i.e., they try to compute
conditional class probabilities and then predict the
most probable class. The basic principle of NaIve
Byes is described as follows:
From a training set of patient data, marginal
probabilities of symptoms P(si) and diseases P(dj),
and conditional probabilities of symptoms on all
diseases P(sildj) are calculated by counting Figure 1. Naive Bayes Classifier
frequencies in the data. Given a set of symptoms (S

{si}) for a patient, the posterior probability for B. Bayes Network Construction
each diagnosis for the patient is calculated as, Based on investigation of the knowledge of
diabetes, we can create the Bayes network structure.
First, it is certain that diabetes can be caused by
PREGNANT (number of times pregnant), AGE,
Since denominator na,.�rq�0 is common in the and DPF (diabetes pedigree function). An
computation of posterior probabilities for all interesting part of the dataset is that it has two
97B-1-90B320-0B/7/$2S.00©2012 IEEE 473

T ABLE IV. CONDITIONAL PROBABILITY TABLES AND VALUES

measures related to being overweight: SKIN
(triceps skin fold thickness) and BMI (body mass PREGNANT=HIGH
PREGNANT=LOW
index). These measurements don't cause overweight,
instead, being overweight causes these TABLEY.
measurements to be high. So we can assume that AGE=HIGH 372/768

"overweight" a hidden variable in the network. AGE=LOW 396/768
After further examination, skin fold thickness TABLE vI.
looked like very poor evidence for diabetes, so I DPF=HIGH 384/768

used body mass index as the value of overweight. DPF=LOW 384/768
The GTT (plasma glucose concentration) and the
INSULIN (serum insulin) measurements are both TABLE VII.
tests for diabetes, which means diabetes causing OVERWEIGHT=HIGH

these. OVERWEIGHT=LOW
For BP (blood pressure), there is some debate

about whether diabetes is a cause of blood pressure. TABLE VIII.
From looking up literature, no one mentioned blood
pressure causing diabetes, so the causal link from PABO PABO PABO PABO PABO PABO PABO PABO
diabetes to BP was drawn. However, all sorts of Test..p ositive jUi3 27/69 J5n4 27n6 1112l 6nD 712i 11/44
things cause blood pressure, including pregnancy, Test..negative 25/i3 42169 39n4 49n6 1412l 14120 21m 33/44
age, and overweight. To make the network
presentable and just to illustrate a point or two, a PABO PABO PABO PABO PABO PABO PABO PABO
link from overweight to blood pressure was drawn. 15123 6/16 11120 Sill 27n4 5n4 15/55 2176
Probably a couple others should be added. 8123 1011 6 9120 6/11 47n4 69n4 40/55 74n6
Based on the analysis above, the Bayes network

TABLE IX.
is built in Fig. 2:
Test"'positive Test negative
GTT=IDGH 206/268 1 821500
GTT=LOW 621268 3 18/500
TABLE X.
Test"'positive Testnegative
INSULIN=IDGH 128/268 256/500
INSU1IN=LOW 1 40/268 244/500
Figure 2. Bayes Network In this paper, leave-one-out method was used to

evaluate the proposed naIve Bayes network. It
IV. EXPERIMENTATION RESULTS AND ANALYSIS should be noted that some bias has been added
because the samples was used to make some of my
Based on the bayes network in Fig. 2, we use decisions about how to transform the samples and
weka to simulate the dataset and fill in the how to structure the network. But this should give
conditional probability tables. However, with only some idea whether Bayes network might be useful
768 samples, one should expect some inaccuracy for this domain.
especially for the diabetes table, which has 32 After gathering all the probabilities from all the
entries. TABLE IV to TABLE X showed all the samples, we implement leave-one-out by looping
Conditional Probability Tables and Values: over the sample again. For each sample, first the
97B-1-90B320-0B/7/$2S.00©2012 IEEE 474

sample is subtracted from the counts, then the learning methods such as Neural Network will be tested to
compare the predicting results.
network is used to classify the sample, and finally
the sample is added back into the counts.
By using leave-one-out evaluation, the accuracy REFERENCES
of the proposed Bayes network and Weka's naIve [I] Marjan Khajehei, Faried Etemady, "Data Mining and Medical
Bayes network are compared. TABLE xi showed Research Studies," cimsim, pp. 1 19-122, 20 10 Second International
Conference on Computational Intelligence, Modelling and Simulation,
the results comparing proposed Bayes network to 20 10
naive Bayes: [2] International Diabetes Federation, Diabetes Atlas, 3rd ed. Brussels,
Belgium: International Diabetes Federation,2007
TABLE Xl. RESULTS COMPARING PROPOSED BAYES NETWORK TO NAIVE [3] R. Bellazzi, "Telemedicine and diabetes management: Current
BAYES challenges and future research directions," J. Diabetes Sci. Technol.,
vol. 2,no.l,pp. 98- 104,2008
Method Accuracy [4] J.C.Pickup, G. Williams,(Eds), Textbook of diabetes, Blackwell
Porposed Byes Network 5551768=72.3% Science,Oxford
NaIve Bayes Network 5491768=71.5% [5] http://archive.ics. uci.edu/mlldatasets/Pima+Indians+Diabets, Irvine,
CA: University of California, School of Information and Computer
Science
From TABLE XI, the result of proposed Bayes [6] Al Jarullah,A.A,Decision Tree Discovery for the Diagnosis of Type II
Diabetes. International Conference on Innovations in Information
network is more accurate than naIve Bayes network. Technology (itt),20 1 1.
The proposed Bayes belief network model is [7] Patil, B.M.; Joshi, R.c.; Toshniwal, D.; , "Association Rule for
Classification of Type-2 Diabetic Patients," Machine Learning and
promising for this domain. Computing (lCMLC), 20 10 Second International Conference on , vol.,
no.,pp.330-334,9- 1 1 Feb. 20 10
V. DISCUSSION AND CONCLUSION [8] Friedman N, Linial M, Nachman I, Pe'er D (2000) Using Bayesian
networks to analyze expression data. Journal of computational biology :
The discovery of knowledge from medical a journal of computational molecular cell biology 7: 60 1-620.
databases is important in order to make effective [9] Larose, D. T. (2006) Data Mining Methods and Models, Hoboken:
John Wiley & Sons,Inc.
medical diagnosis. The aim of data mining is to [ 10] Pyle, D. ( 1999) Data Preparation for Data Mining, San Francisco:
extract knowledge from information stored in Morgan Kaufmann
[II] http://www.cs.waikato.ac.nz/ml/weka!
database and generate clear and understandable [ 12] P. Langley, W. Iba, and K. Thompson. An Analysis of Bayesian
description of patterns. Classifiers. Proc. 10th Nat. Coni. on Artificial Iritelliyence (AAAI'92,
San Jose, CA, USA), 223-228. AAAI Press and MIT Press, Menlo
This study aimed at the discovery of a decision Park and Cambridge,CA,USA 1992
tree model for the diagnosis of type 2 diabetes. The [ 13] P. Langley and S. Sage. Iiiductiori of Selective Bayesian Classifiers.
Proc. 10th Corif. u7r Wricertozrrty zrr Arlsjiciul Irrlelliyence (UAI'94,
dataset used was the Pima Indian diabetes dataset. Seattle, WA, USA), 399-406. Morgan Kaufinarl, Sail Mateo, CA,
Pre-processing was used to improve the quality of USA 1994
[ 14] Seibel, J. A. (2007) Diabetes Guide, WebMD,
data. The techniques of pre-processing applied were http://diabetes.webmd. com!guideloral-glucose-tolerance-test.
attributes identification and selection, data
normalization, and numerical discretization. Next,
classifier was applied to the modified dataset to
construct the NaIve Bayes model. Finally weka was
used to do simulation, and the accuracy of the
resulting model was 72.3%.
There are some limitations of this study. Firstly, considering
the Pima Indian diabetes dataset, there might be other risk
factors that the data collections did not consider. According to
[12], other important factors include gestational diabetes,
family history, metabolic syndrome, smoking, inactive
lifestyles, certain dietary patterns etc. The proper prediction
model would need more data gathering to make it more
accurate. This can be achieved by collecting diabetes datasets
from multiple sources, generating a model from each dataset.
Secondly, in this study we only use Bayes network to predict
diabetes. Considering of the uncertain factors of some
diabetes attributes, in the future work, fuzzy set method will
be introduced to improve Bayes Network to do prediction.
Also, in order to find a best prediction model, other machine
97B-1-90B320-0B/7/$25.00©2012 IEEE 475

Using Bayes Network For Prediction of Type-2 Diabetes: Yan Hu

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Using Bayes Network For Prediction of Type-2 Diabetes: Yan Hu

Uploaded by

Copyright:

Available Formats

The 7th International Conference for Internet Technology and Secured Transactions (ICITST-2012)

Using Bayes Network for Prediction of Type-2

97B-1-90B320-0B/7/$2S.00©2012 IEEE 471

TABLE I . THE ATTRIBUTES OF THE DATASET

Attribute No. Attribute Description Type

Attribute Mean Standard Xl 0.380 0.340

97B-1-90B320-0B/7/$2S.00©2012 IEEE 472

then predicting the class with the highest posterior

frequencies in the data. Given a set of symptoms (S

97B-1-90B320-0B/7/$2S.00©2012 IEEE 473

T ABLE IV. CONDITIONAL PROBABILITY TABLES AND VALUES

measurements to be high. So we can assume that AGE=HIGH 372/768

After further examination, skin fold thickness TABLE vI.

looked like very poor evidence for diabetes, so I DPF=HIGH 384/768

tests for diabetes, which means diabetes causing OVERWEIGHT=HIGH

For BP (blood pressure), there is some debate

Based on the analysis above, the Bayes network

Figure 2. Bayes Network In this paper, leave-one-out method was used to

97B-1-90B320-0B/7/$2S.00©2012 IEEE 474

97B-1-90B320-0B/7/$25.00©2012 IEEE 475

You might also like