Professional Documents
Culture Documents
by
Ritesh Jain
Master of Science
May 2015
Copyright 2015 by Ritesh Jain
ii
ACKNOWLEDGEMENTS
It is a pleasure to thank the many people who made this thesis a success. I am indebted to
my advisors Dr. Ankur Agarwal and Dr. Ravi Behara for giving me this wonderful
enthusiasm, inspiration and great efforts to explain things clearly and in a simple way
I would like to thank Dr. Vinaya Rao, M. D., Methodist University Hospital Transplant
Institute, Memphis, TN, USA for sharing her expertise and providing valuable guidance
to validate my research.
I would like to thank my committee members Dr. Xingquan Zhu and Dr. Hari Kalva for
their valuable comments, suggestions and input to the thesis. Thanks a lot for your
I would like to thank my parents Mr. Mahesh Jain and Mrs Sumitra Jain for believing in
me. My sincere thanks also go to my brother Mr. Hitesh Jain and my Sister-in-Law Ms.
iv
ABSTRACT
Year: 2015
Chronic Diseases are the major cause of mortality around the world, accounting for 7 out
of 10 deaths each year in the United States. Because of its adverse effect on the quality of
life, it has become a major problem globally. Health care costs involved in managing
these diseases are also very high. In this thesis, we will focus on two major chronic
diseases Asthma and Diabetes, which are among the leading causes of mortality around
the globe. It involves design and development of a predictive analytics based decision
support system which uses five supervised machine learning algorithm to predict the
occurrence of Asthma and Diabetes. This system helps in controlling the disease well in
advance by selecting its best indicators and providing necessary feedback. Based on
several risk factors such as blood pressure, BMI, age, ethnicity, smoking status etc, the
v
system would be able to predict the vulnerability of a person to a particular disease
which helps in taking necessary action to avoid the disease well in advance
vi
PREDICTIVE MODELING FOR CHRONIC CONDITIONS
1 INTRODUCTION ....................................................................................................1
2 RELATED WORK...................................................................................................6
vii
2.5 An Evolutionary two-objective genetic algorithm for asthma prediction .......... 11
Study(PROCAM) ....................................................................................................... 16
viii
3.5.4 Logistic Regression: ................................................................................. 25
ix
6.2.1 Confusion Matrix ..................................................................................... 66
REFERENCES .............................................................................................................. 73
x
LIST OF TABLES
Table 4.1 Unweighted response rates for NHANES 2011-2012 survey by Age and
Gender ........................................................................................................................... 29
xi
Table 4.20 Laboratory variables for diabetes prediction ................................................. 46
Table 6.1 Asthma Results in terms of Accuracy, RMSE and ROC area for all the 5
models ........................................................................................................................... 63
Table 6.2 Confusion Matrix of Naive Bayes Classifier for Asthma Prediction ............... 63
Table 6.3 Confusion Matrix of BayesNet Classifier for Asthma Prediction .................... 64
Table 6.4 Confusion Matrix of MLP Classifier for Asthma Prediction ........................... 64
Table 6.5 Confusion Matrix of Logistic Classifier for Asthma Prediction ...................... 64
Table 6.6 Confusion Matrix of J48 decision tree Classifier for Asthma Prediction ........ 64
Table 6.7 ROC curves obtained from WEKA for Asthma .............................................. 65
Table 6.8 Diabetes Results in terms of Accuracy, RMSE and ROC area for all the 5
models ........................................................................................................................... 66
Table 6.9 Confusion Matrix of Naive Bayes Classifier for Diabetes Prediction ............. 67
Table 6.10 Confusion Matrix of BayesNet Classifier for Diabetes Prediction ................ 67
Table 6.11 Confusion Matrix of MLP Classifier for Diabetes Prediction ....................... 67
Table 6.12 Confusion Matrix of Logistic Classifier for Diabetes Prediction .................. 68
Table 6.13 Confusion Matrix of J48 Classifier for Diabetes Prediction ......................... 68
xii
LIST OF FIGURES
xiii
Figure 5.11 Asthma Calculator Page4 ............................................................................ 57
xiv
1 INTRODUCTION
1.1 Motivation
Chronic condition is a health condition or disease that is persistent and whose effects are
long lasting. It has major adverse effect on the quality of life of the individual who is
affected with it. Diabetes, Asthma, Cancer, COPD, CKD and Heart Disease are some of
the major chronic conditions the world is facing today. It has been found that chronic
diseases are the major cause of mortality, accounting to 7 out of 10 deaths in the US.
According to CDC, Centre for Disease Control and Prevention about half of all adults i.e.
almost 117 million people in United States have one or more chronic health conditions.
According to World Health Organization report, [1] out of 58 million deaths in 2005,
chronic diseases resulted into 35 million of them. They are currently the major cause of
Chronic conditions are critical not only because of their high mortality rate but also
because of the cost associated with it. Majority of US economic and health care costs
associated with the health condition are primarily because of the chronic diseases and the
associated health risk behaviors. CDC survey shows that the total costs of heart disease
and stroke in 2010 were estimated to be $315.4 billion; costs involved in diabetes were
estimated to be $245 billion and cancer care costs were estimated to be $157 billion.
1
As a result of the above factors, a need to develop a system which can manage the
chronic conditions in an individual well before its onset arises. In this research we have
developed a Predictive analytics based clinical decision support system which can help an
individual to better manage chronic conditions. This system investigates the state of
being unwell by focusing on two major chronic diseases Asthma, that is caused by
inflammation of the airways [2], these are the small tubes called bronchi which carry air
in and out of the lungs and Diabetes, that is caused by an imbalance in the secretion of
insulin resulting in a disturbance in the sugar levels of the blood [3]. This also increases
the risk of developing kidney diseases, heart diseases, blindness etc [3].
million people are suffering from Asthma, the annual economic cost including direct and
indirect cost amount to more than $56 billion annually. CDC survey shows that every day
According to American Diabetes Association, 29.1 million people in the United States
had diabetes in 2012 out of which only 21 million people were diagnosed. According to
International Diabetes Federation, there are 246 million diabetic people around the world
On further research we found out that Asthma and Diabetes are associated with many risk
factors. There are many clinical and non-clinical risk factors which might lead to these
conditions such as age, gender, blood pressure, smoking status etc. As a result, we have
developed a predictive system which can help an individual in tracking their likeliness of
Asthma and the extent of diabetes based on a list of clinical and non-clinical factors. This
2
system also provides necessary feedback which can reduce the chances of having a
Chronic diseases such as Asthma and Diabetes are the major cause of high mortality and
morbidity rate all around the world; millions of people are diagnosed with one or more
chronic condition every year. These are among the most common health problems the
world is facing today. The overall economic costs involved in these diseases are also very
high. Health Risk behaviors are the major factors that are associated with chronic
conditions. They are the unhealthy behaviors that can be changed with time if proper
efforts are being made to control or manage these behaviors. Following are the four
2. Poor Nutrition
4. Excessive smoking
According to CDC surveys, cigarette smoking is responsible for more than 480,000
deaths every year, drinking too much alcohol causes 88,000 deaths each year, the survey
also shows that more than 50% of adults aged 18 years or older did not meet the expected
duration and level of physical activity. Hence because of improper management of these
1.3 Contribution
3
1. The main objective of the research is to develop a Predictive System that can
2. A web based application to make use of the system in a user friendly manner.
4. To provide necessary feedbacks to improve overall health and reduce the risk of
5. The efforts are also made to perform analysis on the data taken from NHANES
2011-2012 survey.
6. Utilize the java oriented data mining software tool Weka, for classification,
1.4 Organization
This thesis is divided into 7 major chapters. Chapter 1 explains the research problem and
provides the readers with the importance and magnification of the problem. It also
solve health problems. It provides an overview of the methodology used by these systems
4
Chapter 3 provides an overview of the research methodology used in order to solve the
Chapter 4 provides a detailed description of NHANES data set, the data set used to train
the prediction models. It also provides an overview of the risk factors chosen to predict
Chapter 5 defines the system architecture in detail. It also gives a detailed description of
Chapter 6 discusses all the results obtained from the models used in the research. It
Chapter 7 discusses the conclusion and possible future work in this field of study
Chapter 1 Introduction
5
2 RELATED WORK
2.1 Introduction
Many studies have been carried out at applying appropriate technologies to improve
healthcare and its delivery using predictive modeling in healthcare systems based on
Artificial Neural Network. In this chapter we will discuss several systems that provide
predictive modeling approach to improve the overall health, several systems to predict
the risk of Asthma and Diabetes based on different risk factors. Here, we will provide an
overview of the architecture of the systems, data sets used in order to carry out the
Peter Ghavami of Harborview Medical Centre, Seattle and Kailash Kapur of University
patients physiological health status using Artificial Neural Network [6]. The Engine
builds model based on Historical clinical data which is collected from different human
Systems etc. Their system includes PHM i.e. Prognostics and Health Management which
provides methods for solving reliability issues; it also permits assessment of the system
under its application scenario in order to determine possible risks and failures.
6
Figure 2.1 shows the timeline of medical predictions on which the system works:
that provides prediction about a disease, then it goes to marker where the actual causes of
disease are combined to form a detectable level of diseases and at last it goes to the
medical event itself, determining the presence of a disease or its occurrence in the near
future.
7
2.2.1 System Architecture
Physiological system is represented by i(t) which includes data related to the medical
treatment plan, it involves some set of medications, protocols and procedures suggested
by the physician. The Human Physiological System internally consists of a wide variety
of clinical data such as lab results and monitored data represented by X(t). The Model
includes a Prognostic Engine which continuously monitors the real time patient data and
applies some mathematical algorithms to develop rules and patterns to make predictions
8
2) SVM- Support Vector Machine
2.2.3 Data:
The clinical data used by the system to define rules for prediction consists of 468 patients
cases who were admitted for various physiological treatments. The input data consist of
21 independent variables and 1 dependent variable. Input data includes data from
different lab results, blood pressure data, data related to heart rate etc. The dependent
variable represents the clinical outcome i.e. the absence or presence of a disease.
At the end, results of all models are stored and with the use of an oracle (overseer
The model provides a predictive approach using polynomial neural network with swarm
intelligence. Swarm Intelligence is the human intelligence derived from social interaction
9
2.3.1 System Architecture:
Figure shows the typical architecture of biologically or socially inspired techniques used
in prediction. The Patient’s handheld device includes four major components DPSO,
CPSO, Polynomial Neural Network and Training systems. Input to the model is Patient’s
physiological parameters such as MMRC scale, BMI, FEV1% and 6 minutes walk test.
Prediction model comprises of a condensed polynomial neural network; the model then
runs on the data collected and the accuracy of the system is then assessed by statistical
the future based on patient’s medical history and other similar patients history [8]. It uses
collaborative methods to find out the most significant risk factors that can lead to the
10
disease and generates predictions based on the selected risk factors. An iterative version
ICARE was also designed which uses ensemble concepts to improve the performance of
the system.
2.4.1 Methodology:
In a typical CARE system, the individual medical history is the history of the individual
on which the predictions are to be made, it is considered as the testing patient and other
patients medical histories are considered as training patients, the training patients are
constrained to have the data of patients who have at least two diseases in common with
the testing patient. Collaborative filtering defines methods to generate prediction for the
future disease risks of the patient. In case of ICARE i.e. Iterative Care, this process is
repeated multiple times with different training patients group. These results are then
combined to form an ensemble. The results from both CARE and ICARE are then ranked
predict the occurrence of asthma using Artificial Neural Network and a Genetic
Algorithm [9]. The system predicts asthma risks in children under the age of 5. Genetic
Algorithm helps in filtering the factors that influence the asthma most. In other words this
algorithm finds out the risk factors which make a child more vulnerable to asthma. In a
11
2.5.1 MLP Pruning by Genetic Algorithm
Patterns were generated to predict asthma with the help of Artificial Neural Network.
Multilayer Perceptron, a supervised neural network model was used to generate these
patterns. A total of 34 prognostics factors were used to predict the disease ranging from
common symptoms of asthma such as cough, chest pain, runny nose etc. MLP network
was trained based on the data of 112 patients obtained from the Pediatric Department of
Alexandroupolis University Hospital, Greece. The training algorithm used here is the
back propagation algorithm, where the weights are adjusted by back propagating the error
A variety of experiments were performed in MATLAB with the support Neural Network
Toolbox for constructing Multilayer Perceptrons and the Genetic Algorithms. The testing
Genetic Algorithm search was performed to find out the most significant risk factors that
can be supplied as an input to the Multilayer Perceptron Model, GA search is divided into
two objectives where first objective was to minimize the number of prognostic factors
and second objective was to enhance the performance of the model based on these
factors.
With the increase in popularity of Cloud Computing, researchers from Bangalore and
Anantapur, India designed and developed a Health Care Monitoring System using Cloud
Framework [10]. Cloud enables data sharing without geographical limitations. CHMS
12
collects the health data from a variety of sources and publish them to a cloud based
repository, this repository is named as Telemedicine Repository (TMR). Once the data is
published on to TMR, the system then performs data analysis using services provided by
the cloud and stores the results in the form of health records.
Figure shows the cloud framework of CHMS, it mainly comprises of a data acquisition
Care (EMC) module and a Multi Specialty Hospitals module. Patients are equipped with
a data acquisition device that are capable of collecting data from a patient like ECG,
Glucose, Blood Pressure etc. The data is communicated to the TMC with the help of a
communication system like internet. TMC performs the data analysis on the received data
keeping into account the existing patient’s data and historic data. TMC also maintains an
13
Electronic Health Record (EHR) on the cloud which is accessible to the users at any time.
emergency.
Service (PaaS) layer and the Infrastructure-as-a-Service(IaaS) layer. SaaS layer helps a
user to use the system without going through the complexity of the application, PaaS
layer provides a set of tools to make the system quick and efficient, it helps in storing
EHR of a patient generated by TMC. IaaS layer provides virtual datacenters such as
servers, networks etc it provides storage services by virtualizing the resources. It also
helps in making the data available to different users across the globe.
14
2.7 Modeling Risk Prediction of Diabetes – A Preventive Measure
Bakshi Rohit Prasad et al [11] proposed a data mining approach for selecting best
indicators of diabetes and a model to predict the diabetes before its onset. It uses voting
mechanism to select the most suitable classifier model to achieve high accuracy. The
system works in three stages data pre-processing, class label assignment and construction
of classifier.
2.7.1 Methodology
The data is collected from UCI repository known as Diabetes dataset, it contains 9
attributes such as plasma glucose, diastolic BP, BMI, age etc. In the Data pre-processing
stage the data is transformed in a form which is suitable to execute subsequent stages. It
uses k-nearest neighbor approach to deal with the missing values, which puts the value
present in the nearest column in terms of Euclidean distance. As a result only 5 attributes
remained to form the dataset. In the next stage class label is assigned to each patient
record in terms of high risk, medium risk or low risk. It uses clustering technique to
group the dataset into clusters of high, medium and low risk. Next stage corresponds to
DTC(Decision Tree Classification). Each of the classifier is trained on the training set
and the accuracy is measured for a test data, vote count of the classifier resulting into
highest accuracy is incremented by one. The process continues for different test datasets
to find the classifier with highest vote count which is then selected for classification
purpose.
15
2.8 Scoring Scheme based on Prospective Cardiovascular Munster Study(PROCAM)
Gerd Assmann et al [12] proposed a scoring scheme for calculating the risk of acute
Study. The scoring scheme is based on 325 acute coronary events that have occurred
within 10 years of follow-up among 5389 middle aged men who were recruited into
PROCAM study. Within the 10 years 230 men were lost to follow up, 218 died, 14 had
suspected coronary death and 46 non fatal cases occurred. At last, 4493 middle aged men
To obtain maximum information from the PROCAM study, a risk algorithm which uses
Cox proportional hazards model was constructed. It includes 8 variables which were
independently predicting the event risk. Cox model only allows calculation of relative
risk; hence to convert the relative risk obtained from Cox model into absolute risk,
In order to generate the scoring scheme each risk factor was divided into categories for
which each category is assigned with a value which is obtained from regression equation
combined with the survival curves and the categories of each risk factor [12]. The
coefficients calculated were then standardized and rounded to obtain the score in terms of
a whole number. PROCAM algorithm then calculates the risk of a coronary event
associated with each score which are then categorized into very low and very high
PROCAM scores.
16
2.9 Other Related Works:
A neural network based Structural Health Monitoring System has been proposed [13].
This system uses wireless sensor network where thousands of sensor nodes perform
distributed sensing and collaborative computing for structural health analysis. It uses
system [14] and method for determining and managing an individual portable health
score, this method defines a baseline health score and further adjusting the health score
17
3 RESEARCH METHODOLOGY
3.1 Introduction
The research design is a framework for predicting the likeliness of Asthma and the extent
of diabetes in an individual. In the current chapter firstly the design of the proposed
system will be explained and afterwards the concept of predictive modelling, data mining
and machine learning including various machine learning models that have been used to
The primary purpose of this research is to develop a system which can help an individual
to keep track of their health in the chronic sense; hence it is important to design the
system in such a way that is easy to interpret and easy to use. In this study the advantages
of Artificial Neural Network are channelized to come up with a system which is user
friendly in terms of usage. Figure shows the overall design of the system which leads to
18
Figure 3.1 Proposed System Design
It includes three main components User, System and Neural Network. Users are
individuals who want to avail the services provided by the system; User is expected to
provide necessary input to the system; input consists of a list of parameters including
demographic details, laboratory details, and body measures etc. Once the user provide
input to the system; the system makes use of several neural network models (discussed
later in the chapter) to generate the predictions in terms of likeliness of Asthma and
extent of Diabetes in a numerical format that is easy to interpret. Once the prediction has
been made, the system then suggests necessary feedback in order to manage the disease.
variables and the predictor or independent variables [15]. It helps in predicting the
probability of an outcome when a set of independent variables passes through the model.
19
3.4 Data Mining
Data Mining is an analytic process to extract information from a large amount of data. It
is designed to explore data in order to find patterns or relationships between the variables.
It helps in extracting unknown or potentially useful information from data [16]. Data
Mining involves machine learning, artificial intelligence and statistics [17]. The main
goal of data mining is to predict; predictive data mining is the most common data mining
approach that have been used by many studies. The process of data mining is divided into
three main stages: In the first stage, the dataset to train the model is prepared it includes
Data Cleaning and Data Pre-processing; second stage includes building of model that
means identifying patterns based on the prepared dataset and lastly in the final stage the
Machine learning is a branch of computer science that consists of algorithms that can
learn from data, it provides set of methods that can detect patterns in the data and use the
Machine learning is divided into two main types supervised and unsupervised learning.
Supervised learning is the machine learning technique in which the learning algorithms
make use of labelled data. The main goal is to map a set of input parameters X which are
also called as features or attributes to output parameter Y which is also called as class
label [18]. In this learning technique model gets trained on the labelled training data and
then it generates predictions for unseen situations. In unsupervised learning, the model is
trained on unlabelled data. The main goal of unsupervised learning is to find patterns in
20
Following diagram shows the supervised learning models used in this research to
Supervised
Learning
Naïve Bayes classifiers are based on Bayesian Theorem; it simplifies the learning method
by assuming that features are independent of each other on the class context [19]. This
strong assumption is known as Naïve Bayes Assumption [18]. Let us consider x ϵ X, the
input feature vectors; y ϵ {1,…, c}, the class labels; then the Naïve Bayes Assumption is
given by
D
p ( x / y c ) p ( xi | y c )
i 1
It is particularly suited when the input set comprises of large number of attributes. Naïve
bayes classifiers are used in several fields such as target marketing, text classification,
Advantages:
21
2. Naïve Bayes classifiers can be trained quickly in a single scan
Disadvantages:
1. The major disadvantage of Naïve Bayes is the Naïve Bayes Assumption; it can
2. In real world, dependencies exist among the attributes; but these dependencies are
Bayesian Networks are also called as Belief Networks; they are probabilistic graphical
model widely used under uncertainty [21]. It provides methods to represent relationship
between the attributes. It is represented in the form of a directed acyclic graph whose
nodes represents the attributes and edges represent the relationship between them. The
Advantages:
interpreted
2. Models are represented in the form of a graph hence which can be interpreted
available
22
Disadvantages:
2. Because of the acyclic nature of the model; feedback methods cannot be included
in Bayesian Networks
Multilayer Perceptron is a feed forward neural network with one or more hidden layers
between input and output. Feed Forward means data flows from one direction to another
i.e. from input nodes towards the output node. This network is trained with a back
propagation learning algorithm. MLP helps in distinguish between the data that is not
linearly separable. Except input nodes all nodes consist of a non-linear activation
function. Input layer consist of a set of input parameters based on which prediction has to
be made, hidden layer consist of a set of hidden nodes which helps in solving the non-
linear data problem, these nodes helps in converting input data into the form which can
be used by the output node and lastly output layer consist of a output node with a non-
23
Figure 3.3 A Multilayer Perceptron Model with three layers
Advantages
Disadvantages
24
3.5.4 Logistic Regression:
situations where the outcome is categorical. It has become a standard method of analysis
in the situation where outcome variable is discrete taking two or more possible values
[23]. It provides a reasonable model to describe the relationship between the output
variable and one or more input variables. In most cases, the outcome variable is
dichotomous i.e. it can take only two values such as yes/no, 0/1 etc; such logistic
regression models are called as Binary Logistic Regression Model. In some cases, the
outcome variable can take more than two values such models are called as Multinomial
Advantages:
Disadvantages:
25
3.5.5 J48 Decision Tree
Decision tree learning is the learning method which uses trees to represent a predictive
model [24]; the tree consist of leaves that represents the class label and branches that
represents features or rules that leads to a particular class label. It is divided into two
categories Classification trees; in which the target variable consist of a finite set of values
and Regression trees; in which the target variable can take numerical values. J48
commonly known as C4.5 algorithm is used to perform decision tree learning. It is also
known as statistical classifier. C4.5 generates decision tree which can be used for
Advantages:
1. J48 Decision trees can be used for both continuous and discrete attributes.
2. Once the tree is created, it removes unnecessary nodes which do not help in
3. It is easy to implement
Disadvantages:
several conditions
2. If the decision tree consist of too many branches and nodes, the cost and the
26
4 DATA AND ANALYSIS OF RISK FACTORS
This chapter presents the detailed description of the data used to train the models
proposed in this research; It includes four major components; Data Collection, Variable
The data used in this research to train the models for prediction is collected from National
27
studies designed to assess the health and nutritional status of adults and childrens in the
United States”. The survey has been conducted by the National Center for Health
Statistics (NCHS), an agency of United States Federal Statistical System that provides
statistical data to improve the health status of people in America. NCHS is an integral
part of Centers for Disease Control and Prevention (CDC). The survey is a combination
demographic, diet and other health related questions. Every year about 5000 persons that
are located in different states across the country are examined under this survey; it also
keeps track of the changes in their health conditions over time. The data collected from
the survey is used to determine various risk factors for major diseases such as Asthma,
For this research we have used the data collected from NHANES 2011-2012 survey. This
from 30 different locations were selected out of which 9,756 completed the interview
component and 9,338 were examined in order to collect the information related to above
mentioned categories. Below table shows the unweighted response rates for NHANES
28
Table 4.1 Unweighted response rates for NHANES 2011-2012 survey by Age and
Gender
There are many environment and socio economic factor that may be considered as the
risk factors for Asthma and Diabetes. This section provides the detailed description of the
risk factors used to predict the likeliness of Asthma and extent of Diabetes. Data from
Most chronic conditions share common risk factors; while some risk factors such as age,
gender and ethnicity cannot be changed over time; other behavioural or environmental
risk factors such as alcohol consumption, smoking habits and physical activity can be
changed over time if proper measurements are taken. The recognition of such risk factors
30
4.2.1 Initial Set
In the initial stage we selected the parameters which were relevant to different chronic
conditions; later on we divided the parameters into two sets one for Asthma Prediction
Table shows the list of parameters selected in the initial stage; it displays Variable Name,
the unique name given to the parameter in NHANES; SAS Label, the question
corresponding to the variable; Data File Name, the name of the file in which the
description about the parameter is stored and Doc File, id of the document file.
Demographic:
Demographic data includes variables that cover the whole society; it helps in putting
people into different categories such as age, gender and race etc. Table shows the list of
variable that we have included for our research from demographic section of NHANES.
Demographics
Data File Doc
Variable Name SAS Label
Name File
RIAGENDR Gender Demographic
RIDAGEYR Age in years at screening Variables and DEM
Sample O_G
RIDRETH3 Race/Hispanic origin w/ NH Asian Weights
Examination:
pressure, height, weight, BMI and injuries related questions etc. Table shows the list of
31
Table 4.3 Variable Considered from Examination
Examination
Variable Data File
SAS Label Doc File
Name Name
BPXSY1 Systolic: Blood Pres (1st rdg) mm Hg
BPXDI1 Diastolic: Blood Pres (1st rdg) mm Hg
BPXSY2 Systolic: Blood Pres (2nd rdg) mm Hg
BPXDI2 Diastolic: Blood Pres (2nd rdg) mm Hg Blood
BPX_G
BPXSY3 Systolic: Blood Pres (3rd rdg) mm Hg Pressure
BPXDI3 Diastolic: Blood Pres (3rd rdg) mm Hg
BPXSY4 Systolic: Blood Pres (4th rdg) mm Hg
BPXDI4 Diastolic: Blood Pres (4th rdg) mm Hg
BMXWT Weight (kg)
BMXHT Standing Height (cm) Body BMX_G
BMXBMI Body Mass Index (kg/m**2) Measures
BMXWAIST Waist Circumference (cm)
Laboratory:
It includes variables that corresponds to the clinical measures, Table shows the list of
Laboratory
Variable Data File
SAS Label Doc File
Name Name
LBDHDD Direct HDL-Cholesterol (mg/dL) HDL-Cholestrol HDL_G
Plasma Fasting
LBXGLU Fastin Glucose (mg/dL) Glucose and GLU_G
Insulin
LBXTC Total Cholesterol (mg/dL) Total Cholestrol TCHOL_G
LBXTR Triglyceride (mg/dL) Triglycerides
and LDL- TRIGLY_G
LBLDL LDL-cholesterol (mg/dL) Cholestrol
Urinary
Albumin and
URXUMS Albumin, urine (mg/L) ALB_CR_G
Urinary
Creatinine
32
Questionnaire:
This component consists of a set of questions ranging from physical activity, alcohol use,
environment related questions etc. Table shows the list of variable selected for this
research.
Questionnaire
Variable Data File
SAS Label Doc File
Name Name
ALQ101 Had at least 12 alcohol drinks.1 yr? Alcohol Use ALQ_G
How often drink alcohol over past 12
ALQ120Q Alcohol Use ALQ_G
months
Avg # alcoholic drinks/day – past 12
ALQ130 Alcohol Use ALQ_G
mos
ALQ141Q # days have 4/5 drinks – past 12 mos Alcohol Use ALQ_G
Ever have 4/5 or more drinks
ALQ151 Alcohol Use ALQ_G
everyday?
PAQ605 Vigorous work activity
PAQ620 Moderate work activity
PAQ635 Walk or bicycle
PAQ650 Vigorous recreational activities
Physical
PAQ665 Moderate recreational activities PAQ_G
Activity
PAD680 Minutes sedentary activity
Hours watch TV or videos past 30
PAQ710
days
PAQ715 Hours use computer past 30 days
Sleep
SLD010H How much sleep do you get (hours)? SLQ_G
Disorders
Smoking –
SMQ020 Smoked at least 100 cigarettes in life SMQ_G
Cigarette Use
Age started smoking cigarettes Smoking –
SMQ030 SMQ_G
regulary Cigarette Use
Smoking –
SMQ040 Do you now smoke cigarettes? SMQ_G
Cigarette Use
How long since quit smoking Smoking –
SMQ050Q SMQ_G
cigarettes? Cigarette Use
Smoking –
SMD055 Age last smoke cigarettes regularly SMQ_G
Cigarette Use
Smoking –
SMD057 # cigarettes smoked per day when quit SMQ_G
Cigarette Use
33
# days smoked cigs during past 30 Smoking –
SMD641 SMQ_G
days Cigarette Use
Avg # cigarettes/day during past 30 Smoking –
SMD650 SMQ_G
days Cigarette Use
Smoking –
SMD410 Does anyone smoke inside home? SMQ_G
Cigarette Use
SMD410 Does anyone smoke inside home?
Smoking
SMD415 Total # of smokers inside home
Household SMQFAM_G
SMD415A Total # cigarette smokers inside home
disorders
SMD430 Total # cigarettes smoked inside home
MCQ300a Close relative had heart attack?
Medical
MCQ300b Close relative had asthma? MCQ_G
Conditions
MCQ300c Close relative had diabetes?
Ever had work exposure to mineral
OCQ510
dusts?
OCQ520 # of years exposed to mineral dusts
Ever had work exposure to organic
OCQ530
dusts? Occupation
OCQ540 # of years exposed to organic dusts Questionnaire OCQ_G
Ever exposed to exhaust fumes at
OCQ550
work?
OCQ560 # of years exposed to exhaust fumes
Ever had work exposure to other
OCQ570
fumes?
OCQ580 # of years exposed to other fumes
DIQ010 Doctor told you have diabetes Diabetes DIQ_G
DIQ160 Ever told you have prediabetes Diabetes DIQ_G
Medical
MCQ010 Ever been told you have asthma MCQ_G
Conditions
In order to build predictive models for Asthma we have considered 40 attributes which
are divided into different categories such as demographics, blood pressure, body
Others.
34
1. Demographics
Many studies have suggested that demographic details play an important role in
other ethnic groups [25]. It is found that asthma occurs more frequently in boys compare
to girls at childhood; in young adults, ratio of asthma is found to be same for both males
and females. Females are more likely to have asthma once they cross 40 years of age.
Demographics
Variable SAS Label Code or Value Value Description
RIAGENDR Gender 1 Male
2 Female
RIDAGEYR Age in years at 0 to 79 Range of Values
screening 80 years of age and
80 over
RIDRETH3 Race/Hispanic 1 Mexican American
origin w/ NH 2 Other Hispanic
Asian 3 Non-Hispanic White
4 Non-Hispanic Black
6 Non-Hispanic Asian
Other Race -
Including Multi-
7 Racial
2. Blood Pressure
According to Asthma and Allergy Foundations of America, most asthma patients are
diagnosed with high blood pressure. NHANES provides the blood pressure details in four
readings. For our research we have considered the average of all while building the
35
Table 4.7 Blood Pressure variables for asthma prediction
Blood Pressure
Variable SAS Label Code or Value Value Description
Systolic: Blood pres (1st rdg)
BPXSY1
mm Hg 74 to 238 Range of Values
Diastolic: Blood pres (1st rdg)
BPXDI1
mm Hg 0 to 120 Range of Values
Systolic: Blood pres (2nd rdg)
BPXSY2
mm Hg 74 to 234 Range of Values
Diastolic: Blood pres (2nd rdg)
BPXDI2
mm Hg 0 to 134 Range of Values
Systolic: Blood pres (3rd rdg)
BPXSY3
mm Hg 74 to 232 Range of Values
Diastolic: Blood pres (3rd rdg)
BPXDI3
mm Hg 0 to 128 Range of Values
Systolic: Blood pres (4th rdg)
BPXSY4
mm Hg 78 to 226 Range of Values
Diastolic: Blood pres (4th rdg)
BPXDI4
mm Hg 0 to 130 Range of Values
3. Body Measures
[26] suggested that people with high body mass index are more likely to have asthma
compare to people with normal body mass index. It has been found that women with high
circumference are more prone to asthma even if they have a normal BMI [27].
Body Measures
Variable SAS Label Code or Value Value Description
Body Mass
BMXBMI Index(kg/m2) 12.4 to 82.1 Range of Values
Waist
BMXWAIST Circumference 38.7 to 176 Range of Values
36
4. Physical Activity
Physical Activity and exercise plays a vital role for a healthy life. Many studies showed
that people with higher physical activity are less likely to have asthma [28]. However, in
certain conditions narrowing of the airways in lungs can also be triggered with highly
Induced Asthma.
Physical Activity
Variable SAS Label Code or Value Value Description
1 Yes
PAQ605 Vigorous work activity 2 No
1 Yes
PAQ620 Moderate work activity 2 No
Number of days moderate
PAQ625 work 1 to 7 Range of Values
1 Yes
PAQ635 Walk or bicycle 2 No
Number of days walk or
PAQ640 bicycle 1 to 7 Range of Values
Vigorous recreational 1 Yes
PAQ650 activities 2 No
Moderate recreational 1 Yes
PAQ665 activities 2 No
Days moderate recreational
PAQ670 activities 1 to 7 Range of Values
PAD680 Minutes sedentary activity 0 to 1380 Range of Values
0 Less than 1 hour
1 1 hour
2 2 hours
3 3 hours
4 4 hours
5 5 hours
Hours watch TV or videos Do not watch TV or
PAQ710 past 30 days 8 Videos
0 Less than 1 hour
Hours use computer past 30 1 1 hour
PAQ715 days 2 2 hours
37
3 3 hours
4 4 hours
5 5 hours
Do not watch TV or
8 Videos
5. Alcohol Use
Excess of alcohol intake has been known for impairing lungs for years. According to
Alcohol Consumption
Variable SAS Label Code or Value Value Description
Had at least 12 1 Yes
alcohol drinks/1
ALQ101 yr? 2 No
How often drink
alcohol over past
ALQ120Q 12 mos 0 to 350 Range of Values
Avg # alcoholic
drinks/day - past
ALQ130 12 mos 1 to 82 Range of Values
Ever have 4/5 or 1 Yes
more drinks every
ALQ151 day? 2 No
Smoking is a common risk factor for prevalence of asthma. It is divided into two
categories active smoking and passive Smoking. It has been observed that passive
smoking can trigger symptoms of asthma in individuals suffering from the disease. Many
38
researchers have suggested that disease control is poorer in the patients who smoke
7. Environmental Factors
Asthma. Mineral dust, dust from sand, coal and soil; Organic dust, dust from flour,
cotton, animal and plants; Exhaustive fumes from Engines, Machinery, trucks and buses
are found to be the major cause of asthma in adults. Environmental factors not just
39
increase the chances of asthma but it also obstructs the disease control process for the
Environmental Factors
Variable SAS Label Code or Value Value Description
Ever had work 1 Yes
exposure to
OCQ510 mineral dusts? 2 No
# of years exposed
OCQ520 to mineral dusts 0 to 65 Range of Values
Ever had work 1 Yes
exposure to organic
OCQ530 dusts? 2 No
Ever exposed to 1 Yes
exhaust fumes at
OCQ550 work? 2 No
Ever had work 1 Yes
exposure to other
OCQ570 fumes? 2 No
# of years exposed
OCQ580 to other fumes 0 to 65 Range of Values
8. Others
Many psychological and genetic factors are recognized to influence the onset of asthma;
Studies showed that people with asthma feel lonely more often [31] compare to other
people. Burke W. et al, suggested that risk of asthma is increased if a positive family
history is found. MCQ010 i.e. “Ever been told you have asthma” is the class variable we
40
Table 4.13 Other variables for asthma prediction
Others
Variable SAS Label Code or Value Value Description
Close Relative had 1 Yes
MCQ300B asthma 2 No
0 Not at all
1 Several Days
Feeling Down, More than half the
depressed or 2 days
DPQ020 hopeless 3 Nearly everyday
Ever been told you 1 Yes
MCQ010 have asthma 2 No
In order to build models to predict the extent of diabetes we have used data consisted of
33 attributes. The attributes are divided into 8 major categories Demographics, Blood
Pressure, Body Measures, Physical Activity, Smoking – Cigarette use, Alcohol use,
1. Demographics
Many studies showed that risk of diabetes increases as the person gets older especially
after 45 years of age. According to American Diabetes Association, the risk of diabetes
very high because these populations are more like to have high blood pressure and high
BMI. In 2012, CDC survey estimated 86 million prediabetes cases among population of
20 years or older.
41
Table 4.14 Demographics variables for diabetes prediction
Demographics
Variable SAS Label Code or Value Value Description
RIAGENDR Gender 1 Male
2 Female
RIDAGEYR Age in years at 0 to 79 Range of Values
screening 80 years of age and
80 over
RIDRETH3 Race/Hispanic 1 Mexican American
origin w/ NH 2 Other Hispanic
Asian 3 Non-Hispanic White
4 Non-Hispanic Black
6 Non-Hispanic Asian
Other Race -
Including Multi-
7 Racial
2. Blood Pressure
Hypertension is one of the major factors that can worsen the complications of diabetes.
Most people with diabetes are diagnosed to have high blood pressure [32].
Blood Pressure
Variable SAS Label Code or Value Value Description
Systolic: Blood pres (1st rdg) mm
BPXSY1
Hg 74 to 238 Range of Values
Diastolic: Blood pres (1st rdg)
BPXDI1
mm Hg 0 to 120 Range of Values
Systolic: Blood pres (2nd rdg)
BPXSY2
mm Hg 74 to 234 Range of Values
Diastolic: Blood pres (2nd rdg)
BPXDI2
mm Hg 0 to 134 Range of Values
Systolic: Blood pres (3rd rdg) mm
BPXSY3
Hg 74 to 232 Range of Values
Diastolic: Blood pres (3rd rdg)
BPXDI3
mm Hg 0 to 128 Range of Values
Systolic: Blood pres (4th rdg) mm
BPXSY4
Hg 78 to 226 Range of Values
Diastolic: Blood pres (4th rdg)
BPXDI4
mm Hg 0 to 130 Range of Values
42
3. Body Measures
Many researchers found that risk of diabetes increases with the increase in BMI [32],
overweight people are more likely to have diabetes compared to their counter parts.
Body Measures
Variable SAS Label Code or Value Value Description
Body Mass
BMXBMI Index(kg/m2) 12.4 to 82.1 Range of Values
Waist
BMXWAIST Circumference 38.7 to 176 Range of Values
4. Physical Activity
Physical activity helps in controlling blood glucose, HDL cholesterol, blood pressure and
triglycerides resulting into lower risk of diabetes. Many risk factors are directly related to
physical activity such as BMI and waist circumference. Thus making it one of the major
Physical Activity
Variable SAS Label Code or Value Value Description
1 Yes
PAQ605 Vigorous work activity 2 No
1 Yes
PAQ620 Moderate work activity 2 No
Number of days moderate
PAQ625 work 1 to 7 Range of Values
1 Yes
PAQ635 Walk or bicycle 2 No
Vigorous recreational 1 Yes
PAQ650 activities 2 No
Moderate recreational 1 Yes
PAQ665 activities 2 No
PAD680 Minutes sedentary activity 0 to 1380 Range of Values
43
0 Less than 1 hour
1 1 hour
2 2 hours
3 3 hours
4 4 hours
5 5 hours
Hours watch TV or videos Do not watch TV or
PAQ710 past 30 days 8 Videos
0 Less than 1 hour
1 1 hour
2 2 hours
3 3 hours
4 4 hours
5 5 hours
Hours use computer past 30 Do not watch TV or
PAQ715 days 8 Videos
Research conducted by Julie C Will [33] shows an increase in diabetes rate with the
increase in smoking. It shows that men who smoked have 45% higher diabetes rate
compare to the men who had never smoked thus making smoking as an important
indicator of diabetes.
44
6. Alcohol Use
Alcohol consumption has become an important risk factor for diabetes. Many researchers
investigated that moderate intake of alcohol is associated with reduced risk of diabetes
Alcohol Consumption
Variable SAS Label Code or Value Value Description
Had at least 12 1 Yes
alcohol drinks/1
ALQ101 yr? 2 No
How often drink
alcohol over past
ALQ120Q 12 mos 0 to 350 Range of Values
Avg # alcoholic
drinks/day - past
ALQ130 12 mos 1 to 82 Range of Values
# days have 4/5
drinks - past 12
ALQ141Q mos 0 to 220 Range of Values
Ever have 4/5 or 1 Yes
more drinks every
ALQ151 day? 2 No
7. Laboratory
Table shows the list of clinical variable which can be considered as an important factor to
45
Table 4.20 Laboratory variables for diabetes prediction
Laboratory
Variable SAS Label Code or Value Value Description
Direct HDL-Cholesterol
LBDHDD
(mg/dL) 14 to 175 Range of Values
LBXTR Triglyceride (mg/dL) 18 to 1562 Range of Values
URXUMS Albumin, urine (mg/L) 0.21 to 14800 Range of Values
LBXGLU Fasting Glucose (mg/dL) 39 to 382 Range of Values
LBXIN Insulin (uU/mL) 0.14 to 647.5 Range of Values
8. Others
Prediabetes is an important indicator of diabetes in which blood sugar level is higher than
the normal but not in the diabetes range. Genetics also play an important role in
history of the disease have higher chances of developing diabetes compare to other
people. DIQ010 i.e. ‘ever been told you have diabetes’ is the class variable we selected to
Others
Variable SAS Label Code or Value Value Description
Close Relative had 1 Yes
MCQ300C diabetes 2 No
Ever told you have 1 Yes
DIQ160 prediabetes 2 No
1 Yes
Ever been told you 2 No
DIQ010 have diabetes 3 Borderline
46
4.3 Data Pre-Processing
Real world data is often inconsistent and incomplete, and is more likely to contain errors.
In this section we will discuss the processing steps taken to convert the data for better
accuracy. Initially 9756 records including infants, children and adolescents have been
selected in the previous stage. Since the main purpose of the research is to develop a
system for adults, we selected the records corresponding to the individuals of age 18 or
above; this left us with 5864 records. On further analysis we found that the data set
consisted of too many observations with ‘No’ class values (MCQ010 in case of asthma
and DIQ010 in case of diabetes), hence to avoid the problem of over fitting we deleted
records with class labels ‘No’ and too many missing values.
The detailed description of the total number of instances used for Training and Testing
the prediction models for both Asthma and Diabetes is given below:
Asthma:
The training set comprises of 1951 instances including 1135 observations with class label
‘No’ and 816 observations with class label ‘Yes’, whereas the testing set comprises of
200 instances including 143 observations with class label ‘No’ and 57 observations with
Diabetes:
The training set comprises of 1525 instances including 780 observations with class label
‘No’, 111 observations with class label ’Borderline’ and 634 observations with class label
‘Yes’, whereas the testing set comprises of 200 instances including 220 observations with
47
class label ‘No’, 8 observations with class label ‘Borderline’ and 72 observations with
The next step is to convert the processed data into a format that can be used by Weka, the
data mining tool we have used to build models for the prediction. This step transforms the
48
5 SYSTEM ARCHITECTURE AND IMPLEMENTATION
In this section, we will discuss the overall architecture and implementation of the system
proposed in this thesis. It describes the complete process of converting the input provided
by the user into a predicted numerical value in the form of ‘Likeliness of having Asthma’
and ‘Extent of Diabetes’. It also includes a detailed description of Weka, the data mining
Figure illustrates the operation of the proposed system in order to generate predictions for
Asthma and Diabetes. The proposed system depicts a user equipped with a mobile device
that is capable of collecting data from the user. It can be Desktop, Laptop or any mobile
device. This device collects the data from the user and sends it to the system over the
network; the data is nothing but the list of parameters described in the previous chapter.
The system consist of three main blocks Input Conversion, Neural Network and a
Input conversion block collects the data from the device and converts it into a format
which can be used by neural network models. It creates an instance of the data collected
49
Figure 5.1 System Architecture for Asthma Prediction
Neural Network block comprises of 5 prediction models described in Chapter 3; all the
models gets trained on the training set described in Chapter 4. Once the models are
trained, the input instance collected from previous block passes through each model
For Asthma, each model generates a numeric value; 1 when the instance is classified as
‘No’, meaning ‘not likely to have asthma’ and 2 when the instance is classified as ‘Yes’,
50
For Diabetes, each model generates a numeric value; 1 when the instance is classified as
‘No’, meaning ‘not likely to have diabetes’, 2 when the instance is classified as
‘Borderline’, meaning ‘likely to have borderline diabetes’ and 3 when the instance is
calculates the mean value of the outcomes obtained from all the 5 models; Since for
Asthma; only two values are possible i.e. 1 or 2, the mean can range from 1 to 2 only;
similarly for Diabetes; only three values are possible i.e. 1, 2 or 3 hence the mean can
51
only range from 1 to 3. Second component converts the mean obtained from the previous
block to the required scale; For Asthma, it converts the mean scale from 1-2 to 0-1 and
For Diabetes, it converts the mean scale from 1-3 to 0-2. It then multiplies the mean
obtained after proper scaling by 100 to generate the likeliness of having asthma and
5.2 Implementation
In order to better manage Asthma and Diabetes; it is essential to implement the proposed
system in such a way that can be used from anywhere, hence we have developed a web
based application of the system proposed in the previous section. This system is designed
in JSF (Java Server Faces), which is a java specification to build component-based user
interfaces for web application. In order to store the data for effectively tracking the health
over a period of time we have used MySql database. In order to make use of the system
the user has to login to the system with his Email-Id and password which is generated at
the time of registration. It helps in storing his details each time he calculates his score to
provide better health care over a period of time. Each time a user calculates his score, all
the details gets stored in the database which can be retrieved and used whenever needed.
In order to build models described previously to generate the predictions we have used
Weka, a java oriented data mining tool. Weka trains the model only when the user
calculates the score for the first time and later on trained models are used to generate the
5.2.1 Weka
Weka is a workbench for machine learning algorithms [36] written in java. It helps in
52
rules. It is an open source software issued under GNU public license. It comes with three
53
Figure 5.5 ROC curve generated in weka
In this system we have used Java API provided by Weka, it is a collection of classes and
models in the form of a class which can be integrated in a java code in an object oriented
5.3 Screenshots
54
Figure 5.6 Login Page
55
Figure 5.8 Asthma Calculator Page1
56
Figure 5.10 Asthma Calculator Page3
57
Figure 5.12 Results Generated for Asthma
58
Figure 5.14 Asthma Record Page1
59
Figure 5.16 Asthma Dashboard
60
6 RESULTS AND ANALYSIS
This chapter shows the results obtained from Weka for all the five models. Results of all
models from weka is recorded and an analysis is carried out to compare the prediction
Root Mean Squared Error and Area under ROC. Accuracy is the percentage of total
number of instances correctly classified. RMSE measures the square root of average of
squares of errors i.e. the difference between the actual class and the predicted class. ROC,
of a classifier system; it plots the true positive rate against the false positive rate. The area
under ROC curve ranges from 0 to 1, with 1 implies a perfect test and 0 implies a useless
test. The analysis also includes confusion matrix, which is a table layout to visualize the
performance of a model. A typical confusion matrix consists of rows and columns where
each column represents the number of instances in the predicted class and each row
matrix represents the total number of true positives, false positives, false negatives and
true negatives
Predicted Class
Actual
61
In order to better understand the terminologies, consider a scenario where a test is
conducted that screens people for asthma. Each person either has asthma or does not have
asthma. Test result can be either positive (meaning the person has asthma) or negative
In this case, True Positive means the person with asthma is correctly diagnosed with
asthma; False Positive means the person without asthma is incorrectly diagnosed with
asthma, True Negative means the person without asthma is correctly identified without
asthma and False Negative means the person with asthma is incorrectly identified without
asthma.
True positive rate is also known as Sensitivity and true negative rate is also known as
Specificity.
Table 6.1 shows an overview of the result obtained from all the 5 models when all
models are trained on 1951 instances. It has been found that J48 decision tree resulted
into highest accuracy. But at the class level Multilayer Perceptron and Logistic
Regression model performed marginally well with more Area under ROC.
62
Table 6.1 Asthma Results in terms of Accuracy, RMSE and ROC area for all the 5
models
This section displays the confusion matrix obtained from weka for all the 5 models when
all the 200 instances are tested against the trained model, it gives the measure of true
Naïve Bayes
Table 6.2 Confusion Matrix of Naive Bayes Classifier for Asthma Prediction
63
Bayesian Network
Predicted Class
BayesNet
No Yes
Actual No 120 23
Class Yes 33 24
Predicted Class
MLP
No Yes
Actual No 112 31
Class Yes 28 29
Logistic
Predicted Class
Logistic
No Yes
Actual No 129 14
Class Yes 37 20
Table 6.6 Confusion Matrix of J48 decision tree Classifier for Asthma Prediction
Predicted Class
J48
No Yes
Actual No 126 17
Class Yes 31 26
64
6.1.2 ROC curves
Asthma No Yes
Naïve Bayes
Bayesian Network
Multilayer Perceptron
Logistic
J48
65
6.2 Diabetes
Table 6.1 shows an overview of the result obtained from all the 5 models when all
models are trained on 1525 instances. It has been found that Multilayer Perceptron Model
resulted into highest accuracy and lowest root mean squared error, it is also been found
that at the class level Naïve Bayes and Multilayer Perceptron have almost similar Area
under ROC.
Table 6.8 Diabetes Results in terms of Accuracy, RMSE and ROC area for all the 5
models
This section displays the confusion matrix obtained from weka for all the 5 models when
all the 300 instances are tested against the trained model, it gives the measure of true
66
Naïve Bayes
Table 6.9 Confusion Matrix of Naive Bayes Classifier for Diabetes Prediction
Predicted Class
Naïve Bayes Borderline
No Yes
No 217 0 3
Actual
Borderline 4 2 2
Class
Yes 13 12 47
Bayesian Network
Predicted Class
Bayesian Network Borderline
No Yes
No 189 0 31
Actual
Borderline 5 0 3
Class
Yes 17 0 55
67
Logistic
Predicted Class
Logistic Borderline
No Yes
No 201 0 19
Actual
Borderline 5 0 3
Class
Yes 16 0 56
J48
Predicted Class
J48 Borderline
No Yes
No 179 4 37
Actual
Borderline 4 0 4
Class
Yes 15 1 56
68
6.2.2 ROC curves:
Naïve Bayes
Bayesian Network
Multilayer
Perceptron
Logistic
J48
69
As shown in the above results for Asthma and Diabetes prediction, models work
differently for different performance measures; some models provide better accuracy but
with less Area under ROC, some models resulted into more Area under ROC but with
high root mean squared error. Hence the approach described to combine all the 5 models
70
7 CONCLUSION AND FUTURE WORK
7.1 Conclusion
In this thesis, we have discussed the design and implementation of a predictive analytics
based system to predict the likeliness of having asthma and extent of diabetes in an
individual. In order to give better results and build a powerful system we used 5 machine
one model with the help of other models. The system is successfully able to generate the
predictions based on the data provided by the user. For asthma, the questionnaire consist
questions ranging from demographics to laboratory details. It is been found that in both
the diseases, with the accuracy of 84% model for predicting the extent of diabetes
performed well compared to the model for predicting the likeliness of having
asthma(accuracy 76%). The reason being clinical data like Albumin, Fasting Glucose,
Triglycerides etc were included in the models for Diabetes but not in the model of
Asthma because of too many missing values in the asthma records with ‘Yes’ as the class
label. It is also been found that at the class level, model for Diabetes resulted in better
Area under ROC. The system developed in this study can be used to develop
Asthma and Diabetes. With the Web based implementation of the proposed system the
user is able to make use of the system without worrying about the geographical
71
restrictions. The user can also view previously generated results and feedback to better
With the increase in the development of many health management strategies, the research
extension includes:
For Diabetes prediction there were very less number of records with Borderline
Diabetes hence one future scope would be to add more number of records with
Borderline Diabetes, it will help the system to improve learning algorithms for
Borderline cases.
For Asthma prediction, we can include the clinical data for training purpose. It
will improve the overall accuracy of the system because clinical data have a
The system can be extended to build models for other chronic conditions such as
72
REFERENCES
[3] G. B. a. Y. H. Yang Guo, "Using Bayes Network for Prediction of Type-2 Diabetes,"
2012.
73
[11] B. R. P. a. S. Agarwal, "Modeling Risk Prediction of Diabetes - A Preventive
Measure," IEEE, pp. 1-6, 2014.
[12] P. C. a. H. S. Gerd Assman, "Simple Scoring Scheme for Calculating the Risk of
Acute Coronary Events Based on the 10 year Follow-up of the Prospective
Cardiovascular Munster(PROCAM) study," Circulation, pp. 1-8, 2012.
[14] F. Beaule, "Systems and Methods For Determining and Managing an Individual and
Portable Health Score," 2012.
[19] I. Rish, "An empirical study of the naive bayes classifier," 2001.
[21] M. E. Kragt, "A beginners guide to Bayesian Network Modelling for integrated
catchment management," 2009.
74
[26] C. M. A. G. P. E. M. V. D. M. K. E. P. A. P. K. Papoutsakis, "Associations between
central obesity and asthma in children and adolescents: a case-control study.," 2014.
[29] J. H. Sisson, "Alcohol and Airways Function in Health and Disease," 2007.
75