OBJ Datastream

PREDICTIVE MODELING FOR CHRONIC CONDITIONS
by
Ritesh Jain
A Thesis Submitted to the Faculty of
The College of Computer Science and Engineering
in Partial Fulfillment of the requirements for the Degree of
Master of Science
Florida Atlantic University
Boca Raton, Florida
May 2015
Copyright 2015 by Ritesh Jain
ii
ACKNOWLEDGEMENTS
It is a pleasure to thank the many people who made this thesis a success. I am indebted to
my advisors Dr. Ankur Agarwal and Dr. Ravi Behara for giving me this wonderful
opportunity to work under their guidance throughout my Master’s thesis. Their
enthusiasm, inspiration and great efforts to explain things clearly and in a simple way
helped me to achieve my goals in this study.
I would like to thank Dr. Vinaya Rao, M. D., Methodist University Hospital Transplant
Institute, Memphis, TN, USA for sharing her expertise and providing valuable guidance
to validate my research.
I would like to thank my committee members Dr. Xingquan Zhu and Dr. Hari Kalva for
their valuable comments, suggestions and input to the thesis. Thanks a lot for your
patience and time.
I would like to thank my parents Mr. Mahesh Jain and Mrs Sumitra Jain for believing in
me. My sincere thanks also go to my brother Mr. Hitesh Jain and my Sister-in-Law Ms.
Priyanka Ved for giving me all the support.
iv
ABSTRACT
Author: Ritesh Jain
Title: Predictive Modeling for Chronic Conditions
Institution: Florida Atlantic University
Thesis Advisor: Dr. Ankur Agarwal
Degree: Master of Science
Year: 2015
Chronic Diseases are the major cause of mortality around the world, accounting for 7 out
of 10 deaths each year in the United States. Because of its adverse effect on the quality of
life, it has become a major problem globally. Health care costs involved in managing
these diseases are also very high. In this thesis, we will focus on two major chronic
diseases Asthma and Diabetes, which are among the leading causes of mortality around
the globe. It involves design and development of a predictive analytics based decision
support system which uses five supervised machine learning algorithm to predict the
occurrence of Asthma and Diabetes. This system helps in controlling the disease well in
advance by selecting its best indicators and providing necessary feedback. Based on
several risk factors such as blood pressure, BMI, age, ethnicity, smoking status etc, the
v
system would be able to predict the vulnerability of a person to a particular disease
which helps in taking necessary action to avoid the disease well in advance
vi
PREDICTIVE MODELING FOR CHRONIC CONDITIONS
LIST OF TABLES ........................................................................................................ xi
LIST OF FIGURES ..................................................................................................... xiii
1 INTRODUCTION ....................................................................................................1
1.1 Motivation .........................................................................................................1
1.2 Problem Statement ............................................................................................3
1.3 Contribution ......................................................................................................3
1.4 Organization ......................................................................................................4
2 RELATED WORK...................................................................................................6
2.1 Introduction .......................................................................................................6
2.2 Prognostics for Patient Health Management.......................................................6
2.2.1 System Architecture ...................................................................................8
2.2.2 Artificial Neural Network: ..........................................................................8
2.2.3 Data: ..........................................................................................................9
2.3 COPD Prognosis under Biologically Inspired Neural Network ..........................9
2.3.1 System Architecture: ................................................................................ 10
2.4 Time to CARE................................................................................................. 10
2.4.1 Methodology: ........................................................................................... 11
vii
2.5 An Evolutionary two-objective genetic algorithm for asthma prediction .......... 11
2.5.1 MLP Pruning by Genetic Algorithm ......................................................... 12
2.6 Cloud Framework for Health Care Monitoring System(CHMS) ....................... 12
2.6.1 Cloud Framework:.................................................................................... 13
2.6.2 System Architecture: ................................................................................ 14
2.7 Modeling Risk Prediction of Diabetes – A Preventive Measure ....................... 15
2.7.1 Methodology ............................................................................................ 15
2.8 Scoring Scheme based on Prospective Cardiovascular Munster
Study(PROCAM) ....................................................................................................... 16
2.8.1 Scoring Method: ....................................................................................... 16
2.9 Other Related Works: ...................................................................................... 17
3 RESEARCH METHODOLOGY ............................................................................ 18
3.1 Introduction ..................................................................................................... 18
3.2 Research Design .............................................................................................. 18
3.3 Predictive Modelling ....................................................................................... 19
3.4 Data Mining .................................................................................................... 20
3.5 Machine Learning............................................................................................ 20
3.5.1 Naïve Bayes: ............................................................................................ 21
3.5.2 Bayesian Network: ................................................................................... 22
3.5.3 Multilayer Perceptron Model .................................................................... 23
viii
3.5.4 Logistic Regression: ................................................................................. 25
3.5.5 J48 Decision Tree ..................................................................................... 26
4 DATA AND ANALYSIS OF RISK FACTORS ..................................................... 27
4.1 Data Collection:............................................................................................... 27
4.2 Variable Selection ........................................................................................... 30
4.2.1 Initial Set .................................................................................................. 31
4.2.2 Asthma Risk Factors: ............................................................................... 34
4.2.3 Diabetes Risk Factors ............................................................................... 41
4.3 Data Pre-Processing......................................................................................... 47
4.4 Data Transformation: ....................................................................................... 48
5 SYSTEM ARCHITECTURE AND IMPLEMENTATION ..................................... 49
5.1 System Architecture ........................................................................................ 49
5.2 Implementation ............................................................................................... 52
5.2.1 Weka ........................................................................................................ 52
5.3 Screenshots ..................................................................................................... 54
6 RESULTS AND ANALYSIS ................................................................................. 61
6.1 Analysis of Results obtained for Asthma ......................................................... 62
6.1.1 Confusion Matrix: .................................................................................... 63
6.1.2 ROC curves .............................................................................................. 65
6.2 Diabetes .......................................................................................................... 66
ix
6.2.1 Confusion Matrix ..................................................................................... 66
6.2.2 ROC curves: ............................................................................................. 69
7 CONCLUSION AND FUTURE WORK ................................................................ 71
7.1 Conclusion ...................................................................................................... 71
7.2 Future Work .................................................................................................... 72
REFERENCES .............................................................................................................. 73
x
LIST OF TABLES
Table 4.1 Unweighted response rates for NHANES 2011-2012 survey by Age and
Gender ........................................................................................................................... 29
Table 4.2 Variable considered from Demographics ........................................................ 31
Table 4.3 Variable Considered from Examination .......................................................... 32
Table 4.4 Variables considered from Laboratory............................................................ 32
Table 4.5 Variables considered from Questionnaire ....................................................... 33
Table 4.6 Demographics variables for asthma prediction ............................................... 35
Table 4.7 Blood Pressure variables for asthma prediction .............................................. 36
Table 4.8 Body Measure variables for asthma prediction ............................................... 36
Table 4.9 Physical Activity variables for asthma prediction ........................................... 37
Table 4.10 Alcohol use variable for asthma prediction ................................................... 38
Table 4.11 Smoking related variables for asthma prediction .......................................... 39
Table 4.12 Environmental Factors for asthma prediction................................................ 40
Table 4.13 Other variables for asthma prediction ........................................................... 41
Table 4.14 Demographics variables for diabetes prediction ............................................ 42
Table 4.15 Blood Pressure variables for diabetes prediction ........................................... 42
Table 4.16 Body measure variables for diabetes prediction ............................................ 43
Table 4.17 Physical Activity variables for diabetes prediction........................................ 43
Table 4.18 Smoking related variables for diabetes prediction ......................................... 44
Table 4.19 Alcohol use variables for diabetes prediction ................................................ 45
xi
Table 4.20 Laboratory variables for diabetes prediction ................................................. 46
Table 4.21 other variables for diabetes prediction .......................................................... 46
Table 6.1 Asthma Results in terms of Accuracy, RMSE and ROC area for all the 5
models ........................................................................................................................... 63
Table 6.2 Confusion Matrix of Naive Bayes Classifier for Asthma Prediction ............... 63
Table 6.3 Confusion Matrix of BayesNet Classifier for Asthma Prediction .................... 64
Table 6.4 Confusion Matrix of MLP Classifier for Asthma Prediction ........................... 64
Table 6.5 Confusion Matrix of Logistic Classifier for Asthma Prediction ...................... 64
Table 6.6 Confusion Matrix of J48 decision tree Classifier for Asthma Prediction ........ 64
Table 6.7 ROC curves obtained from WEKA for Asthma .............................................. 65
Table 6.8 Diabetes Results in terms of Accuracy, RMSE and ROC area for all the 5
models ........................................................................................................................... 66
Table 6.9 Confusion Matrix of Naive Bayes Classifier for Diabetes Prediction ............. 67
Table 6.10 Confusion Matrix of BayesNet Classifier for Diabetes Prediction ................ 67
Table 6.11 Confusion Matrix of MLP Classifier for Diabetes Prediction ....................... 67
Table 6.12 Confusion Matrix of Logistic Classifier for Diabetes Prediction .................. 68
Table 6.13 Confusion Matrix of J48 Classifier for Diabetes Prediction ......................... 68
Table 6.14 ROC curves for Diabetes Prediction ............................................................. 69
xii
LIST OF FIGURES
Figure 1.1 Thesis Structure ..............................................................................................5
Figure 2.1 Timeline of Medical Prediction .......................................................................7
Figure 2.2 System Architecture of Patient Health Management ........................................8
Figure 2.3 System Architecture of Biologically inspired techniques ............................... 10
Figure 2.4 Cloud Framework of CHMS ......................................................................... 13
Figure 2.5 Layered Architecture of CHMS .................................................................... 14
Figure 3.1 Proposed System Design ............................................................................... 19
Figure 3.2 Supervised Learning Models used in the research ......................................... 21
Figure 3.3 A Multilayer Perceptron Model with three layers .......................................... 24
Figure 4.1 Steps involved in preparing a dataset ............................................................ 27
Figure 5.1 System Architecture for Asthma Prediction................................................... 50
Figure 5.2 System Architecture for Diabetes Prediction ................................................. 51
Figure 5.3 Weka Preprocessing stage ............................................................................. 53
Figure 5.4 Weka classification stage .............................................................................. 53
Figure 5.5 ROC curve generated in weka ....................................................................... 54
Figure 5.6 Login Page.................................................................................................... 55
Figure 5.7 Home Page ................................................................................................... 55
Figure 5.8 Asthma Calculator Page1 .............................................................................. 56
Figure 5.9 Asthma Calculator Page2 .............................................................................. 56
Figure 5.10 Asthma Calculator Page3 ............................................................................ 57
xiii
Figure 5.11 Asthma Calculator Page4 ............................................................................ 57
Figure 5.12 Results Generated for Asthma ..................................................................... 58
Figure 5.13 View Asthma History Page ......................................................................... 58
Figure 5.14 Asthma Record Page1 ................................................................................. 59
Figure 5.15 Asthma Record Page2 ................................................................................. 59
Figure 5.16 Asthma Dashboard...................................................................................... 60
xiv
1 INTRODUCTION
1.1 Motivation
Chronic condition is a health condition or disease that is persistent and whose effects are
long lasting. It has major adverse effect on the quality of life of the individual who is
affected with it. Diabetes, Asthma, Cancer, COPD, CKD and Heart Disease are some of
the major chronic conditions the world is facing today. It has been found that chronic
diseases are the major cause of mortality, accounting to 7 out of 10 deaths in the US.
According to CDC, Centre for Disease Control and Prevention about half of all adults i.e.
almost 117 million people in United States have one or more chronic health conditions.
According to World Health Organization report, [1] out of 58 million deaths in 2005,
chronic diseases resulted into 35 million of them. They are currently the major cause of
death among adults all over the world.
Chronic conditions are critical not only because of their high mortality rate but also
because of the cost associated with it. Majority of US economic and health care costs
associated with the health condition are primarily because of the chronic diseases and the
associated health risk behaviors. CDC survey shows that the total costs of heart disease
and stroke in 2010 were estimated to be $315.4 billion; costs involved in diabetes were
estimated to be $245 billion and cancer care costs were estimated to be $157 billion.
1
As a result of the above factors, a need to develop a system which can manage the
chronic conditions in an individual well before its onset arises. In this research we have
developed a Predictive analytics based clinical decision support system which can help an
individual to better manage chronic conditions. This system investigates the state of
being unwell by focusing on two major chronic diseases Asthma, that is caused by
inflammation of the airways [2], these are the small tubes called bronchi which carry air
in and out of the lungs and Diabetes, that is caused by an imbalance in the secretion of
insulin resulting in a disturbance in the sugar levels of the blood [3]. This also increases
the risk of developing kidney diseases, heart diseases, blindness etc [3].
According to United States Environmental Protection Agency [4], an estimated 25.9
million people are suffering from Asthma, the annual economic cost including direct and
indirect cost amount to more than $56 billion annually. CDC survey shows that every day
about 9 people die from Asthma.
According to American Diabetes Association, 29.1 million people in the United States
had diabetes in 2012 out of which only 21 million people were diagnosed. According to
International Diabetes Federation, there are 246 million diabetic people around the world
and this number is expected to rise to 380 million by 2025 [3].
On further research we found out that Asthma and Diabetes are associated with many risk
factors. There are many clinical and non-clinical risk factors which might lead to these
conditions such as age, gender, blood pressure, smoking status etc. As a result, we have
developed a predictive system which can help an individual in tracking their likeliness of
Asthma and the extent of diabetes based on a list of clinical and non-clinical factors. This
2
system also provides necessary feedback which can reduce the chances of having a
chronic condition in future or to control it further.
1.2 Problem Statement
Chronic diseases such as Asthma and Diabetes are the major cause of high mortality and
morbidity rate all around the world; millions of people are diagnosed with one or more
chronic condition every year. These are among the most common health problems the
world is facing today. The overall economic costs involved in these diseases are also very
high. Health Risk behaviors are the major factors that are associated with chronic
conditions. They are the unhealthy behaviors that can be changed with time if proper
efforts are being made to control or manage these behaviors. Following are the four
major health risk behaviors associated with chronic conditions [5],
1. Lack of Exercise or Physical Activity
2. Poor Nutrition
3. Drinking too much alcohol
4. Excessive smoking
According to CDC surveys, cigarette smoking is responsible for more than 480,000
deaths every year, drinking too much alcohol causes 88,000 deaths each year, the survey
also shows that more than 50% of adults aged 18 years or older did not meet the expected
duration and level of physical activity. Hence because of improper management of these
behaviors the rate of mortality caused due to chronic conditions is increasing.
1.3 Contribution
The major contribution of the research work can be outlined as follows:
3
1. The main objective of the research is to develop a Predictive System that can
predict the Likeliness of Asthma and the Extent of Diabetes in an individual to
effectively manage their health in the chronic sense.
2. A web based application to make use of the system in a user friendly manner.
3. To provide a graphical representation of the likeliness of asthma and extent of
diabetes to keep track of their health.
4. To provide necessary feedbacks to improve overall health and reduce the risk of
Asthma and Diabetes.
5. The efforts are also made to perform analysis on the data taken from NHANES
2011-2012 survey.
6. Utilize the java oriented data mining software tool Weka, for classification,
regression, pre-processing, clustering, visualization and association.
7. Incorporating techniques of Artificial Neural Network to make the machine more
intelligent and results more reliable.
1.4 Organization
This thesis is divided into 7 major chapters. Chapter 1 explains the research problem and
provides the readers with the importance and magnification of the problem. It also
provides an overview of the contributions made by the research.
Chapter 2 provides different perspective towards the predictive analytics approach to
solve health problems. It provides an overview of the methodology used by these systems
4
Chapter 3 provides an overview of the research methodology used in order to solve the
problem stated. It includes a detailed description of various machine learning models
used by the system to generate the predictions.
Chapter 4 provides a detailed description of NHANES data set, the data set used to train
the prediction models. It also provides an overview of the risk factors chosen to predict
the Likeliness of Asthma and Extent of Diabetes.
Chapter 5 defines the system architecture in detail. It also gives a detailed description of
Weka, a data mining tool used to build the prediction models.
Chapter 6 discusses all the results obtained from the models used in the research. It
provides various tables and graphs to easily understand the results.
Chapter 7 discusses the conclusion and possible future work in this field of study
Chapter 1 Introduction
Chapter 2 Related Work
Chapter 3 Research Methodology
Chapter 4 Data and Analysis of Variables
Chapter 5 System Architecture and Implementation
Chapter 6 Analysis & Result
Chapter 7 Conclusion and Future Work
Figure 1.1 Thesis Structure
5
2 RELATED WORK
2.1 Introduction
Many studies have been carried out at applying appropriate technologies to improve
healthcare and its delivery using predictive modeling in healthcare systems based on
Artificial Neural Network. In this chapter we will discuss several systems that provide
predictive modeling approach to improve the overall health, several systems to predict
the risk of Asthma and Diabetes based on different risk factors. Here, we will provide an
overview of the architecture of the systems, data sets used in order to carry out the
research and their results.
2.2 Prognostics for Patient Health Management
Peter Ghavami of Harborview Medical Centre, Seattle and Kailash Kapur of University
of Washington, Seattle have designed and developed a prognostics engine to predict
patients physiological health status using Artificial Neural Network [6]. The Engine
builds model based on Historical clinical data which is collected from different human
physiological systems including Respiratory Systems, Circulatory Systems, and Immune
Systems etc. Their system includes PHM i.e. Prognostics and Health Management which
provides methods for solving reliability issues; it also permits assessment of the system
under its application scenario in order to determine possible risks and failures.
6
Figure 2.1 shows the timeline of medical predictions on which the system works:
Risk Factors Marker
Prediction Medical Event
Figure 2.1 Timeline of Medical Prediction

The timeline begins with risk factors that leads to predictors which includes the variables
that provides prediction about a disease, then it goes to marker where the actual causes of
disease are combined to form a detectable level of diseases and at last it goes to the
medical event itself, determining the presence of a disease or its occurrence in the near
future.
7
2.2.1 System Architecture
Figure 2.2 System Architecture of Patient Health Management

Figure 2.2 shows the Feed Forward and feedback control model. The input to the
Physiological system is represented by i(t) which includes data related to the medical
treatment plan, it involves some set of medications, protocols and procedures suggested
by the physician. The Human Physiological System internally consists of a wide variety
of clinical data such as lab results and monitored data represented by X(t). The Model
includes a Prognostic Engine which continuously monitors the real time patient data and
applies some mathematical algorithms to develop rules and patterns to make predictions
regarding the presence of a disease or an occurrence of a disease in the near future.
2.2.2 Artificial Neural Network:
This system uses four major artificial neural network model :
1) PNN- Probabilistic Neural network
8
2) SVM- Support Vector Machine
3) Generalized Feed Forward Multilayer perceptron model
4) MLP trained with LM (Levenberg-Marquardt model)
2.2.3 Data:
The clinical data used by the system to define rules for prediction consists of 468 patients
cases who were admitted for various physiological treatments. The input data consist of
21 independent variables and 1 dependent variable. Input data includes data from
different lab results, blood pressure data, data related to heart rate etc. The dependent
variable represents the clinical outcome i.e. the absence or presence of a disease.
At the end, results of all models are stored and with the use of an oracle (overseer
program), the most accurate model is selected.
2.3 COPD Prognosis under Biologically Inspired Neural Network
Researchers at Easwari Engineering College, Chennai-India proposed a model to
rehabilititate Chronic Obstructive Pulmonary Disease(COPD) patients in real time [7].
The model provides a predictive approach using polynomial neural network with swarm
intelligence. Swarm Intelligence is the human intelligence derived from social interaction
of an individual.The Model includes two main components Discrete Particle Swarm
Optimization(DPSO) and Continuous Particle Swarm Optimization(CPSO) and the
classification of the health status of the patient is done by support vectors.
9
2.3.1 System Architecture:
Figure 2.3 System Architecture of Biologically inspired techniques
Figure shows the typical architecture of biologically or socially inspired techniques used
in prediction. The Patient’s handheld device includes four major components DPSO,
CPSO, Polynomial Neural Network and Training systems. Input to the model is Patient’s
physiological parameters such as MMRC scale, BMI, FEV1% and 6 minutes walk test.
Prediction model comprises of a condensed polynomial neural network; the model then
runs on the data collected and the accuracy of the system is then assessed by statistical
analysis as per Gibbs specifications.
2.4 Time to CARE
CARE, a Collaborative Assessment and Recommendation Engine predict disease risks in
the future based on patient’s medical history and other similar patients history [8]. It uses
collaborative methods to find out the most significant risk factors that can lead to the
10
disease and generates predictions based on the selected risk factors. An iterative version
ICARE was also designed which uses ensemble concepts to improve the performance of
the system.
2.4.1 Methodology:
In a typical CARE system, the individual medical history is the history of the individual
on which the predictions are to be made, it is considered as the testing patient and other
patients medical histories are considered as training patients, the training patients are
constrained to have the data of patients who have at least two diseases in common with
the testing patient. Collaborative filtering defines methods to generate prediction for the
future disease risks of the patient. In case of ICARE i.e. Iterative Care, this process is
repeated multiple times with different training patients group. These results are then
combined to form an ensemble. The results from both CARE and ICARE are then ranked
based on the disease from highest to lowest risk scores.
2.5 An Evolutionary two-objective genetic algorithm for asthma prediction
Researchers at Democritus University of Thrace, Xanthi, Greece proposed a system to
predict the occurrence of asthma using Artificial Neural Network and a Genetic
Algorithm [9]. The system predicts asthma risks in children under the age of 5. Genetic
Algorithm helps in filtering the factors that influence the asthma most. In other words this
algorithm finds out the risk factors which make a child more vulnerable to asthma. In a
Genetic Algorithm, a set of solutions to an optimization problem is evolved to find the
most appropriate solution.
11
2.5.1 MLP Pruning by Genetic Algorithm
Patterns were generated to predict asthma with the help of Artificial Neural Network.
Multilayer Perceptron, a supervised neural network model was used to generate these
patterns. A total of 34 prognostics factors were used to predict the disease ranging from
breathing tests to Pharmaceutical therapy, wheezing episodes to demographic and some
common symptoms of asthma such as cough, chest pain, runny nose etc. MLP network
was trained based on the data of 112 patients obtained from the Pediatric Department of
Alexandroupolis University Hospital, Greece. The training algorithm used here is the
back propagation algorithm, where the weights are adjusted by back propagating the error
resulting into a more efficient algorithm.
A variety of experiments were performed in MATLAB with the support Neural Network
Toolbox for constructing Multilayer Perceptrons and the Genetic Algorithms. The testing
accuracy of the network was found to be 94.8%.
Genetic Algorithm search was performed to find out the most significant risk factors that
can be supplied as an input to the Multilayer Perceptron Model, GA search is divided into
two objectives where first objective was to minimize the number of prognostic factors
and second objective was to enhance the performance of the model based on these
factors.
2.6 Cloud Framework for Health Care Monitoring System(CHMS)
With the increase in popularity of Cloud Computing, researchers from Bangalore and
Anantapur, India designed and developed a Health Care Monitoring System using Cloud
Framework [10]. Cloud enables data sharing without geographical limitations. CHMS
12
collects the health data from a variety of sources and publish them to a cloud based
repository, this repository is named as Telemedicine Repository (TMR). Once the data is
published on to TMR, the system then performs data analysis using services provided by
the cloud and stores the results in the form of health records.
2.6.1 Cloud Framework:
Figure 2.4 Cloud Framework of CHMS
Figure shows the cloud framework of CHMS, it mainly comprises of a data acquisition
module, a communication system, a TMC (Tele Medicine Center), an Emergency Health
Care (EMC) module and a Multi Specialty Hospitals module. Patients are equipped with
a data acquisition device that are capable of collecting data from a patient like ECG,
Glucose, Blood Pressure etc. The data is communicated to the TMC with the help of a
communication system like internet. TMC performs the data analysis on the received data
keeping into account the existing patient’s data and historic data. TMC also maintains an
13
Electronic Health Record (EHR) on the cloud which is accessible to the users at any time.
It also comprises of a web portal enabling patients and doctors to communicate in an
emergency.
2.6.2 System Architecture:
Figure 2.5 Layered Architecture of CHMS
It mainly consist of three layers a Software-as-a-Service (SaaS) layer, Platform-as-a-
Service (PaaS) layer and the Infrastructure-as-a-Service(IaaS) layer. SaaS layer helps a
user to use the system without going through the complexity of the application, PaaS
layer provides a set of tools to make the system quick and efficient, it helps in storing
EHR of a patient generated by TMC. IaaS layer provides virtual datacenters such as
servers, networks etc it provides storage services by virtualizing the resources. It also
helps in making the data available to different users across the globe.
14
2.7 Modeling Risk Prediction of Diabetes – A Preventive Measure
Bakshi Rohit Prasad et al [11] proposed a data mining approach for selecting best
indicators of diabetes and a model to predict the diabetes before its onset. It uses voting
mechanism to select the most suitable classifier model to achieve high accuracy. The
system works in three stages data pre-processing, class label assignment and construction
of classifier.
2.7.1 Methodology
The data is collected from UCI repository known as Diabetes dataset, it contains 9
attributes such as plasma glucose, diastolic BP, BMI, age etc. In the Data pre-processing
stage the data is transformed in a form which is suitable to execute subsequent stages. It
uses k-nearest neighbor approach to deal with the missing values, which puts the value
present in the nearest column in terms of Euclidean distance. As a result only 5 attributes
remained to form the dataset. In the next stage class label is assigned to each patient
record in terms of high risk, medium risk or low risk. It uses clustering technique to
group the dataset into clusters of high, medium and low risk. Next stage corresponds to
building classifier to predict diabetes, it uses four classification techniques; KNN(K-
Nearest Neighbor), LDA( Linear Discriminant Analyzer), Naïve Bayes and
DTC(Decision Tree Classification). Each of the classifier is trained on the training set
and the accuracy is measured for a test data, vote count of the classifier resulting into
highest accuracy is incremented by one. The process continues for different test datasets
to find the classifier with highest vote count which is then selected for classification
purpose.
15
2.8 Scoring Scheme based on Prospective Cardiovascular Munster Study(PROCAM)
Gerd Assmann et al [12] proposed a scoring scheme for calculating the risk of acute
coronary events based on the 10-year follow up of Prospective Cardiovascular Munster
Study. The scoring scheme is based on 325 acute coronary events that have occurred
within 10 years of follow-up among 5389 middle aged men who were recruited into
PROCAM study. Within the 10 years 230 men were lost to follow up, 218 died, 14 had
suspected coronary death and 46 non fatal cases occurred. At last, 4493 middle aged men
survived the 10 years of follow up without any major coronary event.
2.8.1 Scoring Method:
To obtain maximum information from the PROCAM study, a risk algorithm which uses
Cox proportional hazards model was constructed. It includes 8 variables which were
independently predicting the event risk. Cox model only allows calculation of relative
risk; hence to convert the relative risk obtained from Cox model into absolute risk,
Kaplan-Meier statistics were used.
In order to generate the scoring scheme each risk factor was divided into categories for
which each category is assigned with a value which is obtained from regression equation
which is calculated between logarithms of global risk as calculated by Cox model
combined with the survival curves and the categories of each risk factor [12]. The
coefficients calculated were then standardized and rounded to obtain the score in terms of
a whole number. PROCAM algorithm then calculates the risk of a coronary event
associated with each score which are then categorized into very low and very high
PROCAM scores.
16
2.9 Other Related Works:
A neural network based Structural Health Monitoring System has been proposed [13].
This system uses wireless sensor network where thousands of sensor nodes perform
distributed sensing and collaborative computing for structural health analysis. It uses
several algorithms to predict a particular disease. In 2012, Jean-Francois proposed a
system [14] and method for determining and managing an individual portable health
score, this method defines a baseline health score and further adjusting the health score
based on several health actions.
17
3 RESEARCH METHODOLOGY
3.1 Introduction
The research design is a framework for predicting the likeliness of Asthma and the extent
of diabetes in an individual. In the current chapter firstly the design of the proposed
system will be explained and afterwards the concept of predictive modelling, data mining
and machine learning including various machine learning models that have been used to
carry out the prediction will be explained.
3.2 Research Design
The primary purpose of this research is to develop a system which can help an individual
to keep track of their health in the chronic sense; hence it is important to design the
system in such a way that is easy to interpret and easy to use. In this study the advantages
of Artificial Neural Network are channelized to come up with a system which is user
friendly in terms of usage. Figure shows the overall design of the system which leads to
the prediction in terms of likeliness of Asthma and extent of Diabetes.
18
Figure 3.1 Proposed System Design
It includes three main components User, System and Neural Network. Users are
individuals who want to avail the services provided by the system; User is expected to
provide necessary input to the system; input consists of a list of parameters including
demographic details, laboratory details, and body measures etc. Once the user provide
input to the system; the system makes use of several neural network models (discussed
later in the chapter) to generate the predictions in terms of likeliness of Asthma and
extent of Diabetes in a numerical format that is easy to interpret. Once the prediction has
been made, the system then suggests necessary feedback in order to manage the disease.
3.3 Predictive Modelling
Predictive modelling is a name given to a collection of mathematical techniques or
models that helps in finding a mathematical relationship between a target or dependent
variables and the predictor or independent variables [15]. It helps in predicting the
probability of an outcome when a set of independent variables passes through the model.
Almost all regression models can be used for prediction purposes.
19
3.4 Data Mining
Data Mining is an analytic process to extract information from a large amount of data. It
is designed to explore data in order to find patterns or relationships between the variables.
It helps in extracting unknown or potentially useful information from data [16]. Data
Mining involves machine learning, artificial intelligence and statistics [17]. The main
goal of data mining is to predict; predictive data mining is the most common data mining
approach that have been used by many studies. The process of data mining is divided into
three main stages: In the first stage, the dataset to train the model is prepared it includes
Data Cleaning and Data Pre-processing; second stage includes building of model that
means identifying patterns based on the prepared dataset and lastly in the final stage the
trained model are used to generate predictions.
3.5 Machine Learning
Machine learning is a branch of computer science that consists of algorithms that can
learn from data, it provides set of methods that can detect patterns in the data and use the
patterns to generate future predictions [18].
Machine learning is divided into two main types supervised and unsupervised learning.
Supervised learning is the machine learning technique in which the learning algorithms
make use of labelled data. The main goal is to map a set of input parameters X which are
also called as features or attributes to output parameter Y which is also called as class
label [18]. In this learning technique model gets trained on the labelled training data and
then it generates predictions for unseen situations. In unsupervised learning, the model is
trained on unlabelled data. The main goal of unsupervised learning is to find patterns in
order to extract useful information from an unlabelled data.
20
Following diagram shows the supervised learning models used in this research to
generate the predictions for Asthma and Diabetes.
Supervised
Learning
Naive Bayesian Multilayer Logistic

J48
Bayes Network perceptron Regression
Figure 3.2 Supervised Learning Models used in the research

3.5.1 Naïve Bayes:
Naïve Bayes classifiers are based on Bayesian Theorem; it simplifies the learning method
by assuming that features are independent of each other on the class context [19]. This
strong assumption is known as Naïve Bayes Assumption [18]. Let us consider x ϵ X, the
input feature vectors; y ϵ {1,…, c}, the class labels; then the Naïve Bayes Assumption is
given by
D
p ( x / y  c )   p ( xi | y  c )
i 1
It is particularly suited when the input set comprises of large number of attributes. Naïve
bayes classifiers are used in several fields such as target marketing, text classification,
credit approval, spam filtering [20] etc
Advantages:
1. Naïve Bayes is easy to implement
21
2. Naïve Bayes classifiers can be trained quickly in a single scan
3. Classification process is quick compare to other models
4. It can handle a large and discrete amount of data
5. It is not sensitive to irrelevant features
Disadvantages:
1. The major disadvantage of Naïve Bayes is the Naïve Bayes Assumption; it can
result in loss of accuracy
2. In real world, dependencies exist among the attributes; but these dependencies are
irrelevant in Naïve Bayes Models
3.5.2 Bayesian Network:
Bayesian Networks are also called as Belief Networks; they are probabilistic graphical
model widely used under uncertainty [21]. It provides methods to represent relationship
between the attributes. It is represented in the form of a directed acyclic graph whose
nodes represents the attributes and edges represent the relationship between them. The
relationships between the attributes are derived by conditional probability distribution.
Advantages:
1. Since the outputs are represented in terms of probability it can be easily
interpreted
2. Models are represented in the form of a graph hence which can be interpreted
easily by people from different disciplines [21].
3. Bayesian Networks can be easily updated when a new knowledge source is
available
22
Disadvantages:
1. Limited ability to deal with continuous data [22]
2. Because of the acyclic nature of the model; feedback methods cannot be included
in Bayesian Networks
3.5.3 Multilayer Perceptron Model
Multilayer Perceptron is a feed forward neural network with one or more hidden layers
between input and output. Feed Forward means data flows from one direction to another
i.e. from input nodes towards the output node. This network is trained with a back
propagation learning algorithm. MLP helps in distinguish between the data that is not
linearly separable. Except input nodes all nodes consist of a non-linear activation
function. Input layer consist of a set of input parameters based on which prediction has to
be made, hidden layer consist of a set of hidden nodes which helps in solving the non-
linear data problem, these nodes helps in converting input data into the form which can
be used by the output node and lastly output layer consist of a output node with a non-
linear activation function to make the prediction
23
Figure 3.3 A Multilayer Perceptron Model with three layers
Advantages
1. It helps in solving problems which includes classification of non linear data
2. MLP models do not make any assumptions regarding the probabilistic
information about the class labels
Disadvantages
1. It requires more memory and processing power.
2. Training takes more time compare to other classifiers
24
3.5.4 Logistic Regression:
Logistic Regression is a statistical classification model which can be applied in the
situations where the outcome is categorical. It has become a standard method of analysis
in the situation where outcome variable is discrete taking two or more possible values
[23]. It provides a reasonable model to describe the relationship between the output
variable and one or more input variables. In most cases, the outcome variable is
dichotomous i.e. it can take only two values such as yes/no, 0/1 etc; such logistic
regression models are called as Binary Logistic Regression Model. In some cases, the
outcome variable can take more than two values such models are called as Multinomial
Logistic Regression Model.
Advantages:
1. No linear relationship between independent and dependent variable
2. Multiple explanatory variables can be used.
3. Less prone to over-fitting due to simplicity and low variance.
4. Dependent variable need not to be distributed normally
5. No confounding effects because logistic regression allows quantified values for
strength of association between explanatory variables.
Disadvantages:
1. It cannot predict continuous data as logistic regression is built on discrete functions.
2. It requires a large set of data to achieve better results
25
3.5.5 J48 Decision Tree
Decision tree learning is the learning method which uses trees to represent a predictive
model [24]; the tree consist of leaves that represents the class label and branches that
represents features or rules that leads to a particular class label. It is divided into two
categories Classification trees; in which the target variable consist of a finite set of values
and Regression trees; in which the target variable can take numerical values. J48
commonly known as C4.5 algorithm is used to perform decision tree learning. It is also
known as statistical classifier. C4.5 generates decision tree which can be used for
classification or numerical prediction.
Advantages:
1. J48 Decision trees can be used for both continuous and discrete attributes.
2. Once the tree is created, it removes unnecessary nodes which do not help in
classification resulting into a simpler tree which can be easily interpreted
3. It is easy to implement
4. Can work on both continuous and categorical values of output variable
Disadvantages:
1. It is not suitable for the problems where classification is based on fulfilment of
several conditions
2. If the decision tree consist of too many branches and nodes, the cost and the
complexity involved is also very high.
26
4 DATA AND ANALYSIS OF RISK FACTORS
This chapter presents the detailed description of the data used to train the models
proposed in this research; It includes four major components; Data Collection, Variable
Selection, Data Pre-processing and Data Transformaion..
• This section describes the source of the data; It provides a detailed

Data description of NHANES 2011-2012 dataset
Collection
• It includes the formulation of data set for Asthma and Diabetes

Prediction. Provides description of the variable included for Asthma
Variable and Diabetes
Selection
• It includes elimation of data sets that are noisy, inconsistent or

Data Pre- includes too many missing values
procssing
• Data is finally transformed in a format which can be used by Weka, the

Data data mining tool
Transforma
tion
Figure 4.1 Steps involved in preparing a dataset
4.1 Data Collection:
The data used in this research to train the models for prediction is collected from National
Health and Nutrition Examination Survey (NHANES), “NHANES is a program of
27
studies designed to assess the health and nutritional status of adults and childrens in the
United States”. The survey has been conducted by the National Center for Health
Statistics (NCHS), an agency of United States Federal Statistical System that provides
statistical data to improve the health status of people in America. NCHS is an integral
part of Centers for Disease Control and Prevention (CDC). The survey is a combination
of physical examination and interviews. Physical examination includes medical,
laboratory and physiological tests whereas interview includes questions related to
demographic, diet and other health related questions. Every year about 5000 persons that
are located in different states across the country are examined under this survey; it also
keeps track of the changes in their health conditions over time. The data collected from
the survey is used to determine various risk factors for major diseases such as Asthma,
Diabetes, and Kidney Diseases etc.
For this research we have used the data collected from NHANES 2011-2012 survey. This
survey divides the data into 6 categories Demographics, Dietary, Examination,
Laboratory, Questionnaire and Limited Access. In 2011-2012, around 13,431 persons
from 30 different locations were selected out of which 9,756 completed the interview
component and 9,338 were examined in order to collect the information related to above
mentioned categories. Below table shows the unweighted response rates for NHANES
2011-2012 survey by Age and Gender.
28
Table 4.1 Unweighted response rates for NHANES 2011-2012 survey by Age and
Gender
Screened Sample Interviewed Sample Examined Sample

Unweig
Unweight Unweight Unweight
Gender / Age Samp hted
Control ed ed ed
Group le Respon
Totals Sample Response Sample
Size 1 se Rate
Size Rate (%) Size
(%)
306,590,68
Total All Ages 1 13,431 9,756 72.6 9,338 69.5
< 1year 3,686,290 458 392 85.6 382 83.4
1-5 years 20,444,290 1,463 1,203 82.2 1,135 77.6
6-11 years 24,614,923 1,641 1,328 80.9 1,272 77.5
12-15
years 16,544,876 803 658 81.9 630 78.5
16-19
years 17,333,412 814 615 75.6 600 73.7
20-29
years 41,927,467 1,409 994 70.5 954 67.7
30-39
years 39,278,265 1,301 963 74.0 924 71.0
40-49
years 42,648,248 1,267 899 71.0 875 69.1
50-59
years 42,270,235 1,309 913 69.7 879 67.2
60-69
years 30,499,549 1,385 908 65.6 868 62.7
70-79
years 16,715,521 879 520 59.2 493 56.1
80+ years 10,627,605 702 363 51.7 326 46.4
149,632,76
Male All Ages 3 6,681 4,856 72.7 4,651 69.6
< 1year 1,880,565 219 193 88.1 188 85.8
1-5 years 10,432,000 732 595 81.3 561 76.6
6-11 years 12,575,747 820 678 82.7 651 79.4
12-15
years 8,487,172 411 340 82.7 331 80.5
16-19
years 8,846,406 400 310 77.5 300 75.0
20-29
years 20,817,190 731 510 69.8 488 66.8
30-39
years 19,253,439 653 481 73.7 459 70.3
40-49
years 20,860,140 617 428 69.4 416 67.4
50-59
years 20,490,838 653 435 66.6 418 64.0
60-69
years 14,492,515 706 459 65.0 443 62.7
70-79
years 7,514,299 426 260 61.0 244 57.3
80+ years 3,982,452 313 167 53.4 152 48.6
Femal All Ages 156,957,91 6,750 4,900 72.6 4,687 69.4
29
e 8
< 1year 1,805,725 239 199 83.3 194 81.2
1-5 years 10,012,290 731 608 83.2 574 78.5
6-11 years 12,039,176 821 650 79.2 621 75.6
12-15
years 8,057,704 392 318 81.1 299 76.3
16-19
years 8,487,006 414 305 73.7 300 72.5
20-29
years 21,110,277 678 484 71.4 466 68.7
30-39
years 20,024,826 648 482 74.4 465 71.8
40-49
years 21,788,108 650 471 72.5 459 70.6
50-59
years 21,779,397 656 478 72.9 461 70.3
60-69
years 16,007,034 679 449 66.1 425 62.6
70-79
years 9,201,222 453 260 57.4 249 55.0
80+ years 6,645,153 389 196 50.4 174 44.7
4.2 Variable Selection
There are many environment and socio economic factor that may be considered as the
risk factors for Asthma and Diabetes. This section provides the detailed description of the
risk factors used to predict the likeliness of Asthma and extent of Diabetes. Data from
four categories i.e. Demographics, Examination, Laboratory and Questionnaire is used
for this research as Dietary data was not available.
Most chronic conditions share common risk factors; while some risk factors such as age,
gender and ethnicity cannot be changed over time; other behavioural or environmental
risk factors such as alcohol consumption, smoking habits and physical activity can be
changed over time if proper measurements are taken. The recognition of such risk factors
is the basis of this research.
30
4.2.1 Initial Set
In the initial stage we selected the parameters which were relevant to different chronic
conditions; later on we divided the parameters into two sets one for Asthma Prediction
and other for Diabetes Prediction.
Table shows the list of parameters selected in the initial stage; it displays Variable Name,
the unique name given to the parameter in NHANES; SAS Label, the question
corresponding to the variable; Data File Name, the name of the file in which the
description about the parameter is stored and Doc File, id of the document file.
Demographic:
Demographic data includes variables that cover the whole society; it helps in putting
people into different categories such as age, gender and race etc. Table shows the list of
variable that we have included for our research from demographic section of NHANES.
Table 4.2 Variable considered from Demographics
Demographics
Data File Doc
Variable Name SAS Label
Name File
RIAGENDR Gender Demographic
RIDAGEYR Age in years at screening Variables and DEM
Sample O_G
RIDRETH3 Race/Hispanic origin w/ NH Asian Weights
Examination:
It includes variables related to basic physical examination of an individual such as blood
pressure, height, weight, BMI and injuries related questions etc. Table shows the list of
variables selected for this research.
31
Table 4.3 Variable Considered from Examination
Examination
Variable Data File
SAS Label Doc File
Name Name
BPXSY1 Systolic: Blood Pres (1st rdg) mm Hg
BPXDI1 Diastolic: Blood Pres (1st rdg) mm Hg
BPXSY2 Systolic: Blood Pres (2nd rdg) mm Hg
BPXDI2 Diastolic: Blood Pres (2nd rdg) mm Hg Blood
BPX_G
BPXSY3 Systolic: Blood Pres (3rd rdg) mm Hg Pressure
BPXDI3 Diastolic: Blood Pres (3rd rdg) mm Hg
BPXSY4 Systolic: Blood Pres (4th rdg) mm Hg
BPXDI4 Diastolic: Blood Pres (4th rdg) mm Hg
BMXWT Weight (kg)
BMXHT Standing Height (cm) Body BMX_G
BMXBMI Body Mass Index (kg/m**2) Measures
BMXWAIST Waist Circumference (cm)
Laboratory:
It includes variables that corresponds to the clinical measures, Table shows the list of
variables considered for this research
Table 4.4 Variables considered from Laboratory
Laboratory
Variable Data File
SAS Label Doc File
Name Name
LBDHDD Direct HDL-Cholesterol (mg/dL) HDL-Cholestrol HDL_G
Plasma Fasting
LBXGLU Fastin Glucose (mg/dL) Glucose and GLU_G
Insulin
LBXTC Total Cholesterol (mg/dL) Total Cholestrol TCHOL_G
LBXTR Triglyceride (mg/dL) Triglycerides
and LDL- TRIGLY_G
LBLDL LDL-cholesterol (mg/dL) Cholestrol
Urinary
Albumin and
URXUMS Albumin, urine (mg/L) ALB_CR_G
Urinary
Creatinine
32
Questionnaire:
This component consists of a set of questions ranging from physical activity, alcohol use,
environment related questions etc. Table shows the list of variable selected for this
research.
Table 4.5 Variables considered from Questionnaire
Questionnaire
Variable Data File
SAS Label Doc File
Name Name
ALQ101 Had at least 12 alcohol drinks.1 yr? Alcohol Use ALQ_G
How often drink alcohol over past 12
ALQ120Q Alcohol Use ALQ_G
months
Avg # alcoholic drinks/day – past 12
ALQ130 Alcohol Use ALQ_G
mos
ALQ141Q # days have 4/5 drinks – past 12 mos Alcohol Use ALQ_G
Ever have 4/5 or more drinks
ALQ151 Alcohol Use ALQ_G
everyday?
PAQ605 Vigorous work activity
PAQ620 Moderate work activity
PAQ635 Walk or bicycle
PAQ650 Vigorous recreational activities
Physical
PAQ665 Moderate recreational activities PAQ_G
Activity
PAD680 Minutes sedentary activity
Hours watch TV or videos past 30
PAQ710
days
PAQ715 Hours use computer past 30 days
Sleep
SLD010H How much sleep do you get (hours)? SLQ_G
Disorders
Smoking –
SMQ020 Smoked at least 100 cigarettes in life SMQ_G
Cigarette Use
Age started smoking cigarettes Smoking –
SMQ030 SMQ_G
regulary Cigarette Use
Smoking –
SMQ040 Do you now smoke cigarettes? SMQ_G
Cigarette Use
How long since quit smoking Smoking –
SMQ050Q SMQ_G
cigarettes? Cigarette Use
Smoking –
SMD055 Age last smoke cigarettes regularly SMQ_G
Cigarette Use
Smoking –
SMD057 # cigarettes smoked per day when quit SMQ_G
Cigarette Use
33
# days smoked cigs during past 30 Smoking –
SMD641 SMQ_G
days Cigarette Use
Avg # cigarettes/day during past 30 Smoking –
SMD650 SMQ_G
days Cigarette Use
Smoking –
SMD410 Does anyone smoke inside home? SMQ_G
Cigarette Use
SMD410 Does anyone smoke inside home?
Smoking
SMD415 Total # of smokers inside home
Household SMQFAM_G
SMD415A Total # cigarette smokers inside home
disorders
SMD430 Total # cigarettes smoked inside home
MCQ300a Close relative had heart attack?
Medical
MCQ300b Close relative had asthma? MCQ_G
Conditions
MCQ300c Close relative had diabetes?
Ever had work exposure to mineral
OCQ510
dusts?
OCQ520 # of years exposed to mineral dusts
Ever had work exposure to organic
OCQ530
dusts? Occupation
OCQ540 # of years exposed to organic dusts Questionnaire OCQ_G
Ever exposed to exhaust fumes at
OCQ550
work?
OCQ560 # of years exposed to exhaust fumes
Ever had work exposure to other
OCQ570
fumes?
OCQ580 # of years exposed to other fumes
DIQ010 Doctor told you have diabetes Diabetes DIQ_G
DIQ160 Ever told you have prediabetes Diabetes DIQ_G
Medical
MCQ010 Ever been told you have asthma MCQ_G
Conditions
4.2.2 Asthma Risk Factors:
In order to build predictive models for Asthma we have considered 40 attributes which
are divided into different categories such as demographics, blood pressure, body
measures, Physical Activity, Alcohol Use, Smoking – Cigarette use, Environment,
Others.
34
1. Demographics
Many studies have suggested that demographic details play an important role in
prevalence of asthma; Mexican Americans shows low prevalence to asthma compare to
other ethnic groups [25]. It is found that asthma occurs more frequently in boys compare
to girls at childhood; in young adults, ratio of asthma is found to be same for both males
and females. Females are more likely to have asthma once they cross 40 years of age.
Table 4.6 Demographics variables for asthma prediction
Demographics
Variable SAS Label Code or Value Value Description
RIAGENDR Gender 1 Male
2 Female
RIDAGEYR Age in years at 0 to 79 Range of Values
screening 80 years of age and
80 over
RIDRETH3 Race/Hispanic 1 Mexican American
origin w/ NH 2 Other Hispanic
Asian 3 Non-Hispanic White
4 Non-Hispanic Black
6 Non-Hispanic Asian
Other Race -
Including Multi-
7 Racial
2. Blood Pressure
According to Asthma and Allergy Foundations of America, most asthma patients are
diagnosed with high blood pressure. NHANES provides the blood pressure details in four
readings. For our research we have considered the average of all while building the
models for prediction.
35
Table 4.7 Blood Pressure variables for asthma prediction
Blood Pressure
Systolic: Blood pres (1st rdg)
BPXSY1
mm Hg 74 to 238 Range of Values
Diastolic: Blood pres (1st rdg)
BPXDI1
Systolic: Blood pres (2nd rdg)
BPXSY2
Diastolic: Blood pres (2nd rdg)
BPXDI2
Systolic: Blood pres (3rd rdg)
BPXSY3
Diastolic: Blood pres (3rd rdg)
BPXDI3
Systolic: Blood pres (4th rdg)
BPXSY4
Diastolic: Blood pres (4th rdg)
BPXDI4
3. Body Measures
Obesity is found to be a critical indicator of Asthma; Research made by Papoutsakis et al.
[26] suggested that people with high body mass index are more likely to have asthma
compare to people with normal body mass index. It has been found that women with high
circumference are more prone to asthma even if they have a normal BMI [27].
Table 4.8 Body Measure variables for asthma prediction
Body Measures
Body Mass
BMXBMI Index(kg/m2) 12.4 to 82.1 Range of Values
Waist
BMXWAIST Circumference 38.7 to 176 Range of Values
36
4. Physical Activity
Physical Activity and exercise plays a vital role for a healthy life. Many studies showed
that people with higher physical activity are less likely to have asthma [28]. However, in
certain conditions narrowing of the airways in lungs can also be triggered with highly
strenuous physical activity or exercise, such cases of asthma is known as Exercised
Induced Asthma.
Table 4.9 Physical Activity variables for asthma prediction
Physical Activity
1 Yes
PAQ605 Vigorous work activity 2 No
1 Yes
PAQ620 Moderate work activity 2 No
Number of days moderate
PAQ625 work 1 to 7 Range of Values
1 Yes
PAQ635 Walk or bicycle 2 No
Number of days walk or
PAQ640 bicycle 1 to 7 Range of Values
Vigorous recreational 1 Yes
PAQ650 activities 2 No
Moderate recreational 1 Yes
Days moderate recreational
PAQ670 activities 1 to 7 Range of Values
PAD680 Minutes sedentary activity 0 to 1380 Range of Values
0 Less than 1 hour
1 1 hour
2 2 hours
3 3 hours
4 4 hours
5 5 hours
Hours watch TV or videos Do not watch TV or
PAQ710 past 30 days 8 Videos
0 Less than 1 hour
Hours use computer past 30 1 1 hour
PAQ715 days 2 2 hours
37
3 3 hours
4 4 hours
5 5 hours
Do not watch TV or
8 Videos
5. Alcohol Use
Excess of alcohol intake has been known for impairing lungs for years. According to
Joseph H. Sisson, “Brief exposure to mild concentrations of alcohol may enhance
mucociliary clearance, stimulates bronchodilation and probably attenuates the airway
inflammation and injury observed in asthma” [29].
Table 4.10 Alcohol use variable for asthma prediction
Alcohol Consumption
Had at least 12 1 Yes
alcohol drinks/1
ALQ101 yr? 2 No
How often drink
alcohol over past
ALQ120Q 12 mos 0 to 350 Range of Values
Avg # alcoholic
drinks/day - past
ALQ130 12 mos 1 to 82 Range of Values
Ever have 4/5 or 1 Yes
more drinks every
ALQ151 day? 2 No
6. Smoking – Cigarette use
Smoking is a common risk factor for prevalence of asthma. It is divided into two
categories active smoking and passive Smoking. It has been observed that passive
smoking can trigger symptoms of asthma in individuals suffering from the disease. Many
38
researchers have suggested that disease control is poorer in the patients who smoke
compare to the non smoker asthmatic patients [30].
Table 4.11 Smoking related variables for asthma prediction
Smoking – Cigarette use

Smoked at least 1 Yes
100 cigarettes in
SMQ020 life 2 No
Age started
smoking cigarettes
SMD030 regularly 6 to 72 Range of Values
1 Every day
Do you now smoke 2 Some days
SMQ040 cigarettes 3 Not at all
How long since
quit smoking
SMQ050Q cigarettes 1 to 193 Range of Values
Age last smoked
SMD055 cigarettes regularly 13 to 78 Range of Values
# cigarettes
smoked per day
SMD057 when quit 2 to 90 Range of Values
# days smoked cigs
SMD641 during past 30 days 0 to 30 Range of Values
Avg # 1 to 94 Range of Values
cigarettes/day
SMD650 during past 30 days 95 95 or more
Does anyone 1 Yes
smoke inside
SMD410 home? 2 No
7. Environmental Factors
Home environment and Work environment plays an important role in prevalence of
Asthma. Mineral dust, dust from sand, coal and soil; Organic dust, dust from flour,
cotton, animal and plants; Exhaustive fumes from Engines, Machinery, trucks and buses
are found to be the major cause of asthma in adults. Environmental factors not just
39
increase the chances of asthma but it also obstructs the disease control process for the
patients with the disease.
Table 4.12 Environmental Factors for asthma prediction
Environmental Factors
Ever had work 1 Yes
exposure to
OCQ510 mineral dusts? 2 No
# of years exposed
OCQ520 to mineral dusts 0 to 65 Range of Values
Ever had work 1 Yes
exposure to organic
OCQ530 dusts? 2 No
Ever exposed to 1 Yes
exhaust fumes at
OCQ550 work? 2 No
Ever had work 1 Yes
exposure to other
OCQ570 fumes? 2 No
# of years exposed
OCQ580 to other fumes 0 to 65 Range of Values
8. Others
Many psychological and genetic factors are recognized to influence the onset of asthma;
Studies showed that people with asthma feel lonely more often [31] compare to other
people. Burke W. et al, suggested that risk of asthma is increased if a positive family
history is found. MCQ010 i.e. “Ever been told you have asthma” is the class variable we
considered to perform supervised learning.
40
Table 4.13 Other variables for asthma prediction
Others
Close Relative had 1 Yes
MCQ300B asthma 2 No
0 Not at all
1 Several Days
Feeling Down, More than half the
depressed or 2 days
DPQ020 hopeless 3 Nearly everyday
Ever been told you 1 Yes
MCQ010 have asthma 2 No
4.2.3 Diabetes Risk Factors
In order to build models to predict the extent of diabetes we have used data consisted of
33 attributes. The attributes are divided into 8 major categories Demographics, Blood
Pressure, Body Measures, Physical Activity, Smoking – Cigarette use, Alcohol use,
Laboratory and others.
1. Demographics
Many studies showed that risk of diabetes increases as the person gets older especially
after 45 years of age. According to American Diabetes Association, the risk of diabetes
in African Americans, Mexican Americans, American Indians and Asian Americans is
very high because these populations are more like to have high blood pressure and high
BMI. In 2012, CDC survey estimated 86 million prediabetes cases among population of
20 years or older.
41
Table 4.14 Demographics variables for diabetes prediction
Demographics
RIAGENDR Gender 1 Male
2 Female
RIDAGEYR Age in years at 0 to 79 Range of Values
screening 80 years of age and
80 over
RIDRETH3 Race/Hispanic 1 Mexican American
origin w/ NH 2 Other Hispanic
Asian 3 Non-Hispanic White
4 Non-Hispanic Black
6 Non-Hispanic Asian
Other Race -
Including Multi-
7 Racial
2. Blood Pressure
Hypertension is one of the major factors that can worsen the complications of diabetes.
Most people with diabetes are diagnosed to have high blood pressure [32].
Table 4.15 Blood Pressure variables for diabetes prediction
Blood Pressure
Systolic: Blood pres (1st rdg) mm
BPXSY1
Hg 74 to 238 Range of Values
Diastolic: Blood pres (1st rdg)
BPXDI1
Systolic: Blood pres (2nd rdg)
BPXSY2
Diastolic: Blood pres (2nd rdg)
BPXDI2
Systolic: Blood pres (3rd rdg) mm
BPXSY3
Diastolic: Blood pres (3rd rdg)
BPXDI3
Systolic: Blood pres (4th rdg) mm
BPXSY4
Diastolic: Blood pres (4th rdg)
BPXDI4
42
3. Body Measures
Many researchers found that risk of diabetes increases with the increase in BMI [32],
overweight people are more likely to have diabetes compared to their counter parts.
Table 4.16 Body measure variables for diabetes prediction
Body Measures
Body Mass
BMXBMI Index(kg/m2) 12.4 to 82.1 Range of Values
Waist
BMXWAIST Circumference 38.7 to 176 Range of Values
4. Physical Activity
Physical activity helps in controlling blood glucose, HDL cholesterol, blood pressure and
triglycerides resulting into lower risk of diabetes. Many risk factors are directly related to
physical activity such as BMI and waist circumference. Thus making it one of the major
risk factor involved in many chronic diseases.
Table 4.17 Physical Activity variables for diabetes prediction
Physical Activity
1 Yes
PAQ605 Vigorous work activity 2 No
1 Yes
PAQ620 Moderate work activity 2 No
Number of days moderate
PAQ625 work 1 to 7 Range of Values
1 Yes
PAQ635 Walk or bicycle 2 No
Vigorous recreational 1 Yes
Moderate recreational 1 Yes
PAD680 Minutes sedentary activity 0 to 1380 Range of Values
43
0 Less than 1 hour
1 1 hour
2 2 hours
3 3 hours
4 4 hours
5 5 hours
Hours watch TV or videos Do not watch TV or
PAQ710 past 30 days 8 Videos
0 Less than 1 hour
1 1 hour
2 2 hours
3 3 hours
4 4 hours
5 5 hours
Hours use computer past 30 Do not watch TV or
PAQ715 days 8 Videos
5. Smoking – Cigarette Use
Research conducted by Julie C Will [33] shows an increase in diabetes rate with the
increase in smoking. It shows that men who smoked have 45% higher diabetes rate
compare to the men who had never smoked thus making smoking as an important
indicator of diabetes.
Table 4.18 Smoking related variables for diabetes prediction
Smoking – Cigarette use

Smoked at least 100 1 Yes
SMQ020 cigarettes in life 2 No
Age started smoking
SMD030 cigarettes regularly 6 to 72 Range of Values
1 Every day
2 Some days
Do you now smoke 3 Not at all
SMQ040 cigarettes 95 95 or more
Does anyone smoke 1 Yes
SMD410 inside home? 2 No
44
6. Alcohol Use
Alcohol consumption has become an important risk factor for diabetes. Many researchers
investigated that moderate intake of alcohol is associated with reduced risk of diabetes
[34], however heavy intake of alcohol increases the risk of diabetes.
Table 4.19 Alcohol use variables for diabetes prediction
Alcohol Consumption
Had at least 12 1 Yes
alcohol drinks/1
ALQ101 yr? 2 No
How often drink
alcohol over past
ALQ120Q 12 mos 0 to 350 Range of Values
Avg # alcoholic
drinks/day - past
ALQ130 12 mos 1 to 82 Range of Values
# days have 4/5
drinks - past 12
ALQ141Q mos 0 to 220 Range of Values
Ever have 4/5 or 1 Yes
more drinks every
ALQ151 day? 2 No
7. Laboratory
Table shows the list of clinical variable which can be considered as an important factor to
measure the risk of diabetes.
45
Table 4.20 Laboratory variables for diabetes prediction
Laboratory
Direct HDL-Cholesterol
LBDHDD
(mg/dL) 14 to 175 Range of Values
LBXTR Triglyceride (mg/dL) 18 to 1562 Range of Values
URXUMS Albumin, urine (mg/L) 0.21 to 14800 Range of Values
LBXGLU Fasting Glucose (mg/dL) 39 to 382 Range of Values
LBXIN Insulin (uU/mL) 0.14 to 647.5 Range of Values
8. Others
Prediabetes is an important indicator of diabetes in which blood sugar level is higher than
the normal but not in the diabetes range. Genetics also play an important role in
developing diabetes. According to American Diabetes Association people with family
history of the disease have higher chances of developing diabetes compare to other
people. DIQ010 i.e. ‘ever been told you have diabetes’ is the class variable we selected to
perform supervised learning.
Table 4.21 other variables for diabetes prediction
Others
Close Relative had 1 Yes
MCQ300C diabetes 2 No
Ever told you have 1 Yes
DIQ160 prediabetes 2 No
1 Yes
Ever been told you 2 No
DIQ010 have diabetes 3 Borderline
46
4.3 Data Pre-Processing
Real world data is often inconsistent and incomplete, and is more likely to contain errors.
In this section we will discuss the processing steps taken to convert the data for better
accuracy. Initially 9756 records including infants, children and adolescents have been
considered from NHANES; each observation includes values corresponding to variables
selected in the previous stage. Since the main purpose of the research is to develop a
system for adults, we selected the records corresponding to the individuals of age 18 or
above; this left us with 5864 records. On further analysis we found that the data set
consisted of too many observations with ‘No’ class values (MCQ010 in case of asthma
and DIQ010 in case of diabetes), hence to avoid the problem of over fitting we deleted
records with class labels ‘No’ and too many missing values.
The detailed description of the total number of instances used for Training and Testing
the prediction models for both Asthma and Diabetes is given below:
Asthma:
The training set comprises of 1951 instances including 1135 observations with class label
‘No’ and 816 observations with class label ‘Yes’, whereas the testing set comprises of
200 instances including 143 observations with class label ‘No’ and 57 observations with
class label ‘Yes’.
Diabetes:
The training set comprises of 1525 instances including 780 observations with class label
‘No’, 111 observations with class label ’Borderline’ and 634 observations with class label
‘Yes’, whereas the testing set comprises of 200 instances including 220 observations with
47
class label ‘No’, 8 observations with class label ‘Borderline’ and 72 observations with
class label ‘Yes’.
4.4 Data Transformation:
The next step is to convert the processed data into a format that can be used by Weka, the
data mining tool we have used to build models for the prediction. This step transforms the
data into an Attribute-Relation file format (.arff), it represents a dataset in terms of a
relation made up of attributes or columns of data [35]
48
5 SYSTEM ARCHITECTURE AND IMPLEMENTATION
In this section, we will discuss the overall architecture and implementation of the system
proposed in this thesis. It describes the complete process of converting the input provided
by the user into a predicted numerical value in the form of ‘Likeliness of having Asthma’
and ‘Extent of Diabetes’. It also includes a detailed description of Weka, the data mining
tool used by the system to generate predictions.
5.1 System Architecture
Figure illustrates the operation of the proposed system in order to generate predictions for
Asthma and Diabetes. The proposed system depicts a user equipped with a mobile device
that is capable of collecting data from the user. It can be Desktop, Laptop or any mobile
device. This device collects the data from the user and sends it to the system over the
network; the data is nothing but the list of parameters described in the previous chapter.
The system consist of three main blocks Input Conversion, Neural Network and a
mathematical computational block.
Input conversion block collects the data from the device and converts it into a format
which can be used by neural network models. It creates an instance of the data collected
and passes it to the Neural Network block for classification
49
Figure 5.1 System Architecture for Asthma Prediction
Neural Network block comprises of 5 prediction models described in Chapter 3; all the
models gets trained on the training set described in Chapter 4. Once the models are
trained, the input instance collected from previous block passes through each model
individually. Each model classifies the instance accordingly;
For Asthma, each model generates a numeric value; 1 when the instance is classified as
‘No’, meaning ‘not likely to have asthma’ and 2 when the instance is classified as ‘Yes’,
meaning ‘likely to have asthma’.
50
For Diabetes, each model generates a numeric value; 1 when the instance is classified as
‘No’, meaning ‘not likely to have diabetes’, 2 when the instance is classified as
‘Borderline’, meaning ‘likely to have borderline diabetes’ and 3 when the instance is
classified as ‘Yes’, meaning ‘likely to have diabetes’.
Figure 5.2 System Architecture for Diabetes Prediction
Mathematical computation block comprises of two main components; First component
calculates the mean value of the outcomes obtained from all the 5 models; Since for
Asthma; only two values are possible i.e. 1 or 2, the mean can range from 1 to 2 only;
similarly for Diabetes; only three values are possible i.e. 1, 2 or 3 hence the mean can
51
only range from 1 to 3. Second component converts the mean obtained from the previous
block to the required scale; For Asthma, it converts the mean scale from 1-2 to 0-1 and
For Diabetes, it converts the mean scale from 1-3 to 0-2. It then multiplies the mean
obtained after proper scaling by 100 to generate the likeliness of having asthma and
extent of having diabetes in terms of percentage.
5.2 Implementation
In order to better manage Asthma and Diabetes; it is essential to implement the proposed
system in such a way that can be used from anywhere, hence we have developed a web
based application of the system proposed in the previous section. This system is designed
in JSF (Java Server Faces), which is a java specification to build component-based user
interfaces for web application. In order to store the data for effectively tracking the health
over a period of time we have used MySql database. In order to make use of the system
the user has to login to the system with his Email-Id and password which is generated at
the time of registration. It helps in storing his details each time he calculates his score to
provide better health care over a period of time. Each time a user calculates his score, all
the details gets stored in the database which can be retrieved and used whenever needed.
In order to build models described previously to generate the predictions we have used
Weka, a java oriented data mining tool. Weka trains the model only when the user
calculates the score for the first time and later on trained models are used to generate the
predictions which helps in making the predictions faster.
5.2.1 Weka
Weka is a workbench for machine learning algorithms [36] written in java. It helps in
data pre-processing, regression , classification, clustering, visualization and association
52
rules. It is an open source software issued under GNU public license. It comes with three
different modes of operation GUI, command line and Java API.
Below are few of the screenshots of GUI mode:
Figure 5.3 Weka Preprocessing stage
Figure 5.4 Weka classification stage
53
Figure 5.5 ROC curve generated in weka
In this system we have used Java API provided by Weka, it is a collection of classes and
interfaces to incorporate machine learning in a java code. It provides the prediction
models in the form of a class which can be integrated in a java code in an object oriented
manner. A class is a collection of methods to perform different operations on a model like
building the classifier, classifying an instance etc.
5.3 Screenshots
This section consist of the screenshots of the proposed system
54
Figure 5.6 Login Page
Figure 5.7 Home Page
55
Figure 5.8 Asthma Calculator Page1
56
57
Figure 5.12 Results Generated for Asthma
Figure 5.13 View Asthma History Page
58
Figure 5.14 Asthma Record Page1
Figure 5.15 Asthma Record Page2
59
Figure 5.16 Asthma Dashboard
60
6 RESULTS AND ANALYSIS
This chapter shows the results obtained from Weka for all the five models. Results of all
models from weka is recorded and an analysis is carried out to compare the prediction
power of 5 competing models in accordance with three important measures Accuracy,
Root Mean Squared Error and Area under ROC. Accuracy is the percentage of total
number of instances correctly classified. RMSE measures the square root of average of
squares of errors i.e. the difference between the actual class and the predicted class. ROC,
receiver operating characteristic, is a graphical representation to measure the performance
of a classifier system; it plots the true positive rate against the false positive rate. The area
under ROC curve ranges from 0 to 1, with 1 implies a perfect test and 0 implies a useless
test. The analysis also includes confusion matrix, which is a table layout to visualize the
performance of a model. A typical confusion matrix consists of rows and columns where
each column represents the number of instances in the predicted class and each row
represents the number of instances in an actual class. In predictive analytics, a confusion
matrix represents the total number of true positives, false positives, false negatives and
true negatives
Predicted Class
Actual
True Positives False Positives

Class
False Negatives True Negatives
61
In order to better understand the terminologies, consider a scenario where a test is
conducted that screens people for asthma. Each person either has asthma or does not have
asthma. Test result can be either positive (meaning the person has asthma) or negative
(meaning the person does not have asthma)
In this case, True Positive means the person with asthma is correctly diagnosed with
asthma; False Positive means the person without asthma is incorrectly diagnosed with
asthma, True Negative means the person without asthma is correctly identified without
asthma and False Negative means the person with asthma is incorrectly identified without
asthma.
True positive rate is also known as Sensitivity and true negative rate is also known as
Specificity.
6.1 Analysis of Results obtained for Asthma
Table 6.1 shows an overview of the result obtained from all the 5 models when all
models are trained on 1951 instances. It has been found that J48 decision tree resulted
into highest accuracy. But at the class level Multilayer Perceptron and Logistic
Regression model performed marginally well with more Area under ROC.
62
Table 6.1 Asthma Results in terms of Accuracy, RMSE and ROC area for all the 5
models
Bayes Naïve Multilayer Logistic

Network Bayes Perceptron Regression J48
Total Number of
Instances 200 200 200 200 200
Correctly Classified
Instances 144 139 141 149 152
Incorrectly
Classified Instances 56 61 59 51 48
Accuracy 72.00% 69.50% 70.50% 74.50% 76%
Root mean squared 0.44
error 0.4451 0.4685 0.5099 0.4219 43
0.66
ROC area 0.68 0.664 0.707 0.711 5
6.1.1 Confusion Matrix:
This section displays the confusion matrix obtained from weka for all the 5 models when
all the 200 instances are tested against the trained model, it gives the measure of true
positive, false positive, true negative and false negative.
 Naïve Bayes
Table 6.2 Confusion Matrix of Naive Bayes Classifier for Asthma Prediction
Naïve Predicted Class

Bayes No Yes
Actual No 116 27
Class Yes 34 23
63
 Bayesian Network
Table 6.3 Confusion Matrix of BayesNet Classifier for Asthma Prediction
Predicted Class
BayesNet
No Yes
Actual No 120 23
Class Yes 33 24
 Multilayer perceptron model
Table 6.4 Confusion Matrix of MLP Classifier for Asthma Prediction
Predicted Class
MLP
No Yes
Actual No 112 31
Class Yes 28 29
 Logistic
Table 6.5 Confusion Matrix of Logistic Classifier for Asthma Prediction
Predicted Class
Logistic
No Yes
Actual No 129 14
Class Yes 37 20
 J48 Decision tree
Table 6.6 Confusion Matrix of J48 decision tree Classifier for Asthma Prediction
Predicted Class
J48
No Yes
Actual No 126 17
Class Yes 31 26
64
6.1.2 ROC curves
Table 6.7 ROC curves obtained from WEKA for Asthma
Asthma No Yes
Naïve Bayes
Bayesian Network
Multilayer Perceptron
Logistic
J48
65
6.2 Diabetes
Table 6.1 shows an overview of the result obtained from all the 5 models when all
models are trained on 1525 instances. It has been found that Multilayer Perceptron Model
resulted into highest accuracy and lowest root mean squared error, it is also been found
that at the class level Naïve Bayes and Multilayer Perceptron have almost similar Area
under ROC.
Table 6.8 Diabetes Results in terms of Accuracy, RMSE and ROC area for all the 5
models
Bayes Naïve Multilayer Logistic

Network Bayes Perceptron Regression J48
Total Number of
Instances 300 300 300 300 300
Correctly
Incorrectly
Accuracy 81.30% 88.60% 94.00% 85.60% 78%
Root mean squared 0.335
error 0.3008 0.2372 0.1801 0.2803 4
ROC area 0.872 0.985 0.993 0.899 0.826
6.2.1 Confusion Matrix
This section displays the confusion matrix obtained from weka for all the 5 models when
all the 300 instances are tested against the trained model, it gives the measure of true
positive, false positive, true negative and false negative.
66
 Naïve Bayes
Table 6.9 Confusion Matrix of Naive Bayes Classifier for Diabetes Prediction
Predicted Class
Naïve Bayes Borderline
No Yes
No 217 0 3
Actual
Borderline 4 2 2
Class
Yes 13 12 47
 Bayesian Network
Table 6.10 Confusion Matrix of BayesNet Classifier for Diabetes Prediction
Predicted Class
Bayesian Network Borderline
No Yes
No 189 0 31
Actual
Borderline 5 0 3
Class
Yes 17 0 55
 Multilayer Perceptron Model
Table 6.11 Confusion Matrix of MLP Classifier for Diabetes Prediction
Multilayer Predicted Class

Perceptron No Borderline Yes
No 220 0 0
Actual
Borderline 0 0 8
Class
Yes 1 9 62
67
 Logistic
Table 6.12 Confusion Matrix of Logistic Classifier for Diabetes Prediction
Predicted Class
Logistic Borderline
No Yes
No 201 0 19
Actual
Borderline 5 0 3
Class
Yes 16 0 56
 J48
Table 6.13 Confusion Matrix of J48 Classifier for Diabetes Prediction
Predicted Class
J48 Borderline
No Yes
No 179 4 37
Actual
Borderline 4 0 4
Class
Yes 15 1 56
68
6.2.2 ROC curves:
Table 6.14 ROC curves for Diabetes Prediction
Diabetes No Borderline Yes
Naïve Bayes
Bayesian Network
Multilayer
Perceptron
Logistic
J48
69
As shown in the above results for Asthma and Diabetes prediction, models work
differently for different performance measures; some models provide better accuracy but
with less Area under ROC, some models resulted into more Area under ROC but with
high root mean squared error. Hence the approach described to combine all the 5 models
resulted in overcoming the disadvantage of one model through other models.
70
7 CONCLUSION AND FUTURE WORK
7.1 Conclusion
In this thesis, we have discussed the design and implementation of a predictive analytics
based system to predict the likeliness of having asthma and extent of diabetes in an
individual. In order to give better results and build a powerful system we used 5 machine
learning models to generate the predictions. It helped in overcoming the disadvantages of
one model with the help of other models. The system is successfully able to generate the
predictions based on the data provided by the user. For asthma, the questionnaire consist
of a set of 40 questions and For Diabetes, the questionnaire consist of a set of 33
questions ranging from demographics to laboratory details. It is been found that in both
the diseases, with the accuracy of 84% model for predicting the extent of diabetes
performed well compared to the model for predicting the likeliness of having
asthma(accuracy 76%). The reason being clinical data like Albumin, Fasting Glucose,
Triglycerides etc were included in the models for Diabetes but not in the model of
Asthma because of too many missing values in the asthma records with ‘Yes’ as the class
label. It is also been found that at the class level, model for Diabetes resulted in better
Area under ROC. The system developed in this study can be used to develop
individual/clinical decision support systems to improve management of chronic diseases
Asthma and Diabetes. With the Web based implementation of the proposed system the
user is able to make use of the system without worrying about the geographical
71
restrictions. The user can also view previously generated results and feedback to better
manage his/her health.
7.2 Future Work
With the increase in the development of many health management strategies, the research
presented here can be extended in a variety of directions. Some of the suggested
extension includes:
 For Diabetes prediction there were very less number of records with Borderline
Diabetes hence one future scope would be to add more number of records with
Borderline Diabetes, it will help the system to improve learning algorithms for
Borderline cases.
 For Asthma prediction, we can include the clinical data for training purpose. It
will improve the overall accuracy of the system because clinical data have a
significant effect on the predictions.
 The system can be extended to build models for other chronic conditions such as
CKD, COPD and Heart Diseases.
72
REFERENCES
[1] W. H. Organization, "Preventing Chronic Diseases - a vital investment," WHO

press, 2005.
[2] E. P. a. A. R. Eleni Chatzimichail, "An evovlutionary two objective genetic

algorithm for asthma prediction," in UKSim 15th International Conference on
Computer Modelling and Simulation, 2013.
[3] G. B. a. Y. H. Yang Guo, "Using Bayes Network for Prediction of Type-2 Diabetes,"
2012.
[4] EPA, "Asthma Facts," 2013.
[5] K. L. K. S. S. S. Eaton DK, "Youth risk behavior surveillance," 2009.
[6] P. G. a. K. Kapur, "Artificial Neural Network-enabled Prognostics for Patient Health

Management," IEEE, pp. 1-8, 2012.
[7] A. S. V. a. M. S. Komathy Karuppanan, "COPD Prognosis under Biologically

Inspired Neural Network," in International Conference on Advances in Computing
and Communications, 2012.
[8] N. V. C. N. A. C. a. A. L. B. Darcy A. Davis, "Time to CARE: a collaborative

engine," Data Min Knowl Disc, p. 388–415, 2009.
[9] E. P. a. A. R. Eleni Chatzimichail, "An evolutionary two-objective genetic algorithm

for asthma prediction," in 15th International Conference on Computer Modelling
and Simulation, 2013.
[10] D. T. K. a. G. R. Dr.B.Eswara Reddy, "An Efficient Cloud Framework for Health

Care Monitoring System," in International Symposium on Cloud and Services
Computing, 2012.
73
[11] B. R. P. a. S. Agarwal, "Modeling Risk Prediction of Diabetes - A Preventive
Measure," IEEE, pp. 1-6, 2014.
[12] P. C. a. H. S. Gerd Assman, "Simple Scoring Scheme for Calculating the Risk of
Acute Coronary Events Based on the 10 year Follow-up of the Prospective
Cardiovascular Munster(PROCAM) study," Circulation, pp. 1-8, 2012.
[13] J. G. H. Z. T. J. R. B. a. Y. S. X. Xie, "Neural-network based structural health

monitoring with wireless sensor networks," in Natural Computation (ICNC),
Shenyang, 2013.
[14] F. Beaule, "Systems and Methods For Determining and Managing an Individual and
Portable Health Score," 2012.
[15] N. C. S. U. a. R. David A. Dickey, "Introduction to Predictive Modeling with

Examples," SAS Global Forum, 2012.
[16] I. H. W. a. E. Frank, Data Mining: Practical Machine Learning Tools and

Techniques, Diane Cerra, 2005.
[17] M. E. U. F. e. a. Soumen Chakrabarti, "Data Mining Curriculum: A Proposal," 2006.
[18] K. P. Murphy, Machine Learning : A Probabilistic Perspective, 2012.
[19] I. Rish, "An empirical study of the naive bayes classifier," 2001.
[20] S. S. e. a. Ashraf Uddin, "Presentation on Naive Bayes Classification," 2012.
[21] M. E. Kragt, "A beginners guide to Bayesian Network Modelling for integrated
catchment management," 2009.
[22] D. G. a. M. G. Nir Friedman, "Bayesian Network Classifiers," Kluwer Academic

Publishers, 1997.
[23] D. W. H. a. S. Lemeshow, Applied Logistic Regression, 2000.
[24] Q. L. K. R. X. T. Dongsheng Che, "Decision Tree and Ensemble Learning

Algorithms with Their Applications in Bioinformatics," in Software Tools and
Algorithms for Biological Systems, 2011.
[25] G. L. D. E. S. L. S. R. T. a. L. W. W. A. A. Arif, "Prevalence and risk factors of

asthma and wheezing among US adults," European Respiratory Journal, 2003.
74
[26] C. M. A. G. P. E. M. V. D. M. K. E. P. A. P. K. Papoutsakis, "Associations between
central obesity and asthma in children and adolescents: a case-control study.," 2014.
[27] M. L. P. L. H. R. R. J. D. F. G. R. M. L. B. C. A. C. P. R. J Von Behren, "Obesity,

waist size and prevalence of current asthma in the California Teachers study cohort,"
Thorax, 2009.
[28] M. M. J. M. T. D. C. T. M. H. P. Marianne Eijkemans, "Physical Activity and

Asthma: A Systematic Review and Meta-Analysis," Plos One, 2012.
[29] J. H. Sisson, "Alcohol and Airways Function in Health and Disease," 2007.
[30] P. A. H.-T. P. B. C. G. P. B. C. R. M. H. M. a. T. H. S. P. Megan Stapleton,

"Smoking and Asthma," vol. 24, pp. 313-322, 2011.
[31] A. B. L. Y. M. B. a. D. N. Roberto Forero, "Asthma, Health Behaviors, Social

Adjustment, and Psychosomatic Symptoms in Adolescence," vol. 33, pp. 157-164,
1996.
[32] R. H. C. a. S. G. H E Bays, "The relationship of body mass index to diabetes

mellitus, hypertension and dyslipidaemia: comparison of data from two national
surveys," pp. 737-747, 2007.
[33] D. A. G. E. S. F. A. M. a. E. E. C. Julie C Willa, "Cigarette smoking and diabetes

mellitus: evidence of a positive association from a large prospective cohort study,"
Oxford, 2000.
[34] M. D. K. S. S. a. K. T. M. Noriyuki Nakanishi, "Alcohol Consumption and Risk for

Development of Impaired Fasting Glucose or Type 2 Diabetes in Middle-Aged
Japanese Men," DOI, vol. 26, pp. 48-54, 2003.
[35] S. R. Garner, "WEKA: The Waikato Environment".
[36] A. D. a. I. H. W. Geoffrey Holmes, "WEKA: A Machine Learning Workbench," in

IEEE, 1994.
75

OBJ Datastream

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

OBJ Datastream

Uploaded by

Copyright:

Available Formats

PREDICTIVE MODELING FOR CHRONIC CONDITIONS

A Thesis Submitted to the Faculty of

The College of Computer Science and Engineering

in Partial Fulfillment of the requirements for the Degree of

Florida Atlantic University

Boca Raton, Florida

opportunity to work under their guidance throughout my Master’s thesis. Their

helped me to achieve my goals in this study.

patience and time.

Priyanka Ved for giving me all the support.

Author: Ritesh Jain

Title: Predictive Modeling for Chronic Conditions

Institution: Florida Atlantic University

Thesis Advisor: Dr. Ankur Agarwal

Degree: Master of Science

LIST OF TABLES ........................................................................................................ xi

LIST OF FIGURES ..................................................................................................... xiii

1.1 Motivation .........................................................................................................1

1.2 Problem Statement ............................................................................................3

1.3 Contribution ......................................................................................................3

1.4 Organization ......................................................................................................4

2.1 Introduction .......................................................................................................6

2.2 Prognostics for Patient Health Management.......................................................6

2.2.1 System Architecture ...................................................................................8

2.2.2 Artificial Neural Network: ..........................................................................8

2.2.3 Data: ..........................................................................................................9

2.3 COPD Prognosis under Biologically Inspired Neural Network ..........................9

2.3.1 System Architecture: ................................................................................ 10

2.4 Time to CARE................................................................................................. 10

2.4.1 Methodology: ........................................................................................... 11

2.5.1 MLP Pruning by Genetic Algorithm ......................................................... 12

2.6 Cloud Framework for Health Care Monitoring System(CHMS) ....................... 12

2.6.1 Cloud Framework:.................................................................................... 13

2.6.2 System Architecture: ................................................................................ 14

2.7 Modeling Risk Prediction of Diabetes – A Preventive Measure ....................... 15

2.7.1 Methodology ............................................................................................ 15

2.8 Scoring Scheme based on Prospective Cardiovascular Munster

2.8.1 Scoring Method: ....................................................................................... 16

2.9 Other Related Works: ...................................................................................... 17

3 RESEARCH METHODOLOGY ............................................................................ 18

3.1 Introduction ..................................................................................................... 18

3.2 Research Design .............................................................................................. 18

3.3 Predictive Modelling ....................................................................................... 19

3.4 Data Mining .................................................................................................... 20

3.5 Machine Learning............................................................................................ 20

3.5.1 Naïve Bayes: ............................................................................................ 21

3.5.2 Bayesian Network: ................................................................................... 22

3.5.3 Multilayer Perceptron Model .................................................................... 23

3.5.5 J48 Decision Tree ..................................................................................... 26

4 DATA AND ANALYSIS OF RISK FACTORS ..................................................... 27

4.1 Data Collection:............................................................................................... 27

4.2 Variable Selection ........................................................................................... 30

4.2.1 Initial Set .................................................................................................. 31

4.2.2 Asthma Risk Factors: ............................................................................... 34

4.2.3 Diabetes Risk Factors ............................................................................... 41

4.3 Data Pre-Processing......................................................................................... 47

4.4 Data Transformation: ....................................................................................... 48

5 SYSTEM ARCHITECTURE AND IMPLEMENTATION ..................................... 49

5.1 System Architecture ........................................................................................ 49

5.2 Implementation ............................................................................................... 52

5.2.1 Weka ........................................................................................................ 52

5.3 Screenshots ..................................................................................................... 54

6 RESULTS AND ANALYSIS ................................................................................. 61

6.1 Analysis of Results obtained for Asthma ......................................................... 62