You are on page 1of 89

PREDICTIVE MODELING FOR CHRONIC CONDITIONS

by

Ritesh Jain

A Thesis Submitted to the Faculty of

The College of Computer Science and Engineering

in Partial Fulfillment of the requirements for the Degree of

Master of Science

Florida Atlantic University

Boca Raton, Florida

May 2015
Copyright 2015 by Ritesh Jain

ii
ACKNOWLEDGEMENTS

It is a pleasure to thank the many people who made this thesis a success. I am indebted to

my advisors Dr. Ankur Agarwal and Dr. Ravi Behara for giving me this wonderful

opportunity to work under their guidance throughout my Master’s thesis. Their

enthusiasm, inspiration and great efforts to explain things clearly and in a simple way

helped me to achieve my goals in this study.

I would like to thank Dr. Vinaya Rao, M. D., Methodist University Hospital Transplant

Institute, Memphis, TN, USA for sharing her expertise and providing valuable guidance

to validate my research.

I would like to thank my committee members Dr. Xingquan Zhu and Dr. Hari Kalva for

their valuable comments, suggestions and input to the thesis. Thanks a lot for your

patience and time.

I would like to thank my parents Mr. Mahesh Jain and Mrs Sumitra Jain for believing in

me. My sincere thanks also go to my brother Mr. Hitesh Jain and my Sister-in-Law Ms.

Priyanka Ved for giving me all the support.

iv
ABSTRACT

Author: Ritesh Jain

Title: Predictive Modeling for Chronic Conditions

Institution: Florida Atlantic University

Thesis Advisor: Dr. Ankur Agarwal

Degree: Master of Science

Year: 2015

Chronic Diseases are the major cause of mortality around the world, accounting for 7 out

of 10 deaths each year in the United States. Because of its adverse effect on the quality of

life, it has become a major problem globally. Health care costs involved in managing

these diseases are also very high. In this thesis, we will focus on two major chronic

diseases Asthma and Diabetes, which are among the leading causes of mortality around

the globe. It involves design and development of a predictive analytics based decision

support system which uses five supervised machine learning algorithm to predict the

occurrence of Asthma and Diabetes. This system helps in controlling the disease well in

advance by selecting its best indicators and providing necessary feedback. Based on

several risk factors such as blood pressure, BMI, age, ethnicity, smoking status etc, the

v
system would be able to predict the vulnerability of a person to a particular disease

which helps in taking necessary action to avoid the disease well in advance

vi
PREDICTIVE MODELING FOR CHRONIC CONDITIONS

LIST OF TABLES ........................................................................................................ xi

LIST OF FIGURES ..................................................................................................... xiii

1 INTRODUCTION ....................................................................................................1

1.1 Motivation .........................................................................................................1

1.2 Problem Statement ............................................................................................3

1.3 Contribution ......................................................................................................3

1.4 Organization ......................................................................................................4

2 RELATED WORK...................................................................................................6

2.1 Introduction .......................................................................................................6

2.2 Prognostics for Patient Health Management.......................................................6

2.2.1 System Architecture ...................................................................................8

2.2.2 Artificial Neural Network: ..........................................................................8

2.2.3 Data: ..........................................................................................................9

2.3 COPD Prognosis under Biologically Inspired Neural Network ..........................9

2.3.1 System Architecture: ................................................................................ 10

2.4 Time to CARE................................................................................................. 10

2.4.1 Methodology: ........................................................................................... 11

vii
2.5 An Evolutionary two-objective genetic algorithm for asthma prediction .......... 11

2.5.1 MLP Pruning by Genetic Algorithm ......................................................... 12

2.6 Cloud Framework for Health Care Monitoring System(CHMS) ....................... 12

2.6.1 Cloud Framework:.................................................................................... 13

2.6.2 System Architecture: ................................................................................ 14

2.7 Modeling Risk Prediction of Diabetes – A Preventive Measure ....................... 15

2.7.1 Methodology ............................................................................................ 15

2.8 Scoring Scheme based on Prospective Cardiovascular Munster

Study(PROCAM) ....................................................................................................... 16

2.8.1 Scoring Method: ....................................................................................... 16

2.9 Other Related Works: ...................................................................................... 17

3 RESEARCH METHODOLOGY ............................................................................ 18

3.1 Introduction ..................................................................................................... 18

3.2 Research Design .............................................................................................. 18

3.3 Predictive Modelling ....................................................................................... 19

3.4 Data Mining .................................................................................................... 20

3.5 Machine Learning............................................................................................ 20

3.5.1 Naïve Bayes: ............................................................................................ 21

3.5.2 Bayesian Network: ................................................................................... 22

3.5.3 Multilayer Perceptron Model .................................................................... 23

viii
3.5.4 Logistic Regression: ................................................................................. 25

3.5.5 J48 Decision Tree ..................................................................................... 26

4 DATA AND ANALYSIS OF RISK FACTORS ..................................................... 27

4.1 Data Collection:............................................................................................... 27

4.2 Variable Selection ........................................................................................... 30

4.2.1 Initial Set .................................................................................................. 31

4.2.2 Asthma Risk Factors: ............................................................................... 34

4.2.3 Diabetes Risk Factors ............................................................................... 41

4.3 Data Pre-Processing......................................................................................... 47

4.4 Data Transformation: ....................................................................................... 48

5 SYSTEM ARCHITECTURE AND IMPLEMENTATION ..................................... 49

5.1 System Architecture ........................................................................................ 49

5.2 Implementation ............................................................................................... 52

5.2.1 Weka ........................................................................................................ 52

5.3 Screenshots ..................................................................................................... 54

6 RESULTS AND ANALYSIS ................................................................................. 61

6.1 Analysis of Results obtained for Asthma ......................................................... 62

6.1.1 Confusion Matrix: .................................................................................... 63

6.1.2 ROC curves .............................................................................................. 65

6.2 Diabetes .......................................................................................................... 66

ix
6.2.1 Confusion Matrix ..................................................................................... 66

6.2.2 ROC curves: ............................................................................................. 69

7 CONCLUSION AND FUTURE WORK ................................................................ 71

7.1 Conclusion ...................................................................................................... 71

7.2 Future Work .................................................................................................... 72

REFERENCES .............................................................................................................. 73

x
LIST OF TABLES

Table 4.1 Unweighted response rates for NHANES 2011-2012 survey by Age and

Gender ........................................................................................................................... 29

Table 4.2 Variable considered from Demographics ........................................................ 31

Table 4.3 Variable Considered from Examination .......................................................... 32

Table 4.4 Variables considered from Laboratory............................................................ 32

Table 4.5 Variables considered from Questionnaire ....................................................... 33

Table 4.6 Demographics variables for asthma prediction ............................................... 35

Table 4.7 Blood Pressure variables for asthma prediction .............................................. 36

Table 4.8 Body Measure variables for asthma prediction ............................................... 36

Table 4.9 Physical Activity variables for asthma prediction ........................................... 37

Table 4.10 Alcohol use variable for asthma prediction ................................................... 38

Table 4.11 Smoking related variables for asthma prediction .......................................... 39

Table 4.12 Environmental Factors for asthma prediction................................................ 40

Table 4.13 Other variables for asthma prediction ........................................................... 41

Table 4.14 Demographics variables for diabetes prediction ............................................ 42

Table 4.15 Blood Pressure variables for diabetes prediction ........................................... 42

Table 4.16 Body measure variables for diabetes prediction ............................................ 43

Table 4.17 Physical Activity variables for diabetes prediction........................................ 43

Table 4.18 Smoking related variables for diabetes prediction ......................................... 44

Table 4.19 Alcohol use variables for diabetes prediction ................................................ 45

xi
Table 4.20 Laboratory variables for diabetes prediction ................................................. 46

Table 4.21 other variables for diabetes prediction .......................................................... 46

Table 6.1 Asthma Results in terms of Accuracy, RMSE and ROC area for all the 5

models ........................................................................................................................... 63

Table 6.2 Confusion Matrix of Naive Bayes Classifier for Asthma Prediction ............... 63

Table 6.3 Confusion Matrix of BayesNet Classifier for Asthma Prediction .................... 64

Table 6.4 Confusion Matrix of MLP Classifier for Asthma Prediction ........................... 64

Table 6.5 Confusion Matrix of Logistic Classifier for Asthma Prediction ...................... 64

Table 6.6 Confusion Matrix of J48 decision tree Classifier for Asthma Prediction ........ 64

Table 6.7 ROC curves obtained from WEKA for Asthma .............................................. 65

Table 6.8 Diabetes Results in terms of Accuracy, RMSE and ROC area for all the 5

models ........................................................................................................................... 66

Table 6.9 Confusion Matrix of Naive Bayes Classifier for Diabetes Prediction ............. 67

Table 6.10 Confusion Matrix of BayesNet Classifier for Diabetes Prediction ................ 67

Table 6.11 Confusion Matrix of MLP Classifier for Diabetes Prediction ....................... 67

Table 6.12 Confusion Matrix of Logistic Classifier for Diabetes Prediction .................. 68

Table 6.13 Confusion Matrix of J48 Classifier for Diabetes Prediction ......................... 68

Table 6.14 ROC curves for Diabetes Prediction ............................................................. 69

xii
LIST OF FIGURES

Figure 1.1 Thesis Structure ..............................................................................................5

Figure 2.1 Timeline of Medical Prediction .......................................................................7

Figure 2.2 System Architecture of Patient Health Management ........................................8

Figure 2.3 System Architecture of Biologically inspired techniques ............................... 10

Figure 2.4 Cloud Framework of CHMS ......................................................................... 13

Figure 2.5 Layered Architecture of CHMS .................................................................... 14

Figure 3.1 Proposed System Design ............................................................................... 19

Figure 3.2 Supervised Learning Models used in the research ......................................... 21

Figure 3.3 A Multilayer Perceptron Model with three layers .......................................... 24

Figure 4.1 Steps involved in preparing a dataset ............................................................ 27

Figure 5.1 System Architecture for Asthma Prediction................................................... 50

Figure 5.2 System Architecture for Diabetes Prediction ................................................. 51

Figure 5.3 Weka Preprocessing stage ............................................................................. 53

Figure 5.4 Weka classification stage .............................................................................. 53

Figure 5.5 ROC curve generated in weka ....................................................................... 54

Figure 5.6 Login Page.................................................................................................... 55

Figure 5.7 Home Page ................................................................................................... 55

Figure 5.8 Asthma Calculator Page1 .............................................................................. 56

Figure 5.9 Asthma Calculator Page2 .............................................................................. 56

Figure 5.10 Asthma Calculator Page3 ............................................................................ 57

xiii
Figure 5.11 Asthma Calculator Page4 ............................................................................ 57

Figure 5.12 Results Generated for Asthma ..................................................................... 58

Figure 5.13 View Asthma History Page ......................................................................... 58

Figure 5.14 Asthma Record Page1 ................................................................................. 59

Figure 5.15 Asthma Record Page2 ................................................................................. 59

Figure 5.16 Asthma Dashboard...................................................................................... 60

xiv
1 INTRODUCTION

1.1 Motivation

Chronic condition is a health condition or disease that is persistent and whose effects are

long lasting. It has major adverse effect on the quality of life of the individual who is

affected with it. Diabetes, Asthma, Cancer, COPD, CKD and Heart Disease are some of

the major chronic conditions the world is facing today. It has been found that chronic

diseases are the major cause of mortality, accounting to 7 out of 10 deaths in the US.

According to CDC, Centre for Disease Control and Prevention about half of all adults i.e.

almost 117 million people in United States have one or more chronic health conditions.

According to World Health Organization report, [1] out of 58 million deaths in 2005,

chronic diseases resulted into 35 million of them. They are currently the major cause of

death among adults all over the world.

Chronic conditions are critical not only because of their high mortality rate but also

because of the cost associated with it. Majority of US economic and health care costs

associated with the health condition are primarily because of the chronic diseases and the

associated health risk behaviors. CDC survey shows that the total costs of heart disease

and stroke in 2010 were estimated to be $315.4 billion; costs involved in diabetes were

estimated to be $245 billion and cancer care costs were estimated to be $157 billion.

1
As a result of the above factors, a need to develop a system which can manage the

chronic conditions in an individual well before its onset arises. In this research we have

developed a Predictive analytics based clinical decision support system which can help an

individual to better manage chronic conditions. This system investigates the state of

being unwell by focusing on two major chronic diseases Asthma, that is caused by

inflammation of the airways [2], these are the small tubes called bronchi which carry air

in and out of the lungs and Diabetes, that is caused by an imbalance in the secretion of

insulin resulting in a disturbance in the sugar levels of the blood [3]. This also increases

the risk of developing kidney diseases, heart diseases, blindness etc [3].

According to United States Environmental Protection Agency [4], an estimated 25.9

million people are suffering from Asthma, the annual economic cost including direct and

indirect cost amount to more than $56 billion annually. CDC survey shows that every day

about 9 people die from Asthma.

According to American Diabetes Association, 29.1 million people in the United States

had diabetes in 2012 out of which only 21 million people were diagnosed. According to

International Diabetes Federation, there are 246 million diabetic people around the world

and this number is expected to rise to 380 million by 2025 [3].

On further research we found out that Asthma and Diabetes are associated with many risk

factors. There are many clinical and non-clinical risk factors which might lead to these

conditions such as age, gender, blood pressure, smoking status etc. As a result, we have

developed a predictive system which can help an individual in tracking their likeliness of

Asthma and the extent of diabetes based on a list of clinical and non-clinical factors. This

2
system also provides necessary feedback which can reduce the chances of having a

chronic condition in future or to control it further.

1.2 Problem Statement

Chronic diseases such as Asthma and Diabetes are the major cause of high mortality and

morbidity rate all around the world; millions of people are diagnosed with one or more

chronic condition every year. These are among the most common health problems the

world is facing today. The overall economic costs involved in these diseases are also very

high. Health Risk behaviors are the major factors that are associated with chronic

conditions. They are the unhealthy behaviors that can be changed with time if proper

efforts are being made to control or manage these behaviors. Following are the four

major health risk behaviors associated with chronic conditions [5],

1. Lack of Exercise or Physical Activity

2. Poor Nutrition

3. Drinking too much alcohol

4. Excessive smoking

According to CDC surveys, cigarette smoking is responsible for more than 480,000

deaths every year, drinking too much alcohol causes 88,000 deaths each year, the survey

also shows that more than 50% of adults aged 18 years or older did not meet the expected

duration and level of physical activity. Hence because of improper management of these

behaviors the rate of mortality caused due to chronic conditions is increasing.

1.3 Contribution

The major contribution of the research work can be outlined as follows:

3
1. The main objective of the research is to develop a Predictive System that can

predict the Likeliness of Asthma and the Extent of Diabetes in an individual to

effectively manage their health in the chronic sense.

2. A web based application to make use of the system in a user friendly manner.

3. To provide a graphical representation of the likeliness of asthma and extent of

diabetes to keep track of their health.

4. To provide necessary feedbacks to improve overall health and reduce the risk of

Asthma and Diabetes.

5. The efforts are also made to perform analysis on the data taken from NHANES

2011-2012 survey.

6. Utilize the java oriented data mining software tool Weka, for classification,

regression, pre-processing, clustering, visualization and association.

7. Incorporating techniques of Artificial Neural Network to make the machine more

intelligent and results more reliable.

1.4 Organization

This thesis is divided into 7 major chapters. Chapter 1 explains the research problem and

provides the readers with the importance and magnification of the problem. It also

provides an overview of the contributions made by the research.

Chapter 2 provides different perspective towards the predictive analytics approach to

solve health problems. It provides an overview of the methodology used by these systems

4
Chapter 3 provides an overview of the research methodology used in order to solve the

problem stated. It includes a detailed description of various machine learning models

used by the system to generate the predictions.

Chapter 4 provides a detailed description of NHANES data set, the data set used to train

the prediction models. It also provides an overview of the risk factors chosen to predict

the Likeliness of Asthma and Extent of Diabetes.

Chapter 5 defines the system architecture in detail. It also gives a detailed description of

Weka, a data mining tool used to build the prediction models.

Chapter 6 discusses all the results obtained from the models used in the research. It

provides various tables and graphs to easily understand the results.

Chapter 7 discusses the conclusion and possible future work in this field of study

Chapter 1 Introduction

Chapter 2 Related Work

Chapter 3 Research Methodology

Chapter 4 Data and Analysis of Variables

Chapter 5 System Architecture and Implementation

Chapter 6 Analysis & Result

Chapter 7 Conclusion and Future Work

Figure 1.1 Thesis Structure

5
2 RELATED WORK

2.1 Introduction

Many studies have been carried out at applying appropriate technologies to improve

healthcare and its delivery using predictive modeling in healthcare systems based on

Artificial Neural Network. In this chapter we will discuss several systems that provide

predictive modeling approach to improve the overall health, several systems to predict

the risk of Asthma and Diabetes based on different risk factors. Here, we will provide an

overview of the architecture of the systems, data sets used in order to carry out the

research and their results.

2.2 Prognostics for Patient Health Management

Peter Ghavami of Harborview Medical Centre, Seattle and Kailash Kapur of University

of Washington, Seattle have designed and developed a prognostics engine to predict

patients physiological health status using Artificial Neural Network [6]. The Engine

builds model based on Historical clinical data which is collected from different human

physiological systems including Respiratory Systems, Circulatory Systems, and Immune

Systems etc. Their system includes PHM i.e. Prognostics and Health Management which

provides methods for solving reliability issues; it also permits assessment of the system

under its application scenario in order to determine possible risks and failures.

6
Figure 2.1 shows the timeline of medical predictions on which the system works:

Risk Factors Marker

Prediction Medical Event

Figure 2.1 Timeline of Medical Prediction


The timeline begins with risk factors that leads to predictors which includes the variables

that provides prediction about a disease, then it goes to marker where the actual causes of

disease are combined to form a detectable level of diseases and at last it goes to the

medical event itself, determining the presence of a disease or its occurrence in the near

future.

7
2.2.1 System Architecture

Figure 2.2 System Architecture of Patient Health Management


Figure 2.2 shows the Feed Forward and feedback control model. The input to the

Physiological system is represented by i(t) which includes data related to the medical

treatment plan, it involves some set of medications, protocols and procedures suggested

by the physician. The Human Physiological System internally consists of a wide variety

of clinical data such as lab results and monitored data represented by X(t). The Model

includes a Prognostic Engine which continuously monitors the real time patient data and

applies some mathematical algorithms to develop rules and patterns to make predictions

regarding the presence of a disease or an occurrence of a disease in the near future.

2.2.2 Artificial Neural Network:

This system uses four major artificial neural network model :

1) PNN- Probabilistic Neural network

8
2) SVM- Support Vector Machine

3) Generalized Feed Forward Multilayer perceptron model

4) MLP trained with LM (Levenberg-Marquardt model)

2.2.3 Data:

The clinical data used by the system to define rules for prediction consists of 468 patients

cases who were admitted for various physiological treatments. The input data consist of

21 independent variables and 1 dependent variable. Input data includes data from

different lab results, blood pressure data, data related to heart rate etc. The dependent

variable represents the clinical outcome i.e. the absence or presence of a disease.

At the end, results of all models are stored and with the use of an oracle (overseer

program), the most accurate model is selected.

2.3 COPD Prognosis under Biologically Inspired Neural Network

Researchers at Easwari Engineering College, Chennai-India proposed a model to

rehabilititate Chronic Obstructive Pulmonary Disease(COPD) patients in real time [7].

The model provides a predictive approach using polynomial neural network with swarm

intelligence. Swarm Intelligence is the human intelligence derived from social interaction

of an individual.The Model includes two main components Discrete Particle Swarm

Optimization(DPSO) and Continuous Particle Swarm Optimization(CPSO) and the

classification of the health status of the patient is done by support vectors.

9
2.3.1 System Architecture:

Figure 2.3 System Architecture of Biologically inspired techniques

Figure shows the typical architecture of biologically or socially inspired techniques used

in prediction. The Patient’s handheld device includes four major components DPSO,

CPSO, Polynomial Neural Network and Training systems. Input to the model is Patient’s

physiological parameters such as MMRC scale, BMI, FEV1% and 6 minutes walk test.

Prediction model comprises of a condensed polynomial neural network; the model then

runs on the data collected and the accuracy of the system is then assessed by statistical

analysis as per Gibbs specifications.

2.4 Time to CARE

CARE, a Collaborative Assessment and Recommendation Engine predict disease risks in

the future based on patient’s medical history and other similar patients history [8]. It uses

collaborative methods to find out the most significant risk factors that can lead to the

10
disease and generates predictions based on the selected risk factors. An iterative version

ICARE was also designed which uses ensemble concepts to improve the performance of

the system.

2.4.1 Methodology:

In a typical CARE system, the individual medical history is the history of the individual

on which the predictions are to be made, it is considered as the testing patient and other

patients medical histories are considered as training patients, the training patients are

constrained to have the data of patients who have at least two diseases in common with

the testing patient. Collaborative filtering defines methods to generate prediction for the

future disease risks of the patient. In case of ICARE i.e. Iterative Care, this process is

repeated multiple times with different training patients group. These results are then

combined to form an ensemble. The results from both CARE and ICARE are then ranked

based on the disease from highest to lowest risk scores.

2.5 An Evolutionary two-objective genetic algorithm for asthma prediction

Researchers at Democritus University of Thrace, Xanthi, Greece proposed a system to

predict the occurrence of asthma using Artificial Neural Network and a Genetic

Algorithm [9]. The system predicts asthma risks in children under the age of 5. Genetic

Algorithm helps in filtering the factors that influence the asthma most. In other words this

algorithm finds out the risk factors which make a child more vulnerable to asthma. In a

Genetic Algorithm, a set of solutions to an optimization problem is evolved to find the

most appropriate solution.

11
2.5.1 MLP Pruning by Genetic Algorithm

Patterns were generated to predict asthma with the help of Artificial Neural Network.

Multilayer Perceptron, a supervised neural network model was used to generate these

patterns. A total of 34 prognostics factors were used to predict the disease ranging from

breathing tests to Pharmaceutical therapy, wheezing episodes to demographic and some

common symptoms of asthma such as cough, chest pain, runny nose etc. MLP network

was trained based on the data of 112 patients obtained from the Pediatric Department of

Alexandroupolis University Hospital, Greece. The training algorithm used here is the

back propagation algorithm, where the weights are adjusted by back propagating the error

resulting into a more efficient algorithm.

A variety of experiments were performed in MATLAB with the support Neural Network

Toolbox for constructing Multilayer Perceptrons and the Genetic Algorithms. The testing

accuracy of the network was found to be 94.8%.

Genetic Algorithm search was performed to find out the most significant risk factors that

can be supplied as an input to the Multilayer Perceptron Model, GA search is divided into

two objectives where first objective was to minimize the number of prognostic factors

and second objective was to enhance the performance of the model based on these

factors.

2.6 Cloud Framework for Health Care Monitoring System(CHMS)

With the increase in popularity of Cloud Computing, researchers from Bangalore and

Anantapur, India designed and developed a Health Care Monitoring System using Cloud

Framework [10]. Cloud enables data sharing without geographical limitations. CHMS

12
collects the health data from a variety of sources and publish them to a cloud based

repository, this repository is named as Telemedicine Repository (TMR). Once the data is

published on to TMR, the system then performs data analysis using services provided by

the cloud and stores the results in the form of health records.

2.6.1 Cloud Framework:

Figure 2.4 Cloud Framework of CHMS

Figure shows the cloud framework of CHMS, it mainly comprises of a data acquisition

module, a communication system, a TMC (Tele Medicine Center), an Emergency Health

Care (EMC) module and a Multi Specialty Hospitals module. Patients are equipped with

a data acquisition device that are capable of collecting data from a patient like ECG,

Glucose, Blood Pressure etc. The data is communicated to the TMC with the help of a

communication system like internet. TMC performs the data analysis on the received data

keeping into account the existing patient’s data and historic data. TMC also maintains an

13
Electronic Health Record (EHR) on the cloud which is accessible to the users at any time.

It also comprises of a web portal enabling patients and doctors to communicate in an

emergency.

2.6.2 System Architecture:

Figure 2.5 Layered Architecture of CHMS

It mainly consist of three layers a Software-as-a-Service (SaaS) layer, Platform-as-a-

Service (PaaS) layer and the Infrastructure-as-a-Service(IaaS) layer. SaaS layer helps a

user to use the system without going through the complexity of the application, PaaS

layer provides a set of tools to make the system quick and efficient, it helps in storing

EHR of a patient generated by TMC. IaaS layer provides virtual datacenters such as

servers, networks etc it provides storage services by virtualizing the resources. It also

helps in making the data available to different users across the globe.

14
2.7 Modeling Risk Prediction of Diabetes – A Preventive Measure

Bakshi Rohit Prasad et al [11] proposed a data mining approach for selecting best

indicators of diabetes and a model to predict the diabetes before its onset. It uses voting

mechanism to select the most suitable classifier model to achieve high accuracy. The

system works in three stages data pre-processing, class label assignment and construction

of classifier.

2.7.1 Methodology

The data is collected from UCI repository known as Diabetes dataset, it contains 9

attributes such as plasma glucose, diastolic BP, BMI, age etc. In the Data pre-processing

stage the data is transformed in a form which is suitable to execute subsequent stages. It

uses k-nearest neighbor approach to deal with the missing values, which puts the value

present in the nearest column in terms of Euclidean distance. As a result only 5 attributes

remained to form the dataset. In the next stage class label is assigned to each patient

record in terms of high risk, medium risk or low risk. It uses clustering technique to

group the dataset into clusters of high, medium and low risk. Next stage corresponds to

building classifier to predict diabetes, it uses four classification techniques; KNN(K-

Nearest Neighbor), LDA( Linear Discriminant Analyzer), Naïve Bayes and

DTC(Decision Tree Classification). Each of the classifier is trained on the training set

and the accuracy is measured for a test data, vote count of the classifier resulting into

highest accuracy is incremented by one. The process continues for different test datasets

to find the classifier with highest vote count which is then selected for classification

purpose.

15
2.8 Scoring Scheme based on Prospective Cardiovascular Munster Study(PROCAM)

Gerd Assmann et al [12] proposed a scoring scheme for calculating the risk of acute

coronary events based on the 10-year follow up of Prospective Cardiovascular Munster

Study. The scoring scheme is based on 325 acute coronary events that have occurred

within 10 years of follow-up among 5389 middle aged men who were recruited into

PROCAM study. Within the 10 years 230 men were lost to follow up, 218 died, 14 had

suspected coronary death and 46 non fatal cases occurred. At last, 4493 middle aged men

survived the 10 years of follow up without any major coronary event.

2.8.1 Scoring Method:

To obtain maximum information from the PROCAM study, a risk algorithm which uses

Cox proportional hazards model was constructed. It includes 8 variables which were

independently predicting the event risk. Cox model only allows calculation of relative

risk; hence to convert the relative risk obtained from Cox model into absolute risk,

Kaplan-Meier statistics were used.

In order to generate the scoring scheme each risk factor was divided into categories for

which each category is assigned with a value which is obtained from regression equation

which is calculated between logarithms of global risk as calculated by Cox model

combined with the survival curves and the categories of each risk factor [12]. The

coefficients calculated were then standardized and rounded to obtain the score in terms of

a whole number. PROCAM algorithm then calculates the risk of a coronary event

associated with each score which are then categorized into very low and very high

PROCAM scores.

16
2.9 Other Related Works:

A neural network based Structural Health Monitoring System has been proposed [13].

This system uses wireless sensor network where thousands of sensor nodes perform

distributed sensing and collaborative computing for structural health analysis. It uses

several algorithms to predict a particular disease. In 2012, Jean-Francois proposed a

system [14] and method for determining and managing an individual portable health

score, this method defines a baseline health score and further adjusting the health score

based on several health actions.

17
3 RESEARCH METHODOLOGY

3.1 Introduction

The research design is a framework for predicting the likeliness of Asthma and the extent

of diabetes in an individual. In the current chapter firstly the design of the proposed

system will be explained and afterwards the concept of predictive modelling, data mining

and machine learning including various machine learning models that have been used to

carry out the prediction will be explained.

3.2 Research Design

The primary purpose of this research is to develop a system which can help an individual

to keep track of their health in the chronic sense; hence it is important to design the

system in such a way that is easy to interpret and easy to use. In this study the advantages

of Artificial Neural Network are channelized to come up with a system which is user

friendly in terms of usage. Figure shows the overall design of the system which leads to

the prediction in terms of likeliness of Asthma and extent of Diabetes.

18
Figure 3.1 Proposed System Design
It includes three main components User, System and Neural Network. Users are

individuals who want to avail the services provided by the system; User is expected to

provide necessary input to the system; input consists of a list of parameters including

demographic details, laboratory details, and body measures etc. Once the user provide

input to the system; the system makes use of several neural network models (discussed

later in the chapter) to generate the predictions in terms of likeliness of Asthma and

extent of Diabetes in a numerical format that is easy to interpret. Once the prediction has

been made, the system then suggests necessary feedback in order to manage the disease.

3.3 Predictive Modelling

Predictive modelling is a name given to a collection of mathematical techniques or

models that helps in finding a mathematical relationship between a target or dependent

variables and the predictor or independent variables [15]. It helps in predicting the

probability of an outcome when a set of independent variables passes through the model.

Almost all regression models can be used for prediction purposes.

19
3.4 Data Mining

Data Mining is an analytic process to extract information from a large amount of data. It

is designed to explore data in order to find patterns or relationships between the variables.

It helps in extracting unknown or potentially useful information from data [16]. Data

Mining involves machine learning, artificial intelligence and statistics [17]. The main

goal of data mining is to predict; predictive data mining is the most common data mining

approach that have been used by many studies. The process of data mining is divided into

three main stages: In the first stage, the dataset to train the model is prepared it includes

Data Cleaning and Data Pre-processing; second stage includes building of model that

means identifying patterns based on the prepared dataset and lastly in the final stage the

trained model are used to generate predictions.

3.5 Machine Learning

Machine learning is a branch of computer science that consists of algorithms that can

learn from data, it provides set of methods that can detect patterns in the data and use the

patterns to generate future predictions [18].

Machine learning is divided into two main types supervised and unsupervised learning.

Supervised learning is the machine learning technique in which the learning algorithms

make use of labelled data. The main goal is to map a set of input parameters X which are

also called as features or attributes to output parameter Y which is also called as class

label [18]. In this learning technique model gets trained on the labelled training data and

then it generates predictions for unseen situations. In unsupervised learning, the model is

trained on unlabelled data. The main goal of unsupervised learning is to find patterns in

order to extract useful information from an unlabelled data.

20
Following diagram shows the supervised learning models used in this research to

generate the predictions for Asthma and Diabetes.

Supervised
Learning

Naive Bayesian Multilayer Logistic


J48
Bayes Network perceptron Regression

Figure 3.2 Supervised Learning Models used in the research


3.5.1 Naïve Bayes:

Naïve Bayes classifiers are based on Bayesian Theorem; it simplifies the learning method

by assuming that features are independent of each other on the class context [19]. This

strong assumption is known as Naïve Bayes Assumption [18]. Let us consider x ϵ X, the

input feature vectors; y ϵ {1,…, c}, the class labels; then the Naïve Bayes Assumption is

given by

D
p ( x / y  c )   p ( xi | y  c )
i 1

It is particularly suited when the input set comprises of large number of attributes. Naïve

bayes classifiers are used in several fields such as target marketing, text classification,

credit approval, spam filtering [20] etc

Advantages:

1. Naïve Bayes is easy to implement

21
2. Naïve Bayes classifiers can be trained quickly in a single scan

3. Classification process is quick compare to other models

4. It can handle a large and discrete amount of data

5. It is not sensitive to irrelevant features

Disadvantages:

1. The major disadvantage of Naïve Bayes is the Naïve Bayes Assumption; it can

result in loss of accuracy

2. In real world, dependencies exist among the attributes; but these dependencies are

irrelevant in Naïve Bayes Models

3.5.2 Bayesian Network:

Bayesian Networks are also called as Belief Networks; they are probabilistic graphical

model widely used under uncertainty [21]. It provides methods to represent relationship

between the attributes. It is represented in the form of a directed acyclic graph whose

nodes represents the attributes and edges represent the relationship between them. The

relationships between the attributes are derived by conditional probability distribution.

Advantages:

1. Since the outputs are represented in terms of probability it can be easily

interpreted

2. Models are represented in the form of a graph hence which can be interpreted

easily by people from different disciplines [21].

3. Bayesian Networks can be easily updated when a new knowledge source is

available
22
Disadvantages:

1. Limited ability to deal with continuous data [22]

2. Because of the acyclic nature of the model; feedback methods cannot be included

in Bayesian Networks

3.5.3 Multilayer Perceptron Model

Multilayer Perceptron is a feed forward neural network with one or more hidden layers

between input and output. Feed Forward means data flows from one direction to another

i.e. from input nodes towards the output node. This network is trained with a back

propagation learning algorithm. MLP helps in distinguish between the data that is not

linearly separable. Except input nodes all nodes consist of a non-linear activation

function. Input layer consist of a set of input parameters based on which prediction has to

be made, hidden layer consist of a set of hidden nodes which helps in solving the non-

linear data problem, these nodes helps in converting input data into the form which can

be used by the output node and lastly output layer consist of a output node with a non-

linear activation function to make the prediction

23
Figure 3.3 A Multilayer Perceptron Model with three layers

Advantages

1. It helps in solving problems which includes classification of non linear data

2. MLP models do not make any assumptions regarding the probabilistic

information about the class labels

Disadvantages

1. It requires more memory and processing power.

2. Training takes more time compare to other classifiers

24
3.5.4 Logistic Regression:

Logistic Regression is a statistical classification model which can be applied in the

situations where the outcome is categorical. It has become a standard method of analysis

in the situation where outcome variable is discrete taking two or more possible values

[23]. It provides a reasonable model to describe the relationship between the output

variable and one or more input variables. In most cases, the outcome variable is

dichotomous i.e. it can take only two values such as yes/no, 0/1 etc; such logistic

regression models are called as Binary Logistic Regression Model. In some cases, the

outcome variable can take more than two values such models are called as Multinomial

Logistic Regression Model.

Advantages:

1. No linear relationship between independent and dependent variable

2. Multiple explanatory variables can be used.

3. Less prone to over-fitting due to simplicity and low variance.

4. Dependent variable need not to be distributed normally

5. No confounding effects because logistic regression allows quantified values for

strength of association between explanatory variables.

Disadvantages:

1. It cannot predict continuous data as logistic regression is built on discrete functions.

2. It requires a large set of data to achieve better results

25
3.5.5 J48 Decision Tree

Decision tree learning is the learning method which uses trees to represent a predictive

model [24]; the tree consist of leaves that represents the class label and branches that

represents features or rules that leads to a particular class label. It is divided into two

categories Classification trees; in which the target variable consist of a finite set of values

and Regression trees; in which the target variable can take numerical values. J48

commonly known as C4.5 algorithm is used to perform decision tree learning. It is also

known as statistical classifier. C4.5 generates decision tree which can be used for

classification or numerical prediction.

Advantages:

1. J48 Decision trees can be used for both continuous and discrete attributes.

2. Once the tree is created, it removes unnecessary nodes which do not help in

classification resulting into a simpler tree which can be easily interpreted

3. It is easy to implement

4. Can work on both continuous and categorical values of output variable

Disadvantages:

1. It is not suitable for the problems where classification is based on fulfilment of

several conditions

2. If the decision tree consist of too many branches and nodes, the cost and the

complexity involved is also very high.

26
4 DATA AND ANALYSIS OF RISK FACTORS

This chapter presents the detailed description of the data used to train the models

proposed in this research; It includes four major components; Data Collection, Variable

Selection, Data Pre-processing and Data Transformaion..

• This section describes the source of the data; It provides a detailed


Data description of NHANES 2011-2012 dataset
Collection

• It includes the formulation of data set for Asthma and Diabetes


Prediction. Provides description of the variable included for Asthma
Variable and Diabetes
Selection

• It includes elimation of data sets that are noisy, inconsistent or


Data Pre- includes too many missing values
procssing

• Data is finally transformed in a format which can be used by Weka, the


Data data mining tool
Transforma
tion

Figure 4.1 Steps involved in preparing a dataset

4.1 Data Collection:

The data used in this research to train the models for prediction is collected from National

Health and Nutrition Examination Survey (NHANES), “NHANES is a program of

27
studies designed to assess the health and nutritional status of adults and childrens in the

United States”. The survey has been conducted by the National Center for Health

Statistics (NCHS), an agency of United States Federal Statistical System that provides

statistical data to improve the health status of people in America. NCHS is an integral

part of Centers for Disease Control and Prevention (CDC). The survey is a combination

of physical examination and interviews. Physical examination includes medical,

laboratory and physiological tests whereas interview includes questions related to

demographic, diet and other health related questions. Every year about 5000 persons that

are located in different states across the country are examined under this survey; it also

keeps track of the changes in their health conditions over time. The data collected from

the survey is used to determine various risk factors for major diseases such as Asthma,

Diabetes, and Kidney Diseases etc.

For this research we have used the data collected from NHANES 2011-2012 survey. This

survey divides the data into 6 categories Demographics, Dietary, Examination,

Laboratory, Questionnaire and Limited Access. In 2011-2012, around 13,431 persons

from 30 different locations were selected out of which 9,756 completed the interview

component and 9,338 were examined in order to collect the information related to above

mentioned categories. Below table shows the unweighted response rates for NHANES

2011-2012 survey by Age and Gender.

28
Table 4.1 Unweighted response rates for NHANES 2011-2012 survey by Age and
Gender

Screened Sample Interviewed Sample Examined Sample


Unweig
Unweight Unweight Unweight
Gender / Age Samp hted
Control ed ed ed
Group le Respon
Totals Sample Response Sample
Size 1 se Rate
Size Rate (%) Size
(%)
306,590,68
Total All Ages 1 13,431 9,756 72.6 9,338 69.5
< 1year 3,686,290 458 392 85.6 382 83.4
1-5 years 20,444,290 1,463 1,203 82.2 1,135 77.6
6-11 years 24,614,923 1,641 1,328 80.9 1,272 77.5
12-15
years 16,544,876 803 658 81.9 630 78.5
16-19
years 17,333,412 814 615 75.6 600 73.7
20-29
years 41,927,467 1,409 994 70.5 954 67.7
30-39
years 39,278,265 1,301 963 74.0 924 71.0
40-49
years 42,648,248 1,267 899 71.0 875 69.1
50-59
years 42,270,235 1,309 913 69.7 879 67.2
60-69
years 30,499,549 1,385 908 65.6 868 62.7
70-79
years 16,715,521 879 520 59.2 493 56.1
80+ years 10,627,605 702 363 51.7 326 46.4
149,632,76
Male All Ages 3 6,681 4,856 72.7 4,651 69.6
< 1year 1,880,565 219 193 88.1 188 85.8
1-5 years 10,432,000 732 595 81.3 561 76.6
6-11 years 12,575,747 820 678 82.7 651 79.4
12-15
years 8,487,172 411 340 82.7 331 80.5
16-19
years 8,846,406 400 310 77.5 300 75.0
20-29
years 20,817,190 731 510 69.8 488 66.8
30-39
years 19,253,439 653 481 73.7 459 70.3
40-49
years 20,860,140 617 428 69.4 416 67.4
50-59
years 20,490,838 653 435 66.6 418 64.0
60-69
years 14,492,515 706 459 65.0 443 62.7
70-79
years 7,514,299 426 260 61.0 244 57.3
80+ years 3,982,452 313 167 53.4 152 48.6
Femal All Ages 156,957,91 6,750 4,900 72.6 4,687 69.4
29
e 8
< 1year 1,805,725 239 199 83.3 194 81.2
1-5 years 10,012,290 731 608 83.2 574 78.5
6-11 years 12,039,176 821 650 79.2 621 75.6
12-15
years 8,057,704 392 318 81.1 299 76.3
16-19
years 8,487,006 414 305 73.7 300 72.5
20-29
years 21,110,277 678 484 71.4 466 68.7
30-39
years 20,024,826 648 482 74.4 465 71.8
40-49
years 21,788,108 650 471 72.5 459 70.6
50-59
years 21,779,397 656 478 72.9 461 70.3
60-69
years 16,007,034 679 449 66.1 425 62.6
70-79
years 9,201,222 453 260 57.4 249 55.0
80+ years 6,645,153 389 196 50.4 174 44.7

4.2 Variable Selection

There are many environment and socio economic factor that may be considered as the

risk factors for Asthma and Diabetes. This section provides the detailed description of the

risk factors used to predict the likeliness of Asthma and extent of Diabetes. Data from

four categories i.e. Demographics, Examination, Laboratory and Questionnaire is used

for this research as Dietary data was not available.

Most chronic conditions share common risk factors; while some risk factors such as age,

gender and ethnicity cannot be changed over time; other behavioural or environmental

risk factors such as alcohol consumption, smoking habits and physical activity can be

changed over time if proper measurements are taken. The recognition of such risk factors

is the basis of this research.

30
4.2.1 Initial Set

In the initial stage we selected the parameters which were relevant to different chronic

conditions; later on we divided the parameters into two sets one for Asthma Prediction

and other for Diabetes Prediction.

Table shows the list of parameters selected in the initial stage; it displays Variable Name,

the unique name given to the parameter in NHANES; SAS Label, the question

corresponding to the variable; Data File Name, the name of the file in which the

description about the parameter is stored and Doc File, id of the document file.

Demographic:

Demographic data includes variables that cover the whole society; it helps in putting

people into different categories such as age, gender and race etc. Table shows the list of

variable that we have included for our research from demographic section of NHANES.

Table 4.2 Variable considered from Demographics

Demographics
Data File Doc
Variable Name SAS Label
Name File
RIAGENDR Gender Demographic
RIDAGEYR Age in years at screening Variables and DEM
Sample O_G
RIDRETH3 Race/Hispanic origin w/ NH Asian Weights

Examination:

It includes variables related to basic physical examination of an individual such as blood

pressure, height, weight, BMI and injuries related questions etc. Table shows the list of

variables selected for this research.

31
Table 4.3 Variable Considered from Examination

Examination
Variable Data File
SAS Label Doc File
Name Name
BPXSY1 Systolic: Blood Pres (1st rdg) mm Hg
BPXDI1 Diastolic: Blood Pres (1st rdg) mm Hg
BPXSY2 Systolic: Blood Pres (2nd rdg) mm Hg
BPXDI2 Diastolic: Blood Pres (2nd rdg) mm Hg Blood
BPX_G
BPXSY3 Systolic: Blood Pres (3rd rdg) mm Hg Pressure
BPXDI3 Diastolic: Blood Pres (3rd rdg) mm Hg
BPXSY4 Systolic: Blood Pres (4th rdg) mm Hg
BPXDI4 Diastolic: Blood Pres (4th rdg) mm Hg
BMXWT Weight (kg)
BMXHT Standing Height (cm) Body BMX_G
BMXBMI Body Mass Index (kg/m**2) Measures
BMXWAIST Waist Circumference (cm)

Laboratory:

It includes variables that corresponds to the clinical measures, Table shows the list of

variables considered for this research

Table 4.4 Variables considered from Laboratory

Laboratory
Variable Data File
SAS Label Doc File
Name Name
LBDHDD Direct HDL-Cholesterol (mg/dL) HDL-Cholestrol HDL_G
Plasma Fasting
LBXGLU Fastin Glucose (mg/dL) Glucose and GLU_G
Insulin
LBXTC Total Cholesterol (mg/dL) Total Cholestrol TCHOL_G
LBXTR Triglyceride (mg/dL) Triglycerides
and LDL- TRIGLY_G
LBLDL LDL-cholesterol (mg/dL) Cholestrol
Urinary
Albumin and
URXUMS Albumin, urine (mg/L) ALB_CR_G
Urinary
Creatinine

32
Questionnaire:

This component consists of a set of questions ranging from physical activity, alcohol use,

environment related questions etc. Table shows the list of variable selected for this

research.

Table 4.5 Variables considered from Questionnaire

Questionnaire
Variable Data File
SAS Label Doc File
Name Name
ALQ101 Had at least 12 alcohol drinks.1 yr? Alcohol Use ALQ_G
How often drink alcohol over past 12
ALQ120Q Alcohol Use ALQ_G
months
Avg # alcoholic drinks/day – past 12
ALQ130 Alcohol Use ALQ_G
mos
ALQ141Q # days have 4/5 drinks – past 12 mos Alcohol Use ALQ_G
Ever have 4/5 or more drinks
ALQ151 Alcohol Use ALQ_G
everyday?
PAQ605 Vigorous work activity
PAQ620 Moderate work activity
PAQ635 Walk or bicycle
PAQ650 Vigorous recreational activities
Physical
PAQ665 Moderate recreational activities PAQ_G
Activity
PAD680 Minutes sedentary activity
Hours watch TV or videos past 30
PAQ710
days
PAQ715 Hours use computer past 30 days
Sleep
SLD010H How much sleep do you get (hours)? SLQ_G
Disorders
Smoking –
SMQ020 Smoked at least 100 cigarettes in life SMQ_G
Cigarette Use
Age started smoking cigarettes Smoking –
SMQ030 SMQ_G
regulary Cigarette Use
Smoking –
SMQ040 Do you now smoke cigarettes? SMQ_G
Cigarette Use
How long since quit smoking Smoking –
SMQ050Q SMQ_G
cigarettes? Cigarette Use
Smoking –
SMD055 Age last smoke cigarettes regularly SMQ_G
Cigarette Use
Smoking –
SMD057 # cigarettes smoked per day when quit SMQ_G
Cigarette Use

33
# days smoked cigs during past 30 Smoking –
SMD641 SMQ_G
days Cigarette Use
Avg # cigarettes/day during past 30 Smoking –
SMD650 SMQ_G
days Cigarette Use
Smoking –
SMD410 Does anyone smoke inside home? SMQ_G
Cigarette Use
SMD410 Does anyone smoke inside home?
Smoking
SMD415 Total # of smokers inside home
Household SMQFAM_G
SMD415A Total # cigarette smokers inside home
disorders
SMD430 Total # cigarettes smoked inside home
MCQ300a Close relative had heart attack?
Medical
MCQ300b Close relative had asthma? MCQ_G
Conditions
MCQ300c Close relative had diabetes?
Ever had work exposure to mineral
OCQ510
dusts?
OCQ520 # of years exposed to mineral dusts
Ever had work exposure to organic
OCQ530
dusts? Occupation
OCQ540 # of years exposed to organic dusts Questionnaire OCQ_G
Ever exposed to exhaust fumes at
OCQ550
work?
OCQ560 # of years exposed to exhaust fumes
Ever had work exposure to other
OCQ570
fumes?
OCQ580 # of years exposed to other fumes
DIQ010 Doctor told you have diabetes Diabetes DIQ_G
DIQ160 Ever told you have prediabetes Diabetes DIQ_G
Medical
MCQ010 Ever been told you have asthma MCQ_G
Conditions

4.2.2 Asthma Risk Factors:

In order to build predictive models for Asthma we have considered 40 attributes which

are divided into different categories such as demographics, blood pressure, body

measures, Physical Activity, Alcohol Use, Smoking – Cigarette use, Environment,

Others.

34
1. Demographics

Many studies have suggested that demographic details play an important role in

prevalence of asthma; Mexican Americans shows low prevalence to asthma compare to

other ethnic groups [25]. It is found that asthma occurs more frequently in boys compare

to girls at childhood; in young adults, ratio of asthma is found to be same for both males

and females. Females are more likely to have asthma once they cross 40 years of age.

Table 4.6 Demographics variables for asthma prediction

Demographics
Variable SAS Label Code or Value Value Description
RIAGENDR Gender 1 Male
2 Female
RIDAGEYR Age in years at 0 to 79 Range of Values
screening 80 years of age and
80 over
RIDRETH3 Race/Hispanic 1 Mexican American
origin w/ NH 2 Other Hispanic
Asian 3 Non-Hispanic White
4 Non-Hispanic Black
6 Non-Hispanic Asian
Other Race -
Including Multi-
7 Racial

2. Blood Pressure

According to Asthma and Allergy Foundations of America, most asthma patients are

diagnosed with high blood pressure. NHANES provides the blood pressure details in four

readings. For our research we have considered the average of all while building the

models for prediction.

35
Table 4.7 Blood Pressure variables for asthma prediction

Blood Pressure
Variable SAS Label Code or Value Value Description
Systolic: Blood pres (1st rdg)
BPXSY1
mm Hg 74 to 238 Range of Values
Diastolic: Blood pres (1st rdg)
BPXDI1
mm Hg 0 to 120 Range of Values
Systolic: Blood pres (2nd rdg)
BPXSY2
mm Hg 74 to 234 Range of Values
Diastolic: Blood pres (2nd rdg)
BPXDI2
mm Hg 0 to 134 Range of Values
Systolic: Blood pres (3rd rdg)
BPXSY3
mm Hg 74 to 232 Range of Values
Diastolic: Blood pres (3rd rdg)
BPXDI3
mm Hg 0 to 128 Range of Values
Systolic: Blood pres (4th rdg)
BPXSY4
mm Hg 78 to 226 Range of Values
Diastolic: Blood pres (4th rdg)
BPXDI4
mm Hg 0 to 130 Range of Values

3. Body Measures

Obesity is found to be a critical indicator of Asthma; Research made by Papoutsakis et al.

[26] suggested that people with high body mass index are more likely to have asthma

compare to people with normal body mass index. It has been found that women with high

circumference are more prone to asthma even if they have a normal BMI [27].

Table 4.8 Body Measure variables for asthma prediction

Body Measures
Variable SAS Label Code or Value Value Description
Body Mass
BMXBMI Index(kg/m2) 12.4 to 82.1 Range of Values
Waist
BMXWAIST Circumference 38.7 to 176 Range of Values

36
4. Physical Activity

Physical Activity and exercise plays a vital role for a healthy life. Many studies showed

that people with higher physical activity are less likely to have asthma [28]. However, in

certain conditions narrowing of the airways in lungs can also be triggered with highly

strenuous physical activity or exercise, such cases of asthma is known as Exercised

Induced Asthma.

Table 4.9 Physical Activity variables for asthma prediction

Physical Activity
Variable SAS Label Code or Value Value Description
1 Yes
PAQ605 Vigorous work activity 2 No
1 Yes
PAQ620 Moderate work activity 2 No
Number of days moderate
PAQ625 work 1 to 7 Range of Values
1 Yes
PAQ635 Walk or bicycle 2 No
Number of days walk or
PAQ640 bicycle 1 to 7 Range of Values
Vigorous recreational 1 Yes
PAQ650 activities 2 No
Moderate recreational 1 Yes
PAQ665 activities 2 No
Days moderate recreational
PAQ670 activities 1 to 7 Range of Values
PAD680 Minutes sedentary activity 0 to 1380 Range of Values
0 Less than 1 hour
1 1 hour
2 2 hours
3 3 hours
4 4 hours
5 5 hours
Hours watch TV or videos Do not watch TV or
PAQ710 past 30 days 8 Videos
0 Less than 1 hour
Hours use computer past 30 1 1 hour
PAQ715 days 2 2 hours

37
3 3 hours
4 4 hours
5 5 hours
Do not watch TV or
8 Videos

5. Alcohol Use

Excess of alcohol intake has been known for impairing lungs for years. According to

Joseph H. Sisson, “Brief exposure to mild concentrations of alcohol may enhance

mucociliary clearance, stimulates bronchodilation and probably attenuates the airway

inflammation and injury observed in asthma” [29].

Table 4.10 Alcohol use variable for asthma prediction

Alcohol Consumption
Variable SAS Label Code or Value Value Description
Had at least 12 1 Yes
alcohol drinks/1
ALQ101 yr? 2 No
How often drink
alcohol over past
ALQ120Q 12 mos 0 to 350 Range of Values
Avg # alcoholic
drinks/day - past
ALQ130 12 mos 1 to 82 Range of Values
Ever have 4/5 or 1 Yes
more drinks every
ALQ151 day? 2 No

6. Smoking – Cigarette use

Smoking is a common risk factor for prevalence of asthma. It is divided into two

categories active smoking and passive Smoking. It has been observed that passive

smoking can trigger symptoms of asthma in individuals suffering from the disease. Many

38
researchers have suggested that disease control is poorer in the patients who smoke

compare to the non smoker asthmatic patients [30].

Table 4.11 Smoking related variables for asthma prediction

Smoking – Cigarette use


Variable SAS Label Code or Value Value Description
Smoked at least 1 Yes
100 cigarettes in
SMQ020 life 2 No
Age started
smoking cigarettes
SMD030 regularly 6 to 72 Range of Values
1 Every day
Do you now smoke 2 Some days
SMQ040 cigarettes 3 Not at all
How long since
quit smoking
SMQ050Q cigarettes 1 to 193 Range of Values
Age last smoked
SMD055 cigarettes regularly 13 to 78 Range of Values
# cigarettes
smoked per day
SMD057 when quit 2 to 90 Range of Values
# days smoked cigs
SMD641 during past 30 days 0 to 30 Range of Values
Avg # 1 to 94 Range of Values
cigarettes/day
SMD650 during past 30 days 95 95 or more
Does anyone 1 Yes
smoke inside
SMD410 home? 2 No

7. Environmental Factors

Home environment and Work environment plays an important role in prevalence of

Asthma. Mineral dust, dust from sand, coal and soil; Organic dust, dust from flour,

cotton, animal and plants; Exhaustive fumes from Engines, Machinery, trucks and buses

are found to be the major cause of asthma in adults. Environmental factors not just

39
increase the chances of asthma but it also obstructs the disease control process for the

patients with the disease.

Table 4.12 Environmental Factors for asthma prediction

Environmental Factors
Variable SAS Label Code or Value Value Description
Ever had work 1 Yes
exposure to
OCQ510 mineral dusts? 2 No
# of years exposed
OCQ520 to mineral dusts 0 to 65 Range of Values
Ever had work 1 Yes
exposure to organic
OCQ530 dusts? 2 No
Ever exposed to 1 Yes
exhaust fumes at
OCQ550 work? 2 No
Ever had work 1 Yes
exposure to other
OCQ570 fumes? 2 No
# of years exposed
OCQ580 to other fumes 0 to 65 Range of Values

8. Others

Many psychological and genetic factors are recognized to influence the onset of asthma;

Studies showed that people with asthma feel lonely more often [31] compare to other

people. Burke W. et al, suggested that risk of asthma is increased if a positive family

history is found. MCQ010 i.e. “Ever been told you have asthma” is the class variable we

considered to perform supervised learning.

40
Table 4.13 Other variables for asthma prediction

Others
Variable SAS Label Code or Value Value Description
Close Relative had 1 Yes
MCQ300B asthma 2 No
0 Not at all
1 Several Days
Feeling Down, More than half the
depressed or 2 days
DPQ020 hopeless 3 Nearly everyday
Ever been told you 1 Yes
MCQ010 have asthma 2 No

4.2.3 Diabetes Risk Factors

In order to build models to predict the extent of diabetes we have used data consisted of

33 attributes. The attributes are divided into 8 major categories Demographics, Blood

Pressure, Body Measures, Physical Activity, Smoking – Cigarette use, Alcohol use,

Laboratory and others.

1. Demographics

Many studies showed that risk of diabetes increases as the person gets older especially

after 45 years of age. According to American Diabetes Association, the risk of diabetes

in African Americans, Mexican Americans, American Indians and Asian Americans is

very high because these populations are more like to have high blood pressure and high

BMI. In 2012, CDC survey estimated 86 million prediabetes cases among population of

20 years or older.

41
Table 4.14 Demographics variables for diabetes prediction

Demographics
Variable SAS Label Code or Value Value Description
RIAGENDR Gender 1 Male
2 Female
RIDAGEYR Age in years at 0 to 79 Range of Values
screening 80 years of age and
80 over
RIDRETH3 Race/Hispanic 1 Mexican American
origin w/ NH 2 Other Hispanic
Asian 3 Non-Hispanic White
4 Non-Hispanic Black
6 Non-Hispanic Asian
Other Race -
Including Multi-
7 Racial

2. Blood Pressure

Hypertension is one of the major factors that can worsen the complications of diabetes.

Most people with diabetes are diagnosed to have high blood pressure [32].

Table 4.15 Blood Pressure variables for diabetes prediction

Blood Pressure
Variable SAS Label Code or Value Value Description
Systolic: Blood pres (1st rdg) mm
BPXSY1
Hg 74 to 238 Range of Values
Diastolic: Blood pres (1st rdg)
BPXDI1
mm Hg 0 to 120 Range of Values
Systolic: Blood pres (2nd rdg)
BPXSY2
mm Hg 74 to 234 Range of Values
Diastolic: Blood pres (2nd rdg)
BPXDI2
mm Hg 0 to 134 Range of Values
Systolic: Blood pres (3rd rdg) mm
BPXSY3
Hg 74 to 232 Range of Values
Diastolic: Blood pres (3rd rdg)
BPXDI3
mm Hg 0 to 128 Range of Values
Systolic: Blood pres (4th rdg) mm
BPXSY4
Hg 78 to 226 Range of Values
Diastolic: Blood pres (4th rdg)
BPXDI4
mm Hg 0 to 130 Range of Values

42
3. Body Measures

Many researchers found that risk of diabetes increases with the increase in BMI [32],

overweight people are more likely to have diabetes compared to their counter parts.

Table 4.16 Body measure variables for diabetes prediction

Body Measures
Variable SAS Label Code or Value Value Description
Body Mass
BMXBMI Index(kg/m2) 12.4 to 82.1 Range of Values
Waist
BMXWAIST Circumference 38.7 to 176 Range of Values

4. Physical Activity

Physical activity helps in controlling blood glucose, HDL cholesterol, blood pressure and

triglycerides resulting into lower risk of diabetes. Many risk factors are directly related to

physical activity such as BMI and waist circumference. Thus making it one of the major

risk factor involved in many chronic diseases.

Table 4.17 Physical Activity variables for diabetes prediction

Physical Activity
Variable SAS Label Code or Value Value Description
1 Yes
PAQ605 Vigorous work activity 2 No
1 Yes
PAQ620 Moderate work activity 2 No
Number of days moderate
PAQ625 work 1 to 7 Range of Values
1 Yes
PAQ635 Walk or bicycle 2 No
Vigorous recreational 1 Yes
PAQ650 activities 2 No
Moderate recreational 1 Yes
PAQ665 activities 2 No
PAD680 Minutes sedentary activity 0 to 1380 Range of Values
43
0 Less than 1 hour
1 1 hour
2 2 hours
3 3 hours
4 4 hours
5 5 hours
Hours watch TV or videos Do not watch TV or
PAQ710 past 30 days 8 Videos
0 Less than 1 hour
1 1 hour
2 2 hours
3 3 hours
4 4 hours
5 5 hours
Hours use computer past 30 Do not watch TV or
PAQ715 days 8 Videos

5. Smoking – Cigarette Use

Research conducted by Julie C Will [33] shows an increase in diabetes rate with the

increase in smoking. It shows that men who smoked have 45% higher diabetes rate

compare to the men who had never smoked thus making smoking as an important

indicator of diabetes.

Table 4.18 Smoking related variables for diabetes prediction

Smoking – Cigarette use


Variable SAS Label Code or Value Value Description
Smoked at least 100 1 Yes
SMQ020 cigarettes in life 2 No
Age started smoking
SMD030 cigarettes regularly 6 to 72 Range of Values
1 Every day
2 Some days
Do you now smoke 3 Not at all
SMQ040 cigarettes 95 95 or more
Does anyone smoke 1 Yes
SMD410 inside home? 2 No

44
6. Alcohol Use

Alcohol consumption has become an important risk factor for diabetes. Many researchers

investigated that moderate intake of alcohol is associated with reduced risk of diabetes

[34], however heavy intake of alcohol increases the risk of diabetes.

Table 4.19 Alcohol use variables for diabetes prediction

Alcohol Consumption
Variable SAS Label Code or Value Value Description
Had at least 12 1 Yes
alcohol drinks/1
ALQ101 yr? 2 No
How often drink
alcohol over past
ALQ120Q 12 mos 0 to 350 Range of Values
Avg # alcoholic
drinks/day - past
ALQ130 12 mos 1 to 82 Range of Values
# days have 4/5
drinks - past 12
ALQ141Q mos 0 to 220 Range of Values
Ever have 4/5 or 1 Yes
more drinks every
ALQ151 day? 2 No

7. Laboratory

Table shows the list of clinical variable which can be considered as an important factor to

measure the risk of diabetes.

45
Table 4.20 Laboratory variables for diabetes prediction

Laboratory
Variable SAS Label Code or Value Value Description
Direct HDL-Cholesterol
LBDHDD
(mg/dL) 14 to 175 Range of Values
LBXTR Triglyceride (mg/dL) 18 to 1562 Range of Values
URXUMS Albumin, urine (mg/L) 0.21 to 14800 Range of Values
LBXGLU Fasting Glucose (mg/dL) 39 to 382 Range of Values
LBXIN Insulin (uU/mL) 0.14 to 647.5 Range of Values

8. Others

Prediabetes is an important indicator of diabetes in which blood sugar level is higher than

the normal but not in the diabetes range. Genetics also play an important role in

developing diabetes. According to American Diabetes Association people with family

history of the disease have higher chances of developing diabetes compare to other

people. DIQ010 i.e. ‘ever been told you have diabetes’ is the class variable we selected to

perform supervised learning.

Table 4.21 other variables for diabetes prediction

Others
Variable SAS Label Code or Value Value Description
Close Relative had 1 Yes
MCQ300C diabetes 2 No
Ever told you have 1 Yes
DIQ160 prediabetes 2 No
1 Yes
Ever been told you 2 No
DIQ010 have diabetes 3 Borderline

46
4.3 Data Pre-Processing

Real world data is often inconsistent and incomplete, and is more likely to contain errors.

In this section we will discuss the processing steps taken to convert the data for better

accuracy. Initially 9756 records including infants, children and adolescents have been

considered from NHANES; each observation includes values corresponding to variables

selected in the previous stage. Since the main purpose of the research is to develop a

system for adults, we selected the records corresponding to the individuals of age 18 or

above; this left us with 5864 records. On further analysis we found that the data set

consisted of too many observations with ‘No’ class values (MCQ010 in case of asthma

and DIQ010 in case of diabetes), hence to avoid the problem of over fitting we deleted

records with class labels ‘No’ and too many missing values.

The detailed description of the total number of instances used for Training and Testing

the prediction models for both Asthma and Diabetes is given below:

Asthma:

The training set comprises of 1951 instances including 1135 observations with class label

‘No’ and 816 observations with class label ‘Yes’, whereas the testing set comprises of

200 instances including 143 observations with class label ‘No’ and 57 observations with

class label ‘Yes’.

Diabetes:

The training set comprises of 1525 instances including 780 observations with class label

‘No’, 111 observations with class label ’Borderline’ and 634 observations with class label

‘Yes’, whereas the testing set comprises of 200 instances including 220 observations with
47
class label ‘No’, 8 observations with class label ‘Borderline’ and 72 observations with

class label ‘Yes’.

4.4 Data Transformation:

The next step is to convert the processed data into a format that can be used by Weka, the

data mining tool we have used to build models for the prediction. This step transforms the

data into an Attribute-Relation file format (.arff), it represents a dataset in terms of a

relation made up of attributes or columns of data [35]

48
5 SYSTEM ARCHITECTURE AND IMPLEMENTATION

In this section, we will discuss the overall architecture and implementation of the system

proposed in this thesis. It describes the complete process of converting the input provided

by the user into a predicted numerical value in the form of ‘Likeliness of having Asthma’

and ‘Extent of Diabetes’. It also includes a detailed description of Weka, the data mining

tool used by the system to generate predictions.

5.1 System Architecture

Figure illustrates the operation of the proposed system in order to generate predictions for

Asthma and Diabetes. The proposed system depicts a user equipped with a mobile device

that is capable of collecting data from the user. It can be Desktop, Laptop or any mobile

device. This device collects the data from the user and sends it to the system over the

network; the data is nothing but the list of parameters described in the previous chapter.

The system consist of three main blocks Input Conversion, Neural Network and a

mathematical computational block.

Input conversion block collects the data from the device and converts it into a format

which can be used by neural network models. It creates an instance of the data collected

and passes it to the Neural Network block for classification

49
Figure 5.1 System Architecture for Asthma Prediction

Neural Network block comprises of 5 prediction models described in Chapter 3; all the

models gets trained on the training set described in Chapter 4. Once the models are

trained, the input instance collected from previous block passes through each model

individually. Each model classifies the instance accordingly;

For Asthma, each model generates a numeric value; 1 when the instance is classified as

‘No’, meaning ‘not likely to have asthma’ and 2 when the instance is classified as ‘Yes’,

meaning ‘likely to have asthma’.

50
For Diabetes, each model generates a numeric value; 1 when the instance is classified as

‘No’, meaning ‘not likely to have diabetes’, 2 when the instance is classified as

‘Borderline’, meaning ‘likely to have borderline diabetes’ and 3 when the instance is

classified as ‘Yes’, meaning ‘likely to have diabetes’.

Figure 5.2 System Architecture for Diabetes Prediction

Mathematical computation block comprises of two main components; First component

calculates the mean value of the outcomes obtained from all the 5 models; Since for

Asthma; only two values are possible i.e. 1 or 2, the mean can range from 1 to 2 only;

similarly for Diabetes; only three values are possible i.e. 1, 2 or 3 hence the mean can

51
only range from 1 to 3. Second component converts the mean obtained from the previous

block to the required scale; For Asthma, it converts the mean scale from 1-2 to 0-1 and

For Diabetes, it converts the mean scale from 1-3 to 0-2. It then multiplies the mean

obtained after proper scaling by 100 to generate the likeliness of having asthma and

extent of having diabetes in terms of percentage.

5.2 Implementation

In order to better manage Asthma and Diabetes; it is essential to implement the proposed

system in such a way that can be used from anywhere, hence we have developed a web

based application of the system proposed in the previous section. This system is designed

in JSF (Java Server Faces), which is a java specification to build component-based user

interfaces for web application. In order to store the data for effectively tracking the health

over a period of time we have used MySql database. In order to make use of the system

the user has to login to the system with his Email-Id and password which is generated at

the time of registration. It helps in storing his details each time he calculates his score to

provide better health care over a period of time. Each time a user calculates his score, all

the details gets stored in the database which can be retrieved and used whenever needed.

In order to build models described previously to generate the predictions we have used

Weka, a java oriented data mining tool. Weka trains the model only when the user

calculates the score for the first time and later on trained models are used to generate the

predictions which helps in making the predictions faster.

5.2.1 Weka

Weka is a workbench for machine learning algorithms [36] written in java. It helps in

data pre-processing, regression , classification, clustering, visualization and association

52
rules. It is an open source software issued under GNU public license. It comes with three

different modes of operation GUI, command line and Java API.

Below are few of the screenshots of GUI mode:

Figure 5.3 Weka Preprocessing stage

Figure 5.4 Weka classification stage

53
Figure 5.5 ROC curve generated in weka
In this system we have used Java API provided by Weka, it is a collection of classes and

interfaces to incorporate machine learning in a java code. It provides the prediction

models in the form of a class which can be integrated in a java code in an object oriented

manner. A class is a collection of methods to perform different operations on a model like

building the classifier, classifying an instance etc.

5.3 Screenshots

This section consist of the screenshots of the proposed system

54
Figure 5.6 Login Page

Figure 5.7 Home Page

55
Figure 5.8 Asthma Calculator Page1

Figure 5.9 Asthma Calculator Page2

56
Figure 5.10 Asthma Calculator Page3

Figure 5.11 Asthma Calculator Page4

57
Figure 5.12 Results Generated for Asthma

Figure 5.13 View Asthma History Page

58
Figure 5.14 Asthma Record Page1

Figure 5.15 Asthma Record Page2

59
Figure 5.16 Asthma Dashboard

60
6 RESULTS AND ANALYSIS

This chapter shows the results obtained from Weka for all the five models. Results of all

models from weka is recorded and an analysis is carried out to compare the prediction

power of 5 competing models in accordance with three important measures Accuracy,

Root Mean Squared Error and Area under ROC. Accuracy is the percentage of total

number of instances correctly classified. RMSE measures the square root of average of

squares of errors i.e. the difference between the actual class and the predicted class. ROC,

receiver operating characteristic, is a graphical representation to measure the performance

of a classifier system; it plots the true positive rate against the false positive rate. The area

under ROC curve ranges from 0 to 1, with 1 implies a perfect test and 0 implies a useless

test. The analysis also includes confusion matrix, which is a table layout to visualize the

performance of a model. A typical confusion matrix consists of rows and columns where

each column represents the number of instances in the predicted class and each row

represents the number of instances in an actual class. In predictive analytics, a confusion

matrix represents the total number of true positives, false positives, false negatives and

true negatives

Predicted Class
Actual

True Positives False Positives


Class

False Negatives True Negatives

61
In order to better understand the terminologies, consider a scenario where a test is

conducted that screens people for asthma. Each person either has asthma or does not have

asthma. Test result can be either positive (meaning the person has asthma) or negative

(meaning the person does not have asthma)

In this case, True Positive means the person with asthma is correctly diagnosed with

asthma; False Positive means the person without asthma is incorrectly diagnosed with

asthma, True Negative means the person without asthma is correctly identified without

asthma and False Negative means the person with asthma is incorrectly identified without

asthma.

True positive rate is also known as Sensitivity and true negative rate is also known as

Specificity.

6.1 Analysis of Results obtained for Asthma

Table 6.1 shows an overview of the result obtained from all the 5 models when all

models are trained on 1951 instances. It has been found that J48 decision tree resulted

into highest accuracy. But at the class level Multilayer Perceptron and Logistic

Regression model performed marginally well with more Area under ROC.

62
Table 6.1 Asthma Results in terms of Accuracy, RMSE and ROC area for all the 5
models

Bayes Naïve Multilayer Logistic


Network Bayes Perceptron Regression J48
Total Number of
Instances 200 200 200 200 200
Correctly Classified
Instances 144 139 141 149 152
Incorrectly
Classified Instances 56 61 59 51 48
Accuracy 72.00% 69.50% 70.50% 74.50% 76%
Root mean squared 0.44
error 0.4451 0.4685 0.5099 0.4219 43
0.66
ROC area 0.68 0.664 0.707 0.711 5

6.1.1 Confusion Matrix:

This section displays the confusion matrix obtained from weka for all the 5 models when

all the 200 instances are tested against the trained model, it gives the measure of true

positive, false positive, true negative and false negative.

 Naïve Bayes

Table 6.2 Confusion Matrix of Naive Bayes Classifier for Asthma Prediction

Naïve Predicted Class


Bayes No Yes
Actual No 116 27
Class Yes 34 23

63
 Bayesian Network

Table 6.3 Confusion Matrix of BayesNet Classifier for Asthma Prediction

Predicted Class
BayesNet
No Yes
Actual No 120 23
Class Yes 33 24

 Multilayer perceptron model

Table 6.4 Confusion Matrix of MLP Classifier for Asthma Prediction

Predicted Class
MLP
No Yes
Actual No 112 31
Class Yes 28 29

 Logistic

Table 6.5 Confusion Matrix of Logistic Classifier for Asthma Prediction

Predicted Class
Logistic
No Yes
Actual No 129 14
Class Yes 37 20

 J48 Decision tree

Table 6.6 Confusion Matrix of J48 decision tree Classifier for Asthma Prediction

Predicted Class
J48
No Yes
Actual No 126 17
Class Yes 31 26
64
6.1.2 ROC curves

Table 6.7 ROC curves obtained from WEKA for Asthma

Asthma No Yes

Naïve Bayes

Bayesian Network

Multilayer Perceptron

Logistic

J48

65
6.2 Diabetes

Table 6.1 shows an overview of the result obtained from all the 5 models when all

models are trained on 1525 instances. It has been found that Multilayer Perceptron Model

resulted into highest accuracy and lowest root mean squared error, it is also been found

that at the class level Naïve Bayes and Multilayer Perceptron have almost similar Area

under ROC.

Table 6.8 Diabetes Results in terms of Accuracy, RMSE and ROC area for all the 5
models

Bayes Naïve Multilayer Logistic


Network Bayes Perceptron Regression J48
Total Number of
Instances 300 300 300 300 300
Correctly
Classified Instances 244 266 282 257 235
Incorrectly
Classified Instances 56 34 18 43 65
Accuracy 81.30% 88.60% 94.00% 85.60% 78%
Root mean squared 0.335
error 0.3008 0.2372 0.1801 0.2803 4
ROC area 0.872 0.985 0.993 0.899 0.826

6.2.1 Confusion Matrix

This section displays the confusion matrix obtained from weka for all the 5 models when

all the 300 instances are tested against the trained model, it gives the measure of true

positive, false positive, true negative and false negative.

66
 Naïve Bayes

Table 6.9 Confusion Matrix of Naive Bayes Classifier for Diabetes Prediction

Predicted Class
Naïve Bayes Borderline
No Yes
No 217 0 3
Actual
Borderline 4 2 2
Class
Yes 13 12 47

 Bayesian Network

Table 6.10 Confusion Matrix of BayesNet Classifier for Diabetes Prediction

Predicted Class
Bayesian Network Borderline
No Yes
No 189 0 31
Actual
Borderline 5 0 3
Class
Yes 17 0 55

 Multilayer Perceptron Model

Table 6.11 Confusion Matrix of MLP Classifier for Diabetes Prediction

Multilayer Predicted Class


Perceptron No Borderline Yes
No 220 0 0
Actual
Borderline 0 0 8
Class
Yes 1 9 62

67
 Logistic

Table 6.12 Confusion Matrix of Logistic Classifier for Diabetes Prediction

Predicted Class
Logistic Borderline
No Yes
No 201 0 19
Actual
Borderline 5 0 3
Class
Yes 16 0 56

 J48

Table 6.13 Confusion Matrix of J48 Classifier for Diabetes Prediction

Predicted Class
J48 Borderline
No Yes
No 179 4 37
Actual
Borderline 4 0 4
Class
Yes 15 1 56

68
6.2.2 ROC curves:

Table 6.14 ROC curves for Diabetes Prediction

Diabetes No Borderline Yes

Naïve Bayes

Bayesian Network

Multilayer

Perceptron

Logistic

J48

69
As shown in the above results for Asthma and Diabetes prediction, models work

differently for different performance measures; some models provide better accuracy but

with less Area under ROC, some models resulted into more Area under ROC but with

high root mean squared error. Hence the approach described to combine all the 5 models

resulted in overcoming the disadvantage of one model through other models.

70
7 CONCLUSION AND FUTURE WORK

7.1 Conclusion

In this thesis, we have discussed the design and implementation of a predictive analytics

based system to predict the likeliness of having asthma and extent of diabetes in an

individual. In order to give better results and build a powerful system we used 5 machine

learning models to generate the predictions. It helped in overcoming the disadvantages of

one model with the help of other models. The system is successfully able to generate the

predictions based on the data provided by the user. For asthma, the questionnaire consist

of a set of 40 questions and For Diabetes, the questionnaire consist of a set of 33

questions ranging from demographics to laboratory details. It is been found that in both

the diseases, with the accuracy of 84% model for predicting the extent of diabetes

performed well compared to the model for predicting the likeliness of having

asthma(accuracy 76%). The reason being clinical data like Albumin, Fasting Glucose,

Triglycerides etc were included in the models for Diabetes but not in the model of

Asthma because of too many missing values in the asthma records with ‘Yes’ as the class

label. It is also been found that at the class level, model for Diabetes resulted in better

Area under ROC. The system developed in this study can be used to develop

individual/clinical decision support systems to improve management of chronic diseases

Asthma and Diabetes. With the Web based implementation of the proposed system the

user is able to make use of the system without worrying about the geographical

71
restrictions. The user can also view previously generated results and feedback to better

manage his/her health.

7.2 Future Work

With the increase in the development of many health management strategies, the research

presented here can be extended in a variety of directions. Some of the suggested

extension includes:

 For Diabetes prediction there were very less number of records with Borderline

Diabetes hence one future scope would be to add more number of records with

Borderline Diabetes, it will help the system to improve learning algorithms for

Borderline cases.

 For Asthma prediction, we can include the clinical data for training purpose. It

will improve the overall accuracy of the system because clinical data have a

significant effect on the predictions.

 The system can be extended to build models for other chronic conditions such as

CKD, COPD and Heart Diseases.

72
REFERENCES

[1] W. H. Organization, "Preventing Chronic Diseases - a vital investment," WHO


press, 2005.

[2] E. P. a. A. R. Eleni Chatzimichail, "An evovlutionary two objective genetic


algorithm for asthma prediction," in UKSim 15th International Conference on
Computer Modelling and Simulation, 2013.

[3] G. B. a. Y. H. Yang Guo, "Using Bayes Network for Prediction of Type-2 Diabetes,"
2012.

[4] EPA, "Asthma Facts," 2013.

[5] K. L. K. S. S. S. Eaton DK, "Youth risk behavior surveillance," 2009.

[6] P. G. a. K. Kapur, "Artificial Neural Network-enabled Prognostics for Patient Health


Management," IEEE, pp. 1-8, 2012.

[7] A. S. V. a. M. S. Komathy Karuppanan, "COPD Prognosis under Biologically


Inspired Neural Network," in International Conference on Advances in Computing
and Communications, 2012.

[8] N. V. C. N. A. C. a. A. L. B. Darcy A. Davis, "Time to CARE: a collaborative


engine," Data Min Knowl Disc, p. 388–415, 2009.

[9] E. P. a. A. R. Eleni Chatzimichail, "An evolutionary two-objective genetic algorithm


for asthma prediction," in 15th International Conference on Computer Modelling
and Simulation, 2013.

[10] D. T. K. a. G. R. Dr.B.Eswara Reddy, "An Efficient Cloud Framework for Health


Care Monitoring System," in International Symposium on Cloud and Services
Computing, 2012.

73
[11] B. R. P. a. S. Agarwal, "Modeling Risk Prediction of Diabetes - A Preventive
Measure," IEEE, pp. 1-6, 2014.

[12] P. C. a. H. S. Gerd Assman, "Simple Scoring Scheme for Calculating the Risk of
Acute Coronary Events Based on the 10 year Follow-up of the Prospective
Cardiovascular Munster(PROCAM) study," Circulation, pp. 1-8, 2012.

[13] J. G. H. Z. T. J. R. B. a. Y. S. X. Xie, "Neural-network based structural health


monitoring with wireless sensor networks," in Natural Computation (ICNC),
Shenyang, 2013.

[14] F. Beaule, "Systems and Methods For Determining and Managing an Individual and
Portable Health Score," 2012.

[15] N. C. S. U. a. R. David A. Dickey, "Introduction to Predictive Modeling with


Examples," SAS Global Forum, 2012.

[16] I. H. W. a. E. Frank, Data Mining: Practical Machine Learning Tools and


Techniques, Diane Cerra, 2005.

[17] M. E. U. F. e. a. Soumen Chakrabarti, "Data Mining Curriculum: A Proposal," 2006.

[18] K. P. Murphy, Machine Learning : A Probabilistic Perspective, 2012.

[19] I. Rish, "An empirical study of the naive bayes classifier," 2001.

[20] S. S. e. a. Ashraf Uddin, "Presentation on Naive Bayes Classification," 2012.

[21] M. E. Kragt, "A beginners guide to Bayesian Network Modelling for integrated
catchment management," 2009.

[22] D. G. a. M. G. Nir Friedman, "Bayesian Network Classifiers," Kluwer Academic


Publishers, 1997.

[23] D. W. H. a. S. Lemeshow, Applied Logistic Regression, 2000.

[24] Q. L. K. R. X. T. Dongsheng Che, "Decision Tree and Ensemble Learning


Algorithms with Their Applications in Bioinformatics," in Software Tools and
Algorithms for Biological Systems, 2011.

[25] G. L. D. E. S. L. S. R. T. a. L. W. W. A. A. Arif, "Prevalence and risk factors of


asthma and wheezing among US adults," European Respiratory Journal, 2003.

74
[26] C. M. A. G. P. E. M. V. D. M. K. E. P. A. P. K. Papoutsakis, "Associations between
central obesity and asthma in children and adolescents: a case-control study.," 2014.

[27] M. L. P. L. H. R. R. J. D. F. G. R. M. L. B. C. A. C. P. R. J Von Behren, "Obesity,


waist size and prevalence of current asthma in the California Teachers study cohort,"
Thorax, 2009.

[28] M. M. J. M. T. D. C. T. M. H. P. Marianne Eijkemans, "Physical Activity and


Asthma: A Systematic Review and Meta-Analysis," Plos One, 2012.

[29] J. H. Sisson, "Alcohol and Airways Function in Health and Disease," 2007.

[30] P. A. H.-T. P. B. C. G. P. B. C. R. M. H. M. a. T. H. S. P. Megan Stapleton,


"Smoking and Asthma," vol. 24, pp. 313-322, 2011.

[31] A. B. L. Y. M. B. a. D. N. Roberto Forero, "Asthma, Health Behaviors, Social


Adjustment, and Psychosomatic Symptoms in Adolescence," vol. 33, pp. 157-164,
1996.

[32] R. H. C. a. S. G. H E Bays, "The relationship of body mass index to diabetes


mellitus, hypertension and dyslipidaemia: comparison of data from two national
surveys," pp. 737-747, 2007.

[33] D. A. G. E. S. F. A. M. a. E. E. C. Julie C Willa, "Cigarette smoking and diabetes


mellitus: evidence of a positive association from a large prospective cohort study,"
Oxford, 2000.

[34] M. D. K. S. S. a. K. T. M. Noriyuki Nakanishi, "Alcohol Consumption and Risk for


Development of Impaired Fasting Glucose or Type 2 Diabetes in Middle-Aged
Japanese Men," DOI, vol. 26, pp. 48-54, 2003.

[35] S. R. Garner, "WEKA: The Waikato Environment".

[36] A. D. a. I. H. W. Geoffrey Holmes, "WEKA: A Machine Learning Workbench," in


IEEE, 1994.

75

You might also like