1 s2.0 S0167923620300944 Main

Decision Support Systems 136 (2020) 113339
Contents lists available at ScienceDirect
Decision Support Systems

journal homepage: www.elsevier.com/locate/dss
Missing care: A framework to address the issue of frequent missing T

values;The case of a clinical decision support system for Parkinson's disease
Saeed Piri
Department of Operations and Business Analytics, Lundquist College of Business, University of Oregon, Eugene, OR 97403, USA
ARTICLE INFO ABSTRACT
Keywords: In recent decades, the implementation of electronic health record (EHR) systems has been evolving worldwide,
Electronic health records leading to the creation of immense data volume in healthcare. Moreover, there has been a call for research
Data missing values studies to enhance personalized medicine and develop clinical decision support systems (CDSS) by analyzing the
Clinical decision support systems available EHR data. In EHR data, usually, there are millions of patients records with hundreds of features col-
Predictive healthcare analytics
lected over a long period of time. This enormity of EHR data poses significant challenges, one of which is dealing
Imbalanced data learning
Parkinson's disease
with many variables with very high degrees of missing values. In this study, the data quality issue of in-
completeness in EHR data is discussed, and a framework called ‘Missing Care’ is introduced to address this issue.
Using Missing Care, researchers will be able to select the most important variables at an acceptable missing
values degree to develop predictive models with high predictive power. Moreover, Missing Care is applied to
analyze a unique, large EHR data to develop a CDSS for detecting Parkinson's disease. Parkinson is a complex
disease, and even a specialist's diagnosis is not without error. Besides, there is a lack of access to specialists in
more remote areas, and as a result, about half of the patients with Parkinson's disease in the US remain un-
diagnosed. The developed CDSS can be integrated into EHR systems or utilized as an independent tool by
healthcare practitioners who are not necessarily specialists; therefore, making up for the limited access to
specialized care in remote areas.
1. Introduction increased to 59%, and in 2017, 96% of all non-federal acute care hos-
pitals possessed certified EHR technology [11]. Many researchers have
Over the past two decades, the ever-increasing creation of immense studied the diffusion of EHR systems in the US hospitals and facilitating
data volumes in diverse forms from various sources has inspired and EHR assimilation in small physician practices [12,13], and also the
induced industry practitioners and researchers to analyze and exploit impact and benefits associated with the adoption of EHR [9,14–18]. As
the available big data and transform their analysis to competitive ad- using EHR systems has become ubiquitous, many researchers and
vantages and strategic value [1–6]. “Big data analytics”, “business clinicians have started using EHR and its data for research purposes
analytics”, “predictive analytics”, and “data-driven decision making” [19,20]. A specific characteristic of an EHR system is that it is a com-
are some of the terms used in the literature for this stream of research. prehensive system, which links multiple patient level data sources such
In healthcare, similar to many other fields, such as retail, e-commerce, as demographics, encounters, lab tests, medication, and medical pro-
and media & entertainment, vast amounts of data in different forms cedures [21]. This comprehensiveness allows for more reliable and
have been available owing to process digitalization in the industry [7]. robust research that considers many aspects of the healthcare system
Two prime sources of big data in healthcare are genomic and payer- and patients in this system. The trove of health data available, coupled
provider. Genomic data is the data retrieved from the genome and DNA with the recent advancement in analytics, has created an ideal oppor-
of an organism and is used in bioinformatics. Examples of payer-pro- tunity for researchers to conduct analytics research and gain valuable
vider data sources are electronic health records (EHR), insurance re- insights that improve decision making in healthcare systems [22].
cords, pharmacy prescription, and patient feedback and responses [8]. Personalized medicine provides medical care tailored to the unique
In the recent decade, the use and implementation of EHR systems physiological and medical history of individuals rather than relying on
have been evolving worldwide, and in the US [3,9]. Jha et al. [10] general population information. Personalized medicine leads to earlier
reported that the adoption of basic or comprehensive EHR raised from diagnosis, more effective interventions and treatments, and lower cost
8.7% in 2008 to 11.9% in 2009. By 2014, the use of basic EHR [23]. Developing clinical decision support systems (CDSS) has been
E-mail address: spiri@uoregon.edu.
https://doi.org/10.1016/j.dss.2020.113339
Received 7 November 2019; Received in revised form 1 June 2020; Accepted 1 June 2020
Available online 12 June 2020
0167-9236/ © 2020 Elsevier B.V. All rights reserved.
S. Piri Decision Support Systems 136 (2020) 113339
outlined as one of the most crucial research directions related to per- indication of an urgent need for a more accessible diagnosis/screening
sonalized medicine [1,20,24,25]. CDSSs are tools that aid clinicians in tool for this disease. Hence, developing tools and CDSSs for a more
making more informed decisions, such as diagnosing various diseases. accessible diagnosis of PD is vital. In the US, the total annual direct and
CDSSs can be integrated into EHR systems and be a part of clinical indirect costs of PD are about $52 billion [37]. There is no cure for PD
workflow and, as a result, help clinicians to stratify patients, diagnose yet, but there are treatment options such as medications and surgery.
diseases, and identify the best candidates for various treatments. One of The primary current diagnostic method for PD is based on the sub-
the most promising directions in developing CDSSs is data mining and jective opinion of neurologists reviewing patients' movement and
predictive analytics using large size EHR data [26]. Many editorials and speaking [38]. Researchers have discussed the challenge of healthcare
commentaries such as Gupta & Sharda [20], Agarwal et al. [19], access, particularly in remote areas, and have urged the need for in-
Agarwal & Dhar [27], Chen et al. [3], Shmueli & Koppius [28], and novative solutions that are affordable and easy to implement [39–41].
Baesens et al. [29] have discussed and noted the significance and value Because of the limited specialty care access, many patients, especially in
of predictive analytics specifically in healthcare applications. As remote and rural areas, may remain undiagnosed. The developed CDSS
Shmueli & Koppius [28] noted, when the goal of the research is the can fill this gap and be a solution for the problem of care access in
predictability of a phenomenon, construct operationalization con- remote areas and, as a result, alleviate the low diagnosis rate for PD.
siderations are trivial. Then, they introduced “assessing predictability In this study, by analyzing a unique, large size EHR data, including
of empirical phenomena” as one of the contributions of predictive demographic and routine lab tests, a CDSS for detecting PD is devel-
analytics. Also, Agarwal & Dhar [27] argued that in the healthcare oped. In the development of this CDSS, an imbalanced dataset is ana-
domain, prediction could be even more important than explanation (or lyzed using “synthetic informative minority over-sampling” approach
causality), because of proven benefits of earlier diagnosis, intervention, (SIMO) [42], which is an over-sampling algorithm for imbalanced da-
and treatment. tasets. As Von Alan et al. [43] discuss, design science as an essential
In EHR data, usually, there are hundreds of thousands, and some- paradigm of research is creating new and innovative artifacts (such as
times millions of patients with many records and features (demo- models and systems) to address the real world and applied problems.
graphics, laboratory, medications, encounters, and outcomes) collected Harnessing big data in healthcare, using data mining techniques, is
over a long period of time. Therefore, the enormity and complexity of consistent with the design science paradigm and has received a lot of
EHR data pose significant research and practical challenges [3]. One of attention in recent years. The developed diagnosis/screening CDSS for
the most critical challenges is dealing with many variables with very PD also belongs to this category of research.
high degrees of missing values [30]. If we consider the laboratory in- This study's contribution is two-fold. First, to the best of my
formation, there are hundreds of different lab tests in EHR data; how- knowledge, this is the first study that formally discusses the challenge of
ever, not every patient takes all of those lab tests. Therefore, for many having many variables with very high degrees of missing values in EHR
features (tests), the majority of values are missing (at the level of more (and other similar) datasets and addresses it by introducing the Missing
than 70% to 90% missing). Baesens et al. [29] mention, Care framework. This framework addresses the issue of data quality as
well as trust in big data analytics tools, as discussed by Baesens et al.
“Access to big data and the tools to perform deep analytics suggests that
[29]. Both quality and trust features are discussed in more detail in the
power now equals information (data) + trust.”
methodology section. And, second, from the precision medicine per-
Then, they extensively discuss data quality as a critical and essential spective, this study, introduces a CDSS for detecting and screening for
topic that is frequently ignored in data analytics. Research has shown PD, a neurological disease with a meager diagnosis rate. This CDSS can
that while many analytics and machine learning techniques might yield be utilized as a tool integrated into EHR systems as well as an in-
comparable predictive performance, the best way to enhance this per- dependent tool used by healthcare practitioners who are not necessarily
formance is to work on the key element of analytics, data [31]. Data specialists; therefore, making up for the limited access to specialized
completeness is a vital aspect of data quality [32,33]. Thus, in ana- care in rural and remote areas. The rest of this manuscript is organized
lyzing EHR data, we deal with data quality from the completeness point as follows. In the next section, the related literature is covered. Next, in
of view [30]. Although many studies, especially in the fields of statistics the methodology section, the Missing Care framework is introduced.
and machine learning, have focused on imputing missing values, they Following that, the data, as well as data pre-processing steps, are pre-
only consider variables that have reasonable degrees of completeness sented. Next, in the experiment results section, the results of the ana-
(roughly below 50% missing) and variables with very high degrees of lysis in the case of PD are provided. And following that, a series of
missingness (as high as 70 to 90% missing) are dropped before applying robustness checks are presented. Finally, the findings, contributions,
imputation methods [30,34,35]. As a result, the challenge of dealing and implications of this research are discussed.
with variables with very high degrees of missingness remains un-
answered. To address this challenge, a new framework that can be 2. Literature review
applied to EHR data and other types of data with the same challenge of
having many variables with very high degrees of missing values is 2.1. Predictive modeling
proposed. This framework is called Missing Care. Using Missing Care,
data analytics researchers will be able to select and keep the most im- The primary purpose of predictive analytics is to predict the out-
portant variables at an acceptable missing values degree to use im- come of interest for new cases rather than explaining the causal re-
putation approaches and develop predictive models with high pre- lationships between features and the outcome [28]. Predictive analy-
dictive power. tics, machine learning, and data mining have been extensively used in
Moreover, the proposed framework, Missing Care, is applied to de- the literature. Many researchers have applied predictive analytics and
velop a CDSS for detecting and screening for Parkinson's disease (PD). machine learning in marketing and retail contexts, such as predicting
PD is a chronic and progressive neurological disorder affecting more online customers' repeat visits [44], predicting consumers' purchase
than 10 million people worldwide. In the US alone, there are about timing and choice decisions [45], customer churn prediction [46] and,
500,000 patients diagnosed with PD; however, given many un- marketing resource allocation [47]. Data mining and predictive ana-
diagnosed or misdiagnosed cases, it is estimated that there are actually lytics have also been used in finance-related areas. For instance, fi-
about one million patients with PD in the US. Besides patients them- nancial fraud detection [48], evaluating firms' value [49] and pre-
selves, PD affects thousands of more spouses, family members, and dicting the type of entities in the Bitcoin blockchain [50].
other caregivers [36]. The fact that the actual number of patients with Social networks, recommendation systems, and process events are
PD is twice as many as the number of diagnosed patients is a strong other fields that have gained interest from researchers. Examples in this
2
stream are, predicting business process events such as early warning instance, Fujimaki et al. [77] showed that caffeine level in the blood of
systems [51], and predicting the probability that a social entity will PD patients is lower compared to the control group. Feigenbaum et al.
adopt a product [52]. Finally, many research works are related to [78] studied the possibilities of testing tears to find a specific protein
analyzing unstructured data, such as text, reviews, and blogs [53–56]. that has been shown to be an indication of PD. Arroyo-Gallego et al.
For a more extensive review of research in data mining, the readers are [79] studied 25 PD patients and 27 controls to detect PD based on
referred to Trieu [57]. natural typing interaction. Another study analyzed the handwriting of
20 PD and 20 controls and performed a discriminant analysis to classify
2.2. Healthcare data analytics the participant to PD and non-PD [80]. There have been other studies
applying machine learning in managing PD, such as using smartphones
Kohli & Tan [22] extensively discuss how researchers can contribute to monitor PD patients' movements, calculating a score, and sending it
to the healthcare transformation in two semantic areas-integration and to doctors [81]. The significant difference between these types of works
analytics- using EHR. Healthcare analytics' primary goal is to predict and the current study is that this research focuses on the diagnosis
medical/healthcare outcomes such as diseases, hospital readmissions, challenge in PD while these studies address the disease management for
and mortality rates, using clinical and non-clinical data [26]. In currently diagnosed patients. There is no proven biomarker for PD [82];
healthcare analytics, two types of data sources could be used. First, data therefore, personalized medicine based on big data analytics could be a
collected in clinical trials; these types of datasets are collected explicitly potential solution to PD diagnosis challenges.
for analysis purposes; however, they are usually small-sized and lim- In summary, none of the previous research studies in predictive
ited. The other source is secondary data in healthcare, such as EHR analytics and healthcare analytics introduced or used any formal pro-
data. Data analytics researchers typically deal with datasets from the cedure to address the challenge of having many variables with very
second source that has its own challenges. high degrees of missing values. And this study is the first to propose a
Readmission is defined as being readmitted for the same primary formal framework that addresses this common challenge in working
diagnosis within 30 days. Readmissions incur a huge preventable cost with EHR. Besides, to the best of my knowledge, this research, for the
to the US healthcare system. Many researchers have developed pre- first time, introduces a CDSS for detecting and screening for PD that
dictive models to identify and predict patients with a high risk of does not require any specific equipment or test and can be used by any
readmission [34,58–60]. Another group of researchers has studied on- primary care provider or nurse. This CDSS can be used as a tool in-
line healthcare social platforms and used data analytics to investigate tegrated into EHR systems or as a standalone personalized medicine
and predict the patients' behavior and outcome. Examples are dis- tool, especially in remote areas with limited access to specialists.
covering the health outcomes for patients with mental health issues in Moreover, while most of the existing studies only use balanced data-
an online health community [61], predicting the social support in a sets,1 in the development of this CDSS, advanced imbalanced data
chronic disease-focused online health community [62], and de- learning techniques are employed to simulate the realistic situation of
termining individuals' smoking status [63]. facing imbalanced distribution of patients with PD and those without
Predicting and detecting adverse health events is another stream of PD. Additionally, in this study, an ensemble approach is used to in-
research in healthcare analytics. Lin et al. [26] proposed an approach, tegrate multiple classifiers to achieve the highest possible accuracy in
which can provide multifaceted risk profiling for patients with co- detecting PD.
morbid conditions. Wang et al. [64] also proposed a framework to
predict multiple disease risk. Piri et al. [65] developed a CDSS to detect
3. Methodology- Missing Care framework
diabetic retinopathy and proposed an ensemble approach to improve
the CDSS's performance further. Zhang & Ram [66] introduced a data-
In analyzing datasets similar to EHR data, we face the challenge of
driven framework that integrates multiple machine learning techniques
having many variables, most of them with very high degrees of missing
to identify asthma triggers and risk factors. Wang et al. [67] used ma-
values (as high as 70 to 90%). There are two extreme and immediate
chine learning to analyze patient-level data and identified patient
solutions to this issue. One solution is to remove numerous records to
groups that exhibit significant differences in outcomes of cardiovascular
have records with reasonably populated values. Another solution,
surgical procedures. Ahsen et al. [68] studied breast cancer diagnosis in
which is in a way in the opposite direction, is first to remove variables
the presence of human bias. And finally, Hsu [69] proposed an attribute
with very high degrees of missing values. And then remove the records
selection method to identify cardiovascular disease risk factors.
with high missing values and, as a result, end up with a dataset that has
Other related healthcare analytics works are on topics such as
reasonable completeness that is suitable for using imputation ap-
treatment failures, clinical trials, and patients' pathways in hospitals. To
proaches. By pursuing the first solution, we will lose numerous records
mention a few, Meyer et al. [70] proposed a machine learning approach
(in the case of EHR, records correspond to patients or encounter data).
to improve dynamic decision making. They applied the proposed
And by taking the second solution, we will deprive ourselves of a lot of
method to predict treatment failures for type II diabetic patients.
variables (independent variables) that might indeed be strongly asso-
Gómez-Vallejo et al. [71] developed a system to diagnose healthcare-
ciated with the target variable.
associated infections. Researchers have also used predictive modeling
One might say various imputation methods can be used to impute
to evaluate the kidney and heart transplant survival [72,73]. And, So-
missing values. In fact, there is an extensive literature on missing values
manchi et al. [74] proposed models to predict whether emergency de-
imputation methods in statistics and machine learning fields [83–87].
partment patients will be admitted as inpatients or will be discharged.
However, imputation methods are only appropriate when there is rea-
sonable completeness in each variable. Acuna & Rodriguez [88] men-
2.3. Parkinson's disease
tioned “rates of less than 1 % missing data are generally considered
trivial, 1-5% are manageable. However, 5-15% requires sophisticated
The current gold standard for diagnosing PD is a clinical evaluation
methods to handle, and more than 15% may severely impact any kind
by a specialist. The criteria for diagnosis have been formalized by the
of interpretation.” Many data analytics tools such as SAS Enterprise
UK Parkinson's Disease Society Brain Bank [75]. Due to the disease
Miner automatically (by default) remove variables with more than 50%
complexity, even the diagnosis by a specialist using the formal criteria
missing from further analysis and imputation. The reason is that to
is not perfect, and the accuracy is about 90% [76]. Many researchers
have studied the association of PD with potential biomarkers and other
characteristics of patients. However, most of them are based on 1
Balanced datasets are much easier to learn from; however, their perfor-
studying only one single biomarker in very small sample sizes. For mance on real-world imbalanced datasets are not acceptable.
3
impute missing values, we usually need to use the values of other re- and random forests regressor are used. These methods are re-
cords for the same variable, and when most of the values (or a con- commended for three reasons: first, one is representative of classical
siderable portion of them) are missing, the imputation is not reliable. statistics (regression), and the other one is a representative of more
For instance, in the data used in this research, the maximum missing advanced machine learning techniques (random forests); second, re-
value degree was 98%, and many features had missing values around gressions are highly interpretable and random forests are highly accu-
60% to 90%, and imputing these variables with very high degrees of rate and capable of handling a large number of variables when there is a
missingness was not appropriate. It needs to be pointed out that Missing relatively small number of records available [89,90]; and third, both
Care is not a replacement for imputation methods. In fact, it is a pre- methods provide some sort of variable importance.
processing framework that prepares the data for more meaningful use Both logistic regression with l1 regularization and lasso regression
of imputation methods when there is a reasonable degree of com- train in a way to reduce the number of features in the predictive model
pleteness in the data. In many fields, especially healthcare, even 50% and reduce the chance of over-fitting [91]. Where the linear regression
completeness for a variable is not acceptable. is in the form of Eq. 1
Baesens et al. [29] extensively discuss data quality and trust in
y= + 1 x1 + …+ p x p + (1)
analytics works. Data quality was elaborated on in the introduction 0
section, and here the “trust” part is discussed. The matter of trust in Lasso minimizes the following (Eq. 2) to estimate the coefficients,
both data and analytics approaches is a critical factor in implementing p
DSS based on data analytics [29]. This is even more crucial in health- MSE +
j=0
| j|
(2)
care; working with physicians and clinicians as data analysts entails a
remarkable liability and trust. Even if you use a complete and clean MSE is the mean squared error, βj are coefficients of p features, and α
dataset to develop a CDSS, clinicians hardly trust the results, let alone if is the parameter that adjusts the trade-off between accuracy on training
they learn that you have heavily imputed your data. Therefore, to en- data and the regularization. Higher α means forcing more βj to be zero
hance data-driven decision making in general and more specifically in (removing the corresponding features from the model). Therefore,
healthcare, we need to gain the organizations' leaders' trust and one of when lasso (or logistic regression with l1 regularization) is used, during
the prerequisites is to limit the level of imputation. the training process, important variables that are strongly associated
This challenge is addressed by introducing the Missing Care frame- with the target variable are kept in the model and other less important
work. Missing Care begins by using the initial dataset, D with p variables features will end up having zero coefficients, meaning they will be re-
and N records. Next, keeping all variables in the data, only records with moved from the model. An importance measure is assigned to each
a reasonable degree of completeness, say δ%2will be kept, and all other feature included in the model. This importance is based on the R-square
records will be removed. At this stage, we have dataset D1 containing all reduction (denoted by Rj2red) after removing the variable from the model
of the initial variables and only records with a high degree of com- and in the end, all of the importance measures are normalized. The
pleteness (N1); therefore, a considerable number of records are re- importance of variable j (denoted by Reg_xjimp) is calculated as follows
moved. The next step is to identify the variables (features or in- in Eq. 3,
dependent variables) that are strongly associated with the target R2j red
variable. In Section 3.1, a procedure is recommended for computing an Reg _x imp
j =
R1 red +…+ Rp2red
2
(3)
importance level for all the variables and identifying the important
ones. The same is done for the logistic regression; however, instead of R-
Then, based on the final variables' importance (xjimp) top p∗ vari- square reduction, the area under the curve (AUC) reduction is calcu-
ables out of initial p variables are identified and selected for further lated. AUC is the area under the ROC (Receiving Operator
analysis. In the next step, from the initial dataset D, only selected p∗ Characteristic) chart and takes values between 0 and 1. The AUC re-
variables are kept, and all other variables are removed. At this point, we duction after removing variable j is denoted by AUCjred and the variable
have an updated dataset, D2, which includes only a subset of variables j’s importance in a classification problem is calculated as in Eq. 4,
(those that are strongly associated with the target variable) and all
original records. The next step is to remove the records with very high AUCj red
Reg _x imp
j =
missing values and keep only the records with at least δ% completeness. AUC1red +…+AUCpred (4)
This dataset (D∗) is the final dataset that will be used to develop the
It needs to be noted that Rj2red
and AUCjred for variables that are not
predictive models and has p∗ and N∗ records with N∗ > N1.
included in the model are equal to zero, and as a result, their im-
portance is zero as well.
3.1. Computing variables' importance In random forests models, the variable importance is calculated
based on impurity reduction when a variable is used for splitting [92].
Two data mining and machine learning methods for both classifi- The impurity for regression models is based on the variance at each
cation and regression problems are recommended to compute the node and is calculated as shown in Eq. 5 and Eq. 6,
variables' importance. Here, it needs to be noted that classification
y
problems are the ones with a binary or categorical target variable, for ym =
i Nm i
instance, when we want to detect and diagnose a disease or when we Nm (5)

want to predict the success or failure of a project. And regression pro-
blems are the cases with a numeric target variable, which is measur- i Nm
(yi ym ) 2
vm =
able, such as predicting the value of a house or predicting the length of Nm (6)
stay for a patient. At the end of this section, two Missing Care frame-
Where yi is the label (target variable value) for record i and vm is
works are presented; one for regression problems, and one for classifi-
variance at node m, with Nm observations. When variable j is used in the
cation problems. If the problem is classification, logistic regression with
split in node m in a tree, variance reduction (VR) is calculated as in Eq.
l1 regularization (the same regularization that lasso uses) and random
7,
forests classifier [89] are used. And if the problem is regression, lasso,
VR jm = vm wleft vleft _ m wright vright _ m (7)
2
A suggestion for a reasonable degree of completeness is about 85% to 90%, where wleft and wright are the proportion of the records in each leaf.
however, it could be different based on the data characteristics. Variable j’s importance at each tree is calculated based on the
4
proportion of variance reduction using variable j in splits to the total Framework 1 (continued)
variance reduction in all splits as shown in Eq. 8,
m all noded splited using variable j

VR jm 10. Calculate the ultimate variables' importance as
VR _x imp
j =
m all nodes
VRm (8) Reg
wRsq RF
Re g x imp + wRsq RF x imp
j j
x imp =
Next, VR_xjimp is normalized using all variables' importance in the j Reg
wRsq RF
+ wRsq
tree (Eq. 9),
VR _x imp 11. Identify top p∗ variables out of initial p variables as selected variables based on xjimp
j
N _VR _x imp
j = 12. From D keep only selected p∗ variables and all records
i p
VR _x iimp (9)
D2 (p variables &N records )
Finally, the importance of variable j in the random forests model,
denoted by RF_xjimp is computed based on the average variance reduc- 13. From D2, keep only the records with at least δ% completeness:
tion in all trees generated in the random forests model as in Eq. 10 D (p variables&N records )
all trees i n RF
N _VR _x imp
j
RF _x imp
j =
number of trees in RF (10) (N > N1)
In classification models, the impurity is the Gini index at each node

14. D∗: the final dataset with selected variables and records
(Eq. 11),
Ginim = pmk (1 pmk ) (11)

K Finally, the importance of variable j in the random forests, denoted
Where pmk is the proportion of observation belonging to class k at by RF_xjimp is computed based on the average Gini reduction in all trees
node m (there are K classes in the data). When variable j is used in the generated in the random forests model as in Eq. 15
split in node m in a tree, Gini reduction (GR) is calculated as in Eq. 12, N _GR _x imp
all trees in RF j
RF _x imp
j =
GR jm = Ginim wleft Ginileft _ m wright Giniright _ m (12) number of trees in RF (15)
where wleft and wright are the proportion of the records in each leaf. At this stage, the variables' importance in both regression and
Variable j’s importance at each tree is calculated based on the propor- random forests models are available. Next, a weight is assigned to each
tion of Gini reduction using variable j in splits to the total Gini reduc- variable importance based on models' R-square for regression problems
tion in all splits (Eq. 13), and models' AUC for classification problems, denoted by wRsqReg, wRsqRF,
wAUCReg, and wAUCRF respectively. The final importance of variable j is
GR jm
GR _x imp
j =
m all noded splited using variable j
calculated as Eq. 16 and Eq. 17,
m all nodes
GRm (13) Reg
wRsq Reg _x imp RF RF _x imp
+ wRsq
for regression problems
j j
x imp
j = Reg RF
wRsq + wRsq
Next, GR_xjimp is normalized using all variables' importance in the
(16)
tree (Eq. 14), Reg
wAUC Reg _x imp RF RF _x imp
+ wAUC
for classification problems
j j
x imp
j = Reg
GR _x imp
j
wAUC RF
+ wAUC
N _GR _x imp
j = (17)
GR _x iimp (14)
i p Framework 2
Missing Care Framework for classification problems.
Framework 1
Missing Care Framework for regression problems. Given, p, N, δ, α
1. From D, keep all p variables and only records that have at least δ% completeness:
Given, p, N, δ, α
1. Keep all p variables and only records that have at least δ% completeness: D1 (p variables &N1 records )
D1 (p variables &N1 records )

2. Train a logistic regression with l1 regularization on D1
3. Using an appropriate value for α, identify the variables with non-zero coefficient p1
2. Train a lasso regression on D1 4. Calculate AUCjred for all variables in p1
3. Using an appropriate value for α, identify the variables with non-zero coefficient p1 5. Normalize the values of AUCjred and calculate variable importance, Reg_xjimp using
4. Calculate Rj2red for all variables in p1
5. Normalize the values of Rj2red and calculate variable importance, Reg_xjimp using AUCj red
Reg _x imp
j =
AUC1red +…+ AUCpred
R2j
Reg _x imp
j = red
R12red +…+ Rp2red
6. Train a random forests classifier on D1
7. Calculate Gini reduction, GR_xjimp for all variables at each tree
6. Train a random forests regressor on D1 8. Normalize Gini reductions using,
7. Calculate variance reduction, VR_xjimp for all variables at each tree
8. Normalize variance reductions using, GR _x imp
j
N _GR _x imp
j = imp
i p GR _x i
VR _x imp
j
N _VR _x imp
j = imp
i p VR _x i
9. Calculate RF_xjimp based on normalized Gini reductions for all trees in random
forests using
9. Calculate variable importance, RF_xjimp based on normalized variance reductions
for all trees in random forests using all trees in RF N _GR _x imp
j
RF _x imp
j =
number of trees in RF
all trees in RF N _VR _x imp
j
RF _x imp
j =
number of trees in RF
10. Calculate the ultimate variables' importance as
(continued on next page)
5
Framework 2 (continued)
Reg
wAUC Reg _x imp
j
RF
+ wAUC RF _x imp
j
x imp
j = Reg RF
wAUC + wAUC
11. Identify top p∗ variables out of initial p variables as selected variables based on xjimp
12. From D keep only selected p∗ variables and all records
D2 (p variables &N records )
13. From D2, keep only the records with at least δ% completeness:
D (p variables&N records )
(N > N1)
14. D∗: the final dataset with selected variables and records
Fig. 1. Cerner data diagram.
Formal pseudo-code for Missing Care for both regression and clas-
sification problems are shown in Framework 1 and 2, and the notations that are used to merge various tables are specified; the primary key in
are listed in Table 1. each table is bold and Italic, and the foreign key is underlined. The
An important point needs to be noted here. Even though the re- patient table contains demographic information such as age, race,
commended techniques led to better performance compared to other marital status, and gender. The medication table holds a complete set of
variable selection methods such as relative importance in neural net- variables on the medications that patients take; examples are medica-
works, and stepwise regression, it cannot necessarily be the case for all tion name/ID, dosage, ordering physician, order date, unit costs, etc.
settings and datasets. Therefore, it is best if analysts experiment with The diagnosis table is home to the diagnosis information (ICD 9 and 10
various variable selection methods and apply the one that provides the codes). The encounter table contains all the variables related to the
best results for their data. patients' visits. Variables such as admission and discharge dates, ad-
mitting physician, admission type, and admission source are in this
4. Data table. Information about the patients' vital signs such as blood pressure,
temperature, and respiratory rate and their respective collection date
In this section, the data, and also the data pre-processing and pre- and time is stored in the clinical event table. And finally, the lab pro-
paration steps that were taken before developing the final predictive cedure table contains all the information about the patients' lab tests.
models are described. In this research, a unique data retrieved from the This table includes variables such as lab name, lab completion date, and
largest relational healthcare data warehouse in the US, Cerner Health results. Diagnosis and medication tables are used to label patients in the
Facts® is used. Health Facts® contains more than two decades of data of PD and control groups. And, variables in the patient, clinical event, and
84 million unique patients in 133 million encounters from over 500 lab procedure tables are used in developing the models.
health care facilities across the US [93]. The initial data included mil- Data cleaning/pre-processing, especially for EHR, is a very critical
lions of records and hundreds of variables across various tables that and time-consuming task, and therefore a great deal of time and con-
needed to be integrated, cleaned, and pre-processed. Fig. 1 is a sim- sideration was dedicated to it. This process also involved consulting
plified diagram for this data, depicting the tables that are included in it with medical professionals to incorporate their expertise. Data retrieval
and how they are related. In this diagram, primary and foreign keys from Health Facts® was based on ICD-9 (International Classification of
Diseases) and ICD-10 diagnosis codes. An imbalanced dataset3 is used in
Table 1 this study to have a fair and factual setting; as in the real world patient
Notations for missing care framework. population, only a small percentage of patients have PD. In fact, the
limitation of many healthcare analytics studies is that they consider a
D: Initial dataset
p: number of variables in the initial data
somewhat balanced data in their analysis. For the PD group, all patients
N: number of records in the initial data with either ICD-9 or ICD-10 PD diagnosis codes were extracted, and
δ: record completeness percentage threshold that yielded data of 83,393 unique patients from about 1 million en-
α: l1 regularization parameter counters. For the control group, data for patients in the size of 10 times
Notations for regression problems:
larger than the PD group were extracted, and this yielded 833,921
Rj2red: R-square reduction when variable j is removed from lasso model
Reg_xjimp: variable j’s importance in lass model unique patients having more than 5 million encounters.
VR_xjimp: variance reduction when variable j is used for splits in a tree Throughout the process, parts of data were discarded because not all
N_VR_xjimp: normalized variance reduction for variable j in a tree information in various tables was available for all patients. The process
RF_xjimp: variable j’s importance in random forests regressor model started with merging the encounter4 and diagnosis data tables, and this
wRsqReg: R-square of lasso regression model
wRsqRF: R-square of random forests regressor model
led to having data of 442,556 unique patients, out of which 83,393
Notations for classification problems: were in the PD group and the rest in the control group. This data was
AUCjred: AUC reduction when variable j is removed from the logistic regression
model
Reg_xjimp: variable j’s importance in the logistic regression model 3
In an imbalanced data, the number of records belonging to one class—called
GR_xjimp: Gini reduction when variable j is used for splits in a tree
majority, outnumbers the number of records belonging to the other class—-
N_GR_xjimp: normalized Gini reduction for variable j in a tree
called minority class
RF_xjimp: variable j’s importance in random forests classifier model 4
wAUCReg: AUC of the logistic regression model
An encounter is a hospital or clinic visit. The encounter table contains all of
wAUCRF: AUC of the random forests classifier model the information that is specific to that visit (encounter). The encounter ID is
xjimp: ultimate variable j’s importance with regard to the target variable unique within this table. The same patient can have more than one record in
this table over time.
6
from about 3 million encounters for these patients. The imbalanced 5. Experiment results
ratio (number of patients in the PD group divided by the total number
of patients) for this data was about 18.8%. In the next step, the PD In this section, the experimental results of predictive models de-
group was double-checked against ICD-9, and ICD-10 diagnosis codes veloped by employing Missing Care and also not employing Missing Care
for PD and 4203 patients were removed from the PD group because are provided. Logistic regression (LR), linear support vector machine
they did not have ICD 9 or ICD 10 diagnosis code for PD. To be as close (SVM-L), support vector machine with RBF kernel (SVM-RBF), neural
as possible to the first PD diagnosis, the first encounter with PD diag- networks (NN), random forests (RF), and gradient boosting (GB) are
nosis was kept, and other encounters in the PD group were removed. used to develop predictive models. For SVM and NN that are sensitive
This action did not make any changes to the number of patients and to the variables' scale, all variables are transformed to a 0 to 1 scale. The
only decreased the encounters number. Next, the lab procedure table parameters for each model have been tuned to get the best possible
was cleaned and merged with the master data. Combining lab data led results, and 5-fold cross-validation is used to minimize the effect of data
to removing 40,943 PD patients and 197,951 patients in the control partitioning bias. First, the results of the models built without using
group because there was no lab data for these groups of patients. At this Missing Care after over-sampling it applying SIMO are presented. Then,
stage, the imbalanced ratio was 19.2%. the results of models developed based on SIMO-oversampled data by
Considering only ICD codes to form the PD and control groups applying Missing Care are represented. Next, the results of the two sets
might not be entirely accurate because of potential errors such as data of models are compared and the effectiveness of the Missing Care fra-
entry. To avoid this issue and ensure all patients in the PD group have mework is shown. Table 2 shows the results of the models when Missing
PD and patients in the control group do not have PD, an extra step Care is not applied, and Table 3 represents models after employing
besides checking for ICD codes was taken. Reviewing the American Missing Care. As shown, using Missing Care, led to a significant im-
Parkinson's disease Association (APDA) website (www.apdaparkinson. provement in the models' performance (5% to 7% AUC increase across
org), a list of medications that PD patients could take was formed (see various models). Table 4 depicts the significance of this improvement at
Appendix I). Then, using the Health Facts® medication table that stores the level of 99% confidence. The reason is that Missing Care addresses
the information of all medications patients take, only patients in the PD data quality as a critical factor in analytics. Without using Missing Care,
group that take PD medications were kept, and others were removed. many important variables with high predictive power would be dis-
Additionally, all patients in the control group that take any PD medi- carded from further analysis, and this would deteriorate the models'
cation were excluded. Conducting this two-stage identification using performance. The improvement across various models is depicted in
ICD codes and medications, the veracity of all patients in the PD group Fig. 5 as well. It needs to be pointed out that after (and during) applying
having PD and all patients in the control group not having PD was Missing Care, the data still has missing values. However, the degree of
confirmed with a high degree of confidence. At this stage, the data was missingness at these stages is manageable by applying imputation
prepared to develop predictive models. This data included 15,669 un- methods. In this study, mean and mode were used to impute the missing
ique PD and 160,722 unique patients in the control group and had an values for the numeric and categorical variables, respectively. How-
imbalanced ration of 8.9%. This data is called Master I. Data selection ever, any other imputation methods can be used.
steps are summarized in Fig. 2. There were many other data cleaning/ Among various machine learning techniques, random forests and
pre-processing steps that were taken to deal with various variables and gradient boosting had the best performance. AUCs for these two models
tables in the data. These steps are briefly depicted in Fig. 3. after applying Missing Care on Master II SIMO-oversampled data are
After having Master I data in hand, Missing Care was applied to have 84.3% and 84.63%, respectively. Therefore, for the rest of the analysis,
the final data that contains all important variables with a reasonable the focus is on these two modeling techniques. As was mentioned in the
degree of missing values. Master I included 81 variables, out of which previous sections, imbalanced data is used in this study. There are
many had very high missing values (as high as 98% missing!). various remedies for imbalanced data learning problems, one of which
Employing Missing Care, the majority of records with high missing va- is synthetic informative minority over-sampling (SIMO) with two ver-
lues were removed to keep as many as variable possible with up to 30% sions, SIMO and W-SIMO (Weighted-SIMO). Another popular over-
missing. This yielded a dataset with 3705 records and 81 variables that sampling technique is SMOTE (Synthetic Minority Over-sampling
is much smaller than Master I. Then, Missing Care was applied to this Technique) [94], and a common under-sampling approach is random
data to identify variables strongly associated with the target variable. under-sampling (RUS). To have the best results, the performance of the
After following Missing Care guidelines, 30 variables were identified as predictive models after using SIMO, W_SIMO, SMOTE, and RUS are
important variables. Then, going back to Master I, and only keeping evaluated. The results of the models based on various over-sampling
these 30 variables, records with very high missing values were removed and under-sampling techniques are presented in Table 5 and Fig. 6. In
to reach a reasonable missing values degree for the selected 30 vari- this study, models trained based on over-sampling techniques out-
ables. It resulted in a data with 15,000 records, out of which 2000 were performed models based on RUS. And, among over-sampling methods,
PD, and the rest belonged to the control group. This data, called Master SIMO was the best. The significance of the improvement gained from
II, was used to develop the final predictive models. Missingness in employing SIMO and W-SIMO compared with SMOTE and RUS is
Master II was acceptable (maximum missing value was 37%) to apply shown in Table 6. As evident, not all differences are statistically sig-
imputation methods. A descriptive analysis of variables in Master II for nificant. The results presented are the performance of the models on the
both PD and control groups is available in Appendix II. Not employing validation data, which is in the original imbalanced distribution of the
Missing Care would lead to losing many variables with a strong asso- initial dataset.
ciation with the target. Without applying Missing Care, only the vari- To further improve the accuracy of the models, multiple predictive
ables with at most 35% missingness (the same missingness threshold models using an ensemble approach called, confidence margin en-
that was used for Master II) are kept, and it will result in a dataset with semble [65] were combined. Ensemble approaches are the most bene-
many records (113,759, considerably more than Master II's records). ficial when different models, with relatively similar (and good) per-
However, this data will have only 7 variables, because most of the formance, are combined. As the performance of RF and GB models were
variables have missingness levels of 60% to 90%. This data is called much better than LR, SVM, and NN, only RF and GB models were used
Master III, and it is used to develop models and compare the models' to develop the confidence margin ensemble models. First, the ensemble
performance with the ones build on Master II. Data using Missing Care models were developed using RF and GB built on each of the im-
procedure compared to not using Missing Care is shown in Fig. 4. balanced data learning techniques (SIMO, W_SIMO, SMOTE, and RUS).
The performance of confidence margin ensembles is shown in Table 7.
In each ensemble, there is a marginal improvement compared to the
7
Fig. 2. Data selection steps.
individual models, RF and GB; this is observable in Fig. 6. difference significance.

More models were integrated into the ensembles step by step to Second, to confirm the effectiveness of Missing Care, it was applied
create even more accurate models. In Table 8 and Fig. 7, the first to another disease, Diabetic Retinopathy (DR). DR is the most common
column (point) shows the ensemble of RF and GB models based on eye complication for diabetic patients. And about 28% of diabetic pa-
SIMO-oversampled data. Next, RF and GB models based on W_SIMO- tients experience this complication [95]. DR data is acquired from the
oversampled data were added to the ensemble, and there was an im- same source (Cerner) as PD, and data preparation steps, similar to what
provement in the performance. In the next stage, RF and GB models has been done for PD data, are taken to pre-process the data. The data
built based on SMOTE-oversampled data were added; at this stage, for DR has different characteristics (number of rows, variables, and
there were six models in the ensemble, and there was an improvement missingness) compared to the data used for PD. Table 10 shows the
from AUC of 85.06% to 85.19%. Finally, RF and GB models based on characteristics of DR data, and Table 11 depicts the statistically sig-
RUS data were added, and the AUC of the ensemble of 8 models was nificant improvement achieved by applying Missing Care.
85.29%. The final ensemble model of 8 models is used in the devel-
opment of CDSS to diagnose and screen for PD.
7. Discussion
6. Robustness checks Many researchers have discussed the effectiveness and significance
of information systems (IS) and IT tools in improving the care for pa-
To ensure the Missing Care effectiveness and the proposed CDSS tients and also making care providing more efficient [3,12,25]. Wide-
validity to detect PD, a series of robustness checks needed to be con- spread adoption of EHR (as an IS/IT tool) in the hospitals and clinics
ducted. First, the effectiveness of Missing Care and the validity of the has led to the emergence of EHR-based healthcare predictive analytics
CDSS were evaluated for datasets with various imbalanced ratios. These research that brings about remarkable practical and medical values [3].
datasets were formed by randomly removing portions of data belonging While various studies have shown the value of EHR data in analytics,
to the PD group. In this way, datasets with lower PD records' ratios were the challenges associated with analyzing these types of data are rarely
created that were more realistic. Table 9 shows the performance of the addressed in the literature. The prerequisite of a competent data ana-
CDSS in various imbalanced ratios. In each scenario, models are de- lytics research is to have high-quality data [29]. One important aspect
veloped as the ensemble of RF and GB over SIMO, W-SIMO, SMOTE, of data quality is data completeness, and EHR data severely suffers from
and RUS. This table illustrates the AUC values with and without ap- data incompleteness. In this study, by introducing a new framework
plying Missing Care, the difference between the two scenarios, and the called Missing Care, this critical challenge is addressed. Using Missing
8
Table 2
Models without applying Missing Care (on SIMO-oversampled Master III).
LR SVM-L SVM-RBF NN RF GB
Auc 73.63% 73.08% 73.11% 72.24% 78.02% 78.15%

Sensitivity 66.98% 66.35% 66.39% 66.08% 70.12% 70.37%
Specificity 66.08% 66.22% 66.13% 66.00% 70.15% 70.28%
Table 3
Models after applying Missing Care (on SIMO-oversampled Master II).
auc 80.63% 79.87% 79.36% 77.63% 84.30% 84.63%

sensitivity 73.15% 72.98% 72.85% 69.95% 75.80% 76.25%
specificity 72.78% 72.46% 72.08% 69.83% 75.28% 75.17%
helps preserve variables that could benefit from imputation in later

stages of analysis, while they could be discarded at early stages due to
very high missingness degrees.
Besides the immense traditional contribution of researchers to
healthcare at the level of management and organizations, personalized
medicine has received attention in the recent years, and prestigious
journals have recognized personalized medicine through CDSSs as one
of the promising and impactful streams of research [3,20,25,26,96].
CDSSs can enhance decision-making capabilities in care management at
Fig. 3. Data pre-processing steps. the patient level rather than the general population. In line with this
recent movement in healthcare analytics literature, and consistent with
Care, before developing any predictive model, out of numerous features the design science research paradigm that focuses on solving practical
available in EHR data (many of them having very high missing values), problems, a CDSS is developed to detect and monitor for PD, a highly
we can identify the features that are highly associated with the target undiagnosed neurological disorder. Employing this CDSS and thereby,
variable. And for the rest of the analysis, those selected variables will be earlier diagnosis of PD has multiple benefits. It will provide more in-
involved in the model building. Without employing Missing Care, many formation to patients about their health status, thus enabling them to be
important features with high predictive powers might be discarded, proactive. It also helps clinics and physicians in being prepared for
only because the majority of their values in the initial data are missing. treatment planning. Finally, it improves the specialty care delivery to
In the experimental analysis, the improvement in the prediction accu- patients.
racy of models after using the Missing Care framework is demonstrated. The contributions of this study can be summarized in two cate-
While Missing Care is introduced in the context of EHR, it can be ap- gories. First, to the best of my knowledge, this is the first study that
plied to other datasets with EHR characteristics (many variables with formally discusses the data quality issue of incompleteness in EHR data
high degrees of missing values). It needs to be emphasized that Missing and introduces a framework to address this issue. The benefits of ap-
Care does not deny the importance of imputation techniques, rather it plying this framework are demonstrated through empirical experiments
Fig. 4. Using missing care vs. not using missing care
9
Table 4
Statistical significance of effectiveness of Missing Care.
AUC difference 6.998%*** 6.79%*** 6.25%*** 5.39%*** 6.25%*** 6.49%***

p-value 4.1E-05 1.81E-06 5.46E-05 4.63E-05 1.36E-04 1.18E-04
*** 99% confidence, ** 95% confidence, * 90% confidence—two-sample t-test, unequal variances.
Table 7
Confidence Margin ensemble of RF and GB in various over and under-sampling
techniques.
RUs SMOTE W_SIMO SIMO
auc 84.62% 84.65% 84.82% 84.97%

sensitivity 75.70% 75.65% 76.05% 76.45%
specificity 75.80% 75.57% 75.56% 75.88%
Table 8
Confidence Margin ensemble of RF and GB.
ENSEMBLE OF SIMO SIMO & W-SIMO …, & SMOTE …, & RUS
Auc 84.97% 85.06% 85.19% 85.29%

Sensitivity 76.45% 76.55% 76.15% 76.00%
Specificity 75.88% 75.44% 76.04% 76.29%
Fig. 5. AUC for models with and without Missing Care.
Table 5
AUC for models built on Master II in various over and under-sampling techni-
ques.
SIMO 80.63% 79.87% 79.36% 77.63% 84.30% 84.63%

W_SIMO 80.53% 79.68% 79.20% 77.67% 84.34% 84.45%
SMOTE 80.31% 79.36% 79.29% 77.55% 84.21% 84.23%
RUS 80.08% 81.01% 80.26% 77.24% 84.12% 84.15%
Fig. 7. AUC for the ensemble of RF and GB by adding more models
Table 9
Difference between models with and without applying Missing Care (various
imbalanced Ratio).
Imbalanced Ratio 10% 6.50% 4% 1.30%
Missing Care- auc 85.19% 85.11% 84.71% 83.80%

# of PD Patients 1500 975 600 195
no Missing Care- AUC 78.69% 78.57% 78.50% 77.09%
# of PD Patients 11,376 7394 4550 1479
Fig. 6. AUC of RF and GB and their ensembles using various imbalanced data Difference btw Missing Care and 6.51%*** 6.53%*** 6.21%*** 6.71%***
NO missing Care
learning techniques.
p-value for Difference 0.00016 0.00012 0.00098 0.00478
Table 6 *** 99% confidence, ** 95% confidence, * 90% confidence—two-sample t-test,

The significance of the difference between SIMO and SMOTE and RUS. unequal variances.
Difference Between RUS SMOTE

Table 10
GB Model /SIMO 0.48% * 0.40%*** DR data characteristics.
p-value 0.0943 0.0016
# of variable # of Patients Avg. Missing Max Missing Min Missing
RF Model /W_SIMO 0.22%** 0.13%
Rate Rate Rate
p-value 0.0111 0.2991
91 451,392 74.1% 97.8% 29.8%
*** 99% confidence, ** 95% confidence, * 90% confidence———two-sample t-test,
unequal variances.
10
Table 11 country.
Difference between models with and without applying Missing Care for DR. Another benefit of this CDSS is identifying the risk factors for PD.
Imbalanced Ratio AUC # of # of Patients # of DR Among the identified factors, some are known to the medical commu-
Variables Patients nity. And another group of factors was not mentioned in the medical
literature (to the best of my knowledge). These new factors could be
Missing Care- auc 92.83% 30 64,562 9943
potential leads for medical researchers to conduct more controlled
No Missing Care- 88.54% 9 136,505 22,415
AUC
clinical studies and evaluate their relationship with PD. Factors iden-
Difference 4.287%*** tified by the CDSS and medical studies that showed their connection to
p-value for 1.80571E-05 *** 99% confidence, ** 95% confidence—two- PD are age, gender [97]; alanine aminotransferase, aspartate amino-
Difference sample t-test, unequal variances transferase, and, glucose [98]; platelet count and lymphocyte [99];
glomerular filtration rate [100]; blood Pressure and heart rate [101];
serum sodium and chloride [102]; blood monocytes [103]; and, white
in various predictive modeling techniques. Second, in this study, a
blood cell count [104]. No medical research was found studying the
CDSS that can aid clinicians in diagnosing PD is developed. The su-
following factors that were identified by the CDSS: mean corpuscular
periority of this CDSS over the existing diagnostic methods is that it can
volume, creatinine serum, specific gravity urine, partial thromboplastin
be applied using only demographic and lab test information of the
time, blood urea nitrogen, basophils percent.
patients, and there is no need for more advanced equipment such as
MRI equipment that is scarce in more remote areas.
7.2. Limitations and future research
7.1. Practical implications
This work, similar to other analytics research based on EHR data,
The proposed framework, Missing Care, can be used in the devel- has the limitation of the initial labeling of patients to PD and non-PD
opment of predictive models based on EHR or other similar datasets. before training the models. While most of the studies rely only on ICD
These predictive models can be used to enhance decision making in diagnosis codes, in this study, a reliable measure is taken to mitigate the
various contexts, including healthcare. The CDSS developed in this effect of this limitation by having a two-stage initial identification both
study can be employed in different forms and can be beneficial to all based on the ICD diagnosis codes in the data and also the medications
stakeholders in healthcare. It can be integrated into EHR systems and that patients take. Future research can be conducted by collecting data
automatically provide the risk of having PD for patients. It also can be specific to the purpose of this research and in a more controlled
used as a standalone tool; then, primary care providers and even nurses manner, thereby providing causality inference.
can use it as a screening tool. Since this CDSS is very easy to employ, it
can fill the gap of specialists and equipment scarcity in rural and remote Acknowledgment
areas as well as many developing countries that are similar to the US
rural areas with regards to healthcare accessibility. This CDSS is easy to This work was conducted with data from the Cerner Corporation's
use and accurate at the same time. The ensemble models developed in Health Facts database of electronic medical records provided by the
this study could reach the AUC of 85.3% that given the complexity of Oklahoma State University Center for Health Systems Innovation
diagnosing PD is a very good accuracy for a screening tool that uses (CHSI). Any opinions, findings, and conclusions or recommendations
simple blood tests. Applying this CDSS can lead to earlier diagnosis for expressed in this material are those of the author(s) and do not ne-
patients, and earlier diagnosis can always allow for more effective cessarily reflect the views of the Cerner Corporation. I would like to
treatments and interventions, and this could translate to cost-saving acknowledge Dr. Yasamin Vahdati, for proofreading the article, and Dr.
both for patients themselves as well as the whole healthcare system and Delen, Dr. Paiva, and, Dr. Miao from CHSI for providing the data.
Appendix A. List of PD medications
Carbidopa-Levodopa
Pramipexole
Ropinirole
Benztropine
Amantadine
Selegiline
Carbidopa/Entacapone/Levodopa
Trihexyphenidyl
Rasagiline
Ropinirole
Rotigotine
Carbidopa
Tolcapone
Levodopa
Appendix B. Descriptive statistics for the patients in both PD and control groups
Variables PD Patients Control Patients
Mean STD Median Mean STD Median
Alanine Aminotransferase SGPT 21.83 19.46 18.00 32.72 34.40 26.00

Alkaline Phosphatase 85.88 39.72 78.00 93.60 48.24 83.50
Anion Gap 9.01 3.08 9.00 8.99 3.16 9.00
Aspartate Aminotransferase 30.91 33.58 23.00 36.87 59.24 25.00
11
Basophils Percent 0.51 0.30 0.50 0.55 0.34 0.55

Bilirubin Total 0.70 0.46 0.60 0.74 0.70 0.60
Blood Pressure, Systolic 131.08 14.66 129.89 129.71 16.58 129.89
Blood Urea Nitrogen 21.75 12.83 19.00 18.92 13.33 15.00
Carbon Dioxide 25.94 3.85 26.00 25.83 3.86 25.85
Chloride Serum 104.78 4.92 105.00 104.09 4.71 104.00
Creatinine Serum 1.07 0.62 0.90 1.08 0.71 0.90
Glomerular Filtration Rate 58.04 16.64 60.00 60.70 21.52 60.00
Glucose Serum 113.87 37.40 104.00 115.39 39.72 104.00
Heart Rate 80.54 10.26 81.86 82.07 11.58 81.86
Height 65.53 3.54 65.47 65.46 3.60 65.47
International Normalized Ratio 1.27 0.42 1.19 1.25 0.41 1.15
Lymphocyte Percent 19.91 7.97 21.39 21.62 10.27 21.39
Mean Corpuscular Volume 91.58 5.89 91.80 89.64 6.74 89.90
Mean Platelet Volume 8.7 1.3 8.7 8.73 1.24 8.7
Monocyte Percent 8.12 2.71 7.90 7.87 2.80 7.90
Partial Thromboplastin Time 32.30 8.48 32.00 32.86 8.96 32.79
Platelet Count 219.90 88.24 208.00 231.45 96.67 222.00
Prothrombin Time 14.17 3.92 13.60 14.07 3.98 13.60
Sodium 139.21 3.86 139.00 138.61 3.67 139.00
Specific Gravity Urine 1.016 0.005 1.015 1.015 0.006 1.015
Weight 167.91 36.40 176.02 178.89 42.28 177.42
White Blood Cell Count 8.27 3.37 7.60 8.57 3.70 7.90
Age 77.12 9.22 78.00 59.49 20.06 61.00
Male Female Male Female

Gender 55% 45% 44% 56%
References 391–396.
[24] M.P. Johnson, K. Zheng, R. Padman, Modeling the longitudinality of user accep-
tance of technology with an evidence-adaptive clinical decision support system,
[1] P.B. Goes, Big data and IS research, MIS Quarterly 38 (3) (2014) (p. iii-viii). Decis. Support. Syst. 57 (2014) 444–453.
[2] Chiang, R.H., et al., Strategic Value of Big Data and Business Analytics. 2018, Taylor [25] R.G. Fichman, R. Kohli, R. Krishnan, Editorial overview—the role of information
& Francis. systems in healthcare: current research and future trends, Inf. Syst. Res. 22 (3)
[3] H. Chen, R.H. Chiang, V.C. Storey, Business intelligence and analytics: From big (2011) 419–428.
data to big impact, MIS Q. 36 (4) (2012). [26] Lin, Y.-K., et al., Healthcare predictive analytics for risk profiling in chronic care: A
[4] D. Bertsimas, et al., Call for papers—special issue of management science: business Bayesian multitask learning approach. MIS Q., 2017. 41(2).
analytics: submission deadline: September 16, 2012 expected publication date: [27] R. Agarwal, V. Dhar, Big data, data science, and analytics: the opportunity and
first quarter 2014, Manag. Sci. 58 (7) (2012) 1422. challenge for IS research, Inf. Syst. Res. 25 (3) (2014) 443–448.
[5] A. Seidmann, Y. Jiang, J. Zhang, Introduction to the special issue on analyzing the [28] Shmueli, G. and O.R. Koppius, Predictive analytics in information systems research.
impacts of advanced information technologies on business operations, Decis. MIS Q., 2011: p. 553–572.
Support. Syst. 76 (C) (2015) 1–2. [29] Baesens, B., et al., Transformational Issues of Big Data And Analytics in Networked
[6] J.-N. Mazón, et al., Introduction to the special issue of business intelligence and Business. MIS Q., 2016. 40(4).
the web, Decis. Support. Syst. 52 (4) (2012) 851–852. [30] G. Jetley, H. Zhang, Electronic health records in IS research: quality issues, es-
[7] R.J. Kauffman, D. Ma, B. Yoo, Guest editorial: market transformation to an IT- sential thresholds and remedial actions, Decis. Support. Syst. 126 (2019) 113137.
enabled services-oriented economy, Decis. Support. Syst. 78 (2015) 65–66. [31] Baesens, B., Analytics in a big data world: The essential guide to data science and its
[8] K. Miller, Big Data Analytics In Biomedical Research, (2012) (cited 2019 June 17). applications. 2014: John Wiley & Sons.
[9] T.R. Huerta, et al., Electronic health record implementation and hospitals’ total [32] H.-T. Moges, et al., A multidimensional analysis of data quality for credit risk
factor productivity, Decis. Support. Syst. 55 (2) (2013) 450–458. management: new insights and challenges, Inf. Manag. 50 (1) (2013) 43–58.
[10] A.K. Jha, et al., A progress report on electronic health records in US hospitals, [33] J. Du, L. Zhou, Improving financial data quality using ontologies, Decis. Support.
Health Aff. 29 (10) (2010) 1951–1957. Syst. 54 (1) (2012) 76–86.
[11] Health-IT-Quick-Stat, Percent of Hospitals, By Type, that Possess Certified Health [34] I. Bardhan, et al., Predictive analytics for readmission of patients with congestive
IT, (2018) ([cited 2019 June 10th]; Health IT Quick-Stat #52). heart failure, Inf. Syst. Res. 26 (1) (2014) 19–39.
[12] C.M. Angst, et al., Social contagion and information technology diffusion: the [35] B. Yet, et al., Decision support system for warfarin therapy management using
adoption of electronic medical records in US hospitals, Manag. Sci. 56 (8) (2010) Bayesian networks, Decis. Support. Syst. 55 (2) (2013) 488–498.
1219–1241. [36] NINDS, Parkinson's Disease: Challenges, Progress, and Promise. 2015, National
[13] A. Baird, E. Davidson, L. Mathiassen, Reflective technology assimilation: facil- Institutes of Health.
itating electronic health record assimilation in small physician practices, J. Manag. [37] Parkinson's-Foundation. Parkinson's Statistics. [cited 2019 June 20].
Inf. Syst. 34 (3) (2017) 664–694. [38] Inacio, P. Distinct Brain Activity Patterns Captured by EEG May Help in Treating
[14] H.K. Bhargava, A.N. Mishra, Electronic medical records and physician pro- Parkinson's, Study Suggests. 2019 [cited 2019 June 20].
ductivity: evidence from panel data analysis, Manag. Sci. 60 (10) (2014) [39] J. Barjis, G. Kolfschoten, J. Maritz, A sustainable and affordable support system for
2543–2562. rural healthcare delivery, Decis. Support. Syst. 56 (2013) 223–233.
[15] K.K. Ganju, H. Atasoy, P.A. Pavlou, 'Where to, Doc?'Electronic Health Record [40] Y. Li, et al., Designing utilization-based spatial healthcare accessibility decision
Systems and the Mobility of Chronic Care Patients, Working paper (2017). support systems: a case of a regional health plan, Decis. Support. Syst. 99 (2017)
[16] H. Atasoy, P.-y. Chen, K. Ganju, The spillover effects of health IT investments on 51–63.
regional healthcare costs, Manag. Sci. 64 (6) (2017) 2515–2534. [41] U. Varshney, Mobile health: four emerging themes of research, Decis. Support.
[17] M.Z. Hydari, R. Telang, W.M. Marella, Saving patient Ryan—can advanced elec- Syst. 66 (2014) 20–35.
tronic medical records make patient care safer? Manag. Sci. 65 (5) (2018) [42] S. Piri, D. Delen, T. Liu, A synthetic informative minority over-sampling (SIMO)
2041–2059. algorithm leveraging support vector machine to enhance learning from im-
[18] Y.-K. Lin, M. Lin, H. Chen, Do Electronic Health Records Affect Quality of Care? balanced datasets, Decis. Support. Syst. 106 (2018) 15–29.
Evidence from the HITECH Act. Information Systems Research, (2019). [43] R.H. Von Alan, et al., Design science in information systems research, MIS Q. 28
[19] R. Agarwal, et al., Research commentary—the digital transformation of health- (1) (2004) 75–105.
care: current status and the road ahead, Inf. Syst. Res. 21 (4) (2010) 796–809. [44] Padmanabhan, B., Z. Zheng, and SO. Kimbrough, An empirical analysis of the value
[20] Gupta, A. and R. Sharda, Improving the science of healthcare delivery and informatics of complete information for eCRM models. MIS Q., 2006: p. 247–267.
using modeling approaches. 2013, Elsevier. [45] L. Ma, R. Krishnan, A.L. Montgomery, Latent homophily or social influence? An
[21] T.T. Moores, Towards an integrated model of IT acceptance in healthcare, Decis. empirical analysis of purchase within a social network, Manag. Sci. 61 (2) (2014)
Support. Syst. 53 (3) (2012) 507–516. 454–473.
[22] R. Kohli, S.S.-L. Tan, Electronic health records: how can IS researchers contribute [46] K. Coussement, S. Lessmann, G. Verstraeten, A comparative analysis of data pre-
to transforming healthcare? MIS Q. 40 (3) (2016) 553–573. paration algorithms for customer churn prediction: a case study in the tele-
[23] J. Glaser, et al., Advancing personalized health care through health information communication industry, Decis. Support. Syst. 95 (2017) 27–36.
technology: an update from the American health information Community’s per- [47] Saboo, A.R., V. Kumar, and I. Park, Using Big Data to Model Time-Varying Effects for
sonalized health care workgroup, J. Am. Med. Inform. Assoc. 15 (4) (2008) Marketing Resource (Re) Allocation. MIS Q., 2016. 40(4).
12
[48] Abbasi, A., et al., Metafraud: a meta-learning framework for detecting financial fraud. Parkinson disease, Neurology 90 (5) (2018) e404–e411.
MIS Q., 2012. 36(4). [78] D. Feigenbaum, et al., Tear Proteins as Possible Biomarkers for Parkinson’s Disease
[49] C. Kuzey, A. Uyar, D. Delen, The impact of multinationality on firm value: a (S3. 006), AAN Enterprises, 2018.
comparative analysis of machine learning techniques, Decis. Support. Syst. 59 [79] T. Arroyo-Gallego, et al., Detecting motor impairment in early Parkinson’s disease
(2014) 127–142. via natural typing interaction with keyboards: validation of the neuroQWERTY
[50] H.H. Sun Yin, et al., Regulating cryptocurrencies: a supervised machine learning approach in an uncontrolled at-home setting, J. Med. Internet Res. 20 (3) (2018)
approach to de-anonymizing the bitcoin blockchain, J. Manag. Inf. Syst. 36 (1) e89.
(2019) 37–73. [80] S. Rosenblum, et al., Handwriting as an objective tool for Parkinson’s disease di-
[51] D. Breuker, et al., Comprehensible predictive models for business processes, MIS agnosis, J. Neurol. 260 (9) (2013) 2357–2361.
Q. 40 (4) (2016) 1009–1034. [81] A. Zhan, et al., Using smartphones and machine learning to quantify Parkinson
[52] X. Fang, et al., Predicting adoption probabilities in social networks, Inf. Syst. Res. disease severity: the mobile Parkinson disease score, JAMA Neurol. 75 (7) (2018)
24 (1) (2013) 128–145. 876–880.
[53] Aggarwal, R. and H. Singh, Differential influence of blogs across different stages of [82] P. Rizek, N. Kumar, And MS jog, An update on the diagnosis and treatment of
decision making: The case of venture capitalists. MIS Q., 2013: p. 1093–1112. Parkinson disease, Cmaj 188 (16) (2016) 1157–1165.
[54] S. Stieglitz, L. Dang-Xuan, Emotions and information diffusion in social med- [83] Allison, P.D., Missing data. Vol. 136. 2001: Sage publications.
ia—sentiment of microblogs and sharing behavior, J. Manag. Inf. Syst. 29 (4) [84] M. Nakai, W. Ke, Review of the methods for handling missing data in longitudinal
(2013) 217–248. data analysis, Int. J. Math. Anal. 5 (1) (2011) 1–13.
[55] Y. Bao, A. Datta, Simultaneously discovering and quantifying risk types from [85] Wells, B.J., et al., Strategies for handling missing data in electronic health record de-
textual risk disclosures, Manag. Sci. 60 (6) (2014) 1371–1391. rived data. Egems, 2013. 1(3).
[56] N. Kumar, et al., Detecting review manipulation on online platforms with hier- [86] S.H. Holan, et al., Bayesian multiscale multiple imputation with implications for
archical supervised learning, J. Manag. Inf. Syst. 35 (1) (2018) 350–380. data confidentiality, J. Am. Stat. Assoc. 105 (490) (2010) 564–577.
[57] V.-H. Trieu, Getting value from business intelligence systems: a review and re- [87] Mozharovskyi, P., J. Josse, and F. Husson, Nonparametric imputation by data depth.
search agenda, Decis. Support. Syst. 93 (2017) 111–124. J. Am. Stat. Assoc., 2019: p. 1–24.
[58] J. Xie, et al., Readmission Prediction for Patients with Heterogeneous Hazard: A [88] Acuna, E. and C. Rodriguez, The treatment of missing values and its effect on classifier
Trajectory-Based Deep Learning Approach, (2018). accuracy, in Classification, clustering, and data mining applications. 2004, Springer. p.
[59] Ben-Assuli, O. and R. Padman, Trajectories of Repeated Readmissions of Chronic 639–647.
Disease Patients: Risk Stratification, Profiling, and Prediction. MIS Quarterly, [89] L. Breiman, Random forests, Mach. Learn. 45 (1) (2001) 5–32.
2019(Forthcoming). [90] H. Ishwaran, Variable importance in binary regression trees and forests, Electron.
[60] H.M. Zolbanin, D. Delen, Processing electronic medical records to improve pre- J. Stat. 1 (2007) 519–537.
dictive analytics outcomes for hospital readmissions, Decis. Support. Syst. 112 [91] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser.
(2018) 98–110. B Methodol. 58 (1) (1996) 267–288.
[61] L. Yan, Y. Tan, Feeling blue? Go online: an empirical study of social support among [92] L. Breiman, Classification and Regression Trees, Wadsworth, Belmont, CA, 1984.
patients, Inf. Syst. Res. 25 (4) (2014) 690–709. [93] J.P. DeShazo, M.A. Hoffman, A comparison of a multistate inpatient EHR database
[62] L. Chen, A. Baird, D. Straub, Fostering participant health knowledge and attitudes: to the HCUP Nationwide inpatient sample, BMC Health Serv. Res. 15 (1) (2015)
an econometric study of a chronic disease-focused online health community, J. 384.
Manag. Inf. Syst. 36 (1) (2019) 194–229. [94] N.V. Chawla, et al., SMOTE: synthetic minority over-sampling technique, J. Artif.
[63] X. Wang, et al., Mining user-generated content in an online smoking cessation Intell. Res. 16 (2002) 321–357.
community to identify smoking status: a machine learning approach, Decis. [95] X. Zhang, et al., Prevalence of diabetic retinopathy in the United States, 2005-
Support. Syst. 116 (2019) 26–34. 2008, Jama 304 (6) (2010) 649–656.
[64] Wang, T., et al., Directed disease networks to facilitate multiple-disease risk assessment [96] K.M. Bretthauer, S. Savin, Introduction to the special issue on patient-centric
modeling. Decis. Support. Syst., 2019: p. 113171. healthcare Management in the age of analytics, Prod. Oper. Manag. 27 (12) (2018)
[65] S. Piri, et al., A data analytics approach to building a clinical decision support 2101–2102.
system for diabetic retinopathy: developing and deploying a model ensemble, [97] F. Moisan, et al., Parkinson disease male-to-female ratios increase with age: French
Decis. Support. Syst. 101 (2017) 12–27. nationwide study and meta-analysis, J. Neurol. Neurosurg. Psychiatry 87 (9)
[66] Zhang, W. and S. Ram, A Comprehensive Analysis of Triggers and Risk Factors for (2016) 952–957.
Asthma Based on Machine Learning and Large Heterogeneous Data Sources. MIS [98] E. Cereda, et al., Low cardiometabolic risk in Parkinson’s disease is independent of
Quarterly, 2019(Forthcoming). nutritional status, body composition and fat distribution, Clin. Nutr. 31 (5) (2012)
[67] G. Wang, J. Li, W.J. Hopp, Personalized Health Care Outcome Analysis of 699–704.
Cardiovascular Surgical Procedures, (2017). [99] Y.-H. Qin, et al., The role of red cell distribution width in patients with Parkinson’s
[68] M.E. Ahsen, M.U.S. Ayvaci, S. Raghunathan, When algorithmic predictions use disease, Int. J. Clin. Exp. Med. 9 (3) (2016) 6143–6147.
human-generated data: a bias-aware classification algorithm for breast cancer [100] G.E. Nam, et al., Chronic renal dysfunction, proteinuria, and risk of Parkinson’s
diagnosis, Inf. Syst. Res. 30 (1) (2019) 97–116. disease in the elderly, Mov. Disord. 34 (8) (2019) 1184–1191.
[69] W.-Y. Hsu, A decision-making mechanism for assessing risk factor significance in [101] L. Norcliffe-Kaufmann, et al., Orthostatic heart rate changes in patients with au-
cardiovascular diseases, Decis. Support. Syst. 115 (2018) 64–77. tonomic failure caused by neurodegenerative synucleinopathies, Ann. Neurol. 83
[70] G. Meyer, et al., A machine learning approach to improving dynamic decision (3) (2018) 522–531.
making, Inf. Syst. Res. 25 (2) (2014) 239–263. [102] C.J. Mao, et al., Serum sodium and chloride are inversely associated with dyski-
[71] H. Gómez-Vallejo, et al., A case-based reasoning system for aiding detection and nesia in Parkinson's disease patients, Brain and Behavior 7 (12) (2017) (p.
classification of nosocomial infections, Decis. Support. Syst. 84 (2016) 104–116. e00867).
[72] A. Dag, et al., A probabilistic data-driven framework for scoring the preoperative [103] V. Grozdanov, et al., Inflammatory dysregulation of blood monocytes in
recipient-donor heart transplant survival, Decis. Support. Syst. 86 (2016) 1–12. Parkinson’s disease patients, Acta Neuropathol. 128 (5) (2014) 651–663.
[73] K. Topuz, et al., Predicting graft survival among kidney transplant recipients: a
Bayesian decision support model, Decis. Support. Syst. 106 (2018) 97–109. Saeed Piri (spiri@uoregon.edu) is an Assistant Professor at Lundquist College of Business,
[74] S. Somanchi, I. Adjerid, R. Gross, To Predict or Not to Predict: The Case of University of Oregon. His research interests lie in data analytics with a particular focus on
Inpatient Admissions from the Emergency Department, Available at SSRN, healthcare. His research contributes to both data mining methodologies, such as im-
3054619 (2017). balanced data learning, ensemble modeling, and association analysis and also data ana-
[75] J. Jankovic, Parkinson’s disease: clinical features and diagnosis, J. Neurol. lytics applications, such as developing clinical decision support systems and personalized
Neurosurg. Psychiatry 79 (4) (2008) 368–376. medicine. In addition, Dr. Piri has been studying value-based payment systems in
[76] A.J. Hughes, S.E. Daniel, A.J. Lees, Improved accuracy of clinical diagnosis of healthcare, online retail platforms, and electronic medical records systems. In his works,
Lewy body Parkinson’s disease, Neurology 57 (8) (2001) 1497–1499. he employs advanced machine learning techniques and also classical empirical methods.
[77] M. Fujimaki, et al., Serum caffeine and metabolites are reliable biomarkers of early
13

1 s2.0 S0167923620300944 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S0167923620300944 Main

Uploaded by

Copyright:

Available Formats

Decision Support Systems 136 (2020) 113339

Contents lists available at ScienceDirect

Decision Support Systems

Missing care: A framework to address the issue of frequent missing T

ARTICLE INFO ABSTRACT

E-mail address: spiri@uoregon.edu.

instance, when we want to detect and diagnose a disease or when we Nm (5)

m all noded splited using variable j

In classification models, the impurity is the Gini index at each node

Ginim = pmk (1 pmk ) (11)

D1 (p variables &N1 records )

D2 (p variables &N records )

Fig. 2. Data selection steps.

individual models, RF and GB; this is observable in Fig. 6. difference significance.

Auc 73.63% 73.08% 73.11% 72.24% 78.02% 78.15%

auc 80.63% 79.87% 79.36% 77.63% 84.30% 84.63%

helps preserve variables that could benefit from imputation in later

Fig. 4. Using missing care vs. not using missing care

AUC difference 6.998%*** 6.79%*** 6.25%*** 5.39%*** 6.25%*** 6.49%***

auc 84.62% 84.65% 84.82% 84.97%

Auc 84.97% 85.06% 85.19% 85.29%

SIMO 80.63% 79.87% 79.36% 77.63% 84.30% 84.63%

Fig. 7. AUC for the ensemble of RF and GB by adding more models

Missing Care- auc 85.19% 85.11% 84.71% 83.80%

Table 6 *** 99% confidence, ** 95% confidence, * 90% confidence—two-sample t-test,

Difference Between RUS SMOTE

Appendix A. List of PD medications

Variables PD Patients Control Patients

Mean STD Median Mean STD Median

Alanine Aminotransferase SGPT 21.83 19.46 18.00 32.72 34.40 26.00

Basophils Percent 0.51 0.30 0.50 0.55 0.34 0.55

Male Female Male Female

You might also like

AUC difference 6.998%* 6.79%* 6.25%* 5.39%* 6.25%* 6.49%*

Table 6 * 99% confidence, 95% confidence, * 90% confidence—two-sample t-test,