Professional Documents
Culture Documents
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 10:29:40 UTC from IEEE Xplore. Restrictions apply.
benchmarking the regression models versus the deep learning Regressor [19] and Stacking Regressor [20]) and evaluated the
model for heart failure hospitalizations from the hospital best regressor model with Deep Neural Network model [21].
electronic medical records in ICU settings.
In this research, we aim to analyze and compare the
prediction abilities of regression ensembles-based machine
learning method against the deep neural networks and find the
optimal prediction method in the context of heart failure LOS
regression prediction problem in ICU-based hospitalizations.
II. METHOD
This section describes the proposed predictive framework
for heart failure LOS prediction using ICU electronic medical
records data. Also, it explains the steps towards building the
predictive framework to baseline the outperforming
predictive model for HF length of stay admissions. Then, the
performing model will be selected, and the top features are Figure 1. Proposed Predictive Framework for HF Length of Stay in
passed during the model tuning stage. Relevant models ICU Hospital Management Systems.
evaluation metrics were chosen to examine the performance D. Candidates Models Evaluation
of each model.
Models evaluation is important in predictions tasks to
A. Data Description and Data Preparation examine the performance of each model and report the
The dataset was constructed from MIMIC-III dataset [15] outperforming model(s). In our study, we have used R-squared
which is an openly available dataset developed by the MIT Lab R2 or the coefficient of determination. R2 indicates how much
for Computational Physiology. It comprises de-identified variation of a dependent variable is explained by the
health data measuring 61,532 intensive care unit stays in independent variable(s) in the regression models. Also, we
intensive care unit admissions, demographics, vital signs, have used Mean average error MAE (equation 1) to measure
laboratory tests, medications, and more clinical variables. the difference between continuous variables regression
evaluation metrics. Noting that, the closer R is to the value 1.0,
During the data preparation process, we have used the better the prediction model performance. We have
(process_mimic.py), which was developed [16] to mine the evaluated regression models against each other (stage 1), then
cardiovascular inpatients' variables. We have merged the we have assessed the performance of the winning regressor
mined (csv file) which contains (Vitals, laboratory test, with the DNN model (stage 2).
Demographic) with variables from MIMIC-III tables
∑𝑛 |𝐿𝑂𝑆𝑝𝑟𝑒𝑑 − 𝐿𝑂𝑆𝑜𝑏𝑠𝑟𝑣 |
("ADMISSIONS", "PATIENTS", "ICUSTAYS", and 𝑀𝐴𝐸 = 𝑖=1 (1)
𝑛
"DIAGNOSES_ICD"). Data blending was performed, and all Where LOSpred is the predicted length of stay, LOSobsrv is
selected variables were combined using merge and join the observed length of stay, and n is the sample size (n = HF
(DataFrames) with Pandas in Python in one final table patients)
("HF_MIMIC1_4v"). For missing values, a technique called
"Impute Missing Values" [17] was applied to replace missing E. Model Fitting and Hyperparameters Optimization
values with specific values that have meaning to heart failure Hyperparameters or tuning is vital for the success of the
admitted cases from the dataset. fitting model. Hyperparameters determine the skill for the
B. Data Preprocessing selected model, learned from data, and they are manually
configuration by the machine learning expert. Grid search
Pearson correlation test was implemented to determine the parameter tuning and random search parameter tuning [22-23]
correlation between the independent variable and the output are common algorithm tuning which is considered as the last
variable. The features (25 independent variables) inputs for the step before obtaining predictions results. In our study, the
candidate models were selected according to the correlation choice of model tuning is based on the winning model.
between the independent variables (LOS) and the dependent
variables. To convert categorical variables, (One-hot-encoding III. RESULTS AND DISCUSSION
"0,1") was used for variables (such as Gender, Admit type,
Admit location, etc.). This technique is essential to improve We have identified all distinct heart failure hospitalizations
results for prediction models. Data scaling "normalization" (1592) based on ICD-9. Table 2 shows selected variables
technique was applied to get scaled features from the dataset, (Demographic, and Vital) from the HF patients' characteristics
which will be passed to the prediction designs. in the study. Male inpatients HF hospitalizations (52%) were
slightly more than female inpatients HF hospitalizations
C. Candidates Models (48%). The LOS patients mean, and median are (67.74, 62),
The chosen models choice (see Figure 1) demonstrated the respectively. The minimum LOS is 0.13 days, and the
nature of the length of stay data type (scale / continuous), maximum LOS is 93.94 days. All prediction models were
which was extracted from the MIMIC-III repository. Further, implemented using Python programming language. Scikit-
we have considered using regression-based models. learn [24] library was used during the regression models
Accordingly, we have evaluated regression-based model, such building and Keras [25] for the deep learning model. We have
as (Random Forest Regressor [18], Gradient Boosting divided HF dataset into training (64%) and testing (34%). We
5443
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 10:29:40 UTC from IEEE Xplore. Restrictions apply.
have compared regression models with deep learning model
using the evaluation of the appropriate metrics.
We have considered three regression models (Random
Forest Regressor "RFR", Gradient Boosting Regressor,
"GBR", and the Stacking Regressor). The regressors were built
and evaluated, including exaction time. The results (Figure 2)
showed that the three models (GBR, Stacking Regression,
RFR) showed relatively close R2 and MAE (R2:0.81, 0.81, and
0.80) and (MAE: 2.00, 1.92, and 1.98) respectively. Stacking
Regressor was the slowest in prediction execution speed
(11.85 seconds), and GBR was the fastest in execution time.
GBR and Stacking had the best R2. Since the model training
and evaluating time is vital in real-time settings, such as
predicting clinical and medical cases. Therefore, the GBR was
our winning Regressor model.
The Deep Neural Networks (DNN) approach was
implemented using Keras with TensorFlow backend using
Figure 2. Single Regression predictors vs Stacked predictor.
three layers (25 dense layers). We have used "linear" for
activation and Stochastic gradient descent (SGD) for the
optimizer. Figure 3 illustrates the DNN prediction results.
Whereas, Table 1 compares the prediction results for the
regression models vs the DNN model. We noticed regression
models showed overall better results compared to DNN.
Generally, DNN performs well in larger data size and high
dimensional data due to its automatic learning feature benefit.
We have decided to fit GBR in the model tuning.
Since Gradient Boosting Regressor performed well
compared to other regression models as well as the DNN, we
refined the model using GridSearchCV from scikit-learn to get
the best estimators (hyperparameters) for the GBR model. One
of the advantages of the GBR is the fact that it is robust to the
over-fitting problem; a hence larger number of (n_estimatrors
= 200) can result in better performance. The second tuned
parameter is the (max_depth = 3), where tuning this parameter
to the performance, and it depends on the iteration of the input
Figure 3. Loss vs MAE for the DNN model on training and testing
value. The default value is (3). We have set the depth of the HF sets.
iteration maximum of three individual regression estimators.
The third tuned parameter was the loss function, and we have
set it to ('ls', 'lad'), where ls is: least squares regression, the lad IV. LIMITATIONS
is least absolute deviation. We have fitted GBR on the In this section, we report some of the limitations associated
predicators (Figure 4). After we have fitted the GBR with the with our study. First of all, since the aim was to develop a
best hyperparameters and the top features, the model achieved machine learning predictive length of stay approach for HF
improved and achieved better results (R2: 0.84±0.0.7). patients. We did not consider the other existing medical
comorbidities associated with HF inpatient, which may need
Table 1. Regression vs DNN Prediction results
further research exploration in a future study to measure its
Model R2 MAE time impact on the LOS or the extended LOS. Also, we did not
0.95 sec address post-hospitalization intervention and its effect on heart
RFR 0.8±0.08 1.98±0.16
failure LOS. Therefore, a further research study is needed to
GBR 0.81±0.07 2.0±0.14 0.85 sec
address the post-hospitalization intervention on LOS. We aim
Stacking
0.81±0.08 1.92±0.15
11.84 sec to exploit other deep learning approaches such as Long-Term
Regression Short Memory "LSTM" algorithm in a future research work.
DNN 4.11 sec
0.77±0.06 2.30±0.18
Regression
V. CONCLUSION
This study aimed to develop machine learning length of
stay predictive framework to predict heart failure for patients'
hospitalization in intensive care units using electronic medical
records historical data. The GBR regressor outperformed all
other models in this study with (R2 0.8±0.08). Noting that the
deep learning model (Deep Learning Regression) did not
report better results than regression models. Hence, it will be
5444
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 10:29:40 UTC from IEEE Xplore. Restrictions apply.
commendable to exam DNN on a big EMR dataset with a large [3] W. Lesyuk, C. Kriza, and P. Kolominsky-Rabas, "Cost-of-illness
volume of heart failure hospitalizations and compare it with studies in heart failure: A systematic review 2004–2016," BMC
regression models in a future study. Moreover, in future work, cardiovascular disorders, vol. 18, no. 1, p. 74, 2018.
[4] Y.-K. Chan et al., "Current and projected burden of heart failure in the
we will validate our proposed predictive LOS framework on a Australian adult population: a substantive but still ill-defined major
real-world external dataset to evaluate the performance of the health issue," BMC health services research, vol. 16, no. 1, p. 501,
proposed model on other data sources. Our contribution 2016.
provides new insight to the healthcare decision-makers in [5] S. L. Jackson, X. Tong, R. J. King, F. Loustalot, Y. Hong, and M. D.
hospital management systems, clinical AI researches, and the Ritchey, "National burden of heart failure events in the United States,
clinical practitioner to predict patients' future outcomes, as 2006 to 2014," Circulation: Heart Failure, vol. 11, no. 12, p. e004873,
2018.
well as determining the costly HF hospitalizations using the
[6] A. Almashrafi, H. Alsabti, M. Mukaddirov, B. Balan, and P. Aylin,
artificial intelligence (AI) approaches. Further, it facilitates the "Factors associated with prolonged length of stay following cardiac
mission to clinical AI researches towards a more surgery in a major referral hospital in Oman: a retrospective
comprehensive framework to predict the common LOS heart observational study," BMJ open, vol. 6, no. 6, p. e010764, 2016.
failure hospitalizations from the clinical diagnosis and [7] B. Alsinglawi and O. Mubin, "Predictive Analytics and Deep Learning
improving resources utilization using the machine learning Techniques in Electronic Medical Records: Recent Advancements and
Future Direction," in Workshops of the International Conference on
advancements. Advanced Information Networking and Applications, 2019: Springer,
pp. 907-914.
[8] B. Alsinglawi, O. Mubin, M.Novoa, and O.Darwish, "Benchmarking
Predictive Models in Electronic Health Records: Sepsis Length of
Stay Prediction" to appear in International Conference on Advanced
Information Networking and Applications, 2020: Springer.
[9] H. Omar and M. Guglin, "Longer-than-average length of stay in acute
heart failure," Herz, vol. 43, no. 2, pp. 131-139, 2018.
[10] M. S. Durstenfeld, M. D. Saybolt, A. Praestgaard, and S. E. Kimmel,
"Physician predictions of length of stay of patients admitted with heart
failure," Journal of hospital medicine, vol. 11, no. 9, pp. 642-645,
2016.
[11] P.-F. J. Tsai et al., "Length of hospital stay prediction at the admission
stage for cardiology patients using artificial neural network," Journal of
healthcare engineering, vol. 2016, 2016.
[12] S. Barnes, E. Hamrock, M. Toerper, S. Siddiqui, and S. Levin, "Real-
Figure 4. Best 10 features (predictors) for GBR model. * beats per time prediction of inpatient length of stay for discharge prioritization,"
minutes, ** breaths per minutes, Journal of the American Medical Informatics Association, vol. 23, no.
e1, pp. e2-e10, 2015.
[13] L. Turgeman, J. H. May, and R. Sciulli, "Insights from a machine
Table 2. Patient Characteristics learning model for predicting the hospital Length of Stay (LOS) at the
time of admission," Expert Systems with Applications, vol. 78, pp.
HF 376-385, 2017
(Mean/ Median) Std Min Max [14] N. binti Omar, E. Supriyanto, and R. H. Al-Ashwal, "Personalized
Demographics Clinical Pathway for Heart Failure Management," in 2018 International
Conference on Applied Engineering (ICAE), 2018: IEEE, pp. 1-5
0.13 93.94 [15] A. E. Johnson et al., "MIMIC-III, a freely accessible critical care
LOS (10.77 / 8.13) days 9.16
days day
database," Scientific data, vol. 3, p. 160035, 2016.
Age 67.74/62 13.17 0 89 [16] D. Gligorijevic et al., "Deep Attention Model for Triage of Emergency
Mean: Department Patients," in Proceedings of the 2018 SIAM International
Weight 9.79 34.59 127.59 Conference on Data Mining, 2018: SIAM, pp. 297-305.
80.84
[17] J. A. Sterne et al., "Multiple imputation for missing data in
Gender Female: 765 (48%) Male: 827 (52%)
epidemiological and clinical research: potential and pitfalls," Bmj, vol.
Vital 338, p. b2393, 2009.
[18] A. Liaw and M. Wiener, "Classification and regression by
Temperature
Mean: 97.83 3.41 33.9 100.48 randomForest," R news, vol. 2, no. 3, pp. 18-22, 2002.
(F)
Mean: [19] scikit-learn. "Gradient Boosting Regressor." https://scikit-learn.org
Heart Rate 22.19 35.5 161.41 (accessed 15/01/2020, 2020).
92.27
Respiratory Mean: [20] L. Breiman, "Stacked regressions," Machine Learning, journal article
3.40 8.4 27.8 vol. 24, no. 1, pp. 49-64, July 01 1996, doi: 10.1007/bf00117832.
Rate 18.02
Pulse Mean: [21] J. Schmidhuber, "Deep learning in neural networks: An overview,"
1.60 57.85 99.83 Neural networks, vol. 61, pp. 85-117, 2015.
Oximetry 97.09
[22] J. Bergstra and Y. Bengio, "Random search for hyper-parameter
optimization," Journal of Machine Learning Research, vol. 13, no.
REFERENCES Feb, pp. 281-305, 2012.
[1] P. Ponikowski et al., "2016 ESC Guidelines for the diagnosis and [23] J. Bergstra and Y. Bengio, "Random search for hyper-parameter
treatment of acute and chronic heart failure: The Task Force for the optimization," Journal of machine learning research, vol. 13, no. Feb,
diagnosis and treatment of acute and chronic heart failure of the pp. 281-305, 2012
European Society of Cardiology (ESC). Developed with the special [24] F. Pedregosa et al., "Scikit-learn: Machine learning in Python,"
contribution of the Heart Failure Association (HFA) of the ESC," Journal of machine learning research, vol. 12, no. Oct, pp. 2825-2830,
European journal of heart failure, vol. 18, no. 8, pp. 891-975, 2016. 2011.
[2] G. Savarese and L. H. Lund, "Global public health burden of heart [25] F. Chollet, "Keras," ed, 2015
failure," Cardiac failure review, vol. 3, no. 1, p. 7, 2017.
5445
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 10:29:40 UTC from IEEE Xplore. Restrictions apply.