Professional Documents
Culture Documents
Physionet 2012 Challenge: Predicting Icu Patient Mortality: Felipe Feijoo, Dongping Du, and Thomas Gebregergis
Physionet 2012 Challenge: Predicting Icu Patient Mortality: Felipe Feijoo, Dongping Du, and Thomas Gebregergis
(√ )
n
is now with the Department of Physics, Colorado State University, Fort 1 2
Collins, CO 80523 USA (e-mail: author@lamar. colostate.edu).
T. C. Author is with the Electrical Engineering Department, University
∑
n i=1
(X i− X́ )
of Colorado, Boulder, CO 80309 USA, on leave from the National Research
Institute for Metals, Tsukuba, Japan (e-mail: author@nrim.go.jp).
n
1 were selected including the predictor variables Age, Height
∑
n i=1
( X i− X́)4 and Weight.
Kurtosis k 1= n 2 Correlation Test using Pearson’s correlation coefficient.
( 1
∑
n i=1
( X i− X́ )2 ) Accordingly, Pearson’s correlation test was done on the
significant 137 features to check for their linear relationship.
n The correlation coefficient of 137 was calculated and put
β0 , in to correlation matrix. We were interested on coefficient s
β1 Minimize{ ∑|X i−β 0 + β 1 t i| } with absolute value greater than 0.9. Hence 28 pairs of
i=1
variables were in this range. To decide which variables to
reduce, it was important to look at the p-values of the t-test
C. Feature Selection for feature selection. Based on that, the variables with small
p-value get considered as representative variable of the high
T-test. A t-test was done on all the candidate features to check
correlated pair. When developing the logistic regression
if they have significance in predicting the mortality of ICU
models (discussed later in the paper), feature selection
patients. Assuming a feature has independent samples for
algorithms were utilized (forward selection, backward
both groups. 248 features were tested if the population of the
selection and step wise selection).
groups is significantly different. Moreover, the result from
this hypothesis testing is an indication of relevancy of D. Outliers and Data Imputation
variables relative to the outcome which is 0 for survived The deletion or replacement of outliers’ data points was
patients and 1 for in hospital death. The hypotheses on this
performed after the features were calculated (we could have
statistical test are: Null hypothesis (Ho):
μ0 i=μ1 i ; removed the outliers first as well). Depending on the feature,
the outliers were deleted, or removed by median or mean of
Alternate hypothesis (H1): .
μ0 i≠μ1 i = Where,
μ0 i the column. Chi squared test for outliers was also utilized to
populations mean of feature i for group of patients who evaluate if a data point cold be selected as outlier. Two data
survive at the end of 48 hours.
μ1i = populations mean of sets were created. The first data was develop by replacing the
missing values (data imputation) and outliers by the mean
feature i for group of patients who were dead within 48 value of the column, and the second data set was using
hours, and i = 1, 2, 3 … 248. median value of the column (without considering the value of
y 1i − y 0i outliers in both data sets). We only considered in the models
those features with less than 50% of missing values.
√
s s
1i 2 0 i2
+ E. Model Description
Test statistic (t) = n1 i n 0i
Neural Network Model. The neural network model has been
Where,
y
0i = sample mean of feature i for group of widely used in many fields due to its good performance on
patients who survive at the end of 48 hours. complex and nonlinear problems. Most of the data in our
case is not normally distributed; hence, neural network may
y1i= sample mean of feature i for group of be a more efficient tool to learn the hidden information in
patients who were dead within 48 hours. the data. Figure 1 shows the overall block diagram of the
s neural network structure used in this project. 110 features
1i 2
= sample variance of feature i for group of were left after removing non-significant and high-correlated
patients who were dead within 48 hours. items. Considering the model complexity and computational
s efficiency, we choose three-layer network with number of
0 i2
= sample variance of feature i for group of neurons S 1=50 , S 1=50 , and S 1=1 in each
patients who survive at the end of 48 hours. layer. Since we are estimating the survival probability of
n1i= number of samples for group of patients
patients, Log-sigmoid transfer function is applied at each
layer. The output of this neural network is:
who were dead within 48 hours.
n0 i= number of samples for group of patients
3 3 2 1 1 2 3
a =f (W f ( W f ( W P+b ) +b ) +b )
who survive at the end of 48 hours.
w11,1 ⋯ w11,n
( )
|t|>t0 . 025 , n1 i +n0i −2
If, , we reject the null hypothesis.
Which is an indication of the means of the two groups are Where W i= ⋮ ⋱ ⋮ , bi=(b 1 , … b S )T ,
i
w S ,1 ⋯ w1S ,n
1
different and this in turn indicates feature i is a qualified
i i
predictor.
i=1,2,3 and n = 110, 50 and 50 accordingly. P is
In general, a feature which gives a p-value of less than
0.05 was considered significant predictor. Thus out of 248
matrix composing all the features, and a3 is the
features which were tested using this statistic 137 features predicted value. All the model parameters yield to the
optimal fit to the in hospital outcomes.
First Layer Second Layer Third Layer (summarized on Table 2). The threshold to classify patients
p1
1 1 was set to 0.77 (>0.77 classified as death). The model is
based on the data set imputed with the mean values of each
p2
2 2 feature. The data was divided 80% and training, 20% testing.
risk The unofficial event 1 score on dataset A is 0.46.
p4 1 Table 2 Features used in logistic regression model
Features Attributes used
49 49
Static Age, ICUType, MechVent,
Variables
50 50 Mean HR, Lactate, GCS, Urine, Fi02, HC03, pH, Weight, Pa02,
pn
Na, NIMAP, MAP, DiasABP, K
Median GCS, Temp, pH, Creatinine, PaC02, Plateletes, NIMAP,
Figure 1. Neural Network Structure
NIDiasABP, DiasABP
Kurtosis DiasABP, Fi02, HC03, Glucose, WBC, SysABP, Temp,
Logistic Regression (LN). The logistic regression is a type of GCS, NISysABP, MAP, K
regression that is commonly used to predict the relation of Skewness HC03, HR, Pa02, Creatinine, WBC, Urine, Plateletes, pH,
covariates with a categorical response variable. This model GCS, Na, NIDiasABP, PaC02
Minimum BUN, GCS, pH, Mg, Temp, Pa02, HCT, Creatinine, Na,
estimates the probability of an outcome to belong to a DiasABP, PaC03, HR
category (response variable, 0 or 1) based on the predictors Maximu Temp, NISysABP, Fi02, HC03, Pa02, Plateletes, pH,
x i . In this work, the probability p represents the m NIDiasABP, DiasABP, MAP, PaC02
Beta0 Creatinine, Mg, PaC02, Plateletes, MAP, HCT, Temp, K
probability of death of a single patient. The model is
described as follows: Beta1 GCS, Pa02, HC03, Plateletes, NIDiasABP, Na, Fi02,
NIMAP, Creatinine, Glucose, BUN, SysABP
log ( 1−p p )=β +∑ β x0
i=1
i i
Variance Temp, PaC02, NIDiasABP, HC03, Pa02, Plateletes, GCS,
HR, Na