model_validation-tutorial

 Introduction: Model validation
 Bootstrap method
 Predictive performance
 Use bootstrap and other methods for model
validation
 Demonstrate association: Evaluation the
relationship between an outcome and
covariates
e.g, Association Between Helicopter vs
Ground Emergency Medical Services and
Survival for Adults With Major Trauma
JAMA. 2012;307(15):1602-1610.
 We are interested in the beta coefficient of the
regression model, e.g.,
In the multivariable regression model, for
patients transported to level I trauma
centers, helicopter transport was
associated with an improved odds of
survival compared with ground transport
(odds ratio [OR], 1.16; 95% CI, 1.14-1.17;
P < .001).
 Prediction and forecasting:
e.g., Regression Tree Analysis.
Decompensated Heart Failure: Classification
and Risk Stratification for In-Hospital
Mortality in Acutely.
JAMA. 2005;293(5):572-580
Predictive score construction:
e.g., score (H) is generally based on the results
of regression model: H=(β1×covariate A )+
(β2×covariate B)+(β3×covariate C), and so on,
where β1, β2, and β3 denote the estimates of
beta coefficients for covariates A, B, and C and
were obtained by fitting the regression model
for the outcome of interest.
 Model validation is applied to regression
models for prediction purpose.
MODEL VALIDATION in general has at least
two parts:
1. Model selection:
to choose the best model based on model
performance.
2. Model assessment:
to estimate performance for a final
chosen model.
 Here we study various methods for model
assessment ( how well the model is to predict a future
outcome?)
Internal validation of predictive models: Efficiency of

some procedures for logistic regression analysis
E W. Steyerberg et. al
Journal of Clinical Epidemiology 54 (2001) 774–781
 Randomly sampling, with replacement, from
an original dataset for use in obtaining
statistical inference.
 Bootstrap theory says that the distance between
the population mean and sample mean is
similar to the distance between sample mean
and bootstrap ‘subsample’ mean.
95% CI for:
Correlation coefficient
CV = SD/mean
AUC of ROC
Median
 External validation: use a training (derivation)
data to build the model and a test (validation)
data to validate the model.
 example: old vs new patients, one vs another
dataset,
 Internal validation: use the same dataset for
model building and validation.
1. we use regression analysis to construct the
predictive model to provide an estimate of patient
outcome.
2. The apparent performance of the model on this
training set will be better than the performance in
another data set, even if the latter test set consists of
patients from the same population. (this is called
optimism)
Population Sample Inference
Regular Population Sample Standard error
statistical mean mean
procedure
Bootstrap Sample Subsample Bootstrap standard

mean mean deviation
Model validation Testing data Training data Optimism

1. Data: GUSTO-I data gives 30-day mortality in patients with
acute myocardial infarction. this data set consists of 40,830
patients, of whom 2851 (7.0%) had died at 30 days.
 Response(Y): 30 day mortality
 Predictors(X): age > 65 years, high risk (anterior infarct
location or previous MI), diabetes, shock, hypotension
(systolic blood pressure< 100 mmHg), tachycardia
(pulse > 80), relief of chest pain > 1 hr, female gender.
 Produce training set and test set based on
GUSTO-1 data (EPV: event per variable)
• Example: EPV=5, 7% event rate
=> training data set: 5*8=40 death out of 571 patients
=> test data set: 2811 death out of 40259 patients
1. concordance: the c statistic. For binary
outcomes, c is identical to the area under the
receiver operating characteristic (ROC)
curve; c varies between 0.5 and 1.0 for
sensible models (the higher the better)
2. The calibration slope is the regression
coefficient b in a logistic model with the
predictive score as the only covariate:
logit(mortality) = a+ b * predictive score.
Well-calibrated models have a slope of 1,
while models providing too extreme
predictions have a slope less than 1.
3. The Brier score (or average prediction error) is
calculated as Sum(y_i -p_i)^2/n, where y_i
denotes the observed outcome and p_i the
prediction for subject i in the data set of n
subjects.
4. D is a scaled version of the model chi-square,
which is a function of log-likelihood
5. R^2 as a measure of explained variation.
5. A few methods to estimate model
performance (Table 1)
1) Split sample: randomly split the training data in two
parts: one to develop the model and another to
measure its performance. The split was made once
and at random.
2) cross-validation: With split-half cross-validation,
the model is developed on one randomly drawn half
and tested on the other and vice versa. The average
is taken as estimate of performance. Other fractions
of subjects may be left out (e.g., 10% to test a
model developed on 90% of the sample). This
procedure is repeated 10 times, such that all
subjects have once served to test the model.
 To improve the stability of the cross-validation,
the whole procedure can be repeated several
times, taking new random subsamples. The
most extreme cross-validation procedure is to
leave one subject out at a time, which is
equivalent to the jack-knife technique.
3) Bootstrapping replicates the process of
sample generation from an underlying
population by drawing samples with
replacement from the original data set, of the
same size as the original data set. Models
may be developed in bootstrap samples and
tested in the original sample.
a) regular bootstrap: the model as estimated in the
bootstrap sample was evaluated in the bootstrap
sample and in the original sample. The performance
in the bootstrap sample represents estimation of the
apparent performance, and the performance in the
original sample represents test performance. The
difference between these performances is an
estimate of the optimism in the apparent
performance.
 This difference is averaged to obtain a stable
estimate of the optimism. internally validated
performance .
 optimism= average (bootstrap performance –
test performance).
 Estimated performance
= apparent performance – optimism.
6. Performance: (Fig2 and Table 2)
7. Conclusión:
1) split-sample approach tends to produce larger difference
between estimated performance and test performance, unless
a very large sample is available.
2) However, with a large sample size (e.g., EPV > 40), optimism
is small and the apparent estimates of model performance are
attractive because of their stability.
3) Regular bootstrapping provides better estimates of internal
validity of logistic regression models constructed in smaller
samples (e.g., EPV10)
 What is bootstrap?
 Model validation
 Internal vs external model validation
 Optimism in internal validation
 Using bootstrap and other methods to correct
optimism
1. Efron and Tibshirani (1993), An Introduction to the Bootstrap,
Chapman &Hall/CRC
2. Internal validation of predictive models: Efficiency of some procedures for
logistic regression analysis
E W. Steyerberg et. al Journal of Clinical Epidemiology 54 (2001) 774–781

model_validation-tutorial

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

model_validation-tutorial

Uploaded by

Copyright:

Available Formats

 Introduction: Model validation

Internal validation of predictive models: Efficiency of

Bootstrap Sample Subsample Bootstrap standard

Model validation Testing data Training data Optimism

You might also like