You are on page 1of 10

Journal of Clinical Epidemiology 61 (2008) 1085e1094

REVIEW ARTICLE

Validation, updating and impact of clinical prediction rules: A review


D.B. Toll, K.J.M. Janssen, Y. Vergouwe, K.G.M. Moons*
Julius Center for Health Sciences and Primary care, University Medical Center Utrecht, Utrecht, The Netherlands
Accepted 14 April 2008

Abstract
Objective: To provide an overview of the research steps that need to follow the development of diagnostic or prognostic prediction
rules. These steps include validity assessment, updating (if necessary), and impact assessment of clinical prediction rules.
Study Design and Setting: Narrative review covering methodological and empirical prediction studies from primary and secondary care.
Results: In general, three types of validation of previously developed prediction rules can be distinguished: temporal, geographical, and
domain validations. In case of poor validation, the validation data can be used to update or adjust the previously developed prediction rule to
the new circumstances. These update methods differ in extensiveness, with the easiest method a change in model intercept to the outcome
occurrence at hand. Prediction rulesdwith or without updatingdshowing good performance in (various) validation studies may subse-
quently be subjected to an impact study, to demonstrate whether they change physicians’ decisions, improve clinically relevant process
parameters, patient outcome, or reduce costs. Finally, whether a prediction rule is implemented successfully in clinical practice depends
on several potential barriers to the use of the rule.
Conclusion: The development of a diagnostic or prognostic prediction rule is just a first step. We reviewed important aspects of the
subsequent steps in prediction research. Ó 2008 Elsevier Inc. All rights reserved.
Keywords: Prediction rule; Validation; Generalizability; Updating; Impact analysis; Implementation

1. Introduction rules, which endured the first three phases [4]. A quick
Medline-search using a suggested search strategy [6] dem-
Prediction rules or prediction models, often also referred
onstrated that the number of scientific articles discussing
to as decision rules or risk scores, combine multiple predic-
prediction rules has more than doubled in the last decade;
tors, such as patient characteristics, test results, and other 6,744 published articles in 1995 compared to 15,662 in
disease characteristics, to estimate the probability that a cer-
2005. A striking fact is that this mainly includes papers
tain outcome is present (diagnosis) in an individual or will
concerning the development of prediction rules. A rela-
occur (prognosis). They intend to aid the physician in mak-
tively small number regards the validation of rules and
ing medical decisions and in informing patients. Table 1
there are hardly any publications showing whether an im-
shows an example of a prediction rule.
plemented rule has impact on physician’s behavior or
In multivariable prediction research, the literature often
patient outcome [3,4].
distinguishes three phases: (1) development of the predic-
Lack of validation and impact studies is unfortunate, be-
tion rule; (2) external validation of the prediction rule (fur- cause accurate predictionsdcommonly expressed in good
ther referred to as ‘‘validation’’), that is, testing the rule’s
calibration (agreement between predicted probabilities
accuracy and thus generalizability in data that was not used
and observed outcome frequencies) and good discrimina-
for the development of the rule, and subsequent updating if
tion (ability to distinguish between patients with and with-
validity is disappointing; and (3) studying the clinical im-
out the outcome)din the patients that were used to develop
pact of a rule on physician’s behavior and patient outcome
a rule are no guarantee for good predictions in new patients,
(Table 2) [1e5]. A fourth phase of prediction research may
let alone for their use by physicians [1,3,4,7,8]. In fact,
be the actual implementation in daily practice of prediction
most prediction rules commonly show a reduced accuracy
when validated in new patients [1,3,4,7,8]. There may be
* Corresponding author. Julius Center for Health Sciences and Primary
Care, University Medical Center Utrecht, P.O. Box 85500, 3508 GA
two main reasons for this: (1) the rule was inadequately de-
Utrecht, The Netherlands. Tel.: þ31-88-7559368; fax: þ31-88-7555485. veloped and (2) there were (major) differences between the
E-mail address: k.g.m.moons@umcutrecht.nl (K.G.M. Moons). derivation and validation population.
0895-4356/08/$ e see front matter Ó 2008 Elsevier Inc. All rights reserved.
doi: 10.1016/j.jclinepi.2008.04.008
1086 D.B. Toll et al. / Journal of Clinical Epidemiology 61 (2008) 1085e1094

Many guidelines regarding the development of predic- Table 1


tion rules have been published, including the number of po- Prognostic prediction rule for estimating the probability of
neurological sequelae or death after bacterial meningitis [26]
tential predictors in relation to the number of patients,
methods for predictor selection, how to assign the weights Regression coefficienta Scoreb
per predictor, how to shrink the regression coefficients to Male gender 1.48 2
prevent overfitting, and how to estimate the rule’s potential Atypical convulsions 2.27 3
Body temperature  35  Cc 0.75 1
for optimism using so-called internal validation techniques
such as bootstrapping [1,2,7e14]. Pathogen
Compared to the literature on the development of predic- Streptococcus pneumoniae 3.12 4
tion rules, the methodology for validation and studying the Neisseria meningitidis 1.48 2
impact of prediction rules is underappreciated [1,4,8]. This Intercept/constant 25.2 5
paper provides a short overview of the types of validation Score Probability
studies, of possible methods to improve or update a previ- !2.5 0/25 (0%)
ously developed rule in case of disappointing accuracy in 2.5e4.5 2/78 (2.6%)
a validation study, and of important aspects of impact stud- 5.0e5.5 6/33 (18.2%)
O5.5 15/34 (44.1%)
ies and implementation of prediction rules. We focus on Overall 23/170 (13.5%)
prediction rules developed by logistic regression analysis,
The relative weights of the predictors are expressed with regression co-
but the issues largely apply to prediction rules developed efficients and with simplified scores.
by other methods such as Cox proportional hazard analysis a
The probability for each patient can be calculated as log (Risk of out-
or neural networks. The methodology applies both to diag- come/1Risk of outcome) 5 25.2 þ 1.48*Male gender þ 2.27*A typical
nostic and prognostic prediction rules and is illustrated with convulsions  0.75(Body temperature 35 C) þ 3.12*S. Pneumoniae þ
examples from diagnostic and prognostic research. 1.48*N. Meningitidis.
b
The scores were derived by dividing the regression coefficients of the
included predictors by the smallest regression coefficient and then round-
ing them to the nearest integer. For each patient, a sumscore can be calcu-
lated by adding the scores that correspond to the characteristics of the
2. Examples of disappointing accuracy of prediction
patient. The total sumscore are related to the individual probability as
rules shown in the lower part of the table.
c
For each degree Celsius body temperature above 35 C, 1 point is
Even when internal validation techniques are applied to subtracted from the sumscore.
correct for overfitting and optimism, the accuracy of predic-
tion rules can be substantially lower in new patients com-
pared to the accuracy found in the patients of the the EuroSCORE in 8,331 cardiac surgery patients from
development population. For example, the generalizability six Australian institutions and found that predictions were
of an internally validated prediction rule for diagnosing poorly calibrated for Australian patients. According to the
a serious bacterial infection in children presenting with fe- authors, reasons for this finding are unclear and likely to
ver without apparent source was disappointing [15]. In the be multifactorial, such as a different health care system, dif-
development study, the area under the receiver operating ferent indications for cardiac surgery, and a different preve-
characteristic curvedafter adjustment for optimismdwas lance of comorbid conditions in Australia compared to
0.76 (95% confidence interval: 0.66e0.86). However, when Europe. Also, the prediction rule could be ‘‘out of date’’
applied to new patients obtained from another hospital in [24,25]. The rule was developed with data of patients
a later period using the same inclusion and exclusion crite- who were operated more than 10 years prior to the patients
ria, the area under the receiver operating characteristic in the Australian validation study. The surgical procedure
curve dropped to 0.57 (95% confidence interval: has indeed changed over time, potentially leading to differ-
0.47e0.67). The authors concluded that this could partly ent outcomes [24,25]. This change was not reflected by the
have been caused by flaws in the development of the rule, other validation studies [16,18e23].
notably too few patients in relation to the number of predic-
tors tested, but also that internal validation and correction
2.1. Common differences between a development and
for optimism do not always prevent a decreased accuracy
a validation population
in future patients [15].
Another example of poor generalizability regards the As described, disappointing generalizability can be ex-
European System for Cardiac Operative Risk Evaluation plained by differences in the development and validation
(EuroSCORE), a prognostic prediction rule that was devel- population. We may largely identify three possible differ-
oped in 128 centers in eight European states to predict 30- ences. First, the definitions of predictors and the outcome
day mortality in patients who underwent cardiac surgery variable and the measurement methods may be different
[16,17]. Validation studies showed good results in Euro- [1,2,8,11]. Prediction rules that contain unclear defined pre-
pean, North American, and Japanese populations dictors or predictors which measurement or interpretation is
[16,18e23]. Yap et al. [24] tested the generalizability of liable to subjectivity are likely to show a reduced predictive
D.B. Toll et al. / Journal of Clinical Epidemiology 61 (2008) 1085e1094 1087

Table 2
Consecutive phases in multivariable prediction research, diagnostic or prognostic
Phase Short description
1. Development Development of a multivariable (diagnostic or prognostic) prediction rule, including identification of important predictors,
assigning the relative weights to each predictor, estimating the rule’s predictive accuracy, estimating the rule’s potential for
optimism using so-called internal validation techniques, anddif necessarydadjusting the rule for overfitting.
2. Validation and Testing the accuracy of the prediction rule in patients that were not included in the development study. Temporal, geographical, and
updating domain validations can be distinguished. If necessary, the prediction rule can be updated, by combining the information captured
in the rule (development study) and the data of the new patients (validation study).
3. Impact Determining whether a (validated) prediction rule is used by physicians, changes therapeutic decisions, improves clinically relevant
process parameters, improves patient outcome or reduces costs.
4. Implementation Actual dissemination of the diagnostic or prognostic prediction rule in daily practice to guide physicians with their patient
management.

strength when applied to new patients. For example, the di- depends on the hypotheses tested. For prediction rules that
agnostic prediction rule of Table 1 contains the rather predict dichotomous outcomes, it has been suggested that
objective predictor gender, but also the presence of the validation sample should contain at least 100 events
‘‘atypical convulsions’’ [26]. The latter may be defined dif- and 100 nonevents to detect substantial changes in accuracy
ferently by physicians, which may compromise the general- (for example, a 0.1 change in c-statistic) with 80% power [29].
izability of the rule. It is advised to determine the
interobserver variability of potential predictors and to in- 2.2. Type of validation studies
clude only those predictors in the final prediction rule that
show good reliability [2,5]. Improvement in measurement It has repeatedly been suggested that a validation study
techniques for predictors may also affect the predictive should consist of an adequate sample of ‘‘different but re-
strength of a predictor. For example, the Magnetic Reso- lated patients’’ compared to the development study popula-
nance Imaging technique is developing rapidly over time, tion [1,4,8]. Relatedness is at least defined as ‘‘patients
which results in improved image quality. Consequently, suspected of the same disease’’ for a diagnostic rule, and
the diagnostic or prognostic information of Magnetic Res- as ‘‘patients at risk of the same event’’ for a prognostic rule.
onance Imaging probably also improves over time and In hierarchy of increasingly stringent validation strate-
influences the accuracy of prediction rules that include gies, we largely distinguish temporal, geographical, and
Magnetic Resonance Imaging information. domain validations [1,3,4,8]. In general, the potential for
Second, the group of patients used for the development differences between the development and validation popu-
of a prediction rule may be different from the group of pa- lation is smallest in a temporal validation study, and largest
tients used for validation. This is also called difference in in a domain validation study (Table 3). Consequently, con-
‘‘case mix’’ [1,27]. For example, differences in indication firmative results in a domain validation study are consid-
for cardiac surgery and differences in comorbidity were ered to provide the strongest evidence that the prediction
considered as one of the causes of the poor calibration of rule can be generalized to new patients, whereas the gener-
the prognostic EuroSCORE in Australian patients. Both alizability of a prediction rule that has shown confirmative
discrimination and calibration of a rule can be affected by results in a temporal validation study may still be restricted.
differences in case mix. For example, a validation popula- When various published prediction rules for the same
tion may only include elderly (e.g., defined as age >65 outcome are available from different databases, these rules
years), whereas in the development population individuals’ can be mutually validated in these databases which some-
age ranged from 18 to 85 years. If age is a predictor in the times have even be prepared to undertake such a cross-
rule, then discrimination between presence and absence of validation. It is also possible to validate the available
the outcome in the more homogeneous validation popula- prediction rules in one comprehensive database including
tion is more difficult than in the more heterogeneous devel- data from various time periods, geographical areas, and
opment population. Further, a validation population may, even clinical domains [30].
for example, contain relatively more males than the devel-
opment population. If male gender increases the probability 2.2.1. Temporal validation
of the outcome but gender was not included in the rule Temporal validation tests the generalizability of a predic-
(missed predictor), then the predicted probabilities by the tion rule ‘‘over time.’’ In a temporal validation study, the
rule will be underestimated in the validation population prediction rule is typically tested by the same physicians
(reduced calibration) [27]. or investigators as in the development study, in the same in-
Third, validation studies commonly include fewer indi- stitution(s), and in similar patients, for example, using the
viduals than development studies. Accordingly, both popu- same eligibility criteria resulting in small variation in case
lations may seem different, which is notably due to random mix (Table 3) [3,4,8]. Hence, this type of validation is usu-
variation [12,28]. The required size of a validation study ally successful for thoughtfully developed prediction rules.
1088 D.B. Toll et al. / Journal of Clinical Epidemiology 61 (2008) 1085e1094

Table 3
Potential differences between the development and validation population and the influence on generalizability of the prediction rule
Geographical validation: to test
Temporal validation: to test the the generalizability of
generalizability of a prediction a prediction rule in similar
rule ‘‘over time’’ in similar patients as in the development Domain validation: To test the
patients as in the development study (same inclusion and generalizability of a prediction
study (same inclusion and exclusion criteria) in hospitals or rule across different domains,
exclusion criteria) in the same institutions of another which may contain other patient
Differences in hospitals or institutions geographical area (sub)groups
Interpretation of predictors and : Often same physicians as in þ: Other physicians as in the þ: Other physicians as in the
outcome the development study who are development study, who may development study, who may
thus experienced in obtaining define subjective predictors define subjective predictors
the predictors and outcome; (and the outcome) differently; (and the outcome) differently;
differences in interpretation of differences in interpretation of differences in interpretation of
predictors is unlikely predictors may occur predictors may occur
Used measurements for 6: Same measurements used, þ: Other measurements may be þ: Other measurements may be
predictors and outcome unless measurements have used (‘‘institution used (‘‘institution
been replaced (time related) dependent’’), plus potential for dependent’’), plus potential for
time-related changes in time-related changes in
measurements measurements
Case mix : Patient populations are 6: (Subtle) differences in case þþ: Differences in case mix are
similar; (random) variation due mix possible, beyond possible (very) likely, beyond possible
to a commonly small sample (random) variation due to a (random) variation due to a
size of the validation commonly small sample size commonly small sample size
population is possible of the validation population of the validation population
: Weak; 6: Possible; þ: Probable; þþ: Likely.

However, improvements in medical techniques may still af- case in temporal validation studies), though in hospitals
fect the predictive accuracy of a prediction rule, as in the or institutions of other geographical areas [1,8]. ‘‘Other
example of the EuroSCORE. Confirmative results in (mul- geographical areas’’ can be within one country, across sim-
tiple) temporal validation studies indicate that clinicians ilar countries (e.g., western to western or nonwestern to
may cautiously use the prediction rule in their future pa- nonwestern countries), and across nonsimilar countries
tients who are similar to the development and validation (e.g., western to nonwestern or vice versa). Understandably,
population. But validation in varied study sites is still nec- the less similar the development and validation locations
essary before the rule can be implemented in other (geo- are, the more potential for differences in ‘‘interpretation
graphical) locations or other patient domains (see below) of predictors and outcome,’’ ‘‘measurements used,’’ and
[1,4,8]. ‘‘case mix,’’ and thus the more potential for disappointing
generalizability (Table 3).
2.2.2. Geographical validation Physicians participating in a geographical validation
Geographical validation studies typically test the gener- study may be less experienced using the prediction rule,
alizability of a prediction rule in a patient population that is which may influence the accuracy of the rule, due to predic-
similarly defined as the development population (as is the tors sensitive to subjectivity. Further, measurement of

Table 4
Updating methods for prediction rules
No. Updating method Reason for updating
0 No adjustment (the original prediction rule) d
1 Adjustment of the intercept Difference in outcome incidence
2 Adjustment of the regression coefficients of the predictors (by one Regression coefficients of the original rule are overfitted
adjustment factor) and of the intercept
3 Method 2 þ extra adjustment of regression coefficients for predictors As in method 2, and the strength (regression coefficient) of one or more
with a different strength in the validation population compared to the predictors may be different in the validation population
development population
4 Method 2 þ stepwise selection of additional predictors As in method 2, and one or more potential predictors were not included
in the original rule
5 Re-estimation of all regression coefficients, using the data of the The strength of all predictors may be different in the validation
validation population population
6 Model 5 þ stepwise selection of additional predictors As in method 5, and one or more potential predictors were not included
in the original rule
D.B. Toll et al. / Journal of Clinical Epidemiology 61 (2008) 1085e1094 1089

predictors and the outcome can be performed with different a subdomain of the primary care population. Hence, in con-
methods than in the development study, also potentially af- trast, validating a prediction rule developed from secondary
fecting the rule’s accuracy. For example, prediction rules to care in a more heterogeneous primary care population actu-
safely exclude the diagnosis deep vein thrombosis (DVT) ally concerns the estimation of the rule’s ability for extrap-
contain items from patient history, physical examination, olation. Extrapolation of prediction rules developed in
and the result of a D-dimer test [31,32]. Many different secondary care to primary care patients often results in a de-
D-dimer assays are available, each with different diagnostic creased accuracy [8,35]. For example, it has been shown
accuracy measures. Sensitivities of D-dimer tests for diag- that a prediction rule for safely excluding the diagnosis
nosing DVT may vary between 48% and 100%, and spec- DVT developed in secondary care [32,36] showed disap-
ificities from 5% to 100% [33]. This variation in the pointing accuracy in primary care: 0.9% of the secondary
accuracy of different D-dimer tests will obviously be re- care patients had DVT while the diagnosis was ruled out ac-
flected in the accuracy of the diagnostic rules that include cording to the rule, whereas this proportion was increased
D-dimer tests. Further, if different cutoff values are used to 2.9% in primary care patients [37].
to dichotomize a continuous variable to define a positive We note, however, that certain inclusion criteria in de-
vs. negative result, the predictive accuracy of the same vari- velopment studies may have been chosen for practical rea-
able may be different across studies. For instance, if an ab- sons only, and may not compromise the generalization of
normal D-dimer concentration has been defined as higher a prediction rule derived from such studies. For example,
than 500 ng/mL in the development sample and as higher the Ottawa ankle rule for safely excluding fractures without
than 1,000 ng/mL in the validation sample, the rule will additional X-ray testing [38] was developed on patients
likely perform differently in the validation sample. aged 18 years or older. One could question whether this
Case-mix differences in a geographical validation study ruledwhich does not include age as a predictordis accu-
can be subtle. For example, a rule containing ‘‘antibiotic rate when applied to children or adolescents. If the relative
use in previous month’’ as a predictor to estimate the prob- weights (odds ratios) of the predictors in the rule are inde-
ability of 30-day hospitalization or death from lower respi- pendent of age, extrapolation of the rule to adolescents can
ratory tract infection in elderly patients [34] may show be as successful as in the development population. A recent
decreased accuracy in another country, not because the pre- review concerning the extrapolation of the Ottawa ankle
dictor was not well defined or liable to subjectivity, but be- rule to children or adolescents indeed concluded that ‘‘a
cause the indication for prescribing antibiotics, and thus the small’’ percentage (1.4%) of patients that are excluded
characteristics of antibiotic receivers, may vary between from receiving X-ray evaluation based on the Ottawa ankle
countries. rule will actually have a fracture [39], compared to 0% in
Geographical validation studies can be done prospec- the development study (in the development study, the score
tively but also retrospectively using previously collected threshold was specifically chosen to achieve a 100% nega-
data that contain all predictors and the outcome(s) of tive predictive value).
interest. Like geographical validation, domain validation can be
done prospectively but also retrospectively using previously
2.2.3. Domain validation collected data.
Perhaps the broadest form of validation is to test the gen-
eralizability of a prediction rule across different domains,
such as patients from a different setting (primary, second-
3. Updating prediction rules
ary, or tertiary care), inpatients versus outpatients, patients
of different age categories (e.g., adults vs. adolescents or When a validation study shows disappointing results, re-
children), of a different gender, and perhaps from a different searchers are often tempted to reject the rule and directly
type of hospital (academic vs. general hospital). Obviously, pursue to develop new rules with the data of the validation
the case mix of a patient population of a new domain will population only. However, although the original prediction
differ from the development population (Table 3), which is rules usually have been developed with large data sets, val-
usually reflected in differences in the distribution of the idation studies are frequently conducted with much smaller
predictor values and in the ranges of predictor values. patient samples. The redeveloped rules are thus also based
For example, the case mix of primary care and second- on smaller samples. Furthermore, it would lead to many
ary care patients is often clearly different. Primary care prediction rules for the same outcome, obviously creating
physicians always selectively refer patients to specialists. impractical situations, as physicians have to decide on
These referred patients commonly have relatively more se- which rule to use. For example, there are over 60 published
vere signs or symptoms or have a relatively more developed rules to predict outcome after breast cancer [40]. Moreover,
disease stage [8,35]. Consequently, secondary care patients when every new patient sample would lead to a new predic-
commonly have a narrower range of the (more severe) pre- tion rule, prior information that is captured in previous
dictor values than primary care patients. To some extent, studies and prediction rules would be neglected. This is
a secondary care population can be considered as counterintuitive to the intention that scientific inferences
1090 D.B. Toll et al. / Journal of Clinical Epidemiology 61 (2008) 1085e1094

should be based on data of as many patients as possible. First, the strength of one or more predictors can be dif-
This principle of using prior knowledge from previous stud- ferent in the validation population compared to the develop-
ies has been recognized and used in etiologic and interven- ment population, whereas the relative sizes of the other
tion research, for example, in the realm of (cumulative) regression coefficients to each other are correct. The regres-
meta-analyses. sion coefficients that differ can be re-estimated from the
A logical alternative to redeveloping prediction rules in validation data (Table 4, method 3). For example, we dis-
each new patient sample, is to update existing prediction cussed a prediction rule with the antibiotic use as a predictor
rules with the data of the new patients in the validation (Section 2.2.2). When this rule is applied in a population or
study. As a result, updated rules combine the prior informa- setting with a different antibiotic prescription strategy, the
tion that is captured in the original rules with the informa- strength of this particular predictor may be different,
tion of the new patients of the validation population whereas the strength of the other predictors in the rule
[41e44]. Hence, updated rules are adjusted to the charac- not necessarily changes.
teristics of the new patients, and likely show improved Also, when potential predictors that may have predictive
generalizability. value were not included in the original rule, one can test
Several updating methods have been proposed in the lit- whether these have added predictive value in the validation
erature [41e44]. The methods vary in extensiveness, which data (Table 4, method 4). For example, when a prediction rule
is reflected by the number of parameters that is adjusted or is validated over time, and a new test has become available,
re-estimated (Table 4). We will briefly describe these the new test may have added predictive value in the rule.
methods and refer to the literature for a more profound Finally, when the previous described updating methods
description [41e44]. cannot improve the accuracy of the rule, and the strengths
In many situations, as described before, differences in of all predictor are expected to be different in the new pa-
outcome incidence are found between the development data tients, the intercept and the regression coefficients of all
and the validation data. For example, in a primary care set- predictors can be re-estimated with the data of the new pa-
ting one can validate a secondary care rule that predicts the tients (Table 4, method 5). If necessary, additional predic-
presence or absence of DVT. Due to the higher prevalence tors can be considered as well (Table 4, method 6). These
of DVT in secondary care [35], the calibration of the rule in two methods are the most rigorous updating methods, as
primary care patients may be poor as a result of systemat- the intercept, regression coefficients and possibly also addi-
ically too high predicted probabilities. By adjusting only tional predictors are all re-estimated from the validation set.
the intercept of the original prediction rule for the patients Both methods will probably be most applicable to domain
in the primary care setting, the poor calibration can be im- validation, as these are typical situations in which the
proved [43,44]. This method is by far the simplest updating strength of predictors may differ between the two popula-
method as only one parameter of the original rule, that is, tions. Note that a disadvantage of these rigorous updating
the intercept, is adjusted (Table 4, method 1). methods is that the rule is redeveloped on the data of the
Another updating method is called ‘‘logistic recalibra- validation set only and that the prior information in the
tion’’ and can be used when the regression coefficients (rel- original rule is neglected, as we discussed above.
ative weights that represent the predictive strength) of the With all above described methods, the updated rules are
predictors in the prediction rule are overfitted in the devel- adjusted to the circumstances of the validation population.
opment study [43,44]. This typically results when too many However, we recommend that updated prediction rules, just
predictors were considered in a too small dataset [7,13]. In like newly developed rules, still need to be tested on their
the lower range, the predicted probabilities in the new pa- generalizability and impact before they can be applied in
tients are usually too low, whereas in the higher range they daily practice. Note that for all updating methods, data of
are too high. When overfitting was not adequately pre- the new patients is needed. When this data is not available,
vented or adjusted during development of the rule, all re- but one knows the incidence of the outcome and the mean
gression coefficients can still be adjusted with a single value of the predictors in the new population, the rule can
correction factor that is easily estimated from the data of be adjusted by a simple adjustment of the prediction rule
the new patients in the validation set (Table 4, method 2). [45,46].
Although calibration can indeed be improved by these
first two methods, discrimination (area under the receiver
4. Impact analysis
operating characteristic curve) will remain unchanged, as
the relative ranking of the predicted probabilities remain To ascertain whether a validated diagnostic or prognos-
the same. To improve the discrimination of a rule in new tic prediction rule will actually be used by physicians, will
patients, more rigorous adjustments need to be made to change or direct physicians’ decisions, and will improve
the prediction rule, also called model revisions. We will clinically relevant process parameters (such as number of
briefly explain four revision methods that can improve both bed days, length of hospital stay, or time to diagnosis), pa-
the discrimination and the calibration of a prediction rule tient outcomes, or reduces costs, an impact study or impact
when a rule is validated in new patients. analysis should be performed [3,4]. In the ideal design of an
D.B. Toll et al. / Journal of Clinical Epidemiology 61 (2008) 1085e1094 1091

impact study, physicians or care units are randomized to ei- rule can be successfully applied in practice [1,4,8]. Yet,
ther the index groupdwhich is ‘‘exposed’’ to the use of the there are still reasons why the rule is not as successful in
prediction ruledor to the control group using ‘‘care or clin- daily practice.
ical judgment as usual’’: a cluster randomized trial [3,47]. First, physicians may feel that their often implicit esti-
There have been some nice examples of cluster-randomized mation of a particular predicted probability is at least as
studies of the impact of prediction rules [48e50]. Random- good as the probability calculated with a prediction rule,
ization of patients instead of physiciansdsuch that a physi- and may therefore not use or follow the rule’s predictions
cian randomly uses the prediction rule or applies ‘‘usual [3]. Or, the physicians’ estimation of a probability may
care’’dis not advised. Learning effects will lead to a re- even have proven to outperform the discrimination of a pre-
duced contrast between the two study groups, resulting in diction rule. In a recent review, Sinuff et al. compared the
a diluted measured impact of the rule. Moreover, random- discrimination of physicians’ estimations with prediction
izing centers (requiring a multicenter study) instead of phy- rules in predicting the mortality of critically ill patients
sicians within a single center may prevent the risk of considered for intensive care unit (ICU) admission [52].
contaminationdi.e., exchange of experiences and informa- They concluded that ICU physicians more accurately dis-
tion by physicians between the two study groupsdalso criminate between survivors and nonsurvivors in the first
leading to reduced contrast and dilution of the rule’s effect. 24 h of ICU admission than prediction rules do. It may thus
Patients are followed up to determine the impact of the pre- be important to compare physicians’ predictions with those
diction rule on clinically relevant process parameters, of prediction rules, preferably already during the develop-
patient outcome, and on cost-effectiveness. Randomizing ment phase of a prediction rule (requiring obviously a pro-
physicians to different diagnostic or prognostic strategies- spective design), but certainly in validation and impact
de.g., with and without exposure or use of a diagnostic studies. If physicians’ predictions of probabilities have
or prognostic prediction ruledcan only lead to differences proven to outperform the probabilities provided by a rule,
in patient outcome if it leads to actual differences in treat- the rule will likely not be used in practice. In contrast, re-
ment decisions. Follow up is not required for studying the sults in favor for the prediction rule can be used to convince
influence of a prediction rule on decision-making behavior: physicians of using the rule when properly validated.
a cross-sectional, randomized study then suffices. An alter- Second, prediction rules must have face validity; physi-
native design to determine the impact of a prediction rule is cians must accept the logic, as well as the science of the
a beforeeafter study within the same physicians or care rule. Clinical prediction rules that do not have face validity
units [51], although temporal changes may compromise may not be applied in practice, even when effective [53].
the validity of this design. Third, prediction rules may not be used because they are
Although an impact analysis is the method par excel- not user-friendly [3,54]. The user-friendliness should be
lence to study the real effect of a prediction rule in practice, taken into account when developing the rule. A variable
only a limited number of impact analyses have been per- should only be considered as a potential predictor if obtain-
formed [4]. One of these few studies regarded the impact ing this variable is also feasible in daily practice of the type
of a prediction rule aimed at improving the effectiveness of of patients under study (not too time-consuming or costly).
treatment of patients presenting with community-acquired The user-friendliness of a prediction rule also depends on
pneumonia to the emergency department, measured by the the way a prediction rule is presented; the original regres-
health-related quality of life and the number of bed days sion formula (e.g., as in the legend of Table 1) is the most
per patient [49]. Hospitals were randomly assigned to imple- exact and accurate form, but may involve cumbersome cal-
mentation of the prediction rule or conventional care, by us- culations requiring a calculator or computer. Although
ing a computer that stratified for type of institution (teaching sumscores, risk stratification charts, or nomograms may
or community hospital) and average length of stay. Physi- be less precise, they certainly are more user-friendly. With
cians were instructed to use the prediction rule as a guide on- the introduction of electronic patient records, the use of
ly; the rule did not supersede clinical judgment. An regression formulas in daily practice will become much
educational plan was designed to reinforce compliance with easier.
the use of the prediction rule. The authors concluded that the Finally, practical barriers may exist to act on the results
prediction rule did not improve the health-related quality of of the prediction rule. For instance, when using a diagnostic
life of the patients but reduced the number of bed days per pa- prediction rule aiming to determine whether subsequent
tient managed. This effect was never revealed if no impact testing is necessary, such as the Ottawa ankle rules, physi-
study had been conducted. cians may be concerned about protecting themselves
against litigation. Hence, they may still refer their patients
for additional testing, whereas at the same time the predic-
tion rule indicates that referral was not necessary [3]. Bre-
5. Implementation of prediction rules
haut et al. [55] conducted a postal survey among 399
When a rule has frequently been proven to be accurate in randomly selected physicians to examine whether physi-
diverse populations, the more likely it is that the prediction cians used the Ottawa ankle rules [38] for diagnosing ankle
1092 D.B. Toll et al. / Journal of Clinical Epidemiology 61 (2008) 1085e1094

fractures. Most physicians (90%) reported to use the Ot- combined, and how to properly address publication bias;
tawa ankle rules always or most of the time in appropriate as prediction rules with good results are more likely to be
circumstances, whereas only 42% actually based their deci- published than rules with moderate results. Although the
sions to order radiography primarily on the rule [55]. The methodology for meta-analyses has been extensively de-
same authors assessed why some prediction rules become scribed for etiologic and intervention studies, to our knowl-
widely used and others do not [56], taking the Canadian edge no research has been conducted for meta-analysis to
Cervical Spine Rule [57] as an example. They showed that combine several prediction rules.
older physicians and part-time working physicians were Last, the potential gain in predictive accuracy and gener-
less likely to be familiar with the rule. The best predictors alizability of a prediction rule developed on combined data-
whether a rule would be used in practice were the familiar- sets with individual patient data from various studies on the
ity acquired during training, the confidence in the useful- same outcome (so-called individual patient studies) is a
ness of the rule, and the user-friendliness of the rule [56]. research area that needs more attention [60].
Our purpose was to stress the importance of testing the
generalizability and impact of prediction rules, and outline
the methods of such research. The relevance and impor-
6. Final comments
tance of validating and testing the impact of prediction
We have given an overview of types of validation stud- rules on physician’s behavior, and patients’ outcome, has
ies, of methods to improve or update a previously devel- repeatedly been emphasized in the literature. Unfortunately,
oped diagnostic or prognostic prediction rule in case of only a relatively small number of rules are validated, and
disappointing accuracy in a validation study, and of aspects hardly any study questions whether an implemented rule
of impact studies and the implementation of prediction can change patient outcome. An increased focus on valida-
rules. A validated, and if necessary updated, rule may cau- tion and impact studies will likely improve the application
tiously be applied in new patients that are similar to the pa- of valid prediction rules in daily clinical practice.
tients in the development and validation populations.
However, when the user has reasons to believe that the rule
may perform differently in the new patients, data of the new Acknowledgments
patients should first be collected to test the accuracy, and We gratefully acknowledge the support by The Nether-
preferably impact of the rule, before the rule is applied in lands Organization for Scientific Research (ZonMw
daily clinical practice. Any rule may perform slightly dif- 016.046.360; ZonMw 945-04-009).
ferent in a new patient sample due to sampling variation.
In that situation, the rule does not need to be updated. References
The questions remains: when has a rule been sufficiently
validated and updated? So far, this particular methodologi- [1] Altman DG, Royston P. What do we mean by validating a prognostic
model? Stat Med 2000;19:453e73.
cal area of prediction research has not been explored. [2] Laupacis A, Sekar N, Stiell IG. Clinical prediction rules. A review
Future research should address the question how many and suggested modifications of methodological standards. JAMA
validation studies and what type of adjustments are needed 1997;277:488e94.
before it is justified to implement a prediction rule into clin- [3] McGinn TG, Guyatt GH, Wyer PC, Naylor CD, Stiell IG,
ical practice. Further, the performance of prediction rules Richardson WS. Users’ guides to the medical literature: XXII: how
to use articles about clinical decision rules. Evidence-Based Medi-
may worsen over time, as they can become ‘‘out of date’’ cine Working Group. JAMA 2000;284:79e84.
(as mentioned in Section 2). Especially in rapidly develop- [4] Reilly BM, Evans AT. Translating clinical research into clinical prac-
ing medical disciplines, physicians should first verify tice: impact of using prediction rules to make decisions. Ann Intern
whether the medical situation nowadays is still comparable Med 2006;144:201e9.
to the situation during the development of the prediction [5] Wasson JH, Sox HC, Neff RK, Goldman L. Clinical prediction rules.
Applications and methodological standards. N Engl J Med 1985;313:
rule, before applying the rule in their practice. Regular 793e9.
‘‘postmarketing surveillance’’ to evaluate whether the accu- [6] Ingui BJ, Rogers MA. Searching for clinical prediction rules in
racy of the prediction rules holds over time, might be some- MEDLINE. J Am Med Inform Assoc 2001;8:391e7.
thing to consider. Such repeated validations could indicate [7] Harrell FE Jr, Lee KL, Mark DB. Multivariable prognostic models:
issues in developing models, evaluating assumptions and adequacy,
whether a rule should no longer be used, or at least updated
and measuring and reducing errors. Stat Med 1996;15:361e87.
to the new circumstances. The question of course arises [8] Justice AC, Covinsky KE, Berlin JA. Assessing the generalizability
how often such repeated validation studies should be under- of prognostic information. Ann Intern Med 1999;130:515e24.
taken, which is a topic for further research. [9] Copas JB. Regression, prediction and shrinkage. JR Stat Soc B
Another subject of prediction research that may need 1983;45:311e54.
more focus in the future is the methodology for systematic [10] Efron B, Gong G. A leisurely look at the bootstrap, the jackknife, and
cross-validation. Am Stat 1983;37:36e48.
reviews and perhaps even meta-analyses of prediction rules [11] Harrell FE Jr. Regression modelling strategies with applications to
for the same outcome [58,59]. It will be a challenge to de- linear models, logistic regression, and survival analysis. New York:
fine how regression coefficients of prediction rules can be Springer; 2001.
D.B. Toll et al. / Journal of Clinical Epidemiology 61 (2008) 1085e1094 1093

[12] Steyerberg EW, Bleeker SE, Moll HA, Grobbee DE, Moons KG. In- [31] Oudega R, Moons KG, Hoes AW. Ruling out deep venous thrombosis
ternal and external validation of predictive models: a simulation in primary care. A simple diagnostic algorithm including D-dimer
study of bias and precision in small samples. J Clin Epidemiol testing. Thromb Haemost 2005;94:200e5.
2003;56:441e7. [32] Wells PS, Anderson DR, Rodger M, Forgie M, Kearon C, Dreyer J,
[13] Steyerberg EW, Eijkemans MJ, Harrell FE Jr, Habbema JD. Prognos- et al. Evaluation of D-dimer in the diagnosis of suspected deep-vein
tic modelling with logistic regression analysis: a comparison of selec- thrombosis. N Engl J Med 2003;349:1227e35.
tion and estimation methods in small data sets. Stat Med 2000;19: [33] Di Nisio M, Squizzato A, Rutjes AW, Buller HR, Zwinderman AH,
1059e79. Bossuyt PM. Diagnostic accuracy of D-dimer test for exclusion of ve-
[14] Steyerberg EW, Harrell FE Jr, Borsboom GJ, Eijkemans MJ, nous thromboembolism: a systematic review. J Thromb Haemost
Vergouwe Y, Habbema JD. Internal validation of predictive models: 2007;5:296e304.
efficiency of some procedures for logistic regression analysis. J Clin [34] Bont J, Hak E, Hoes AW, Schipper M, Schellevis FG, Verheij TJ. A
Epidemiol 2001;54:774e81. prediction rule for elderly primary-care patients with lower respira-
[15] Bleeker SE, Moll HA, Steyerberg EW, Donders AR, tory tract infections. Eur Respir J 2007;29:969e75.
Derksen-Lubsen G, Grobbee DE, et al. External validation is neces- [35] Knottnerus JA. Between iatrotropic stimulus and interiatric referral:
sary in prediction research: a clinical example. J Clin Epidemiol the domain of primary care research. J Clin Epidemiol 2002;55:
2003;56:826e32. 1201e6.
[16] Nashef SA, Roques F, Michel P, Gauducheau E, Lemeshow S, [36] Wells PS, Hirsh J, Anderson DR, Lensing AW, Foster G, Kearon C,
Salamon R. European system for cardiac operative risk evaluation et al. Accuracy of clinical assessment of deep-vein thrombosis. Lan-
(EuroSCORE). Eur J Cardiothorac Surg 1999;16:9e13. cet 1995;345(8961):1326e30.
[17] Roques F, Nashef SA, Michel P, Gauducheau E, de Vincentiis C, [37] Oudega R, Hoes AW, Moons KG. The Wells rule does not adequately
Baudet E, et al. Risk factors and outcome in European cardiac sur- rule out deep venous thrombosis in primary care patients. Ann Intern
gery: analysis of the EuroSCORE multinational database of 19030 Med 2005;143:100e7.
patients. Eur J Cardiothorac Surg 1999;15:816e22. discussion [38] Stiell IG, Greenberg GH, McKnight RD, Nair RC, McDowell I,
22e23. Worthington JR. A study to develop clinical decision rules for the
[18] Geissler HJ, Holzl P, Marohl S, Kuhn-Regnier F, Mehlhorn U, use of radiography in acute ankle injuries. Ann Emerg Med
Sudkamp M, et al. Risk stratification in heart surgery: comparison 1992;21:384e90.
of six score systems. Eur J Cardiothorac Surg 2000;17:400e6. [39] Myers A, Canty K, Nelson T. Are the Ottawa ankle rules helpful in
[19] Gogbashian A, Sedrakyan A, Treasure T. EuroSCORE: a systematic ruling out the need for x ray examination in children? Arch Dis Child
review of international performance. Eur J Cardiothorac Surg 2005;90:1309e11.
2004;25:695e700. [40] Altman DG. Prognostic models: a methodological framework and re-
[20] Kawachi Y, Nakashima A, Toshima Y, Arinaga K, Kawano H. Risk view of models for breast cancer. In: Lyman GH, Burstein HJ, edi-
stratification analysis of operative mortality in heart and thoracic tors. Breast cancer translational therapeutic strategies. New York;
aorta surgery: comparison between Parsonnet and EuroSCORE addi- Informa Healthcare: 2007.
tive model. Eur J Cardiothorac Surg 2001;20:961e6. [41] Hosmer Dw SL. Applied logistic regression. New York: John Wiley
[21] Michel P, Roques F, Nashef SA. Logistic or additive EuroSCORE for and Sons Inc.; 1989.
high-risk patients? Eur J Cardiothorac Surg 2003;23:684e7. discus- [42] Ivanov J, Tu JV, Naylor CD. Ready-made, recalibrated, or Remod-
sion 7. eled? Issues in the use of risk indexes for assessing mortality after
[22] Nashef SA, Roques F, Hammill BG, Peterson ED, Michel P, coronary artery bypass graft surgery. Circulation 1999;99:2098e104.
Grover FL, et al. Validation of European System for Cardiac Opera- [43] Steyerberg EW, Borsboom GJ, van Houwelingen HC, Eijkemans MJ,
tive Risk Evaluation (EuroSCORE) in North American cardiac sur- Habbema JD. Validation and updating of predictive logistic regres-
gery. Eur J Cardiothorac Surg 2002;22:101e5. sion models: a study on sample size and shrinkage. Stat Med
[23] Nilsson J, Algotsson L, Hoglund P, Luhrs C, Brandt J. Early mortality 2004;23:2567e86.
in coronary bypass surgery: the EuroSCORE versus The Society of [44] Janssen KJ, Moons KG, Kalkman CJ, Grobbee DE, Vergouwe Y. Up-
Thoracic Surgeons risk algorithm. Ann Thorac Surg 2004;77: dating methods improved the performance of a clinical prediction
1235e9. discussion 9e40. model in new patients. J Clin Epidemiol 2008;61:76e86.
[24] Yap CH, Reid C, Yii M, Rowland MA, Mohajeri M, Skillington PD, [45] D’Agostino RB Sr, Grundy S, Sullivan LM, Wilson P. Validation of
et al. Validation of the EuroSCORE model in Australia. Eur J Cardi- the Framingham coronary heart disease prediction scores: results of
othorac Surg 2006;29:441e6. discussion 6. a multiple ethnic groups investigation. JAMA 2001;286:180e7.
[25] Nashef SA. Editorial comment EuroSCORE and the Japanese aorta. [46] Liu J, Hong Y, D’Agostino RB Sr, Wu Z, Wang W, Sun J, et al. Pre-
Eur J Cardiothorac Surg 2006;30:582e3. dictive value for the Chinese population of the Framingham CHD risk
[26] Oostenbrink R, Moons KG, Derksen-Lubsen G, Grobbee DE, assessment tool compared with the Chinese Multi-Provincial Cohort
Moll HA. Early prediction of neurological sequelae or death after Study. JAMA 2004;291:2591e9.
bacterial meningitis. Acta Paediatr 2002;91:391e8. [47] Campbell MK, Elbourne DR, Altman DG. CONSORT statement: ex-
[27] Vergouwe Y, Steyerberg EW, Eijkemans MJ, Habbema JD. Validity tension to cluster randomised trials. BMJ 2004;328(7441):702e8.
of prognostic models: when is a model clinically useful? Semin Urol [48] Foy R, Penney GC, Grimshaw JM, Ramsay CR, Walker AE,
Oncol 2002;20:96e107. MacLennan G, et al. A randomised controlled trial of a tailored mul-
[28] Peek N, Arts DG, Bosman RJ, van der Voort PH, de Keizer NF. Ex- tifaceted strategy to promote implementation of a clinical guideline
ternal validation of prognostic models for critically ill patients re- on induced abortion care. BJOG 2004;111:726e33.
quired substantial sample sizes. J Clin Epidemiol 2007;60:491e501. [49] Marrie TJ, Lau CY, Wheeler SL, Wong CJ, Vandervoort MK,
[29] Vergouwe Y, Steyerberg EW, Eijkemans MJ, Habbema JD. Substan- Feagan BG. A controlled trial of a critical pathway for treatment of
tial effective sample sizes were required for external validation stud- community-acquired pneumonia. CAPITAL Study Investigators.
ies of predictive logistic regression models. J Clin Epidemiol Community-Acquired Pneumonia Intervention Trial Assessing Levo-
2005;58:475e83. floxacin. JAMA 2000;283:749e55.
[30] Starmans R, Muris JW, Fijten GH, Schouten HJ, Pop P, [50] Meyer G, Kopke S, Bender R, Muhlhauser I. Predicting the risk of
Knottnerus JA. The diagnostic value of scoring models for organic fallingdefficacy of a risk assessment tool compared to nurses’ judge-
and non-organic gastrointestinal disease, including the irrita- ment: a cluster-randomised controlled trial [ISRCTN37794278].
ble-bowel syndrome. Med Decis Making 1994;14:208e16. BMC Geriatr 2005;5:14.
1094 D.B. Toll et al. / Journal of Clinical Epidemiology 61 (2008) 1085e1094

[51] Stiell I, Wells G, Laupacis A, Brison R, Verbeek R, Vandemheen K, [56] Brehaut JC, Stiell IG, Graham ID. Will a new clinical decision rule
et al. Multicentre trial to introduce the Ottawa ankle rules for use of be widely used? The case of the Canadian C-spine rule. Acad Emerg
radiography in acute ankle injuries. Multicentre Ankle Rule Study Med 2006;13:413e20.
Group. BMJ 1995;311(7005):594e7. [57] Stiell IG, Wells GA, Vandemheen KL, Clement CM, Lesiuk H, De
[52] Sinuff T, Adhikari NK, Cook DJ, Schunemann HJ, Griffith LE, Maio VJ, et al. The Canadian C-spine rule for radiography in alert
Rocker G, et al. Mortality predictions in the intensive care unit: com- and stable trauma patients. JAMA 2001;286:1841e8.
paring physicians with scoring systems. Crit Care Med 2006;34: [58] Altman DG. Systematic reviews of evaluations of prognostic vari-
878e85. ables. BMJ 2001;323(7306):224e8.
[53] Blackmore CC. Clinical prediction rules in trauma imaging: who, [59] Riley RD, Abrams KR, Sutton AJ, Lambert PC, Jones DR, Heney D,
how, and why? Radiology 2005;235:371e4. et al. Reporting of prognostic markers: current problems and develop-
[54] Braitman LE, Davidoff F. Predicting clinical states in individual pa- ment of guidelines for evidence-based practice in the future. Br J
tients. Ann Intern Med 1996;125:406e12. Cancer 2003;88:1191e8.
[55] Brehaut JC, Stiell IG, Visentin L, Graham ID. Clinical decision rules [60] Khan KS, Bachmann LM, ter Riet G. Systematic reviews with indi-
‘‘in the real world’’: how a widely disseminated rule is used in every- vidual patient data meta-analysis to evaluate diagnostic tests. Eur J
day practice. Acad Emerg Med 2005;12:948e56. Obstet Gynecol Reprod Biol 2003;108:121e5.

You might also like