PERSPECTIVES
CONFLICT OF INTEREST
The author declared no conflict of interest.
© 2007 ASCPT
1. Schwartz, J.B. The current state of knowledge on age, sex, and their interactions on clinical pharmacology. Clin. Pharm. Ther. 82, 87–96 (2007).
2. Rogers, W. Evidencebased medicine and women: do the principles and practice of EBM further women’s health? Bioethics 18, 50–71 (2004).
Does Sex Matter? (National Academy Press, Washington, D.C., 2001).
5. Uhl, K., Parekh, A. & Kweder, S. Females in clinical studies: where are we going? Clin. Pharm. Ther. 81, 600–602 (2007).
6. WiesenfeldHallin, Z. Sex differences in pain perception. Gend. Med. 2, 127–145 (2005).
7. Berger, J.S., Roncaglioni, M.C., Avanzini, F., Pangrazzi, I., Tognoni, G. & Brown, D.L. Aspiring for the primary prevention of cardiovascular events in women and men: a sexspecific metaanalysis of randomized controlled trials. JAMA 295,

306–313 (2006). 
differences. N. Engl. J. Med. 354, 1507–1514 


Ther. 81, 311–315 (2007). 
the Biological Contributions to Human Health: 

MO Karlsson ^{1} and RM Savic ^{1}
Conclusions from clinical trial results that are derived from modelbased analyses rely on the model adequately describing the underlying system. The traditionally used diagnostics intended to provide information about model adequacy have seldom discussed shortcomings. Without an understanding of the properties of these diagnostics, development and use of new diagnostics, and additional information pertaining to the diagnostics, there is risk that adequate models will be rejected and inadequate models accepted. Thus, a diagnosis of available diagnostics is desirable.
Increasingly, results from clinical trials are reported and interpreted in the form of modelbased analyses. In particular, nonlinear mixedeff ects or “population” analysis is used for this purpose. Readers of publications with modelbased analy ses should be able to expect diagnosis of the performance of the fi nal model(s) in describing data. There are numer ous ways of doing this, most of them involving graphics. In this commen tary the characteristics—in particular the weaknesses—of the most common diagnostics are discussed. Solutions for providing more informative diagnostics are also suggested.
Typical individual predictionbased diagnostics
Th is oft enused diagnostic is appealing in its simplicity and in that each individu al’s data are not involved in making the prediction, except as being part of the data defi ning the population parameters. Th e most common manner of displaying this diagnostic is as a plot of observa tions versus population predictions (the latter oft en denoted “PRED”). A line of identity, and sometimes also a regression line, is included to illustrate how well the observations and predictions agree. Th is diagnostic may give a useful impression of the extent of variability in the data
^{1} Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden. Correspondence:
MO Karlsson (mats.karlsson@farmbio.uu.se)
doi:10.1038/sj.clpt.2007.6100241
that is explained by the structural and
covariate components of the model, but as a diagnostic for model adequacy it has fundamental fl aws. One of these is that
there is no expected pattern to this plot.
Figure 1 shows examples of plots for which the “observations” in each case are simulated from the same models and parameter values as those used to gener ate the predictions. Th us, in each case the
plot has the pattern associated with the correct model. Clearly the expected pat
tern is situationdependent, and it will vary with both model and study design. Th e magnitude of spread around the line of identity will, in addition to model mis specifi cation, depend on the magnitude of unexplained residual variability, unex plained parameter variability, dose range, censoring (such as omission of data below limit of quantifi cation), and dose adapta tion (e.g., titration to suitable response). If one expects an even spread of data around the line of identity, all the (correct) mod els of Figure 1 are likely to be rejected. A solution, when it is possible to appropri ately simulate from the fi nal model and study design (see below for a discussion on simulation), is to create a reference plot that shows the expected pattern for a particular model and study design. ^{1}^{,}^{2} Th is is done by simulating from the fi nal model and then creating the same plot as was created from the observed data, but now using the simulated data and the pre diction based on the parameters used in the simulation. If the pattern in the plot for the observed data and the simulated data are similar, no model misspecifi ca tion is evident from this diagnostic. How ever, as discussed below, simulations are not always possible to perform. When a regression line is included to illustrate agreement between observa tions and predictions, it usually does not take into account the heteroscedas ticity in the error structure; nor does it take into account that the data come from separate individuals. Th e latter is generally referred to as naivepooling analysis, which is known to have poor properties, for example, when data are unbalanced. Th e regression line is oft en included with the unmentioned assump tion that an adequate model would result in a line superimposed on the
CLINICAL PHARMACOLOGY & THERAPEUTICS  VOLUME 82 NUMBER 1  JULY 2007
17
PERSPECTIVES
line of identity. However, with a nonlin ear mixedeff ects model we should not expect this. Several factors, in addition to model misspecifi cations, would make the mean of the observations diff erent from the typical individual predictions. Th ese include censoring and dose adap tation, but the most important factor is that the unexplained parameter vari ability will enter nonlinearly into the model and produce individual predic tions that generally will be expected to have a mean diff erent from the typical individual prediction. Th e solution is as above, to obtain the expected regression line through simulations.
Figure 1 Observations versus population predictions when observations are simulated with the same model as is used to calculate population predictions. The black line is a line of identity, the light red line is a linear regression line, and the light blue line, when present, is loess smooth. E _{m}_{a}_{x} , maximum drug effect; EC50, concentration of drug producing 50% of E _{m}_{a}_{x} . Details on all simulations are provided in Supplementary Tables 1–3.
Diagnostics based on individual parameter estimates
Predictions based on individual parame ter estimates resolve some of the problems associated with typical individualpre diction plots. With plots of observations versus predictions based on individual parameter estimates (often denoted “IPRED”), unexplained parameter vari ability is not confounding the interpreta tion. However, if this diagnostic is to be informative on model misspecifi cation, the individual data need to be suffi ciently informative on the parameters that are estimated in the individual fit. When individual data are sparse in informa tion about one or more parameters, an overfi t will occur and even a misspecifi ed model may provide excellent agreement between observations and predictions, because IPRED will shrink toward the actual observation (the “perfect fit” phenome non). There is a measure, ε  shrinkage, that can be used in identify ing and quantifying whether an overfi t is taking place. ^{3} If no overfi t occurs, the dis tribution of individual weighted residuals (IWRES = (observation – IPRED)/ σ , where σ is the error magnitude given by the residual error model) should have a standard deviation of one. ε shrink age is calculated as 1 – SD (IWRES) and will thus increase from zero toward one as data become less informative. Figure 2 provides observations versus IPRED plots, at varying degrees of infor mation in data, of a model in which one structural component was misspecifi ed as compared with the simulation model.
Clearly, as data become less informa tive, so does this diagnostic. Providing information of εshrinkage would allow the reader to assess the relevance of the graph. If εshrinkage is high, the individ ual predictions are of no value for evalu ating model adequacy and ought to be omitted from any presentation. Already 20–30% εshrinkage is suffi ciently high in the examples provided in Figure 2 to render this diagnostic essentially with out value. For nonlinear mixedeffects mod els, individual parameter estimates are regularly obtained as empirical Bayes estimates (EBE; sometimes referred to as POSTHOC parameters). In addition to their use for calculating IPRED, these are oft en used as diagnostics in their own right, to show, for example, that covariate models are appropriate. How ever, EBEs of the interindividual random eff ect, η s, are biased (shrunk) toward the population mean, 0, whenever the individual data are not rich in informa tion about the parameter. η shrinkage, estimated as 1 – SD (EBEs)/ω, where ω is the population model estimate of the SD in η, can be used to inform about the relevance of graphs employing EBEs. ^{3} As the value of η  shrinkage increases from zero toward one, the value of EBEs as a diagnostic decreases.
Residualtype diagnostics
From the above it should be clear that the usefulness of residuals based on population predictions (RES = obser
vations – PRED) or individual predic tions (IWRES) for identifying model misspecification is limited. RES may show trends even when the model is adequate, and IWRES may lack trends even in the presence of model misspeci fi cation. A residual commonly used as a diagnostic for model misspecifi cation is the weighted residual (WRES) provided as standard output by programs such as NONMEM (Globomax, Hanover, MD). WRES does not suff er from the short comings of RES or IWRES, but it has another one. It is based on the same fi rst order (FO) approximation as the FO method—the fi rstestimation method for nonlinear mixedeff ects models. ^{4} The FO approximation is sometimes too crude and can then lead to WRES indicating model misspecification when there is none. Th is is illustrated in Figure 3, in which plots of WRES versus the independent variable are provided for situations where the model is cor rect but the diagnostic indicates other wise. Recently, the conditional weighted residual (CWRES) based on the fi rst order conditional estimation (FOCE) method was suggested as a more appro priate diagnostic. ^{5} Just as the FOCE method is generally preferred over the FO method, it seems appropriate to prefer CWRES over WRES, although the MichaelisMenten example in Figure 3 shows that the FOCE approxi mation used in CWRES sometimes may show limitations too. CWRES can be calculated from analyses regardless
18
VOLUME 82 NUMBER 1  JULY 2007  www.nature.com/cpt
PERSPECTIVES
of whether FO or FOCE has been used. An alternative solution to make WRES diagnostics more informative is to cre ate a reference pattern through simula tions, as discussed above.
Simulationbased diagnostics
Simulations from the model and the underlying design of the observed data are increasingly used to illustrate model properties. Already simulations of a single data set for comparison with the real data can be useful as a diagnostic and reveal model misspecifi cation patterns that are not easily diagnosed by other methods. ^{1}^{,}^{6} More commonly, however, multiple simu lations are made from the model and ref erence distributions created for features of the observed data. Such diagnostics have become known as “predictive checks.” Some predictive checks focus on second ary statistics (e.g., area under the curve, time above a minimum inhibitory con centration, or the number of responders) that can be derived from both the raw data and the simulated data. A drawback is that relevant statistics cannot always be created directly from the raw data, espe cially if these are sparse. Th e visual predic tive check (VPC) is based on a graphical comparison between the observed data and prediction intervals derived from the simulated data. ^{7} A related statistic is the numerical predictive check (NPC), which calculates the fractions of observations outside a certain prediction interval and compares that with the expected values. A recent development in simulationbased diagnostics is the normalized prediction distribution error (NPDE), in which a reference distribution is created for each observation and correlations in residuals within a subject are taken into account. ^{8} Although NPDEs have been used only for evaluation on external data, they are likely to be useful for the model evaluation on the data used in the estimation (inter nal evaluation). A general drawback of simulation based diagnostics is that it is not always feasible to generate simulations correctly. Simulation of data requires knowledge of the factors responsible for the real ized design. For observational data (as in therapeutic drug monitoring) this is seldom the case. Even for experimentally
designed studies, features such as cen sored data (dropout, data below limit of quantifi cation), missing data, adaptive designs, allowance for subjective choices or behavior (dosing times, dose changes), nonadherence, and protocol violations can all cause the results of a simulation to be misleading. Th e solution in such cases is to develop additional models for the features in question and the relation ships, if any, between the parameters of these models and parameters of the orig inal “primary” model. However, if this additional modeling is at all possible, it will oft en result in a substantial increase in modeling workload. Another problem is that simulation based diagnostics are most intuitive and
informative when heterogeneity in design and model is low. When doses, dosing times, observation times, and/or covari ate values vary between subjects, diag nosis becomes less straightforward. Most susceptible to such problems is the VPC. A solution is to stratify simulationbased diagnostics by the important variables. However, with sufficient heterogene ity numerous strata may be required, and for each stratum, diagnostics may become uninformative as the number of graphs increases and the amount of data per graph diminishes.
Numerical diagnostics
Several types of diagnostics are not usually used graphically. Such numeri
model
Onecompartment disposition PK model fitted to data simulated with a twocompartment model
Firstorder absorption PK model fitted to data simulated with a transit compartment absorption model
Figure 2 Observation versus individual predictions for three different structural model misspecifications at varying degrees of information in data, expressed through the εshrinkage value. E _{m}_{a}_{x} , maximum drug effect; PK, pharmacokinetics.
CLINICAL PHARMACOLOGY & THERAPEUTICS  VOLUME 82 NUMBER 1  JULY 2007
19
PERSPECTIVES
Figure 3 Conditional weighted residuals (CWRES) and weighted residuals (WRES) versus independent variable plots when both CWRES and WRES were calculated from the correct models. E _{m}_{a}_{x} , maximum drug effect.
cal diagnostics are of importance for comparisons between models (e.g. , the objective function values), to provide information on model robustness (e.g. , casedeletion and bootstrap methods), or to detect possible overfi t ( e.g., the standard errors of the parameters). However, numerical diagnostics are sel dom useful to assess whether a model can adequately describe the observed data in an absolute sense. Furthermore, biological plausibility of a model lies not only in its structure but also in the rea sonableness of the parameter estimates and predictions with respect to prior knowledge about the biological system. It can also be important to ensure that the model provides biologically plausi ble predictions in unobserved situations, for example, at other exposures, or for unobserved variables of the model.
Discussion
Th e main purpose of this commentary is to raise awareness of the shortcom
ings of the commonly used diagnostics:
(i) PRED and WRESbased diagnostics may falsely indicate that the model is inadequate; (ii) IPRED or EBEbased diagnostics may fail to fl ag a model mis specifi cation; and (iii) it may not always be possible to generate relevant simula tionbased diagnostics. Suggested alter native diagnostics sometimes, but not always, require that more information be provided to the reader. Nonlinear mixedeff ects models have several model components (structural model, covariate model, and models for interindividual variability and residual error), and it would be desirable to provide measures of model appropriateness in all these aspects (and for all the variables for which models have been derived). Th is would result in a considerable number of fi gures and information. When these fi g ures have to share publication space with illustrations of the data themselves and model implications, suboptimal compro mises are oft en necessary. Hence, data
that provide a more extensive demon stration of model adequacy can be seen in Supplementary Tables 1–3 online.
SUPPLEMENTARY MATERIAL is linked to the online version of the paper at http://www. nature.com/cpt
ACKNOWLEDGMENT
We thank Peter Milligan for valuable comments.
CONFLICT OF INTEREST
The authors declared no conflict of interest.
© 2007 ASCPT
1. Karlsson, M.O., Jonsson, E.N., Wiltse, C.G. & Wade, J.R. Assumption testing in population pharmacokinetic models: illustrated with an analysis of moxonidine data from congestive heart failure patients. J. Pharmacokinet. Biopharm. 26, 207–246 (1998).
2. Cox, E.H., VeyratFollet, C., Beal, S.L., Fuseau, E., Kenkare, S. & Sheiner, L.B. A population pharmacokineticpharmacodynamic analysis of repeated measures timetoevent pharmacodynamic responses: the antiemetic effect of ondansetron. J. Pharmacokinet. Biopharm. 27, 625–644, 1999.
3. Savic, R., Wilkins, J.J. & Karlsson, M.O. (Un)informativeness of empirical Bayes estimate based diagnostics [abstr T3360]. AAPS J. 8 (S2),
2006.
4. Sheiner, L.B., Rosenberg, B. & Marathe, V.V. Estimation of population characteristics of pharmacokinetic parameters from routine clinical data. J. Pharmacokinet. Biopharm. 5, 445–479,
1977.
5. Hooker, A. & Karlsson, M.O. Conditional weighted residuals: a diagnostic to improve population PK/ PD model building and evaluation [abstr W5321]. AAPS Pharm. Sci. 7 (S2), 2005.
6. Girard, P., Blaschke, T.F., Kastrissios, H. & Sheiner, L.B. A Markov mixed effect regression model for drug compliance. Stat. Med. 17, 2313–2333 (1998).
7. Holford, N. VPC, the visual predictive check— superiority to standard diagnostic (Rorschach) plots [abstr 738]. PAGE 14 (2005) <http://www. pagemeeting.org/?abstract=738>.
8. Brendel, K., Comets, E., Laffont, C., Laveille, C. & Mentre, F. Metrics for external model evaluation with an application to the population pharmacokinetics of gliclazide. Pharm. Res.23, 2036–2049 (2006).
20
VOLUME 82 NUMBER 1  JULY 2007  www.nature.com/cpt