Professional Documents
Culture Documents
doi: 10.1093/bib/bbz016
Advance Access Publication Date: 20 March 2019
Review article
Abstract
Information criteria (ICs) based on penalized likelihood, such as Akaike’s information criterion (AIC), the Bayesian
information criterion (BIC) and sample-size-adjusted versions of them, are widely used for model selection in health and
biological research. However, different criteria sometimes support different models, leading to discussions about which is
the most trustworthy. Some researchers and fields of study habitually use one or the other, often without a clearly stated
justification. They may not realize that the criteria may disagree. Others try to compare models using multiple criteria but
encounter ambiguity when different criteria lead to substantively different answers, leading to questions about which
criterion is best. In this paper we present an alternative perspective on these criteria that can help in interpreting their
practical implications. Specifically, in some cases the comparison of two models using ICs can be viewed as equivalent to a
likelihood ratio test, with the different criteria representing different alpha levels and BIC being a more conservative test
than AIC. This perspective may lead to insights about how to interpret the ICs in more complex situations. For example, AIC
or BIC could be preferable, depending on the relative importance one assigns to sensitivity versus specificity. Understanding
the differences and similarities among the ICs can make it easier to compare their results and to use them to make
informed decisions.
Key words: Akaike information criterion; Bayesian information criterion; latent class analysis; likelihood ratio testing;
model selection
John J. Dziak is a research assistant professor in the Methodology Center at Penn State. He is interested in longitudinal data, mixture models, high-
dimensional data and mental health applications. He has a BS in Psychology from the University of Scranton and a PhD. in Statistics from Penn State, with
advisor Runze Li.
Donna L. Coffman is an assistant professor in the Department of Epidemiology and Biostatistics at Temple University. Her research interests include causal
inference, mediation models, longitudinal data and mobile health applications in substance abuse prevention and obesity prevention.
Stephanie T. Lanza is the director of the Edna Bennett Pierce Prevention Research Center, a professor in the Department of Biobehavioral Health and a
principal investigator at the Methodology Center. She is interested in applications of mixture modeling and intensive longitudinal data to substance abuse
prevention.
Runze Li is Eberly Family Chair professor in the Department of Statistics and a principal investigator in the Methodology Center at Penn State. His research
interests include variable screening and selection in genomic and other high-dimensional data, as well as in models for longitudinal data.
Lars S. Jermiin is an honorary professor in the Research School of Biology at the Australian National University and a visiting researcher at the Earth Institute
and School of Biology and Environmental Science, University College Dublin. He is interested in using genome bioinformatics to study phylogenetics and
evolutionary processes, especially insect evolution.
Submitted: 24 October 2018; Received (in revised form): 4 January 2019
© The Author(s) 2019. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
553
554 Dziak et al.
Non-consistent criteria
AIC An = 2 Good prediction Overfitting
AICc An = 2n/(n − p − 1) Good prediction Overfitting
Consistent criteria
ABIC An = ln ((n + 2)/24) Depends on n Depends on n
BIC An = ln (n) Parsimony Underfitting
CAIC An = ln (n) + 1 Parsimony Underfitting
Note. AIC = Akaike information criterion. ABIC = adjusted Bayesian information criterion. BIC = Bayesian information criterion. CAIC = consistent Akaike information
criterion. n = sample size (number of subjects). Other criteria include the DIC (deviance information criterion) which acts as an analog of AIC in certain Bayesian
analyses but is more complicated to compute.
and BIC or CAIC choose a two-class model. In the three-class itively understood as the difference between the estimated and
model, each of the empirically estimated classes corresponds the true distribution. Et (t (y)) will be the same for all models
almost perfectly to one of the three species, with very few mis- being considered, so KL is minimized by choosing the model
classifications (five of the versicolor were mistakenly classified with highest Et ((y)). The (y) from the fitted model is a biased
as virginica). In the two-class model, the versicolor and virginica measure of Et ((y)), especially if p is large, because a model with
flowers were lumped together. Agusta and Dowe [28] performed many parameters can generally be fine-tuned to appear to fit
this analysis and concluded that BIC performed poorly on this a small dataset well, even if its structure is such that it cannot
benchmark dataset. Most biologists would probably agree with generalize to describe the process that generated the data. Intu-
this assessment. However, an alternative interpretation might itively, this means that if there are many parameters, the fit of
be that BIC was simply being parsimonious, and that flower the model to the originally obtained data (training sample) will
dimensions alone might not be enough to confidently separate seem good regardless of whether the model is correct or not,
the species. A much more detailed look at clustering the iris data, simply because the model is so flexible. In other words, once a
considering many more possible modeling choices, is found in particular dataset is used to estimate the parameters of a model,
[29]. However, this simple look is enough to discuss the relevant the fit of the model on that sample is no longer an independent
response. Each approach is intended to correct a fit estimate for If we assume equal prior probabilities for each model, this
the artificial inflation in observed performance caused by fitting simplifies to the ‘Bayes factor’ (see [51]):
a model and evaluating it with the same data, and to find a
good balance between bias caused by too restrictive a model and Pr(y|Mi )
excessive variance caused by a model with too many parameters Bij = (3)
Pr(y|Mj )
[36]. These AIC-like criteria do not treat model parsimony as a
motivating goal in its own right, but only as a means to reduce so that the model with the highest Bayes factor also has the
unnecessary sampling error caused by having to estimate too higher posterior probability. Schwarz [3] and Kass and Wasser-
many parameters relative to n. Thus, especially for large n, AIC- man [52] showed that, for many kinds of models, Bij can be
like criteria emphasize sensitivity more than specificity. How- roughly approximated by exp(− 21 BICi + 12 BICj ), where BIC equals
ever, in many research settings, parsimonious interpretation is Expression (1) with An = ln(n). BIC is also called the Schwarz
of strong interest in its own right. In these settings, another criterion. Note that in a Bayesian analysis, all of the parameters
criterion such as BIC, described in the next section, might be within each of the candidate models have prior distributions
more appropriate. representing knowledge or beliefs which the investigators have
Table 2. An for common IC sized models and comparing nested models—each provide some
insight into this question.
n AIC ABIC BIC CAIC
First, when comparing different models of the same size
10 2.0000 -0.6931 2.3026 3.3026 (i.e. number of parameters to be estimated), all ICs of the form
50 2.0000 0.7732 3.9120 4.9120 (1) will always agree on which model is best. For example, in
100 2.0000 1.4469 4.6052 5.6052 regression variable subset selection, suppose two models each
500 2.0000 3.0405 6.2146 7.2146 use five covariates. In this case, any IC will select whichever
1000 2.0000 3.7317 6.9078 7.9078 model has the highest likelihood (the best fit to the observed
5000 2.0000 5.3395 8.5172 9.5172 sample) after estimating the parameters. This is because only
10000 2.0000 6.0325 9.2103 10.2103 the first term in Expression (1) will differ across the candidate
100000 2.0000 8.3349 11.5129 12.5129 models, so An does not matter. Thus, although the ICs differ
in theoretical framework, they only disagree when they make
Note. An = penalty weighting constant. n = sample size (number of subjects).
different tradeoffs between fit and model size.
AIC = Akaike information criterion. ABIC = adjusted Bayesian information crite-
rion. BIC = Bayesian information criterion. CAIC = consistent Akaike information Second, for comparing a nested pair of models, different
Assuming p1 − p0 = 1
Assuming p1 − p0 = 10
Note. n = sample size (number of subjects). AIC = Akaike information criterion. ABIC = adjusted Bayesian information criterion. BIC = Bayesian information criterion.
CAIC = consistent Akaike information criterion. p1 = number of free parameters in larger model within pair being compared. p0 = number of free parameters in
smaller model.
to An = 1.962 ≈ 4. However, as p1 − p0 becomes larger, all asymptotically consistent even though it may be more powerful
ICs become more conservative, in order to avoid adding many for a given n (see [1, 75, 68]).
unnecessary parameters unless they are needed. Table 3 shows Also, since choosing An for a model comparison is closely
different effective α values for two values of p1 − p0 , obtained related to choosing an α level for a significance test, it becomes
using the R [27] code 1-pchisq(q=An*df,df=df,lower.tail=TRUE) clear that the universally ‘best’ IC cannot be defined any more
where An is the An value and df is p1 − p0 . AICc is not shown in than the ‘best’ α; there will always be a tradeoff. Thus, debates
the table because its penalty weight depends both on p0 and on about whether AIC is generally superior to BIC or vice versa, will
p1 in a slightly more complicated way, but will behave similarly be fruitless.
to AIC for large n and modest p0 .
The utility of each cutoff is determined by the consequences ing this item given membership in this class. For numerical
of a false positive or false negative and by one’s beliefs about outcomes, the means and covariance parameters of the vector of
the base rates of positives and negatives. Thus, AIC and BIC items within each class constitute the class-specific parameters.
could be seen as representing different sets of prior beliefs in To fit an LCA model or any of its cousins, an algorithm such as EM
a Bayesian sense (see [40, 31]) or, at least, different judgments [91, 92, 81] is often used to alternatively estimate class-specific
about the importance of parsimony. Perhaps in some examples parameters and predict subjects’ class membership given those
a more or less sensitive test (higher or lower An or α) would parameters. The user must specify the number of classes in a
be more appropriate than in others. For example, although AIC model, but the true number of classes is generally unknown
has favorable theoretical properties for choosing the number [17, 66]. Sometimes one might have a strong theoretical reason to
of parameters needed to approximate the shape of a nonpara- specify the number of classes, but often this must be done using
metric growth curve in general [33], in a particular application data-driven model selection.
with such data Dziak et al. [85] argued that BIC would give more
interpretable results. They argued this because the curves in that
context were believed likely to have a smooth and simple shape, Information criterion for selecting the number
Other simulations have explored the ability of the ICs to instead of three species, and a larger number might be needed
determine the correct number of classes. In [95], AIC had the to distinguish three cultivars or subspecies. Thus, the point at
lowest rate of underfitting but often overfit, while BIC and CAIC which the n becomes ‘large’ depends on numerous aspects of
practically never overfit but often underfit. AIC3 was in between the simulated scenario [98, 99]. Furthermore, in real data, unlike
and did well in general. The danger of underfitting increased simulations, the characteristics of the ‘true’ (data-generating)
when the classes did not have very different response profiles model are unknown, since the data have been generated by a
and were therefore easy to mistakenly lump together; in these natural or experimental process rather than a probability model.
cases BIC and CAIC almost always underfit. Yang [96] reported For this reason it may be more helpful to think about which
that ABIC performed better in general than AIC (whose model aspects of performance (e.g. sensitivity or specificity) are most
selection accuracy never got to 100% regardless of n) or BIC important in a given situation, rather than talking about the
or CAIC (which underfit too often and required large n to be nature of a supposed true data-generating model.
accurate). Fonseca and Cardoso [46] similarly suggested AIC3 as If the goal of having a model which contains enough param-
the preferred selection criterion for categorical LCA models. eters to describe the heterogeneity in the population is more
Yang and Yang [47] compared AIC, BIC, AIC3, ABIC and CAIC. important than the goal of parsimony, or if some classes are
parameter values, and sample size used when generating their methodological choices and not pick and choose methods
simulated data determine the relative performance of the ICs in simply as a way of supporting a desired outcome (see [114]).
simulations in a complicated way, limiting the extent to which A larger question is whether to use ICs at all. If ICs indeed
they can be used to state general rules or advice [98, 99, 107]. reduce to LRTs in simple cases, one might wonder why ICs are
If BIC indicates that a model is too small, it may well be too needed at all, and why researchers cannot simply do LRTs. A
small (or else fit poorly for some other reason). If AIC indicates possible answer is flexibility. Both AIC and BIC can be used to
that a model is too large, it may well be too large for the concurrently compare many models, not all of them nested,
data to warrant. Beyond this, theory and judgment are needed. rather than just a pair of nested models at a time. They can also
If BIC selects the largest and most general model considered, it be used to weight the estimates obtained from different models
is worth thinking about whether to expand the space of models for a common quantity of interest. These weighting approaches
considered (since an even more general model might fit even use either AIC or BIC but not both, because AIC and BIC are essen-
better), and similarly if AIC chooses the most parsimonious. tially treated as different Bayesian priors. While currently we
AIC and BIC each have distinct theoretical advantages. How- know of no mathematical theoretical framework for explicitly
ever, a researcher may judge that there may be a practical combining both AIC and BIC into a single weighting scheme, a
library(mclust); Funding
library(datasets);
National Institute on Drug Abuse (NIH grant P50 DA039838).
n <- 150;
ll <- rep(NA,7); The content is solely the responsibility of the authors and
bic.given <- rep(NA,7); does not necessarily represent the official views of the
models <- list(); National Institute on Drug Abuse or the National Institutes
for (k in 1:7) { of Health.
temp.model <- Mclust(iris[,1:4],G=k,modelNames="VVV");
p[k] <- temp.model$df;
ll[k] <- temp.model$loglik;
References
bic.given[k] <- temp.model$bic; 1. Claeskens G, Hjort NL. Model Selection and Model Averaging.
models[[k]] <- temp.model; New York: Cambridge University Press, 2008.
} 2. Akaike, H. Information theory and an extension of the max-
aic.calculated <- -2*ll + 2*p; imum likelihood principle. In: Petrov BN, Csaki F (eds). Sec-
19. Minin V, Abdo Z, Joyce P, et al. Performance-based selection 41. Sugiura N. Further analysis of the data by Akaike’s Infor-
of likelihood models for phylogeny estimation. Syst Biol mation Criterion and the finite corrections. Commun Stat
2003;52:674–683. Theory Methods 1978;A7:13–26.
20. Wang H, Li R, Tsai C. Tuning parameter selectors for the 42. George EI. The variable selection problem. J Am Stat Assoc
smoothly clipped absolute deviation method. Biometrika 2000; 95:1304–1308.
2007;94:553–568. 43. Stone M. An asymptotic equivalence of choice of model
21. Atkinson AC. A note on the generalized information crite- by cross-validation and Akaike’s criterion. J R Stat Soc B
rion for choice of a model. Biometrika 1980;67:413–418. 1977;39:44–47.
22. Collins LM, Lanza ST. Latent Class and Latent Transition Anal- 44. Golub GH, Heath M, Wahba G. Generalized cross-validation
ysis for the Social, Behavioral, and Health Sciences. Hoboken, NJ: as a method for choosing a good ridge parameter. Techno-
Wiley, 2010. metrics 1979;21:215–223.
23. Lin TH, Dayton CM. Model selection information crite- 45. Andrews RL, Currim IS. A comparison of segment reten-
ria for non-nested latent class models. J Educ Behav Stat tion criteria for finite mixture logit models. J Mark Res
1997;22:249–264. 2003;40:235–243.
66. Tein J-Y, Coxe S, Cham H. Statistical power to detect the 90. Lubke GH, Stephens SH, Lessem JM, et al. The
correct number of classes in latent profile analysis. Struct CHRNA5/A3/B4 gene cluster and tobacco, alcohol, cannabis,
Equ Modeling 2013;20:640–657. inhalants and other substance use initiation: replication
67. Rao CR, Wu Y. A strongly consistent procedure for model and new findings using mixture analysis. Behav Genet
selection in a regression problem. Biometrika 1989;76: 2012;42:636–646.
369–374. 91. Dempster AP, Laird NM, Rubin DB. Maximum likelihood
68. Yang Y. Can the strengths of AIC and BIC be shared? A from incomplete data via the EM algorithm. J R Stat Soc B
conflict between model identification and regression esti- 1977;39:1–38.
mation. Biometrika 2005;92:937–950. 92. Gupta MR, Chen Y. Theory and use of the EM algorithm.
69. Zhang P. On the convergence rate of model selection Found Trends Signal Process 2010;4:223–296.
criteria. Commun Stat Theory Methods 1993;22:2765–2775. 93. Leroux BG. Consistent estimation of a mixing distribution.
70. Pötscher BM. Effects of model selection on inference. Econ Ann Stat 1992;20:1350–1360.
Theory 1991;7:163–185. 94. Dayton CM. Latent Class Scaling Analysis. Thousand Oaks,
71. Shibata R. Consistency of model selection and parameter CA: Sage, 1998.
111. Johnson JB, Omland KS. Model selection in ecology and five major psychiatric disorders: a genome-wide analysis.
evolution. Trends Ecol Evol 2004;19:101–108. Lancet 2013;381(9875):1371–1379.
112. Posada D, Crandall KA. Selecting the best-fit model of 117. Chen J, Chen Z. Extended Bayesian information criterion
nucleotide substitution. Syst Biol 2001;50:580–601. for model selection with large model spaces. Biometrika
113. Gelman A, Hennig C. Beyond subjective and objective in 2008;95:759–771.
statistics. J R Stat Soc 2017;180:967–1033. 118. Liao JG, Cavanaugh JE, McMurry TL. Extending AIC to best
114. Simmons JP, Nelson LD, Simonsohn U. False-positive psy- subset regression. Comput Stat 2018;33:787–806.
chology: undisclosed flexibility in data collection and anal- 119. Mestres AC, Bochkina N, Mayer C. Selection of the regu-
ysis allows presenting anything as significant. Psychol Sci larization parameter in graphical models using network
2011;22:1359–1366. characteristics. J Comput Graph Stat 2018;27:323–333.
115. Yang Y. Consistency of cross validation for comparing 120. Pan R, Wang H, Li R. Ultrahigh-dimensional multiclass lin-
regression procedures. Ann Stat 2007;35:2450–2473. ear discriminant analysis by pairwise sure independence
116. Cross-Disorder Group of the Psychiatric Genomics Con- screening. J Am Stat Assoc 2016;111:169–179.
sortium. Identification of risk loci with shared effects on