You are on page 1of 24

Journal of Memory and Language 68 (2013) 255–278

Contents lists available at SciVerse ScienceDirect

Journal of Memory and Language


journal homepage: www.elsevier.com/locate/jml

Random effects structure for confirmatory hypothesis testing:


Keep it maximal
Dale J. Barr a,⇑, Roger Levy b, Christoph Scheepers a, Harry J. Tily c
a
Institute of Neuroscience and Psychology, University of Glasgow, 58 Hillhead St., Glasgow G12 8QB, United Kingdom
b
Department of Linguistics, University of California at San Diego, La Jolla, CA 92093-0108, USA
c
Department of Brain and Cognitive Sciences, Massachussetts Institute of Technology, Cambridge, MA 02139, USA

a r t i c l e i n f o a b s t r a c t

Article history: Linear mixed-effects models (LMEMs) have become increasingly prominent in psycholin-
Received 5 August 2011 guistics and related areas. However, many researchers do not seem to appreciate how ran-
revision received 30 October 2012 dom effects structures affect the generalizability of an analysis. Here, we argue that
researchers using LMEMs for confirmatory hypothesis testing should minimally adhere
to the standards that have been in place for many decades. Through theoretical arguments
Keywords: and Monte Carlo simulation, we show that LMEMs generalize best when they include the
Linear mixed-effects models
maximal random effects structure justified by the design. The generalization performance of
Generalization
Statistics
LMEMs including data-driven random effects structures strongly depends upon modeling
Monte Carlo simulation criteria and sample size, yielding reasonable results on moderately-sized samples when
conservative criteria are used, but with little or no power advantage over maximal models.
Finally, random-intercepts-only LMEMs used on within-subjects and/or within-items data
from populations where subjects and/or items vary in their sensitivity to experimental
manipulations always generalize worse than separate F1 and F2 tests, and in many cases,
even worse than F1 alone. Maximal LMEMs should be the ‘gold standard’ for confirmatory
hypothesis testing in psycholinguistics and beyond.
Ó 2012 Elsevier Inc. All rights reserved.

it does in everyday reasoning. Consider a pet-food manu-


facturer determining which of two new gourmet cat-food
‘‘I see no real alternative, in most confirmatory studies, to
recipes to bring to market. The manufacturer has every
having a single main question—in which a question is
interest in choosing the recipe that the average cat will
specified by ALL of design, collection, monitoring, AND
eat the most of. Thus every day for a month (28 days) their
ANALYSIS.’’
expert, Dr. Nyan, feeds one recipe to a cat in the morning
Tukey (1980), ‘‘We Need Both Exploratory and Confir- and the other recipe to a cat in the evening, counterbalanc-
matory’’ (p. 24, emphasis in original). ing which recipe is fed when and carefully measuring how
much was eaten at each meal. At the end of the month Dr.
Nyan calculates that recipes 1 and 2 were consumed to the
Introduction
tune of 92.9 ± 5.6 and 107.2 ± 6.1 (means ± SDs) grams per
meal respectively. How confident can we be that recipe 2 is
The notion of independent evidence plays no less impor-
the better choice to bring to market? Without further
tant a role in the assessment of scientific hypotheses than
information you might hazard the guess ‘‘somewhat confi-
dent’’, considering that one of the first statistical hypothe-
⇑ Corresponding author. Fax: +44 (0)141 330 4606.
sis tests typically taught, the unpaired t-test, gives p = 0.09
E-mail addresses: dale.barr@glasgow.ac.ukss (D.J. Barr), rlevy@ucsd.
against the null hypothesis that choice of recipe does not
edu (R. Levy), christoph.scheepers@glasgow.ac.uk (C. Scheepers), hjt@mit.
edu (H.J. Tily). matter. But now we tell you that only seven cats partici-

0749-596X/$ - see front matter Ó 2012 Elsevier Inc. All rights reserved.
http://dx.doi.org/10.1016/j.jml.2012.11.001
256 D.J. Barr et al. / Journal of Memory and Language 68 (2013) 255–278

pated in this test, one for each day of the week. How does Standard practices for data-analysis in psycholinguis-
this change your confidence in the superiority of recipe 2? tics and related areas fundamentally changed, however,
Let us first take a moment to consider precisely what it after Clark (1973). In a nutshell, Clark (1973) argued that
is about this new information that might drive us to linguistic materials, just like experimental participants,
change our analysis. The unpaired t-test is based on the have idiosyncrasies that need to be accounted for. Because
assumption that all observations are conditionally indepen- in a typical psycholinguistic experiment, there are multiple
dent of one another given the true underlying means of the observations for the same item (e.g., a given word or sen-
two populations—here, the average amount a cat would tence), these idiosyncrasies break the conditional indepen-
consume of each recipe in a single meal. Since no two cats dence assumptions underlying mixed-model ANOVA,
are likely to have identical dietary proclivities, multiple which treats experimental participant as the only random
measurements from the same cat would violate this effect. Clark proposed the quasi-F (F0 ) and min-F0 statistics
assumption. The correct characterization becomes that all as approximations to an F-ratio whose distributional
observations are conditionally independent of one another assumptions are satisfied even under what in contempo-
given (a) the true palatability effect of recipe 1 versus rec- rary parlance is called CROSSED random effects of participant
ipe 2, together with (b) the dietary proclivities of each cat. and item (Baayen, Davidson, & Bates, 2008). Clark’s paper
This weaker conditional independence is a double-edged helped drive the field toward a standard demanding evi-
sword. On the one hand, it means that we have tested dence that experimental results generalized beyond the
effectively fewer individuals than our 56 raw data points specific linguistic materials used—in other words, the so-
suggest, and this should weaken our confidence in general- called by-subjects F1 mixed-model ANOVA was not en-
izing the superiority of recipe 2 to the entire cat popula- ough. There was even a time where reporting of the min-
tion. On the other hand, the fact that we have made F0 statistic was made a standard for publication in the Jour-
multiple measurements for each cat holds out the prospect nal of Memory and Language. However, acknowledging the
of factoring out each cat’s idiosyncratic dietary proclivities widespread belief that min-F0 is unduly conservative (see,
as part of the analysis, and thereby improving the signal- e.g., Forster & Dickinson, 1976), significance of min-F0
to-noise ratio for inferences regarding each recipe’s overall was never made a requirement for acceptance of a publica-
appeal. How we specify these idiosyncrasies can dramati- tion. Instead, the ‘normal’ convention continued to be that
cally affect our conclusions. For example, we know that a result is considered likely to generalize if it passes
some cats have higher metabolisms and will tend to eat p < 0.05 significance in both by-subjects (F1) and by-items
more at every meal than other cats. But we also know that (F2) ANOVAs. In the literature this criterion is called F1  F2
each creature has its own palate, and even if the recipes (e.g., Forster & Dickinson, 1976), which in this paper we
were of similar overall quality, a given cat might happen use to denote the larger (less significant) of the two p val-
to like one recipe more than the other. Indeed, accounting ues derived from F1 and F2 analyses.
for idiosyncratic recipe preferences for each cat might lead
to even weaker evidence for the superiority of recipe 2. Linear mixed-effects models (LMEMs)
Situations such as these, where individual observations
cluster together via association with a smaller set of enti- Since Clark (1973), the biggest change in data analysis
ties, are ubiquitous in psycholinguistics and related practices has been the introduction of methods for simulta-
fields—where the clusters are typically human participants neously modeling crossed participant and item effects in a
and stimulus materials (i.e., items). Similar clustered- single analysis, in what is variously called ‘‘hierarchical
observation situations arise in other sciences, such as agri- regression’’, ‘‘multi-level regression’’, or simply ‘‘mixed-ef-
culture (plots in a field) and sociology (students in class- fects models’’ (Baayen, 2008; Baayen et al., 2008; Gelman &
rooms in schools in school-districts); hence accounting Hill, 2007; Goldstein, 1995; Kliegl, 2007; Locker, Hoffman,
for the RANDOM EFFECTS of these entities has been an impor- & Bovaird, 2007; Pinheiro & Bates, 2000; Quené & van den
tant part of the workhorse statistical analysis technique, Bergh, 2008; Snijders & Bosker, 1999b).1 In this paper we refer
the ANALYSIS OF VARIANCE, under the name MIXED-MODEL ANOVA, to models of this class as mixed-effects models; when fixed ef-
since the first half of the 20th century (Fisher, 1925; Sche- fects, random effects, and trial-level noise contribute linearly
ffe, 1959). In experimental psychology, the prevailing stan- to the dependent variable, and random effects and trial-level
dard for a long time has been to assume that individual error are both normally distributed and independent for differ-
participants may have idiosyncratic sensitivities to any ing clusters or trials, it is a linear mixed-effects model (LMEM).
experimental manipulation that may have an overall ef- The ability of LMEMs to simultaneously handle crossed
fect, so detecting a ‘‘fixed effect’’ of some manipulation random effects, in addition to a number of other advanta-
must be done under the assumption of corresponding par- ges (such as better handling of categorical data; see Dixon,
ticipant random effects for that manipulation as well. In 2008; Jaeger, 2008), has given them considerable momen-
our pet-food example, if there is a true effect of recipe—
that is, if on average a new, previously unstudied cat will 1
Despite the ‘‘mixed-effects models’’ nomenclature, traditional ANOVA
on average eat more of recipe 2 than of recipe 1—it should approaches used in psycholinguistics have always used ‘‘mixed effects’’ in
be detectable above and beyond the noise introduced by the sense of simultaneously estimating both fixed- and random-effects
cat-specific recipe preferences, provided we have enough components of such a model. What is new about mixed effects models is
their explicit estimation of the random-effects covariance matrix, which
data. Technically speaking, the fixed effect is tested against leads to considerably greater flexibility of application, including, as clearly
an error term that captures the variability of the effect indicated by the title of Baayen et al. (2008), the ability to handle the
across individuals. crossing of two or more types of random effects in a single analysis.
D.J. Barr et al. / Journal of Memory and Language 68 (2013) 255–278 257

tum as a candidate to replace ANOVA as the method of 2. With mixed-model ANOVAs, the standard has been
choice in psycholinguistics and related areas. But despite to use what we term ‘‘maximal’’ random-effect
the widespread use of LMEMs, there seems to be insuffi- structures.
ciently widespread understanding of the role of random ef- 3. Insofar as we as a field think this standard is appro-
fects in such models, and very few standards to guide how priate for the purpose of confirmatory hypothesis
random effect structures should be specified for the analy- testing, researchers using LMEMs for that purpose
sis of a given dataset. Of course, what standards are appro- should also be using LMEMs with maximal random
priate or inappropriate depends less upon the actual effects structure.
statistical technique being used, and more upon the goals 4. Failure to include maximal random-effect structures
of the analysis (cf. Tukey, 1980). Ultimately, the random ef- in LMEMs (when such random effects are present in
fect structure one uses in an analysis encodes the assump- the underlying populations) inflates Type I error
tions that one makes about how sampling units (subjects rates.
and items) vary, and the structure of dependency that this 5. For designs including within-subjects (or within-
variation creates in one’s data. items) manipulations, random-intercepts-only
In this paper, our focus is mainly on what assumptions LMEMs can have catastrophically high Type I error
about sampling unit variation are most critical for the use rates, regardless of how p-values are computed from
of LMEMs in confirmatory hypothesis testing. By confirma- them (see also Roland, 2009; Jaeger, 2011a; Schielz-
tory hypothesis testing we mean the situation in which eth & Forstmeier, 2009).
the researcher has identified a specific set of theory-critical 6. The performance of a data-driven approach to deter-
hypotheses in advance and attempts to measure the evi- mining random effects (i.e., model selection)
dence for or against them as accurately as possible (Tukey, depends strongly on the specific algorithm, size of
1980). Confirmatory analyses should be performed accord- the sample, and criteria used; moreover, the power
ing to a principled plan guided by theoretical consider- advantage of this approach over maximal models is
ations, and, to the extent possible, should minimize the typically negligible.
influence of the observed data on the decisions that one 7. In terms of power, maximal models perform surpris-
makes in the analysis (Wagenmakers, Wetzels, Borsboom, ingly well even in a ‘‘worst case’’ scenario where
van der Maas, & Kievit, 2012). To simplify our discussion, they assume random slope variation that is actually
we will focus primarily on confirmatory analysis of simple not present in the population.
data sets involving only a few theoretically-relevant vari- 8. Contrary to some warnings in the literature (Pinhe-
ables. We recognize that in practice, the complexity of iro & Bates, 2000), likelihood-ratio tests for fixed
one’s data may impose constraints on the extent to which effects in LMEMs show minimal Type I error infla-
one can perform analyses fully guided by theory and not by tion for psycholinguistic datasets (see Baayen et al.,
the data. Researchers who perform laboratory experiments 2008, FootNote 1, for a similar suggestion); also,
have extensive control over the data collection process, deriving p-values from Monte Carlo Markov Chain
and, as a result, their statistical analyses tend to include (MCMC) sampling does not mitigate the high Type
only a small set of theoretically relevant variables, because I error rates of random-intercepts-only LMEMs.
other extraneous factors have been rendered irrelevant 9. The F1  F2 criterion leads to increased Type I error
through randomization and counterbalancing. This is in rates the more the effects vary across subjects and
contrast to other more complex types of data sets, such items in the underlying populations (see also Clark,
as observational corpora or large-scale data sets collected 1973; Forster & Dickinson, 1976).
in the laboratory for some other, possibly more general, 10. Min-F0 is conservative in between-items designs
purpose than the theoretical question at hand. Such data- when the item variance is low, and is conservative
sets may be unbalanced and complex, and include a large overall for within-items designs, especially so when
number of measurements of many different kinds. Analyz- the treatment-by-subject and/or treatment-by-item
ing such datasets appropriately is likely to require more variances are low (see also Forster & Dickinson,
sophisticated statistical techniques than those we discuss 1976); in contrast, maximal LMEMs show no such
in this paper. Furthermore, such analyses may involve conservativity.
data-driven techniques typically used in exploratory data
analysis in order to reduce the set of variables to a manage- Further results and discussion are available in an Online
able size. Discussion of such techniques for complex data- Appendix.
sets and their proper application of is beyond the scope of
this paper (but see, e.g., Baayen, 2008; Jaeger, 2010). Random effects in LMEMs and ANOVA: the same principles
Our focus here is on the question: When the goal of a apply
confirmatory analysis is to test hypotheses about one or
more critical ‘‘fixed effects’’, what random-effects structure The Journal of Feline Gastronomy has just received a sub-
should one use? Based on theoretical analysis and Monte mission reporting that the feline palate prefers tuna to li-
Carlo simulation, we will argue the following: ver, and as journal editor you must decide whether to
send it out for review. The authors report a highly signifi-
1. Implicit choices regarding random-effect structures cant effect of recipe type (p < .0001), stating that they used
existed for traditional mixed-model ANOVAs just ‘‘a mixed effects model with random effects for cats and
as they exist today for LMEMs. recipes’’. Are you in a position to evaluate the generality
258 D.J. Barr et al. / Journal of Memory and Language 68 (2013) 255–278

of the findings? Given that LMEMs can implement nearly ment. The words in one condition differ from those in the
any of the standard parametric tests—from a one-sample other condition on some intrinsic categorical dimension
test to a multi-factor mixed-model ANOVA—the answer (e.g., syntactic class), comprising a word-type manipulation
can only be no. Indeed, whether LMEMs produce valid that is within-subjects and between-items. The question is
inferences depends critically on how they are used. What whether reaction times are systematically different be-
you need to know in addition is the random effects structure tween condition A and condition B. For expository pur-
of the model, because this is what the assessment of the poses, we use a ‘‘toy’’ dataset with hypothetical data from
treatment effects is based on. In other words, you need four subjects and four items, yielding two observations
to know which treatment effects are assumed to vary per treatment condition per participant. The observed data
across which sampled units, and how they are assumed are plotted alongside predictions from the various models
to vary. As we will see, whether one is specifying a random we will be considering in the panels of Fig. 1. Because we
effects structure for LMEMs or choosing an analysis from are using simulated data, all of the parameters of the popu-
among the traditional options, the same considerations lation are known, as well as the ‘‘true’’ subject-specific and
come into play. So, if you are scrupulous about choosing item-specific effects for the sampled data. In practice,
the ‘‘right’’ statistical technique, then you should be researchers do not know these values and can only estimate
equally scrupulous about using the ‘‘right’’ random effects them from the data; however, using known values for
structure in LMEMs. In fact, knowing how to choose the hypothetical data can aid in understanding how clustering
right test already puts you in a position to specify the cor- in the population maps onto clustering in the sample.
rect random effects structure for LMEMs. The general pattern for the observed data points sug-
In this section, we attempt to distill the implicit stan- gests that items of type B (I3 and I4) are responded to fas-
dards already in place by walking through a hypothetical ter than items of type A (I1 and I2). This suggests a simple
example and discussing the various models that could be (but clearly inappropriate) model for these data that re-
applied, their underlying assumptions, and how these lates response Ysi for subject s and item i to a baseline level
assumptions relate to more traditional analyses. In this via fixed-effect b0 (the intercept), a treatment effect via
hypothetical ‘‘lexical decision’’ experiment, subjects see fixed-effect b1 (the slope), and observation-level error esi
strings of letters and have to decide whether or not each with variance r2:
string forms an English word, while their response times
Y si ¼ b0 þ b1 X i þ esi ;
are measured. Each subject is exposed to two types of ð1Þ
words, forming condition A and condition B of the experi- esi  Nð0; r2 Þ;

Fig. 1. Example RT data (open symbols) and model predictions (filled symbols) for a hypothetical lexical decision experiment with two within-subject/
between-item conditions, A (triangles) and B (circles), including four subjects (S1–S4) and four items (I1–I4). Panel (a) illustrates a model with no random
effects, considering only the baseline average RT (response to word type A) and treatment effect; panel (b) adds random subject intercepts to the model;
panel (c) adds by-subject random slopes; and panel (d) illustrates the additional inclusion of by-item random intercepts. Panel (d) represents the maximal
random-effects structure justified for this design; any remaining discrepancies between observed data and model estimates are due to trial-level
measurement error (esi).
D.J. Barr et al. / Journal of Memory and Language 68 (2013) 255–278 259

where Xi is a predictor variable2 taking on the value of 0 or Note that the variation on the intercepts is not con-
1 depending on whether item i is of type A or B respectively, founded with our effect of primary theoretical interest
and esi  N(0, r2) indicates that the trial-level error is nor- (b1): for each subject, it moves the means for both condi-
mally distributed with mean 0 and variance r2. In the pop- tions up or down by a fixed amount. Accounting for this
ulation, participants respond to items of type B 40 ms faster variation will typically decrease the residual error and thus
than items of type A. Under this first model, we assume that increase the sensitivity of the test of b1. Fitting Model (2) is
each of the 16 observations provides the same evidence for thus analogous to analyzing the raw, unaggregated re-
or against the treatment effect regardless of whether or not sponse data using a repeated-measures ANOVA with SSsub-
any other observations have already been taken into ac- jects subtracted from the residual SSerror term. One could see
count. Performing an unpaired t-test on these data would that this analysis is wrong by observing that the denomi-
implicitly assume this (incorrect) generative model. nator degrees of freedom for the F statistic (i.e., corre-
Model (1) is not a mixed-effects model because we have sponding to MSerror) would be greater than the number of
not defined any sources of clustering in our data; all obser- subjects (see Online Appendix for further discussion and
vations are conditionally independent given a choice of demonstration).
intercept, treatment effect, and noise level. But experience Although Model (2) is clearly preferable to Model (1), it
tells us that different subjects are likely to have different does not capture all the possible by-subject dependencies
overall response latencies, breaking conditional indepen- in the sample; experience also tells us that subjects often
dence between trials for a given subject. We can expand vary not only in their overall response latencies but also
our model to account for this by including a new offset in the nature of their response to word type. In the present
term S0s, the deviation from b0 for subject s. The expanded hypothetical case, Subject 3 shows a total effect of
model is now 134 ms, which is 94 ms larger than the average effect in
the population of 40 ms. We have multiple observations
Y si ¼ b0 þ S0s þ b1 X i þ esi ; per combination of subject and word type, so this variabil-
S0s  Nð0; s200 Þ; ð2Þ ity in the population will also create clustering in the sam-
ple. The S0s do not capture this variability because they
esi  Nð0; r2 Þ:
only allow subjects to vary around b0. What we need in
These offsets increase the model’s expressivity by allowing addition are random slopes to allow subjects to vary with
predictions for each subject to shift upward or downward respect to b1, our treatment effect. To account for this var-
by a fixed amount (Fig. 1b). Our use of Latin letters for this iation, we introduce a random slope term S1s with variance
term is a reminder that S0s is a special type of effect which s211 , yielding
is different from the bs—indeed, we now have a ‘‘mixed-ef-
fects’’ model: parameters b0 and b1 are fixed effects (effects Y si ¼ b0 þ S0s þ ðb1 þ S1s ÞX i þ esi ;
" #!
that are assumed to be constant from one experiment to s200 qs00 s11
another), while the specific composition of subject levels ðS0s ; S1s Þ  N 0; ; ð3Þ
qs00 s11 s211
for a given experiment is assumed to be a random subset
2
of the levels in the underlying populations (another instan- esi  Nð0; r Þ:
tiation of the same experiment would have a different
composition of subjects, and therefore different realiza- This is now a mixed-effects model with by-subject random
tions of the S0s effects). The S0s effects are therefore random intercepts and random slopes. Note that the inclusion of the
effects; specifically, they are random intercepts, as they al- by-subject random slope causes the predictions for condi-
low the intercept term to vary across subjects. Our primary tion B to shift by a fixed amount for each subject (Fig. 1c),
goal is to produce a model which will generalize to the improving predictions for words of type B. The slope offset
population from which these subjects are randomly drawn, S1s captures how much Subject s’s effect deviates from the
rather than describing the specific S0s values for this sam- population treatment effect b1. Again, we do not want our
ple. Therefore, instead of estimating the individual S0s ef- analysis to commit to particular S1s effects, and so, rather
fects, the model-fitting algorithm estimates the than estimating these values directly, we estimate s211 ,
population distribution from which the S0s effects were the by-subject variance in treatment effect. But note that
drawn. This requires assumptions about this distribution; now we have two random effects for each subject s, and
we follow the common assumption that it is normal, with these two effects can exhibit a correlation (expressed by
a mean of 0 and a variance of s200 ; here s200 is a random effect q). For example, subjects who do not read carefully might
parameter, and is denoted by a Greek symbol because, like not only respond faster than the typical subject (and have a
the bs, it refers to the population and not to the sample. negative S0s), but might also show less sensitivity to the
word type manipulation (and have a more positive S1s). In-
2 deed, such a negative correlation, where we would have
For expository purposes, we use a treatment coding scheme (0 or 1) for
the predictor variable. Alternatively, the models in this section could be q < 0, is suggested in our hypothetical data (Fig. 1): S1
expressed in the style more common to traditional ANOVA pedagogy, and S3 are slow responders who show clear treatment ef-
where fixed and random effects represent deviations from a grand mean. fects, whereas S2 and S4 are fast responders who are
This model can be fit by using ‘‘deviation coding’’ for the predictor variable hardly susceptible to the word type manipulation. In the
(.5 and .5 instead of 0 and 1). For higher-order designs, treatment and
deviation coding schemes will lead to different interpretations for lower-
most general case, we should not treat these effects as
order effects (simple effects for contrast coding and main effects for coming from independent univariate distributions, but in-
deviation coding). stead should treat S0s and S1s as being jointly drawn from a
260 D.J. Barr et al. / Journal of Memory and Language 68 (2013) 255–278

bivariate distribution. As seen in line 2 of Eq. (3), we follow by a consistent amount across all subjects (Fig. 1d). It is
common assumptions in taking this distribution as bivari- also worth noting that the by-item variance is also con-
ate normal with a mean of (0, 0) and three free parameters: founded with our effect of interest, since we have different
s200 (random intercept variance), s211 (random slope vari- items in the different conditions, and thus will tend to con-
ance), and qs00s11 (the intercept/slope covariance). For fur- tribute to any difference we observe between the two con-
ther information about random effect variance–covariance dition means.
structures, see Baayen (2004, 2008), Gelman and Hill This analysis has a direct analogue to min-F0 , which
(2007), Goldstein (1995), Raudenbush and Bryk (2002), tests MST against a denominator term consisting of the
and Snijders and Bosker (1999a). sum of MSTxS and MSI, the mean squares for the random
Both Models (2) and (3) are instances of what is tradi- treatment-by-subject interaction and the random main ef-
tionally analyzed using ‘‘mixed-model ANOVA’’ (e.g., Sche- fect of items. It is, however, different from the practice of
ffe, 1959, chap. 8). By long-standing convention in our performing F1  F2 and rejecting the null hypothesis if
field, however, the classic ‘‘by-subjects ANOVA’’ (and anal- p < .05 for both Fs. The reason is that MST (the numerator
ogously ‘‘by-items ANOVA’’ when items are treated as the for both F1 and F2) reflects not only the treatment effect,
 
random effect) is understood to mean Model (3), the rele- but also treatment-by-subject
 2  variability s211 as well as
MST
vant F-statistic for which is F 1 ¼ MS TxS
, where MST is the by-item variability x00 . The denominator of F1 controls
treatment mean square and MSTxS is the treatment-by-sub- for treatment-by-subject variability but not item variabil-
ject mean square. This convention presumably derives ity; similarly, the denominator of F2 controls for item var-
from the widespread recognition that subjects (and items) iability but not treatment-by-subject variability. Thus,
usually do vary idiosyncratically not only in their global finding that F1 is significant implies that b1 – 0 or
mean responses but also in their sensitivity to the experi- x200 – 0, or both; likewise, finding that F2 is significant im-
mental treatment. Moreover, this variance, unlike random plies that b1 – 0 or s211 – 0, or both. Since x200 and s211 can
intercept variance, can drive differences between condition be nonzero while b1 = 0, F1  F2 tests will inflate the Type
means. This can be seen by comparing the contributions of I error rate (Clark, 1973). Thus, in terms of controlling Type
random intercepts versus random slopes across panels (b) I error rate, the mixed-effects modeling approach and the
and (c) in Fig. 1. Therefore, it would seem to be important min-F0 approach are, at least theoretically, superior to sep-
to control for such variation when testing for a treatment arate by-subject and by-item tests.
effect.3 At this point, we might wish to go further and consider
Although Model (3) accounts for all by-subject random other models. For instance, we have considered a by-sub-
variation, it still has a critical defect. As Clark (1973) noted, ject random slope; for consistency, why not also consider
the repetition of words across observations is a source of a model with a by-item random slope, I1i? A little reflection
non-independence not accounted for, which would impair reveals that a by-item random slope does not make sense
generalization of our results to new items. We need to for this design. Words are nested within word types—no
incorporate item variability into the model with the ran- word can be both type A and type B—so it is not sensible
dom effects I0i, yielding to ask whether words vary in their sensitivity to word type.
No sample from this experiment could possibly give us the
Y si ¼ b0 þ S0s þ I0i þ ðb1 þ S1s ÞX i þ esi ; information needed to estimate random slope variance and
" #!
s200 qs00 s11 random slope/intercept covariance parameters for such a
ðS0s ; S1s Þ  N 0; ; model. A model like this is unidentifiable for the data it is
qs00 s11 s211 ð4Þ
applied to: there are (infinitely) many different values we
2
I0i  Nð0; x 00 Þ; could choose for its parameters which would describe
2
esi  Nð0; r Þ: the data equally well.4 Experiments with a within-item
manipulation, such as a priming experiment in which target
This is a mixed-effect model with by-subject random inter- words are held constant across conditions but the prime
cepts and slopes and by-item random intercepts. Rather word is varied, would call for by-item random slopes, but
than committing to specific I0i values, we assume that not the current experiment.
the I0i effects are drawn from a normal distribution with The above point also extends to designs where one
a mean of zero and variance x200 . We also assume that independent variable is manipulated within- and another
x200 is independent from the s parameters defining the variable between- subjects (respectively items). In case of
by-subject variance components. Note that the inclusion between-subject manipulations, the levels of the subject
of by-item random intercepts improves the predictions variable are nested within the levels of the experimental
from the model, with predictions for a given item shifting treatment variable (i.e. each subject belongs to one and
only one of the experimental treatment groups), meaning
3 that subject and treatment cannot interact—a model with
Note that in practice, most researchers do not compute MSTxS on the
raw data but rather aggregate their data first so that there is one a by-subject random slope term would be unidentifiable.
observation per subject per cell, and then perform an ANOVA (or t-test) In general, within-unit treatments require both the by-unit
on the cell means. This aggregation confounds the random slope variance
with residual error and reduces the error degrees of freedom, making it
4
possible to perform a repeated-measures ANOVA. This is an alternative way Technically, by-item random slopes for a between-item design can be
of meeting the assumption of conditional independence, but the aggrega- used to capture heteroscedasticity across conditions, but this is typically a
tion precludes simultaneous generalization over subjects and items (see minor concern in comparison with the issues focused on in this paper (see,
online appendix for further details). e.g., discussion in Gelman & Hill, 2007).
D.J. Barr et al. / Journal of Memory and Language 68 (2013) 255–278 261

intercepts and slopes in the random effects specification, cepts-only LMEM; it is perhaps akin to a modified min-F0
whereas between-unit treatments require only the by-unit statistic with a denominator error term including MSI but
random intercepts. with MSTxS replaced by the error term from Model (2)
It is important to note that identifiability is a property (i.e., with SSerror reduced by SSsubjects). But it would seem
of the model given a certain dataset. The model with by- inappropriate to use this as a test statistic, given that the
item random slopes is unidentifiable for any possible data- numerator MST increases as a function not only of the over-
set because it tries to model a source of variation that all treatment effect, but also as a function of random slope
 
could not logically exist in the population. However, there variation s211 , and the denominator does not control for
are also situations where a model is unidentifiable because this variation.
there is insufficient data to estimate its parameters. For in- A common misconception is that crossing subjects and
stance, we might decide it was important to try to estimate items in the intercept term of LMEMs is sufficient for meet-
variability corresponding to the different ways that sub- ing the assumption of conditional independence, and that
jects might respond to a given word (a subject-by-item including random slopes is strictly unnecessary unless it
random intercept). But to form a cluster in the sample, it is of theoretical interest to estimate that variability (see
is necessary to have more than one observation for a given e.g., Janssen, 2012; Locker et al., 2007). However, this is
unit; otherwise, the clustering effect cannot be distin- problematic given the fact that, as already noted, random
guished from residual error.5 If we only elicit one observa- slope variation can drive differences between condition
tion per subject/item combination, we are unable to means, thus creating a spurious impression of a treatment
estimate this source of variability, and the model containing effect where none might exist. Indeed, some researchers
this random effect becomes unidentifiable. Had we used a have already warned against using random-intercepts-
design with repeated exposures to the same items for a gi- only models when random slope variation is present
ven subject, the same model would be identifiable, and in (e.g., Baayen, 2008; Jaeger, 2011a; Roland, 2009; Schielzeth
fact we would need to include that term to avoid violating & Forstmeier, 2009). However, the performance of these
the conditional independence of our observations given sub- models has not yet been evaluated in the context of simul-
ject and item effects. taneous generalization over subjects and items. Our simu-
This discussion indicates that Model (4) has the maxi- lations will provide such an evaluation.
mal random effects structure justified by our experimental Although the maximal model best captures all the
design, and we henceforth refer to such models as maximal dependencies in the sample, sometimes it becomes neces-
models. A maximal model should optimize generalization sary for practical reasons to simplify the random effects
of the findings to new subjects and new items. Models that structure. Fitting LMEMs typically involves maximum like-
lack random effects contained in the maximal model, such lihood estimation, where an iterative procedure is used to
as Models (1)–(3), are likely to be misspecified—the model come up with the ‘‘best’’ estimates for the parameters gi-
specification may not be expressive enough to include ven the data. As the name suggests, it attempts to maxi-
the true generative process underlying the data. This type mize the likelihood of the data given the structure of the
of misspecification is problematic because conditional model. Sometimes, however, the estimation procedure will
independence between observations within a given cluster fail to ‘‘converge’’ (i.e., to find a solution) within a reason-
is not achieved. Each source of random variation that is not able number of iterations. The likelihood of this conver-
accounted for will tend to work against us in one of two gence failure tends to increase with the complexity of the
different ways. On the one hand, unaccounted-for variation model, especially the random effects structure.
that is orthogonal to our effect of interest (e.g., random Ideally, simplification of the random effects structure
intercept variation) will tend to reduce power for our tests should be done in a principled way. Dropping a random
of that effect; on the other, unaccounted-for variation that slope is not the only solution, nor is it likely to be the best,
is confounded with our effect of interest (e.g., random given that random slopes tend to account for variance con-
slope variation), can drive differences between means, founded with the fixed effects of theoretical interest. We
and thus will tend to increase the risk of Type I error. thus consider two additional mixed-effects models with
A related model that we have not yet considered but simplified random effects structure.6 The first of these is al-
that has become popular in recent practice includes only most identical to the maximal model (Model (4)) but with-
by-subject and by-item random intercepts. out any correlation parameter:

Y si ¼ b0 þ S0s þ I0i þ b1 X i þ esi ; Y si ¼ b0 þ S0s þ I0i þ ðb1 þ S1s ÞX i þ esi ;


  " #!
S0i  N 0; s200 ; s200 0
  ð5Þ ðS0s ; S1s Þ  N 0; ;
I0i  N 0; x200 ; 0 s211 ð6Þ
 
esi  Nð0; r Þ:2 I0i  N 0; x200 ;
esi  Nð0; r2 Þ:
Unlike the other models we have considered up to this
point, there is no clear ANOVA analog to a random-inter-
6
Unlike the other models we have considered up to this point, the
5
It can also be difficult to estimate random effects when some of the performance of these two additional models (Models (6) and (7)) will
sampling units (subjects or items) provide few observations in particular depend to some extent on how the predictor variable X is coded (e.g.,
cells of the design. See Section ‘General Discussion’ and Jaeger, Graff, Croft, treatment or deviation coding, with performance generally better for the
and Pontillo (2011, Section 3.3) for further discussion of this issue. latter; see Appendix for further discussion).
262 D.J. Barr et al. / Journal of Memory and Language 68 (2013) 255–278

Note that the only difference from Model (4) is in the spec- Table 1
ification of the distribution of (S0s, S1s) pairs. Model (6) is Summary of models considered and associated lmer syntax.

more restrictive than Model (4) in not allowing correlation No. Model lmer model syntax
between the random slope and random intercept; if, for (1) Ysi = b0 + b1Xi + esi n/a (Not a mixed-effects model)
example, subjects with overall faster reaction times also (2) Ysi = b0 + S0s + b1Xi + esi Y  X+(1jSubject)
tended to be less sensitive to experimental manipulation (3) Ysi = b0 + S0s + Y  X+(1 + XjSubject)
(as in our motivating example for random slopes), Model (b1 + S1s)Xi + esi
(4) Ysi = b0 + S0s + I0i + Y  X+(1 + XjSubject)+(1jItem)
(6) could not capture that aspect of the data. But it does ac-
(b1 + S1s)Xi + esi
count for the critical random variances that are con- (5) Ysi = b0 + S0s + I0i + b1Xi + esi Y  X+(1jSubject)+(1jItem)
founded with the effect of interest, s211 and x200 . (6)a As (4), but S0s, S1s Y  X+(1jSubject)+(0 + Xj
The next and final model to consider is one that has re- independent Subject)+(1jItem)
(7)a Ysi = b0 + I0i + Y  X+(0 + XjSubject)+(1jItem)
ceived almost no discussion in the literature but is none-
(b1 + S1s)Xi + esi
theless logically possible: a maximal model that is
a
simplified by removing random intercepts for any with- Performance is sensitive to the coding scheme for variable X (see
in-unit (subject or item) factor. For the current design, this Online Appendix).

means removing the by-subject random intercept:

manner to higher-order designs, which we consider further


Y si ¼ b0 þ I0i þ ðb1 þ S1s ÞX i þ esi ; in Section ‘General Discussion’.
 
S1s  N 0; s211 ;
  ð7Þ
I0i  N 0; x200 ; Design-driven versus data-driven random effects specification
2
esi  Nð0; r Þ:
As the last section makes evident, in psycholinguistics
and related areas, the specification of the structure of ran-
This model, like random-intercepts-only and no-correla- dom variation is traditionally driven by the experimental
tion models, would almost certainly be misspecified for design. In contrast to this traditional design-driven ap-
typical psycholinguistic data. However, like the previous proach, a data-driven approach has gained prominence
model, and unlike the random-intercepts-only model, it along with the recent introduction of LMEMs. The basic
captures all the sources of random variation that are con- idea behind this approach is to let the data ‘‘speak for
founded with the effect of main theoretical interest. themselves’’ as to whether certain random effects should
The mixed-effects models considered in this section are be included in the model or not. That is, on the same data
presented in Table 1. We give their expression in the syn- set, one compares the fit of a model with and without cer-
tax of lmer (Bates, Maechler, & Bolker, 2011), a widely tain random effect terms (e.g. Model (4) versus Model (5)
used mixed-effects fitting method for R (R Development in the previous section) using goodness of fit criteria that
Core Team, 2011). To summarize, when specifying random take into account both the accuracy of the model to the
effects, one must be guided by (1) the sources of clustering data and its complexity. Here, accuracy refers to how much
that exist in the target subject and item populations, and variance is explained by the model and complexity to how
(2) whether this clustering in the population will also exist many predictors (or parameters) are included in the model.
in the sample. The general principle is that a by-subject (or The goal is to find a structure that strikes a compromise be-
by-item) random intercept is needed whenever there is tween accuracy and complexity, and to use this resulting
more than one observation per subject (or item or sub- structure for carrying out hypothesis tests on the fixed ef-
ject-item combination), and a random slope is needed for fects of interest.
any effect where there is more than one observation for Although LMEMs offer more flexibility in testing ran-
each unique combination of subject and treatment level dom effects, data-driven approaches to random effect
(or item and treatment level, or subject-item combination structure have long been possible within mixed-model AN-
and treatment level). Models are unidentifiable when they OVA (see the online appendix). For example, Clark (1973)
include random effects that are logically impossible or that considers a suggestion by Winer (1971) that one could test
cannot be estimated from the data in principle. Models are the significance of the treatment-by-subjects interaction at
misspecified when they fail to include random effects that some liberal alpha level (e.g., .25), and, if it is not found to
create dependencies in the sample. Subject- or item-re- be significant, to use the F2 statistic to test one’s hypothesis
lated variance that is not accounted for in the sample can instead of a quasi-F statistic (Clark, 1973, p. 339). In LMEM
work against generalizability in two ways, depending on terms, this is similar to using model comparison to test
whether on not it is independent of the hypothesis-critical whether or not to include the by-subject random slope (al-
fixed effect. In the typical case in which fixed-effect slopes beit with LMEMs, this could be done while simultaneously
are of interest, models without random intercepts will controlling for item variance). But Clark rejected such an
have reduced power, while models without random slopes approach, finding it unnecessarily risky (see e.g., Clark,
will exhibit an increased Type I error rate. This suggests 1973, p. 339). Whether they shared Clark’s pessimism or
that LMEMs with maximal random effects structure have not, researchers who have used ANOVA on experimental
the best potential to produce generalizable results. data have, with rare exception, followed a design-driven
Although this section has only dealt with a simple single- rather than a data-driven approach to specifying random
factor design, these principles extend in a straightforward effects.
D.J. Barr et al. / Journal of Memory and Language 68 (2013) 255–278 263

We believe that researchers using ANOVA have been simultaneous generalization. If maximal LMEMs also turn
correct to follow a design-driven approach. Moreover, we out to have low power, then perhaps this would justify
believe that a design-driven approach is equally preferable the extra Type I error risk associated with data-driven ap-
to a data-driven approach for confirmatory analyses using proaches. However, the assumption that maximal models
LMEMs. In confirmatory analyses, random effect variance are overly conservative should not be taken as a forgone
is generally considered a ‘‘nuisance’’ variable rather than conclusion. Although maximal models are similar in spirit
a variable of interest; one does not eliminate these vari- to min-F0 , there are radical differences between the estima-
ables just because they do not ‘‘improve model fit.’’ As sta- tion procedures for min-F0 and maximal LMEMs. Min-F0 is
ted by Ben Bolker (one of the developers of lme4), ‘‘If composed of two separately calculated F statistics, and
random effects are part of the experimental design, and if the by-subjects F does not control for the by-item noise,
the numerical estimation algorithms do not break down, nor does the by-items F control for the by-subjects noise.
then one can choose to retain all random effects when esti- In contrast, with maximal LMEMs by-subject and by-item
mating and analyzing the fixed effects’’ (Bolker et al., 2009, variance is taken into account simultaneously, yielding
p. 134). The random effects are crucial for encoding mea- greater prospects for being a more sensitive test.
surement-dependencies in the design. Put bluntly, if an Finally, we believe it is important to distinguish be-
experimental treatment is manipulated within-subjects tween model-selection for the purpose of data exploration
(with multiple observations per subject-by-condition cell), on the one hand and model-selection for the purpose of
then there is no way for the analysis procedure to ‘‘know’’ determining random effects structures (in confirmatory
about this unless the fixed effect of that treatment is contexts) on the other; we are skeptical about the latter,
accompanied with a by-subject random slope in the analy- but do not intend to pass any judgement on the former.
sis model. Also, it is important to bear in mind that exper-
imental designs are usually optimized for the detection of Modeling of random effects in the current psycholinguistic
fixed effects, and not for the detection of random effects. literature
Data-driven techniques will therefore not only (correctly)
reject random effects that do not exist, but also (incor- The introduction of LMEMs and their early application
rectly) reject random effects for which there is just insuffi- to psycholinguistic data by Baayen et al. (2008) has had a
cient power. This problem is exacerbated for small major influence on analysis techniques used in peer-re-
datasets, since detecting random effects is harder the few- viewed publications. At the time of writing (October
er clusters and observations-per-cluster are present. 2012), Google Scholar reports 1004 citations to Baayen,
A further consideration is that the there are no existing Davidson and Bates. In a informal survey of the 150 re-
criteria to guide researchers in the data-driven determina- search articles published in the Journal of Memory and Lan-
tion of random effects structure. This is unsatisfactory be- guage since Baayen et al. (from volume 59 issue 4 to
cause the approach requires many decisions: What a-level volume 64 issue 3) we found that 20 (13%) reported anal-
should be used? Should a be corrected for the number of yses using an LMEM of some kind. However, these papers
random effects being tested? Should one test random ef- differ substantially in both the type of models used and
fects following a forward or backward algorithm, and the information reported about them. In particular,
how should the tests be ordered? Should intercepts be researchers differed in whether they included random
tested as well as slopes, or left in the model by default? slopes or only random intercepts in their models. Of the
The number of possible random effects structures, and 20 JML articles identified, six gave no information about
thus the number of decisions to be made, increases with the random effects structure, and a further six specified
the complexity of the design. As we will show, the partic- that they used random intercepts only, without theoretical
ular decision criteria that are used will ultimately affect or empirical justification. A further five papers employed
the generalizability of the test. The absence of any ac- model selection, four forward and only one backward
cepted criteria allows researchers to make unprincipled (testing for the inclusion of random effects, but not fixed
(and possibly self-serving) choices. To be sure, it may be effects). The final three papers employed a maximal ran-
possible to obtain reasonable results using a data-driven dom effects structure including intercept and slope terms
approach, if one adheres to conservative criteria. However, where appropriate.
even when the modeling criteria are explicitly reported, it This survey highlights two important points. First, there
is a non-trivial problem to quantify potential increases in appears to be no standard for reporting the modeling pro-
anti-conservativity that the procedure has introduced cedure, and authors vary dramatically in the amount of de-
(see Harrell, 2001, chap. 4). tail they provide. Second, at least 30% of the papers and
But even if one agrees that, in principle, a design-driven perhaps as many as 60%, do not include random slopes,
approach is more appropriate than a data-driven approach i.e. they tacitly assume that individual subjects and items
for confirmatory hypothesis testing, there might be con- are affected by the experimental manipulations in exactly
cern that using LMEMs with maximal random effects the same way. This is in spite of the recommendations of
structure is a recipe for low power, by analogy with min- various experts in peer-reviewed papers and books (Baa-
F0 , an earlier solution to the problem of simultaneous gen- yen, 2008; Baayen et al., 2008) as well as in the informal
eralization. The min-F0 statistic has indeed been shown to literature (Jaeger, 2009, 2011b). Furthermore, none of the
be conservative under some circumstances (Forster & Dick- LMEM articles in the JML special issue (Baayen et al.,
inson, 1976), and it is perhaps for this reason that it has not 2008; Barr, 2008; Dixon, 2008; Jaeger, 2008; Mirman, Dix-
been broadly adopted as a solution to the problem of on, & Magnuson, 2008; Quené & van den Bergh, 2008) set a
264 D.J. Barr et al. / Journal of Memory and Language 68 (2013) 255–278

bad example of using random-intercept-only models. As Method


discussed earlier, the use of random-intercept-only models
is a departure even from the standard use of ANOVA in Generating simulated data
psycholinguistics.
For simplicity, all datasets included a continuous re-
sponse variable and had only a single two-level treatment
factor, which was always within subjects, and either with-
The present study in or between items. When it was within, each ‘‘subject’’
was assigned to one of two counterbalancing ‘‘presenta-
How do current uses of LMEMs compare to more tradi- tion’’ lists, with half of the subjects assigned to each list.
tional methods such as min-F0 and F1  F2? The next sec- We assumed no list effect; that is, the particular configura-
tion of this paper tests a wide variety of commonly used tion of ‘‘items’’ within a list did not have any unique effect
analysis methods for datasets typically collected in psy- over and above the item effects for that list. We also as-
cholinguistic experiments, both in terms of whether result- sumed no order effects, nor any effects of practice or fati-
ing significance levels can be trusted—i.e., whether the gue. All experiments had 24 subjects, but we ran
Type I error rate for a given approach in a given situation simulations with both 12 or 24 items to explore the effect
is conservative (less than a), nominal (equal to a), or anti- of number of random-effect clusters on fixed-effects
conservative (greater than a)—and the power of each meth- inference.7
od in detecting effects that are actually present in the Within-item data sets were generated from the follow-
populations. ing sampling model:
Ideally, we would compare the different analysis tech-
Y si ¼ b0 þ S0s þ I0i þ ðb1 þ S1s þ I1i ÞX si þ esi
niques by applying them to a large selection of real data
sets. Unfortunately, in real experiments the true generative with all variables defined as in the tutorial section above,
process behind the data is unknown, meaning that we can- except that we used deviation coding for Xsi (.5, .5) rather
not tell whether effects in the population exist—or how big than treatment coding. Random effects S0s and S1s were
those effects are—without relying on one of the analysis drawn from a bivariate normal distribution with means
techniques we actually want to evaluate. Moreover, even lS = h0, 0i and variance–covariance matrix
 
if we knew which effects were real, we would need far s 2
00 qS s00 s11
T¼ . Likewise, I0i and I1i were also
more datasets than are readily available to reliably esti- qS s00 s11 2
s11
mate the nominality and power of a given method. drawn from a separate bivariate normal distribution with
We therefore take an alternative approach of using lI = h0, 0i and variance–covariance matrix
 
Monte Carlo methods to generate data from simulated x 2
q x x
X¼ 00 I 00 11 . The residual errors esi were
experiments. This allows us to specify the underlying sam- qI x00 x11 2
x11
pling distributions per simulation, and thus to have verid- drawn from a normal distribution with a mean of 0 and
ical knowledge of the presence or absence of an effect of variance r2. For between-item designs, the I1i effects (by-
interest, as well as all other properties of the experiment item random slopes) were simply ignored and thus did
(number of subjects, items and trials, and the amount of not contribute to the response variable.
variability introduced at each level). Such a Monte Carlo We investigated the performance of various analyses
procedure is standard for this type of problem (e.g., Baayen over a range of population parameter values (Table 2). To
et al., 2008; Davenport & Webster, 1973; Forster & Dickin- generate each simulated dataset, we first determined the
son, 1976; Quené & van den Bergh, 2004; Santa, Miller, & population parameters b0 ; s200 ; s211 ; qS ; x200 ; x211 ; qI ; and
Shaw, 1979; Schielzeth & Forstmeier, 2009; Wickens & r2 by sampling from uniform distributions with ranges gi-
Keppel, 1983), and guarantees that as the number of sam- ven in Table 2. We then simulated 24 subjects and 12 or
ples increases, the obtained p-value distribution becomes 24 items from the corresponding populations, and simu-
arbitrarily close to the true p-value distribution for data- lated one observation for each subject/item pair. We also as-
sets generated by the sampling model. sumed missing data, with up to 5% of observations in a given
The simulations assume an ‘‘ideal-world scenario’’ in data set counted as missing (at random). This setting was as-
which all the distributional assumptions of the model class sumed to reflect normal rates of data loss (due to experi-
(in particular normal distribution of random effects and menter error, technical issues, extreme responses, etc.).
trial-level error, and homoscedasticity of trial-level error The Online Appendix presents results for scenarios in which
and between-items random intercept variance) are satis- data loss was more substantial and nonhomogeneous.
fied. Although the approach leaves open for future research For tests of Type I error, b1 (the fixed effect of interest)
many difficult questions regarding departures of realistic was set to zero. For tests of power, b1 was set to .8, which
psycholinguistic data from these assumptions, it allows we found yielded power around 0.5 for the most powerful
us great flexibility in analyzing the behavior of each ana- methods with close-to-nominal Type I error.
lytic method as the population and experimental design We generated 100,000 datasets for testing for each of
vary. We hence proceed to the systematic investigation the eight combinations (effect present/absent, between-/
of traditional ANOVA, min-F0 , and several types of LMEMs
as datasets vary in many crucial respects including be- 7
Having only six items per condition, such as in the 12-item case, is not
tween- versus within-items, different numbers of items, uncommon in psycholinguistic research, where it is often difficult to come
and different random-effect sizes and covariances. up with larger numbers of suitably controlled items.
D.J. Barr et al. / Journal of Memory and Language 68 (2013) 255–278 265

Table 2 Model selection analyses


Ranges for the population parameters; U(min, max) means the parameter We also considered a wide variety of LMEMs whose
was sampled from a uniform distribution with range [min, max].
random effects structure—specifically, which slopes to in-
Parameter Description Value clude—was determined through model selection. We also
b0 Grand-average intercept U(3, 3) varied the model-selection a level, i.e., the level at which
b1 Grand-average slope 0 (H0 true) or .8 (H1 slopes were tested for inclusion or exclusion, taking on
true) the values .01 and .05 as well as values from .10 to .80 in
s200 By-subject variance of S0s U(0, 3)
steps of .10.
s211 By-subject variance of S1s U(0, 3)
Our model selection techniques tested only random
qS Correlation between (S0s, S1s) U(.8, .8)
pairs slopes for inclusion/exclusion, leaving in by default the
x200 By-item variance of I0i U(0, 3) by-subject and by-item random intercepts (since that
x211 By-item variance of I1i U(0, 3) seems to be the general practice in the literature). For be-
qI Correlation between (I0i, I1i) U(.8, .8) tween-items designs, there was only one slope to be tested
pairs (by-subjects) and thus only one possible model selection
r2 Residual error U(0, 3)
algorithm. In contrast, for within-items designs, where
pmissing Proportion of missing U(.00, .05)
observations there are two slopes to be tested, a large variety of algo-
rithms are possible. We explored the model selection algo-
rithms given in Table 4, which were defined by the
direction of model selection (forward or backward) and
within-item manipulation, 12/24 items). The functions we whether slopes were tested in an arbitrary or principled
used in running the simulations and processing the results sequence. The forward algorithms began with a random-
are available in the R package simgen, which we have intercepts-only model and tested the two possible slopes
made available in the Online Appendix, along with a num- for inclusion in an arbitrary, pre-defined sequence (either
ber of R scripts using the package. The Online Appendix by-subjects slope first and by-items slope second, or vice
also contains further information about the additional R versa; these models are henceforth denoted by ‘‘FS’’ and
packages and functions used for simulating the data and ‘‘FI’’). If the p-value from the first test exceeded the mod-
running the analyses. el-selection a level for inclusion, the slope was left out of
the model and the second slope was never tested; other-
wise, the slope was included and the second slope was
Analyses
tested. The backward algorithm (‘‘BS’’ and ‘‘BI’’ models)
was similar, except that it began with a maximal model
The analyses that we evaluated are summarized in Ta-
and tested for the exclusion of slopes rather than for their
ble 3. Three of these were based on ANOVA (F1, min-F0 ,
inclusion.
and F1  F2), with individual F-values drawn from mixed-
For these same within-items designs, we also consid-
model ANOVA on the unaggregated data (e.g., using MStreat-
ered forward and backward algorithms in which the se-
ment/MStreatment-by-subject) rather than from performing re-
quence of slope testing was principled rather than
peated-measures ANOVAs on the (subject and item)
arbitrary; we call these the ‘‘best-path’’ algorithms because
means. The analyses also included LMEMs with a variety
they choose each step through the model space based on
of random effects structures and test statistics. All LMEMs
which addition or removal of a predictor leads to the best
were fit using the lmer function of the R package lme4,
next model. For the forward version, both slopes were
version 0.999375-39 (Bates et al., 2011), using maximum
tested for inclusion independently against a random-inter-
likelihood estimation.
cepts-only model. If neither test fell below the model-
There were four kinds of models with predetermined
selection a level, then the random-intercepts-only model
random effects structures: models with random intercepts
was retained. Otherwise, the slope with the strongest evi-
but no random slopes, models with within-unit random
dence for inclusion (lowest p-value) was included in the
slopes but no within-unit random intercepts, models with
model, and then the second slope was tested for inclusion
no random correlations (i.e., independent slopes and inter-
against this model. The backward best-path algorithm was
cepts), and maximal models.
the same, except that it began with a maximal model and
tested slopes for exclusion rather than for inclusion. (In
principle, one can use best-path algorithms that allow both
Table 3 forwards and backwards moves, but the space of possible
Analyses performed on simulated datasets.

Analysis Test statistics Table 4


0 0 Model selection algorithms for within-items designs.
min-F min-F
F1 F1 Model Direction Order
F1  F2 F1, F2
Maximal LMEM FS Forward By-subjects slope then by-items
t; v2LR
FI Forward By-items slope then by-subjects
LMEM, random intercepts only t; v2LR , MCMC
FB Forward ‘‘Best path’’ algorithm
LMEM, no within-unit intercepts (NWI) t; v2LR BS Backward By-subjects slope then by-items
Maximal LMEM, no random correlations (NRC) t; v2LR , MCMC BI Backward By-items slope then by-subjects
Model selection (multiple variants) t; v2LR BB Backward ‘‘Best path’’ algorithm
266 D.J. Barr et al. / Journal of Memory and Language 68 (2013) 255–278

models considered here is so small that such an algorithm empirical distribution for p-values from the simulation for
would be indistinguishable from the forwards- or back- a given method where the null hypothesis was true. If the
wards-only variants.) p-value at the 5% quantile of this distribution was below
the targeted a-level (e.g., .05), then this lower value was
Handling nonconvergence and deriving p-values used as the cutoff for rejecting the null. To illustrate, note
Nonconverging LMEMs were dealt with by progres- that a method that is nominal (neither anticonservative
sively simplifying the random effects structure until con- nor conservative) would yield an empirical distribution of
vergence was reached. Data from these simpler models p-values for which very nearly 5% of the simulations would
contributed to the performance metrics for the more com- obtain p-values less than .05. Now consider that a given
plex models. For example, in testing maximal models, if a method with a targeted a-level of .05, 5% of the simulation
particular model did not converge and was simplified runs under the null hypothesis yielded p-values of .0217 or
down to a random-intercepts-only model, the p values lower. This clearly indicates that this method is anticonser-
from that model would contribute to the performance vative, since more than 5% of the simulation runs had p-val-
metrics for maximal models. This reflects the assumption ues less than the targeted a-level of .05. We could correct
that researchers who encountered nonconvergence would for this anticonservativity in the power analysis by requir-
not just give up but would consider simpler models. In ing that a p-value from a given simulation run, to be
other words, we are evaluating analysis strategies rather deemed statistically significant, must be less than .0217 in-
than particular model structures. stead of .05. In contrast, if for a given method 5% of the runs
In cases of nonconvergence, simplification of the ran- under the null hypothesis yielded a value of .0813 or lower,
dom effects structure proceeded as follows. For between- this method would be conservative, and it would be unde-
items designs, the by-subjects random slope was dropped. sirable to ‘correct’ this as it would artificially make the test
For within-items designs, statistics from the partially con- seem more powerful than it actually is. Instead, for this case
verged model were inspected, and the slope associated we would simply require that the p-value for a given simu-
with smaller variance was dropped (see the Online Appen- lation run be lower than .05.
dix for justification of this method). In the rare (0.002%) of Because we made minimal assumptions about the rela-
cases that the random-intercepts-only model did not con- tive magnitudes of random variances, it is also of interest to
verge, the analysis was discarded. examine the performance of the various approaches as a
There are various ways to obtain p-values from LMEMs, function the various parameters that define the space. Gi-
and to our knowledge, there is little agreement on which ven the difficulty of visualizing a multidimensional param-
method to use. Therefore, we considered three methods eter space, we chose to visually represent performance
currently in practice: (1) treating the t-statistic as if it were metrics in terms of two ‘‘critical variances’’, which were
a z statistic (i.e., using the standard normal distribution as those variances that can drive differences between treat-
a reference); (2) performing likelihood ratio tests, in which ment means. As noted above, for between-item designs,  
the deviance (2LL) of a model containing the fixed effect this includes the by-item random intercept variance x200
is compared to another model without it but that is other- and the by-subject random slope variance (s211 ); for with-
wise identical in random effects structure; and (3) by Mar- in-items designs, this includes the by-item and by sub-
kov Chain Monte Carlo (MCMC) sampling, using the ject-random slope variance (x211 and s211 ). We modeled
mcmcsamp() function of lme4 with 10,000 iterations. This Type I error rate and power over these critical variances
is the default number of iterations used in Baayen’s using local polynomial regression fitting (the loess func-
pvals.fnc() of the languageR package (Baayen, 2011). tion in R which is wrapped by the loessPred function in
This function wraps the function mcmcsamp(), and we our simgen package). The span parameter for loess fitting
used some of its code for processing the output of mcmcs- was set to .9; this highlights general trends throughout the
amp(). Although MCMC sampling is the approach recom- parameter space, at the expense of fine-grained detail.
mended by Baayen et al. (2008), it is not implemented in
lme4 for models containing random correlation parame-
ters. We therefore used (3) only for random-intercept-only Results and discussion
and no-random-correlation LMEMs.
An ideal statistical analysis method maximizes statisti-
Performance metrics cal power while keeping Type I error nominal (at the stated
The main performance metrics we considered were a level). Performance metrics in terms of Type I error,
Type I error rate (the rate of rejection of H0 when it is true) power, and corrected power are given in Table 5 for the be-
and power (the rate of failure to reject H0 when it is false). tween-item design and in Table 6 for the within-item de-
For all analyses, the a level for testing the fixed-effect slope sign. The analyses in each table are (approximately)
(b1) was set to .05 (results were also obtained using a = .01 ranked in terms of Type I error, with analyses toward the
and a = .10, and were qualitatively similar; see online top of the table showing the best performance, with priority
appendix). given to better performance on the larger (24-item) dataset.
It can be misleading to directly compare the power of Only min-F0 was consistently at or below the stated a
various approaches that differ in Type I error rate, because level. This is not entirely surprising, because the tech-
the power of anticonservative approaches will be inflated. niques that are available for deriving p-values from LMEMs
Therefore, we also calculated Power0 , a power rate cor- are known to be somewhat anticonservative (Baayen et al.,
rected for anticonservativity. Power0 was derived from the 2008). For maximal LMEMs, this anticonservativity was
D.J. Barr et al. / Journal of Memory and Language 68 (2013) 255–278 267

quite minor, within 1–2% of a.8 LMEMs with maximal ran- From the point of view of overall Type I error rate, we
dom slopes, but missing either random correlations or with- can rank the analyses for both within- and between-items
in-unit random intercepts, performed nearly as well as designs in order of desirability:
‘‘fully’’ maximal LMEMs, with the exception of the case
where p-values were determined by MCMC sampling. In 1. min-F0 , maximal LMEMs, ‘‘near-maximal’’ LMEMs miss-
addition, there was slight additional anticonservativity rela- ing within-unit random intercepts or random correla-
tive to the maximal model for the models missing within- tions, and model selection LMEMs using backward
unit random intercepts. This suggests that when maximal selection and/or testing slopes using the ‘‘best path’’
LMEMs fail to converge, dropping within-unit random inter- algorithm.
cepts or random correlations are both viable options for sim- 2. F1  F2.
plifying the random effects structure. It is also worth noting 3. forward-stepping LMEMs that test slopes in an arbitrary
that F1  F2, which is known to be fundamentally biased sequence.
(Clark, 1973; Forster & Dickinson, 1976), controlled overall 4. F1 and random-intercepts-only LMEMs.
Type I error rate fairly well, almost as well as maximal
LMEMs. However, whereas anticonservativity for maximal It would also seem natural to draw a line separating
(and near-maximal) LMEMs decreased as the data set got analyses that have an ‘‘acceptable’’ rate of false rejections
larger (from 12 to 24 items), for F1  F2 it actually showed (i.e., 1–2) from those with a rate that is intolerably high
a slight increase. (i.e., 3–4). However, it is insufficient to consider only the
F1 alone was the worst performing method for between- overall Type I error rate, as there may be particular prob-
items designs, and also had an unacceptably high error rate lem areas of the parameter space where even the best anal-
for within-items designs. Random-intercepts-only LMEMs yses perform poorly (such as when particular variance
were also unacceptably anticonservative for both types of components are very small or large). If these areas are
designs, far worse than F1  F2. In fact, for within-items de- small, they will only moderately affect the overall error
signs, random-intercepts-only LMEMs were even worse than rate. This is a problem because we do not know where
F1 alone, showing false rejections 40–50% of the time at the actual populations that we study reside in this param-
the .05 level, regardless of whether p-values were derived eter space; it could be that they inhabit these problem
using the normal approximation to the t-statistic, the like- areas. It is therefore also useful to examine the perfor-
lihood-ratio test, or MCMC sampling. In other words, for mance metrics as a function of the critical random variance
within-items designs, one can obtain better generalization parameters that are confounded with the treatment effect,
by ignoring item variance altogether (F1) than by using an i.e., s211 and x200 for between-items designs and s211 and x211
LMEM with only random intercepts for subjects and items. for within-items designs. These are given in the ‘‘heatmap’’
Fig. 2 presents results from LMEMs where the inclusion displays of Figs. 3–5.
of random slopes was determined by model selection. The Viewing Type I error rate as a function of the critical
figure presents the results for the within-items design, variances, it can be seen that of models with predeter-
where a variety of algorithms were possible. Performance mined random effects structures, only min-F0 maintained
for the between-items design (where there was only a sin- the Type I error rate consistently below the a-level
gle slope to be tested) was very close to that of maximal throughout the parameter space (Figs. 3 and 4). Min-F0 be-
LMEMs, and is presented in the Online Appendix. came increasingly conservative as the relevant random ef-
The figure suggests that the Type I error rate depends fects got small, replicating Forster and Dickinson (1976).
more upon the algorithm followed in testing slopes than Maximal LMEMs showed no such increasing conservativi-
the a-level used for the tests. Forward-stepping ap- ty, performing well overall, especially for 24-item datasets.
proaches that tested the two random slopes in an arbitrary The near-maximal LMEMs also performed relatively well,
sequence performed poorly in terms of Type I error even at though for within-item datasets, models missing within-
relatively high a levels. This was especially the case for the unit intercepts became increasingly conservative as slope
smaller, 12-item sets, where there was less power to detect variances got small. Random-intercepts-only LMEMs de-
by-item random slope variance. In contrast, performance graded extremely rapidly as a function of random slope
was relatively sound for backward models even at rela- parameters; even at very low levels of random-slope vari-
tively low a levels, as well as for the ‘‘best path’’ models ability, the Type I error rate was unacceptably high.
regardless of whether the direction was backward or for- In terms of Type I error, the average performance of the
ward. It is notable that these sounder data-driven ap- widely adopted F1  F2 approach is comparable to that of
proaches showed only small gains in power over maximal LMEMs (slightly less anti-conservative for 12
maximal models (indicated by the dashed line in the back- items, slightly more anti-conservative for 24 items). But
ground of the figure). in spite of this apparent good performance, the heatmap
visualization indicates that the approach is fundamentally
unsound. Specifically, F1  F2 shows both conservative and
anticonservative tendencies: as slopes get small, it be-
8
This anticonservativity stems from underestimation of the variation comes increasingly conservative, similar to min-F0 ; as
between subjects and/or items, as is suggested by generally better slopes get large, it becomes increasingly anticonservative,
performance of the maximal model in the 24- as opposed to 12-item
simulations. In Appendix, we show that as additional subjects and items are
similar to random-intercepts-only LMEMs, though to a les-
added, the Type I error rate for LMEMs with random slopes decreases ser degree. This increasing anticonservativity reflects the
rapidly, while for random-intercepts-only models, it actually increases. fact that (as noted in the introduction) subject random
268 D.J. Barr et al. / Journal of Memory and Language 68 (2013) 255–278

Table 5
Performance metrics for between-items design. Power0 = corrected power. Note that corrected power for random-intercepts-only MCMC is not included
because the low number of MCMC runs (10,000) combined with the high Type I error rate did not provide sufficient resolution.

Type I Power Power0


Nitems
12 24 12 24 12 24
Type I: Error at or near a = .05
min-F0 .044 .045 .210 .328 .328 .328
LMEM, maximal, v2LR .070 .058 .267 .364 .223 .342
LMEM, no random correlations, v2LR a .069 .057 .267 .363 .223 .343
LMEM, no within-unit intercepts, v2LR a .081 .065 .288 .380 .223 .342
LMEM, maximal, t .086 .065 .300 .382 .222 .343
LMEM, no random correlations, t .086 .064 .300 .382 .223 .343
LMEM, no within-unit intercepts, ta .100 .073 .323 .401 .222 .342
F1  F2 .063 .077 .252 .403 .224 .337
Type I: Error far exceeding a = .05
LMEM, random intercepts only, v2LR .102 .111 .319 .449 .216 .314
LMEM, random intercepts only, t .128 .124 .360 .472 .472 .314
LMEM, no random correlations, MCMCa .172 .192 .426 .582
LMEM, random intercepts only, MCMC .173 .211 .428 .601
F1 .421 .339 .671 .706 .134 .212
a
Performance is sensitive to coding of the predictor (see the Online Appendix); simulations use deviation coding.

Table 6
Performance metrics for within-items design. Note that corrected power for random-intercepts-only MCMC is not included because the low number of MCMC
runs (10,000) combined with the high Type I error rate did not provide sufficient resolution.

Type I Power Power0


Nitems
12 24 12 24 12 24
Type I: Error at or near a = .05
min-F0 .027 .031 .327 .512 .327 .512
LMEM, maximal, v2LR .059 .056 .460 .610 .433 .591
LMEM, no random correlations, v2LR a .059 .056 .461 .610 .432 .593
LMEM, no within-unit intercepts, v2LR a .056 .055 .437 .596 .416 .579
LMEM, maximal, t .072 .063 .496 .629 .434 .592
LMEM, no random correlations, t .072 .062 .497 .629 .432 .593
LMEM, no within-unit intercepts, ta .070 .064 .477 .620 .416 .580
F1  F2 .057 .072 .440 .643 .416 .578
Type I: Error far exceeding a = .05
F1 .176 .139 .640 .724 .345 .506
LMEM, no random correlations, MCMCa .187 .198 .682 .812
LMEM, random intercepts only, MCMC .415 .483 .844 .933
LMEM, random intercepts only, v2LR .440 .498 .853 .935 .379 .531
LMEM, random intercepts only, t .441 .499 .854 .935 .379 .531
a
Performance is sensitive to coding of the predictor (see the Online Appendix); simulations use deviation coding.

slopes are not accounted for in the F2 analysis, nor are item come small, slopes are less likely to be kept in the model.
random slopes accounted for in the F1 analysis (Clark, This anticonservativity varies considerably with the size of
1973). The fact that both F1 and F2 analyses have to pass the dataset and the algorithm used (and of course should
muster keeps this anti-conservativity relatively minimal also vary with the a-level, not depicted in the figure). The
as long as subject and/or item slope variances are not large, anticonservativity is present to a much stronger degree for
but the anti-conservativity is there nonetheless. the 12-item than for the 24-item datasets, reflecting the fact
Fig. 5 indicates that despite the overall good perfor- that the critical variances are harder to detect with smaller
mance of some of the data-driven approaches, these mod- data sets. The forward stepping models that tested slopes
els become anticonservative when critical variances are in an arbitrary sequence were the most anticonservative
small and nonzero.9 This is because as slope variances be- by far. The model testing the by-subject slope first per-
formed most poorly when that slope variance was small
9
Some readers might wonder why the heatmaps for these models show and the by-item slope variance was large, because in such
apparent anticonservative behavior at the bottom left corner, where the cases the algorithm would be likely to stop at a random-
slopes are zero, since performance at this point should be close to the intercepts-only model, and thus would never test for the
nominal level. This is an artifact of the smoothing used in generating the inclusion of the by-item slope. By the same principles, the
heatmaps, which aids in the detection of large trends at the expense of
small details. The true underlying pattern is that the anticonservativity
forward model that tested the by-item slope first showed
ramps up extremely quickly from the point where the slope variances are worst performance when the by-item slope variance was
zero to reach its maximal value, before beginning to gradually taper away small and the by-subject slope variance large.
(as the variances become more reliably detected).
D.J. Barr et al. / Journal of Memory and Language 68 (2013) 255–278 269

BB BI BS FB FI FS

Type I Error Power Corrected Power


.20 .70 .70
.18 .64 .64
.16 .59 .59
.13 .53 .53
.11 .48 .48
.09 .42 .42
.07 .37 .37
.04 .31 .31
.02 .26 .26
.00 .20 .20
.01
.05
.10
.20
.30
.40
.50
.60
.70
.80

.01
.05
.10
.20
.30
.40
.50
.60
.70
.80

.01
.05
.10
.20
.30
.40
.50
.60
.70
.80
α α α

.20 .70 .70


.18 .64 .64
.16 .59 .59
.13 .53 .53
.11 .48 .48
.09 .42 .42
.07 .37 .37
.04 .31 .31
.02 .26 .26
.00 .20 .20
.01
.05
.10
.20
.30
.40
.50
.60
.70
.80

.01
.05
.10
.20
.30
.40
.50
.60
.70
.80

.01
.05
.10
.20
.30
.40
.50
.60
.70
.80
α α α

Fig. 2. Performance of model selection approaches for within-items designs, as a function of selection algorithm and a level for testing slopes. The p-values
for all LMEMs in the figure are from likelihood-ratio tests. Top row: 12 items; bottom row: 24 items. BB = backwards, ‘‘best path’’; BI = backwards, item-
slope first; BS = backwards, subject-slope first; FB = forwards, ‘‘best path’’; FI = forwards, item-slope first; FS = forwards, subject-slope first.

In sum, insofar as one is concerned about drawing con- showed a considerable power advantage for within-item
clusions likely to generalize across subjects and items, only designs, with corrected power levels for a = .05 (.433 and
min-F0 and maximal LMEMs can be said to be fundamen- .592) that were 16–32% higher than the uncorrected power
tally sound across the full parameter space that we sur- values for min-F0 (.327 and .512). This additional power of
veyed. F1-only and random-intercepts-only LMEMs are maximal LMEMs cannot be attributed to the inclusion of
fundamentally flawed, as are forward stepping models that cases where slopes were removed due to nonconvergence,
follow an arbitrary sequence, especially in cases with few since in the worst case (the within-item dataset with 12
observations. All other LMEMs with model selection items) virtually all simulation runs (99.613%) converged
showed small amounts of anticonservativity when slopes with the maximal structure.
were small, even when the slopes were tested in a princi- Note that the corrected power for random-intercepts-
pled sequence; however, this level of anticonservativity is only LMEMs was actually below that of maximal LMEMs
probably tolerable for reasonably-sized datasets (so long (between-items: .216 and .314 for 12 and 24 items respec-
as there is not an extensive amount of missing data; see tively; within-items: .380 and .531). This means that most
the Online Appendix). The widely-used F1  F2 approach of the apparent additional power of maximal LMEMs over
is flawed as well, but may be acceptable in cases where min-F0 is real, while most of the apparent power of ran-
maximal LMEMs are not applicable. The question now is dom-intercepts-only LMEMs is, in fact, illusory.
which of these analyses best maximizes power (Tables 5 Our results show that it is possible to use a data-driven
and 6; Figs. 3 and 4). approach to specifying random effects in a way that mini-
Overall, maximal LMEMs showed greater power than mizes Type I error, especially with ‘‘best-path’’ model
min-F0 . When corrected for their slight anticonservativity, selection. However, the power advantage of this approach
maximal LMEMs exhibited power that was between 4% for continuous data, even when uncorrected for anticon-
and 6% higher for the between-items design than the servativity, is very small. In short, data-driven approaches
uncorrected values for min-F0 . Although there does not can produce reasonable results, but their very small benefit
seem to be a large overall power advantage to using max- to power may not be worth the additional uncertainty they
imal LMEMs for between-item designs, the visualization of introduce as compared to a design-driven approach.
power in terms of the critical variances (Fig. 3) suggests The above analyses suggest that maximal LMEMs are in
that the power advantage increases slightly as the critical no way conservative, at least for the analysis of continuous
variances become small. In contrast, maximal LMEMs data. To dispel this suspicion entirely it is illustrative to
270 D.J. Barr et al. / Journal of Memory and Language 68 (2013) 255–278

min−F' LMEM, Max LMEM, Max−NRC LMEM, Max−NWI

2.5 2.5 2.5 2.5

2.0 2.0 2.0 2.0


ω002

ω002

ω002

ω002
1.5 1.5 1.5 1.5

1.0 1.0 1.0 1.0

0.5 0.5 0.5 0.5

0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
τ112 τ112 τ112 τ112

F1+F2 LMEM, RI−only F1 LMEM, Model Selection

2.5 2.5 2.5 2.5

2.0 2.0 2.0 2.0


ω002

ω002

ω002

ω002
1.5 1.5 1.5 1.5

1.0 1.0 1.0 1.0

0.5 0.5 0.5 0.5

0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
τ112 τ112 τ112 τ112

min−F' LMEM, Max LMEM, Max−NRC LMEM, Max−NWI

2.5 2.5 2.5 2.5

2.0 2.0 2.0 2.0


ω002

ω002

ω002

ω002
1.5 1.5 1.5 1.5

1.0 1.0 1.0 1.0

0.5 0.5 0.5 0.5

0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
τ112 τ112 τ112 τ112

F1+F2 LMEM, RI−only F1 LMEM, Model Selection

2.5 2.5 2.5 2.5

2.0 2.0 2.0 2.0


ω002

ω002

ω002

ω002

1.5 1.5 1.5 1.5

1.0 1.0 1.0 1.0

0.5 0.5 0.5 0.5

0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
τ112 τ112 τ112 τ112

Fig. 3. Type I error (top two rows) and Power (bottom two rows) for between-items design with 24 items, as a function of by-subject random slope variance
s211 and by-item random intercept variance x200 . The p-values for all LMEMs in the figure are from likelihood-ratio tests. All model selection approaches in
the figure had a = .05 for slope inclusion. The heatmaps from the 12-item datasets show similar patterns, and are presented in the Online Appendix.

consider the performance of maximal LMEMs in an ex- the extent that the estimation procedure for LMEMs can
treme case for which their power should be at its absolute detect that a random effect parameter is effectively zero.
worst: namely, when the random slope variance is negligi- In this situation, it is additionally informative to compare
ble, such that the underlying process is best described by a the power of maximal LMEMs not only to random-inter-
random-intercepts-only LMEM. Comparing the perfor- cepts-only LMEMs, but also to that of min-F0 , since min-F0
mance of maximal LMEMs to random-intercepts-only is generally regarded as conservative, as well as to that of
LMEMs in this circumstance can illustrate the highest F1  F2, since F1  F2 is generally regarded as sufficiently
‘‘cost’’ that one could possibly incur by using maximal powerful. If maximal LMEMs perform better than min-F0
LMEMs in the analysis of continuous data. Maximal LMEMs and at least as well as F1  F2, then that would strongly ar-
might perform badly in this situation because they overfit gue against the idea that maximal LMEMs are unduly con-
the underlying generative process, loosely akin to assum- servative. To address this, we conducted an additional set
ing too few degrees of freedom for the analysis rather than of simulations, once again using the data-generating
too many. But they might also perform tolerably well to parameters in Table 2, except that we set all random-slope
D.J. Barr et al. / Journal of Memory and Language 68 (2013) 255–278 271

min−F' LMEM, Max LMEM, Max−NRC LMEM, Max−NWI

2.5 2.5 2.5 2.5

2.0 2.0 2.0 2.0


ω112

ω112

ω112

ω112
1.5 1.5 1.5 1.5

1.0 1.0 1.0 1.0

0.5 0.5 0.5 0.5

0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
τ112 τ112 τ112 τ112

F1+F2 LMEM, RI−only F1

2.5 2.5 2.5

2.0 2.0 2.0


ω112

ω112

ω112
1.5 1.5 1.5

1.0 1.0 1.0

0.5 0.5 0.5

0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
τ112 τ112 τ112

min−F' LMEM, Max LMEM, Max−NRC LMEM, Max−NWI

2.5 2.5 2.5 2.5

2.0 2.0 2.0 2.0


ω112

ω112

ω112

ω112
1.5 1.5 1.5 1.5

1.0 1.0 1.0 1.0

0.5 0.5 0.5 0.5

0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
τ112 τ112 τ112 τ112

F1+F2 LMEM, RI−only F1

2.5 2.5 2.5

2.0 2.0 2.0


ω112

ω112

ω112

1.5 1.5 1.5

1.0 1.0 1.0

0.5 0.5 0.5

0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
τ112 τ112 τ112

Fig. 4. Type I error (top three rows) and power (bottom three rows) for within-items design with 24 items, as a function of by-subject random slope
variance s211 and by-item random slope variance x211 . The p-values for all LMEMs in the figure are from likelihood-ratio tests. All model selection approaches
in the figure had a = .05 for slope inclusion. The heatmaps from the 12-item datasets show similar patterns, and are presented in Online Appendix.

variances to 0, so that the model generating the data was a the four different sets of simulations in Fig. 6, our maximal
random-intercepts-only LMEM; also, we varied the true LMEM analysis procedure stepped down to a random-
fixed effect size continuously from 0 to 0.8. intercept model (due to convergence problems) on less
The results were unambiguous (Fig. 6): even in this than 3% of runs. Thus, these results indicate good perfor-
worst case scenario for their power, maximal LMEMs con- mance of the estimation algorithm when random slope
sistently showed higher power than min-F0 and even F1  variances are zero.
F2; indeed, for within-items designs, they far outstripped The case of within-item designs with few items showed
the performance of F1  F2, an approach whose power is the biggest difference in power between random-inter-
rarely questioned. For between-items designs, maximal cepts-only models and maximal LMEMs. The size of this
LMEMs incurred a negligible cost relative to random-inter- difference indicates the maximum benefit that could be
cepts-only LMEMs, while for within-items designs, there obtained, in principle, by using a data-driven approach.
was only a minor cost that diminished as the number of However, in practice, the ability of a data-driven approach
items increased. Overall, the cost/benefit analysis favors to detect random slope variation diminishes as the dataset
maximal LMEMs over other approaches. Note that over gets small. In other words, it is in just this situation that
272 D.J. Barr et al. / Journal of Memory and Language 68 (2013) 255–278

Backward, Best Path Backward, Subjects First Backward, Items First Forward, Best Path Forward, Subjects First Forward, Items First
12 items 12 items 12 items 12 items 12 items 12 items

2.5

2.5

2.5

2.5

2.5

2.5
2.0

2.0

2.0

2.0

2.0

2.0
ω112

ω112

ω112

ω112

ω112

ω112
1.5

1.5

1.5

1.5

1.5

1.5
1.0

1.0

1.0

1.0

1.0

1.0
0.5

0.5

0.5

0.5

0.5

0.5
0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
τ112 τ112 τ112 τ112 τ112 τ112

Backward, Best Path Backward, Subjects First Backward, Items First Forward, Best Path Forward, Subjects First Forward, Items First
24 items 24 items 24 items 24 items 24 items 24 items
2.5

2.5

2.5

2.5

2.5

2.5
2.0

2.0

2.0

2.0

2.0

2.0
ω112

ω112

ω112

ω112

ω112

ω112
1.5

1.5

1.5

1.5

1.5

1.5
1.0

1.0

1.0

1.0

1.0

1.0
0.5

0.5

0.5

0.5

0.5

0.5
0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
τ112 τ112 τ112 τ112 τ112 τ112

Backward, Best Path Backward, Subjects First Backward, Items First Forward, Best Path Forward, Subjects First Forward, Items First
12 items 12 items 12 items 12 items 12 items 12 items
2.5

2.5

2.5

2.5

2.5

2.5
2.0

2.0

2.0

2.0

2.0

2.0
ω112

ω112

ω112

ω112

ω112

ω112
1.5

1.5

1.5

1.5

1.5

1.5
1.0

1.0

1.0

1.0

1.0

1.0
0.5

0.5

0.5

0.5

0.5

0.5
0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
τ112 τ112 τ112 τ112 τ112 τ112

Backward, Best Path Backward, Subjects First Backward, Items First Forward, Best Path Forward, Subjects First Forward, Items First
24 items 24 items 24 items 24 items 24 items 24 items
2.5

2.5

2.5

2.5

2.5

2.5
2.0

2.0

2.0

2.0

2.0

2.0
ω112

ω112

ω112

ω112

ω112

ω112
1.5

1.5

1.5

1.5

1.5

1.5
1.0

1.0

1.0

1.0

1.0

1.0
0.5

0.5

0.5

0.5

0.5

0.5
0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
τ112 τ112 τ112 τ112 τ112 τ112

Fig. 5. Type I error (top two rows) and power (bottom two rows) for data-driven approaches on within-items data, as a function of by-subject random slope
variance s211 and by-item random slope variance x211 . The p-values for all LMEMs in the figure are from likelihood-ratio tests. All approaches in the figure
tested random slopes at a = .05.

the power for detecting random slope variation is at its ity of the test. Including only those random components
worst (see Fig. 4 and the 12-item figures in Appendix); that reduce noise but not those that are confounded with
thus, in this case, the payoff in terms of statistical power the effect of interest will lead to drastically anticonserva-
over the maximal approach does not outweigh the addi- tive behavior, as seen by random-intercepts-only LMEMs,
tional risk of anticonservativity. which had the worst Type I error rates overall.
In sum, our investigation suggests that for confirmatory
hypothesis testing, maximal LMEMs yield nearly optimal General discussion
performance: they were better than all other approaches
except min-F0 at maintaining the Type I error rate near Recent years have witnessed a surge in the popularity of
the nominal level. Furthermore, unlike F1  F2 and certain LMEMs in psycholinguistics and related fields, and this
model selection approaches, maximal LMEMs held the er- growing excitement is well deserved, given the great flex-
ror rate relatively close to nominal across the entirety of ibility of LMEMs and their ability to model the generative
the parameter space. And once corrected for anticonserv- process underlying one’s data. However, there has been
ativity, no other technique exceeded the power of maximal insufficient appreciation of how choices about random ef-
LMEMs. ‘‘Best-path’’ model selection also kept Type I errors fect structure impact generalizability, and no accepted
largely at bay, but it is not clear whether they lead to gains standards for the use of LMEMs in confirmatory hypothesis
in statistical power. testing are currently in place. We have emphasized that
The near-optimal performance of maximal models may specifying random effects in LMEMs involves essentially
be explained as follows. Including the random variances the same principles as selecting an analysis technique from
that could potentially be confounded with the effect of the menu of traditional ANOVA-based options. The stan-
interest is critical to controlling the Type I error rate, by dard for ANOVA has been to assume that if an effect exists,
ensuring that the assumption of conditional independence subjects and/or items will vary in the extent to which they
is met. Including random variances that are not con- show that effect. This is evident in the fact that researchers
founded with that effect (e.g., within-unit intercept vari- using ANOVA have tended to assume the maximal (or
ances) is not critical for reducing Type I error, but near-maximal) random effects structure justified by the
nonetheless reduces noise and thus increases the sensitiv- design.
D.J. Barr et al. / Journal of Memory and Language 68 (2013) 255–278 273

Between Item, N=12 Between Item, N=24


1.0 Maximal 1.0 Maximal
Random Intercept Random Intercept
min−F' min−F'
F1xF2 F1xF2
0.8 0.8

0.6 0.6

power
power

0.4 0.4

0.2 0.2

0.0 0.0
(.000, .042]
(.042, .084]
(.084, .126]
(.126, .168]
(.168, .211]
(.211, .253]
(.253, .295]
(.295, .337]
(.337, .379]
(.379, .421]
(.421, .463]
(.463, .505]
(.505, .547]
(.547, .589]
(.589, .632]
(.632, .674]
(.674, .716]
(.716, .758]
(.758, .800]

(.000, .042]
(.042, .084]
(.084, .126]
(.126, .168]
(.168, .211]
(.211, .253]
(.253, .295]
(.295, .337]
(.337, .379]
(.379, .421]
(.421, .463]
(.463, .505]
(.505, .547]
(.547, .589]
(.589, .632]
(.632, .674]
(.674, .716]
(.716, .758]
(.758, .800]
effect effect

Within Item, N=12 Within Item, N=24


1.0 Maximal 1.0 Maximal
Random Intercept Random Intercept
min−F' min−F'
F1xF2 F1xF2
0.8 0.8

0.6 0.6
power
power

0.4 0.4

0.2 0.2

0.0 0.0
(.000, .042]
(.042, .084]
(.084, .126]
(.126, .168]
(.168, .211]
(.211, .253]
(.253, .295]
(.295, .337]
(.337, .379]
(.379, .421]
(.421, .463]
(.463, .505]
(.505, .547]
(.547, .589]
(.589, .632]
(.632, .674]
(.674, .716]
(.716, .758]
(.758, .800]

(.000, .042]
(.042, .084]
(.084, .126]
(.126, .168]
(.168, .211]
(.211, .253]
(.253, .295]
(.295, .337]
(.337, .379]
(.379, .421]
(.421, .463]
(.463, .505]
(.505, .547]
(.547, .589]
(.589, .632]
(.632, .674]
(.674, .716]
(.716, .758]
(.758, .800]

effect effect

Fig. 6. Statistical power for maximal LMEMs, random-intercepts-only LMEMs, min-F0 , and F1  F2 as a function of effect size, when the generative model
underlying the data is a random-intercepts-only LMEM.

Our survey of various analysis strategies for confirma- a function of the critical variances (the variances con-
tory hypothesis testing on data with crossed random ef- founded with the effect of interest). As these variances be-
fects clearly demonstrates that the strongest contenders came small, maximal LMEMs showed better retention of
in avoiding anti-conservativity are maximal LMEMs and their power relative to ANOVA-based approaches, which
min-F0 (see below for related discussion of data-driven ap- became increasingly conservative. In the limit, when the
proaches). These were the only approaches that consis- generative process did not include any random slope vari-
tently showed nominal or near-nominal Type I error ation, the power of maximal LMEMs substantially out-
rates throughout the parameter space. Although maximal stripped that of ANOVA-based approaches, even F1  F2.
LMEMs showed some minor anticonservativity, our analy- An apparent reason for this power advantage of maximal
sis has uncovered a hidden advantage of maximal LMEMs LMEMs is their simultaneous accommodation of by-subject
over ANOVA-based approaches. This advantage was evi- and by-item random variation. ANOVA-based approaches
dent in the diverging performance of these approaches as are based on two separately calculated statistics, one for
274 D.J. Barr et al. / Journal of Memory and Language 68 (2013) 255–278

subjects and one for items, each of which does not control tradeoff, even when the inclusion criteria are known. Still,
for the variance due to the other random factor. Specifi- there are situations in which data-driven approaches may
cally, the prominent decrease in power for ANOVA-based be justifiable, such as when the aims are not fully confirma-
approaches when critical random variances are small could tory, or when one experiences severe convergence problems
be due to the fact that separate F1  F2 analyses cannot dis- (see below).
tinguish between random intercept variation on the one Overall, our analysis suggests that, when specifying ran-
hand and residual noise on the other: by-item random dom effects for hypothesis testing with LMEMs, research-
intercept variation is conflated with trial-level noise in ers have been far too concerned about overfitting the data,
the F1 analysis, and by-subject random intercept variation and not concerned enough about underfitting the design.
is conflated with trial-level noise in the F2 analysis. Maxi- In fact, it turns out overfitting the data with a maximal
mal LMEMs do not suffer from this problem. model has only minimal consequences for Type I error
The performance of LMEMs depended strongly on and power—at least for the simple designs for typical psy-
assumptions about random effects. This clearly implies cholinguistic datasets considered here—whereas underfit-
that researchers who wish to use LMEMs need to be more ting the design can incur levels of anticonservativity
attentive both to how they specify random effects in their ranging anywhere from minor (best-path model selection)
models, and to the reporting of their modeling efforts. to extremely severe (random-intercepts-only LMEMs) with
Throughout this article, we have argued that for confirma- little real benefit to power. In the extreme, random-inter-
tory analyses, a design-driven approach is preferable to a cepts-only models have the worst generalization perfor-
data-driven approach for specifying random effects. By mance of any approach to date when applied to
using a maximal model, one adheres to the longstanding continuous data with within-subjects and within-item
(and eminently reasonable) assumption that if an effect ex- manipulations. This goes to show that, for such designs,
ists, subjects and/or items will most likely vary in the ex- crossing of random by-subject and by-item intercepts
tent to which they show that effect, whether or not that alone is clearly not enough to ensure proper generalization
variation is actually detectable in the sample. That being of experimental treatment effects (a misconception that is
said, it seems likely that effect variation across subjects unfortunately rather common at present). In psycholin-
and/or items differs among research areas. Perhaps there guistics, there are few circumstances in which we can
are even effects for which such variation is effectively neg- know a priori that a random-intercepts-only model is truly
ligible. In such cases, one might argue that a maximal justified (see further below). Of course, if one wishes to
LMEM is ‘‘overfitting’’ the data, which is detrimental to emphasize the absence of evidence for a given effect, there
power. However, our simulations (Fig. 6) show that maxi- could be some value in demonstrating that the effect is sta-
mal LMEMs would not be unreasonably conservative in tistically insignificant even when a random-intercepts-
such cases, at least for continuous data; indeed, they are only model is applied.
far more powerful than even F1  F2. Our general argument applies in principle to the analy-
Many researchers have used a data-driven approach to sis of non-normally distributed data (e.g., categorical or
determining the random effects structure associated with count data) because observational dependencies are
confirmatory analysis of a fixed effect of particular inter- mostly determined by the design of an experiment and
est.10 Our simulations indicate that it is possible to obtain not by the particular type of data being analyzed. However,
reasonable results with such an approach, if generous crite- practical experience with fitting LMEMs to datasets with
ria for the inclusion of random effects are used. However, it categorical response variables, in particular with mixed lo-
is important to bear in mind that data-driven approaches al- git models, suggests more difficulty in getting maximal
ways imply some sort of tradeoff between Type-I and Type-II models to converge. There are at least three reasons for
error probability, and that it is difficult to precisely quantify this. First, the estimation algorithms for categorical re-
the tradeoff that one has taken. It is partly for this reason sponses differ from the more developed procedures in esti-
that the use of data-driven approaches is controversial, even mating continuous data. Second, observations on a
among statistical experts (Bolker et al., 2009; see also Har- categorical scale (e.g., whether a response is accurate or
rell, 2001 for more general concerns regarding model selec- inaccurate) typically carry less information about parame-
tion). Our results show that data-driven approaches yield ters of interest than observations on a continuous scale
varying performance in terms of Type I error depending on (e.g., response time). Third, parameter estimation for any
the criteria and algorithm used, the size of the dataset, the logistic model is challenging when the underlying proba-
extent of missing data, and the design of the study (which bilities are close to the boundaries (zero or one), since
determines the number of random effect terms that need the inverse logit function is rather flat in this region. So
to be tested). It may be difficult for a reader lacking access although our arguments and recommendations (given be-
to the original data to quantify the resulting Type I/II error low) still apply in principle, they might need to be modi-
fied for noncontinuous cases. For example, whereas we
10
We note here that arguments for data-driven effects to random-effects observed for continuous data that cases in which the ran-
structure have been made within the ANOVA literature as well (e.g., dom effect structure ‘‘overfit’’ the data had minimal impact
Raaijmakers, Schrijnemakers, & Gremmen, 1999), and that these arguments on performance, this might not be the case for categorical
have not won general acceptance within the general psycholinguistics data, and perhaps data-driven strategies would be more
community. Furthermore, part of the appeal of those arguments was the
undue conservativity of min-F0 ; since maximal LMEMs do not suffer from
justifiable. In short, although we maintain that design-dri-
this problem, we take the arguments for data-driven random-effects ven principles should govern confirmatory hypothesis test-
specification with them to be correspondingly weaker. ing on any kind of data, we acknowledge that there is a
D.J. Barr et al. / Journal of Memory and Language 68 (2013) 255–278 275

pressing need to evaluate how these principles can be best tion may contribute to nonconvergence). Finally, in cases
applied to categorical data. where there is only a single observation for every unit, of
In our investigation we have only looked at a very simple course, not even a random intercept is needed (one can just
one-factor design with two levels. However, we see no rea- use ordinary regression as implemented in the R functions
son why our results would not generalize to more complex lm() and glm()).
designs. The principles are the same in higher-order de- The same principles apply to higher-order designs
signs as they are for simple one-factor designs: any main ef- involving interactions. In most cases, one should also have
fect or interaction for which there are multiple by-unit random slopes for any interactions where all factors
observations per subject or item can vary across these units, comprising the interaction are within-unit; if any one factor
and, if this dependency is not taken into account, the p-val- involved in the interaction is between-unit, then the ran-
ues will be biased against the null hypothesis.11 The main dom slope associated with that interaction cannot be esti-
difference is that models with maximal random effects struc- mated, and is not needed. The exception to this rule, again,
ture would be less likely to converge as the number of with- is when you have only one observation for every subject in
in-unit manipulations increases. In the next section, we offer every cell (i.e., unique combination of factor levels). If some
some guidelines for how to cope with nonconverging models. of the cells for some of your subjects have only one or zero
observations, you should still try to fit a random slope.
Producing generalizable results with LMEMs: Best practices
Random effects for control predictors
Our theoretical analyses and simulations lead us to the One of the most compelling aspects of mixed-effects
following set of recommended ‘‘best practices’’ for the use models is the ability to include almost any control predic-
of LMEMs in confirmatory hypothesis testing. It is impor- tor—by which we mean a property of an experimental trial
tant to be clear that our recommendations may not apply which may affect the response variable but is not of theo-
to situations where the overarching goal is not fully confir- retical interest in a given analysis—desired by the research-
matory. Furthermore, we offer these not as the best possi- er. In principle, including control variables in an analysis
ble practices—as our understanding of these models is still can rule out potential confounds and increase statistical
evolving—but as the best given our current level of power by reducing residual noise. Given the investigations
understanding. in the present paper, however, the question naturally
arises: in order to guard against anti-conservative infer-
ence about a predictor X of theoretical interest, do we need
Identifying the maximal random effects structure by-subject and by-item random effects for all our control
Because the same principles apply for specifying subject predictors C as well? Suppose, after all, if there is no under-
and item random effects, to simplify the exposition in this lying fixed effect of C but there is a random effect of C—
section we will only talk about ‘‘by-unit’’ random effects, could this create anti-conservative inference in the same
where ‘‘unit’’ stands in for the sampling unit under consid- way as omitting a random effect of X in the analysis could?
eration (subjects or items). As we have emphasized To put this issue in perspective via an example, Kuperman,
throughout this paper, the same considerations come into Bertram, and Baayen (2010) include a total of eight main
play when specifying random effects as when choosing effects in an LME analysis of fixation durations in Dutch
from the menu of traditional analyses. So the first question reading; for the interpretation of each main effect, the
to ask oneself when trying to specify a maximal LMEM is: other seven may be thought of as serving as controls. Fit-
which factors are within-unit, and which are between? If a ting eight random effects, plus correlation terms, would re-
factor is between-unit, then a random intercept is usually quire estimating 72 random effects parameters, 36 by-
sufficient. If a factor is within-unit and there are multiple subject and 36 by item. One would likely need a huge data-
observations per treatment level per unit, then you need set to be able to estimate all the effects reliably (and one
a by-unit random slope for that factor. The only exception must also not be in any hurry to publish, for even with
to this rule is when you only have a single observation for huge amounts of data such models can take extremely long
every treatment level of every unit; in this case, the ran- to converge).
dom slope variance would be completely confounded with To our knowledge, there is little guidance on this issue
trial-level error. It follows that a model with a random in the existing literature, and more thorough research is
slope would be unidentifiable, and so a random intercept needed. Based on a limited amount of informal simulation,
would be sufficient to meet the conditional independence however (reported in the Online Appendix), we propose
assumption. For datasets that are unbalanced across the the working assumption that it is not essential for one to
levels of a within-unit factor, such that some units have specify random effects for control predictors to avoid anti-
very few observations or even none at all at certain levels conservative inference, as long as interactions between the
of the factor, one should at least try to estimate a random control predictors and the factors of interest are not pres-
slope (but see the caveats below about how such a situa- ent in the model (or justified by the data). Once again,
we emphasize the need for future research on this impor-
11
To demonstrate this, we conducted Monte Carlo simulation of a 24- tant issue.
subject, 24-item 2  2 within/within experiment with main fixed effects,
no fixed interaction, and random by-subject and by-item interactions.
When analyzed with random-intercepts-only LMEMs, we found a Type I
Coping with failures to converge
error rate of .69; with maximal LMEMs the Type I error rate was .06. A It is altogether possible and unfortunately common that
complete report of these simulations appears in the Online Appendix. the estimation procedure for LMEMs will not converge
276 D.J. Barr et al. / Journal of Memory and Language 68 (2013) 255–278

with the full random-effects specification. In our experi- ing random effects for all these key effects does not
ence, the likelihood that a model will converge depends converge, separate analyses can be pursued. For example,
on two factors: (1) the extent to which random effects in in a study where confirmatory analysis of fixed effects for
the model are large, and (2) the extent to which there X1 and X2 is desired, two separate analyses may be in or-
are sufficient observations to estimate the random effects. der—one with (at least) random slopes for X1 to test the evi-
Generally, as the sizes of the subject and item samples dence for generalization of X1, and another with (at least)
grow, the likelihood of convergence will increase. Of random slopes for X2 to test X2. (One would typically still
course, one does not always have the luxury of using many want to include the fixed effects for X1 and X2 in both mod-
subjects and items. And, although the issue seems not to els, of course, for the established reason that multiple
have been studied systematically, it is our impression that regression reveals more information than two separate
fitting maximal LMEMs is less often successful for categor- regressions with single predictors; and see also the previ-
ical data than for continuous data. ous section on random effects for control predictors.)
It is important, however, to resist the temptation to step One fallback strategy for coping with severe conver-
back to random-intercepts-only models purely on the gence problems is to use a data-driven approach, building
grounds that the maximal model does not converge. When up in a principled way from a very simple model (e.g., a
the maximal LMEM does not converge, the first step should model with all fixed effects but only a single by-subjects
be to check for possible misspecifications or data problems random intercept, or even no random effects at all). Our
that might account for the error. It may also help to use simulations indicate that it is possible to use forward mod-
standard outlier removal methods and to center or sum- el selection in a way that guards against anticonservativity
code the predictors. In addition, it may sometimes be effec- if, at each step, one tests for potential inclusion all random
tive to increase the maximum number of iterations in the effects not currently in the model, and include any that
estimation procedure. pass at a relatively liberal a-level (e.g., .20; see Method sec-
Once data and model specification problems have been tion for further details about the ‘‘best path’’ algorithm).
eliminated, the next step is to ask what simplification of We warn the reader that, with more complex designs,
one’s model’s random-effects structure is the most defensi- the space of possible random-effects specifications is far ri-
ble given the goals of one’s analysis. In the common case cher than typically appreciated, and it is easy to design a
where one is interested in a minimally anti-conservative model-selection procedure that fails to get to the best
evaluation of the strength of evidence for the presence of model. As just one example, in a ‘‘within’’ 2  2 design
an effect, our results indicate that keeping the random slope with factors A and B, it is possible to have a model with a
for the predictor of theoretical interest is important: a max- random interaction term but no random main-effect
imal model with no random correlations or even missing slopes. If one is interested in the fixed-effect interaction
within-unit random intercepts is preferable to one missing but only tests for a random interaction if both main-effect
the critical random slopes. Our simulations suggest that random slopes are already in the model, then if the true
removing random correlations might be a good strategy, generative process underlying the data has a large random
as this model performed similarly to maximal LMEMs.12 interaction but negligible random main-effect slopes then
However, when considering this simplification strategy, forward model selection will be badly anti-conservative.
it is important to first check whether the nonconvergence Thus it is critical to report the modeling strategy in detail
might be attributable to the presence of a few subjects (or so that it can be properly evaluated.
items) with small numbers of observations in particular A final recommendation is based on the fact that maxi-
cells. If this is the case, it might be preferable to remove mal LMEMs will be more likely to converge when the ran-
(or replace) these few subjects (or items) rather than to re- dom effects are large, which is exactly the situation where
move an important random slope from the model. For exam- F1  F2 is anti-conservative. This points toward a possible
ple, Jaeger et al. (2011, Section 3.3) discuss a case involving practice of trying to fit a maximal LMEM wherever possi-
categorical data in which strong evidence for random slope ble, and when it is not, to drop the concept of crossed ran-
variation was present when subjects with few observations dom effects altogether and perform separate by-subject
were excluded, but not when they were kept in the data set. and by-item LMEMs, similar in logic to F1  F2, each with
For more complex designs, of course, the number of pos- appropriate maximal random effect structures. In closing,
sible random-effects structures proliferate. Research is it remains unresolved which of these strategies for dealing
needed to evaluate the various possible strategies that with nonconvergence is ultimately most beneficial, and we
one could follow when one cannot fit a maximal model. hope that future studies will investigate their impact on
Both our theoretical analysis and simulations suggest a generalizability more systematically.
general rule of thumb: for whatever fixed effects are of crit-
ical interest, the corresponding random effects should be Computing p-values
present in that analysis. For a study with multiple fixed ef- There are a number of ways to compute p-values from
fects of theoretical interest, and for which a model includ- LMEMs, none of which is uncontroversially the best.
Although Baayen et al. (2008) recommended using Monte
Carlo Markov Chain (MCMC) simulation, this is not yet
12
possible in lme4 for models with correlation parameters,
We remind the reader that the no-correlation and RS-only models are
sensitive to the coding used for the predictor; our theoretical analysis and
and our simulations indicate that this method for obtaining
simulation results indicate that deviation coding is generally preferable to p-values is more anticonservative than the other two
treatment coding; see the Online Appendix. methods we examined (at least using the current imple-
D.J. Barr et al. / Journal of Memory and Language 68 (2013) 255–278 277

mentation and defaults for MCMC sampling in lme4 and If it is seen as necessary or desirable in a confirmatory
languageR).13 Also, it is important to note that MCMC sam- analysis to determine the random effects structure using
pling does nothing to mitigate the anticonservativity of ran- a data-driven approach, certain minimal guidelines should
dom-intercept-only LMEMs when random slope variation is be followed. First, it is critical to report the criteria that have
present. been used, including the a-level for exclusion/inclusion of
For obtaining p-values from analyses of typically-sized random slopes and the order in which random slopes were
psycholinguistic datasets—where the number of observa- tested. Furthermore, authors should explicitly report the
tions usually far outnumbers the number of model param- changed assumptions about the generative process under-
eters—our simulations suggest that the likelihood-ratio lying the data that result from excluding the random slope
test is best approach. To perform such a test, one compares (rather than just stating that the slopes did not ‘‘improve
a model containing the fixed effect of interest to a model model fit’’), and should do so in non-technical language that
that is identical in all respects except the fixed effect in non-experts can understand. Readers with only back-
question. One should not also remove any random effects ground in ANOVA will not understand that removing the
associated with the fixed effect when making the compar- random slope corresponds to pooling error across strata
ison. In other words, likelihood-ratio tests of a fixed effect in a mixed-model ANOVA analysis. It is therefore preferable
with k levels should have only k  1 degrees of freedom to clearly state the underlying assumption of a constant ef-
(e.g., one degree of freedom for the dichotomous single- fect, e.g., ‘‘by excluding the random slope for the priming
factor studies in our simulations). We have seen cases manipulation, we assume that the priming effect is invari-
where removing the fixed effect causes the comparison ant across subjects (or items) in the population.’’
model to fail to converge. Under these circumstances,
one might alter the comparison model following the proce-
dures described above to attempt to get it to converge, and Concluding remarks
once convergence is achieved, compare it to an identical
model including the fixed effect. Note that our results indi- In this paper we have focused largely on confirmatory
cate that the concern voiced by Pinheiro and Bates (2000) analyses. We hope this emphasis will not be construed as
regarding the anti-conservativity of likelihood-ratio tests an endorsement of confirmatory over exploratory ap-
to assess fixed effects in LMEMs is probably not applicable proaches. Exploratory analysis is an important part of the
to datasets of the typical size of a psycholinguistic study cycle of research, without which there would be few ideas
(see also Baayen et al., 2008, footNote 1). to confirm in the first place. We do not wish to discourage
people from exploiting all the many new and exciting
Reporting results opportunities for data analysis that LMEMs offer (see Baa-
It is not only important for researchers to understand yen, 2008 for an excellent overview). Indeed, one of the sit-
the importance of using maximal LMEMs, but also for them uations in which the exploratory power of LMEMs can be
to articulate their modeling efforts with sufficient detail so especially valuable is in performing a ‘‘post-mortem’’ anal-
that other researchers can understand and replicate the ysis on confirmatory studies that yield null or ambiguous
analysis. In our informal survey of papers published in results. In such circumstances, one should pay careful
JML, we sometimes found nothing more than a mere state- attention to the estimated random effects covariance
ment that researchers used ‘‘a mixed effects model with matrices from the fitted model, as they provide a map of
random effects for subjects and items.’’ This could be any- one’s ignorance. For instance, when a predicted fixed effect
thing from a random-intercepts-only to a maximal LMEM, fails to reach significance, it is informative to check
and obviously, there is not enough information given to as- whether subjects or items have larger random slopes for
sess the generalizability of the results. One needs to pro- that effect, and to then use whatever additional data one
vide sufficient information for the reader to be able to has on hand (e.g., demographic information) to try to re-
recreate the analyses. One way of satisfying this require- duce this variability. Such investigation can be extremely
ment is to report the variance–covariance matrix, which useful in planning further studies (or in deciding whether
includes all the information about the random effects, to cut one’s losses), though of course such findings should
including their estimates. This is useful not only as a check be interpreted with caution, and their post hoc nature
on the random effects structure, but also for future meta- should be honestly reported (Wagenmakers et al., 2012).
analyses. A simpler option is to mention that one at- At a recent workshop on mixed-effects models, a prom-
tempted to use a maximal LMEM and, as an added check, inent psycholinguist14 memorably quipped that encourag-
also state which factors had random slopes associated with ing psycholinguists to use linear mixed-effects models was
them. If the random effects structure had to be simplified like giving shotguns to toddlers. Might the field be better
to obtain convergence, this should also be reported, and off without complicated mixed-effects modeling, and the
the simplifications that were made should be justified to potential for misuse it brings? Although we acknowledge
the extent possible. this complexity and its attendant problems, we feel that
one of the reasons why researchers have been using
13
MCMC simulations for random-slopes and more complex mixed-effects mixed-effects models inappropriately in confirmatory analy-
models can be run with general-purpose graphical models software such as ses is due to the misconception that they are something en-
WinBUGS (Lunn, Thomas, Best, & Spiegelhalter, 2000), JAGS (Plummer, tirely new, a misconception that has prevented seeing the
2003), or MCMCglmm (Hadfield, 2010). This approach can be delicate and
error-prone, however, and we do not recommend it at this point as a
14
general practice for the field. G.T.M. Altmann.
278 D.J. Barr et al. / Journal of Memory and Language 68 (2013) 255–278

continued applicability of their previous knowledge about Jaeger, T. F. (2008). Categorical data analysis: Away from ANOVAs
(transformation or not) and towards logit mixed models. Journal of
what a generalizable hypothesis test requires. As we hope
Memory and Language, 59, 434–446.
to have shown, by and large, researchers already know most Jaeger, T. F. (2009). Post to HLP/Jaeger lab blog, 14 May 2009. <http://
of what is needed to use LMEMs appropriately. So long as hlplab.wordpress.com/2009/05/14/random-effect-structure>.
we can continue to adhere to the standards that are already Jaeger, T. F. (2010). Redundancy and reduction: Speakers manage
syntactic information density. Cognitive Psychology, 61, 23–62.
implicit, we therefore should not deny ourselves access to Jaeger, T. F. (2011a). Post to the R-LANG mailing list, 20 February 2011.
this new addition to the statistical arsenal. After all, when <http://pidgin.ucsd.edu/pipermail/r-lang/2011-February/
our investigations involve stalking a complex and elusive 000225.html>.
Jaeger, T. F. (2011b). Post to HLP/Jaeger lab blog, 25 June 2011. <http://
beast (whether the human mind or the feline palate), we hlplab.wordpress.com/2011/06/25/more-on-random-slopes>.
need the most powerful weapons at our disposal. Jaeger, T. F., Graff, P., Croft, W., & Pontillo, D. (2011). Mixed effect models
for genetic and areal dependencies in linguistic typology. Linguistic
Typology, 15, 281–320.
Acknowledgments Janssen, D. P. (2012). Twice random, once mixed: applying mixed models
to simultaneously analyze random effects of language and
We thank Ian Abramson, Harald Baayen, Klinton Bic- participants. Behavior Research Methods, 44, 232–247.
Kliegl, R. (2007). Toward a perceptual-span theory of distributed
knell, Ken Forster, Simon Garrod, Philip Hofmeister, Florian processing in reading: A reply to Rayner, Pollatsek, Drieghe,
Jaeger, Keith Rayner, and Nathaniel Smith for helpful dis- Slattery, and Reichle (2007). Journal of Experimental Psychology:
cussion, feedback, and suggestions. This work has been General, 136, 530–537.
Kuperman, V., Bertram, R., & Baayen, R. H. (2010). Processing trade-offs in
presented at seminars and workshops in Bielefeld, Edin-
the reading of Dutch derived words. Journal of Memory and Language,
burgh, York, and San Diego. CS acknowledges support from 62, 83–97.
ESRC (RES-062-23-2009), and RL was partially supported Locker, L., Hoffman, L., & Bovaird, J. (2007). On the use of multilevel
by NSF (IIS-0953870) and NIH (HD065829). modeling as an alternative to items analysis in psycholinguistic
research. Behavior Research Methods, 39, 723–730.
Lunn, D. J., Thomas, A., Best, N., & Spiegelhalter, D. (2000). WinBUGS – A
Appendix A. Supplementary data Bayesian modelling framework: Concepts, structure, and
extensibility. Statistics and Computing, 10, 325–337.
Mirman, D., Dixon, J. A., & Magnuson, J. S. (2008). Statistical and
Supplementary data associated with this article can be computational models of the visual world paradigm: Growth curves
found, in the online version, at http://dx.doi.org/10.1016/ and individual differences. Journal of Memory and Language, 59, 475–494.
Pinheiro, J. C., & Bates, D. M. (2000). Mixed-effects models in S and S-PLUS.
j.jml.2012.11.001.
New York: Springer.
Plummer, M., 2003. JAGS: A program for analysis of Bayesian graphical
References models using Gibbs sampling. In Proceedings of the 3rd International
workshop on distributed statistical computing.
Baayen, R. H. (2004). Statistics in psycholinguistics: A critique of some Quené, H., & van den Bergh, H. (2004). On multi-level modeling of data
current gold standards. Mental Lexicon Working Papers, 1, 1–45. from repeated measures designs: A tutorial. Speech Communication,
Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to 43, 103–121.
statistics using R. Cambridge: Cambridge University Press. Quené, H., & van den Bergh, H. (2008). Examples of mixed-effects
Baayen, R. H. (2011). languageR: Data sets and functions with analyzing modeling with crossed random effects and with binomial data.
linguistic data: A practical introduction to statistics. R package version 1.2. Journal of Memory and Language, 59, 413–425.
Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects R Development Core Team (2011). R: A language and environment for
modeling with crossed random effects for subjects and items. Journal statistical computing. Austria: R Foundation for Statistical Computing
of Memory and Language, 59, 390–412. Vienna. ISBN 3-900051-07-0.
Barr, D. J. (2008). Analyzing ‘visual world’ eyetracking data using Raaijmakers, J. G. W., Schrijnemakers, J. M. C., & Gremmen, F. (1999). How
multilevel logistic regression. Journal of Memory and Language, 59, to deal with the language-as-fixed-effect fallacy: Common
457–474. misconceptions and alternative solutions. Journal of Memory and
Bates, D., Maechler, M., & Bolker, B. (2011). lme4: Linear mixed-effects Language, 41, 416–426.
models using S4 classes. R package version 0.999375-41. http:// Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models:
CRAN.R-Project.org/package=lme4. Applications and data analysis methods. Thousand Oaks, CA: Sage.
Bolker, B. M., Brooks, M. E., Clark, C. J., Geange, S. W., Poulsen, J. R., Stevens, Roland, D., 2009. Relative clauses remodeled: The problem with mixed-
et al. (2009). Generalized linear mixed models: A practical guide for effect models. Poster presentation at the 2009 CUNY sentence
ecology and evolution. Trends in Ecology & Evolution, 24, 127–135. processing conference.
Clark, H. H. (1973). The language-as-fixed-effect fallacy: A critique of Santa, J. L., Miller, J. J., & Shaw, M. L. (1979). Using quasi F to prevent alpha
language statistics in psychological research. Journal of Verbal inflation due to stimulus variation. Psychological Bulletin, 86, 37–46.
Learning and Verbal Behavior, 12, 335–359. Scheffe, H. (1959). The analysis of variance. New York: Wiley.
Davenport, J. M., & Webster, J. T. (1973). A comparison of some Schielzeth, H., & Forstmeier, W. (2009). Conclusions beyond support:
approximate F-tests. Technometrics, 15, 779–789. Overconfident estimates in mixed models. Behavioral Ecology, 20,
Dixon, P. (2008). Models of accuracy in repeated-measures designs. 416–420.
Journal of Memory and Language, 59, 447–496. Snijders, T., & Bosker, R. J. (1999a). Multilevel analysis: An introduction to
Fisher, R. A. (1925). Statistical methods for research workers. Edinburgh: basic and advanced multilevel modeling. Thousand Oaks, CA: Sage.
Oliver & Boyd. Snijders, T. A. B., & Bosker, R. J. (1999b). Multilevel analysis: An introduction
Forster, K., & Dickinson, R. (1976). More on the language-as-fixed-effect to basic and advanced multilevel modeling. Sage.
fallacy: Monte carlo estimates of error rates for F1, F2, F0 , and min F0 . Tukey, J. W. (1980). We need both exploratory and confirmatory. The
Journal of Verbal Learning and Verbal Behavior, 15, 135–142. American Statistician, 34, 23–25.
Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/ Wagenmakers, E. J., Wetzels, R., Borsboom, D., van der Maas, H. L. J., &
hierarchical models. Cambridge, MA: Cambridge University Press. Kievit, R. A. (2012). An agenda for purely confirmatory research.
Goldstein, H. (1995). Multilevel statistical models. Kendall’s library of Perspectives on Psychological Science, 7, 632–638.
statistics (vol. 3). Arnold. Wickens, T. D., & Keppel, G. (1983). On the choice of design and of test
Hadfield, J. D. (2010). MCMC methods for multi-response generalized statistic in the analysis of experiments with sampled materials.
linear mixed models: The MCMCglmm R package. Journal of Statistical Journal of Verbal Learning and Verbal Behavior, 22, 296–309.
Software, 33, 1–22. Winer, B. J. (1971). Statistical principles in experimental design (2nd ed.).
Harrell, F. E. (2001). Regression modeling strategies: With applications to New York: McGraw-Hill.
linear models, logistic regression, and survival analysis. New York:
Springer.

You might also like