The Generalizability Crisis Yarkoni

The generalizability crisis
Tal Yarkoni1*
1
Department of Psychology, University of Texas at Austin
*
Email: tyarkoni@gmail.com
Most theories and hypotheses in psychology are verbal in nature, yet their evaluation overwhelmingly relies on inferen-
tial statistical procedures. The validity of the move from qualitative to quantitative analysis depends on the verbal and
statistical expressions of a hypothesis being closely aligned—that is, that the two must refer to roughly the same set of
hypothetical observations. Here I argue that most inferential statistical tests in psychology fail to meet this basic condi-
tion. I demonstrate how foundational assumptions of the "random effects" model used pervasively in psychology impose
far stronger constraints on the generalizability of results than most researchers appreciate. Ignoring these constraints
dramatically inflates false positive rates and routinely leads researchers to draw sweeping verbal generalizations that lack
any meaningful connection to the statistical quantities they are putatively based on. I argue that failure to consider gen-
eralizability from a statistical perspective lies at the root of many of psychology's ongoing problems (e.g., the replication
crisis), and conclude with a discussion of several potential avenues for improvement.
Keywords: generalization, statistics, psychology, random effects, inference, philosophy of science
Introduction can do it quickly, or one can do it slowly. The “fast”

approach is liberal and incautious; it makes the default
There’s a widely circulated joke about mathemati- assumption that every observation can be safely gen-
cians that goes as follows1 : eralized to other similar-seeming situations until such
time as those generalizations are contradicted by new
A mathematician, a physicist, and an as- evidence. For example, upon observing that a partic-
tronomer were traveling north by train. ular set of subjects rated a particular set of vignettes
They had just crossed the border into Scot- as more morally objectionable when primed with a
land, when the astronomer looked out of the particular set of cleanliness-related words than with
window and saw a single black sheep in the a particular set of neutral words, one might draw the
middle of a field. extremely broad conclusion that “cleanliness reduces
“Scottish sheep are black,” he remarked. the severity of moral judgments” (Schnall, Benton,
& Harvey, 2008). If a later experiment conducted
“No, my friend,” replied the physicist, “Some
with a slightly different procedure fails to replicate
Scottish sheep are black.”
the effect (Johnson, Cheung, & Donnellan, 2014), one
At this point the mathematician looked up might then conclude that the original effect still holds
from his paper and said: ”In Scotland, there under all intended conditions except those directly
exists at least one field, in which there exists tested by the latter study. Over time, if the catalog
at least one sheep, at least one side of which of contradictory results grows sufficiently large, one
is black.” may come to entertain the notion that perhaps the
original phenomenon was not so robust after all.
This anecdote elegantly captures an important insight
The “slow” approach is conservative, and adheres
about the process of scientific induction—i.e., of draw-
to the opposite default: an observed relationship is
ing general conclusions from specific observations: one
assumed to hold only in situations identical, or very
1 The provenance of the joke is unknown, but I first
similar to, the one in which it has already been ob-
encountered it on John Myles White’s blog (http://www.john-
served. Upon observing that priming a certain group
myleswhite.com/notebook/2016/10/03/claims-and-evidence-
a-joke/), offered in support of much the same point made of subjects with cleanliness-related words increases rat-
here. ings of moral disgust for a certain set of vignettes, one
1
concludes that there is apparently something about The question taken up in this paper is whether or
the subjects, stimuli, experimental methods, experi- not the tendency to generalize psychology findings far
menter, or analysis procedures that causes a change in beyond the circumstances in which they were origi-
the outcome measure. However, because it cannot be nally established is defensible. The case I lay out in
readily determined which of these many factors, alone the next few sections is that it is not, and that unsup-
or in combination, are responsible for the observed ported generalization lies at the root of many of the
change, it is considered inadvisable to generalize one’s methodological and sociological challenges currently
conclusions beyond the narrow context at hand. In- afflicting psychological science. The implications of
stead, one might draws the limited conclusion that this behavior are, in the aggregate, dire. It is not sim-
“priming undergraduate Plymouth students with 40 ply a matter of researchers slightly overselling their
cleanliness-related words increases 21-point moral dis- conclusions here and there; the fundamental problem
gust ratings for 6 specific moral dilemmas”. If one then is that, on close examination, a huge proportion of
conducts several similar experiments using different the verbal claims made in empirical psychology ar-
stimuli, but otherwise holding other procedures con- ticles turn out to have very little relationship with
stant, one might graduate to the somewhat broader the statistical quantities they putatively draw their
conclusion that “priming undergraduates with clean- support from. Rectifying this problem is likely to
liness information increases moral disgust ratings for require a pronounced shift in the way many psycholo-
short vignettes”. Eventually, if the effect is shown to gists think about the relationship between qualitative
hold when systematically varying a large number of ideas and quantitative inferences. Such a shift would
other experimental factors, one may even earn the dramatically improve the reliability and validity of
right to summarize the results of a few hundred stud- psychological inference—even as it brings into full
ies by stating that “cleanliness reduces the severity of view the enormous practical difficulties associated
moral judgments”. with making the kinds of sweeping claims that have
historically pervaded the field.
I think it’s readily apparent that the vast majority
of psychological scientists have long operated under
a regime of (extremely) fast generalization. One has
only to scan the table of contents of our journals Why bother with statistics?
to observe paper titles such as “Local Competition
Modern psychology is—at least to superficial
Amplifies the Corrosive Effects of Inequality” (Krupp
appearances—a quantitative discipline. Evaluation of
& Cook, 2018), “The Intent to Persuade Transforms
most claims proceeds by computing statistical quanti-
Language via Emotionality” (Rocklage, Rucker, &
ties that are thought to bear some important relation-
Nordgren, 2018), and “Inspiration Encourages Belief
ship to the theories or practical applications psycholo-
in God” (Critcher & Lee, 2018)2 . Such claims are
gists care about. This observation may seem obvious,
typically supported by a small number of experiments
but it’s worth noting that things didn’t have to turn
conducted under narrow conditions, with virtually
out this way. Given that the theories and constructs
no discussion of the presumed boundary conditions
psychologists are interested in usually have qualitative
of the effect. The assumption appears to be that
origins, and are almost invariably expressed verbally,
a phenomenon demonstrated once (or twice, or five
a naive observer might well wonder why psychologists
times) under very narrow conditions can be expected
bother with numbers at all. Why take the trouble to
to hold in general, pending new evidence to the con-
compute p-values, Bayes Factors, or confidence inter-
trary. Statements regarding the expected limits of
vals when evaluating qualitative theoretical claims?
generalizability of the effect are almost nonexistent
Why don’t psychologists simply look at the world
(for discussion, see Simons, Shoda, & Lindsay, 2017).
around them, think deeply for a while, and then
2 These titles were all found in the May 2018 issue of Psy- state—again in qualitative terms—what they think
chological Science. they have learned?
2
One standard answer to this question is that quan- learn something interesting about the operational task
titative analysis offers important benefits that qualita- itself).
tive analysis does not (e.g., Steckler, McLeroy, Good-
From a generalizability standpoint, then, the key
man, Bird, & McCormick, 1992)—perhaps most no-
question is how closely the verbal and quantitative
tably, greater objectivity and precision. Two observers
expressions of one’s hypothesis align with each other.
can disagree over whether a crowd of people should
When a researcher verbally expresses a particular hy-
be considered “big” or “small”, but if a careful count
pothesis, she is implicitly defining a set of admissible
establishes that the crowd contains exactly 74 people,
observations containing all of the hypothetical situa-
then it is at least clear what the facts on the ground
tions in which some measurement could be taken that
are, and any remaining dispute is rendered largely
would inform that hypothesis. If the researcher subse-
terminological.
quently asserts that a particular statistical procedure
Unfortunately, the benefits of quantitation come at provides a suitable test of the verbal hypothesis, she
a cost: verbally expressed psychological constructs3 — is making the tacit but critical assumption that the
things like cognitive dissonance, language acquisition, universe of admissible observations implicitly defined
and working memory capacity—cannot be directly by the chosen statistical procedure (in concert with
measured with an acceptable level of objectivity and the experimental design, measurement model, etc.)
precision. What can be measured objectively and pre- is well aligned with the one implicitly defined by the
cisely are operationalizations of those constructs—for qualitative hypothesis. Should a discrepancy between
example, a performance score on a particular digit the two be discovered, the researcher will then face
span task, or the number of English words an in- a choice between (a) working to resolve the discrep-
fant has learned by age 3. Trading vague verbal ancy in some way (i.e., by modifying either the verbal
assertions for concrete measures and manipulations statement of the hypothesis or the quantitative pro-
is what enables researchers to draw precise, objective, cedure(s) meant to provide an operational parallel);
quantitative inferences; however, the same move also or (b) giving up on the link between the two and ac-
introduces new points of potential failure, because the cepting that the statistical procedure does not inform
validity of the original verbal assertion now depends the verbal hypothesis in a meaningful way.
not only on what happens to be true about the world
The next few sections explore this relationship with
itself, but also on the degree to which the chosen
respect to the most widely used class of statistical
proxy measures successfully capture the constructs of
model in psychology—linear regression. The explo-
interest—what psychometricians term construct valid-
ration begins with an examination of the random-
ity (Cronbach & Meehl, 1955; Guion, 1980; O’Leary-
subjects model—a mainstay of group-level inferences—
Kelly & J. Vokurka, 1998). When construct validity
and then progressively considers additional sources of
is low, any conclusions one draws at the operational
variability whose existence is implied by most verbal
level run a high risk of failing to generalize to the
inferences in psychology, but that the standard model
construct level. If a particular digit span task doesn’t
fails to appropriately capture. The revealed picture
accurately capture what researchers mean by work-
is that an unknown but clearly very large fraction of
ing memory, then any observed associations between
statistical hypotheses described in psychology studies
scores on that task and other measures will tell us
cannot plausibly be considered reasonable operational-
little about working memory (though we might still
izations of the verbal hypotheses they are meant to
3 I avoid the conventional habit of describing psychological inform.
constructs as latent variables, as such language is often taken
to imply a realist philosophical attitude towards theoretical
entities (e.g., Borsboom, Mellenbergh, & van Heerden, 2003).
For present purposes, it’s irrelevant whether one thinks psycho-
logical constructs objectively exist in some latent or platonic
realm, or are merely pragmatic fictions.
3
Fixed vs. random effects From a statistical standpoint, the model’s failure
to explicitly acknowledge between-subject variability
We begin our exploration with a scenario that will has several deleterious consequences for our Stroop
be familiar to many psychologists. Suppose we ad- estimate. The most salient one, given psychologists’
minister a cognitive task—say, the color-word Stroop predilection towards dichotomous conclusions (e.g.,
(MacLeod, 1991; Stroop, 1935)—to a group of par- whether or not an effect is statistically significant),
ticipants (the reader is free to mentally substitute is that the estimated uncertainty surrounding the
almost any other experimental psychology task into parameter estimates of interest will tend to be biased—
the example). Each participant is presented with typically downwards (i.e., in our Stroop example, the
a series of trials, half in a congruent condition and standard error of the Stroop effect will usually be
half in an incongruent condition. We are tasked with underestimated)4 . The reason is that, lacking any
fitting a statistical model to estimate the canonical concept of a “person”, our model cannot help but
Stroop effect—i.e., the increase in reaction time (RT) assume that any new set of trials—no matter who
observed when participants are presented with incon- they come from—must have been generated by exactly
gruent color-word information relative to congruent the same set of processes that gave rise to the trials
color-word information. the model has previously seen. Consequently, the
A naive, though almost always inappropriate, model model cannot adjust the uncertainty around the point
might be the following: estimate to account for variability between subjects,
and will usually produce an overly optimistic estimate
yij = β0 + β1 Xij + eij (1) of its own performance when applied to new subjects
eij ∼ N (0, σe2 ) whose data-generating process is at least somewhat
different from the process that generated the data the
In this linear regression, yij denotes the i’th subject’s
model was trained on.
response on trial j, Xij indexes the experimental
condition (congruent or incongruent) of subject i’s j’th The deleterious impact of using Model (1) to es-
trial, β0 is an intercept, β1 is the effect of congruency, timate the Stroop effect when generalization to new
and eij captures the errors, which are assumed to be subjects is intended is illustrated in Figure 1A. The
normally distributed. figure shows the results of a simulation of 20 random
Stroop experiments, each with 20 participants and 200
What is wrong with this model? Well, one rather
trials per participant (100 in each condition). The true
serious problem is that the model blatantly ignores
population effect—common to all 20 experiments—is
sources of variance in the data that we know on the-
assumed to be large. As expected, fitting the simu-
oretical grounds must exist. Notably, because the
lated data with the fixed-effects model specification
model includes only a single intercept parameter and
in Eq. (1) produces an unreasonably narrow estimate
a single slope parameter across all subjects and trials,
of the uncertainty surrounding the point estimates—
it predicts exactly the same reaction time value for
observe that, for any given experiment, most of the
all trials in each condition, no matter which subject
estimates from the other experiments are well out-
a given trial is drawn from. Such an assumption is
side the 95% highest posterior density (HPD) interval.
clearly untenable: it’s absurd to suppose that the only
Researchers who attempt to naively generalize the
source of trial-to-trial RT variability within experi-
estimates obtained using the fixed-effects model to
mental conditions is random error. We know full well
that people differ systematically from one another in 4 The precise effect of failing to include random factors de-
performance on the Stroop task (and for that matter, pends on a number of considerations, including the amount
on virtually every other cognitive task). Any model of variance between versus within the random effects, the co-
that fails to acknowledge this important source of variance with other variables, and the effective sample sizes of
different factors. But in most real-world settings, the inclusion
variability is clearly omitting an important feature of of random effects will lead to (often much) larger uncertainty
the world as we understand it. estimates and smaller inferential test statistics.
4
Figure 1: Consequences of mismatch between model specification and generalization intention. Each row represents
a simulated Stroop experiment with n = 20 new subjects randomly drawn from the same global population (the
ground truth for all parameters is constant over all experiments). Bars display the estimated Bayesian 95% highest
posterior density (HPD) intervals for the (fixed) condition effect of interest in each experiment. Experiments are
ordered by the magnitude of the point estimate for visual clarity. (A) The fixed-effects model specification in Eq. (1)
does not account for random subject sampling, and consequently underestimates the uncertainty associated with the
effect of interest. (B) The random-effects specification in Eq. (2) takes subject sampling into account, and produces
appropriately calibrated uncertainty estimates.
data from new subjects are thus setting themselves The u parameters are assumed (like the error e) to
up for an unpleasant surprise. follow a normal distribution centered at zero, with the
size of the variance components (i.e., the variances of
How might we adjust Model (1) to account for
the groups of random effects) σu2 k estimated from the
the additional between-subject variance in the data
data.
introduced by the stochastic sampling of individuals
from a broader population? One standard approach Conventionally, the u parameters in Model (2)
is to fit a model like the following: are referred to as random (or sometimes, varying
or stochastic) effects, as distinct from the fixed effects
yij = β0 + β1 Xij + u0i + u1i Xij + eij (2)
captured by the β terms. There are several ways to
u0i ∼ N (0, σu2 0 ) conceptualize the distinction between random and
u1i ∼ N (0, σu2 1 ) fixed effects (Gelman & Hill, 2006), but since our
2 focus here is on generalizability, we will define them
eij ∼ N (0, σe )
Here, we expand Model (1) to include two new terms: ought to decide whether or not to include both random slopes
u0 and u1 , which respectively reflect a set of intercepts and random intercepts (for discussion, see (Barr, Levy, Scheep-
and a set of slopes—one pair of terms per subject5 . ers, & Tily, 2013; Matuschek, Kliegl, Vasishth, Baayen, & Bates,
2017). The goal here is simply to elucidate the distinction be-
5 To keep things simple, I ignore the question of how one tween fixed and random effects.
5
this way: fixed effects are used to model variables that tied to a particular model specification. A claim like
must remain constant in order for the model to pre- “there is a statistically significant effect of Stroop
serve its meaning across replication studies; random condition” is not a claim about the world per se;
effects are used to model indicator variables that are rather, it is a claim about the degree to which a specific
assumed to be stochastically sampled from some un- model accurately describes the world under certain
derlying population and can vary across replications theoretical assumptions and measurement conditions.
without meaningfully altering the research question. Strictly speaking, a statistically significant effect of
In the context of our Stroop example, we can say Stroop condition in Model (1) tells us only that the
that the estimated Stroop effect β1 is a fixed effect, data we observe would be unlikely to occur under a
because if we were to run another experiment using null model that considers all trials to be completely
a different manipulation (say, a Sternberg memory exchangeable. By contrast, a statistically significant
task), we could no longer reasonably speak of the effect in Model (2) for what nominally appears to
second experiment being a replication of the first. By be the “same” β1 parameter would have a different
contrast, psychologists almost invariably think of ex- (and somewhat stronger) interpretation, as we are
perimental subjects as a random factor: we are rarely now entitled to conclude that the data we observe
interested in the particular people we happen to have would be unlikely if there were no effect (on average)
in a given sample, and it would be deeply problematic at the level of individuals randomly drawn from some
if two Stroop experiments that differed only in their population.
use of different subjects (randomly sampled from the
Second, the validity of an inference depends not just
same population) had to be treated as if they provided
on the model itself, but also on the analyst’s (typically
estimates of two conceptually distinct Stroop effects.6
implicit) intentions. As discussed earlier, to support
Note that while the model specified in (2) is a sub- valid inference, a statistical model must adequately
stantial improvement over the one specified in (1) if represent the universe of observations the analyst
our goal is to draw inferences over populations of sub- intends to implicitly generalize over when drawing
jects, it is not in any meaningful sense the “correct” qualitative conclusions. In our example above, what
model. Model (2) is clearly still an extremely sim- makes Model (1) a bad model is not the model spec-
plistic approximation of the true generative processes ification alone, but the fact that the specification
underlying Stroop data, and, even within the con- aligns poorly with the universe of observations that
fines of purely linear models, there are many ways in researchers typically care about. In typical practice,
which we could further elaborate on (2) to account for researchers intend their conclusions to apply to entire
other potentially important sources of variance (e.g., populations of subjects, and not just to the specific
practice or fatigue effects, stimulus-specific effects, individuals who happened to walk through the labora-
measured individual differences in cognitive ability, tory door when the study was run. Critically, then, it
etc.). In highlighting the difference between models is the mismatch between our generalization intention
(1) and (2), I simply wish to draw attention to two and the model specification that introduces an inflated
important and interrelated points. risk of inferential error, and not the model specifica-
tion alone. The reason we model subjects as random
First, inferences about model parameters are always
effects is not that such a practice is objectively better,
6 A reasonable argument could be made that since no experi- but rather, that this specification more closely aligns
mental context is ever exactly the same across two measurement the meaning of the quantitative inference with the
occasions, in a technical sense, no design factor is ever truly meaning of the qualitative hypothesis we’re interested
fixed. Readers who are sympathetic to such an argument (as I
also am) should remember Box’s dictum that “all models are in evaluating (for discussion, see Cornfield & Tukey,
false, but some are useful”, and are invited to construe the 1956).
choice between fixed and random effects as a purely pragmatic
one that amounts to deciding which of two idealizations better
approximates reality.
6
Beyond random subjects domains ranging from social psychology to functional
MRI have demonstrated that test statistic inflation of
The discussion in the preceding section may seem up to 300% is not uncommon, and that, under realis-
superfluous to some readers given that, in practice, tic assumptions, false positive rates in many studies
psychologists almost universally already model sub- could easily exceed 60% (Judd et al., 2012; Westfall,
ject as a random factor in their analyses. Importantly, Nichols, & Yarkoni, 2016; Wolsiefer, Westfall, & Judd,
however, there is nothing special about subjects. In 2017). In cases where subject sample sizes are very
principle, what goes for subjects also holds for any large, stimulus samples are very small, and stimulus
other factor of an experimental or observational study variance is large, the false positive rate theoretically
whose levels the authors intend to generalize over. The approaches 100%.
reason that we routinely inject extra uncertainty into
our models in order to account for between-subject The clear implication of such findings is that many
variability is that we want our conclusions to apply literatures within psychology are likely to be popu-
to a broader population of individuals, and not just lated by studies that have spuriously misattributed
to the specific people we studied. But the same logic statistically significant effects to fixed effects of in-
also applies to a large number of other factors that we terest when they should actually be attributed to
do not routinely model as random effects—stimuli, ex- stochastic variation in uninteresting stimulus proper-
perimenters, research sites, and so on. As we shall see, ties. Moreover, given that different sets of stimuli are
applying the random-effects treatment to such factors liable to produce effects in opposite directions (e.g.,
has momentous implications for our interpretation of when randomly sampling 20 nouns and 20 verbs, some
a vast array of published findings in psychology. samples will show a statistically significant noun >
verb effect, while others will show the converse), it
is not hard to see how one could easily end up with
The stimulus-as-fixed-effect fallacy entire literatures full of “mixed results” that seem
A paradigmatic example of a design factor that psy- statistically robust in individual studies, yet cannot
chologists almost universally—and incorrectly—model be consistently replicated across studies.
as a fixed rather than random factor is experimental
stimuli. The tendency to ignore stimulus sampling Generalizing the generalizability problem
variability has been discussed in the literature for over
The stimulus-as-fixed-effect fallacy is but one spe-
50 years (Baayen, Davidson, & Bates, 2008; Clark,
cial case of a general tradeoff between precision of
1973; Coleman, 1964; Judd, Westfall, & Kenny, 2012),
estimation and breadth of generalization. Each ad-
and was influentially dubbed the fixed-effect fallacy
ditional random factor one adds to a model licenses
by (Clark, 1973). Unfortunately, outside of a few
generalization over a corresponding population of po-
domains such as psycholinguistics, it remains rare to
tential measurements, expanding the scope of infer-
see psychologists model stimuli as random effects—
ence beyond only those measurements that were ac-
despite the fact that most inferences researchers draw
tually obtained. However, adding random factors to
are clearly meant to generalize over entire populations
one’s model also typically increases the uncertainty
of stimuli. The net result is that, strictly speaking,
with which the fixed effects of interest are estimated.
the inferences routinely drawn throughout much of
The fact that most psychologists have traditionally
psychology can only be said to apply to a specific—
modeled only subject as a random factor—and have
and usually small—set of stimuli. Generalization to
largely ignored the variance introduced by stimulus
the broader class of stimuli like the ones used is not
sampling—is probably best understood as an accident
licensed.
of history (or, more charitably perhaps, of techno-
It is difficult to overstate how detrimental an impact logical limitations, as the software and computing
the stimulus-as-fixed-effect fallacy has had— and con- resources required to fit such models were hard to
tinues to have—in psychology. Empirical studies in come by until fairly recently).
7
Unfortunately, just as the generalizability problem overshadowing effect. In both the original and repli-
doesn’t begin and end with subjects, it also doesn’t cation experiments, only a single video, containing a
end with subjects and stimuli. Exactly the same single perpetrator, was presented at encoding, and
considerations apply to all other aspects of one’s ex- only a single set of foil items was used at test. Alogna
perimental design or procedure that could in prin- et al. successfully replicated the original result in one
ciple be varied without substantively changing the of two tested conditions, and concluded that their find-
research question. Common design factors that re- ings revealed “a robust verbal overshadowing effect”
searchers hardly ever vary, yet almost invariably in- in that condition.
tend to generalize over, include experimental task,
Let us assume for the sake of argument that there is
between-subject instructional manipulation, research
a genuine and robust causal relationship between the
site, experimenter (or, in clinical studies, therapist;
manipulation and outcome employed in the Alogna et
e.g., Crits-Christoph & Mintz, 1991), instructions,
al study. I submit that there would still be essentially
weather, time of day, lighting conditions in the testing
no support for the authors’ assertion that they found
room, and so on and so forth effectively ad infinitum.
a “robust” verbal overshadowing effect, because the
Naturally, the degree to which each such factor experimental design and statistical model used in the
matters will vary widely across domain and research study simply cannot support such a generalization.
question. I’m not suggesting that most statistical in- The strict conclusion we are entitled to draw, given the
ferences in psychology are invalidated by researchers’ limitations of the experimental design inherited from
failure to explicitly model what their participants ate Schooler and Engstler-Schooler (1990), is that there is
for breakfast three days prior to participating in a at least one particular video containing one particular
study. Collectively, however, unmodeled factors al- face that, when followed by one particular lineup of
most always contribute substantial variance to the faces, is more difficult for participants to identify if
outcome variable. Failing to model such factors ap- they previously verbally described the appearance
propriately (or at all) means that a researcher will of the target face than if they were asked to name
end up either (a) running studies with substantially countries and capitals. This narrow conclusion does
higher-than-nominal false positive rates, or (b) draw- not preclude the possibility that the observed effect is
ing inferences that technically apply only to very nar- specific to this one particular stimulus, and that many
row, and usually uninteresting, slices of the universe other potential stimuli the authors could have used
the researcher claims to be interested in. would have eliminated or even reversed the observed
effect. (In later sections, I demonstrate that the latter
Case study: Verbal overshadowing conclusion is statistically bound to be true given even
very conservative background assumptions about the
To illustrate the magnitude of the problem, it may
operationalization, and also that one can argue from
help to consider an example. Alogna and colleagues
first principles—i.e., without any data at all—that
(2014) conducted a large-scale “registered replication
there must be many stimuli that show a so-called
report” (RRR; Simons, Holcombe, & Spellman, 2014)
verbal overshadowing effect.)
involving 31 sites and over 2,000 participants. The
study sought to replicate an influential experiment by Of course, stimulus sampling is not the only unmod-
Schooler and Engstler-Schooler (1990) in which the eled source of variability we need to worry about. We
original authors showed that participants who were also need to consider any number of other plausible
asked to verbally describe the appearance a perpetra- sources of variability: research site, task operational-
tor caught committing a crime on video showed poorer ization (e.g., timing parameters, modality of stimuli
recognition of the perpetrator following a delay than or responses), instructions, and so on. On any rea-
participants assigned to a control task (naming as sonable interpretation of the construct of verbal over-
many countries and capitals as they could). Schooler shadowing, the corresponding universe of intended
& Engstler-Schooler (1990) dubbed this the verbal generalization should clearly also include most of the
8
operationalizations that would result from randomly as though an extremely narrow operationalization is
sampling various combinations of these factors (e.g., an acceptable proxy for a much broader universe of
one would expect it to still count as verbal overshad- admissible observations. It is instructive—and some-
owing if Alogna et al. had used live actors to enact what fascinating from a sociological perspective—to
the crime scene, instead of showing a video)7 . Once observe that while no psychometrician worth their salt
we accept this assumption, however, the critical ques- would ever recommend a default strategy of measuring
tion researchers should immediately ask themselves is: complex psychological constructs using a single unvali-
are there other psychological processes besides verbal dated item, the vast majority of psychology studies do
overshadowing that could plausibly be influenced by precisely that with respect to multiple key design fac-
random variation in any of these uninteresting factors, tors. The modal approach is to stop at a perfunctory
independently of the hypothesized psychological pro- demonstration of face validity—that is, to conclude
cesses of interest? A moment or two of consideration that if a particular operationalization seems like it
should suffice to convince one that the answer is a has something to do with the construct of interest,
resounding yes. It is not hard to think of dozens of then it is an acceptable stand-in for that construct.
explanations unrelated to verbal overshadowing that Any measurement-level findings are then uncritically
could explain the causal effect of a given manipulation generalized to the construct level, leading researchers
on a given outcome in any single operationalization8 . to conclude that they’ve learned something useful
about broader phenomena like verbal overshadowing,
This verbal overshadowing example is by no means
working memory, ego depletion, etc., when in fact
unusual. The same concerns apply equally to the
such sweeping generalizations typically obtain little
broader psychology literature containing tens or hun-
support from the reported empirical studies.
dreds of thousands of studies that routinely adopt
similar practices. In most of psychology, it is standard
operating procedure for researchers employing just one
experimental task, between-subject manipulation, ex-
perimenter, testing room, research site, etc., to behave Unmeasured factors
7 That even small differences in such factors can have large
impacts on the outcome is clear from the Alogna et al. (2014) In an ideal world, generalization failures like those
study itself: due to an error in the timing of different compo- described above could be addressed primarily via sta-
nents of the procedure, Alogna et al actually conducted two
tistical procedures—e.g., by adding new random ef-
large replication studies. They observed a markedly stronger
effect when the experimental task was delayed by 20 minutes fects to models. In the real world, this strategy is
than when it immediately followed the video. a non-starter: in most studies, the vast majority of
8 For example, perhaps participants in Alogna et al’s exper-
factors that researchers intend to implicitly general-
imental condition felt greater pressure to produce the correct
answer (having previously spent several minutes describing their
ize over don’t actually observably vary in the data,
perceptions), and it was the stress rather than the treatment and therefore can’t be accounted for using traditional
per se that resulted in poorer performance. Or, perhaps the mixed-effects models. Unfortunately, the fact that
effect had nothing at all to do with the treatment condition, one has failed to introduce or measure variation in
and instead reflected a poor choice of control condition (say,
because naming countries and capitals incidentally activates
one or more factors doesn’t mean those factors can
helpful memory consolidation processes). And so on and so be safely ignored. Any time one samples design el-
forth. (A skeptic might object that such explanations are not as ements into one’s study from a broader population
plausible as the verbal overshadowing account, but this misses of possible candidates, one introduces sampling error
the point: safely generalizing the results of the narrow Schooler
& Engstler-Schooler (1990) design to the broad construct of that is likely to influence the outcome of the study to
verbal overshadowing implies that one can rule out the influence some unknown degree.
of all other confounds in the aggregate—and reality is not under
any obligation to only manifest sparse causal relationships that Suppose we generalize our earlier Model (2) to in-
researchers find intuitive!) clude all kinds of random design factors that we have
9
no way of directly measuring: statistical model used here differ somewhat from the
ones in Alogna et al. (2014)9 , but the differences are
yij = β0 + β1 X1ij + u0ij + u1ij + . . . + ukij + eij immaterial for our purposes. As Figure 2 illustrates
ukij ∼ N (0, σu2 k ) (3) (top row, labeled σunmeasured
2
= 0), we can readily
2
eij ∼ N (0, σe ) replicate the key finding from Alogna et al. (2014):
participants assigned to the experimental condition
Here, u0 . . . uk are placeholders for all of the variance were more likely to misidentify the perpetrator seen
components that we implicitly consider part of the in the original video.
universe of admissible observations, but that we have
We now ask the following question: how would
no way of measuring or estimating in our study. It
the key result depicted in the top row of Fig. 2
should be apparent that our earlier Model (2) is just
change if we knew the size of the variance component
a special case of (3) where the vast majority of the
associated with random stimulus sampling? This
uk and σuk terms are fixed to 0. That is—and this
2
question cannot be readily answered using classical
is arguably the most important point in this paper—
inferential procedures (because there’s only a single
the conventional “random effects” model (where in
stimulus in the dataset, so the variance component
actuality only subjects are modeled as random effects)
is non-identifiable), but is trivial to address using a
assumes exactly zero effect of site, experimenter, stim-
Bayesian estimation framework. Specifically, we fit
uli, task, instructions, and every other factor except
the following model:
subject—even though in the vast majority of cases it’s
safe to assume that such effects exist and are non-
trivial, and even though authors almost invariably
start behaving as if their statistical models did in yps = β0 + β1 Xps + u0s + u1s Xps + u2 Xps + eps
fact account for such effects as soon as they reach the u0s ∼ N (0, σu2 0 )
Discussion section.
u1s ∼ N (0, σu2 1 ) (4)
Estimating the impact u2 ∼ N (0, σu2 2 )
We do not have to take the urgency of the above eps ∼ N (0, σe2 )
exhortation on faith. While it’s true that we can’t di-
rectly estimate the population magnitude of variance Here, p indexes participants, s indexes sites, Xps
components that showed no observable variation in indexes the experimental condition assigned to partici-
our sample, we can still simulate their effects under dif- pant p at site s, the β terms encode the fixed intercept
ferent assumptions. Doing so allows us to demonstrate and condition slope, and the u terms encode the ran-
empirically that even modest assumptions about the dom effects (site-specific intercepts u0 , site-specific
magnitude of unmeasured variance components are 9 The model differs in that I fit a single mixed-effects lin-
sufficient to completely undermine many conventional ear probability model with random intercepts and slopes for
inferences about fixed effects of interest. sites, whereas Alogna et al. first computed the mean differ-
ence in response accuracy between conditions for each site,
To illustrate, let’s return to Alogna et al (2014)’s and then performed a random-effects meta-analysis (note that
verbal overshadowing RRR. Recall that the dataset a logistic regression model would be appropriate here given
included data from over 2,000 subjects sampled at the binary outcome, but I opted for the linear model for the
sake of consistency with Alogna et al. and simplicity of pre-
31 different sites, but used exactly the same experi- sentation). The data differ because (a) some sites’ datasets
mental protocol (including the same single stimulus were not publicly available, (b) I made no attempt to adhere
sequence) at all sites. Since most of the data are closely to the reported preprocessing procedures (e.g., inclu-
sion/exclusion criteria), and (c) I used only the data from the
publicly available, we can fit a mixed-effects model to (more successful) second RRR reported in the paper. All data
try and replicate the reported finding of a “robust ver- and code used in the analyses reported here are available at
bal overshadowing effect”. Both the dataset and the https://github.com/tyarkoni/generalizability.
10
properties of the specific stimulus we happened to
randomly sample into our experiment. By varying the
amount of variance injected in this way, we can study
the conditions under which the conclusions obtained
from the “standard” model (i.e., one that assumes
zero effect of stimuli) would or wouldn’t hold.
As it turns out, injecting even a small amount of
stimulus sampling variance to the model has momen-
tous downstream effects. If we very conservatively set
Figure 2: Effects of unmeasured variance components
σu2 2 to 0.05, the resulting posterior distribution for the
on the putative “verbal overshadowing” effect. Error bars
condition effect expands to include negative values
display the estimated Bayesian 95% highest posterior den-
sity (HPD) intervals for the experimental effect reported within the 95% HPD (Fig. 2). For perspective, 0.05 is
in Alogna et al. (2014). Positive estimates indicate better considerably lower than the between-site variance es-
performance in the control condition than in the experi- timated from these data (σu2 1 = 0.075)—and it’s quite
mental condition. Each row represents the estimate from unlikely that there would be less variation between
the model specified in Eq. (4), with only the size of different stimuli at a given site than between different
2
σunmeasured (corresponding to σu2 2 in Eq. (4)) varying as sites for the same stimulus (as reviewed above, in
indicated. This parameter represents the assumed contri- most domains where stimulus effects have been quan-
bution of all variance components that are unmeasured in titatively estimated, they tend to be large). Thus,
the experiment, but fall within the universe of intended even under very conservative assumptions about how
generalization conceptually. The top row (σu2 2 = 0) can
much variance might be associated with stimulus sam-
be interpreted as a conventional model analogous to the
pling, there is little basis for concluding that there
one reported in Alogna et al (2014)—i.e., it assumes that
no unmeasured sources have any impact on the putative is a general verbal overshadowing effect. To draw
verbal overshadowing effect. Alogna et al’s conclusion that there is a “robust” ver-
bal overshadowing effect, one must effectively equate
the construct of verbal overshadowing with almost
exactly the operationalization tested by Alogna et al.
slopes u1 , and the stimulus effect u2 ). The novel fea- (and Schooler & Schooler-Engstler before that), down
ture of this model is the inclusion of u2 , which would to the same single video.
ordinarily reflect the variance in outcome associated
Of course, stimulus variance isn’t the only missing
with random stimulus sampling, but is constant in
variance component we ought to worry about. As Eq.
our dataset (because there’s only a single stimulus).
(3) underscores, many other components are likely to
Unlike the other parameters, we cannot estimate contribute non-negligible variance to outcomes within
u2 from the data. Instead, we fix its prior during esti- our universe of intended generalization. We could
mation, by setting σu2 2 to a specific value. While the attempt to list these components individually and
posterior estimate of u2 is then necessarily identical rationally estimate their plausible magnitudes if we
to its prior (because the prior makes no contact with like, but an alternative route is to invent an omnibus
the data), and so is itself of no interest, the inclusion parameter, σunmeasured
2
, that subsumes all of the un-
of the prior has the incidental effect of (appropriately) measured variance components we expect to system-
increasing the estimation uncertainty around the fixed atically influence the condition estimate β1 . Then we
effect(s) of interest. Conceptually, one can think of can repeat our estimation of the model in Eq. (4)
the added prior as a way of quantitatively represent- with larger values of σu2 2 (for the sake of convenience,
ing our uncertainty about whether any experimental I treat σu2 2 and σunmeasured
2
interchangeably, as the
effect we observe should really be attributed to ver- difference is only that the latter is larger than the
bal overshadowing per se, as opposed to irrelevant former).
11
For example, suppose we assume that the hypothet- dom effects is that the latter specification allows us to
ical aggregate influence of all the unmodeled variance generalize to theoretically exchangeable observations
components roughly equals the residual within-site (e.g., new subjects sampled from the same population),
variance estimated in our data. (i.e., σunmeasured
2
= whereas the former does not. In practice, however, the
σeps ≈ 0.5). This is arguably still fairly conservative
2
vast majority of psychologists have no compunction
when one considers that the aggregate σunmeasured
2 about verbally generalizing their results not only to
now includes not only stimulus sampling effects, but previously unseen subjects, but also to all kinds of
also the effects of differences in task operationaliza- other factors that have not explicitly been modeled—
tion, instructions, etc. In effect, we are assuming to new stimuli, experimenters, research sites, and so
that the net contribution of all of the uninteresting on.
factors that vary across the entire universe of observa- Under such circumstances, it’s unclear why any-
tions we consider “verbal overshadowing” is no bigger one should really care about the inferential statistics
than the residual error we observe for this one par- psychologists report in most papers, seeing as those
ticular operationalization. Yet fixing σunmeasured
2
to statistics bear only the most tenuous of connections to
0.5 renders our estimate of the experimental effect authors’ sweeping verbal conclusions. Why take pains
essentially worthless: the 95% HPD interval for the to ensure that subjects are modeled in a way that af-
putative verbal overshadowing effect now spans val- fords generalization beyond the observed sample—as
ues between -0.8 and 0.91—almost the full range of nearly all psychologists reflexively do—while raising
possible values! The upshot is that, even given very no objection whatsoever when researchers freely gen-
conservative background assumptions, the massive eralize their conclusions across all manner of variables
Alogna et al. study—an initiative that drew on the ef- that weren’t explicitly included in the model at all?
forts of dozens of researchers around the world—does Why not simply model all experimental factors, in-
not tell us much about the general phenomenon of cluding subjects, as fixed effects—a procedure that
verbal overshadowing. Under more realistic assump- would, in most circumstances, substantially increase
tions, it tells us essentially nothing. The best we can the probability of producing the sub-.05 p-values psy-
say, if we are feeling optimistic, is that it might tell chologists so dearly crave? Given that we’ve already
us something about one particular operationalization resolved to run roughshod over the relationship be-
of verbal overshadowing10 . tween our verbal theories and their corresponding
The rather disturbing implication of all this is that, quantitative specifications, why should it matter if
in any research area where one expects the aggregate we sacrifice the sole remaining sliver of generality af-
contribution of the missing σu2 terms to be large—i.e., forded by our conventional “random effects” models
anywhere that “contextual sensitivity” (Van Bavel, on the altar of the Biggest Possible Test Statistic?
Mende-Siedlecki, Brady, & Reinero, 2016) is high— It’s hard to think of a better name for this kind of
the inferential statistics generated from models like behavior than what Feynman famously dubbed cargo
(2) will often underestimate the true uncertainty sur- cult science (Feynman, 1974)—an obsessive concern
rounding the parameter estimates to such a degree with the superficial form of a scientific activity rather
as to make an outright mockery of the effort to learn than its substantive empirical and logical content.
something from the data using conventional inferen- Psychologists are trained to believe that their abil-
tial tests. Recall that the nominal reason we care ity to draw meaningful inferences depends to a large
about whether subjects are modeled as fixed or ran- extent on the production of certain statistical quan-
tities (e.g., p-values below .05, BFs above 10, etc.),
10 We should probably be cautious in drawing even this nar-
so they go to great effort to produce such quantities.
row conclusion, however, because the experimental procedure
in question could very well be producing the observed effect due
That these highly contextualized numbers typically
to idiosyncratic and uninteresting properties, and not because have little to do with the broad verbal theories and
it induces verbal overshadowing per se. hypotheses that researchers hold in their heads, and
12
take themselves to be testing, does not seem to trou- that have harnessed the enormous collective efforts of
ble most researchers much. The important thing, it dozens of labs (e.g., Acosta et al., 2016; Alogna et al.,
appears, is that the numbers have the right form. 2014; Cheung et al., 2016; Eerland et al., 2016)—only
to waste that collective energy on direct replications of
a handful of poorly-validated experimental paradigms.
A crisis of replicability or of gen- While there is no denying that large, collaborative
eralizability? efforts could have enormous potential benefits (and
there are currently a number of promising initiatives,
It is worth situating the above concerns within the e.g., the Psychological Science Accelerator (Moshontz
broader ongoing “replication crisis” in psychology and et al., 2018) and ManyBabies Consortium (Bergelson
other sciences (Lilienfeld, 2017; Pashler & Wagenmak- et al., 2017)), realizing these benefits will require
ers, 2012; Shrout & Rodgers, 2018). My perspective a willingness to eschew direct replication in cases
on the replicability crisis broadly accords with other where the experimental design of the to-be-replicated
commentators who have argued that the crisis is real study is fundamentally uninformative. Researchers
and serious, in the sense that there is irrefutable evi- must be willing to look critically at previous studies
dence that questionable research practices (Gelman and flatly reject—on logical and statistical, rather
& Loken, 2013; John, Loewenstein, & Prelec, 2012; than empirical, grounds—assertions that were never
Simmons, Nelson, & Simonsohn, 2011) and strong se- supported by the data in the first place, even under
lection pressures (Francis, 2012; Kühberger, Fritz, & the most charitable methodological assumptions. A
Scherndl, 2014) have led to the publication of a large recognition memory task that uses just one video, one
proportion of spurious or inflated findings that are target face, and one set of foils simply cannot provide
unlikely to replicate (J. Ioannidis, 2008; J. P. A. Ioan- a meaningful test of a broad construct like verbal
nidis, 2005; Yarkoni, 2009). Accordingly, I think the overshadowing, and it does a disservice to the field
ongoing shift towards practices such as preregistration, to direct considerable resources to the replication of
reporting checklists, data sharing, etc. is a welcome such work. The appropriate response to a study like
development that will undoubtedly help improve the Schooler & Engstler-Schooler (1990) is to point out
reproducibility and replicability of psychology find- that the very narrow findings the authors reported did
ings. not—and indeed, could not, no matter how the data
came out—actually support the authors’ sweeping
At the same time, the current focus on reproducibil-
claims. Consequently, the study does not deserve any
ity and replicability risks distracting us from more
follow-up until such time as its authors can provide
important, and logically antecedent, concerns about
more compelling evidence that a phenomenon of any
generalizability. The root problem is that when the
meaningful generality is being observed.
manifestation of a phenomenon is highly variable
across potential measurement contexts, it simply does The same concern applies to many other active sta-
not matter very much whether any single realization tistical and methodological debates. Is it better to use
is replicable or not (cf. Gelman, 2015, 2018). Ongo- a frequentist or a Bayesian framework for hypothesis
ing efforts to ensure the superficial reproducibility testing (Kruschke & Liddell, 2017; Rouder, Speckman,
and replicability of effects—i.e., the ability to obtain Sun, Morey, & Iverson, 2009; Wagenmakers, 2007)?
a similar-looking set of numbers from independent Should we move the conventional threshold for statisti-
studies—are presently driving researchers in psychol- cal significance from .05 to .005 (Benjamin et al., 2018;
ogy and other fields to expend enormous resources on Lakens et al., 2018; McShane, Gal, Gelman, Robert,
studies that are likely to have very little informational & Tackett, 2019)? A lot of ink continues to be spilled
value even in cases where results can be consistently over such issues, yet in any research area where effects
replicated. This is arguably clearest in the case of are highly variable (i.e., in most of psychology), the
large-scale “registered replication reports” (RRRs) net contribution of such methodological and analyti-
13
cal choices to overall inferential uncertainty is likely support the claims made by a study’s authors should
to be dwarfed by the bias introduced by implicitly researchers begin to consider conducting a replication.
generalizing over unmodeled sources of variance in
the data. There is little point in debating the mer-
its of a statistical significance cut-off of .005 rather Where to from here?
than .05 in a world where even a trivial change in
A direct implication of the arguments laid out above
an unmodeled variable—e.g., a random choice be-
is that a huge proportion of the quantitative inferences
tween two nominally equivalent cognitive tasks, or
drawn in the published psychology literature are so
the use of a slightly different stimulus sample—can
inductively weak as to be at best questionable and
routinely take one from p = .5 to p = .0005 or vice
at worst utterly nonsensical. The difficult question
versa (cf. Crits-Christoph & Mintz, 1991; Westfall et
I take up now is what we ought to do about this. I
al., 2016; Wolsiefer et al., 2017). Yet this root prob-
suggest three broad and largely disjoint courses of
lem continues to go largely ignored in favor of efforts
action researchers can pursue that would, in the ag-
to treat its downstream symptoms. It appears that,
gregate, considerably improve the quality of research
faced with the difficulty of stating what the complex,
in psychological science.
multicausal effects we psychologists routinely deal in
actually mean, we have collectively elected to instead
pursue superficially precise answers to questions none
Do something else
of us really care much about. One perfectly reasonable course of action when
faced with the difficulty of extracting meaningful,
To be clear, my suggestion is not that researchers
widely generalizable conclusions from effects that are
should stop caring about methodological or statis-
inherently complex and highly variable is to opt out
tical problems that presently limit reproducibility
of the enterprise entirely. There is an unfortunate cul-
and replicability. Such considerations are undeniably
tural norm within psychology (and, to be fair, many
important. My argument, rather, is that these consid-
other fields) to demand that every research contribu-
erations should be reserved for situations where the
tion end on a wholly positive or “constructive” note.
verbal conclusions drawn by researchers demonstra-
This is an indefensible expectation that I won’t bother
bly bear some non-trivial connection to the reported
to indulge. In life, you often can’t have what you want,
quantitative analyses. The mere fact that a previous
no matter how hard you try. In such cases, I think
study has had a large influence on the literature is
it’s better to recognize the situation for what it is
not a sufficient reason to expend additional resources
sooner rather than later. The fact that a researcher
on replication. On the contrary, the recent move-
is able to formulate a question in his or her head
ment to replicate influential studies using more robust
that seems sensible (for example, “does ego depletion
methods risks making the situation worse, because in
exist”?) doesn’t mean that the question really is sen-
cases where such efforts superficially “succeed” (in the
sible. Moreover, even when the question is a sensible
sense that they obtain a statistical result congruent
one to ask (in the sense that it’s logically coherent
with the original), researchers then often draw the
and seems theoretically meaningful), it doesn’t auto-
incorrect conclusion that the new data corroborate
matically follow that it’s worth trying to obtain an
the original claim (e.g., Alogna et al., 2014)—when
empirical answer. In many research areas, if gener-
in fact the original claim was never supported by the
alizability concerns were to be taken seriously, the
data in the first place. A more appropriate course of
level of effort required to obtain even minimally in-
action in cases where there are questions about the
formative answers to seemingly interesting questions
internal coherence and/or generalizability of a finding
would likely so far exceed conventional standards that
is to first focus a critical eye on the experimental de-
I suspect many academic psychologists would, if they
sign, measurement approach, and model specification.
were dispassionate about the matter, simply opt out.
Only if a careful review suggests that these elements
I see nothing wrong with such an outcome, and think
14
it is a mistake to view a career in psychology (or any tests—effectively constituting what Smedslund (1991)
other academic field) as a higher calling of some sort. dubbed ”pseudoempirical research”.
Admittedly, the utility of this advice depends on To see what I mean, let’s return to our running
one’s career stage, skills, and interests. It should not example of verbal overshadowing. To judge by the
be terribly surprising if few tenured professors are accumulated literature (for reviews, see Meissner &
eager to admit (even to themselves) that they have, Brigham, 2001; Meissner & Memon, 2002), the ques-
as Paul Meehl rather colorfully put it, “achieved some tion of whether verbal overshadowing is or is not
notoriety, tenure, economic security and the like by a “real” phenomenon seems to be taken quite seri-
engaging, to speak bluntly, in a bunch of nothing” ously by many researchers. Yet it’s straightforward
(P. E. Meehl, 1990b, p. 230). The situation is more fa- to show that some phenomenon like verbal overshad-
vorable for graduate students and postdocs, who have owing must exist given even the most basic, uncon-
much less to lose (and potentially much more to gain) troversial facts about the human mind. Consider the
by pursuing alternative careers. To be clear, I’m not following set of statements:
suggesting that a career in academic psychology isn’t
a worthwhile pursuit for anyone; for many people, it 1. The human mind has a finite capacity to store
remains an excellent choice. But I do think all psy- information.
chologists, and early-career researchers in particular,
owe it to themselves to spend some time carefully and 2. There is noise in the information-encoding pro-
dispassionately assessing the probability that the work cess.
they do is going to contribute meaningfully—even if 3. Different pieces of information will sometimes in-
only incrementally—to our collective ability either to terfere with one another during decision-making—
understand the mind or to practically improve the either because they directly conflict, or because
human condition. There is no shame whatsoever in they share common processing bottlenecks.
arriving at a negative answer, and the good news
is that, for people who have managed to obtain a
None of the above statements should be at all con-
PhD (or have the analytical skills to do so), career
troversial, yet the conjunction of the three logically
prospects outside of academia have arguably never
entails that there will be (many) situations in which
been brighter.
something we could label verbal overshadowing will
predictably occur. Suppose we take the set of all
Embrace qualitative analysis situations in which a person witnesses, and encodes
A second approach one can take is to keep doing into memory, a crime taking place. In some subset
psychological research, but to largely abandon in- of these cases, that person will later reconsider, and
ferential statistical methods in favor of qualitative verbally re-encode, the events they observed. Because
analysis. This may seem like a radical prescription, the encoding process is noisy, and conversion between
but I contend that a good deal of what currently different modalities is necessarily lossy, some details
passes for empirical psychology is already best un- will be overemphasized, underemphasized, or other-
derstood as insightful qualitative analysis dressed up wise distorted. And because different representations
as shoddy quantitative science. I submit that if one of the same event will conflict with one another, it is
takes the time to reflect on the way researchers in then guaranteed that there will be situations in which
many fields of psychology currently operationalize the verbal re-consideration of information at Time 2
their theories, it often becomes clear that at least one will lead a person to incorrectly ignore information
of two things is true: either the theory is so vague they may have correctly encoded at Time 1. We can
as to be effectively unfalsifiable, or the central postu- call this verbal overshadowing if we like, but there is
lates of the theory are so obviously true that there is nothing about the core idea that requires any kind
nothing to be gained by subjecting them to empirical of empirical demonstration. So long as it’s framed
15
strictly in broad qualitative terms, the “theory” is manuscripts (Trafimow, 2014; Trafimow & Marks,
trivially true; the only way it could be false is if at 2015). Although the move was greeted with derision
least one of the 3 statements listed above is false— by many scientists (Woolston, 2015), what is prob-
which is almost impossible to imagine. (Note too, that lematic about the BASP policy is, in my view, only
the inverse of the theory is also trivially true: there that the abolition of inferential statistics was made
must be many situations in which lossy re-encoding of mandatory. Framed as a strong recommendation that
information across modalities actually ends up being psychologists should avoid reporting inferential statis-
beneficial). tics that they often do not seem to understand, and
that have no clear implications for our understanding
To be clear, I am not suggesting that there’s no
of, or interaction with, the world, I think there would
point in engaging in quantitative analysis of broad
be much to like about the policy.
putative constructs like verbal overshadowing. On
the contrary: if the goal is to develop models detailed For many psychologists, fully embracing qualitative
enough to make useful real-world predictions, quan- analysis would provide an explicit license to do what
titative analysis is essential. It would be difficult to they are already most interested in doing—namely,
make real-world predictions about when, where, and playing with big ideas, generalizing conclusions far
to what extent verbal overshadowing will manifest and wide, and moving swiftly from research question
unless one has systematically studied and modeled to research question. The primary cost would be the
the putative phenomenon under a broad range of reputational one: in a world where most psychology
conditions—including extensive variation of the per- papers are no longer accompanied by scientific-looking
ceptual stimuli, viewing conditions, rater incentives, inferential statistics, journalists and policy-makers
timing parameters, and so on and so forth. The would probably come knocking on our doors less of-
point is that taking this quantitative objective seri- ten. I don’t deny that this is a non-trivial cost, and I
ously would require much larger and more complex can understand why many researchers would be hesi-
datasets, experimental designs, and statistical models tant to pay such a toll. But such is life. I don’t think
than have typically been deployed in most areas of it requires a terribly large amount of intellectual in-
psychology. As such, psychologists intent on working tegrity to appreciate that one shouldn’t portray one’s
in “soft” domains who are unwilling to learn poten- self as a serious quantitative scientist unless one is
tially challenging new modeling skills—or to spend actually willing to do the corresponding work.
months or years trying to meticulously address “minor”
methodological concerns that presently barely rate
any mention in papers—may need to accept that their Adopt better standards
work is, at root, qualitative in nature, and that the
The previous two suggestions are not a clumsy at-
inferential statistics so often reported in soft psychol-
tempt at dark humor; I am firmly convinced that many
ogy articles primarily serve as a performative ritual
academic psychologists would be better off either pur-
intended to convince one’s colleagues and/or one’s
suing different careers, or explicitly acknowledging
self that something very scientific and important is
the fundamentally qualitative nature of their work (I
taking place.
lump myself into the former group perhaps half the
What would a qualitative psychology look like? In time). For the remainder—i.e., those who would like
many sub-fields, almost nothing would change. The to approach their research from a more quantitatively
primary difference is that researchers would largely defensible perspective—there are a number of prac-
stop using inferential statistics, restricting themselves tices that, if deployed widely, could greatly improve
instead to descriptive statistics and qualitative dis- the quality and reliability of psychological inference
cussion. Such a policy is not without precedent: in without necessarily requiring us to give up on the idea
2014, the journal Basic and Applied Social Psychology that what we’re doing here is quantitative science,
banned the reporting of p-values from all submitted dammit.
16
Draw more conservative inferences Take descriptive research more seriously
Traditionally, purely descriptive research—where
Perhaps the most obvious solution to the general- researchers seek to characterize and explore relation-
izability problem is for authors to draw much more ships between measured variables without imput-
conservative inferences in their manuscripts—and in ing causal explanations or testing elaborate verbal
particular, to replace the hasty generalizations perva- theories—is looked down on in many areas of psy-
sive in contemporary psychology with slower, more chology. The stigma is unfortunate, and discourages
cautious conclusions that hew much more closely to publication of articles making more modest, guarded
the available data. Concretely, researchers should claims. I suspect it stems to a significant extent
avoid extrapolating beyond the universe of observa- from a failure to recognize and internalize just how
tions implied by their experimental designs and sta- fragile many psychological phenomena truly are. Ac-
tistical models. Potentially relevant design factors knowledging the value of empirical studies that do
that are impractical to measure or manipulate, but nothing more than carefully describe the relationships
that conceptual considerations suggest are likely to between a bunch of variables under a wide range of
have non-trivial effects (e.g., effects of stimuli, ex- conditions would go some ways towards atoning for
perimenter, research site, culture, etc.), should be our unreasonable obsession with oversimplified causal
identified and disclosed to the best of authors’ abil- explanations.
ity. Papers should be given titles like “Transient We know that a discipline-wide move from oversim-
manipulation of self-reported anger influences small plistic causal explanation to rigorous description is
hypothetical charitable donations”, and not ones like possible, because at least one other field has success-
“Hot head, warm heart: Anger increases economic fully accomplished it. In genetics, the small-sample
charity”. I strongly endorse the recent suggestion by candidate gene studies of the 1990s (e.g., Ebstein et al.,
Simons and colleagues that most manuscripts should 1996; Lesch et al., 1996)—which almost invariably pro-
include a Constraints on Generality statement that duced spurious results (Chabris et al., 2012; Colhoun,
explicitly defines the boundaries of the universe of McKeigue, & Davey Smith, 2003; Sullivan, 2007),
observations the authors believe their findings apply and were motivated by elegant theoretical hypotheses
to (Simons et al., 2017). that seem laughably simplistic in hindsight—have all
Correspondingly, when researchers evaluate results but disappeared in favor of genome-wide association
reported by others, credit should only be given for studies (GWAS) involving hundreds of thousands of
what the empirical results of a study actually show— subjects (Nagel et al., 2018; Savage et al., 2018; Wray
and not for what its authors claim they show. Contin- et al., 2018). The latter are now considered the gold
ually emphasizing the importance of the distinction standard even when they do little more than descrip-
between verbal constructs and observable measure- tively identify novel statistical associations between
ments would go a long way towards clarifying which gene variants and behavior. In population genetics,
existing findings are worth replicating and which are at least, researchers seem to have accepted that the
not. If researchers develop a habit of mentally rein- world is causally complicated, and attempting to ob-
terpreting a claim like “we provide evidence of ego tain a reasonable descriptive characterization of some
depletion” as “we provide evidence that crossing out small part of it is a perfectly valid reason to conduct
the letter e slightly decreases response accuracy on a large, expensive empirical studies.
subsequent Stroop task”, I suspect that many findings
would no longer seem important enough to warrant Fit more expansive statistical models
any kind of follow-up—at least, not until the original To the degree that authors intend for their conclu-
authors have conducted considerable additional work sions to generalize over populations of stimuli, tasks,
to demonstrate the generalizability of the claimed experimenters, and other such factors, they should
phenomenon. develop a habit of fitting more expansive statistical
17
models. As noted earlier, nearly all statistical analy- 2015). Even in domains where many effects tradi-
ses of multi-subject data in psychology treat subject tionally display little sensitivity to context, some re-
as a varying effect. The same treatment should be ac- searchers have advocated for analysis strategies that
corded to other design factors that researchers intend emphasize variability. For example, Baribault and
to generalize over and that vary controllably or natu- colleagues (2018) randomly varied 16 different experi-
rally in one’s study. Of course, inclusion of additional mental factors in a large multi-site replication (6 sites,
random effects is only one of many potential avenues 346 subjects, and nearly 5,000 “microexperiments”) of
for sensible model expansion (Draper, 1995; Gelman a subliminal priming study (Reuss, Kiesel, & Kunde,
& Shalizi, 2013)11 . The good news is that improve- 2015). The “radical randomization” strategy the au-
ments in statistical computing over the past few years thors adopted allowed them to draw much stronger
have made it substantially easier for researchers to fit conclusions about the generalizability of the priming
arbitrarily complex mixed-effects models within both effect (or lack thereof) than would have otherwise
Bayesian and frequentist frameworks. Models that been possible.
were once intractable for most researchers to fit due to
Naturally, variation-enhancing designs come at a
either mathematical or computational limitations can
cost: they will often demand greater resources than
now often be easily specified and executed on modern
conventional approaches that seek to minimize ex-
laptops using mixed-effects packages (e.g., lmer or
traneous variation. But if authors intend for their
MixedModels.jl; (Bates, Maechler, Bolker, Walker, &
conclusions to hold independently of variation in un-
Others, 2014)) or probabilistic programming frame-
interesting factors, and to generalize to broad classes
works (e.g., Stan or PyMC; (Carpenter et al., 2017;
of situations, there is no good substitute for studies
Salvatier, Wiecki, & Fonnesbeck, 2016)).
whose designs make a serious effort to respect and cap-
Design with variation in mind ture the complexity of real-world phenomena. Large-
scale, collaborative projects of the kind pioneered in
There is a long tradition in psychology of trying to RRRs (Simons et al., 2014) and recent initiatives such
construct randomized experiments that are as tightly as the Psychology Accelerator (Moshontz et al., 2018)
controlled as possible—even at the cost of decreased are arguably the natural venue for such an approach—
generalizability. In recent years, there have been but to maximize their utility, the substantial resources
increasing calls in many domains of psychology for they command must be used to directly measure and
studies that opt for the opposite side of the precision- model variability rather than minimizing and ignoring
generalization tradeoff, and emphasize naturalistic, it.
ecologically valid designs that embrace variability. For
example, in neuroimaging, researchers are increasingly Emphasize variance estimates
fitting sophisticated models to naturalistic stimuli
such as coherent narratives or movies (Hamilton & An important and underappreciated secondary con-
Huth, 2018; Huth, de Heer, Griffiths, Theunissen, sequence of the widespread disregard for general-
& Gallant, 2016; Huth, Nishimoto, Vu, & Gallant, izability is that researchers in many areas of psy-
2012; Spiers & Maguire, 2007). In psycholinguistics, chology rarely have good data—or even just strong
large-scale analyses involving databases of thousands intuitions—about the relative importances of differ-
of words and subjects have superseded traditional ent sources of variance. One way to mitigate this
small-n factorial studies for many applications (Balota, problem is to promote analytical approaches that em-
Yap, Hutchison, & Cortese, 2012; Keuleers & Balota, phasize the estimation of variance components rather
than focusing solely on point estimates. For primary
11 In a sense, the very idea of a random effect is just a
research studies, Generalizability Theory (Brennan,
convenient fiction—effectively a placeholder for a large number
of hypothetical fixed variables (or functions thereof) that we
1992; Cronbach, Rajaratnam, & Gleser, 1963; Shavel-
presently do not know how to write, or lack the capacity to son & Webb, 1991) provides a well-developed (and
measure and/or estimate. underused) framework for computing and applying
18
such estimates. At the secondary level, meta-analysts provides no meaningful support for T unless the pre-
could similarly work to quantify the magnitudes of dif- diction was relatively specific to T —that is, there are
ferent variance components—either by meta-analyzing no readily available alternative theories T10 . . . Tk0 that
reported within-study variance estimates, or by meta- also predict P . Unfortunately, in most domains of
analytically computing between-study variance com- psychology, there are pervasive and typically very plau-
ponents for different factors. Such approaches could sible competing explanations for almost every finding
provide researchers with critically important back- (Cohen, 2016; Lykken, 1968; P. E. Meehl, 1967, 1986).
ground estimates of the extent to which a new finding
The solution to this problem is, in principle, simple:
reported in a particular literature should be expected
researchers should strive to develop theories that gen-
to generalize to different samples of subjects, stimuli,
erate risky predictions (P. Meehl, 1997; P. E. Meehl,
experimenters, research sites, and so on. Notably,
1967, 1990a; Popper, 2014)—or, in the terminology
such estimates would be valuable irrespective of the
popularized by Deborah Mayo, should subject their
presence or absence of a main effect of putative inter-
theories to severe tests (Mayo, 1991, 2018). The canon-
est. For example, even if the accumulated empirical
ical way to accomplish this is to derive from one’s
literature is too feeble to allow us to estimate anything
theory some series of predictions—typically, but not
approximating a single overall universe score for ego
necessarily, quantitative in nature—sufficiently spe-
depletion, it would still be extremely helpful when
cific to that theory that they are inconsistent with, or
planning a new study to know roughly how much of
at least extremely implausible under, other accounts.
the observed variation in the existing pool of studies
As Meehl put it:
is due to differences in stimuli, subjects, tasks, etc.
Make riskier predictions If my meteorological theory successfully pre-

dicts that it will rain sometime next April,
There is an important sense in which most of the and that prediction pans out, the scientific
other recommendations made in this section could community will not be much impressed. If
be obviated simply by making theoretical predictions my theory enables me to correctly predict
that assume a high degree of theoretical risk. I have which of 5 days in April it rains, they will be
approached the problem of generalizability largely more impressed. And if I predict how many
from a statistical perspective, but there is a deep millimeters of rainfall there will be on each
connection between the present concerns and a long of these 5 days, they will begin to take my
tradition of philosophical commentary focusing on the theory very seriously indeed (P. E. Meehl,
logical relationship (or lack thereof) between theoreti- 1990a, p. 110).
cal hypotheses and operational or statistical ones.
Perhaps the best exposition of such ideas is found in The ability to generate and corroborate a truly risky
the seminal work of Paul Meehl, who, beginning in the prediction strongly implies that a researcher must
1960s, argued compellingly that many of the method- already have a decent working model (even if only
ological and statistical practices routinely applied implicitly) of most of the contextual factors that could
by psychologists and other social scientists are logi- potentially affect a dependent variable. If a social
cally fallacious (e.g., P. E. Meehl, 1967, 1978, 1990b). psychologist were capable of directly deriving from a
Meehl’s thinking was extremely nuanced, but a recur- theory of verbal overshadowing the prediction that
ring theme in his work is the observation that most target recognition should decrease 1.7% +/- 0.04%
hypothesis tests in psychology commit the logical fal- in condition A relative to condition B in a given
lacy of affirming the consequent. A theory T makes experiment, concerns about the generalizability of
a prediction P , and when researchers obtain data the theory would dramatically lessen, as there would
consistent with P , they then happily conclude that rarely be a plausible alternative explanation for such
T is corroborated. In reality, the confirmation of P precision other than that the theories in question
19
were indeed accurately capturing something important train a statistical model that allows us to meaningfully
about the way the world works. predict people’s behaviors in a set of situations that
superficially seem to involve verbal overshadowing?
In practice, it’s clearly wishful thinking to demand
The latter framing places emphasis primarily on what
this sort of precision in most areas of psychology (po-
a model is able to do for us rather than on its implied
tential exceptions include, e.g., parts of psychophysics,
theoretical or ontological commitments.
mathematical cognitive psychology, and behavioral
genetics). The very fact that most of the phenomena One major advantage of an applied predictive fo-
psychologists study are enormously complex, and ad- cus is that it naturally draws attention to objective
mit a vast array of causal influences in even the most metrics of performance that can be easier to mea-
artificially constrained laboratory situations, likely sure and evaluate than the relatively abstract, and
precludes the production of quantitative models with often vague, theoretical postulates of psychological
anything close to the level of precision one routinely theories. A strong emphasis on objective, communal
observes in the natural sciences. This does not mean, measures of model performance has been a key driver
however, that vague directional predictions are the of rapid recent progress in the field of machine learning
best we can expect from psychologists. There are a (Jordan & Mitchell, 2015; LeCun, Bengio, & Hinton,
number of strategies that researchers in such fields 2015; Russakovsky et al., 2015)—including numerous
could adopt that would still represent at least a mod- successes in domains such as object recognition and
est improvement over the status quo (for discussion, natural language translation that arguably already
see P. E. Meehl, 1990a). For example, researchers fall within the purview of psychology and cognitive
could use equivalence tests (Lakens, 2017); predict science. A focus on applied prediction would also nat-
specific orderings of discrete observations; test against urally encourage greater use of large samples, as well
compound nulls that require the conjunctive rejec- as of cross-validation techniques that can minimize
tion of many independent directional predictions; and overfitting and provide alternative ways of assessing
develop formal mathematical models that posit non- generalizability outside of the traditional inferential
trivial functional forms between the input and ouput statistical framework (e.g., Woo, Chang, Lindquist,
variables (Marewski & Olsson, 2009; Smaldino, 2017). & Wager, 2017). Admittedly, a large-scale shift to-
While it is probably unrealistic to expect truly severe wards instrumentalism of this kind would break with a
tests to become the norm in most fields of psychology, century-long tradition of explanation and theoretical
severity is an ideal worth keeping perpetually in mind understanding within psychology; however, as I have
when designing studies—if only as a natural guard argued elsewhere (Yarkoni & Westfall, 2017), there
against undue optimism. are good reasons to believe that psychology would
emerge as a healthier, more reliable discipline as a
Focus on practical predictive utility result.
An alternative and arguably more pragmatic way
to think about the role of prediction in psychology Conclusion
is to focus not on the theoretical risk implied by a
prediction, but on its practical utility. Here, the core Most contemporary psychologists view the use of
idea is to view psychological theories or models not inferential statistical tests as an integral part of the
so much as statements about how the human mind discipline’s methodology. The ubiquitous reliance on
actually operates, but as convenient approximations statistical inference is the source of much of the per-
that can help us intervene on the world in useful ceived objectivity and rigor of modern psychology—
ways (Breiman, 2001; Hofman, Sharma, & Watts, the very thing that, in many people’s eyes, makes it
2017; Shmueli, 2010; Yarkoni & Westfall, 2017). For a quantitative science. I have argued that, for most
example, instead of asking the question does verbal research questions in most areas of psychology, this
overshadowing exist?, we might instead ask: can we perception is illusory. Closer examination reveals that
20
the inferential statistics reported in psychology ar- the intellectually honest road, and has the secondary
ticles typically have only a tenuous correspondence benefit of reducing the probability of making unrea-
to the verbal claims they are intended to support. sonably broad claims that are unlikely to stand the
The overarching conclusion is that many fields of psy- test of time.
chology currently operate under a kind of collective
The alternative is to simply brush off these concerns,
self-deception, using a thin sheen of quantitative rigor
recommit one’s self to the same set of procedures that
to mask inferences that remain, at their core, almost
have led to prior success by at least some measures
entirely qualitative.
(papers published, awards received, etc.), and then
Such concerns are not new, of course. Commenta- carry on with business as usual. No additional effort is
tors have long pointed out that, viewed dispassion- required here; no new intellectual or occupational risk
ately, an enormous amount of statistical inference is assumed. The main cost is that one must live with
in psychology (and, to be fair, other sciences) has the knowledge that many of the statistical quantities
a decidedly performative character: rather than im- one routinely reports in one’s papers are essentially
proving the quality of scientific inference, the use of just an elaborate rhetorical ruse used to mathematize
ritualistic, universalized tests serves mainly to increase people into believing claims they would otherwise find
practitioners’ subjective confidence in broad verbal logically unsound.
assertions that would otherwise be difficult to defend
I don’t pretend to think this is an easy choice. I have
on logical grounds (e.g., Gelman, 2016; Gigerenzer,
little doubt that the vast majority of researchers have
2004; Gigerenzer & Marewski, 2015; P. E. Meehl, 1967,
good intentions, and genuinely want to do research
1990b; Tong, 2019). A central point I have tried to
that helps increase understanding of human behavior
emphasize in the present treatment is that such cri-
and improve the quality of people’s lives. I am also
tiques are not, as many psychologists would like to
sympathetic to objections that it’s not fair to expect
believe, pedantic worries about edge cases that one
individual researchers to pro-actively hold themselves
can safely ignore the vast majority of the time. The
to a higher standard than the surrounding commu-
problems in question are fundamental, and follow
nity, knowing full well that a likely cost of doing the
directly from foundational assumptions of our most
right thing is that one’s research may become more
widely used statistical models. The central point is
difficult to pursue, less exciting, and less well received
that the degree of support a statistical analysis lends
by others. Unfortunately, the world we live in isn’t
to a verbal proposition derives not just from some
always fair. I don’t think anyone should be judged
critical number that the analysis does or doesn’t pop
very harshly for finding major course correction too
out (e.g., p < .05), but also (and really, primarily),
difficult an undertaking after spending years immersed
from the ability of the statistical model to implicitly
in an intellectual tradition that encourages rampant
define a universe that matches the one defined by the
overgeneralization. But the decision to stay the course
verbal proposition.
should at least be an informed one: researchers who
When the two diverge markedly—as I have argued opt to ignore the bad news should recognize that, in
is the modal situation in psychology—one is left with the long term, such an attitude hurts the credibility
a difficult choice to make. One possibility is to accept both of their own research program and of the broader
the force of the challenge and adjust one’s standard op- profession they have chosen. One is always free to
erating procedures accordingly—by moderating one’s pretend that small p-values obtained from extremely
verbal claims, narrowing the scope of one’s research narrow statistical operationalizations can provide an
program, focusing on making practically useful pre- adequate basis for sweeping verbal inferences about
dictions, and so on. This path is effort-intensive and complex psychological constructs. But noone else—
incurs a high risk that the results one produces post- not one’s peers, not one’s funders, not the public,
remediation will, at least superficially, seem less im- and certainly not the long-term scientific record—is
pressive than the ones that came before. But it is obligated to honor the charade.
21
Acknowledgments Benjamin, D. J., Berger, J. O., Johannesson, M.,
Nosek, B. A., Wagenmakers, E.-J., Berk, R., …
This paper was a labor of pain that took an inex- Others (2018). Redefine statistical significance.
plicably long time to produce. It owes an important Nature Human Behaviour, 2(1), 6.
debt to Jake Westfall for valuable discussions and Bergelson, E., Bergmann, C., Byers-Heinlein, K., Cris-
comments, and also benefited from conversations with tia, A., Cusack, R., Dyck, K., … others (2017).
dozens of other people (many of them on Twitter, Quantifying sources of variability in infancy re-
because it turns out you can say a surprisingly large search using the infant-directed speech prefer-
amount in hundreds of 280-character tweets). ence.
Borsboom, D., Mellenbergh, G. J., & van Heerden, J.
(2003, April). The theoretical status of latent
References variables. Psychol. Rev., 110(2), 203–219.
Breiman, L. (2001). Statistical modeling: The two
Acosta, A., Adams (Jr.), R. B., Albohn, D. N., Allard, cultures. Stat. Sci., 16(3), 199–215.
E. S., Beek, T., Benning, S. D., … Zwaan, R. A. Brennan, R. L. (1992). Generalizability theory. Ed-
(2016, November). Registered replication report: ucational Measurement: Issues and Practice,
Strack, martin, & stepper (1988). Perspect. 11(4), 27–34.
Psychol. Sci., 11(6), 917–928. Carpenter, B., Gelman, A., Hoffman, M. D., Lee,
Alogna, V. K., Attaya, M. K., Aucoin, P., Bahník, D., Goodrich, B., Betancourt, M., … Riddell,
Š., Birch, S., Birt, A. R., … Zwaan, R. A. A. (2017). Stan: A probabilistic programming
(2014, September). Registered replication re- language. J. Stat. Softw., 76(1).
port: Schooler and Engstler-Schooler (1990). Chabris, C. F., Hebert, B. M., Benjamin, D. J.,
Perspect. Psychol. Sci., 9(5), 556–578. Beauchamp, J., Cesarini, D., van der Loos, M.,
Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). … Laibson, D. (2012, September). Most reported
Mixed-effects modeling with crossed random genetic associations with general intelligence are
effects for subjects and items. J. Mem. Lang., probably false positives. Psychol. Sci., 23(11),
59(4), 390–412. 1314–1323.
Balota, D. A., Yap, M. J., Hutchison, K. A., & Cortese, Cheung, I., Campbell, L., LeBel, E. P., Ackerman,
M. J. (2012). Megastudies: What do millions R. A., Aykutog˘lu, B., Bahník, Š., … Yong,
(or so) of trials tell us about lexical processing? J. C. (2016, September). Registered replication
In J. S. Adelman (Ed.), Visual word recognition report: Study 1 from finkel, rusbult, kumashiro,
volume 1: Models and methods, orthography and & hannon (2002). Perspect. Psychol. Sci., 11(5),
phonology. Psychology Press. 750–764.
Baribault, B., Donkin, C., Little, D. R., Trueblood, Clark, H. H. (1973, August). The language-as-fixed-
J. S., Oravecz, Z., van Ravenzwaaij, D., … Van- effect fallacy: A critique of language statistics
dekerckhove, J. (2018, March). Metastudies for in psychological research. Journal of Verbal
robust tests of theory. Proc. Natl. Acad. Sci. U. Learning and Verbal Behavior, 12(4), 335–359.
S. A., 115(11), 2607–2612. Cohen, J. (2016). The earth is round (p<. 05). In
Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. What if there were no significance tests? (pp.
(2013, April). Random effects structure for con- 69–82). Routledge.
firmatory hypothesis testing: Keep it maximal. Coleman, E. B. (1964). Generalizing to a language
J. Mem. Lang., 68(3). population. Psychol. Rep., 14(1), 219–226.
Bates, D., Maechler, M., Bolker, B., Walker, S., & Colhoun, H. M., McKeigue, P. M., & Davey Smith, G.
Others. (2014). lme4: Linear mixed-effects (2003, March). Problems of reporting genetic
models using eigen and S4. R package version, associations with complex outcomes. Lancet,
1(7), 1–23. 361(9360), 865–872.
22
Cornfield, J., & Tukey, J. W. (1956). Average values incremental changes, and what to do about it.
of mean squares in factorials. The Annals of Pers. Soc. Psychol. Bull., 44(1), 16–23.
Mathematical Statistics, 27 (4), 907–949. Gelman, A., & Hill, J. (2006). Data analysis using
Critcher, C. R., & Lee, C. J. (2018, May). Feeling is regression and Multilevel/Hierarchical models.
believing: Inspiration encourages belief in god. Cambridge University Press.
Psychol. Sci., 29(5), 723–737. Gelman, A., & Loken, E. (2013). The garden of
Crits-Christoph, P., & Mintz, J. (1991, February). Im- forking paths: Why multiple comparisons can
plications of therapist effects for the design and be a problem, even when there is no “fishing
analysis of comparative studies of psychothera- expedition” or “p-hacking” and the research
pies. J. Consult. Clin. Psychol., 59(1), 20–26. hypothesis. Downloaded January, 1–17.
Cronbach, L. J., & Meehl, P. E. (1955). Construct Gelman, A., & Shalizi, C. R. (2013). Philosophy and
validity in psychological tests. Psychol. Bull., the practice of bayesian statistics. British Jour-
52(4), 281–302. nal of Mathematical and Statistical Psychology,
Cronbach, L. J., Rajaratnam, N., & Gleser, G. C. 66(1), 8–38.
(1963). Theory of generalizability: A liberal- Gigerenzer, G. (2004, November). Mindless statistics.
ization of reliability theory. Br. J. Math. Stat. J. Socio Econ., 33(5), 587–606.
Psychol., 16(2), 137–163. Gigerenzer, G., & Marewski, J. N. (2015). Surrogate
Draper, D. (1995). Assessment and propagation of science: The idol of a universal method for sci-
model uncertainty. Journal of the Royal Statis- entific inference. Journal of Management, 41(2),
tical Society: Series B (Methodological), 57 (1), 421–440.
45–70. Guion, R. M. (1980, June). On trinitarian doctrines
Ebstein, R. P., Novick, O., Umansky, R., Priel, B., of validity. Prof. Psychol., 11(3), 385–398.
Osher, Y., Blaine, D., … Belmaker, R. H. (1996). Hamilton, L. S., & Huth, A. G. (2018, July). The
Dopamine D4 receptor (D4DR) exon III poly- revolution will not be controlled: natural stimuli
morphism associated with the human personal- in speech neuroscience. Language, Cognition and
ity trait of novelty seeking. Nat. Genet., 12(1), Neuroscience, 1–10.
78–80. Hofman, J. M., Sharma, A., & Watts, D. J. (2017,
Eerland, A. S., Magliano, A. M., Zwaan, J. P., Arnal, February). Prediction and explanation in social
R. A., Aucoin, J. D., & Crocker, P. (2016). systems. Science, 355(6324), 486–488.
Registered replication report: Hart & albarracín Huth, A. G., de Heer, W. A., Griffiths, T. L., The-
(2011). Perspect. Psychol. Sci., 11(1), 158–171. unissen, F. E., & Gallant, J. L. (2016, April).
Feynman, R. P. (1974). Cargo cult science. Eng. Sci., Natural speech reveals the semantic maps that
37 (7), 10–13. tile human cerebral cortex. Nature, 532(7600),
Francis, G. (2012, December). Publication bias and 453–458.
the failure of replication in experimental psy- Huth, A. G., Nishimoto, S., Vu, A. T., & Gallant,
chology. Psychon. Bull. Rev., 19(6), 975–991. J. L. (2012, December). A continuous semantic
Gelman, A. (2015, February). The connection be- space describes the representation of thousands
tween varying treatment effects and the crisis of of object and action categories across the human
unreplicable research: A bayesian perspective. brain. Neuron, 76(6), 1210–1224.
J. Manage., 41(2), 632–643. Ioannidis, J. (2008). Why most discovered true as-
Gelman, A. (2016). The problems with p- sociations are inflated. Epidemiology, 19(5),
values are not just with p-values. Am. Stat., 640–648.
70(supplemental material to the ASA statement Ioannidis, J. P. A. (2005). Why most published
on p-values and statistical significance), 10. research findings are false. PLoS Med., 2(8),
Gelman, A. (2018, January). The failure of null e124.
hypothesis significance testing when studying John, L. K., Loewenstein, G., & Prelec, D. (2012,
23
May). Measuring the prevalence of question- ulatory region. Science, 274(5292), 1527–1531.
able research practices with incentives for truth Lilienfeld, S. O. (2017, July). Psychology’s replication
telling. Psychol. Sci., 23(5), 524–532. crisis and the grant culture: Righting the ship.
Johnson, D. J., Cheung, F., & Donnellan, M. B. Perspect. Psychol. Sci., 12(4), 660–664.
(2014). Does cleanliness influence moral judg- Lykken, D. T. (1968, September). Statistical signifi-
ments? a direct replication of schnall, benton, cance in psychological research. Psychol. Bull.,
and harvey (2008). Soc. Psychol., 45(3), 209. 70(3), 151–159.
Jordan, M. I., & Mitchell, T. M. (2015, July). Machine MacLeod, C. M. (1991, March). Half a century of
learning: Trends, perspectives, and prospects. research on the stroop effect: an integrative
Science, 349(6245), 255–260. review. Psychol. Bull., 109(2), 163–203.
Judd, C. M., Westfall, J., & Kenny, D. A. (2012). Marewski, J. N., & Olsson, H. (2009). Beyond the
Treating stimuli as a random factor in social psy- null ritual: Formal modeling of psychological
chology: A new and comprehensive solution to a processes. Zeitschrift für Psychologie/Journal
pervasive but largely ignored problem (Vol. 103) of Psychology, 217 (1), 49–60.
(No. 1). Matuschek, H., Kliegl, R., Vasishth, S., Baayen, H., &
Keuleers, E., & Balota, D. A. (2015). Megastudies, Bates, D. (2017, June). Balancing type I error
crowdsourcing, and large datasets in psycholin- and power in linear mixed models. J. Mem.
guistics: An overview of recent developments. Lang., 94, 305–315.
Q. J. Exp. Psychol., 68(8), 1457–1468. Mayo, D. G. (1991). Novel evidence and severe tests.
Krupp, D. B., & Cook, T. R. (2018, May). Local Philos. Sci., 58(4), 523–552.
competition amplifies the corrosive effects of Mayo, D. G. (2018). Statistical inference as severe test-
inequality. Psychol. Sci., 29(5), 824–833. ing. Cambridge: Cambridge University Press.
Kruschke, J. K., & Liddell, T. M. (2017, February). McShane, B. B., Gal, D., Gelman, A., Robert, C., &
The bayesian new statistics: Hypothesis testing, Tackett, J. L. (2019). Abandon statistical sig-
estimation, meta-analysis, and power analysis nificance. The American Statistician, 73(sup1),
from a bayesian perspective. Psychon. Bull. 235–245.
Rev.. Meehl, P. (1997). The problem is epistemology, not
Kühberger, A., Fritz, A., & Scherndl, T. (2014, statistics: Replace significance tests by confi-
September). Publication bias in psychology: dence intervals and quantify accuracy of risky
a diagnosis based on the correlation between numerical predictions. In L. L. Harlow, S. A. Mu-
effect size and sample size. PLoS One, 9(9), laik, & J. H. Steiger (Eds.), What if there were
e105825. no significance tests? (pp. 393–425). Erlbaum.
Lakens, D. (2017, May). Equivalence tests: A prac- Meehl, P. E. (1967). Theory-Testing in psychology
tical primer for t tests, correlations, and Meta- and physics: A methodological paradox. Philos.
Analyses. Soc. Psychol. Personal. Sci., 8(4), Sci., 34(2), 103–115.
355–362. Meehl, P. E. (1978). Theoretical risks and tabu-
Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., lar asterisks: Sir karl, sir ronald, and the slow
Apps, M. A. J., Argamon, S. E., … Zwaan, R. A. progress of soft psychology. J. Consult. Clin.
(2018, February). Justify your alpha. Nature Psychol., 46(4), 806.
Human Behaviour, 2(3), 168–171. Meehl, P. E. (1986). 14 what social scientists don’t
LeCun, Y., Bengio, Y., & Hinton, G. (2015, May). understand. Metatheory in social science: Plu-
Deep learning. Nature, 521(7553), 436–444. ralisms and subjectivities, 315.
Lesch, K. P., Bengel, D., Heils, A., Sabol, S. Z., Green- Meehl, P. E. (1990a, April). Appraising and amending
berg, B. D., Petri, S., … Murphy, D. L. (1996). theories: The strategy of lakatosian defense and
Association of anxiety-related traits with a poly- two principles that warrant it. Psychol. Inq.,
morphism in the serotonin transporter gene reg- 1(2), 108–141.
24
Meehl, P. E. (1990b, February). Why summaries Russakovsky, O., Deng, J., Su, H., Krause, J.,
of research on psychological theories are often Satheesh, S., Ma, S., … Fei-Fei, L. (2015, Decem-
uninterpretable. Psychol. Rep., 66(1), 195–244. ber). ImageNet large scale visual recognition
Meissner, C. A., & Brigham, J. C. (2001). A meta- challenge. Int. J. Comput. Vis., 115(3), 211–
analysis of the verbal overshadowing effect in 252.
face identification. Applied Cognitive Psychology: Salvatier, J., Wiecki, T. V., & Fonnesbeck, C. (2016,
The Official Journal of the Society for Applied April). Probabilistic programming in python
Research in Memory and Cognition, 15(6), 603– using PyMC3. PeerJ Comput. Sci., 2, e55.
616. Savage, J. E., Jansen, P. R., Stringer, S., Watanabe,
Meissner, C. A., & Memon, A. (2002, December). K., Bryois, J., de Leeuw, C. A., … Posthuma,
Verbal overshadowing: A special issue explor- D. (2018, July). Genome-wide association meta-
ing theoretical and applied issues. Appl. Cogn. analysis in 269,867 individuals identifies new
Psychol., 16(8), 869–872. genetic and functional links to intelligence. Nat.
Moshontz, H., Campbell, L., Ebersole, C. R., IJzer- Genet., 50(7), 912–919.
man, H., Urry, H. L., Forscher, P. S., … Chartier, Schnall, S., Benton, J., & Harvey, S. (2008, December).
C. R. (2018, July). Psychological science ac- With a clean conscience: cleanliness reduces
celerator: Advancing psychology through a dis- the severity of moral judgments. Psychol. Sci.,
tributed collaborative network. Advances in 19(12), 1219–1222.
Methods and Practices in Psychological Science. Schooler, J. W., & Engstler-Schooler, T. Y. (1990,
Nagel, M., Jansen, P. R., Stringer, S., Watanabe, January). Verbal overshadowing of visual mem-
K., de Leeuw, C. A., Bryois, J., … Posthuma, ories: some things are better left unsaid. Cogn.
D. (2018, July). Meta-analysis of genome-wide Psychol., 22(1), 36–71.
association studies for neuroticism in 449,484 Shavelson, R. J., & Webb, N. M. (1991). Generaliz-
individuals identifies novel genetic loci and path- ability theory: A primer. SAGE.
ways. Nat. Genet., 50(7), 920–927. Shmueli, G. (2010). To explain or to predict? Stat.
O’Leary-Kelly, S. W., & J. Vokurka, R. (1998). The Sci., 289–310.
empirical assessment of construct validity. J. Shrout, P. E., & Rodgers, J. L. (2018, January). Psy-
Oper. Manage., 16(4), 387–405. chology, science, and knowledge construction:
Pashler, H., & Wagenmakers, E.-J. (2012, November). Broadening perspectives from the replication
Editors’ introduction to the special section on crisis. Annu. Rev. Psychol., 69, 487–510.
replicability in psychological science: A crisis Simmons, J. P., Nelson, L. D., & Simonsohn, U.
of confidence? Perspect. Psychol. Sci., 7 (6), (2011). False-Positive psychology: Undisclosed
528–530. flexibility in data collection and analysis allows
Popper, K. (2014). Conjectures and refutations: The presenting anything as significant. Psychol. Sci.,
growth of scientific knowledge. routledge. 22(11), 1359–1366.
Reuss, H., Kiesel, A., & Kunde, W. (2015, January). Simons, D. J., Holcombe, A. O., & Spellman, B. A.
Adjustments of response speed and accuracy to (2014, September). An introduction to registered
unconscious cues. Cognition, 134, 57–62. replication reports at perspectives on psycho-
Rocklage, M. D., Rucker, D. D., & Nordgren, L. F. logical science. Perspect. Psychol. Sci., 9(5),
(2018, May). Persuasion, emotion, and language: 552–555.
The intent to persuade transforms language via Simons, D. J., Shoda, Y., & Lindsay, D. S. (2017,
emotionality. Psychol. Sci., 29(5), 749–760. November). Constraints on generality (COG):
Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., A proposed addition to all empirical papers.
& Iverson, G. (2009, April). Bayesian t tests Perspect. Psychol. Sci., 12(6), 1123–1128.
for accepting and rejecting the null hypothesis. Smaldino, P. E. (2017). Models are stupid, and we
Psychon. Bull. Rev., 16(2), 225–237. need more of them. In Computational social
25
psychology (pp. 311–331). Routledge. 681.
Smedslund, J. (1991). The pseudoempirical in psychol- Yarkoni, T. (2009, May). Big correlations in lit-
ogy and the case for psychologic. Psychological tle studies: Inflated fMRI correlations reflect
Inquiry, 2(4), 325–338. low statistical Power-Commentary on vul et al.
Spiers, H. J., & Maguire, E. A. (2007). Decoding (2009). Perspect. Psychol. Sci., 4(3), 294–298.
human brain activity during real-world experi- Yarkoni, T., & Westfall, J. (2017, November). Choos-
ences. Trends Cogn. Sci., 11(8), 356–365. ing prediction over explanation in psychology:
Steckler, A., McLeroy, K. R., Goodman, R. M., Bird, Lessons from machine learning. Perspect. Psy-
S. T., & McCormick, L. (1992). Toward inte- chol. Sci., 12(6), 1100–1122.
grating qualitative and quantitative methods:
an introduction. Health Educ. Q., 19(1), 1–8.
Stroop, J. R. (1935). Studies of interference in serial
verbal reactions. J. Exp. Psychol., 18(6), 643.
Sullivan, P. F. (2007, May). Spurious genetic associa-
tions. Biol. Psychiatry, 61(10), 1121–1126.
Tong, C. (2019). Statistical inference enables bad
science; statistical thinking enables good science.
The American Statistician, 73(sup1), 246–261.
Trafimow, D. (2014, January). Editorial. Basic Appl.
Soc. Psych., 36(1), 1–2.
Trafimow, D., & Marks, M. (2015, January). Editorial.
Basic Appl. Soc. Psych., 37 (1), 1–2.
Van Bavel, J. J., Mende-Siedlecki, P., Brady, W. J.,
& Reinero, D. A. (2016, June). Contextual
sensitivity in scientific reproducibility. Proc.
Natl. Acad. Sci. U. S. A., 113(23), 6454–6459.
Wagenmakers, E.-J. (2007, October). A practical
solution to the pervasive problems ofp values.
Psychon. Bull. Rev., 14(5), 779–804.
Westfall, J., Nichols, T. E., & Yarkoni, T. (2016,
December). Fixing the stimulus-as-fixed-effect
fallacy in task fMRI. Wellcome Open Res, 1,
23.
Wolsiefer, K., Westfall, J., & Judd, C. M. (2017,
August). Modeling stimulus variation in three
common implicit attitude tasks. Behav. Res.
Methods, 49(4), 1193–1209.
Woolston, C. (2015, March). Psychology journal bans
P values. Nature News, 519(7541), 9.
Wray, N. R., Ripke, S., Mattheisen, M., Trzaskowski,
M., Byrne, E. M., Abdellaoui, A., … Major De-
pressive Disorder Working Group of the Psy-
chiatric Genomics Consortium (2018, May).
Genome-wide association analyses identify 44
risk variants and refine the genetic architecture
of major depression. Nat. Genet., 50(5), 668–
26

The Generalizability Crisis Yarkoni

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Generalizability Crisis Yarkoni

Uploaded by

Copyright:

Available Formats

The generalizability crisis

Keywords: generalization, statistics, psychology, random effects, inference, philosophy of science

Introduction can do it quickly, or one can do it slowly. The “fast”

7 That even small differences in such factors can have large

Make riskier predictions If my meteorological theory successfully pre-

You might also like