You are on page 1of 6

Perspectives on Psychological

Science http://pps.sagepub.com/

The Value of Direct Replication


Daniel J. Simons
Perspectives on Psychological Science 2014 9: 76
DOI: 10.1177/1745691613514755

The online version of this article can be found at:


http://pps.sagepub.com/content/9/1/76

Published by:

http://www.sagepublications.com

On behalf of:

Association For Psychological Science

Additional services and information for Perspectives on Psychological Science can be found at:

Email Alerts: http://pps.sagepub.com/cgi/alerts

Subscriptions: http://pps.sagepub.com/subscriptions

Reprints: http://www.sagepub.com/journalsReprints.nav

Permissions: http://www.sagepub.com/journalsPermissions.nav

Downloaded from pps.sagepub.com at UNIV SOUTH ALABAMA LIBRARY on August 31, 2014
514755
research-article2013
PPSXXX10.1177/1745691613514755SimonsDirect Replication

Perspectives on Psychological Science

The Value of Direct Replication 2014, Vol 9(1) 76­–80


© The Author(s) 2013
Reprints and permissions:
sagepub.com/journalsPermissions.nav
DOI: 10.1177/1745691613514755
pps.sagepub.com

Daniel J. Simons
University of Illinois

Abstract
Reproducibility is the cornerstone of science. If an effect is reliable, any competent researcher should be able to obtain
it when using the same procedures with adequate statistical power. Two of the articles in this special section question
the value of direct replication by other laboratories. In this commentary, I discuss the problematic implications of some
of their assumptions and argue that direct replication by multiple laboratories is the only way to verify the reliability
of an effect.

Keywords
direct replication, conceptual replication, reliability, generalizability

Reproducibility is the cornerstone of science. The idea In this commentary, I challenge these claims and their
that direct replication undergirds science has a simple underlying assumptions by making three related points:
premise: If an effect is real and robust, any competent
researcher should be able to obtain it when using the 1. Direct replication by other laboratories is the best
same procedures with adequate statistical power. Many (and possibly the only) believable evidence for
effects in psychology replicate reliably across laborato- the reliability of an effect;
ries and even work as classroom demonstrations, but the 2. The idea that only the originating lab can mean-
reproducibility of some prominent effects in psychology ingfully replicate an effect limits the scope of our
has come under fire (see Pashler & Wagenmakers, 2012, findings to the point of being uninteresting and
for a summary). As a way to address this problem, unfalsifiable; and
Perspectives on Psychological Science recently launched a 3. Situational influences and moderators should be
new type of article, the Registered Replication Report, verified rather than assumed.
devoted to verifying important psychology findings.
These reports compile multiple replications of a single
Trust but Verify
effect, conducted by labs throughout the world who all
agree to follow a preregistered and vetted protocol. The All findings represent a combination of some underlying
end result is not a judgment of whether a single replica- effect (the signal) and sampling error (the noise). The
tion attempt succeeded or failed—it is a robust estimate noise can be further decomposed into systematic error
of the size and reliability of the original finding. (moderators and differences in samples) and unsystem-
Two of the articles in this special section (Cesario, atic error (measurement error). Direct replication by mul-
2014, this issue; Stroebe & Strack, 2014, this issue) ques- tiple laboratories is the only way to isolate the signal
tion the value of direct replication by other laboratories.1 from the noise and average across different types of error.
Cesario recognizes the importance of direct replication in The rejection or deferral of direct replication by other
principle, but advocates replication by the originating lab labs assumes that unspecified moderators or quirks of
as the best way test the reliability of an effect. Until theo- sampling are reliably biased against finding the original
ries can specify the contingencies that govern variation in effect in a new setting.
effects, he argues that replication failures by other labo- Many effects in psychology are readily obtained across
ratories should be viewed as not just ambiguous, but laboratories and samples (e.g., Crump, McDonnell, &
uninformative. Stroebe and Strack reject direct replication
in favor of conceptual replication, arguing that “the true
Corresponding Author:
purpose of replications is a (repeated) test of a theoreti- Daniel J. Simons, Department of Psychology, University of Illinois, 603
cal hypothesis rather than an assessment of the reliability E. Daniel Street, Champaign, IL 61820
of a particular experimental procedure” (p. 61) E-mail: dsimons@illinois.edu

Downloaded from pps.sagepub.com at UNIV SOUTH ALABAMA LIBRARY on August 31, 2014
Direct Replication 77

Gureckis, 2013; Germine et al., 2012; Roediger, 2012). If who were tested in our lab cubicles on a rainy afternoon
a mechanism is real, it should be possible to measure its in October by our RA.” Presumably, a reviewer would
impact across a wide range of contexts even if systematic reject it because it is inherently unfalsifiable (and uninter-
errors weaken or strengthen it. Few studies provide com- esting to anyone who does not care about a case study
pelling, reliable, replicated evidence for situational influ- of those subjects). Other laboratories could not measure
ences or individual difference moderators that completely the same effect because they could not sample from that
swamp what are otherwise robust effects. Fragile or population. The originating lab could not verify that con-
easily manipulated effects, those buffeted about by mod- clusion either; even direct replications by the originating
erators, are exactly the ones most in need replication by laboratory assume some generality across situations and
other laboratories. Otherwise, there is no way to know samples.
whether the original measurements were just the result of In contrast, if researchers overreach and incorrectly
a fortunate convergence of unknown factors. assume generalization to all of humanity, then at least
The mantra, “trust but verify,” is a hallmark of the their claim is falsifiable. That makes it a scientific claim,
scientific method. Cesario is right to advocate self- albeit an unjustified one (Henrich et al., 2010). Ideally,
replication before publication—especially for preliminary researchers should specify the expected generality of
or unexpected findings; we should trust our colleagues their finding: What population are they trying to sample
to publish only those findings that they can reproduce from when they treat subjects as a random effect in their
themselves (or those for which effect size estimates are statistics, and what population are they generalizing to?
reasonably precise due to the use of large samples). But With that information, other laboratories could sample
self-replication is not adequate verification. Those same from the same population. The effect might well prove
unidentified moderators that purportedly make effects more or less general than they think, but that is an empir-
fragile could also explain why the originating lab found ical question requiring direct replication by other
them in the first place, leading to an internal replication laboratories.
that does not isolate and verify the reliability of the
underlying effect. Only direct replication by other labora-
tories can do that, precisely because the noise should
The Danger of Assumed Moderation
vary across laboratories. Although Cesario rightly notes the problem of positing
moderators after the fact, his solution is worse. When
researchers posit a moderator explanation for discrepant
Limitations of Scope results, they make a testable claim: The effect is reliable,
The idea that we should assume that undiscovered mod- but differences in that moderator explain why one labo-
erators are responsible for failed replications until proven ratory found the effect and the other did not. They then
otherwise is a claim about generalization. Every study in can conduct a confirmatory study to manipulate that
psychology is intended to generalize beyond the tested moderator and demonstrate that they can reproduce the
sample. If a paper reports inferential statistics (analysis of effect and make it vanish. In fact, they have a responsibil-
variance, t test), the authors have implicitly assumed gen- ity to do so in order to justify their claim that the effect is
erality to similar samples from the same target population robust and varies as a result of moderation. Without such
(subjects are treated as a random effect). That assump- confirmatory evidence, a failed replication does provide
tion might later be proven wrong: perhaps the effect was some evidence against the reliability of the original effect.
specific to the sample tested that day, perhaps it only Rather than treating moderation as an empirically trac-
worked because testing took place in cubicles, perhaps it table problem, Cesario and also Stroebe and Strack make
applies only to elite college students, or perhaps it relied it an a priori assumption: Any failed replication by
on experimenter expectations. Those possibilities are another laboratory could result from moderation, so it
what make the claim a scientific one; the assumption that can be dismissed as uninformative about the underlying
an effect generalizes across situations can be disproven mechanism. Consider the logical ramifications of that
with evidence that it does not. assumption. Any two studies will differ in some respects,
As Cesario correctly notes, researchers have been even when they are conducted in the same laboratory—
overzealous in generalizing from their results, and they no study can be an exact replication. Under this default
should limit the scope of their claims (see also Giner- assumption of moderation, a direct replication provides
Sorolla, 2012; Henrich, Heine, & Norenzayan, 2010). But no evidence for or against the reliability of an effect
they should not assume complete specificity either. unless the replicating lab verifies that all potential mod-
Imagine reviewing a manuscript that concludes “Our erators were equated. That is something not even the
finding applies only to the students in our subject pool original laboratory could do. Researchers could treat a

Downloaded from pps.sagepub.com at UNIV SOUTH ALABAMA LIBRARY on August 31, 2014
78 Simons

failed direct replication just like a failed conceptual repli- that had the right levels of self-monitoring, used research
cation: It provides no challenge to the original finding assistants who happened to have the right social skills for
(Pashler & Harris, 2012). that effect, tested at the right time of day and the right
This a priori assumption places the burden of proof month of the year, and used the appropriate laboratory
on the replicating laboratory to do something that is arrangement? Given that these moderators presumably
empirically impossible because the number of possible have different consequences for different effects, how is
moderators is infinite: perhaps the effect depends on the it that researchers successfully chose the right settings
phases of the moon, perhaps it only works at a particular when they report only one test of an effect? And, what
longitude and latitude, perhaps it requires subjects who are the odds that researchers who conducted a failed
ate a lot of corn as children. Cesario holds that theories direct replication necessarily flubbed the parameter set-
eventually will specify all of the relevant moderators, and tings, leading to a negative finding?
direct replication by other laboratories will be useful after If, instead, researchers have unpublished, confirma-
they do. But no theory will ever be that complete. Stroebe tory evidence for the importance of a moderator that
and Strack effectively dismiss direct replication because it shows how the effect hinges on the correct settings, then
is not exact replication: A replication of the same method they can report that evidence in their Method section so
with a different sample or in a different time is not dis- that others can control for it. Lacking such documenta-
positive because it might test different theoretical con- tion, other researchers must assume that following the
structs. Critically, this claim undermines not just replication published method will reproduce the published effect
failures, but also replication successes. Any successful (assuming adequate power, of course: Simmons, Nelson,
replication might just reflect the operation of unspecified & Simonsohn, 2013; Tversky & Kahneman, 1971)—that is
moderators rather than verification of an underlying the purpose of a Method section, after all. If researchers
mechanism. have empirical evidence for moderation and relegate it to
If we accept the idea that we should defer direct rep- the file drawer, then, ironically, they have buried the very
lication by other laboratories until theories are adequately evidence they need in order to document moderator
complete, then psychology is not a scientific exploration effects, and they have engaged in a questionable research
of the mechanisms that affect our behavior. We cannot practice ( John, Loewenstein, & Prelec, 2012).
accumulate evidence for the reliability of any effect.
Instead, all findings, both positive and negative, can be
attributed to moderators unless proven otherwise. And
Conclusion
we can never prove otherwise. Psychology has come under fire for producing unreliable
If we reject this default assumption of moderation, as effects (Pashler & Wagenmakers, 2012), and we must
we must, what role should we grant to moderators? confront the questionable research practices that contrib-
Almost all published papers in psychology describe posi- ute to them ( John et al., 2012; Simmons et al., 2011).
tive results (Fanelli, 2010; Smart, 1964), and few provide Stroebe and Strack, however, deny the existence of a
confirmatory tests showing that an effect can be com- problem (“no solid data exist on the prevalence of such
pletely eliminated or reversed as a function of some research practices,” p. 60), ignoring both circumstantial
moderator (Cesario cites a handful). Those few that do evidence as well as the fact that researchers themselves
also require verification via direct replication using large report using such practices ( John et al., 2012). Stroebe
samples to show that the moderator effects themselves and Strack even dispute that such practices would be a
are reliable. If the main effects have yet to be replicated, problem (“the discipline still needs to reach an agree-
should we trust moderator effects that have not been ment about the conditions under which they are inac-
either? ceptable,” p. 60). Although they correctly note that a
In the absence of published, confirmatory, replicated single failure to replicate should not be treated as defini-
evidence for the influence of a moderator on an effect, tive evidence against the existence of an effect, they dis-
what should we make of claims that such moderators miss the value of direct replication by adopting the
matter? Either (a) those making claims of moderators premise that the purpose of replication is to provide new
are remarkably adept at guessing what factors matter in tests of a theory (i.e., conceptual replication) rather than
the absence of experimental evidence, or (b) they have to determine the reliability of an effect.
unpublished evidence for the effect of a moderator. Unlike Stroebe and Strack, Cesario acknowledges the
If researchers are just skilled guessers, then claims of legitimacy of the problems and recommends a number of
moderation are only speculative. Without experimental worthwhile changes to improve the reliability of pub-
evidence for moderation, what are the odds that the orig- lished research: direct replication by the originating labo-
inal researchers just happened to stumble upon a sample ratory, increased sample sizes, an emphasis on effect

Downloaded from pps.sagepub.com at UNIV SOUTH ALABAMA LIBRARY on August 31, 2014
Direct Replication 79

sizes, and more limited claims of generality (see Brandt develop a theoretical framework to account for the fragility of
et al., in press, and Asendorpf et al., 2013, for similar sug- some priming results. Given that several recent articles, includ-
gestions). He also recognizes the limits of conceptual ing those in a special issue of Perspectives, have addressed
replication as a way to measure reliability. Unfortunately, other aspects of the replicability crisis, I have focused this com-
his proposal to defer direct replication by other laborato- mentary on claims about direct replication.
ries until theories are suitably complete rejects a founda-
tional principle of science: Direct replication by other References
scientists is the only way to verify the reliability of an Asendorpf, J. B., Conner, M., de Fruyt, F., de Houwer, J.,
effect. Denissen, J. J. A., Fiedler, K., . . . Wicherts, J. M. (2013).
Accumulated evidence for reliable effects is the lasting Recommendations for increasing replicability in psy-
legacy of science—theories come and go, changing to chology. European Journal of Personality, 27, 108–119.
doi:10.1002/per.1919
account for new evidence and making new predictions.
Brandt, M. J., IJzerman, H., Dijksterhuis, A., Farach, F. J., Geller,
Numerous theories can account for any data and make
J., Giner-Sorolla, R., . . . Van ’t Veer, A. (in press). The rep-
predictions along the way. The value of direct replication lication recipe: What makes for a convincing replication?
comes from making sure that theories are accounting for Journal of Experimental Social Psychology. Retrieved from
the signal rather than the noise. A single failed replication http://dx.doi.org/10.1016/j.jesp.2013.10.005
does not prove that an original result was noise and it Cesario, J. (2014). Priming, replication, and the hardest science.
will not disprove a theory, but it does add information Perspectives on Psychological Science, 9, 40–48.
about the reliability of the original effect. A theory based Crump, M. J. C., McDonnell, J. V., & Gureckis, T. M. (2013).
on unreliable evidence will inevitably be a flawed Evaluating Amazon’s Mechanical Turk as a tool for experi-
description of reality, and if we discount direct replica- mental behavioral research. PLoS ONE, 8(3), e57410.
tion by other researchers, then all findings will stand doi:10.1371/journal.pone.0057410
Fanelli, D. (2010). “Positive” results increase down the hierar-
unchallenged and unchallengeable.
chy of the sciences. PLoS ONE, 5(4), e10068. doi:10.1371/
Only with direct replication by multiple laboratories
journal.pone.0010068
will our theories make useful, testable, and generalizable Germine, L., Nakayama, K., Duchaine, B. C., Chabris, C. F.,
predictions. We should not defer replication by other labs Chatterjee, G., & Wilmer, J. B. (2012). Is the web as good
until we have suitably rich theories, and we cannot rely as the lab? Comparable performance from Web and lab in
on tests of different effects (conceptual replication) as a cognitive/perceptual experiments. Psychonomic Bulletin &
way to assess reliability. We cannot have suitably precise Review, 19, 847–857.
theories without direct replication by other laboratories. Giner-Sorolla, R. (2012). Science or art? How aesthetic standards
Direct replication, ideally by multiple laboratories, is the grease the way through the publication bottleneck but
only way to measure the reliability and generality of an undermine science. Perspectives on Psychological Science,
effect across situations. Direct replication is the only way 7, 562–571. doi:10.1177/1745691612457576
Henrich, J., Heine, S. J., & Norenzayan, A. (2010). The weird-
to make sure our theories are accounting for signal and
est people in the world? Behavioral & Brain Sciences, 33,
not noise.
61–83. Retrieved from http://dx.doi.org/10.1017/S014052
5X0999152X
Acknowledgments John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring
Thanks to Brent Roberts, Chris Fraley, Rolf Zwaan, Christopher the prevalence of questionable research practices with
Chabris, and Hal Pashler for their comments on earlier drafts of incentives for truth telling. Psychological Science, 23,
this commentary. Their feedback was essential, but I take 524–532.
responsibility for the views expressed in this commentary. Klatzky, R. L., & Creswell, D. (2014). An intersensory interaction
Thanks also to Joe Cesario for an earlier email correspondence account of priming effects—and their absence. Perspectives
about these issues. on Psychological Science, 9, 49–58.
Pashler, H., & Harris, C. R. (2012). Is the replicability cri-
Declaration of Conflicting Interests sis overblown? Three arguments examined. Perspectives
on Psychological Science, 7, 531–536. doi:10.1177/17456
The author declared no conflicts of interest with respect to the 91612463401
authorship or the publication of this article. Pashler, H., & Wagenmakers, E.-J. (2012). Editors’ introduction
to the special section on replicability in psychological sci-
Note ence: A crisis of confidence? Perspectives on Psychological
1. The other article in this special section (Klatzky & Creswell, Science, 7, 528–530. doi:10.1177/1745691612465253
2014, this issue) focuses less on the value of direct replication. Roediger, H. L. (2012). Psychology’s woes and a partial cure:
Instead, it extrapolates from a model in sensory integration to The value of replication. Observer, 25(2), 9, 27–29.

Downloaded from pps.sagepub.com at UNIV SOUTH ALABAMA LIBRARY on August 31, 2014
80 Simons

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). Smart, R. G. (1964). The importance of negative results in psy-
False-positive psychology: Undisclosed flexibility in data chological research. Canadian Psychologist, 5a, 225–232.
collection and analysis allows presenting anything as sig- doi:10.1037/h0083036
nificant. Psychological Science, 22, 1359–1366. Stroebe, W., & Strack, F. (2014). The alleged crisis and the
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2013, January). illusion of exact replication. Perspectives on Psychological
Life after P-Hacking (2013). Paper presented at the Society Science, 9, 59–71.
for Personality and Social Psychology annual meeting, New Tversky, A., & Kahneman, D. (1971). Belief in the law of small
Orleans, LA. Retrieved from http://ssrn.com/abstract=2205186 numbers. Psychological Bulletin, 76, 105–110.

Downloaded from pps.sagepub.com at UNIV SOUTH ALABAMA LIBRARY on August 31, 2014

You might also like