You are on page 1of 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/313965710

Prep, the Probability of Replicating an Effect

Chapter · January 2015


DOI: 10.1002/9781118625392.wbecp030

CITATIONS READS
2 110

1 author:

Peter R Killeen
Arizona State University Tempe
243 PUBLICATIONS   7,914 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

coal is not black; snow is not white; food is not a reinforcer. View project

Predictive Statistics View project

All content following this page was uploaded by Peter R Killeen on 09 June 2018.

The user has requested enhancement of the downloaded file.


Prep: the Probability of Replicating an Effect

Peter R. Killeen

Arizona State University

Killeen@asu.edu

For: The Encyclopedia of Clinical Psychology

Abstract

Prep, gives the probability that an equally powered replication attempt will provide supportive
evidence—an effect of the same sign as the original, or, if preferred, the probability of a
significant effect in replication. Prep is based on a standard Bayesian construct, the posterior
predictive distribution. It may be used in 3 modes: to evaluate evidence; to inform belief; and to
guide action. In the first case the simple prep is used; in the second, it is augmented with estimates
of realization variance and informed priors; in the third it is embedded in a decision theory. Prep
throws new light on replicability intervals, multiple comparisons, traditional α levels, and
longitudinal studies. As the area under a diagnosticity vs. detectability curve, it constitutes a
criterion-free measure of test quality.

The issue

The foundation of science is the replication of experimental effects. But most statistical analyses
of experiments test, not whether the results are replicable, but whether they are unlikely if there
were truly no effect present. This inverse inference creates many problems of interpretation
which have become increasingly evident to the field. One of the consequences has been an
uneasy relationship between the science of psychology and its practice. The irritant is the
scientific inferential method. Not the method of John Stewart Mill, Michael Faraday, or Charles
Darwin; but that of Ronald Aylmer Fisher, Egon Pearson, and Jerzy Neyman. All were great
scientists or statisticians--Fisher both. But they grappled with scientific problems on different
scales of time, space, and complexity than do clinical psychologists. In all cases their goal was to
epitomize a phenomenon with simple verbal or mathematical descriptions, and then to show that
such a description has legs: That it explains data or predicts outcomes in new situations. But
increasingly it is being realized that results in biopsychosocial research often lame: Effect sizes
can wither to irrelevance with subsequent replications, and credible authorities claim that “most
published research findings are false”. Highly significant effects can have negligible therapeutic
value. Something in our methodology has failed. Must clinicians now turn away from such toxic
“evidence-based” research, back to clinical intuition?

The historical context

In the physical and biological sciences precise numerical predictions can sometimes be made:
The variable A should take the value a. A may have been a geological age, a vacuum
permittivity, or the deviation of a planetary orbit. The more precise the experiment, the more
difficult it is for errant theories to pass muster. In the behavioral sciences it is rare be able to
make the prediction A -> a. Our questions are typically not “does my model of the phenomenon
predict the observed numerical outcome?”, but rather “is my candidate causal factor C really
affecting the process?”; “Does childhood trauma increase the risk of adult PTSD?” It then
becomes a test of two candidate models: No effect: A + C ≈ A -> a; or some effect: A + C -> a +
c, where we can not specify c beforehand, but rather prefer that it be large, and in a particular
direction. Since typically we also can not specify the baseline or control level a, we test to see
whether the difference in the effects were reliably different than zero: testing experimental (A +
C), and control (A) groups, and asking whether a + c ¿=? a, that is does the difference in
outcome between experimental and control groups equal zero: (a + c) - a = 0? Since the
difference will almost always be different from 0 due to random variation, the question evolves
to: Is it sufficiently larger than 0 so that we can have some confidence that the effect is real—that
is, that it will replicate? We want to know whether (a + c) - a is larger than some criterion. How
large should that criterion be?

It was for such situations that Fisher formalized and prior work into the analysis of
variance--ANOVA. ANOVA estimates the background levels of variability—error, or noise—
combining the variance within each of the groups studied, and asking whether the variability
between groups--the treatment effect, or signal--sufficiently exceeds that noise. The signal-to-
noise ratio is the F statistic. If the errors are normally distributed and the groups independent,
with no true effect (that is, all are drawn from the same population, so that A + C = A, and thus c
≈ 0) we can say precisely how often the F ratio will exceed a criterion α (alpha). If our treatment
effect exceeds that value, it is believed to be unlikely that the assumption of “no effect” is true.
Because of its elegance, robustness, and refinement over the decades, ANOVA and its variants
are the most popular inferential statistics in psychology. These virtues derive from certain
knowledge of the ideal case, the null hypothesis, with deviations being precisely characterized by
p-values--significance levels.

But there are problems associated with the uncritical use of such null-hypothesis statistical
tests (NHST), ones well known to the experts, and repeated anew to every generation of students
(e.g., Krueger 2001). Among them: One cannot infer from NHST either the truth or falsity of the
null hypothesis; nor can one infer the truth or falsity of the alternative. ANOVA gives the
probability of data assuming the null, not the probability of the null given the data (see, e.g.,
Nickerson 2000). Yet, rejection of the null is de facto the purpose to which the results are
typically put. Even if the null is (illogically) rejected, significance levels do not give a clear
indication of how replicable a result is. It was to provide such a measure that prep, the probability
of replication, was introduced (Killeen 2005a).

The logic of prep

Prep is a probability derived from a Bayesian posterior predictive distribution (ppd). Assume you
have conducted a pilot experiment on a new treatment for alleviating depression, involving 20
control and 20 experimental participants, and found that the means and standard deviations were:
40 (12), 50 (15). The effect size, d, the difference of means divided by the pooled estimate of
standard deviation (13.6), is a respectable 0.74. Your t-test reports p < .05; indicating that this
result is unlikely under the null hypothesis. Is the result replicable? The answer depends on what
you consider a replication to be, and what you are willing to assume about the context of the
experiment. First the general case, and then the particulars.

Most psychologists know that a sampling distribution is the probability of finding a


statistic such the effect size, d given the “true” value of the population parameter, δ (delta).
Under the null, δ = 0: The sampling distribution, typically a normal or t-distribution, is centered
on 0. If the experimental and control groups are the same size, nE +nC = n, then the variance of
the distribution is approximately 4/(n - 4). This is shown in Figure 1. The area to the right of the
initial result, d1 = 0.74, is less than α = .05, so the result qualifies as significant. To generate a
predictive distribution, move the sampling distribution from 0 to its most likely place. Given
knowledge of only your data, that is the obtained effect size, d1 = 0.74. If this was the true effect
size δ, then that shifted sampling distribution would also give the probability of a replication:
The probability that it would be significant is the area under this distribution that lies to the right
of the α cut-off, 0.675, approximately 58%. The probability of a replication returning in the
wrong direction is the area under this curve to the left of 0—which equals the 1-tailed p-value of
the initial study.

Figure 1. The curve centered on 0 is a sampling distribution for effect size, d, under the null
hypothesis. Shifted to the right it gives the predicted distribution of effect sizes in replications, in
case the true effect size, δ, equals the recorded effect size d1. Since we do not know that δ
precisely equals d1, because both the initial and replicate will incur sampling error, the variance
of the distribution is increased—doubled in the case of an equal-powered replication, to create
the posterior predictive distribution (ppd), the intermediate distribution on the right. In the case
of a conceptual rather than strict replication, additional realization variance is added, resulting in
the lowest ppd. In all cases, the area under the curves to the right of the origin gives the
probability of supportive evidence in replication.

If we knew that the true effect size was exactly δ = 0.74, no further experiments would be
necessary. But we do not know what δ is; we can only estimate it from the original results. There
are thus at least two sources of error: the sampling error in the original, and in the ensuing
replication. This leads to a doubling of the variance in prediction, for a systematic replication
attempt of the same power. The ppd is located at the obtained estimate of d, d1, and has twice the
variance of the sampling distribution—8/(n-4). The resulting probability of achieving a
significant effect in replication, the area to the right of 0.675, shrinks to 55%.

What constitutes evidence of replication?

What if an ensuing replication found an effect size of 0.5? That is below your estimate of 0.74,
and falls short of significance. Is this evidence for or against the original claim? It would
probably be reported as “failure to replicate”. But that is misleading: If those data had been part
of your original study, the increase of n would have more than compensated for the decrease in d,
substantially improving the significance level of the results. The claim was for a causal factor,
and the replication attempt, though not significant, returned evidence that (weakly) supports that
claim. It is straightforward to compute the probability of finding supporting evidence of any
strength in replication. The probability of that-- a positive effect in replication-- is the area under
the ppd to the right of 0. In this case that area is .94, suggesting a very good probability that in
replication the result will not go the wrong way and contradict your original results. This is the
basic version of prep.

What constitutes a replication?

The above assumed that the only source of error was sampling variability. But there are other
sources as well, especially in the most useful case of replication, a conceptual replication
involving a different population of participants, and different analytic techniques. Call this
“random effects” variability realization variance, here σ2R. In social science research it is
approximately σ2R = 0.08 across various research contexts. This noise reduces replicability,
especially for studies with small effect sizes, by further increasing the spread of the ppd. The
median value of σ2R = 0.08 limits all effect sizes less than 0.5 to prep < .90, no matter how many
data they are based on. In the case of the above example, it reduces prep from .94 to .88, so that
there is 1 chance in 8 that a conceptual replication will come back in the wrong direction.

What is the best predictor of replicability?

In the above example all that was known were the results of the experiment: We assumed “flat
priors”--a priori ignorance of the probable effect size. In fact, however, more than that is
typically known, or suspected; the experiment comes from a research tradition in which similar
kinds of effects have been studied. If the experiment had concerned the effect of the activation of
a randomly chosen gene, or of a randomly chosen brain region, on a particular behavior, the prior
distribution would be tightly centered close to 0, and the ppd would move down toward 0. If,
however, the experiment is studying a large effect that had been reported by 3 other laboratories,
the priors would be centered near their average effect size, and the ppd moved up toward them.
The distance moved depends on the relative weight of evidence in the priors and in the current
data. Exactly how much weight should be given to each is a matter of art and argument. The
answer depends largely on which of the following three questions is on the table:

How should I evaluate this evidence? To avoid capricious and ever-differing evaluations of
replicability of results due to diverse subjective judgments of the weight of more or less relevant
priors, prep was presented for the case of flat, ignorance priors. This downplays precision of
prediction in the service of stability and generality of evaluation; it decouples the evaluation of
new data from the sins, and virtues, of their heritage. It uses only the information in the data at
hand, or that augmented with a standardized estimate of realization variance.

What should I believe? Here priors matter: Limiting judgment to only the data in hand is short-
sighted. If a novel experiment provides evidence for extra-sensory pre-cognition, what you
should believe should be based on the corpus of similar research, updated by the new data. In
this case, it is likely that your priors will dominate what you believe.

What should I do? NHST is of absolutely no value in guiding action, as it gives neither the
probability of the null nor of the alternative, nor can it give the probability of replication, which
is central to planning. Prep is designed to predict replicability, and has been developed into a
decision theory for action (Killeen 2006). Figure 2 displays a ppd and superimposed utility
functions that describe the value, or utility, of various effect sizes. To compute expected value,
integrate the product of the utility function with the probability of each outcome, as given by the
ppd. The utility shown as dashed lines is 0 until effect size exceeds 0, then immediately steps to
1. Its expected value is prep, the area under the curve to the right of 0. Prep has a 1-to-1
relationship with the p-value. Thus, NHST (and prep when σ2R is 0) is intrinsically indifferent to
size of effect, giving equal weighting to all positive effect sizes, and none to negative ones.
Figure 2. A candidate utility function is drawn as a power function of effect size (ogive). The
value of an effect increases less than proportionately with its size. The expected utility of a future
course of action is the probability of each particular outcome (the ppd) multiplied by the utility
function—the integral of the product of the two functions. Because traditional significance tests
give no weight to effect size, their implicit utility function is flat (dashed). If drawn as a line at -7
up to the origin, and then at 1 to the right, it sets a threshold for positive utility in replication at
the traditional level α = 0.05. Other exponents for the utility function return other criteria, such
as the Aikaike criterion and the Bayesian information criterion. The ogive gives approximately
equal weight to effect size and to replicability.

If the weight on negative effect sizes were -7, then the expected utility of an effect in
replication would be negative for all ppd whose area to the left of the origin was greater than 1/7.
This sets a criterion for positive action that is identical to the a = .05 criterion. Conversely, this
traditional criterion α = .05 de facto sets the disutility of a false alarm as seven times the utility
of a hit. α = .01 corresponds to a 19/1 valuation of false positives to true positives. This
exposition thus rationalizes the α levels traditional in NHST.

Economic valuations are never discontinuous like these step functions; rather they look
more like the ogive, shown in Figure 2, which is a power function of effect size. To raise
expected utility above a threshold for action, such ogives require more accuracy—typically
larger n—when effect sizes are small than does NHST; conversely large effect sizes pass criteria
with smaller values of n—and replicability. Depending on the exponent of the utility function, it
will emulate traditional decision rules based on AIC, BIC, and adjusted coefficient of
determination. In the limits, as the exponent approaches 0 it returns the traditional step function
of NHST, indifferent to effect size; as it approaches 1, only effect size, not replicability, matters.
A power of 1/3 weights them approximately equally. Thus prediction, built upon the ppd, and
modulated by the importance of potential effects, can guide behavior; NHST can not.

How reliable are predictions of replicability?

Does positive psychology enhance well-being, or ameliorate depressive symptoms? A recent


meta-analysis of positive psychology interventions found a mean effect size of 0.3 for both
dependent variables over 74 interventions (Sin and Lyubomirsky 2009). With an average of 58
individuals per condition, and setting σ2R = 0.08, prep is .88. Of the 74 studies, 65 should
therefore have found a positive effect. 66 found a positive effect. Evaluation of other meta-
analyses shows similar high levels of accuracy for prep’s predictions.

We may also predict that 1 of the studies in this ensemble should have gone the wrong way
strongly (its prep >0.85 for a negative effect). What if yours had been one of the 8 studies that
showed no or negative effects? The most extreme negative effect had a prep of a severely
misleading .88 (for negative replicates)! Prep gives an expected, average estimate of replicability
(Cumming 2005); but it, like a p-value, typically has a high associated variance (Killeen 2007). It
is because we cannot say beforehand whether you will be one of the unlucky few, that some
experts (e.g., Miller 2009) have disavowed the possibility of predicting replicability in general,
and of individual research results in particular. Those with a more Bayesian perspective are
willing to bet that your results will not be the most woeful of the 74, but rather closer to the
typical. It is your money, to bet or hold; but as a practitioner, you must eventually recommend a
course of action. Whereas reserving judgment is a traditional retreat of the academic, it is can be
an unethical one for the practitioner. Prep, used cautiously, provides a guide to action.

What else can be done with the ppd?

Replicability intervals. While more informative than p-values, confidence intervals are
underused and generally poorly understood. Replicability intervals delimit the values within
which a replication is likely to fall. 50% replicability intervals are approximately equal to the
standard error of the statistic. These traditional measures of stability of estimation may be
centered on the statistic, and de facto constitute the values within which replications will fall half
the time.

Multiple comparisons. If a number of comparisons have been performed, how do we decide if


the ensemble of results is replicable? We are appropriately warned against alpha inflation in such
circumstances, and similar considerations affect prep. But some inferences are straightforward. If
the tests are independent (as assumed, for example in ANOVA), then the probability of a
replication showing all effects to be in the same direction (or significant, etc.) is simply the
product of the replicabilities of all individual tests. The probability that none will again achieve
your definition of replication is the complement of the product of the complements of each of the
preps. Is there a simple way to recalibrate the replicability of one of k tests, post hoc? If all the
tests asked the exactly same question--that is, constituted within-study replications—the
probability that all would replicate is the focal prep raised to the kth power. This conservative
adjustment is similar in spirit to the Šidák correction, and suitably reins-in predictions of
replicability for a post-hoc test.

Model comparison and longitudinal studies. Ashby and O’Brien (2008) have generalized the use
of prep for the situation of multiple trials with a small number of participants, showing how to
evaluate alternate models against different criteria (e.g., AIC, BIC). Their analysis is of special
interest both to psychophysicists, and to clinicians conducting longitudinal studies.

Diagnosticity vs. detectability. Tests can succeed in two ways: they can affirm when the state of
the world is positive (a hit), and they can deny when it is negative (a correct rejection). Likewise
they can fail in two ways: affirm when the state of the world is negative (a false alarm, a Type I
error), and deny when it is positive (a miss, a Type II error). The detectability of a test is its hit
rate; the diagnosticity is its correct rejection rate. Neither alone is an adequate measure of the
quality of a test: Detectability of a test can be perfect if we always affirm, driving the
diagnosticity to 0—We can detect 100% of children with ADHD if the test is “Do they move?”.
A Relative Operating Characteristic, or ROC gives the hit rate as a function of the false alarm
rate. The location on the curve gives the performance for a particular criterion. If the criterion for
false alarms is set at α = .05, then the ordinate gives the power of the test. But that criterion is
arbitrary. What is needed to evaluate a test is the information it conveys independently of the
particular criterion chosen. The area under the ROC curve does just that: It measures the quality
of the test independently of the criterion for action. Irwin (2009) has shown that this area is
precisely the probability computed by prep: prep thus constitutes a criterion-free measure of the
quality of a diagnostic test.

Efficacy vs. effectiveness. In the laboratory an intervention may show significant effects of good
size—its efficacy—but in the field its impact—its effectiveness—will vary, and will often
disappoint. There are many possible reasons for this difference, such as differences in the skills
of administering clinicians, the need to accommodate individuals with comorbidities, and so on.
These variables increase realization variance, and thus decrease replicability. Finding that
effectiveness is generally less than efficacy is but another manifestation of realization variance.

How can prep improve the advice given to patients?

What is the probability that a depressed patient will benefit from a positive psychology
intervention? A representative early study found that group therapy was associated with a
significant decrease in Beck Depression Inventory scores for a group of mildly to moderately
depressed young adults. The effects were enduring, with an effect size of 0.6 at 1-year posttest.
Assuming a standard realization variance of 0.08, prep is .81. But that is for an equal-powered
replication. What is the probability that your patient could benefit from this treatment? Here we
are replicating with an n of 1, not 33. Instead of doubling the variance of the original study, we
must add to it the variance of the sampling distribution for n = 1; that is, the standard deviation
of effect size, 1. This returns a prep of 0.71. Thus, there is about a 70% chance that positive
psychotherapy will help your patient for the ensuing year--which, while not great, may be better
than the alternatives. Even when the posterior is based on all of the data in the meta-analysis, n >
4000, it does not change the odds for your patient, as that is here limited by the effect sizes for
these interventions, and your case of 1. You nonetheless have an estimate to offer her, insofar as
it may be in her interest.

Why is prep controversial?

The original exposition contained errors (Doros and Geier 2005), later corrected (Killeen 2005b,
2007). Analyses show that prep is biased when used to predict the coincidence in the sign of the
effects of two future experiments; strongly biased when the null is true (Miller and Schwarz
2011) or when the true effect size is stipulated. But prep is not designed to predict the coincidence
of future experiments. It is designed to predict the replication of known data, based on those data
(Lecoutre and Killeen 2010). If we know, or postulate, that the null is true, then the correct prior
in prep is δ = 0, with variance of 0. No matter what the first effect size, the probability of
replication is ½--and that is precisely what prep will predict. Prep was developed for what
scientists can know, not what statisticians can stipulate; and all that scientists are privy to are
data, not parameters.

Prep is often computed incorrectly. It is easily computed from p-values: When σ2R = 0, in
Excel code as prep = NORMSDIST[NORMSINV(1-p)/SQRT(2)]. In general, compute a p-value with
the standard error increased to SE = [2(σ2Err + σ2R)]1/2 where σErr is the standard error of the
statistic being evaluated. Then the complement of that p-value is the probability of replication. A
common mistake is to use 2-tailed p-values directly, rather than first halving them (Lecoutre,
Lecoutre, and Poitevineau 2010). Another problem is that prep does not dictate an absolute
criterion such as α = .025. For that, one needs to embed the ppd in a full-fledged decision theory.

The vast majority of individuals who assume the null don’t believe it; but they don’t know
what else to assume. Evaluating the probability of replication avoids that dilemma. To evaluate
evidence, use the basic prep, with σ2R set at a standardized value such as 0. To know what to
believe, augment the simple version of prep with what you know; whether realization variance or
priors. If you have a well-defined alternate hypothesis, use Bayesian analyses.

SEE ALSO: Bayes Theorem; Clinical Decision Making; Clinical Significance; Effect Size;
Efficacy versus Effectiveness; Generalizability; Null Hypotheses Statistical Tests; Power
References

Ashby, F. Gregory, and Jeffrey B. O'Brien. 2008. "The prep statistic as a measure of confidence in
model fitting." Psychonomic Bulletin & Review no. 15:16-27. doi: 10.3758/PBR.15.1.16.
Cumming, Geoff. 2005. "Understanding the average probability of replication: Comment on
Killeen (2005)." Psychological Science no. 16:1002-1004. doi: 10.1111/j.1467-9280.2005.01650.
Doros, Gheorghe, and Andrew B. Geier. 2005. "Comment on "An Alternative to Null-
Hypothesis Significance Tests"." Psychological Science no. 16:1005-1006. doi: 10.1111/j.1467-
9280.2005.01651.x.
Irwin, R. John. 2009. "Equivalence of the statistics for replicability and area under the ROC
curve." British Journal of Mathematical and Statistical Psychology no. 62 (3):485-487. doi:
10.1348/000711008X334760
Killeen, Peter R. 2005a. "An alternative to null hypothesis significance tests." Psychological
Science no. 16:345-353. doi: 10.1111/j.0956-7976.2005.01538.
Killeen, Peter R. 2005b. "Replicability, confidence, and priors." Psychological Science no.
16:1009-1012. doi: 10.1111/j.1467-9280.2005.01653.x.
Killeen, Peter R. 2006. "Beyond statistical inference: a decision theory for science."
Psychonomic Bulletin & Review no. 13:549-562. doi: 10.3758/BF03193962.
Killeen, Peter R. 2007. "Replication statistics." In Best practices in quantitative methods, edited
by J. W. Osborne, 103-124. Thousand Oaks, CA: Sage.
Krueger, Joachim. 2001. "Null hypothesis significance testing: On the survival of a flawed
method." American Psychologist no. 56:16-26. doi:10.1037//0003-066X.56.1.16
Lecoutre, Bruno, and Peter R. Killeen. 2010. "Replication is not coincidence: Reply to Iverson,
Lee, and Wagenmakers (2009)." Psychonomic Bulletin & Review no. 17 (2):263-269. doi:
doi:10.3758/PBR.17.2.263.
Lecoutre, Bruno, Marie-Paule Lecoutre, and Jacques Poitevineau. 2010. "Killeen's probability of
replication and predictive probabilities: How to compute and use them." Psychological Methods
no. 15:158-171. doi: 10.1037/a0015915.
Miller, Jeff. 2009. "What is the probability of replicating a statistically significant effect?"
Psychonomic Bulletin & Review no. 16:617-640. doi: 10.3758/PBR.16.4.617.
Miller, Jeff , and Wolf Schwarz. 2011. "Aggregate and individual replication probability within
an explicit model of the research process." Psychological Methods. doi: 10.1037/a0023347.
Nickerson, Raymond S. 2000. "Null hypothesis significance testing: A review of an old and
continuing controversy." Psychological Methods no. 5:241-301. doi: 10.1037/1082-
989X.5.2.241.
Sin, Nancy L., and Sonja Lyubomirsky. 2009. "Enhancing well-being and alleviating depressive
symptoms with positive psychology interventions: A practice-friendly meta-analysis." Journal of
Clinical Psychology no. 65 (5):467-487. doi: DOI: 10.1002/jclp.20593.
Further reading

Cumming, Geoff. 2012. Understanding The New Statistics: Effect Sizes, Confidence Intervals,
and Meta-Analysis. Edited by L. Harlow, Multivariate Applications. New York, NY: Routledge.

Harlow, Lisa Lavoie, Stanley A. Mulaik, and James H. Steiger. 1997. What if there were no
significance tests? Mawah, NJ: Lawrence Erlbaum Associates.

View publication stats

You might also like