You are on page 1of 3

Effect Size editors, have begun to protest naked p values,

asking for meaningful effect sizes and their


Helena Chmura Kraemer confidence intervals.
Stanford University, U.S.A and University of Pittsburgh, Third, meta-analysis is based on effect sizes.
U.S.A.
From each study of a phenomenon, an effect
size and a measure of its precision are extracted.
An effect size is a population parameter (esti- These are compared and, if appropriate, com-
mated in a sample) designed to indicate the bined over the studies to achieve a consensus
practical or clinical importance of a phe- estimation of the effect size.
nomenon under study. As the limitations of Finally, exploration (hypothesis-generation)
reporting and understanding p values from studies must necessarily place primary impor-
statistical hypothesis testing become ever more tance on the estimation of effect sizes because
apparent, it is clearer that for any hypothesis there are no a priori hypotheses to be tested.
one might want to test (i.e., any phenomenon of There are no universally accepted criteria for
interest), it would be good practice to define an what constitutes a well-chosen effect size.
appropriate population parameter that would Certainly, the effect size must be interpretable
indicate how far the actual situation deviates in the context of its use. Many propose that the
from the null hypothesis: an effect size. effect size should be scale free; that it should
First, an effect size is important in designing have null value zero; that it should be invariant
any hypothesis-testing study. To be a valid 5% under at least linear rescaling of the measures,
significance test, for example, the probability of perhaps preferably under monotonic rescaling.
rejecting the null hypothesis must be less than As an illustration, let us consider probably the
5% for any effect size consistent with the null most common phenomenon: the two-group
hypothesis. To have, say, 80% power, one sets a comparison.
critical value of the effect size that indicates the
lower limit of practical/clinical significance, Two-Group Comparisons
and then demonstrates that the probability of
rejecting the null hypothesis is greater than The researchers sample n1 subjects from one
80% for any effect size greater than that critical population and n2 subjects from another. Here
value. Y is a random variable measured on all subjects
Second, an effect size is also important in with possibly different distributions for the two
reporting the results of a hypothesis-testing populations.
study. Many a statistically significant result Cohen’s d
has no practical/clinical importance—that
is, the estimated effect size is less than the Here the most familiar measure of effect size
critical value of that effect size. A statistically is Cohen’s d (Cohen, 1988). This is a mea-
insignificant result means either that the study sure designed to compare two populations,
was poorly designed (inadequate sample size, where Y has a normal distribution in both
unreliable measures, etc.) or the effect size is with equal standard deviations. Cohen’s d is
well below the critical value. If the former is defined as the difference between the means in
true, a better design might be considered; if the the two populations divided by their common
latter, the hypothesis might not be worth fur- standard deviation. It indicates the amount
ther pursuit. It is the effect size that guides how of overlap between the two standard normal
one interprets results. Hence, in recent years, distributions. Cohen’s d is invariant under
publication guidelines, as well as reviewers and linear transformation of Y. Whenever the

The Encyclopedia of Clinical Psychology, First Edition. Edited by Robin L. Cautin and Scott O. Lilienfeld.
© 2015 John Wiley & Sons, Inc. Published 2015 by John Wiley & Sons, Inc.
DOI: 10.1002/9781118625392.wbecp048
10.1002/9781118625392.wbecp048, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/9781118625392.wbecp048 by Cochrane Philippines, Wiley Online Library on [13/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2 EFFECT SIZE

two-sample t test is valid, so is Cohen’s d as an selected subject in the second group (G2),
effect size. the subject in the first group has a bigger
There are variations on this theme. Instead of (or better) response on Y and “G1 = G2”
dividing by the estimate of the common stan- indicates equality of responses. If the normal
dard deviation, one might divide by the stan- distribution underlying Cohen’s d holds, then

dard deviation of one of the groups (e.g., the AUC = Φ(d/ 2), where Φ() is the cumulative
control group in a randomized clinical trial) or standard normal distribution function and d
by the square root of the average variance. The is either Cohen’s d (using the pooled variance
magnitude of Cohen’s d can be visualized as the in the denominator) or d using the average of
overlap between two standard normal distribu- the two groups’ variances in the denominator.
tions but it is a challenge to visualize what such The AUC is invariant under all monotonic
variations in Cohen’s d represent. transformations and can be used with any Y
The Mean Difference (including binary, three-, four- and five-point
Cohen’s d is the standardized mean difference scales, and continua with highly skewed or
between two groups. Why standardize, partic- long- tailed distributions). Perhaps its only
ularly if the scale used is a familiar one? The weakness is that its null value is 0.5, rather than
problem is that if the two group means were 0, but that is easily corrected by using instead
to differ by 10 units, and the within-group SRD (success rate difference) = 2AUC − 1. The
standard deviation were 1, there would be SRD shares all the qualities of AUC, but now
no overlap between the distributions, and no the null value is zero, and the two extremes
doubt of the clinical significance of such a where the two populations do not overlap at all
finding; however, if the within-group stan- are +1 and −1.
dard deviation were 100, the overlap between
Number Needed to Treat/Take (NNT)
the distributions would be almost complete,
and the clinical significance very doubtful. Clinicians, patients and policy makers often
Thus ignoring the standard deviation leaves have problems interpreting probability points
the question of clinical/practical significance (as in AUC or SRD). The NNT instead reports
completely in doubt. on number of subjects, a scale sometimes eas-
ier to interpret. Suppose one defines a subject
Area under the ROC Curve (AUC) as a “success” if that subject had a response
There are many situations in which the variable (Y) bigger (better) than a randomly chosen
Y does not have a normal distribution in one of subject in the other group. How many subjects
the groups, or where the variances are unequal. would one need to sample from G1 to have
Cohen’s d is quite robust to minor deviations one more “success” than if one sampled the
from its underlying assumptions, but using same number from G2? Answer: NNT, which
it with three-, four- or five-point scales, or is equal to 1/SRD. The NNT is most familiar
with continua with outliers, or unequal vari- when Y is success/failure, in which case SRD =
ances, can be quite misleading. The AUC, p1 − p2, the difference in the probabilities of
representing the area under the receiver oper-
success in the two groups (hence “success rate
ating characteristic curve (ROC curve), is one
difference”). However, ease of interpretation
alternative. For present purposes:
here translates to confusing mathematical
AUC = probability (G1 > G2) properties. For example, NNT is undefined
when SRD = 0 (i.e., when there is no difference
+.5 probability (G1 = G2),
between G1 and G2). Generally, therefore, for
where, for example, “G1 > G2” means that research purposes, SRD is preferred to NNT
if one compared a randomly selected sub- but one can always translate a SRD to NNT for
ject in the first group (G1) with a randomly clear communication purposes.
10.1002/9781118625392.wbecp048, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/9781118625392.wbecp048 by Cochrane Philippines, Wiley Online Library on [13/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
EFFECT SIZE 3

Interpreting Effect Size SEE ALSO: Clinical Significance; Cohen, Jacob


(1922–98); Randomized Clinical Trials
When Cohen introduced d (Cohen, 1988), he
suggested as a general guideline, that “small”,
References
“medium” and “large” effects corresponded to
d = 0.2, 0.5, 0.8 respectively (hence, SRD = 0.1, Cohen, J. (1988). Statistical power analysis for the
0.3, 0.4 or NNT = 9, 4, 2). However, Cohen behavioral sciences. Hillsdale, NJ: Lawrence
also clearly warned against reification of this Erlbaum Associates.
Cumming, G. (2012). Understanding the new
guideline, and indicated that the appropriate
statistics: Effect sizes, confidence intervals, and
level should always be dictated by the specific meta-analysis. New York, NY: Routledge.
context. Thus, for example, NNT = 2,500 Grissom, R. J., & Kim, J. J. (2012). Effect sizes for
might be large in comparing subjects with vac- research: Univariate and multivariate
cine versus saline injections to prevent polio applications. New York, NY: Routledge.
in a randomized clinical trial, but would be
ridiculously small in comparing a drug versus Further Reading
psychotherapy to treat major depression, where Acion, L., Peterson, J. J., Temple, S., & Arndt, S.
NNT = 3 might be a more logical choice. Thus (2006). Probabilistic index: An intuitive
the critical effect size and the interpretation of non-parametric approach to measuring the size
the magnitudes of the effect size are context of treatment effects. Statistics in Medicine, 25(4),
bound, guided by what is already known and 591–602.
accepted in the particular context of use. Kraemer, H. C. (2007). Correlation coefficients in
medical research: From product moment
Beyond Two-Group Comparisons correlation to the odds ratio. Statistical Methods
in Medical Research, 15(6), 525–545.
A discussion like the above for the two-group Kraemer, H. C., & Kupfer, D. J. (2006). Size of
comparison problem is possible for each type treatment effects and their importance to clinical
of phenomenon of interest. What is the best research and practice. Biological Psychiatry,
choice of effect size to describe the correla- 59(11), 990–996.
tion between two random variables, or the McGough, J. J., & Faraone, S. V. (2009). Estimating
reliability of an ordinal measure or a binary the size of treatment effects: Moving beyond p
values. Psychiatry (Edgmont), 6(10), 21–29.
measure, or the predictive value of a set of
Preacher, K. J., & Kelley, K. (2011). Effect size
variables to an outcome, and so forth? The measure for mediation models: Quantitative
principles underlying effect sizes have been strategies for communicating indirect effects
known as long as statistical hypothesis-testing Psychological Methods, 16(2), 93–115.
has been known but it is only in recent years Shearer-Underhill, C., & Marker, C. (2010). The use
that these discussions have taken on urgency of number needed to treat (NNT) in randomized
and importance (Cumming, 2012 ; Grissom & clinical trials in psychological treatment. Clinical
Kim, 2012). Psychology: Science and Practice, 17, 41–48.

You might also like