Cohen (1992) StatisticalPower

6WDWLVWLFDO3RZHU$QDO\VLV
$XWKRUV-DFRE&RKHQ
5HYLHZHGZRUNV
6RXUFH&XUUHQW'LUHFWLRQVLQ3V\FKRORJLFDO6FLHQFH9RO1R-XQSS
3XEOLVKHGE\Sage Publications, Inc.RQEHKDOIRIAssociation for Psychological Science
6WDEOH85/http://www.jstor.org/stable/20182143 .
$FFHVVHG
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
Sage Publications, Inc. and Association for Psychological Science are collaborating with JSTOR to digitize,
preserve and extend access to Current Directions in Psychological Science.
http://www.jstor.org
98 VOLUME 1, NUMBER 3, JUNE 1992

pression, Journal of Clinical Psychiatry, 51, 61-69
(1990).
11. L.R. Baxter, Jr., J.M. Schwartz, B.H. Guze,
A.
M.P.
J.C. Mazziotta,
Szuba, K. Bergman,
Alazraki, C.E. Selin, H.K. Freng, P. Munford, and
M.E. Phelps, Obsessive-compulsive
disorder vs.
Tourette's disorder: Differential function in subdivi
sions of the neostriatum, paper presented at the an
nual meeting of the American College of Neuropsy

San Juan, Puerto
Rico
chopharmacology,
(December 1991).
12. E.M. Reiman, M.E. Raichle, F.K. Butler, P.
Herscovitch, and E. Robins, A focal brain abnormal
ity in panic disorder, a severe form of anxiety, Na
ture, 310, 683-685
(1984); E.M. Reiman, M.E. Ra
ichle, E. Robins, F.K. Butler, P. Herscovitch, P. Fox,
and J. Perlmutter, The application of positron emis

sion tomography to the study of panic disorder,
American
Journal of Psychiatry,
143, 469-477
(1986); T.E. Nordahl, W.E. Semple, M. Gross, T.A.
Mellman, M.B. Stein, P. Goyer, A.C. King, T.W.
Uhde, and R.M. Cohen, Cerebral glucose metabolic
in patients with panic disorder, Neuro
differences
3, 261-272
(1990).
psychopharmacology,
= .01 is
Note
power at a2
only.56.1
also that at any given value of a, a
test ismore stringent than
two-sided
a one-sided
test.
Statistical Power Analysis

JacobCohen
Statistical
power analysis exploits

mathematical
relationship
in statis
among these four variables
the
The power
of a statistical
test of a
null hypothesis (H0) is the probabil

ity that the H0 will be rejected when
it is false, that is, the probability
of
a statistically
obtaining
significant
result. Statistical
the
power depends
criterion
(a),
on
the
significance
sample size (N), and the population
effect size (ES).
The importance of power analysis
arises from the fact that most empir
under study.
phenomena
A typical H0 is that a population
r, is
correlation,
product-moment
zero, to be tested at the two-sided
.05 level. When
this H0 is
(ct2 =)
tested on a sample of N cases ran
drawn
from a population
in
Jacob Cohen, Professor of Psychol

is the
ogy at New York University,
author of Statistical Power Analysis
Sciences
for the Behavioral
(2nd
Pa
with
and
co-author
1988)
ed.,
tricia Cohen of Applied Multiple
Analysis for
Regression/Correlation
Sciences
the Behavioral
(2nd ed.,
1983), both published by Law
equal
zero,
re
searchers
risk mistakenly
rejecting
it is true, a Type Ierror,
the H0 when
rate (.05) is controlled by the
whose
a criterion. They also riskmistakenly
the H0 as tenable when
it
accepting
is false, a Type IIerror, whose prob
ability is called ?. Power is thus 1
of not accepting
?, the probability
it is false, that is, the
of successfully
rejecting
probability
the H0.
The outcome
of a statistical test
on
the
the
depends
degree to which
on
is
that
the
is,
false,
magnitude
H0
in this
of the population
ES, which
case is the absolute size of the pop
tical inference: power, a, N, and ES.

The relationship
is such that when
the
any three of them are fixed,
fourth is determined.
Two forms of
power analysis are most useful: One
is the determination
of the N that is
a
to
attain
de
necessary
specified
as
to
of
detect
gree
power
significant
(at specified a) a hypothesized ES.
This form of power analysis

is used
in research planning. The second
is
the determination
of power to detect
a hypothesized ES (for specified N

and
a),
the
form
in meta
used
reviews
analytic power
areas or journals.
r?the
ulation
larger the r, the
that the H0
greater the likelihood
It is also true that
will be rejected.
the outcome depends on N, a larger
sample being more likely to result in
of
research
EFFECTSIZE
rejection of a false H0 than a smaller

one.
rence Erlbaum Associates.

Address
to
Cohen,
Jacob
correspondence
New
of Psychology,
Department
6 Washington
York University,
Place, 5th Floor, New York, NY
10003.
r indeed does
the H0 when
in the social and be

ical research
sciences
havioral
by for
proceeds
the
that
and
testing H0s
mulating
as
a
to
reject
hope
investigators
means of establishing
facts about the
domly
which
Thus,
at a2
.05,
for
exam
r is .30, when
ple, if the population
N is 40, the power of the standard t
test of a sample r turns out to equal
.48, whereas when N is 80, power is
r is .40, when
.78. If the population
N is 40, power is .74, but when N is
is .96. Finally, the test
80, power
outcome depends also on a, the risk
I error. A smaller and
of a Type
therefore more stringent a criterion,
=
.01, for any given popu
say, a2
in
lation r and N, would
result
For example,
smaller power.
with
r =
.30 and N = 80,
population
=
.05 is .78,
while
power at a2
Published by Cambridge
University
Press
I noted that in testing a sample r,

r.
the ES is simply the population
More
in the Neyman
generally,
Pearson system of statistical
induc
the concept of power
tion,2 whence
is derived,
the ES is the discrepancy
between
the null
hypothesis,
H0,
hypothesis of inter
est, Hv For testing a sample r, the
r is zero,
H0 is that the population
and the H^ posits a specific nonzero
.30. Thus, the
for example,
value,
ES in this example
is simply the dif
.30
ference:
.00. Every statistical
test has its own ES index, a contin
uous value
that runs from zero,
and the alternate
CURRENTDIRECTIONS INPSYCHOLOGICALSCIENCE 99
when the H0 is true. Each ES index is
a pure (i.e., scale-free)
value that
to it,
in terms appropriate
measures,
the discrepancy
the Ht
For example,
between
the ES index for the
between
difference
means
in the classical
between
difference
means standardized
difference
population
difference
tests and
the H0 and
independent
t test is d, the
the population
this
by dividing
by the common within
standard deviation.
(The
is absolute
for two-sided
is either positive or nega
tive for one-sided

tests.) The stan
results in a scale-free
dardization
measure: d = .25 implies a quarter
difference
of a standard deviation
free
between
the population means,
of the vari
of the units of measure
in question, whether
they are
or
inches, centimeters,
points scored
on a psychological
test.
able
As another example,
for testing
the departure of a population
pro
portion (P) from .50, the ES index is
= P .50. If an investigator be
g
in
lieves that there is a sex difference
the incidence of dyslexia
such that
boys are at different risk from girls,
in a sample of dyslexic children, she
would posit as the H0 that half the

sample are of one sex, and as the H^
that a specified different proportion,
say, .60, are of the other. The ES
.50
index would
then beg = .60
= .10. Still another
is
the
example
analysis of variance test that a set of
population means are all equal. The
ES index for this test, f, is the stan
dard deviation
of these means
di
vided
by
population
the common
within
standard deviation of the
observations.1
in the social
sci
Investigators
ences find specifying the ES the most
difficult aspect of power analysis.
This is at least partly due to the rel
low level of consciousness
atively
in those disci
about magnitudes
plines. The conquest of psychologi
cal science by Fisherian null hypoth
to
esis testing (where the alternative
so that
the H0 is simply its negation,
no Hy is specified) has had the un
effect
of emphasizing
the
of
values
from
p
signifi
magnitudes
cance tests rather than the magni
tudes of the psychological
phenom
ena under study.3 A salutary side
fortunate
effect of the study of power analysis

is its emphasis on ES. Neither power
nor sample size can be determined
in the absence of the investigator's
to consider
just how
readiness
wrong the null hypothesis is likely to

as to
be (i.e., the ES). The decision
to
ES
what population
posit arises
from the investigator's knowledge of
the field?the
sample ESs found in
investigations with similar
the results of pilot studies
(though not reliable when based on
small samples), and his or her edu
previous
variables,
and
and
for small, medium,
ESs.1
large
Another means of facilitating the
of the various ES in
understanding
is by transforming
dices
them into
other measures.
For example, many
of the ES indices
(e.g., d, f, and the
ES indices for the difference between

proportions and for the degree of as
in contingency
sociation
tables of
into
be
translated
frequencies) may
or
correlation
coefficients
their
squares, which may then be inter
preted as proportions of variance. As
another
d may be ex
example,
as various
of
pressed
proportions
(non)overlap
butions.1
between
normal distri
intuition.
cated
Because
the ES indices
are not
a, THE SIGNIFICANCE
CRITERION
I have
familiar,
generally
proposed
as conventions,
or operational
defi
and
nitions,
"small/'
"medium,"
to
ES
values
of
each
index
"large"
user
some
sense
with
the
of
provide
Itwas my intent that me
its scale.1
dium ES represent an effect of a size
to the naked
likely to be apparent
that small
eye of a careful observer,
ES be noticeably
smaller yet not triv
ial, and that large ES be the same
as small is
distance above medium
it. I also made an effort to
below
make these conventions
comparable
across different statistical tests.
For example,
for the test that r =
and large ESs are,
0, small, medium,
rs .10,
the
respectively,
population
test
two
.50.
For
and
the
that
.30,
means
are
the
population
equal,
=
same
are
in
the
d
ESs,
order,
.20,
.50, and .80. The .20 ES is exempli
fied by the mean
be
IQ difference
tween twins and nontwins
latter
(the
being larger), the .50 ES by the mean
between
clerical and
IQ difference
semiskilled workers,
and the .80 ES
IQ difference between
by the mean
Ph.D.s and college freshmen.
In the
test
of the H0
analysis of variance
that
.40
have equal
populations
the index is (the standard
means,
ized standard
deviation
of the
re
The probability of mistakenly
re
a
the
a,
represents
jecting
H0,
search policy?the
maximum
risk
one
is prepared
to take of making
conven
It has become
this error.
tional
that unless otherwise

stated,
this risk is set at .05. Smaller and
thus more stringent values may be
used, for example, when several H0s
are to be tested in order to minimize
the risk of making any Type Ierrors
in investigation
(the experimentwise
risk). Larger values may be used in
studies. Also,
for tests
exploratory
ESs may be either positive or
whose
negative, a may be defined as two
sided or one-sided.
The latter has
more
power than the former when

the sample effect is in the direction
but has zero power when
posited,
the effect is in the opposite direction
test logically
because
the one-sided
precludes a contrary finding.
DETERMININGSAMPLESIZE
means)
Copyright ?
are,
respectively,
1992 American
.10,
Psychological
.25,
Society
In planning
research,
deciding
the sample sizes is crucial. Because
100 VOLUME 1, NUMBER 3, JUNE 1992

research
costs
are at least approxi

in the number of sub
of dyslexia.
If in a
incidence
of dyslexic children half
population
are boys, there is no sex difference,
so H0 is P =
.50. Departure
from
.50 would
render H0 false. The ES
index for this test isg = P
.50, the
Abnormal
a neophyte
re
power are desired,
searcher might suggest a2 = .01 and
some very large value for power,
departure of the proportion from one

resources
half. If the investigator's
are such that she could obtain an N
say, .99. Power analysis quickly de

ne
termines that these specifications
cessitate a sample size that is likely
resources.
For
beyond the available
a
test
for
the
difference
of
example,
of 90 to 100, and her expectation

is
a value of g in the range .10 to Ve,
she might compile
the sample size
in Table 1 by
table
shown
planning
I
medium,
large
for
that
the
median
found,
example,
ES at a2
power to detect a medium
=
.05 was
.46. The many power
in the biosocial
sci
surveys done
ences since that time have had sim
linear
mately
demands
jects, cost-effectiveness
that this decision
be appropriate.
in connection
When
asked
with a
particular
investigation what a and
if a medium
ES (d
between means,
=
in
the population,
these
.5) exists
in
require 194 cases
specifications
of the two samples. Similarly,
r =
they require that if population
of a
.30, a test of the significance
=
sample r have 254 cases. For a2
.05 and .99 power,
the N require
each
ments
are,
respectively,
148
the
of
looking up various combinations
and
that
in
would
result
Ns
a2
g
within
the desired range and noting
the resulting power. From this table,
she could choose a set of specifica
DETERMININGPOWER
195.
sam
To determine
the necessary
one
to
needs
the
a,
ple size,
posit
I
and
desired
have
power.
pro
ES,
that in the ab
posed as a convention
sence of any other basis for setting
the value for desired power,
In scientific research,
ically more serious to make
used.1
.80 be
it is typ
a false
positive claim (Type I error) than a

false negative one (Type IIerror). Be
cause
for
the implicit convention
is a =
.05, the use of
significance
convention
for desired
=
the
.20) makes
(hence, ?
Type IIerror 4 times as likely as the
Type Ierror, an arbitrary but reason
able reflection of their relative im
the
.80
power
There
is a useful
in assessing
role for power

re
completed
analysis
in
research
search,
particularly
were
which
results
nonsignificant
Given
the N employed
obtained.
and a, one needs only to posit the
ES to determine
power.
population
The sample
ES found, or one or
more ES values posited by the asses
sor, may
common
serve
the
volume
It is a
this purpose.
that power was
finding
poor for plausible
ESs, usually be
cause of small N.
In 1962, I reviewed the articles in
1960
of the journal
portance.4
A useful aid in determining

the
a
size
is
necessary
sample
sample
size planning table. To prepare such
a table, the investigator selects val
ues or ranges of values for a, ES, and
the N for
power and then determines
This table pro
each combination.
vides the basis for a judicious choice
or leads to the use
of specifications
that the research as
ful discovery
is not viable.3
conceived
the investigator
Recall
pursuing
in
of a sex difference
the question
Table 1. A sample size

planning table
a2
.01
.02
.02
.05
.05
.10
.10
.10
.20
Power
1/6
.15
1/6
.10
.15
.10
.15
1/6
.15
.75
.75
.85
.50
.85
.60
.90
.95
.95
92
98
98
96
97
90
91
92
90
University
conventional
small,
of
definitions
and
ES.
a similar
ilar results. For example,
review by Sedlmeier and Gigerenzer
of the 1984 Journal of Abnormal

the median
found
Psychology6
to
power under the same conditions
be a little worse
itwas
(.44)?and
of
taken as confirmation:
significance
The median
power of these studies
to detect a medium
ES at a2 = .05
was
.25!
CONCLUSION
There has been no disagreement

research
among
methodologists
about the desirability of power anal
ysis in research planning and assess
in application
of
ment, yet progress
over the last quarter
this method
century has been slow. There have,
however, been some rays of hope in
the past few years. The popularity of
has served to empha
meta-analysis
size
Published by Cambridge
the
an experi
lower still (.37) when
a criterion was employed.
mentwise
Even worse was the finding that in
11% of the studies, the H0 was taken
as the research hypothesis
and non
tions.
and
and Social
Psychology
from the perspective of power.51 de
termined power for each statistical
test in each article using the N em
=
.01, .05, and . 10 for
ployed at a2
Press
the size of effects and by thus

of behav
raising the consciousness
has promoted
ioral scientists
the
cause of power analysis.3 More di
rectly, both graduate and undergrad
uate statistics textbooks have begun
to feature chapter-length
treatments
of power analysis.7
Finally, in addi
tion to the reference works already
com
there are available
noted,1,4
for
puter programs
power analysis
and sample size determination.8
CURRENTDIRECTIONS INPSYCHOLOGICALSCIENCE 101
am,
Acknowledgments?I
as
always,
grateful to Patricia Cohen for her helpful

comments.
Notes
1. J. Cohen, Statistical Power Analysis for the
Behavioral Sciences, 2nd ed. (Erlbaum, Hillsdale,
NJ, 1988). This is the source of the system of power
analysis described here; the power values and sam
ple sizes of the illustrations derive from this book's
tables.
2. J. Neyman and E.S. Pearson, On the use and
interpretation of certain test criteria for purposes of
statistical inference, Biometrika, 20A, 175-240,

263-294
(1928); J. Neyman and E.S. Pearson, On
the problem of the most efficient tests of statistical
Transactions of the Royal Society of
hypotheses,
London Series A, 231, 289-337
(1933).
3. J. Cohen, Things I have learned (so far),
American Psychologist, 45, 1304-1312
(1990).
4. For an article-length treatment of sample size
=
determination
using the .80 convention and a
.01, .05, and .10, see J. Cohen, A power primer,
Psychological Bulletin (in press). A useful alternative
treatment is offered in H.C. Kraemer and S. Thie
mann, How Many Subjects? Statistical Power Anal
ysis in Research (Sage, Newbury Park, CA, 1987).
5. J.Cohen, The statistical power of abnormal
social psychological
research: A review, Journal of
Abnormal and Social Psychology,
65, 145-153
(1962).
Why Can Methods for Comparing Means

Have Relatively Low Power, and What
Can You Do to Correct the Problem?
6. P. Sedlmeier and G. Gigerenzer, Do studies

of statistical power have an effect on the power of
studies? Psychological
Bulletin,
105, 309-316
(1989).
7. R. Rosenthal and R.L. Rosnow, Essentials of
Behavioral Research: Methods and Data Analysis,
2nd ed. (McGraw Hill, New York,
1991); J.
R.B. Ewen, and J. Cohen, Introductory
Welkowitz,
4th
ed. (Harcourt Brace Jovanovich, San
Statistics,
1991).
Diego,
8. M. Borenstein
and J. Cohen,
Statistical
Power Analysis: A Computer Program (Erlbaum,
Hillsdale, NJ, 1988); J. Hintze, Power Analysis and
Sample Size (NCSS, Kaysville, UT, 1991). Some 13
Power and
programs are reviewed in R. Goldstein,
sample size via MS/PC-DOS computers, American
Statistician, 43, 253-260
(1989).
others might argue that Student's

t
test is robust to nonnormality.
For
the statistical
lit
many years within
it
has
been
known that all
erature,
three of these methods
have serious
in
practical
problems,
particularly
terms of power.
Improved methods
have now emerged and are ready to
Rand R. Wilcox
be used
Certainly,
mon goals
one
of the most
com
in applied
research
is
two or more groups
in
comparing
terms of some measure
of location,
that is, a quantity
intended to repre
sent the "typicar
subject or object
under study. Of course, the measure

of location routinely used is the pop
ulation mean,
|x. If there is no differ
ence between
asso
the distributions
two or more groups,
with
standard methods
for comparing
means appear to provide good con
trol over the probability of a Type I
error (i.e., concluding
the means are
different
when
in fact they are
ciated
if the groups differ

equal). However,
in some way, and in fact you should
that the groups
reject the hypothesis
are the same in terms of some mea

sure of location, then using standard
methods
for comparing means
isone
of the worst things you could possi
bly do. In fact, very slight departures
can have serious
from normality
rwilcox@wilcox.usc.edu.
in applied work.
a procedure
choosing
for
it
to
groups,
comparing
helps
keep
three common
goals in mind:
1. Control the probability of a Type
Ierror when
are
the distributions
identical.
consequences.
In this article,
I review the prob
lem that arises in using conventional
to compare group
statistical methods
means and then discuss some solu
2. Compute
accurate
confidence
intervals for the difference
be
tween two measures
of location
tions.
Standard
nonparametric
methods do not correct the problem,
nor do some of the better known im
means.
for comparing
provements
new methods
There are, however,
that can help applied researchers.
3. Achieve
reasonably
high power
when
the two groups differ
in
terms of some measure
of loca
WHY ISTHEREA PROBLEM?

Rand R.Wilcox
is Professor of Psy
chology at the University of South
ern California. Address correspon
to Rand
dence
R. Wilcox,
of Psychology,
Uni
Department
Los
versity of Southern California,
e-mail:
Angeles, CA 90089-1061;
When
For comparing means, Student's t

is the most
used
commonly
method,
although some researchers
method
instead.
might use Welch's1
If you whisper
"nonnormality,"
some researchers might respond by
test
using
the Mann-Whitney
Copyright ?
1992 American
U test, but
Psychological
Society
when
the distributions
differ.
tion.
1 has received the most atten

tion, especially within the social sci
ences.
In this regard, Student's t test,
and its extension
to more than two
Goal
to perform
groups,
appears
very
well.2
For Goal 2, Student's
t test ap
pears to perform
reasonably when
sample sizes are used, but for
unequal sample sizes, serious prob
lems arise. In particular, Cressie and
Whitford3
described general circum
stances under which,
no matter how
equal

Cohen (1992) StatisticalPower

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cohen (1992) StatisticalPower

Uploaded by

Copyright:

Available Formats

6WDWLVWLFDO3RZHU$QDO\VLV

98 VOLUME 1, NUMBER 3, JUNE 1992

nual meeting of the American College of Neuropsy

and J. Perlmutter, The application of positron emis

Statistical Power Analysis

power analysis exploits

null hypothesis (H0) is the probabil

Jacob Cohen, Professor of Psychol

tricia Cohen of Applied Multiple

1983), both published by Law

tical inference: power, a, N, and ES.

(at specified a) a hypothesized ES.

This form of power analysis

a hypothesized ES (for specified N

rejection of a false H0 than a smaller

rence Erlbaum Associates.

in the social and be

I noted that in testing a sample r,

the ES index for the

tive for one-sided

would posit as the H0 that half the

effect of the study of power analysis

wrong the null hypothesis is likely to

(e.g., d, f, and the

ES indices for the difference between

that unless otherwise

power than the former when

100 VOLUME 1, NUMBER 3, JUNE 1992

are at least approxi

departure of the proportion from one

say, .99. Power analysis quickly de

of 90 to 100, and her expectation

positive claim (Type I error) than a

role for power

A useful aid in determining

Table 1. A sample size

of the 1984 Journal of Abnormal

There has been no disagreement

the size of effects and by thus

CURRENTDIRECTIONS INPSYCHOLOGICALSCIENCE 101

grateful to Patricia Cohen for her helpful

statistical inference, Biometrika, 20A, 175-240,

Why Can Methods for Comparing Means

6. P. Sedlmeier and G. Gigerenzer, Do studies

others might argue that Student's

under study. Of course, the measure

if the groups differ

are the same in terms of some mea

WHY ISTHEREA PROBLEM?

For comparing means, Student's t

1 has received the most atten

You might also like