You are on page 1of 5

6WDWLVWLFDO3RZHU$QDO\VLV

$XWKRU V -DFRE&RKHQ
5HYLHZHGZRUN V 
6RXUFH&XUUHQW'LUHFWLRQVLQ3V\FKRORJLFDO6FLHQFH9RO1R -XQ SS
3XEOLVKHGE\Sage Publications, Inc.RQEHKDOIRIAssociation for Psychological Science
6WDEOH85/http://www.jstor.org/stable/20182143 .
$FFHVVHG
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.

Sage Publications, Inc. and Association for Psychological Science are collaborating with JSTOR to digitize,
preserve and extend access to Current Directions in Psychological Science.

http://www.jstor.org

98 VOLUME 1, NUMBER 3, JUNE 1992


pression, Journal of Clinical Psychiatry, 51, 61-69
(1990).
11. L.R. Baxter, Jr., J.M. Schwartz, B.H. Guze,
A.
M.P.
J.C. Mazziotta,
Szuba, K. Bergman,
Alazraki, C.E. Selin, H.K. Freng, P. Munford, and
M.E. Phelps, Obsessive-compulsive
disorder vs.
Tourette's disorder: Differential function in subdivi
sions of the neostriatum, paper presented at the an

nual meeting of the American College of Neuropsy


San Juan, Puerto
Rico
chopharmacology,
(December 1991).
12. E.M. Reiman, M.E. Raichle, F.K. Butler, P.
Herscovitch, and E. Robins, A focal brain abnormal
ity in panic disorder, a severe form of anxiety, Na
ture, 310, 683-685
(1984); E.M. Reiman, M.E. Ra
ichle, E. Robins, F.K. Butler, P. Herscovitch, P. Fox,

and J. Perlmutter, The application of positron emis


sion tomography to the study of panic disorder,
American
Journal of Psychiatry,
143, 469-477
(1986); T.E. Nordahl, W.E. Semple, M. Gross, T.A.
Mellman, M.B. Stein, P. Goyer, A.C. King, T.W.
Uhde, and R.M. Cohen, Cerebral glucose metabolic
in patients with panic disorder, Neuro
differences
3, 261-272
(1990).
psychopharmacology,

= .01 is
Note
power at a2
only.56.1
also that at any given value of a, a
test ismore stringent than
two-sided
a one-sided
test.

Statistical Power Analysis


JacobCohen

Statistical

power analysis exploits


mathematical
relationship
in statis
among these four variables
the

The power

of a statistical

test of a

null hypothesis (H0) is the probabil


ity that the H0 will be rejected when
it is false, that is, the probability
of
a statistically
obtaining
significant
result. Statistical
the

power depends
criterion
(a),

on
the

significance
sample size (N), and the population
effect size (ES).
The importance of power analysis
arises from the fact that most empir

under study.
phenomena
A typical H0 is that a population
r, is
correlation,
product-moment
zero, to be tested at the two-sided
.05 level. When
this H0 is
(ct2 =)
tested on a sample of N cases ran
drawn

from a population

in

Jacob Cohen, Professor of Psychol


is the
ogy at New York University,
author of Statistical Power Analysis
Sciences
for the Behavioral
(2nd
Pa
with
and
co-author
1988)
ed.,

tricia Cohen of Applied Multiple

Analysis for
Regression/Correlation
Sciences
the Behavioral
(2nd ed.,

1983), both published by Law

equal

zero,

re

searchers
risk mistakenly
rejecting
it is true, a Type Ierror,
the H0 when
rate (.05) is controlled by the
whose
a criterion. They also riskmistakenly
the H0 as tenable when
it
accepting
is false, a Type IIerror, whose prob
ability is called ?. Power is thus 1
of not accepting
?, the probability
it is false, that is, the
of successfully
rejecting

probability
the H0.
The outcome
of a statistical test
on
the
the
depends
degree to which
on
is
that
the
is,
false,
magnitude
H0
in this
of the population
ES, which
case is the absolute size of the pop

tical inference: power, a, N, and ES.


The relationship
is such that when
the
any three of them are fixed,
fourth is determined.
Two forms of
power analysis are most useful: One
is the determination
of the N that is
a
to
attain
de
necessary
specified
as
to
of
detect
gree
power
significant

(at specified a) a hypothesized ES.

This form of power analysis


is used
in research planning. The second
is
the determination
of power to detect

a hypothesized ES (for specified N


and

a),

the

form

in meta

used

reviews
analytic power
areas or journals.

r?the
ulation
larger the r, the
that the H0
greater the likelihood
It is also true that
will be rejected.
the outcome depends on N, a larger
sample being more likely to result in

of

research

EFFECTSIZE

rejection of a false H0 than a smaller


one.

rence Erlbaum Associates.


Address
to
Cohen,
Jacob
correspondence
New
of Psychology,
Department
6 Washington
York University,
Place, 5th Floor, New York, NY
10003.

r indeed does

the H0 when

in the social and be


ical research
sciences
havioral
by for
proceeds
the
that
and
testing H0s
mulating
as
a
to
reject
hope
investigators
means of establishing
facts about the

domly

which

Thus,

at a2

.05,

for

exam

r is .30, when
ple, if the population
N is 40, the power of the standard t
test of a sample r turns out to equal
.48, whereas when N is 80, power is
r is .40, when
.78. If the population
N is 40, power is .74, but when N is
is .96. Finally, the test
80, power
outcome depends also on a, the risk
I error. A smaller and
of a Type
therefore more stringent a criterion,
=
.01, for any given popu
say, a2
in
lation r and N, would
result
For example,
smaller power.
with
r =
.30 and N = 80,
population
=
.05 is .78,
while
power at a2

Published by Cambridge

University

Press

I noted that in testing a sample r,


r.
the ES is simply the population
More
in the Neyman
generally,
Pearson system of statistical
induc
the concept of power
tion,2 whence
is derived,
the ES is the discrepancy
between

the null

hypothesis,
H0,
hypothesis of inter
est, Hv For testing a sample r, the
r is zero,
H0 is that the population
and the H^ posits a specific nonzero
.30. Thus, the
for example,
value,
ES in this example
is simply the dif
.30
ference:
.00. Every statistical
test has its own ES index, a contin
uous value
that runs from zero,
and the alternate

CURRENTDIRECTIONS INPSYCHOLOGICALSCIENCE 99
when the H0 is true. Each ES index is
a pure (i.e., scale-free)
value that
to it,
in terms appropriate
measures,
the discrepancy
the Ht
For example,

between

the ES index for the

between
difference
means
in the classical
between
difference
means standardized
difference
population
difference
tests and

the H0 and

independent
t test is d, the
the population
this

by dividing
by the common within
standard deviation.
(The
is absolute
for two-sided
is either positive or nega

tive for one-sided


tests.) The stan
results in a scale-free
dardization
measure: d = .25 implies a quarter
difference
of a standard deviation
free
between
the population means,
of the vari
of the units of measure
in question, whether
they are
or
inches, centimeters,
points scored
on a psychological
test.
able

As another example,
for testing
the departure of a population
pro
portion (P) from .50, the ES index is
= P .50. If an investigator be
g
in
lieves that there is a sex difference
the incidence of dyslexia
such that
boys are at different risk from girls,
in a sample of dyslexic children, she

would posit as the H0 that half the


sample are of one sex, and as the H^
that a specified different proportion,
say, .60, are of the other. The ES
.50
index would
then beg = .60
= .10. Still another
is
the
example
analysis of variance test that a set of
population means are all equal. The
ES index for this test, f, is the stan
dard deviation
of these means
di
vided

by
population

the common
within
standard deviation of the

observations.1

in the social
sci
Investigators
ences find specifying the ES the most
difficult aspect of power analysis.
This is at least partly due to the rel
low level of consciousness
atively
in those disci
about magnitudes
plines. The conquest of psychologi
cal science by Fisherian null hypoth
to
esis testing (where the alternative
so that
the H0 is simply its negation,
no Hy is specified) has had the un

effect

of emphasizing
the
of
values
from
p
signifi
magnitudes
cance tests rather than the magni
tudes of the psychological
phenom
ena under study.3 A salutary side
fortunate

effect of the study of power analysis


is its emphasis on ES. Neither power
nor sample size can be determined
in the absence of the investigator's
to consider
just how

readiness

wrong the null hypothesis is likely to


as to
be (i.e., the ES). The decision
to
ES
what population
posit arises
from the investigator's knowledge of
the field?the
sample ESs found in
investigations with similar
the results of pilot studies
(though not reliable when based on
small samples), and his or her edu

previous
variables,

and

and
for small, medium,
ESs.1
large
Another means of facilitating the
of the various ES in
understanding
is by transforming
dices
them into
other measures.
For example, many
of the ES indices

(e.g., d, f, and the

ES indices for the difference between


proportions and for the degree of as
in contingency
sociation
tables of
into
be
translated
frequencies) may
or
correlation
coefficients
their
squares, which may then be inter
preted as proportions of variance. As
another
d may be ex
example,
as various
of
pressed
proportions
(non)overlap
butions.1

between

normal distri

intuition.

cated

Because

the ES indices

are not

a, THE SIGNIFICANCE
CRITERION

I have

familiar,
generally
proposed
as conventions,
or operational
defi
and
nitions,
"small/'
"medium,"
to
ES
values
of
each
index
"large"
user
some
sense
with
the
of
provide
Itwas my intent that me
its scale.1
dium ES represent an effect of a size
to the naked
likely to be apparent
that small
eye of a careful observer,
ES be noticeably
smaller yet not triv
ial, and that large ES be the same
as small is
distance above medium
it. I also made an effort to
below
make these conventions
comparable
across different statistical tests.
For example,
for the test that r =
and large ESs are,
0, small, medium,
rs .10,
the
respectively,
population
test
two
.50.
For
and
the
that
.30,
means
are
the
population
equal,
=
same
are
in
the
d
ESs,
order,
.20,
.50, and .80. The .20 ES is exempli
fied by the mean
be
IQ difference
tween twins and nontwins
latter
(the
being larger), the .50 ES by the mean
between
clerical and
IQ difference
semiskilled workers,
and the .80 ES
IQ difference between
by the mean
Ph.D.s and college freshmen.
In the
test
of the H0
analysis of variance
that

.40

have equal
populations
the index is (the standard
means,
ized standard
deviation
of the

re
The probability of mistakenly
re
a
the
a,
represents
jecting
H0,
search policy?the
maximum
risk
one
is prepared
to take of making
conven
It has become
this error.
tional

that unless otherwise


stated,
this risk is set at .05. Smaller and
thus more stringent values may be
used, for example, when several H0s
are to be tested in order to minimize
the risk of making any Type Ierrors
in investigation
(the experimentwise
risk). Larger values may be used in
studies. Also,
for tests
exploratory
ESs may be either positive or
whose
negative, a may be defined as two
sided or one-sided.
The latter has
more

power than the former when


the sample effect is in the direction
but has zero power when
posited,
the effect is in the opposite direction
test logically
because
the one-sided
precludes a contrary finding.

DETERMININGSAMPLESIZE

means)

Copyright ?

are,

respectively,

1992 American

.10,

Psychological

.25,

Society

In planning
research,
deciding
the sample sizes is crucial. Because

100 VOLUME 1, NUMBER 3, JUNE 1992


research

costs

are at least approxi


in the number of sub

of dyslexia.
If in a
incidence
of dyslexic children half
population
are boys, there is no sex difference,
so H0 is P =
.50. Departure
from
.50 would
render H0 false. The ES
index for this test isg = P
.50, the

Abnormal

a neophyte
re
power are desired,
searcher might suggest a2 = .01 and
some very large value for power,

departure of the proportion from one


resources
half. If the investigator's
are such that she could obtain an N

say, .99. Power analysis quickly de


ne
termines that these specifications
cessitate a sample size that is likely
resources.
For
beyond the available
a
test
for
the
difference
of
example,

of 90 to 100, and her expectation


is
a value of g in the range .10 to Ve,
she might compile
the sample size
in Table 1 by
table
shown
planning

I
medium,
large
for
that
the
median
found,
example,
ES at a2
power to detect a medium
=
.05 was
.46. The many power
in the biosocial
sci
surveys done
ences since that time have had sim

linear

mately
demands
jects, cost-effectiveness
that this decision
be appropriate.
in connection
When
asked
with a
particular
investigation what a and

if a medium
ES (d
between means,
=
in
the population,
these
.5) exists
in
require 194 cases
specifications
of the two samples. Similarly,
r =
they require that if population
of a
.30, a test of the significance
=
sample r have 254 cases. For a2
.05 and .99 power,
the N require

each

ments

are,

respectively,

148

the

of
looking up various combinations
and
that
in
would
result
Ns
a2
g
within
the desired range and noting
the resulting power. From this table,
she could choose a set of specifica

DETERMININGPOWER

195.
sam
To determine
the necessary
one
to
needs
the
a,
ple size,
posit
I
and
desired
have
power.
pro
ES,
that in the ab
posed as a convention
sence of any other basis for setting
the value for desired power,
In scientific research,
ically more serious to make

used.1

.80 be
it is typ
a false

positive claim (Type I error) than a


false negative one (Type IIerror). Be
cause
for
the implicit convention
is a =
.05, the use of
significance
convention
for desired
=
the
.20) makes
(hence, ?
Type IIerror 4 times as likely as the
Type Ierror, an arbitrary but reason
able reflection of their relative im
the

.80

power

There

is a useful
in assessing

role for power


re
completed

analysis
in
research
search,
particularly
were
which
results
nonsignificant
Given
the N employed
obtained.
and a, one needs only to posit the
ES to determine
power.
population
The sample
ES found, or one or
more ES values posited by the asses
sor, may
common

serve

the

volume

It is a
this purpose.
that power was
finding
poor for plausible
ESs, usually be
cause of small N.
In 1962, I reviewed the articles in
1960

of the journal

portance.4

A useful aid in determining


the
a
size
is
necessary
sample
sample
size planning table. To prepare such
a table, the investigator selects val
ues or ranges of values for a, ES, and
the N for
power and then determines
This table pro
each combination.
vides the basis for a judicious choice
or leads to the use
of specifications
that the research as
ful discovery
is not viable.3
conceived
the investigator
Recall
pursuing
in
of a sex difference
the question

Table 1. A sample size


planning table
a2
.01
.02
.02
.05
.05
.10
.10
.10
.20

Power

1/6
.15
1/6
.10
.15
.10
.15
1/6
.15

.75
.75
.85
.50
.85
.60
.90
.95
.95

92
98
98
96
97
90
91
92
90

University

conventional

small,

of

definitions
and

ES.

a similar
ilar results. For example,
review by Sedlmeier and Gigerenzer

of the 1984 Journal of Abnormal


the median
found
Psychology6
to
power under the same conditions
be a little worse
itwas
(.44)?and

of

taken as confirmation:
significance
The median
power of these studies
to detect a medium
ES at a2 = .05
was
.25!

CONCLUSION

There has been no disagreement


research
among
methodologists
about the desirability of power anal
ysis in research planning and assess
in application
of
ment, yet progress
over the last quarter
this method
century has been slow. There have,
however, been some rays of hope in
the past few years. The popularity of
has served to empha
meta-analysis
size

Published by Cambridge

the

an experi
lower still (.37) when
a criterion was employed.
mentwise
Even worse was the finding that in
11% of the studies, the H0 was taken
as the research hypothesis
and non

tions.

and

and Social
Psychology
from the perspective of power.51 de
termined power for each statistical
test in each article using the N em
=
.01, .05, and . 10 for
ployed at a2

Press

the size of effects and by thus


of behav
raising the consciousness
has promoted
ioral scientists
the
cause of power analysis.3 More di
rectly, both graduate and undergrad
uate statistics textbooks have begun
to feature chapter-length
treatments
of power analysis.7
Finally, in addi
tion to the reference works already
com
there are available
noted,1,4
for
puter programs
power analysis
and sample size determination.8

CURRENTDIRECTIONS INPSYCHOLOGICALSCIENCE 101

am,

Acknowledgments?I

as

always,

grateful to Patricia Cohen for her helpful


comments.

Notes
1. J. Cohen, Statistical Power Analysis for the
Behavioral Sciences, 2nd ed. (Erlbaum, Hillsdale,
NJ, 1988). This is the source of the system of power
analysis described here; the power values and sam
ple sizes of the illustrations derive from this book's
tables.
2. J. Neyman and E.S. Pearson, On the use and
interpretation of certain test criteria for purposes of

statistical inference, Biometrika, 20A, 175-240,


263-294
(1928); J. Neyman and E.S. Pearson, On
the problem of the most efficient tests of statistical
Transactions of the Royal Society of
hypotheses,
London Series A, 231, 289-337
(1933).
3. J. Cohen, Things I have learned (so far),
American Psychologist, 45, 1304-1312
(1990).
4. For an article-length treatment of sample size
=
determination
using the .80 convention and a
.01, .05, and .10, see J. Cohen, A power primer,
Psychological Bulletin (in press). A useful alternative
treatment is offered in H.C. Kraemer and S. Thie
mann, How Many Subjects? Statistical Power Anal
ysis in Research (Sage, Newbury Park, CA, 1987).
5. J.Cohen, The statistical power of abnormal
social psychological
research: A review, Journal of
Abnormal and Social Psychology,
65, 145-153
(1962).

Why Can Methods for Comparing Means


Have Relatively Low Power, and What
Can You Do to Correct the Problem?

6. P. Sedlmeier and G. Gigerenzer, Do studies


of statistical power have an effect on the power of
studies? Psychological
Bulletin,
105, 309-316
(1989).
7. R. Rosenthal and R.L. Rosnow, Essentials of
Behavioral Research: Methods and Data Analysis,
2nd ed. (McGraw Hill, New York,
1991); J.
R.B. Ewen, and J. Cohen, Introductory
Welkowitz,
4th
ed. (Harcourt Brace Jovanovich, San
Statistics,
1991).
Diego,
8. M. Borenstein
and J. Cohen,
Statistical
Power Analysis: A Computer Program (Erlbaum,
Hillsdale, NJ, 1988); J. Hintze, Power Analysis and
Sample Size (NCSS, Kaysville, UT, 1991). Some 13
Power and
programs are reviewed in R. Goldstein,
sample size via MS/PC-DOS computers, American
Statistician, 43, 253-260
(1989).

others might argue that Student's


t
test is robust to nonnormality.
For
the statistical
lit
many years within
it
has
been
known that all
erature,
three of these methods
have serious
in
practical
problems,
particularly
terms of power.
Improved methods
have now emerged and are ready to

Rand R. Wilcox

be used
Certainly,
mon goals

one

of the most

com

in applied
research
is
two or more groups
in
comparing
terms of some measure
of location,
that is, a quantity
intended to repre
sent the "typicar
subject or object

under study. Of course, the measure


of location routinely used is the pop
ulation mean,
|x. If there is no differ
ence between
asso
the distributions
two or more groups,
with
standard methods
for comparing
means appear to provide good con
trol over the probability of a Type I
error (i.e., concluding
the means are
different
when
in fact they are
ciated

if the groups differ


equal). However,
in some way, and in fact you should
that the groups
reject the hypothesis

are the same in terms of some mea


sure of location, then using standard
methods
for comparing means
isone
of the worst things you could possi
bly do. In fact, very slight departures
can have serious
from normality

rwilcox@wilcox.usc.edu.

in applied work.
a procedure
choosing

for
it
to
groups,
comparing
helps
keep
three common
goals in mind:
1. Control the probability of a Type
Ierror when
are
the distributions
identical.

consequences.

In this article,
I review the prob
lem that arises in using conventional
to compare group
statistical methods
means and then discuss some solu

2. Compute
accurate
confidence
intervals for the difference
be
tween two measures
of location

tions.
Standard
nonparametric
methods do not correct the problem,
nor do some of the better known im
means.
for comparing
provements
new methods
There are, however,
that can help applied researchers.

3. Achieve
reasonably
high power
when
the two groups differ
in
terms of some measure
of loca

WHY ISTHEREA PROBLEM?


Rand R.Wilcox
is Professor of Psy
chology at the University of South
ern California. Address correspon
to Rand
dence
R. Wilcox,
of Psychology,
Uni
Department
Los
versity of Southern California,
e-mail:
Angeles, CA 90089-1061;

When

For comparing means, Student's t


is the most
used
commonly
method,
although some researchers
method
instead.
might use Welch's1
If you whisper
"nonnormality,"
some researchers might respond by
test

using

the Mann-Whitney

Copyright ?

1992 American

U test, but

Psychological

Society

when

the distributions

differ.

tion.

1 has received the most atten


tion, especially within the social sci
ences.
In this regard, Student's t test,
and its extension
to more than two

Goal

to perform
groups,
appears
very
well.2
For Goal 2, Student's
t test ap
pears to perform
reasonably when
sample sizes are used, but for
unequal sample sizes, serious prob
lems arise. In particular, Cressie and
Whitford3
described general circum
stances under which,
no matter how
equal

You might also like