Professional Documents
Culture Documents
Power
After studying the text and working the problems in this chapter, you should be able to:
Power is a probability – the probability of rejecting the null hypothesis when it is false. This
chapter explains the concept of statistical power for three kinds of t tests and for a Pearson
correlation coefficient, r. A test is powerful if it is likely to reject a false null hypothesis.
The textbook, Tales of Distributions, has a heavy emphasis on null hypothesis statistical
testing (NHST), which was introduced in Chapter 8, Hypothesis Testing and Effect Size:
One-Sample Designs. You may recall that to test a null hypothesis, you tentatively assume
that it is true and then, on the basis of sample data, conclude that the null hypothesis is false
or that you don’t have good evidence that it is false. After the statistical analysis is finished,
you interpret the results. From Chapter 8 onward, you worked with several different
statistical tests. Each was developed for a particular design (such as two groups) or for a
particular kind of dependent variable (such as ranks). In every case, however, these tests are
based on the logic of null hypothesis statistical testing.
For every NHST technique, there is a null hypothesis, which is symbolized H0. The form of
the null hypothesis in the textbook is a hypothesis of equivalence. An example you’ve seen
several times is H0: µ0 = µ1. Of course, the truth is that this null hypothesis may be true or
that it may be false. It is the job of the statistical test to provide information that allows you to
move in the direction of one of these conclusions.
A REVIEW OF POTENTIAL MISTAKES
Any conclusion, of course, could be mistaken. So far, you have concentrated on minimizing
a particular kind of mistake, the mistake of rejecting the null hypothesis when it is true. If
you reject a null hypothesis that is true, you make a Type I error. Although you cannot totally
eliminate the possibility of a Type I error, hypothesis testing allows you to control it with the
α level you adopt. If α = .05, you are assured that the probability of a Type I error is no
greater than .05.
As you no doubt recall, there is a second way you can make a mistake in a statistical analysis.
You make this mistake if you retain a false null hypothesis. Retaining the null hypothesis
when it is false is a Type II error. Table 16.1 illustrates the circumstances in which Type I
and Type II errors are possible. Note that a Type I error is possible only when the null
hypothesis is true and a Type II error is possible only when the null hypothesis is false.
H0 true H0 false
Retain H0 Correct decision Type II error
Reject H0 Type I error Correct decision
The probability of a Type II error is symbolized with the Greek beta, β. To review the factors
that influence the value of β, see pages 218-219 in the textbook. Learning the material in this
chapter will improve your understanding of the factors that influence β. This is a chapter that
attends to the situation when the null hypothesis is false, which is the right hand column of
Table 17.1.
The expression, 1 -β, is the power of a statistical test. A test with power = .90 means that
if the null hypothesis is false (to the degree you think it is), you have a .90 probability of
rejecting the null hypothesis with the test.
Consider a statistical test that has a power of .25. Such a test has a .75 probability of failing
to detect a difference that is actually there. You can imagine that if you knew this before you
began your experiment, you might well take steps to increase the power. If you couldn’t
increase the power, you might decide to abandon the project and spend your time and effort
on something with a greater than a .25 chance of success.
ADEQUATE POWER
How much power is enough? This is another of those statistical questions for which there is a
commonly accepted conventional answer. The conventional answer is that adequate power is
1
.80. Sometimes students react to this “convention” by wondering why smart people would
adopt a standard that allows them to miss detecting a difference once in every five
experiments (if their assumptions are correct).
The explanation for this (if you are such a wondering student) is that resources are limited and
adding power requires additional resources. In most situations, researchers can increase power
only by adding observations. In many cases, however, it is expensive to obtain observations,
and as power is increased beyond .80, the sample size requirements go up exponentially.
Thus, for most problems that have an effect size index that is small or medium, you cannot
afford to absolutely ensure that you will find the suspected effect because the resources
needed for additional observations just aren’t available.
The fact that researchers do not need precision is also helpful for those who are just
learning about power analysis because it allows textbook writers to use the normal curve to
explain power, even though the probabilities that result are only approximately correct. As a
result, a power analysis does not change for different sized samples; the normal curve fits
all.
Power is a concept that applies to every NHST statistic. This chapter, however, explains
only how to calculate the power of three kinds of t tests and tests of the significance of r.
Corresponding methods of calculating power exist for other statistics.
1
You have already worked with one convention using the number .80. A d value of 0.80,
however, is quite different from the idea that power = .80. An effect size index of 0.80 means
you have a large effect size. The index, d, is an expression of a difference between means, per
standard deviation unit. Power, however, is expressed as a probability, the probability of
rejecting a false null hypothesis.
TWO DIFFERENT SITUATIONS AS EXAMPLES
Figure 17.1 and Figure 17.2 show two different situations in which the null hypothesis is
false. Figure 17.1 shows situation A. In Figure 16.1, the population on the left is the null
hypothesis population, but the sample is drawn from the population on the right. The sample
is from a population with a mean, µ1, which differs from the null hypothesis population by
an amount that produces a d value of 0.20.
Situation A
d= 0.20
µµ1 0
Figure 17.1 Situation A in which the null hypothesis is false. The degree of falseness is
indicated by a d value of 0.20
Situation B
Population 0, d = 0.80
Population 2,
µ µ
0 2
Figure 17.2 Situation B in which the null hypothesis is false. The degree of falseness is
indicated by a d value of 0.80
Figure 17.2 (situation B) likewise has the null hypothesis population on the left. The
sample, however, comes from the population on the right. The difference between the two
population means produces a d value of 0.80. The two figures illustrate a small effect size (d
= 0.20 in Figure 16.1) and a large effect size (d = 0.80 in Figure 16.2).
Suppose a one-sample t test was calculated on a sample from population 1 in situation A and a
separate one-sample t test was calculated on a sample from population 2
in situation B. That is, in situation A, a sample from population 1 is tested against H0: µ0 =
µ1 and likewise, in situation B, a sample from population 2 is tested against H0: µ0 = µ2.
Which of the two t tests is more likely to reject the null hypothesis? Choose situation A or
situation B before going on. Answer: __________
A one-sample t test is more likely to reject the null hypothesis in situation B than in situation
A. However, is rejecting the null hypothesis the right decision in both cases? Or is rejecting
the null hypothesis an error? Answer: ___________
Rejecting the null hypothesis is the correct thing to do in both cases. Failure to reject
would be a Type II error. But what is the probability of rejecting the null hypothesis in
situation A? In situation B? For probabilities, you need a power analysis.
2
A graph of a sampling distribution of the mean shows the means of all the samples of a
particular size from a population. Thus, in Figure 17.3, the sampling distribution of the
mean for population 0 is the small curve on the left. The sampling distribution of the mean
for population 1 is the small curve on the right.
Situation A
d= 0.20
Figure 17.3 Populations from Figure 16.1 and sampling distributions of the mean for
N = 9. Critical region for α = .05 is shaded.
In Figure 17.3, the expected values of the mean (mean of the sampling distributions) are
µ0 and µ1, the same as the population means. Focus on the sampling distribution of the
mean for the null hypothesis population. The rejection region for a two-tailed test is
shaded. Estimate what proportion of the means from population 1 has values that are
shaded in the sampling distribution on the left. Answer: ______
This is a hard question. Here is more information to help you either answer or confirm your
answer. The sampling distribution of the null hypothesis is on the left; the shaded portions
show the rejection region. All sample means from population 1 with values that are the same
as values in the shaded regions of population 0 lead to a rejected null hypothesis. What
proportion of the means from population 1 are in the rejection regions? Make an eyeball
estimate (attending to both shaded regions). Answer: ______
To my eye, about 20-25 percent of the right side and one percent of the left side of the
sampling distribution of population 1 are in the rejection region of the null hypothesis
sampling distribution. Thus, an eyeball estimate is that power is .21 to .26 in the case where
d = 0.20 and N = 9.
PROBLEMS (Answers at the end of the chapter)
Situation B
d= 0.80
Figure 17.4 Populations from Figure 16.2 and sampling distributions of the mean for
N = 9. Critical region for α = .05 is shaded.
THE EFFECT OF α
Here’s the rule about α and power: If α is changed to larger values (e.g., from .01 to .05 to
.10), power increases. This makes sense; as you make it easier to reject the null hypothesis
by changing α, the more likely you are to reject H0. Being more likely to reject the null
hypothesis overall means that you are more likely to reject H0 when it is false. Thus, power
is increased because you are more likely to reject a false null hypothesis.
Situation B
d = 0.80
Figure 16.5 Enlarged versions of sampling distributions in Figure 16.4. Critical regions
for α levels of .10, .05, and .01 are shaded light, medium, and dark.
Figure 17.5 provides a picture of the effects of changing α. It is an enlarged version of the
sampling distributions in situation B in Figure 16.4. The rejection regions for α levels of .10,
.05 and .01 are shaded as light, medium, and dark, respectively. Look at Figure 16.5 and
estimate the amount of the sampling distribution from population 2 that is to the right of each
of the three alpha levels.
for α = .10: _____ for
α = .05: _____ for α =
.01: _____
To my eye these proportions are: for
α = .10, about .80, for α = .05, about
.70, and for α = .01, about .50.
In situation A, power was approximately .11 (text example). In situation B power was
approximately .70 (problem 3). In both cases, sample size was 9. In summary, the greater
the difference between the populations you are sampling from, the greater the power of your
statistical test. This issue is sometimes referred to as the degree of falseness of H0.
THE EFFECT OF N
The rule for the effect of sample size (N) on power is that as N increases, power increases.
One way to understand this relationship more thoroughly is to examine the algebra in a one-
sample t test. Mentally work your way through the formula that follows, starting on the right
side and working back to the left. Imagine that N increases and see what happens to t.
Convince yourself that as N increases, power increases.
X −µ X −µ
t == 10 10
ssˆ
X
N
The effect of increasing N is to make sX smaller and, of course, a smaller sX increases the
numerical value of t. The larger t is, the more likely the rejection of the null hypothesis.
Thus, power increases even though the difference between means stays the same.
SituationB
d= 0.80
Figure 17.6 Sampling distribution of the mean with N = 9 (upper panel) and sampling
distribution of the mean with N = 36 (lower panel). Situation B.
A second explanation of how N affects power is with pictures rather than algebra. The upper
panel of Figure 17.6 shows sampling distributions from Situation B for N = 9, which is a
repeat of Figure 17.5. The bottom panel of Figure 16.6 shows sampling distributions from
Situation B when N is increased to 36. In both panels, rejection regions are shaded (.05 level,
two-tailed test). When N = 9 (top panel), my eyeball estimate of power remains about .70.
Look at the lower panel sampling distributions for N = 36. What is your estimate of the power
in this situation? Answer: ______
My eyeball estimate of power in the bottom panel of Figure 16.6 is about .99 to
1.00. Thus, as sample size increases, so does power.
What if the test had been a one-tailed test with α still at .05? Looking at Figure 17.5, you
can see that the rejection region for a two-tailed test with α = .10 consists of .05 on the right
and .05 on the left. Thus, the rejection region for a one-tailed test with α = .05 is the entire
shaded area on the right. My estimate of the power in this
case is .80. Thus, a one-tailed test is more powerful than a two-tailed test. (Remember,
however, that a one-tailed test is completely incapable of detecting a difference in which the
ordinal position of the two means is the opposite of what you expect.)
PROBLEMS
1 From memory, list four factors that affect power.
2 Write a sentence about each of the four factors, explaining how power changes as that
factor is reduced or is changed.
• Previous research. Sometimes you can estimate d by using statistics from published
studies on similar topics.
• Practical considerations. In some cases, a researcher can identify a minimum
difference that would be worth finding out about. Oftentimes, these situations are ones in
which a researcher is doing applied research.
• Conventional values of small, medium, and large. As you may recall from Chapter 4
in Tales, Jacob Cohen proposed that d values of .20, .50, and .80 be labeled small, medium,
and large. If neither of the first two solutions works, the researcher might choose one of
these conventional values for d.
ESTIMATING POWER
The actual process of estimating power involves a formula and a table. Look at Table M now
(last page in this chapter). The figures in the body of the table are power probabilities. To
enter the table, select an α value (which will be for a two-tailed test). In addition, you will
need a value for δ(delta), which is explained below. At the intersection of αand δ, you have
the approximate power available for your particular analysis.
δ=d[f ()N ]
The value of d comes from the first step in your power analysis. The expression f(N) is a
general expression. Its specific form depends on which statistical test you are
finding power for. For a specific statistical test, there is a specific formula for f(N). Once you
have a value for δ, enter Table M to find an estimate of the power available for your test.
POWER
For a one-sample t test, f(N) =
To find power, enter Table M with the δvalue of 7.44. The largest value is 5.00, which
indicates power of .99 for an αlevel of .01. Thus, given the very large effect size of 2.63,
even a small sample of 8 was more than adequate to detect the fact that the Frito-Lay
company puts more in their tortilla chip packages than the advertised amount.
σ 15
From Table M, she discovers that, with α= .05 (two-tailed test), power is between .36 and .40.
Using linear interpolation, she concludes that the power for her experiment is about .38.
Power of .38 is not very encouraging. Her most practical solution is to increase sample size.
How much of an increase is needed? The next section shows how to determine N for a
specified amount of power.
DETERMINING N
Let’s suppose that our clinical psychologist has decided to determine the sample size
required for power = .80. Again, she lets d = 0.33. From Table M, she finds that power of
.80 corresponds to a δ value of 2.80. She looks at the basic formula for delta,
N and
δ= d
sees that she
has all the
elements
except N. N
δ= d
So, she
solves for N 2 2
Perhaps, you might reason like this. Because chips are very cheap to produce and public
outcry for selling light is a big problem, a manufacturer will make sure that the customer gets
more than the package claims. Using this reasoning, you decide that the effect size is big.
Begin by assuming an effect size index of 1.00. Next, choose a large value for power, say
.95. With these decisions in hand, you can find the N needed for your experiment. Start
with the formula for δ for a one-sample t test.
δ=Nd
Once again, by rearranging this formula and solving for N, you get
ۇδ
N= 2
d The next step is to enter ۊTable M. Choose an α value of .05 and then go down until
you find a power value of .95. The δ value is 3.60. Now you have the elements you
H0 true H0 false
Retain H0 Correct decision Type II error
Reject H0 Type I error Correct decision
Thus, the plan for the experiment calls for the purchase of 13 bags of tortilla chips with the
expectation that if the effect size index is 1.00, you would have a .95 probability
of detecting that Frito-Lay puts more chips or less chips in their bags. (Being able to detect
more or less comes from the fact that Table M gives values for a two-tailed test.)
POWER
In the case of the independent-samples t test,
()=
fN N
2
The value for N depends on whether or not the sample sizes are equal. When both
samples are the same size, then N = N1 = N2 . When N1 ≠N2, the value for N is given
by the formula
( )(
2 NN)
12 N
N= 1
+N 2
N
δ=d
2
What are the effects on children of spending their days in a child-care center? Sandra Scarr
investigated this question; in addition, she wrote a review article for the February, 1998
American Psychologist. The problem that follows is based on material in her article.
Her analysis showed that the behavioral adjustment scores of children who spent 24 or more
months in a day-care center were not significantly different from those who had been cared
for at home. At this point, there are two possible interpretations. One is that the behavior of
day-care children (all of them) is not different from the behavior of children cared for at home
(all of them). The other possibility is that the sample data produced a Type II error. That is,
there is a difference in the populations but the sample data failed to detect it. A power
analysis helps you decide between these two alternatives.
From the data in Scarr’s report, a d value of 0.15 was calculated. The N in this study was
1100, with 550 in each group. Working from the formula that appeared previously,
N
δ=d =0.15
22() =()0.15 (23.45
550)=3.52
From Table M, you find that the power available in this study was .94 for a two-tailed test
with α = .05. Thus, our analysis shows that a great deal of power was devoted to detecting
any difference that exists between maternal and non-maternal child care, but that no
significant difference was found. With all this power, a Type II error seems unlikely. In any
case, if there is a difference, the power analysis suggests that the effect size index is quite
small.
DETERMINING N
A common question among all researchers, from beginners to experts, is, “How many
participants will I need for my study?” Let’s eavesdrop on a conversation with a local
research methods guru. Put yourself in the scenario that follows.
You: How many participants do I need in my two-group study? Me: I’d just run as
many as I could, without beating myself to death. Just do a
reasonable number. You: Isn’t there some scientific way to decide? Me: Sure, let’s do a
power analysis and see what it turns up for you. You: Do I need an Ouiji Board? Me: Nope,
but a calculator would be handy. Have one in your backpack? You: Sure, always! Me: O.K.,
how much power do you want? You: Well, .80, I guess. Isn’t that the right answer? Me: It’s
a good answer; we’ll work with that. How big of an effect size index are you
With equal numbers in each group, N1 = N2 = N. We’ll start with the formula for δ
for an independent-samples t test,
N
δ= d
2
ۋ
O.K., we’ll plug in your medium effect size for d and the δ value for .80 power…
2(2.80)
2 2
2δ
N=ۈ = 62.72
2ۋ
d (ی0.50)
2
=
You: Uh-Oh.
Me: Yeah, too bad, but better to know than to delude yourself. For your independent project,
it’s no deadly sin to have low power. The main purpose of an independent project is to get
you started doing research on your own.
You: Yeah, but…. Anyway, I’m still thinking …..
This example shows you a fact about independent-samples t tests: they are not very
powerful. That is, you need more than 100 participants to have enough power to detect a
medium effect size in four experiments out of five.
POWER
For a paired-samples t test, f(N) = N.
Thus,
δ= d N
DETERMINING N
The formula for N for a paired-samples t test is the same as that for the one-sample t
test you learned earlier.
ۇδ
N= 2
dۊ
I’ll work just one problem here and see if it leads to any insight. What sample size is needed
to detect a medium effect size with a power of .80? By now, perhaps you can fill in the
formula from memory rather than by consulting tables? Here’s the formula:
2 2
For
N −1a Pearson
, r, f ()N =
where N is the
POWER number of pairs
The effect size index for a Pearsonincorrelation
the data. coefficient is some symbol which hasn’t
been agreed upon by statisticians. For now, we’ll use r. (That is what Cohen uses.) Using r
as the symbol for an effect size index has the advantage of telling us it is associated with a
Pearson correlation coefficient. The disadvantage of using r is that it already has another
meaning – it is the symbol for a Pearson correlation coefficient.
δ=d[ f (N)] =d
How much power is available to detect a moderate correlation coefficient of .30 using 30
pairs of scores? A sample of 30 is a commonly recommended sample size.
.30(5.39) =1.62
Consulting Table M in the α = .05 column, you find that power = .36 for δ = 1.62. Thus, for a
correlation study the commonly recommended sample size of 30 pairs is not a good
recommendation. A sample size of 30 pairs doesn’t provide enough power to detect a
moderate sized r. So, how about 100 pairs? Now,
.30(9.95) =2.98
For δ = 2.98, power is about .85. Thus, for detecting a modest .30 correlation coefficient, a
better recommendation is a sample size of 100 pairs.
DETERMINING N
Solving the equation δ=d N −1 for
N,
2
ۇδۊ
N = ۋ+1
ۉd
I’ll use the formula for N to determine a sample size that has a .95 probability of detecting
an r = .40. For this problem, δ = 3.60.
2 2
Scientific experimenters use NHST statistics to reach conclusions about natural phenomena.
As also happens with other approaches, statistical conclusions can be wrong at times. One of
the beauties of statistics is that a power analysis allows you to calculate the probability of
being right, if your preliminary conclusions about nature are correct.
This chapter provides an introduction to the statistical power of NHST tests. Jacob Cohen’s
4-page, 1992 primer on power in Current Directions in Psychological Science is another
good introduction. Of course, power can be calculated for tests other than t-tests and
correlation coefficients. Howell (2002) shows power calculations for ANOVA and factorial
ANOVA. Cohen’s article shows power formulas for chi square.
PROBLEMS
6. Some years ago, the mean ACT score was about 23 for freshmen at fairly selective
liberal arts schools. Now it is about 27. The standard deviation on this national test is
6. Suppose some group wanted to “scientize” its claim that scores are significantly
higher now by running a t test on a random sample of 100 students from each of the
two eras. How much power would there be in such a test?
7. Your statistics textbook describes an early study on smoking and lung cancer in
which the consumption rate and cancer rate for 11 countries was correlated. How much
power was available to detect this relationship, if the relationship is r = .50?
8. Suppose that the population correlation coefficient for the relationship between
smoking and lung cancer is .75 (which it is). How much power is available to detect that
there is a relationship if the study is based on 11 countries?
9. A pharmaceutical lab had tons of data on the latency of rats’ jumping response to a
light that preceded a shock. The shock occurred unless the rat responded by jumping a
barrier. The latency to the light was 4.0 seconds with a standard deviation of 1.5 seconds.
The researchers tested a drug that was expected to cause time distortion. They decided that
any drug effect that changed latency less than one-half second was trivial. (This decision was
based on changes caused by other drugs.) What size sample is needed for their one-sample t
test, if they want to have power of .80?
10. Create a table. The first column lists the four statistical tests in this chapter. The next
three columns are labeled small, medium, and large. The spanner over these three columns is
effect size index. Conventional values go in these cells. (Use .10, .30, and .50 as small,
medium, and large effect size indexes for r.) The fifth column is labeled “formula for f (N)”;
the cells contain the formulas for each test. Columns 6-8 are labeled, small, medium, and
large. The spanner is “Sample size for power = .80.” Compute sample sizes appropriate for
these 12 cells.
23 − 27
6. d = =−.667
6
δ= .667 =
.667(7.071) =
ۈ ۋ
100
2
4.72
Power ≥ .99
7. δ= .50 11−1 =
.5(3.162) = 1.58
Power = .35
Interpretation: There is not much power to detect a correlation coefficient of .50 using just
11 pairs of data.
8. δ= .75 11−1 =
.75(3.162) =
2.37 Power =
.66
Interpretation: Even for a correlation coefficient of .75, there is not much power to
detect it using just 11 pairs of data.
9 The minimum change in mean latency that was worth detecting was 0.5 seconds.
Thus, the minimum effect size index is
0.5
d == 0.33 2
Interpretation: The researchers need a gross of rats (144) to detect that the drug
distorts time perception to a degree that they consider worthwhile to know.
H0 true H0 false
Retain H0 Correct decision Type II error
Reject H0 Type I error Correct decision
N 1.00
3 60 12 96
Table M Approximate Power as a Function of Delta and Significance
Level
Alpha Level for a Two-Tailed Test delta (δ) .10 .05 .02 .01
H0 true H0 false
Retain H0 Correct decision Type II error
Reject H0 Type I error Correct decision
N 1.00
3.60 2 12.96=
= = ۊ
dδ
Small Med Large f(N) Small Med
Test
One-sample t .20 .50 .80 N 196 31.36
Independent t .20 .50 .80 2N 784 125.44