Power and Sample Size

Sampling and Sample Size
Rachel Glennerster
MIT
Sampling and Sample Size
Rachel Glennerster
MIT
Which of these most likely to describe
estimates from 8 well implemented RCTs?
I. II.
70%
A. I
B. II
C. Neither 15% 15%
A. B. C.
J - PAL | SAMPLING AND SAMPLE SIZE 4
Which is the best description of II?
I. II.
89%
A. Imprecise estimate
B. Biased estimate
C. Imprecise but unbiased
11%
0%
A. B. C.
Bias and precision
Precision (Sample Size)
estimates
truth
Less bias (Randomization)

Outline
• Introduction
• Hypothesis testing
• What influences power?
• Power in clustered designs
• Calculating power in practice

Outline
• Introduction

We evaluate bringing tutors into schools

Post-test: control & treatment
Comparison mean Treatment mean
The mean of treatment is 6ppt higher than mean of control

Is this impact statistically significant?
Average Difference = 6 points
80%
15%
A. Yes
5%
B. No
C. We cant tell
A. B. C.

Difference between the sample means
Estimated effect

What if we ran a second experiment?
Estimated effect

Many experiments: a distribution of estimates
100
90
80
70
60
Frequency
50
40
30
20
10
0
-3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
Difference

100
90
80
70
60
Frequency
50
40
30
20
10
0
-3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
Difference

100
90
80
70
60
Frequency
50
40
30
20
10
0
-3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
Difference

100
90
80
70
60
Frequency
50
40
30
20
10
0
-3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
Difference

100
90
80
70
60
Frequency
50
40
30
20
10
0
-3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
Difference

100
90
80
70
60
Frequency
50
40
30
20
10
0
-3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
Difference

100
90
80
70
60
Frequency
50
40
30
20
10
0
-3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
Difference

Outline
• Introduction

Distribution of estimates if true effect=β
0.5
0.45 β
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6
Estimated effects normally distributed around effect β

Distribution of estimates if true effect=0
0.5
0.45 0
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6
Estimated effects normally distributed around effect 0

Two distributions under two hypotheses
0.5
0.45
True effect=H0
0.4
True effect=Hβ
0.35
H0 Hβ
0.3
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6
Probability of getting given estimate under two hypotheses

If we run one study, see one estimate β�
β�
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6
How do we know if our estimated effect is significant?

Did our estimate come from 𝐻𝐻β or 𝐻𝐻0 ?
0.5
0.45
0.4
True effect=H0
0.35
Hβ
True effect=Hβ
H0
0.3
P1
0.25
0.2
0.15
0.1
0.05
P2
0
-4 -3 -2 -1 0 1 2 3 4 5 6
Which is more likely?

Can we rule out that the estimate comes from H0?
Impose significance level of 5%
Critical value 0.5 Critical value
0.45
0.4
H0 Hβ
0.35 True effect=H0
0.3 True effect=Hβ
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6
Anything between lines cannot be distinguished from 0

Critical value
• Definition: The estimated effect size that exactly

corresponds to the significance level.
• If testing whether:
– The effect is bigger than 0
– Significant at 95% level
Then, critical value is the level of the estimate where exactly
5% of area under the curve lies to the right

Is β� significantly different from 0, at 5%?
β�
0.5
Critical value Critical value
0.45
0.4
H0 0.35
0.3
Hβ True effect=H0
True effect=Hβ
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6
How do we know?
Hypothesis Testing
• In criminal law, most institutions follow the rule: “innocent

until proven guilty”
• The presumption is that the accused is innocent and the
burden is on the prosecutor to show guilt
– The jury or judge starts with the “null hypothesis” that the
accused person is innocent
– The prosecutor has a hypothesis that the accused person is
guilty

Hypothesis Testing
• Usually in program evaluation, instead of “presumption

of innocence,” the rule is: “presumption of zero”
• The “Null hypothesis” (H0) is then that there was no (zero)
impact of the program
• The burden of proof is on showing there was an impact
• Note, there are exceptions:
– e.g. the null for a cash plus training program might be that
it’s the same as the training on its own

Hypothesis Testing: Conclusions
• If it is very unlikely (less than a 5% probability) that the

difference is solely due to chance:
– We “reject our null hypothesis”
• We may now say:
“our program has a statistically significant impact”
• What is statistically significant and what is most likely are
different concepts
– there may be cases where it’s more likely that the program
worked than that it didn’t, but we still say there is no
statistically significant impact

Significance and Type I errors
• Traditionally significance
level is set at 5%
• This means allowing a 5%
chance of experiencing
Type I errors
• 5% of time we will say
program had impact
when in fact it didn’t
Source: Effect Size FAQs blog

If true effect was β
Critical value 0.5

Critical value
0.45
0.4
H0 0.35
0.3
Hβ True effect=H0
True effect=Hβ
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6
How often would we get an estimated effect we can

distinguish from zero?
How often would we reject null, if Hβ true?
0.5
Critical value
True effect=H0
0.45
True effect=Hβ
0.4
Power
0.35
H0 0.3
0.25
Hβ
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6
Shaded area shows power = % of time we would find Hβ

different from 0 if true effect was β
The power to avoid Type II errors
• Statistical power is the probability that, if the true effect is

of a given size, our proposed experiment will be able to
distinguish the estimated effect from zero
• Power is the probability of
avoiding Type II errors
• Traditionally, we aim for 80%
power (some aim for 90%)
• Low power means we may not
find a significant effect even
though an effect exists
J - PAL | SAMPLING AND SAMPLE SIZE Source: Effect Size FAQs blog 38
Four results from hypothesis testing
Underlying truth
Effect No effect
Power: when is effect No error
prob, find significance
Significant
Statistical test
No error
Not significant
J - PAL | SAMPLING AND SAMPLE SIZE Source: Effect Size FAQs blog 39
Four results from hypothesis testing
The underlying truth:
TREATMENT EFFECT NO TREATMENT EFFECT

Statistical Test:
(H0 false) (H0 true)
False Positive
SIGNIFICANT True Positive
Probability = α
(Reject H0) Probability = 1-κ
Type I error
False Zero
NOT SIGNIFICANT True Zero
Probability = κ
(Fail to reject H0) Probability = (1-α)
Type II error

Outline
• Introduction

What influences power?
0.5
Critical value
True effect=H0
0.45
True effect=Hβ
0.4
Power
0.35
H0 0.3
0.25
Hβ
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6
Move overlap of cures, less power, what influences overlap

of these curves?
Power: main ingredients
1. Effect Size

Effect Size: 1*SE
• Hypothesized effect
0.5
size determines distance between
1 Standard
means 0.45 Error
0.4
True effect=H0
H0 0.35
0.3
Hβ True effect=Hβ
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6

Effect Size = 1*SE
0.5
0.45 True effect=H0
True effect=Hβ
0.4
Significance
H0 Hβ
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6

Power: 26%
If the true impact was 1*SE…
0.5
True effect=H0
0.45
True effect=Hβ
0.4
H0 Hβ
Power
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6
The Null Hypothesis would be rejected only 26% of the time

Effect Size: 3*SE
0.5
0.45 3*SE
0.4
True effect=H0
H0 0.35
0.3
Hβ True effect=Hβ
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6
Bigger hypothesized effect size  distributions farther apart

Effect size 3*SE: Power= 91%
0.5
True effect=H0
0.45
True effect=Hβ
0.4 Power
H0 0.35
0.3
Hβ
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6
Bigger Effect size means more power

Effect size and take-up
• Let’s say we believe the impact on our participants is “3”

• What happens if take up is 1/3?
• Let’s show this graphically

Effect Size: 3*SE
0.5
0.45
3*SE
0.4
H0 Hβ
True effect=H0
0.35
True effect=Hβ
0.3
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6
Let’s say we believe the impact on our participants is “3”

Take up is 33%. Effect size is 1/3rd
0.5
1 Standard
0.45
Error
0.4
True effect=H0
H0 Hβ
0.35
True effect=Hβ
0.3
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6

Back to: Power = 26%
0.5
True effect=H0
0.45
True effect=Hβ
0.4
H0 Hβ
Power
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6
Take-up is reflected in the effect size

1. Effect Size
2. Sample Size

By increasing sample size
you …
Power 91% 89%
-4 -3 -2 -1 0 1 2 3 4 5 6
A. Reduce bias
B. Increase precision
C. Both
D. Neither 11%
E. Don’t know 0% 0% 0%
A. B. C. D. E.
Increasing sample size
will …
Power 91%
83%
-4 -3 -2 -1 0 1 2 3 4 5 6
A. Move curves further

apart
B. Move curves closer
together
C. Make curves fatter 6% 6% 6%
D. Make curves narrower 0%
E. Don’t know
A. B. C. D. E.
Power: Effect size = 1 SE
Sample size = 1,000
0.5
0.45
H0 Hβ
True effect=H0
0.4
True effect=Hβ
0.35
Significance
0.3
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6

Power: Effect size= 1SE
Sample size = 4,000
0.5
0.45
0.4
H0 Hβ
True effect=H0
0.35
True effect=Hβ
0.3
Significance
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6

Power: 64%
0.5
0.45
0.4
H0 Hβ
True effect=H0
0.35
True effect=Hβ
0.3
Power
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6

Power: Sample size = 9,000
0.5
0.45
0.4
H0 Hβ
True effect=H0
0.35
True effect=Hβ
0.3
Significance
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6

Power: 91%
0.5
0.45
0.4
H0 Hβ
True effect=H0
0.35
True effect=Hβ
0.3
Power
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6

1. Effect Size
2. Sample Size
3. Variance

What will increased variation in the underlying
population do to our estimates?
A. Increase risk of bias?
B. Reduce risk of bias
72%
C. Increase precision of
estimate
D. Reduce precision of
estimate
E. Will not change
estimates
11%
6% 6% 6%
A. B. C. D. E.
What does increased variation in population
do to our distribution of estimates curves?
A. Move them further

apart? 89%
B. Move them closer

C. Make them fatter
D. Make them thinner
E. Don’t know
6% 6%
0% 0%
A. B. C. D. E.
Low variance sample
0.5
0.45
0.4
H0 Hβ
True effect=H0
0.35
True effect=Hβ
0.3
Significance
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6
Estimates will be more tightly clustered

Low variance sample
0.5
0.45
0.4
H0 Hβ
True effect=H0
0.35
True effect=Hβ
0.3
Power
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6
Estimates will be more tightly clustered  Higher power

Higher variance sample 0.5
0.45
True effect=H0
0.4
True effect=Hβ
0.35
Significance
H0 0.3
0.25
Hβ
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6
Estimates will be more dispersed

Variance and power: intuition
• Our program seeks to increase child height
• There is a lot of variation in height in the population
• At endline children in treatment are taller than in control
• Is this because we happened to start with taller children?
or because the program worked?
• We need a big sample to sort this out
Population
Treatment Control

Variance and power: intuition
• If everyone in the underlying population was of similar

height at the start, it would be easy to sort this out
– we would need a smaller sample (or have more power
with a given sample)
– the variation we see at the end between treatment and
control must be due to the program
Population
Treatment Control

1. Effect Size
2. Sample Size
3. Variance
4. Proportion of sample in T vs. C

Sample split: 50% C, 50% T
0.5
0.45
H0 Hβ
0.4
0.35 True effect=H0
0.3 True effect=Hβ
Significance
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6
Equal split gives distributions that are the same “fatness”

Power: 91%
0.5
0.45
H0 0.4
0.35
Hβ
0.3 True effect=H0
True effect=Hβ
0.25
Power
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6

If it’s not 50-50 split?
• What happens to the relative fatness if the split is not 50-

50.
• Say 25-75?

Sample split: 25% C, 75% T
0.5
0.45
0.4
H0 Hβ
0.35
0.3
True effect=H0
0.25
True effect=Hβ
0.2
Significance
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6
Uneven distributions, not efficient, i.e. less power

Power: 83%
0.5
0.45
H0 Hβ
0.4
0.35
0.3 True effect=H0
True effect=Hβ
0.25
Power
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6

Allocation ratio
• Definition: the fraction of the total sample allocated to

the treatment group is the allocation ratio
• Usually, for a given sample size, power is maximized
when half sample allocated to treatment, half to control
• Diminishing marginal benefit to precision from adding
sample, so best to add equally

Allocation to T v C
σ2 σ2
sd ( X 1 − X 2 ) = +
n1 n2
1 1 2
sd ( X 1 − X 2 ) = + = =1
2 2 2
1 1 4
sd ( X 1 − X 2 ) = + = = 1.15
3 1 3
Power equation: MDE
Significance Variance
Level
Effect Size Power
σ
EffectSize = (t(1−κ ) + tα )*
2
1
*
P(1 − P ) N
Proportion in
Treatment Sample
Size
79
J - PAL | SAMPLING AND SAMPLE SIZE
Outline
• Introduction
– Effect size
– Sample size
– Variance
– Proportion of sample in treatment vs. control

Randomize individuals to T or C

Randomize individuals to T or C

Or randomize clusters: e.g. classes

Or randomize clusters: e.g. classes

Compared to an individual level randomized
design, to achieve the same power, a
clustered level RCT is likely to require..
A. A smaller sample size 80%
B. A bigger sample size

C. The same sample size
D. Don’t know
15%
5%
0%
A. B. C. D.

Clustered design: intuition
• You want to know how close the upcoming national

elections will be
• Method 1: Randomly select 50 people from entire US

population
• Method 2: Randomly select 10 families, and ask five

members of each family their opinion

Low intra-cluster correlation (ICC) or ρ (rho)
Control
Population
Treatment
HIGH intra-cluster correlation (ρ)
Control
Population
Treatment
Intra-cluster correlation definition
• Total variance can be divided into within cluster

variance (𝜏𝜏 2 ) and between cluster variance (σ2 )
• When variance within clusters is small the within cluster
correlation is high ie (ICC) is high (previous slide)
• Definition of ICC: the proportion of total variation
explained by within cluster level variance
– Note, when within cluster variance is high, within cluster
correlation is low and between cluster correlation is high
σ2
• 𝑖𝑖𝑖𝑖𝑖𝑖 = 𝜌𝜌 =
𝜎𝜎2 +𝜏𝜏2

How does ICC impact power?
• For a given N we have less power when we randomize

by cluster (unless ICC is zero)
• There are diminishing returns to surveying more people
per cluster
• Usually the number of clusters is the key determinant of
power, not the number of people per cluster

All uneducated people live in one village.
People with only primary education live in
another. College grads live in a third, etc.
ICC (ρ) on education will be..
A. High
83%
B. Low
C. No effect on rho
D. Don’t know
17%
0% 0%
A. B. C. D.

If ICC (ρ) is high, what is a more efficient
way of increasing power?
A. Include more clusters in
the sample 47%
B. Interview more people in
each cluster
C. Both
D. Don’t know 26%
16%
11%
A. B. C. D.
Power with clustering
Significance
Effect Size Variance
Level
Power
σ
= (t(1−κ ) + tα )*
2
EffectSize 1
*
1 + ρ (m − 1) P(1 − P ) N
Proportion in
Average Treatment Sample
ICC Size
J - PAL |
Cluster Size
SAMPLING AND SAMPLE SIZE
94

Power and Sample Size

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Power and Sample Size

Uploaded by

Copyright:

Available Formats

Sampling and Sample Size

Precision (Sample Size)

Less bias (Randomization)

J - PAL | SAMPLING AND SAMPLE SIZE 6

J - PAL | SAMPLING AND SAMPLE SIZE 7

J - PAL | SAMPLING AND SAMPLE SIZE 8

J - PAL | SAMPLING AND SAMPLE SIZE 10

Comparison mean Treatment mean

The mean of treatment is 6ppt higher than mean of control

J - PAL | SAMPLING AND SAMPLE SIZE 12

Comparison mean Treatment mean

J - PAL | SAMPLING AND SAMPLE SIZE 13

Comparison mean Treatment mean

J - PAL | SAMPLING AND SAMPLE SIZE 14

J - PAL | SAMPLING AND SAMPLE SIZE 15

J - PAL | SAMPLING AND SAMPLE SIZE 16

J - PAL | SAMPLING AND SAMPLE SIZE 17

J - PAL | SAMPLING AND SAMPLE SIZE 18

J - PAL | SAMPLING AND SAMPLE SIZE 19

J - PAL | SAMPLING AND SAMPLE SIZE 20

J - PAL | SAMPLING AND SAMPLE SIZE 21

J - PAL | SAMPLING AND SAMPLE SIZE 22

Estimated effects normally distributed around effect β

Estimated effects normally distributed around effect 0

Probability of getting given estimate under two hypotheses

How do we know if our estimated effect is significant?

Which is more likely?

0.3 True effect=Hβ

Anything between lines cannot be distinguished from 0

• Definition: The estimated effect size that exactly

J - PAL | SAMPLING AND SAMPLE SIZE 30

• In criminal law, most institutions follow the rule: “innocent

J - PAL | SAMPLING AND SAMPLE SIZE 32

• Usually in program evaluation, instead of “presumption

J - PAL | SAMPLING AND SAMPLE SIZE 33

• If it is very unlikely (less than a 5% probability) that the

J - PAL | SAMPLING AND SAMPLE SIZE 34

Source: Effect Size FAQs blog

J - PAL | SAMPLING AND SAMPLE SIZE 35

Critical value 0.5

How often would we get an estimated effect we can

Shaded area shows power = % of time we would find Hβ

• Statistical power is the probability that, if the true effect is

The underlying truth:

TREATMENT EFFECT NO TREATMENT EFFECT

J - PAL | SAMPLING AND SAMPLE SIZE 40

J - PAL | SAMPLING AND SAMPLE SIZE 41

Move overlap of cures, less power, what influences overlap

J - PAL | SAMPLING AND SAMPLE SIZE 44

J - PAL | SAMPLING AND SAMPLE SIZE 45

0.45 True effect=H0

J - PAL | SAMPLING AND SAMPLE SIZE 46

The Null Hypothesis would be rejected only 26% of the time

Bigger hypothesized effect size  distributions farther apart

Bigger Effect size means more power

• Let’s say we believe the impact on our participants is “3”

J - PAL | SAMPLING AND SAMPLE SIZE 50

Let’s say we believe the impact on our participants is “3”

J - PAL | SAMPLING AND SAMPLE SIZE 52

Take-up is reflected in the effect size

J - PAL | SAMPLING AND SAMPLE SIZE 54

Power 91% 89%

A. Move curves further

J - PAL | SAMPLING AND SAMPLE SIZE 57