Edit 2-Why Randomize

Why randomize?
Ben Olken
Harvard University and J-PAL
www.povertyactionlab.org
Agenda
I. The program evaluation problem

II. An example: iron supplements in Central Java
III. Randomized experiments
IV. Advantages and limitations of experiments
V. How wrong can you go: “Vote 2002” campaign
VI. Conclusions
What is Program or Impact Evaluation?
• Program evaluation is a set of techniques used to

determine if a treatment or intervention ‘works’.
• Examples:
– Does giving scholarships increase attendance at school?
– Does auditing road projects reduce corruption?
– Do bed-nets prevent malaria?
Basic setup of program evaluation
• How do we answer these questions?
• The key is establishing a counterfactual.

– A counterfactual is defined as “what would have happened in the
absence of the treatment”.
– The true counterfactual is not observable – we never know what would

have happened to treated group if not treated, because they were
treated.
– The key goal of all program/impact evaluation methods is to construct or

“mimic” the counterfactual using some type of control group
Basic setup of program evaluation
• Some definitions:
– Outcome (Y): What outcome treatment might affect
– Treated (T): Group affected by program
– Control (C): Group unaffected by program
• The key assumption:

– If the treatment did not take place, then the outcome would have been
the same in the treatment group and the control group
– Or, put another way – the control group is the counterfactual
• Then the impact of the treatment is:

– IMPACT = OUTCOME(treated) – OUTCOME(control)
Selection bias
• Remember the key assumption:

– If the treatment did not occur, then the outcome would have been the
same in the treatment group and the control group
• What happens when this assumption is violated?

– i.e., what if there other unobservable factors that might affect the
treated units but not the control units?
– We call these unobservable factors “selection bias”. The
presence of selection bias is rampant in observational studies.
– If there is selection bias, then you get the wrong answer!
• Answer = Treatment Effect + Selection Bias
– If selection bias is positive, an observational study overstates the
treatment effect; if negative, we understate this effect.
Selection bias examples
• What does selection bias look like in the real world?
• Some selection bias examples:

– Subsidized food. People who get food subsidies (e.g., Raskin)
are poorer than those who didn’t. Did Raskin make them poor?
– Schooling. People who finish high school earn more than people
who drop out before finishing high school. Is this because of the
schooling? Or because people who are smarter get more
school? Or some combination?
– Roads. Villages that get roads from government see

improvements in farm incomes. Did the roads cause the income
changes? Or did government put roads in ‘strategic’ locations?
Two types of impact evaluations
1. Randomized Evaluations:
– Use lottery – i.e., coin-flip -- to determine who is in treatment
group and who is in control group.
– Since only difference between two groups is the result of coin
flip, we know that control group provides a good counterfactual
and there is no selection bias
Also known as:

• Random Assignment Studies
• Randomized Field Trials
• Social Experiments
• Randomized Controlled Experiments
Types of Impact Evaluation
Methods (Cont)
2. Non-Experimental or Quasi-Experimental Methods
– These methods use other approaches to construct a control
group with minimal selection bias
• Examples:
– Simple difference
• Compare outcomes of treatment and Control group, where control
group wasn’t exposed to program for `exogenous’ reasons
– Differences-in-Differences
• Compare changes over time between treatment group and control
group
– Statistical Matching
• Identify control group based on observables
– Instrumental Variables
• Predict treatment as function of variables which don’t directly
affect outcome of interest.
• This course focuses on randomized methods
II – EXAMPLE:
IRON SUPPLEMENTS IN CENTRAL JAVA
Example: Iron supplements in Java
• Background:
– Anemia (lack of iron) causes low-energy and reduces people’s
ability to work
– Problem can be severe in agricultural areas where people mostly
eat unfortified foods
• Program:
– Iron fortification pilot program in Purworejo, Central Java
– Health workers come to households and encourage them to take
iron pills once per week
• Questions: does this program improve health, increase

people’s ability to work, and reduce poverty?
What is the program’s impact?
Monthly earnings
(Rp 100,000)
9 (Observed)
Is this the program’s impact?

Not sure!
7
(Observed)
time
2002 2003
PROGRAM
We need to identify what would have
happened in the absence of the program
Monthly earnings
(Rp 100,000)
9 (Observed)
(What would have happened

without the program)
time
2002 2003
We need to identify what would have
happened in the absence of the program
Monthly earnings
(Rp 100,000)
9 (Observed)
Impact 9 – X
Impact:
X (What would have happened

without the program) (Unobserved)
time
2002 2003
Idea: Use a control group to estimate X
Monthly earnings
(Rp 100,000)
9 (Observed among beneficiaries)
Impact: 9 – 7.9 = 1.1
7.9 (Observed in control group)
time
2002 2003
What makes a bad/good control group?
• If there are differences between the group of

participants and the non-participants
– e.g., suppose iron was only given to households who lived
near a Puskesmas
– In that case, treatment group lives near Puskesmas and
control group lives far from Puskesmas
 This can bias the comparison …

– These households may be in richer areas, and income may
have gone up for these households anyway, even without the
program
 This is an example of “selection bias”

Selection Bias
Monthly earnings
(Rp 100,000)
Treatment Group
True impact: 9 – 7.9 = 1.1

X (What would have happened
without the program)
Control Group
time
2002 2003
Selection Bias
Monthly earnings
(Rp 100,000)
Treatment Group
Impact you would

estimate using ‘wrong’
X
control group: 9 – 7 = 2
Control Group
time
2002 2003
One solution… randomized evaluation
• In this case, they determined which households received

the iron supplements, and which did not, by lottery
• This created a comparison group that is not

systematically different from the participants
– i.e., one that is not subject to any selection bias
• So, control group looks just like the treatment group

would have looked like, except they did not get the
treatment
Randomized experiments
Monthly earnings
(Rp 100,000)
Treatment Group
True impact: 9 – 7.9 =1.1
Impact you would
estimate using ‘wrong’
X
control group: 9 – 7 = 2
Control Group
time
2002 2003
Example: Results from Purworejo Study
10 9.36
Monthly earnings Rp. (100,000)
7.91
8
6.6
6 5.5
Treatment
4 Control
0
Men Women
III – WHAT IS A RANDOMIZED
EVALUATION EXACTLY?
The basics
Start with simple case:
• Take a sample of program applicants
• Randomly assign them to either:
– Treatment Group - receives treatment
– Control Group - not allowed to receive treatment (during the
evaluation period)
• Random means that which of these people gets
treatment or does not get treatment is chosen by lottery:
– Can be computer lottery
– Can be a live public lottery
• Note: random assignment of treatment and control is not
the same as random sampling
Why does random assignment work?
• Why does random assignment of treatment and control
generate a good counterfactual?
• Because of the law of large numbers...

– Take 200 people and randomly split them into two groups of 100
– The average height and weight will be very similar across groups
– This works for people, pupils, firms, schools, districts…
– (Doesn’t work if you only have 10 units from which to randomize)
• So…
– If there was no treatment, the two groups would be the same
– So the only difference between treatment and control is the
effect of the treatment!
Basic Setup of a Randomized Evaluation
Target Population
Potential Participants
Evaluation Sample
Random Assignment
Treatment Control
Group Group
Participants No-Shows Based on Orr (1999)

Key Steps in conducting a
randomized experiment
1. Design the study carefully
– What is the problem? What is the key question to answer?
– What are the potential policy levers to address the problem?
2. Collect baseline data and randomly assign people to
treatment or control
3. Verify that assignment looks random
4. Monitor process so that integrity of experiment is not
compromised
5. Collect follow-up data for both the treatment and
control groups
6. Estimate program impacts by comparing mean
outcomes of treatment vs. control group
7. Assess whether program impacts are statistically
significant and practically significant
Some variations on the basics
• Assigning to multiple treatment groups
• Assigning to units other than individuals or households

– Health Centers
– Schools
– Local Governments
– Villages
• Important factors:
– What is the decision-making unit
– At what level can data be collected?
IV - ADVANTAGES AND
LIMITATIONS OF RANDOMIZED
EVALUATIONS
Validity
• In assessing any study, there are two types of issues to
think about:
– Internal validity: relates to ability to draw causal inference, i.e.
can we attribute our impact estimates to the program and not to
something else
– External validity: relates to ability to generalize to other settings

of interest, i.e. can we generalize our impact estimates from this
program to other populations, time periods, countries, etc?
Key Advantage of Randomization
• Much more robust in terms of Internal Validity:

– No selection bias
 little doubt that the difference observed between treatment and

control was caused by your program
Other advantages of experiments
• Relative to results from non-experimental studies, results

from experiments are:
– Less subject to methodological debates

– Easier to convey
– More likely to be convincing to program funders and/or
policymakers
Limitations of experiments
• Despite great methodological advantage of experiments,
they are also potentially subject to threats to their
validity. For example,
– Internal Validity
(e.g. Hawthorne Effects, survey non-response, no-shows,
crossovers, duration bias, etc.)
– External Validity
(e.g. are the results generalizable to the population of interest?)
• It is important to realize that some of these threats also

affect the validity of non-experimental studies
Other limitations of experiments
• Measure the impact of the offer to participate in the

program
– Depending on design it may be possible to understand of the
mechanism underlying the intervention.
• Costs (though need to think about costs of getting the

wrong answer and costs )
• Partial equilibrium
Other limitations of experiments
• Ethical issues
– Most programs are rationed due to lack of resources
– Randomization is a “fair” way to allocate resources
– May also make sense to remove discretion from allocation for
other reasons (e.g., prevent favoritism)
– Phase-ins or Pilots lend themselves naturally to randomization
– Exploit Pilots or phase-in due to budgetary constraints
V –HOW WRONG CAN YOU GO:
THE VOTE 2002 CAMPAIGN
Case 1 – “Vote 2002” campaign
• Intervention designed to increase voter turnout in 2002
U.S. election
• Phone calls to ~60,000 individuals
• Only ~35,000 individuals were reached
• Key Question: Did the campaign have a positive effect
(i.e. impact) on voter turnout?
– 5 methods were used to estimate impact
Methods 1-3
• Based on comparing reached vs not-reached:

– Method 1: Simple difference in voter turnout,
(Voter turnout)reached – (Voter turnout)not reached
– Method 2: Multiple Regression controlling for some differences

between the two groups
– Method 3: Method 2, but also controlling for differences in past

voting behavior between the two groups
Impact Estimates using Methods 1-3
Estimated Impact
Method 1 10.8 pp *
Method 2 6.1 pp *
Method 3 4.5 pp *
pp=percentage points; *: statistically significant at the 5% level

Methods 1-3
Is any of these impact estimates likely to be

the true impact of the “Vote 2002”
campaign?
Reached vs. Not Reached
Reached Not Reached Difference
Female 56.2% 53.8% 2.4 pp*
Newly Regist. 7.3% 9.6% -2.3 pp*
From Iowa 54.7% 46.7% 8.0 pp*
Voted in 2000 71.7% 63.3% 8.3 pp*

Voted in 1998 46.6% 37.6% 9.0 pp*
pp=percentage points
*: statistically significant at the 5% level

Method 4: Matching
• Similar data available on 2,000,000 individuals

• Select as a comparison group a subset of the
2,000,000 individuals that resembles the
reached group as much as possible
• Statistical procedure: matching
• To estimate impact, compare voter turnout
between reached and comparison group
An illustration of matching
Source: Arceneaux, Gerber, and Green (2004)

Impact Estimates using Matching
Estimated Impact
Matching on 4 covariates 3.7 pp *
Matching on 6 covariates 3.0 pp *
Matching on all covariates 2.8 pp *
pp=percentage points; *: statistically significant at the 5% level

Method 4: Matching
• Is this impact estimate likely to be the true impact of the

“Vote 2002” campaign?
• Key: These two groups should be equivalent in terms of
the observable characteristics that were used to do the
matching.
But what about unobservable characteristics?

Method 5: Randomized Experiment
• Turns out 60,000 were randomly chosen from population
of 2,060,000 individuals
• Hence, treatment was randomly assigned to two groups:
– Treatment group (60,000 who got called)
– Control group (2,000,000 who did not get called)
• To estimate impact, compare voter turnout between
treatment and control groups
– Do a statistical adjustment to address the fact that not everyone
in the treatment group was reached
Method 5: Randomized Experiment
• Impact estimate: 0.4 pp, not statistically significant

• Is this impact estimate likely to be the true impact of the
“Vote 2002” campaign?
• Key: Treatment and control groups should be equivalent
in terms of both their observable and unobservable
characteristics
• Hence, any difference in outcomes can be attributed to
the Vote 2002 campaign
Summary Table
Method Estimated Impact
1 – Simple Difference 10.8 pp *
2 – Multiple regression 6.1 pp *
3 – Multiple regression with 4.5 pp *

panel data
4 – Matching 2.8 pp *
5 – Randomized Experiment 0.4 pp

VI – CONCLUSIONS
Conclusions
• Good public policies require knowledge of causal effects.
• Causal effects are estimated only if we have good
counterfactuals.
• In the absence of good counterfactuals, analysis is
contaminated with selection bias.
• Be suspicious of causal claims that come from
observational studies.
• Randomization offers a solution to generating good
counterfactuals.
Conclusions
• If properly designed and conducted, social experiments
provide the most credible assessment of the impact of a
program
• Results from social experiments are easy to understand

and much less subject to methodological quibbles
• Credibility + Ease =>More likely to convince

policymakers and funders of effectiveness (or lack
thereof) of program
Conclusions (cont.)
• However, these advantages are present only if social
experiments are well designed and conducted properly
• Must assess validity of experiments in same way we

assess validity of any other study
• Must be aware of limitations of experiments

THE END

Edit 2-Why Randomize

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Edit 2-Why Randomize

Uploaded by

Copyright:

Available Formats

Why randomize?

I. The program evaluation problem

• Program evaluation is a set of techniques used to

• How do we answer these questions?

• The key is establishing a counterfactual.

– The true counterfactual is not observable – we never know what would

– The key goal of all program/impact evaluation methods is to construct or

• The key assumption:

• Then the impact of the treatment is:

• Remember the key assumption:

• What happens when this assumption is violated?

• Some selection bias examples:

– Roads. Villages that get roads from government see

Also known as:

• Questions: does this program improve health, increase

Is this the program’s impact?

(What would have happened

X (What would have happened

9 (Observed among beneficiaries)

Impact: 9 – 7.9 = 1.1

7.9 (Observed in control group)

• If there are differences between the group of

 This can bias the comparison …

 This is an example of “selection bias”

True impact: 9 – 7.9 = 1.1

Impact you would

• In this case, they determined which households received

• This created a comparison group that is not

• So, control group looks just like the treatment group

• Because of the law of large numbers...

Participants No-Shows Based on Orr (1999)

• Assigning to units other than individuals or households

– External validity: relates to ability to generalize to other settings

• Much more robust in terms of Internal Validity:

 little doubt that the difference observed between treatment and

• Relative to results from non-experimental studies, results

– Less subject to methodological debates

• It is important to realize that some of these threats also

• Measure the impact of the offer to participate in the

• Costs (though need to think about costs of getting the

• Based on comparing reached vs not-reached:

– Method 2: Multiple Regression controlling for some differences

– Method 3: Method 2, but also controlling for differences in past

pp=percentage points; *: statistically significant at the 5% level

Is any of these impact estimates likely to be

Voted in 2000 71.7% 63.3% 8.3 pp*

*: statistically significant at the 5% level

• Similar data available on 2,000,000 individuals

Source: Arceneaux, Gerber, and Green (2004)

Matching on 4 covariates 3.7 pp *

Matching on 6 covariates 3.0 pp *

Matching on all covariates 2.8 pp *

pp=percentage points; *: statistically significant at the 5% level

• Is this impact estimate likely to be the true impact of the

But what about unobservable characteristics?

• Impact estimate: 0.4 pp, not statistically significant

2 – Multiple regression 6.1 pp *

3 – Multiple regression with 4.5 pp *

5 – Randomized Experiment 0.4 pp

• Results from social experiments are easy to understand

• Credibility + Ease =>More likely to convince

• Must assess validity of experiments in same way we

• Must be aware of limitations of experiments