Chapter 4

4.
Impact Evaluation Methods

4.1. Designs and Steps for Impact Evaluations
There are two designs for evaluating program impact:
1. Experimental design
2. Observational or non-experimental design
Figure 4.1: Basic Framework for Evaluation of Program Impact
1. Experimental design
• Individuals are randomly assigned into a Treatment group

and a Control group.
• If well implemented and sample is large enough, random
assignment makes the pre-program treatment and control
groups similar on observed and unobserved
characteristics.
• To estimate program impact:
Program Impact = Average (―Treatment‖) – Average (―Control‖)
• Experiments control for problem of incomplete information
and selection.
Cont’d
• Few conditions are needed:
• Assignment to program is random
• Program ―intensity‖ is homogenous in treatment group
• No spillover effect
• Individuals do not change behavior because of participation/non-
participation in experiment
• There is no selective participation
• Individuals remain during duration of experiment or at least there is
no selective attrition
• External factors influence both groups equally
• Randomized controlled trial (RCT) frequently used for
clinical trials, vaccines, new drugs. There are some
examples in social policy.
2. Observational or non-experimental design
• In absence of good experiments, observational designs can be
used.
• In observational designs, there is no random assignment of
individuals to treatment/control groups.
• Therefore, multiple factors influence participation of individuals
in the program and there is no guarantee that other relevant
factors are similar between the ―Participant‖ and ―Non-
participant‖ groups.
• Observational designs, often referred as non-experimental
designs, use econometric techniques, matching procedures, or
discontinuity approaches to identify a comparison group and to
estimate the counterfactual.
• To do this, one needs a conceptual framework, appropriate
data, a comparison group, a statistical model that corresponds
to the conceptual framework and an estimation strategy.
4.2. Causal Inference & the Problem of the Counterfactual
 Whether “X” (an intervention) causes “Y” (an outcome variable) is

very difficult to determine
 The main challenge is to determine what would have happened to
the beneficiaries if the intervention had not existed
Evaluation question: what is the effect of a project?
Outcome A: Outcome B:
Effect =
with programme without programme
Problem: we only observe individuals that
 participate: or A B
 do not participate :
A B
... but never A and B for everyone! 122
Cont’d
124
Cont’d
126
Treatment and selection effects
Here, we subtract and add

the non-treated outcome for
the treated group
127
Definition of treatment effects
Estimating the Counterfactual
• The key to estimating the counterfactual for program

participants is to move from the individual or unit level to
the group level.
• Although no perfect clone exists for a single unit, we can
rely on statistical properties to generate two groups of
units that, if their numbers are large enough, are
statistically indistinguishable from each other at the group
level.
• So in practice, the challenge of an impact evaluation is to
identify a treatment group and a comparison group that
are statistically identical, on average, in the absence of
the program.
Estimating the Counterfactual
 Specifically, the treatment groups (TG) & comparison groups
(CG) must be the same in at least 3 ways:
1. TG & CG must be identical in the absence of the program.

On average, the characteristics of T & C groups should be
the same. E.g. the average age in the TG should be the same
as the average age in the CG.
2. The treatment should not affect the comparison group

either directly or indirectly. For example, the treatment
group should not transfer resources to the comparison
group (direct effect) .
3. The outcomes of units in the control group should change

the same way as outcomes in the treatment group, if both
groups were given the program (or not). .
130
Two Counterfeit Estimates of the Counterfactual
I. Before-and-after comparisons (also known as pre-post or

reﬂexive comparisons) compare the outcomes of the same
group before and after participating in a program.
• This comparison assumes that if the program had never
existed, the outcome (Y) for program participants would
have been exactly the same as their situation before the
program.
II. Enrolled-and-non-enrolled (or self-selected) comparisons
compare the outcomes of a group that chooses to participate
in a program with those of a group that chooses not to
participate.
• Selection occurs when program participation is based on the
preferences, decisions, or unobserved characteristics of
potential participants.
4.3. Basic Theory of Impact Evaluation: The Problem of Selection
Bias
• Equation 4.1 presents the basic evaluation problem comparing
outcomes Y across treated and nontreated individuals i:
Yi = αXi + βTi + εi . (4.1)
• The problem with estimating equation 4.1 is that treatment
assignment is not often random because of the following factors: (a)
purposive program placement and (b) self-selection into the program.
• That is, programs are placed according to the need of the
communities and individuals, who in turn self-select given program
design and placement.
• Self-selection could be based on observed characteristics,
unobserved factors, or both. In the case of unobserved factors, the
error term in the estimating equation will contain variables that are
also correlated with the treatment dummy T - that is, cov (T, ε) ≠ 0 –
bias.
Cont’d
• Given the previous notations, the average effect of the

program might be represented as follows:
D = E(Yi(1)|Ti=1)–E(Yi(0)|Ti=0) (4.2)
• The problem is that the treated and nontreated groups
may not be the same prior to the intervention, so the
expected difference between those groups may not be
due entirely to program intervention.
• If, in equation 4.2, one then adds and subtracts the
expected outcome for nonparticipants had they
participated in the program —E(Yi(0)/Ti=1), or another way
to specify the counterfactual—one gets
Cont’d
D=E(Yi(1)|Ti=1) – E(Yi(0)|Ti=0) + [E(Yi(0)|Ti=1) – E(Yi(0)|Ti=1)]. (4.3)
⇒ D = ATE + [E(Yi(0) | Ti = 1) – E(Yi(0) | Ti = 0)]. (4.4)
⇒ D = ATE + B. (4.5)
•In these equations, ATE is the average treatment effect
[E(Yi(1) | Ti = 1) – E(Yi(0) | Ti = 1)], namely, the average gain
in outcomes of participants relative to nonparticipants, as if
nonparticipating households were also treated.
•The ATE corresponds to a situation in which a randomly
chosen household from the population is assigned to
participate in the program, so participating and
nonparticipating households have an equal probability of
receiving the treatment T.
Cont’d
• The term B, [E(Yi(0) | Ti = 1) – E(Yi(0) | Ti = 0)], is the extent

of selection bias that crops up in using D as an estimate
of the ATE. Because one does not know E(Yi(0)|Ti=1), one
cannot calculate the magnitude of selection bias.
• The basic objective of a sound impact assessment is then
to find ways to get rid of selection bias (B = 0) or to find
ways to account for it.
• One approach is to randomly assign the program.
• It has also been argued that selection bias would
disappear if one could assume that whether or not
households or individuals receive treatment (conditional
on a set of covariates, X) were independent of the
outcomes that they have.
Different Evaluation Approaches to Ex Post Impact Evaluation
• A number of different methods can be used in impact evaluation

theory to address the fundamental question of the missing
counterfactual.
• Each of these methods carries its own assumptions about the nature
of potential selection bias in program targeting and participation, and
the assumptions are crucial to developing the appropriate model to
determine program impacts.
• These methods include:
1. Randomized evaluations
2. Propensity score matching (PSM)
3. Double-difference (DD) methods
4. Instrumental variable (IV) methods
5. Regression discontinuity (RD) design
4.4. Randomized evaluations
• Randomization can correct for the selection bias B by
randomly assigning individuals or groups to treatment
and control groups.
• Randomization: use randomization to obtain the
counterfactual = ―the gold standard‖ by some:
 Eligible participants are randomly assigned to a
treatment group who will receive program benefits
while the control group consists of people who will
not receive program benefits,
 The treatment and control groups are identical at the
outset of the project, except for participation in the
project.
140
Randomization graphically
Figure 4.2: The Ideal Experiment with an Equivalent Control Group
• Participants are similar or ―equivalent‖ in that both groups prior to a project

intervention are observed to have the same level of outcome (in this case,
income, Y0).
• After the treatment is carried out, the observed outcome of the treated group
is found to be Y2 while the outcome level of the control group is Y1.
• Therefore, the effect of program intervention can be described as (Y2 − Y1).
Different Methods of Randomization
• Oversubscription: If limited resources burden the

program, implementation can be allocated randomly
across a subset of eligible participants, and the remaining
eligible subjects who do not receive the program can be
considered controls.
• Randomized phase-in: This approach gradually phases
in the program across a set of eligible areas, so that
controls represent eligible areas still waiting to receive the
program.
Cont’d
• Within-group randomization: In a randomized phase-in
approach, however, if the lag between program genesis
and actual receipt of benefits is large, greater controversy
may arise about which area or areas should receive the
program first. In that case, an element of randomization
can still be introduced by providing the program to some
subgroups in each targeted area.
• Encouragement design: Instead of randomizing the
treatment, researchers randomly assign subjects an
announcement or incentive to partake in the program.
Experimental designs
 Why RCTS?
 To analyze whether an intervention had an impact, a
counterfactual is needed because it is Hard to ask
counterfactual questions (Ravallion, 2008),
 Randomization guarantees statistical independence of the
intervention from preferences (observed and
unobserved)
Overcome selection bias of

individuals receiving the
intervention.
Internal validity is high,

Involves less rigorous econometric approaches,
Led by MIT Poverty Action Lab and World Bank,
It is criticized by others (see Rodrik, 2008), 143
147
Key assumptions: randomized assignment

148
Randomized Assignment & Treatment Effects

Treatment Effect with Partial and Pure Randomization
• Randomization can be set up in two ways: pure
randomization and partial randomization.
• If treatment were conducted purely randomly following the
two-stage procedure (In the first stage, a sample of
potential participants is selected randomly from the
relevant population. In the second stage, individuals in
this sample are randomly assigned to treatment and
comparison groups - ensuring internal and external
validity), then treated and untreated households would
have the same expected outcome in the absence of the
program.
• Then, E[Yi(0)|Ti = 1] is equal to E[Yi(0)|Ti = 0].
Cont’d
• Because treatment would be random, and not a function of
unobserved characteristics (such as personality or other tastes)
across individuals, outcomes would not be expected to have
varied for the two groups had the intervention not existed.
• Thus, selection bias becomes zero (B=0) under the case of
randomization.
• If treatment is random (then T and ε are independent), equation
4.1 can be estimated by using ordinary least squares (OLS),
and the treatment effect βˆOLS estimates the difference in the
outcomes of the treated and the control group.
• If a randomized evaluation is correctly designed and
implemented, an unbiased estimate of the impact of a program
can be found.
Cont’d
• A pure randomization is, however, extremely rare to
undertake.
• Rather, partial randomization is used, where the treatment
and control samples are chosen randomly, conditional on
some observable characteristics X (for example,
landholding or income).
• If one can make an assumption called conditional
exogeneity of program placement, one can find an
unbiased estimate of program estimate.
Example – use mothersmoke.dta and Exercise data.dta
Impacts of in job training on earning – Exercise data.dta

Impact of mother smoking on baby weight – use
mothersmoke.dta
Commands: run simple Two-sample t-test or simple
regression
For impacts of in job training on earning
 reg re78 treatment
 ttest re78, by( treatment)
For impact of mother smoking on baby weight
 ttest bweight, by(mbsmoke)
 reg bweight mbsmoke
For impacts of in job training on earning
. reg re78 treatment
Source SS df MS Number of obs = 2675

F( 1, 2673) = 173.41
Model 3.9811e+10 1 3.9811e+10 Prob > F = 0.0000
Residual 6.1365e+11 2673 229573194 R-squared = 0.0609
Adj R-squared = 0.0606
Total 6.5346e+11 2674 244375671 Root MSE = 15152
re78 Coef. Std. Err. t P>|t| [95% Conf. Interval]
treatment -15204.78 1154.614 -13.17 0.000 -17468.81 -12940.75

_cons 21553.92 303.6414 70.98 0.000 20958.53 22149.32
. ttest re78, by( treatment)
Two-sample t test with equal variances
Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
0 2490 21553.92 311.731 15555.35 20942.64 22165.2

1 185 6349.144 578.4229 7867.402 5207.949 7490.338
combined 2675 20502.38 302.2505 15632.52 19909.71 21095.04
diff 15204.78 1154.614 12940.75 17468.81
diff = mean(0) - mean(1) t = 13.1687

Ho: diff = 0 degrees of freedom = 2673
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000
For impact of mother smoking on baby weight
. ttest bweight, by(mbsmoke)
Two-sample t test with equal variances
Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
nonsmoke 3778 3412.912 9.284683 570.6871 3394.708 3431.115

smoker 864 3137.66 19.08197 560.8931 3100.207 3175.112
combined 4642 3361.68 8.495534 578.8196 3345.025 3378.335
diff 275.2519 21.4528 233.1942 317.3096
diff = mean(nonsmoke) - mean(smoker) t = 12.8306

Ho: diff = 0 degrees of freedom = 4640
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000
. reg bweight mbsmoke
Source SS df MS Number of obs = 4642

F( 1, 4640) = 164.62
Model 53275939.9 1 53275939.9 Prob > F = 0.0000
Residual 1.5016e+09 4640 323622.478 R-squared = 0.0343
Adj R-squared = 0.0341
Total 1.5549e+09 4641 335032.156 Root MSE = 568.88
bweight Coef. Std. Err. t P>|t| [95% Conf. Interval]
mbsmoke -275.2519 21.4528 -12.83 0.000 -317.3096 -233.1942

_cons 3412.912 9.255254 368.75 0.000 3394.767 3431.056
Quasi-experimental designs
Quasi-experimental designs: use statistical/non-experimental research
designs to construct the counterfactual.
Impact assessment techniques:
Propensity score matching: Match program participants with non-

participants, typically using individual observable characteristics.
Difference-in-differences/double difference: Observed changes in

outcome before and after for a sample of participants and non-participants.
Panel Fixed-Effects Model: generalization of the above multiple time periods
Regression discontinuity design: Individuals just on other side of the cut-

off point = counterfactual.
Instrumental variables: When program placement is correlated with

participants' characteristics, need to correct bias by replacing the variable
characterizing the program placement with another variable.
153
4.5. Propensity score matching
• PSM constructs a statistical comparison group by modeling the

probability of participating in the program on the basis of
observed characteristics unaffected by the program.
• Participants are then matched on the basis of this probability,
or propensity score, to nonparticipants, using different
methods.
• The average treatment effect of the program is then calculated
as the mean difference in outcomes across these two groups.
• On its own, PSM is useful when only observed characteristics
are believed to affect program participation.
• We can ensure only internal as opposed to external validity of
the sample, so only the treatment effect on the treated (TOT)
can be estimated.
Cont’d
• PSM constructs a statistical comparison group that is
based on a model of the probability of participating in the
treatment T conditional on observed characteristics X, or
the propensity score: P(X ) = Pr(T = 1|X ).
• Under certain assumptions, matching on P(X) is as good
as matching on X.
• The necessary assumptions for identification of the
program effect are (a) conditional independence and (b)
presence of a common support.
Assumption of Conditional Independence
• Conditional independence states that given a set of observable
covariates X that are not affected by treatment, potential
outcomes Y are independent of treatment assignment T.
• If YiT represent outcomes for participants and YiC outcomes for
nonparticipants, conditional independence implies
• This assumption is also called unconfoundedness, and it

implies that uptake of the program is based entirely on
observed characteristics.
• To estimate the TOT as opposed to the ATE, a weaker
assumption is needed:
• If unobserved characteristics determine program participation,

conditional independence will be violated, and PSM is not an
appropriate method.
Assumption of Common Support
• A second assumption is the common support or overlap

condition: 0 < P(Ti = 1|Xi) < 1.
• This condition ensures that treatment observations have
comparison observations ―nearby‖ in the propensity score
distribution.
• Specifically, the effectiveness of PSM also depends on
having a large and roughly equal number of participant
and nonparticipant observations so that a substantial
region of common support can be found.
• Treatment units will therefore have to be similar to
nontreatment units in terms of observed characteristics
unaffected by participation; thus, some nontreatment units
may have to be dropped to ensure comparability.
Cont’d
Figure 4.2: Example of Common Support Figure 4.3: Example of Poor Balancing
and Weak Common Support
Cont’d
The TOT Using PSM
• If conditional independence holds, and if there is a sizable

overlap in P(X) across participants and nonparticipants,
the PSM estimator for the TOT can be specified as the
mean difference in Y over the common support, weighting
the comparison units by the propensity score distribution
of participants.
• A typical cross-section estimator can be specified as
follows:
Cont’d
• More explicitly, with cross-section data and within the
common support, the treatment effect can be written as
follows:
• where NT is the number of participants i and ω(i, j ) is the

weight used to aggregate outcomes for the matched
nonparticipants j.
Application of the PSM Method
Step 1: Estimating a Model of Program Participation

• First, the samples of participants and nonparticipants should be
pooled, and then participation T should be estimated on all the
observed covariates X in the data that are likely to determine
participation.
• Choice of X:
• Quality of data is important
• Understanding selection process: qualitative work
• X should only contain baseline or otherwise exogenous data
• Use same survey instruments, match within same region/context, …
• When one is interested only in comparing outcomes for those

participating (T = 1) with those not participating (T = 0), this
estimate can be constructed from a probit or logit model of
program participation.
Cont’d
• After the participation equation is estimated, the predicted
values of T from the participation equation can be derived.
• The predicted outcome represents the estimated
probability of participation or propensity score.
• Every sampled participant and non-participant will have
an estimated propensity score, Pˆ(X |T = 1) = Pˆ(X).
Step 2: Defining the Region of Common Support and Balancing Tests
• Next, the region of common support needs to be defined where
distributions of the propensity score for treatment and comparison
group overlap.
• As mentioned earlier, some of the nonparticipant observations may
have to be dropped because they fall outside the common support.
• Balancing tests can also be conducted to check whether, within each
quantile of the propensity score distribution, the average propensity
score and mean of X are the same.
• For PSM to work, the treatment and comparison groups must be
balanced in that similar propensity scores are based on similar
observed X.
• The distributions of the treated group and the comparator must be
similar, which is what balance implies.
• Formally, one needs to check if Pˆ(X |T = 1) =Pˆ(X |T = 0).
Step 3: Matching Participants to Nonparticipants
• Different matching criteria can be used to assign

participants to non-participants on the basis of the
propensity score.
1. Nearest-neighbor matching: One of the most frequently
used matching techniques is NN matching, where each
treatment unit is matched to the comparison unit with the
closest propensity score.
• One can also choose n nearest neighbors and do
matching (usually n = 5 is used). Matching can be done
with or without replacement.
• Matching with replacement, for example, means that the
same non-participant can be used as a match for different
participants.
Cont’d
174
Cont’d
Cont’d
 We can match to more than one neighbour
 5 nearest neighbours? Or more?
 Radius matching: all neighbours within specific range
 Kernel matching: all neighbours, but close neighbours have larger
weight than far neighbours.
 Best approach?
 Look at sensitivity to choice of approach
 How many neighbours?
 Using more information reduces bias
 Using more control units than treated increases precision
 But using control units more than once decreases precision
Cont’d
Cont’d
Cont’d
2. Caliper or radius matching: One problem with NN
matching is that the difference in propensity scores for a
participant and its closest nonparticipant neighbor may still
be very high.
• This situation results in poor matches and can be avoided
by imposing a threshold or ―tolerance‖ on the maximum
propensity score distance (caliper).
• This procedure therefore involves matching with
replacement, only among propensity scores within a
certain range.
• A higher number of dropped non-participants is likely,
however, potentially increasing the chance of sampling
bias.
Cont’d
Cont’d
3. Stratification or interval matching: This procedure
partitions the common support into different strata (or
intervals) and calculates the program’s impact within each
interval.
• Specifically, within each interval, the program effect is the
mean difference in outcomes between treated and control
observations.
• A weighted average of these interval impact estimates
yields the overall program impact, taking the share of
participants in each interval as the weights.
Cont’d
4. Kernel matching: One risk with the methods just described is
that only a small subset of nonparticipants will ultimately satisfy
the criteria to fall within the common support and thus construct
the counterfactual outcome.
• Nonparametric matching estimators such as kernel matching
use a weighted average of all nonparticipants to construct the
counterfactual match for each participant.
• If Pi is the propensity score for participant i and Pj is the
propensity score for nonparticipant j, the weights for kernel
matching are given by
• where K(•) is a kernel function and αn is a bandwidth

parameter.
Cont’d
Estimating Standard Errors with PSM: Use of the Bootstrap
• Compared to traditional regression methods, the

estimated variance of the treatment effect in PSM should
include the variance attributable to the derivation of the
propensity score, the determination of the common
support, and (if matching is done without replacement) the
order in which treated individuals are matched.
• Failing to account for this additional variation beyond the
normal sampling variation will cause the standard errors
to be estimated incorrectly.
Cont’d
• One solution is to use bootstrapping, where repeated
samples are drawn from the original sample, and
properties of the estimates (such as standard error and
bias) are reestimated with each sample.
• Each bootstrap sample estimate includes the first steps of
the estimation that derive the propensity score, common
support, and so on.
Example 1 - PSM
An NGO has built clinics in several villages

Villages were not selected randomly
We have data on village characteristics before the project
was implemented
What is the effect of the project on infant mortality?
181
Cont’d
What is the effect of the project on infant mortality?
T imrate
treated 10
The easiest and straightforward answer to this question is
treated 15 to compare average mortality rates in the two groups
treated 22
treated 19 (10+15+22+19)/4-(25+19+4+8+6)/5= 4.1
control 25
control 19
What does this mean? Does it mean that clinics have
control 4 increased infant mortality rates?
control 8
NO!
control 6
Pre-project characteristics of the two groups is very
important to answer the above question
182
Cont’d
T imrate povrate pcdocs
treated 10 0.5 0.01
treated 15 0.6 0.02
treated 22 0.7 0.01
treated 19 0.6 0.02
control 25 0.6 0.01
control 19 0.5 0.02
control 4 0.1 0.04
control 8 0.3 0.05
control 6 0.2 0.04
How similar are the treated and control groups?
On average, the treated group has higher poverty rate and few doctors per capita
183
Cont’d
 The Basic Idea
1. Create a new control group
 For each observation in the treatment group, select the
control observation that looks most like it based on the
selection variables (aka background characteristics)
2. Compute the treatment effect
 Compare the average outcome in the treated group with
the average outcome in the control group
84
Cont’d
Macth using Macth using
S. No T imrate povrate pcdocs povrate pcdocs
1 treated 10 0.5 0.01
2 treated 15 0.6 0.02
3 treated 22 0.7 0.01
4 treated 19 0.6 0.02
5 control 25 0.6 0.01
6 control 19 0.5 0.02
7 control 4 0.1 0.04
8 control 8 0.3 0.05
9 control 6 0.2 0.04
• Take povrate and pcdocs one at a time to match the treated group
with that of the control one
• Then take the two at a time. What do you observe?
185
Cont’d
Predicting Selection
How do we actually match treatment
observations to control groups?
In stata, we use logistic or probit regression to predict:
Prob(T=1/X1, X2,…,Xk)
In our example, the X variables are povrate and pcdocs
So, we run logistic regrsssion and save the predicted
probability of the treatment
We call this propensity score
The commands are:
Logistic T povrate pcdocs
Predict ps1 or any name you want the propensity score to
have
187
Cont’d Predicted probability
of treatment or
pcdoc Propensity score
S. No T imrate povrate s ps1 Match
1 treated 10 0.5 0.01 0.4165713
2 treated 15 0.6 0.02 0.7358171
3 treated 22 0.7 0.01 0.9284516
4 treated 19 0.6 0.02 0.7358171
5 control 25 0.6 0.01 0.752714
6 control 19 0.5 0.02 0.395162
7 control 4 0.1 0.04 0.0016534
8 control 8 0.3 0.05 0.026803
9 control 6 0.2 0.04 0.0070107
Exercise: Use the propensity score to match the treated group with the control
one
Find out the average treatment effect on the treated ((10+15+22+19)/4)-
((19+25+25+25)/4)=-7
188
Cont’d
 How do we know how well matching worked?
1. Look at covariate balance between the treated and the
new control groups. They should be similar.
2. Compare distributions of propensity scores in the treated
and control groups. They should be similar
3. Compare distributions of the propensity
scores in the treated and original control groups
 If the two overlap very much, then matching might not
work very well.
189
Example 2 - use PSMExample.dta
• Command1: psmatch2
• psmatch2 dfmfd sexhead agehead educhead lnland vaccess pcirr rice
wheat oil egg, out(lexptot) common
• psgraph
• pstest
• psmatch2 dfmfd, out(lexptot) pscore(myscore) kernel k(normal) bw(0.01)
• psmatch2 dfmfd, out(lexptot) pscore(myscore) neighbor(2)
• psmatch2 dfmfd, out(lexptot) pscore(myscore) caliper(0.01)
• bs "psmatch2 dfmfd sexhead agehead educhead lnland vaccess pcirr rice
wheat oil egg, out(lexptot)" "r(att)"
Cont’d
• Command 2: pscore
• pscore dfmfd sexhead agehead educhead lnland vaccess pcirr rice wheat
oil egg, pscore(myscore) blockid(myblock) comsup
• psgraph, treated(dfmfd) pscore(myscore) bin(50)
• attnd lexptot dfmfd, pscore(myscore) comsup
• atts lexptot dfmfd, pscore(myscore) blockid(myblock) comsup
• attr lexptot dfmfd, pscore(myscore) radius(0.001) comsup
• attk lexptot dfmfd, pscore(myscore) comsup bootstrap reps(50)
Summarize: how to do PSM
1 91
Final comments on PSM and OLS
192
4.6. Difference-in-differences: Basic set-up
• 2 groups:
• Program group (“with program”)
• Comparison group (“without program”)
• 2 points in time:
• Baseline survey
• Follow-up survey
• Recommended: Follow-up survey is longitudinal at the
individual, household or locality level
Difference-in-Differences
Outcome B
Baseline Follow-up Time

B
Outcome
B-A

Outcome B
B-A
A
D
D-C
C

Outcome B
B-A
D-C
A
D
D-C
C

Impact =
Outcome B (B-A)-(D-C)
B-A
D-C
A
D
D-C
C

Impact =
B-A
D-C
A
D
D-C
C
Key condition: “Parallel trends assumption.” Program group would have

had the same change as the Comparison group in absence of the
program.
Impact =
B-A
A
D True change;
diff-in-diff
D-C under-estimates
C program
impact

Key condition: “Parallel trends assumption.” Program group would’ve had
the same change as the Comparison group in absence of the program.
Limitations: - Strong assumption; “true change” could’ve been different.
Impact =
B-A
A
D True change;
diff-in-diff
D-C under-estimates
C program
impact

Key condition: “Parallel trends assumption.” Program group would’ve had
the same change as the Comparison group in absence of the program.
Limitation: - Strong assumption; true change could’ve been different
- It requires “short” time interval, but it reduces magnitude
of impact to estimate.
Impact =
B-A
D-C
A
D
D-C
C

Key Issue: Selection of Comparison Group
Question: What is the best way to select a program and comparison
group, so the two groups will behave similarly and will have the same
change, in the absence of the program?
Difference-in-Differences: Testing the “Parallel trends
assumption”
Impact =
A
D
E
C
F
Pre- Baseline Follow-up Time

Baseline
One way: You need Pre-Baseline data!
Difference-in-Differences: Testing the “Parallel
trends assumption”
Impact =
B (B-A)-(D-C)
Outcome
A
A-E D
E
C
F C-F

Baseline
One way: You need Pre-Baseline data!
In this example, “Parallel trends assumption” holds if: (A-E)=(C-F)
Problems: - Pre-baseline data rarely available
- Past behavior is only an indication of future behavior.
Difference-in-Differences: Not good if different
true changes
Impact =
A
E
D True change

Baseline
In this case Diff-in-diff provides and incorrect estimate of program impact.

It underestimates program impact.
Difference-in-Difference: Not good if different
trends
Impact =
True
A Impact
E
D True
change
C

Baseline
In this case Diff-in-diff provides and incorrect estimate of program impact.

It underestimates program impact.
Difference-in-Differences: Extensions (3 points in
time)
Outcome
B
Impact 1
A
D
Baseline Follow-up 1 Follow-up 2 Time
Key condition: “Parallel trends assumption” holds for each time period.
Difference-in-Differences: Extensions (3 points in
time)
Outcome G Impact 2
B
Impact 1
A
D H
Baseline Follow-up 1 Follow-up 2 Time
Key condition: “Parallel trends assumption” holds for each time period.
DID – More
• Compares an estimation of the outcomes of two groups of

individuals before and after implementation of a program
with the outcomes for non-participants and taking the
difference as the estimate of treatment.
• It is widely used since it is effective in controlling
unobserved variables & trends that may affect outcomes if
data are available.
• The DID addresses the selection bias in estimating the
average impact of an intervention by using differences
between control and experiment groups as an
approximation of the counterfactual as:
Cont’d
• Randomized evaluation and PSM—focus on various single-
difference estimators that often require only an appropriate
cross-sectional survey.
• The double-difference estimation technique typically uses
panel data.
• Note, however, that DD can be used on repeated cross-section
data as well, as long as the composition of participant and
control groups is fairly stable over time.
• In a panel setting, DD estimation resolves the problem of
missing data by measuring outcomes and covariates for both
participants and nonparticipants in pre-and post-intervention
periods.
• DD essentially compares treatment and comparison groups in
terms of outcome changes over time relative to the outcomes
observed for a preintervention baseline.
Cont’d
• Difference-in-Differences or Double-difference (DD)
methods, compared with propensity score matching
(PSM), assume that unobserved heterogeneity in
participation is present—but that such factors are time
invariant.
• DD assumes this unobserved heterogeneity is time
invariant, so the bias cancels out through differencing.
• Some variants of the DD approach have been introduced
to account for potential sources of selection bias.
Combining PSM with DD methods can help resolve this
problem, by matching units in the common support.
Cont’d
• The DD estimate can also be calculated within a
regression framework; the regression can be weighted to
account for potential biases in DD.
• In particular, the estimating equation would be specified
as follows:
• 4.23
• In equation 4.23, the coefficient β on the interaction

between the postprogram treatment variable (Ti1) and
time (t = 1. . .T) gives the average DD effect of the
program.
• Thus, using the notation from equation 4.22, β = DD.
Cont’d
• In addition to this interaction term, the variables Ti1 and t
are included separately to pick up any separate mean
effects of time as well as the effect of being targeted
versus not being targeted.
• Again, as long as data on four different groups are
available to compare, panel data are not necessary to
implement the DD approach (for example, the t subscript,
normally associated with time, can be reinterpreted as a
particular geographic area, k = 1. . .K).
Cont’d
• To understand the intuition better behind equation 4.23,
one can write it out in detail in expectations form
(suppressing the subscript i for the moment):
• 4.24a
• 4.24b
• Following equation 4.22, subtracting 4.24b from 4.24a

gives DD.
• Note again that DD is unbiased only if the potential source
of selection bias is additive and time invariant.
Example – use Panel101.dta
Command
• gen time = (year>=1994) & !missing(year)
• gen treated = (country>4) & !missing(country)
• gen did = time*treated
• reg y time treated did, r
Other Evaluation Approaches
• Panel Fixed-Effects Model
• Instrumental variable (IV) methods
• Regression discontinuity (RD) design
Panel Fixed-Effects Model
• The preceding two-period model can be generalized with

multiple time periods, which may be called the panel
fixed-effects model.
• This possibility is particularly important for a model that
controls not only for the unobserved time-invariant
heterogeneity but also for heterogeneity in observed
characteristics over a multiple-period setting.
Instrumental Variables (IVs)
• Instrumental variable (IV) methods allow for endogeneity in

individual participation, program placement, or both. With panel
data, IV methods can allow for time-varying selection bias.
• Measurement error that results in attenuation bias can also be
resolved through this procedure.
• The IV approach involves finding a variable (or instrument) that
is highly correlated with program placement or participation but
that is not correlated with unobserved characteristics affecting
outcomes.
• Instruments can be constructed from program design (for
example, if the program of interest was randomized or if
exogenous rules were used in determining eligibility for the
program).
Regression Discontinuity (RD) Design
• In a nonexperimental setting, program eligibility rules can

sometimes be used as instruments for exogenously identifying
participants and nonparticipants.
• To establish comparability, one can use participants and
nonparticipants within a certain neighborhood of the eligibility
threshold as the relevant sample for estimating the treatment
impact.
• Known as regression discontinuity (RD), this method allows
observed as well as unobserved heterogeneity to be accounted
for.
• Although the cutoff or eligibility threshold can be defined
nonparametrically, the cutoff has in practice traditionally been
defined through an instrument.

Chapter 4

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 4

Uploaded by

Copyright:

Available Formats

4.

Impact Evaluation Methods

• Individuals are randomly assigned into a Treatment group

 Whether “X” (an intervention) causes “Y” (an outcome variable) is

Here, we subtract and add

• The key to estimating the counterfactual for program

1. TG & CG must be identical in the absence of the program.

2. The treatment should not affect the comparison group

3. The outcomes of units in the control group should change

I. Before-and-after comparisons (also known as pre-post or

• Given the previous notations, the average effect of the

⇒ D = ATE + [E(Yi(0) | Ti = 1) – E(Yi(0) | Ti = 0)]. (4.4)

• The term B, [E(Yi(0) | Ti = 1) – E(Yi(0) | Ti = 0)], is the extent

• A number of different methods can be used in impact evaluation

• Participants are similar or ―equivalent‖ in that both groups prior to a project

• Oversubscription: If limited resources burden the

Overcome selection bias of

Internal validity is high,

Key assumptions: randomized assignment

Randomized Assignment & Treatment Effects

Impacts of in job training on earning – Exercise data.dta

Source SS df MS Number of obs = 2675

re78 Coef. Std. Err. t P>|t| [95% Conf. Interval]

treatment -15204.78 1154.614 -13.17 0.000 -17468.81 -12940.75

. ttest re78, by( treatment)

Two-sample t test with equal variances

0 2490 21553.92 311.731 15555.35 20942.64 22165.2

combined 2675 20502.38 302.2505 15632.52 19909.71 21095.04

diff 15204.78 1154.614 12940.75 17468.81

diff = mean(0) - mean(1) t = 13.1687

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Two-sample t test with equal variances

nonsmoke 3778 3412.912 9.284683 570.6871 3394.708 3431.115

combined 4642 3361.68 8.495534 578.8196 3345.025 3378.335

diff 275.2519 21.4528 233.1942 317.3096

diff = mean(nonsmoke) - mean(smoker) t = 12.8306

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

. reg bweight mbsmoke

Source SS df MS Number of obs = 4642

bweight Coef. Std. Err. t P>|t| [95% Conf. Interval]

mbsmoke -275.2519 21.4528 -12.83 0.000 -317.3096 -233.1942

Impact assessment techniques:

Propensity score matching: Match program participants with non-

Difference-in-differences/double difference: Observed changes in

Panel Fixed-Effects Model: generalization of the above multiple time periods

Regression discontinuity design: Individuals just on other side of the cut-

Instrumental variables: When program placement is correlated with

• PSM constructs a statistical comparison group by modeling the

• This assumption is also called unconfoundedness, and it

• If unobserved characteristics determine program participation,

• A second assumption is the common support or overlap

• If conditional independence holds, and if there is a sizable

• where NT is the number of participants i and ω(i, j ) is the

Step 1: Estimating a Model of Program Participation

• When one is interested only in comparing outcomes for those

• Different matching criteria can be used to assign

• where K(•) is a kernel function and αn is a bandwidth

• Compared to traditional regression methods, the

An NGO has built clinics in several villages

Baseline Follow-up Time

Baseline Follow-up Time

Baseline Follow-up Time

Baseline Follow-up Time