Professional Documents
Culture Documents
This tutorial is based on the data set matchingdata.dta. Before using the data set expand the
memory with the command
To solve the exercise you will need the stata package psmatch2, which you can download in
the following way:
Type net search psmatch2 in STATA
Click on: psmatch2 from http://fmwww.bc.edu/RePEc/bocode/p
Click on: ‘click here to install’
Restart STATA
To find an exact description and help for psmatch2, type help psmatch2
In this exercise we again use the (experimental) data set for the Evaluation of labor market
reforms in the US (in slightly modified form).
In this exercise the approaches of matching discussed in the course will be applied using the
data set already known which is described again below.
As a reminder: Matching is about combining (“matching”) a group of participants of a
treatment with a group of non participants with equal characteristics. The control group is
used then to estimate the unobservable (contrafactual) outcome. There are different
approaches to find good “matches”. These will be presented and applied in this exercise.
Alexander Spermann, University of Freiburg, Summer Term 2009
2
Problems:
1a) In the following exercises you will estimate the „Average Treatment Effect on the
Treated“ (ATT) by „Propensity Score Matching“. As „Propensity Score Matching“
implies comparison with another sample, the originally used individuals of the
experimental sample (sample==1) are complemented by the individuals of two other
samples (sample==2 and sample==3).
Replace the non participants from the experimental sample by all the
individuals from the two other samples.
1b) Generate the following variables that will be used in the exercises. The following
variables are interaction variables. Their use and meaning is the same as in the
exercise on selection problems:
gen age2=age*age
gen age3=age2*age
gen educ2=educ*educ
gen re74_2=re74*re74
gen re75_2=re75*re75
gen zero_earn_74=re74==0
gen zero_earn_75=re75==0
gen int_educ_re74=educ*re74
gen int_zero74_hisp=zero_earn_74*hisp
The following variable is a difference which will be used for the connection of
„Propensity Score Matching“ and „Difference in Difference Estimator“ later on:
gen d_earn=re78-re75
Alexander Spermann, University of Freiburg, Summer Term 2009
4
1c) First do a probit estimation to find out how the variables „age“, „age2“, „age3“, „educ“,
„educ2“, „black“, „hisp“, „married“, „nodegree“, „re74“, „re75“, „re74_2“, „re75_2“,
„zero_earn_75“, „int_educ_re74“ and „int_zero74_hisp“ influence the participation
probability (treated=1 or treated=0) in the new overall sample.
Then predict the „propensity score“.
probit treated age age2 age3 educ educ2 black hisp married
nodegree re74 re75 re74_2 re75_2 zero_earn_75 int_educ_re74
int_zero74_hisp
Alexander Spermann, University of Freiburg, Summer Term 2009
5
Iteration 0: log likelihood = -1037.6992
Iteration 1: log likelihood = -630.39842
Iteration 2: log likelihood = -522.77856
Iteration 3: log likelihood = -481.07983
Iteration 4: log likelihood = -465.67343
Iteration 5: log likelihood = -462.91251
Iteration 6: log likelihood = -462.75867
Iteration 7: log likelihood = -462.75712
Iteration 8: log likelihood = -462.75712
------------------------------------------------------------------------------
treated | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .9113437 .164721 5.53 0.000 .5884965 1.234191
age2 | -.0240836 .0052904 -4.55 0.000 -.0344525 -.0137147
age3 | .0001904 .0000538 3.54 0.000 .000085 .0002958
educ | .4455625 .1163902 3.83 0.000 .2174419 .6736832
educ2 | -.0264906 .0062148 -4.26 0.000 -.0386715 -.0143098
black | 1.655884 .1095013 15.12 0.000 1.441265 1.870503
hisp | .7390261 .208835 3.54 0.000 .3297171 1.148335
married | -.7758958 .110946 -6.99 0.000 -.9933459 -.5584456
nodegree | .3827289 .1477954 2.59 0.010 .0930552 .6724027
re74 | -.0001589 .000047 -3.38 0.001 -.000251 -.0000669
re75 | -.000075 .0000182 -4.11 0.000 -.0001107 -.0000392
re74_2 | 8.57e-10 3.71e-10 2.31 0.021 1.30e-10 1.58e-09
re75_2 | 1.63e-10 3.23e-10 0.50 0.615 -4.71e-10 7.96e-10
zero_earn_75 | .3240265 .1167357 2.78 0.006 .0952287 .5528244
int_educ_~74 | 9.05e-06 4.28e-06 2.12 0.034 6.66e-07 .0000174
int_zero74~p | -.0019342 .2992769 -0.01 0.995 -.5885061 .5846376
_cons | -14.23208 1.756289 -8.10 0.000 -17.67434 -10.78981
------------------------------------------------------------------------------
The estimated coefficients cannot be interpreted directly! These are not the marginal
effects of the explaining variables on the dependent variable. Those would have to be
calculated separately.
predict double ps
double determines the data format.
1d) Now estimate the ATT using „Propensity Score Matching“. Look at the effect on the
outcome variables „re74“, „re75“ and „re78“.
Alexander Spermann, University of Freiburg, Summer Term 2009
6
First interpret the results with respect to the outcome variable „re78“, i.e. the real
income in 1978.
Use the command psmatch2.
Hints:
The outcome variables are stated in brackets after the option outcome.
We use a two step matching, as we do not name the exogenous variables
after treated in the command psmatch2, but use the “propensity score”
estimated by the probit. This is especially useful if you have a model with a lot
of variables in order to keep your program more concise.
------------------------------------------------------------------
Variable Sample | Treated Controls Difference
----------------------------+-------------------------------------
re78 Unmatched | 6349.1435 15750.3 -9401.15645
ATT | 6349.1435 5074.05777 1275.08574
----------------------------+-------------------------------------
re74 Unmatched | 2095.57369 14745.9287 -12650.355
ATT | 2095.57369 1895.30997 200.263719
----------------------------+-------------------------------------
re75 Unmatched | 1532.05531 14380.0105 -12847.9552
ATT | 1532.05531 1100.9613 431.094014
----------------------------+--------------------------------------
- Only „Nearest-Neighbour-Matching“
- Estimated effect is given by the post treatment variable „re78“. For the individuals
of the treatment group, the treatment has raised the real income by 1275$ on
average.
1e) Interpret the results of the matching with respect to the real income in 1975.
The result for the pre treatment variable „re75“ is a so-called Pre Program Test. It is
checked if the matching results in a balancing of the original level of income before
the treatment. The difference of 431,09$ after matching results of unobserved
factors. But at least this is much less than the difference of -12.847,95$ before
matching. The interpretation of „re74“ is similar.
1f) Using the command pstest, check the success of the matching for the exogenous
variables „age“, „educ“, „black“, „married“, „hisp“, „nodegree“, „re74“ and „re75“.
Alexander Spermann, University of Freiburg, Summer Term 2009
7
This is a t-test on the hypothesis that the mean value of each variable is the same in
the treatment group and the non treatment group. It is done before and after
matching. If p>0.1, the null hypothesis cannot be rejected on the 10% significance
level.
Furthermore, a bias before and after matching is calculated for each variable and the
change in this bias is stated. This “bias” is defined as the difference of the mean
values of the treatment group and the (not matched / matched) non treatment group,
devided by the square root of the average sample variance in the treatment group
and the not matched non treatment group.
In the table one can see the difference of the values of the exogenous variables
between the two groups before matching. E.g., 84.3% of the treatment group are
black, but only 9.7% of the control group. These factors have a significant influence
on the treatment probability (see part 1b)).
By the matching, the differences between treatment group and non treatment group
are reduced considerably. An exception is the dummy hisp. For this variable the
difference between the two groups is not eliminated. However, the “bias” was already
rather small before matching.
The null hypothesis that the mean values of the two groups do not differ after
matching cannot be rejected for any variable.
Alexander Spermann, University of Freiburg, Summer Term 2009
8
1g) Check graphically if the assumption of „common support“ holds in the example.
If the assumption holds, there must be an overlap of the „propensity scores“ of the
participants and non participants.
Use the command psgraph.
psgraph, bin(10)
Treated Untreated
Due to the scale, it is difficult to discern in this graph that in each class of the „propensity
score“ there is a certain number of non treated individuals as well. So we can assume that
common support is given.
Alexander Spermann, University of Freiburg, Summer Term 2009
9
Additional problems:
1h) * Estimate the ATT as in part 1d), but use the Kernel Matching approach this time.
Briefly interpret the results with respect to the outcome variables „re78“, „re75“ and
„re74“, comparing them to the results of parts 1d) and 1e).
psmatch2 treated, kernel outcome(re78 re74 re75) pscore(ps)
psmatch2 treated, kernel outcome(re78 re74 re75) pscore(ps)
Matching Method = kernel Metric = pscore
------------------------------------------------------------------
Variable Sample | Treated Controls Difference
----------------------------+-------------------------------------
re78 Unmatched | 6349.1435 15750.3 -9401.15645
ATT | 6349.1435 6433.4888 -84.3452968
----------------------------+-------------------------------------
re74 Unmatched | 2095.57369 14745.9287 -12650.355
ATT | 2095.57369 3934.69386 -1839.12017
----------------------------+-------------------------------------
re75 Unmatched | 1532.05531 14380.0105 -12847.9552
ATT | 1532.05531 3160.4338 -1628.37849
----------------------------+-------------------------------------
1i) * Determine the „Average Treatment Effect“ (ATE) of the training program on the basis
of the „Propensity Score Matching“ in part 1d). What does the result tell you with
respect to the outcome variable „re78“?
Hint: The ATE is calculated analogous to the above matching procedure,
complemented by the option ate in the STATA command.
| psmatch2:
psmatch2: | Common
Treatment | support
assignment | On suppor? | Total
-----------+-----------+----------
Untreated | 18,482 | 18,482
Treated | 185 | 185
-----------+-----------+----------
Total | 18,667 | 18,667
The ATE, i.e. the average effect of the treatment for an individual drawn from the
overall population at random, is -13806,22$. So the real income of a randomly drawn
person would be13806,22$ lower because of the participation in the labor market
program. This results because a negative effect is estimated for the non participants
(ATU) who are much more numerous than the participants (see second table). So the
ATE does not have a direct interpretation for the evaluation of the program.
The ATU is estimated by matching a similar participant to each non participant.
Because of the small number of participants one would have to check if the balancing
is also achieved for this control group. Otherwise the ATU might be biased.
ATE, ATT and ATU are linked as follows:
ATE = N1/N*ATT + N0/N*ATU
Where N1 is the number of participants and N0 is the number of non participants. In
the example we have:
ATE = 185/18.667*1.275 + 18.482/18.667*(-13.957) = -13.806
Alexander Spermann, University of Freiburg, Summer Term 2009
11
2a) Discuss if the conditions for a combination of the two methods are given in this
example.
Conditions:
Panel data
We do not have real panel data, but at least for the real income we have time
series information (before and after program)
Time constant and additive selection bias in the outcome equation:
We assume that unobserved factors have a constant influence on the
outcome.
------------------------------------------------------------------
Variable Sample | Treated Controls Difference
----------------------------+-------------------------------------
d_earn Unmatched | 4817.08818 1370.2894 3446.79878
ATT | 4817.08818 3973.09646 843.991721
----------------------------+-------------------------------------
re78 Unmatched | 6349.1435 15750.3 -9401.15645
ATT | 6349.1435 5074.05777 1275.08574
----------------------------+-------------------------------------
re74 Unmatched | 2095.57369 14745.9287 -12650.355
ATT | 2095.57369 1895.30997 200.263719
----------------------------+-------------------------------------
re75 Unmatched | 1532.05531 14380.0105 -12847.9552
ATT | 1532.05531 1100.9613 431.094014
----------------------------+--------------------------------------
With this approach we find an ATT of 843.99$. So the real income of participants is
raised by 843.99$ through the program. The ATT is calculated as:
ATT= (after-before)treated –(after-before)control=4.817 -3.973=844
Through the combination of matching and DiD we have eliminated the time constant
unobserved effects. This can be seen if you calculate the ATT as the difference
Alexander Spermann, University of Freiburg, Summer Term 2009
12
between ATTre78 (1275,08$) and ATTre75 (431,09$). ATTre78 would be the ATT if
you do not consider the selection bias through time constant unobserved effects,
ATTre75 is the selection bias through time constant unobserved effects.