Heckman Selection Models

Vartanian: SW 683

We’re examining the age when people first get married. We have a number of people with missing values
because they have never been married. We want to see if not being married is non-random process, and correct
for this non-randomness of not being married in our regression results.
First, we’ll examine if age of first marriage is missing for many people in our sample. I do this in Stata with the
following commands.
. generate nomarr=agemarr~=.
. tab nomarr
nomarr |
Freq.
Percent
Cum.
------------+----------------------------------0 |
1,172
29.43
29.43
1 |
2,811
70.57
100.00
------------+----------------------------------Total |
3,983
100.00
Here, we see that of our 3,983 observations, 2,811 are married and 1,172 are not ever married, and therefore
will not have a valid value for agemarr. (It seems like these numbers should be just the opposite – that 2811 are
not married, but this is a quirk with Stata.)
Next, we can run a probit analysis (which is what is used in the Heckman models instead of a logit analysis) to
see if any factors are affecting the likelihood of being married, and thus having an age at first marriage. (I’ve
chosen these variables somewhat randomly.)
. probit nomarr income male norelig bigcity kds
note: norelig != 0 predicts success perfectly
norelig dropped and 5 obs not used
Iteration
Iteration
Iteration
Iteration

0:
1:
2:
3:

log
log
log
log

likelihood
likelihood
likelihood
likelihood

Probit estimates
Log likelihood = -2363.2058

=
=
=
=

-2403.0562
-2363.3512
-2363.2058
-2363.2058
Number of obs
LR chi2(4)
Prob > chi2
Pseudo R2

=
=
=
=

3966
79.70
0.0000
0.0166

-----------------------------------------------------------------------------nomarr |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------income |
4.52e-06
6.33e-07
7.15
0.000
3.28e-06
5.76e-06
male |
.0104852
.0424389
0.25
0.805
-.0726934
.0936639
bigcity | -.2310436
.0436178
-5.30
0.000
-.316533
-.1455543
kds |
.0123415
.012217
1.01
0.312
-.0116033
.0362862
_cons |
.3618608
.0644555
5.61
0.000
.2355303
.4881912
------------------------------------------------------------------------------

841174 bigcity | 1.0000175 male | 1.From this.000 .1530466 9. regress agemarr income male norelig bigcity kds Source | SS df MS -------------+-----------------------------Model | 3961.phd\Heckman Selection Models\Heckman Selection Models. Std.0000126 2.61695 5 792.31 0. Err.93352 -------------+---------------------------------------------------------------select | income | 4.0000169 male | 1.000 1.56418 21.092519 1. Interval] -------------+---------------------------------------------------------------agemarr | income | .000 1. t P>|t| [95% Conf.105746 .2366657 88.16 0.3631359 .369395 1.20 0.422387 -------------+---------------------------------------------------------------rho | -.241285 1.34 Number of obs Censored obs Uncensored obs = = = 3971 1167 2804 Wald chi2(4) Prob > chi2 = = 234.180015 -----------------------------------------------------------------------------LR test of indep.000 1.367 2798 16.0259488 .8841 H:\Word 2010\Lect2.0915897 bigcity | -. we’ll run the model with possible selection bias.2809752 _cons | 21.docx 2 .395891 .0783 4.07 0.71 0.00 0.64e-06 .000 8.99 0.0084382 .042425 0.1530358 10.025943 .101083 . Std.16 0. Next.873 -.3172057 -.27 0.0457291 4.26 0.0111734 . Err.0122102 1.428649 .0362 -----------------------------------------------------------------------------agemarr | Coef. then run a Heckman selection model.0000127 2.0545964 3.1618219 8.0800 0.07386 .000 3.0458917 4.0747133 .2910533 -------------+-----------------------------Total | 49543.000 20.05 0.85 0.6555158 -1.842 -.13 0.96 0.2922866 /lnsigma | 1.81 0.000 7.02 Prob > chi2 = 0.3312078 . we can see that several factors affect the likelihood of being married.0436151 -5.21421 21.4895099 -------------+---------------------------------------------------------------/athrho | -.38956 1.52e-06 4.4386082 48.000 .000 1.524309 .824404 bigcity | 1.409821 .6753421 Number of obs F( 5.3441843 .46 0.04 0.0135187 103.0366898 _cons | .1954122 .07848 1.52e-06 6.2367619 .1047727 .296 -.984 2803 17.147009 lambda | -.2850784 _cons | 21.1910291 .1462377 kds | . z P>|z| [95% Conf. including income and living in big cities.54123 . eqns.000 -.18e-06 5.2317217 .0644777 5.93297 4.39e-06 .2842381 sigma | 4. OLS model .1786608 8.000 .162368 -0.49229 Heckman Selection Model: Heckman selection model (regression model with sample selection) Log likelihood = -10257.76e-06 male | .727124 kds | .34e-07 7.63 0.0127582 .224213 1.64 0.000 1.0000 0. 2798) Prob > F R-squared Adj R-squared Root MSE = = = = = = 2804 48.27e-06 5.1622587 -.000 20.778818 kds | .0000 -----------------------------------------------------------------------------| Coef. (rho = 0): chi2(1) = 0.02824 . Interval] -------------+---------------------------------------------------------------income | .038571 .323389 Residual | 45582.

74 0.977 -.3089284 /lnsigma | 2.452173 .048273 .000 2.0001246 kids | .40 0.181085 We next run a Heckman model with a limited number of selection variables.0000 0. We’re examining the wages of wives.77 0.000 .203736 .0350282 . z P>|z| [95% Conf.03 0.0001056 .000 .266 -. the OLS model is not biased.983 1722 60. We would like to determine if this is a random process where some wives work and some do not.4126586 -------------+---------------------------------------------------------------/athrho | -.1210289 .000 .3830775 11. 1722) Prob > F R-squared Adj R-squared Root MSE = = = = = = 1726 188.076 -. heckman wageswf income kids Iteration 0: log likelihood Iteration 1: log likelihood Iteration 2: log likelihood Iteration 3: log likelihood Iteration 4: log likelihood Iteration 5: log likelihood Iteration 6: log likelihood Iteration 7: log likelihood Heckman selection model (regression model with sample youngest.11 0.000 3. our OLS models give us the following information.1834453 1.0001151 4.0459541 . First.0864731 .081641 H:\Word 2010\Lect2.0568801 white | -. Std.253 Prob > chi2 = 0.2458 7.8457 3 11354. we’ll see that they are very similar. .0090947 _cons | 4.31 0.035 -.4426 = -7564.docx 3 .2471 0.014905 2. Err.5076 = -7612. or a non-random process.0001246 kids | . t P>|t| [95% Conf.5636843 youngest | -.9146831 Number of obs F( 3.2024512 .11 0.7635 -----------------------------------------------------------------------------wageswf | Coef.2717669 -------------+-----------------------------Total | 137852.If we look at this closely. Err.910 -.779929 6.28 Log likelihood = -7564.0047056 .5629986 youngest | -.182041 .2854 = -7564. If we then examine the coefficient estimates for the two models. Example 2.1600203 -0.1555266 .11 0. .1654423 -.2529 Number of obs = 2560 selection) Censored obs = 834 Uncensored obs = 1726 Wald chi2(3) = 565.828 1725 79.9486 Residual | 103787.8532013 5.429739 . select(youngest white) = -8501.56 0.0059168 _cons | 4.3723 = -7793.7992 = -7903.0536253 5. In this case.678393 5.1077678 _cons | . regress wageswf income kids youngest Source | SS df MS -------------+-----------------------------Model | 34064.0001151 4.1833006 1.0001056 . we see that the selection variables appear to affect the likelihood of being censored out of the sample but that the hypothesis that rho=0 is accepted (or we fail to reject rho=0).000 2.1559138 .70 0.22 0.0406674 -2. There are over 800 wives with 0 wages.124417 -------------+---------------------------------------------------------------select | youngest | .11 0.phd\Heckman Selection Models\Heckman Selection Models.0583676 -0.2536 = -7564. Std.85e-06 23.0055746 8.24 0.267 -.04876 -1.0856795 .0170247 120. Interval] -------------+---------------------------------------------------------------wageswf | income | .9477 = -7565.3183397 .0000 -----------------------------------------------------------------------------| Coef.000 . Interval] -------------+---------------------------------------------------------------income | .86e-06 23.2038852 .72 0.3075549 .0066306 .

2994619 sigma | 7. eqns.92 0. Std.92 0.32e-07 9. z P>|z| [95% Conf.070147 -4.1600168 -.0227316 income | 7.070942 .192673 -------------+---------------------------------------------------------------select | youngest | . Interval] -------------+---------------------------------------------------------------wageswf | income | .1210593 .816 (not concave) (not concave) (not concave) Number of obs Censored obs Uncensored obs = = = 2560 834 1726 Wald chi2(3) Prob > chi2 = = 509.0444561 _cons | 6.0364894 1.484724 youngest | -.00 Prob > chi2 = 0.5279928 11.115 -.174656 sigma | 7.9795 ------------------------------------------------------------------------------ From this.0001006 .83e-06 8.46062 -1.1274718 .245851 .1247546 -------------+---------------------------------------------------------------/athrho | -.5342469 -3.413515 .48 0.366411 -----------------------------------------------------------------------------LR test of indep. (rho = 0): chi2(1) = 0. Let’s use a few more selection variables to see what happens. we find that our OLS models were biased.6591 -7510.-------------+---------------------------------------------------------------rho | -.8196 -7509.12298 7.0001104 4. We can see what happens to the effects of Youngest.500014 8.2099452 .0936068 . for example.64 0.000 5.017612 lambda | -.26e-06 7.0423557 -3.1633 -8821.docx 4 .01 0.0012 ------------------------------------------------------------------------------ With these new selection variables. heckman wageswf income kids youngest.376 -7667.003 -.1194365 .0047056 .262329 lambda | -2.0593574 -1.000 -.0063111 5.22 0.1208011 _cons | .522 -.111707 -------------+---------------------------------------------------------------rho | -. select(youngest white income kids) Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: 5: 6: 7: 8: log log log log log log log log log likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood = = = = = = = = = -9533.0458232 white | -. .071723 .5426 -7509.0210828 .395559 -----------------------------------------------------------------------------LR test of indep.0018477 . (rho = 0): chi2(1) = 10.1320182 7.1863746 0.005 .phd\Heckman Selection Models\Heckman Selection Models.8163 Heckman selection model (regression model with sample selection) Log likelihood = -7509.963 -7528.0001201 kids | .58 0.0000 -----------------------------------------------------------------------------| Coef.1619469 7.43 Prob > chi2 = 0.79 0.000 5.754496 .000 2.62734 8.8163 -7509.000 .66 0.0636632 -.1764652 /lnsigma | 2.70e-06 kids | .97e-06 22.4514365 -.468538 2.3139508 .2104875 -.4230791 -. H:\Word 2010\Lect2.03 0.0254388 2.938488 .000 . Err. and we are correct in correcting these coefficient estimates with this Heckman correction model.976 -.0334537 .157827 .0627088 0. eqns.0210843 .240864 -2. we find that the OLS coefficients are no different (statistically) than the Heckman corrections.0204002 101.30 0.55 0.031739 2.2528 -8276.3080049 .304027 .