# Research Method

Lecture 7 (Ch14) Pooled Cross Sections and Simple Panel Data Methods

1

An independently pooled cross section
This type of data is obtained by sampling randomly from a population at different points in time (usually in different years) You can pool the data from different year and run regressions. However, you usually include year dummies.
2

Panel data
This is the cross section data collected at different points in time. However, this data follow the same individuals over time. You can do a bit more than the pooled cross section with Panel data. You usually include year dummies as well.

3

Pooling independent cross sections across time.
 As long as data are collected independently, it causes little problem pooling these data over time.  However, the distribution of independent variables may change over time. For example, the distribution of education changes over time.  To account for such changes, you usually need to include dummy variables for each year (year dummies), except one year as the base year  Often the coefficients for year dummies are of interest.
4

The base year for the year dummies are year 1972. Next slide shows the OLS estimates of the determinants of fertility over time.dta) The data is collected every other year.Example 1 Consider that you would like to see the changes in fertility rate over time after controlling for various characteristics. 5 . (Data: FERTIL1.

24374 1112 2.74 6.5542 399.535383 -.1751486 .101 0.1246157 -.038574 t -7. Err.077747 .000 0.80672 -.5112715 -7.7375571 -.5985984 .19 -0.000 0.3436803 -.001 0.3694143 -.2176385 -1.38 -0.58 [95% Conf.003 0.001 0.66 1.844731 Std.124396 .22 1.1207846 .2092197 .146947 .1482989 .0825938 .2758452 -.8062821 -.1556646 .1287556 .264484 -.Dependent variable =# kids per woman .4892665 -.417937 . 1112) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.1169 1.343 0.233 0.2329692 .8049044 -13.1982893 -.001561 .010 = = = = = = 1129 10.1488953 .1668093 .1614836 -.31 2. -.5933735 .42 -2.73538059 Coef.1989796 -.5262761 .0027756 1.000 0.5093 1128 2.5098761 -.1327211 .1598956 .706 0.177442 .813 0.882745 6 .95 0.1048727 .99 1.000 0.042319 .1380659 . reg kids educ age agesq black east northcen west farm othrural town smcity y74 y76 y80 y82 y84 Source Model Residual Total kids educ age agesq black east northcen west farm othrural town smcity y74 y76 y80 y82 y84 _cons SS df MS Number of obs F( 16.41568682 3085.41 -0.265559 16 24.301226 -.191 0.88 -3.30 -3.3616071 .507 0.0000 0.037886 -.03 3.0553556 -.681 0.043 0.0183209 .02 -0.33 0. .0058384 1.0639849 -.3266712 .2414445 .9540975 2686.5233121 .7802437 -.64 2. Interval] -.1600797 .2180929 .1294 0.1496524 3.4785049 .1733806 .0928081 .24 -3.0090786 -.164703 .3516171 -.1662171 .0089013 .1283168 -.

7 . Similar result is found for year 1984.The number of children one woman has in 1982 is 0.49 less than the base year. The year dummies show significant drops in fertility rate over time.

that is you estimate the following. union dummy.  we estimate the earning equation which includes education.Example 2  CPS78_85. female dummy and the year dummy for 1985.dta has wage data collected in 1978 and 1985.  Suppose that you want to see if gender gap has changed over time. you include interaction between female and 1985. experience. experience squared. 8 .

 The gender gap in each period is given by: -gender gap in the base year (1978) = β5 -gender gap in 1985= β5+ β7 9 . Log(wage)=β0+β1(educ) +β2(exper)+β3(expersq)+β4(Union) +β5(female) +β6(year85) +β7(year85)(female)  You can check if gender wage gap in 1985 is different from the base year (1978) by checking if β7 is equal to zero or not.

0050646 .0003975 .3530916 .085 0.77 -8.3914324 .4184954 .5019493 Coefficient for the interaction term (y85)(Female) is positive and significant at 10% significance level.0005498 .0833217 .2646795 -. .0123524 .231 10 ..4241 0.328704 183.762464 319. reg lwage educ exper expersq union Source Model Residual Total lwage educ exper expersq union female y85 y85fem _cons SS 135. Interval] .12 6.0364844 -.000 0.000 = = = = = = 1084 113.3522088 df 7 1076 1083 female y85 MS y85fem Number of obs F( 7.41326 19.319 gender gap in 1985=-0.0224679 -.0302943 .205237 -.1457945 -.0294761 -.2024683 .0513498 .000 0.1891616 . Err.62 Std.0366427 .0932594 .0763137 [95% Conf.332672 .0733841 .0002451 .45 8.000 0.170782959 .0884046 . .72 10. gender gap in 1978 =-0. 1076) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.20 0.000 0.29463635 t 16.088 =-0.0000776 .59 1.0000 0.72 4.0333324 .4204 .091167 Coef.2476341 .2876877 -.25 -5.3195333 .000 0.000 0.0035717 . So gender gap appear to have reduced over time.319+0.

11 . using an example. called the difference-in-difference estimation.Policy analysis with pooled cross sections: The difference in difference estimator I explain a typical policy analysis with pooled cross section data.

You want to examine if the incinerator affected the housing price. 12 .Example: Effects of garbage incinerator on housing prices This example is based on the studies of housing price in North Andover in Massachusetts The rumor that a garbage incinerator will be build in North Andover began after 1978. The construction of incinerator began in 1981.

So create the following dummy variables nearinc =1 if the house is `near’ the incinerator =0 if otherwise 13 . For illustration define a house to be near the incinerator if it is within 3 miles.Our hypothesis is the following. Hypothesis: House located near the incinerator would fall relative to the price of more distant houses.

5827.000 0.58 107422. price =β0+β1(nearinc)+u where the price is the real price (i.3661e+11 1. Err.6 But can we say from this estimation that the incinerator has negatively affected the housing price? 14 .1594 = 31238 1 2.6367e+11 Coef.5 df MS Number of obs F( 1..027 t -5.7059e+10 1. Using the KIELMC.75 [95% Conf.e. -30688.000 = 142 = 27. 140) Prob > F R-squared Adj R-squared Root MSE P>|t| 0. Interval] -42209. Most naïve analysis would be to run the following regression using only 1981 data.97 95192.27 101307.709 3093.27 32.0000 = 0.dta. deflated using CPI to express it in 1978 constant dollar).7059e+10 140 975815048 141 1.1608e+09 Std. the result is the following .73 = 0.43 -19166. reg rprice nearinc if year==1981 Source Model Residual Total rprice nearinc _cons SS 2.1653 = 0.

09 -9461.594 2653.0765 29432 Std. To see this. Note this is before the rumor of incinerator building began.5332e+11 1. reg rprice nearinc if year==1978 Source Model Residual Total rprice nearinc _cons SS 1.0001 0.09 P>|t| 0. 177) Prob > F R-squared Adj R-squared Root MSE = = = = = = 179 15. .3636e+10 1.74 0.37 Note that the price of the house near the place where the incinerator is to be build is lower than houses farther from the location. estimate the same equation using 1979 data.117 87754.6696e+11 Coef.3636e+10 866239953 937979126 t -3.000 0.000 Number of obs F( 1.37 82517. 4744.62 77280.0817 0. Interval] -28187.23 df 1 177 178 MS 1. . -18824. So negative coefficient simply means that the garbage incinerator 15 was build in the location where the housing price is low. Err.97 31.79 [95% Conf.

0001 0. Err. Year 1978 regression .6696e+11 Coef.000 0.1653 0. reg rprice nearinc if year==1978 Source Model Residual Total rprice nearinc _cons SS 1.709 3093. the increase in the price penalty in 1981 is caused by the incinerator Std.1594 31238 Std.594 2653.0000 0. Interval] -42209.73 0. -30688.62 77280. 4744.23 df 1 177 178 MS 1.3636e+10 866239953 937979126 t -3.58 107422.6367e+11 Coef.43 -19166.37 .09 -9461. 140) Prob > F R-squared Adj R-squared Root MSE = = = = = = 142 27.117 87754.7059e+10 975815048 1.97 95192.75 P>|t| 0.5 df 1 140 141 Year 1981 regression MS 2.1608e+09 t -5. the price penalty for houses near the incinerator is greater in 1981.27 101307. Perhaps.6 This is the basic idea of the difference-in-difference estimator 16 .74 0. Interval] -28187.5332e+11 1. 5827.79 [95% Conf.0765 29432 Compared to 1978. -18824.09 P>|t| 0.27 32. 177) Prob > F R-squared Adj R-squared Root MSE = = = = = = 179 15.Now. reg rprice nearinc if year==1981 Source Model Residual Total rprice nearinc _cons SS 2.000 Number of obs F( 1. compare the two regressions.7059e+10 1.027 [95% Conf.0817 0.000 0.97 31.37 82517. Err.3636e+10 1.3661e+11 1.000 Number of obs F( 1.

37)= ‒11846 So.The difference-in-difference estimator in this example may be computed as follows. incinerator has decreased the house prices on average by \$11846.27 ‒(‒ 18824. The difference-in-difference estimator : ˆ1 = (coefficient for nearinc in 1981) ‒ (coefficient for nearinc in 1979) = ‒ 30688. 17 . I will show you more a general case later on.

   18 .far  (Price)1979.1 of the homework 2). the coefficient for (nearinc) in 1979 is equal to Average price of houses near the incinerator ‒ Average price of houses not near the incinerator This is because the regression includes only one dummy variable: (Just recall Ex.far This is the reason why the estimator is called the difference in difference estimator.near  (Price)1981. Note that. Therefore the difference in difference estimator ˆ in this 1 example is written as. in this example.near  (Price)1979. 1  (Price)1981.

The difference-in-difference estimator can be estimated by running the following single equation using pooled sample. price =β0+β1(nearinc) +β2(year81)+δ1(year81)(nearinc) Difference in difference estimator 19 .Difference in difference estimator: More general case.

reg rprice nearinc y81 y81nrinc Source Model Residual Total rprice nearinc y81 y81nrinc _cons SS 6.67 77152. Interval] -28416.9 82517.000 0. 4875. When you include more variables. ˆ1 cannot be expressed in a simple difference-indifference format.8994e+11 3.1739 = 0.29 -11863.113 0.0969e+09 Std.91 t -3.0000 = 0.45 10821. and therefore.5099e+11 Coef.0352e+10 317 914632739 320 1. you can include more variables that affect the housing price such as the number of bedrooms etc.293 26758.000 = 321 = 22. Err.23 df MS Number of obs F( 3.1055e+10 2.1 -9232.1661 = 30243 3 2. the interpretation does not change.86 4. -18824.59 30. it is still called the difference-in-difference estimator 20 .000 0.88 -26534.26 [95% Conf.867 87882. However.065 7456.64 -1.. 317) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.37 18790.322 4050.69 2806.36 Difference in difference estimator This form is more general since in addition to policy dummy (nearinc).646 2726.25 = 0.

Natural experiment (or quasi-experiment)  The difference in difference estimator is frequently used to evaluate the effect of governmental policy.  Often governmental policy affects one group of people. the change in spousal tax deduction system in Japan which took place in 1995 has affected married couples but did not affect single people.  For example. This type of policy change is called the natural experiment. 21 . while it does not affect other group of people.

Suppose. Suppose that you want to know how the change in spousal tax deduction has affected the hours worked by women. Those who are not affected by the policy is called the control group.The group of people who are affected by the policy is called the treatment group. 22 . The next slide shows the typical procedure you follow to conduct the difference-in-difference analysis. you have the pooled data of workers in 1994 and 1995.

23 . This shows the effect of the policy change on the women’s hours worked. (Hours worked)=β0+β1Dtreat+ β0(year95) +δ1(Year95)(Dtreat)+u Difference in difference estimator. Step 2: Run the following regression.Step 1: Create the treatment dummy such that Dtreat =1 if the person is affected by the policy change =0 otherwise.

reg lscrap grant lsale lemploy if year==1988 Source Model Residual Total lscrap grant lsales lemploy _cons SS 6.0906112 Coef. log( Scrap)   0  1 ( grant )   2 log( sales)   3 log(employment )  v .206287 -.087 0.26846763 1.2966021 1.095553 -4.3270 0.94062472 t -0.3651366 4.91924366 1. .0517781 -.12 -1.18 0. 24 .290 Number of obs F( 3.07 P>|t| 0.22 1.986779 df 3 46 49 MS 2.4548425 .905 0. You estimated the following model for the 1987 data.35799 You did not find the evidence that receiving the grant will reduce scrap rate.2852083 95.229 0.6394289 4.0110 1.384433 .374411 14.9199137 -1.8163574 . Err.3854 Std.4312869 . Interval] -.Two period panel data analysis Motivation: Remember the effects of employee training grant on the scrap rate.655588 [95% Conf.75 1. 46) Prob > F R-squared Adj R-squared Root MSE = = = = = = 50 1.0716 0.8054029 88.3733152 . -.

you have the following situation. log( Scrap)   0  1 ( grant)   2 log( sales)  3 log(employment)  ( 3ability  u )     v where ability is in the error term v. But since you cannot observe ability. you can eliminate the bias by including the ability variable. 25 .  The company with low ability workers tend to apply for the grant. If you observe the average ability of the workers. which creates positive bias in the estimation. v=(β3ability+u) is called the composite error term. The reason why we did not find the significant effect is probably due to the endogeneity problem.

the bias make it difficult to and grant 26 find the effect. We predicted the direction of bias in the Effect of following way. Thus. this causes a bias in the coefficient for (grant). .log( Scrap)   0  1 ( grant)   2 log( sales)  3 log(employment)  ( 3ability  u )     v Because ability and grant are correlated (negatively). ~ ˆ   ˆ    1 4  1   ( ) ( ) ( )  True effect () ~  1 ability on scrap rate Sign is determined by of grant Bias term the correlation The true negative effect of grant is cancelled out by between ability the bias term.

Is there anything we can do to correct for the bias? When you have a panel data. 27 . I will generalize it later. I will explain the method using this example. we can eliminate the bias.Now you know that there is a bias.

So (ability) is interpreted as the innate ability of workers. such as IQ.Eliminating bias using two period panel data Now. Further assume that the average ability of workers does not change over time. go back to the equation. log( Scrap)   0  1 ( grant)   2 log( sales)  3 log(employment)  (  4 ability  u )     v The grant is administered in 1988. Suppose that you have a panel data of firms for two period. 1987 and 1988. 28 .

 Since ability is constant overtime. ability has only i index. the equation can be written as: log( Scrap) it   0  1 ( grant ) it   2 log( sales) it   3 log(employment ) it   5 ( year88 ) it  (  4 abilityi  uit )    vit  i is the index for ith firm. write β4(ability)i=ai. Since (ability) is assumed constant over time. Then above equation can be written as: 29 . I will use a short hand notation for β4(ability)i.  Now.When you have the two period panel data. t is the index for the period.

we can eliminate the bias. 30 .  So if we can get rid of the fixed effect. If you want to emphasize that it is the unobserved firm characteristic. This is the basic idea.  Now the bias in OLS occurs because the fixed effect is correlated with (grant). you can call it the firm fixed effect as well  uit is called the idiosyncratic error.  In the next slide. the fixed effect. or the unobserved effect. I will show the procedure of what is called the first-differenced estimation.log( Scrap) it   0  1 ( grant ) it   2 log( sales) it   3 log(employment ) it   5 ( year88 ) it  (ai  uit )     vit  ai is called.

 log( Scrap)it   0  1 ( grant )it   2 log( sales)it   3 log(employment)it   5 ( year88)it  (ai  uit )  [  0  1 ( grant )it 1   2 log( sales)it 1   3 log(employment)it 1   5 ( year88)it 1  (ai  uit 1 )]  1( grant )it   2  log( sales)it   3 log(employment)it   5 ( year88)it  uit The first differenced equation.  log( Scrap)it  log( Scrap)it  log( Scrap)it 1  It follows that. compute the following. First. That is. take the first difference. for each firm. 31 .

by taking the first difference. estimating the first differenced model using OLS will produce unbiased estimates.  log( Scrap)it  1( grant )it   2  log( sales)it   3 log(employment )it   5 ( year88 )it  uit  If ∆uit is not correlated with ∆(grant)it.So.dta 32 . If we have controlled for enough time-varying variables. estimate this model using JTRAIN. you can eliminate the fixed effect. it is reasonable to assume that they are uncorrelated.  Now.  Note that this model does not have the constant.

gen diffd88=d88-L.lemploy (181 missing values generated) . ************************** * Declare panel * ************************** tsset fcode year panel variable: fcode (strongly balanced) time variable: year. 43) Prob > F R-squared Adj R-squared Root MSE = = = = = = 47 1.1733036 .0651 . ********************** * Run the regression * ********************** reg difflscrap diffgrant difflsales difflemploy diffd88 if year<=1988.2705336 .5064015 .044634 . ..1879101 . gen difflsales=lsales-L.grant (157 missing values generated) .82 0.365626 .5640514 1.0233784 -.05 -0. the grant is negative and significant at 10% level.23 P>|t| 0. .72 -0.lsales (226 missing values generated) .679713595 .701274 -.0566396 . * variables * . ****************************** .71885438 16. . nocons Source Model Residual Total difflscrap diffgrant difflsales difflemploy diffd88 SS 2.1428 0.093 0. -.638 0.79382 Coef.d88 (157 missing values generated) .0749657 18.lscrap (363 missing values generated) .399868511 t -1.120639 [95% Conf. the stata omits constant term.2160501 Now.822 Number of obs F( 4. gen diffgrant=grant-L.3223172 -.9106586 -. * Generate first differenced * . Std.9978775 -. ****************************** .61142 When you use ‘nocons’ option. Interval] -. . 33 . 1987 to 1989 delta: 1 unit . Err. .963 0. gen difflscrap=lscrap-L.373836411 . .0272418 df 4 43 47 MS .1447 0. gen difflemploy=lemploy-L. .47 0.

34 . However. when you use this method in your research. and why you need to use the firstdifferenced method. Note that. In this example. unobserved ability is potentially an important source of the fixed effect.  Off course. one can never tell exactly what the fixed effect is since it is the aggregate effects of all the unobserved effects. your audience can understand the potential direction of the bias. if you tell what is contained in the fixed effect. it is a good idea to tell your audience what the potential fixed effect would be and whether it is correlated with the explanatory variables.

General case  First differenced model in a more general situation can be written as follows. Yit=β0+β1xit1+β2xit2+…+βkxitk+ai+uit Fixed effect If ai is correlated with any of the explanatory variables. ∆Yit=∆ β1xit1+ ∆ β2xit2+…+ ∆ xitk+∆ uit 35 . So take the first difference to eliminate ai. the estimated coefficients will be biased. then estimate the following model by OLS.

the constant term will also be eliminated. these variables are also eliminated. So you should use `nocons’ option in STATA when you estimate the model. If the treatment variable does not change overtime. you cannot use this method. When some variables are time invariant. 36 . when you take the first difference.Note.

First differencing for more than two periods. Do the same for x-variables. You can use first differencing for more than two periods. 37 . and ∆yi3=yi3-yi2. Then run the regression. Then for the dependent variable. You just have to difference two adjacent periods successively. For example. you compute ∆yi2=yi2-yi1. suppose that you have 3 periods.

38 .Exercise  The data ezunem.dta contains the city level unemployment claim statistics in the state of Indiana.  The enterprise zone is the area which encourages businesses and investments through reduced taxes and restrictions. Enterprise zones are usually created in an economically depressed area with the purpose of increasing the economic activities and reducing unemployment. This data also contains information about whether the city has an enterprise zone or not.

what is the direction of bias? Ex2. First estimate the following model using OLS.dta. Estimate the model using the first difference method. ezunem. Did it change the result? Was your prediction of bias correct? 39 . Using the data. log(unemployment claims)it =β0+β1(Enterprise zone)it +β(year dummies)it+vit Discuss whether the coefficient for enterprise zone is biased or not. Use the log of unemployment claim as the dependent variable Ex1. you are asked to estimate the effect of enterprise zones on the city-level unemployment claim. If you think it is biased.

125291 t -0.4850283 .3539 0.2421197 -.3216319 .621887 11.1847186 .76 -1.496279 Coef.000 0.2192554 -. .2575 11.24 -3.81 93.9188151 -1.2867437 -.1799355 .44 0.345352276 197 .32 -3.071 0.001 0. Interval] -.6711645 -. 188) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.44724 .OLS results .001 0.001 0.5970717 -.5700512 64.34 -1. reg luclms ez d81 d82 d83 d84 d85 d86 d87 d88 Source Model Residual Total luclms ez d81 d82 d83 d84 d85 d86 d87 d88 _cons SS 35.3230 .1847186 .97 -6.000 = = = = = = 198 11.2140369 -.2652689 -.0279007 .9520237 -.000 0.510133396 Std. -.1847186 .1847186 .52 -4.893112 11.0000 0.445 0.015519 -1.6216534 -.6511313 -.986041 -1.95222791 188 .1771882 .568788 -.2572658 -.1771882 .82 0.94155 40 .58767 9 3.736 0.1148501 .1302772 -.69439 df MS Number of obs F( 9.5544275 -.187852 .37 -3. Err.0387084 -.1354957 -.1771882 .9262278 100.34 [95% Conf.217 0.283203 -1.

79583815 25.000 0.2192554 -.5580256 -. Interval] -.0275169 -.8196066 -1.83 [95% Conf.2641083 -.192423 df MS Number of obs F( 9.4125748 . nocons Source Model Residual Total lagluclms lagez lagd81 lagd82 lagd83 lagd84 lagd85 lagd86 lagd87 lagd88 SS 17.104372 -1.6031047 -.0781862 .1354957 -.046064 . reg lagluclms lagez lagd81 lagd82 lagd83 lagd84 lagd85 lagd86 lagd87 lagd88.021 0.98 2.007 0.2306891 .11 -4.142895463 Std.3362382 -.75 -5.7447196 -.0797852 .8537383 -1.6733 .95 -6.72 -8.1350488 t -2.08 -2.1818775 -.7716951 -.5860544 -.3414579 -.0000 0.000 0. .3713315 -.31 0.6900 0.3525023 -.90 -5.108961 .9257998 41 . -.0068831 -.459046 -.000 0.3216319 .000 = = = = = = 176 41.33 -6.21606 9 1.3537634 7.0651444 .0617378 -.1269499 .92819594 167 .0945636 .000 0.000 0.5565765 -.1182979 .1496016 Coef.046681666 176 .039 0. 167) Prob > F R-squared Adj R-squared Root MSE P>|t| 0. Err.3767731 -.First differencing .

d83 gen lagd84 =d84 -L.d84 gen lagd85 =d85 -L.ez gen lagd81 =d81 -L.d85 gen lagd86 =d86 -L.d81 gen lagd82 =d82 -L.luclms gen lagez =ez -L.d86 gen lagd87 =d87 -L. tsset city year reg luclms ez d81 d82 d83 d84 d85 d86 d87 d88 gen lagluclms =luclms -L.d87 gen lagd88 =d88 -L. nocons 42 .The do file used to generate the results.d88 reg lagluclms lagez lagd81 lagd82 lagd83 lagd84 lagd85 lagd86 lagd87 lagd88.d82 gen lagd83 =d83 -L.

Assumption FD1: Linearity For each i.The assumptions for the first difference method. the model is written as yit=β0+β1xit1+…+βkxitk+ai+uit 43 .

44 . In addition.Assumption FD2: We have a random sample from the cross section Assumption FD3: There is no perfect collinearity. each explanatory variable changes over time at least for some i in the sample.

Strict exogeneity E(uit|Xi. Where Xi is the short hand notation for ‘all the explanatory variables for ith individual for all the time period’.ai)=0 for each i. 45 . This means that uit is uncorrelated with the current year’s explanatory variables as well as with other years’ explanatory variables.Assumption FD4.

the estimated parameters for the first difference method are unbiased. 46 .The unbiasedness of first difference method Under FD1 through FD4.

So you need an additional assumption to rule out the serial correlation.∆uis)=0 for t≠s Note that FD2 assumes random sampling across difference individual. but does not assume randomness within each individual.Assumption FD5: Homoskedasticity Var(∆uit|Xi)=σ2 Assumption FD6: No serial correlation within ith individual. Cov(∆uit. 47 .

Sign up to vote on this title