Professional Documents
Culture Documents
Semester 2, 2023/24
Data obtained by pooling cross sections are very useful for establishing trends
and conducting policy analysis.
A pooled cross section (PCS) is available whenever a survey is repeated over
time with new random samples obtained in each time period.
That is, the survey does not track the same individuals across survey waves.
A special setup often arises with independently pooled cross sections. The
setup is used often to study the effects of policy interventions.
Outcomes are observed for two groups over two time periods.
One of the groups is exposed to a “treatment” in the second period but not in
the first period.
The second group is not exposed to the treatment during either period.
The regression
Let A be the control group and B the treatment group. Let d2 be a time period
dummy equal to one for a unit in the second time period. Write
y = β0 + β1 dB + δ0 d2 + δ1 d2 · dB + u,
where y is the outcome of interest.
The mean value of u is zero (essentially by definition), so we can read off the
means of the response for different combinations.
y = β0 + β1 dB + δ0 d2 + δ1 d2 · dB + u.
Before (1) After (2) After − Before
Control (A) β0 β0 + δ0 δ0
Treatment (B) β0 + β1 β0 + δ0 + β1 + δ1 δ0 + δ1
Treatment − Control β1 β1 + δ1 δ1
What restrictions did we impose?
For before and after, we toggle d2.
For control and treatment, we toggle dB.
y = β0 + β1 dB + δ0 d2 + δ1 d2 · dB + u
y = β0 + β1 dB + δ0 d2 + δ1 d2 · dB + u
The coefficient of interest is δ1 = (δ1 + δ0 ) − δ0 , the difference in the average
changes over time for the treatment and control groups.
Conveniently, δ1 is the coefficient on the interaction d2 · dB, which is one if
and only if the unit is in the treatment group in period 2.
δ1 is sometimes called the average treatment effect, for it captures the the
effect of the treatment on the average outcome of y .
If y is a logarithm then, as usual, δ1 is a proportionate effect (multiple by 100
to approximate the percentage effect).
Difference-in-Difference
Intuition
Just using ȳB,2 − ȳB,1 , the difference over time in the means of the treatment
group, attributes all change to the intervention.
Just using ȳB,2 − ȳA,2 , the difference in treatment and control means
post-treatment, attributes any differences in the groups to the treatment.
Writing
We study the relationship between the length of injury leave and changes to
injury compensation.
Suppose a policy increased the cap on weekly earnings that were covered by
workers’ compensation.
Low earners were not affected by this policy. They received the same
compensation before and after the intervention.
High earners may stay on workers’ compensation for longer, because it had
become less costly to be on injury leave.
We use injury.dta.
------------------------------------------------------------------------------
ldurat | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
afchnge | .0931227 .034202 2.72 0.006 .0260736 .1601717
_cons | 1.233253 .023641 52.17 0.000 1.186907 1.279598
------------------------------------------------------------------------------
The afchnge coefficient shows that leave duration has increased, on average,
by 9.3%, after the policy.
------------------------------------------------------------------------------
ldurat | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
highearn | .3490087 .0342618 10.19 0.000 .2818423 .416175
_cons | 1.129233 .0223497 50.53 0.000 1.085419 1.173047
------------------------------------------------------------------------------
The highearn coefficient shows that the average leave duration for high
earners is 35% longer than for low earners.
------------------------------------------------------------------------------
ldurat | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
afchnge | .0076573 .0447173 0.17 0.864 -.0800058 .0953204
highearn | .2564785 .0474464 5.41 0.000 .1634652 .3494918
afhigh | .1906012 .0685089 2.78 0.005 .0562973 .3249051
_cons | 1.125615 .0307368 36.62 0.000 1.065359 1.185871
------------------------------------------------------------------------------
A potential problem with using only two periods is that the control and
treatment groups may be trending at different rates having nothing to do with
the intervention.
In the workers’ compensation example, what if high earners are taking up
more compensation than low earners irrespective of the policy change?
We can find high earners from a different state in the U.S. and use them as a
second control group. (They are in the treated group, but they are not subject
to the treatment).
The resulting estimator is a triple difference estimator.
With a panel data set, the same units are sampled in two or more time
periods. For each unit (individual, school, city, and so on) i we have multiple
years of data.
At a minimum, statistical methods must recognise that the outcomes for a unit
will be correlated over time.
By contrast, with pooled cross sections we have different units sampled in
each period. If there is some overlap, we ignore it (and usually would not
know there is overlap).
Balanced panel
Main benefit of panel data: with multiple years of data we can control for
unobserved characteristics that do not change (or change slowly) over time.
Very useful for policy analysis.
A balanced panel is one where we observe the same time periods for each
unit. Easier to achieve for larger units (such as schools and cities).
At a disaggregated level – such as individuals and families – following the
same units over time can be challenging. Attrition can be a serious problem.
What is attrition? Units dropping out of the sample.
Notation
We will start with the case of two periods, so t = 1, 2 and (hopefully) many
cross section observations, i = 1, 2, . . . , n.
Along with the observed data (xit1 , xit2 , . . . , xitk , yit ) we draw unobserved
factors.
Put these into two categories. (1) A component that does not change over
time, ai . Called an unobserved effect or unobserved heterogeneity. It
varies by individual but not by time.
What is an example of ai at the individual level? “ability” – something innate
and subject to slow change.
There are also unobservables that change across time, uit . These are
sometimes called “shocks”; we will call them idiosyncratic errors.
They are specific to unit i but vary over time, and they affect the outcome, yit .
The best way to store panel data is to stack the time periods for each i on top
of each other.
In particular, the time periods for each unit should be adjacent, and stored in
chronological order (from earliest period to the most recent).
This is sometimes called the “long” storage format. It is by far the most
common.
Long format
The same data structure is convenient for more than two years.
While not absolutely necessary for some procedures, it is best to tell Stata
that you have a panel data set. In particular, what are i and t? In GPA3.DTA,
i = id and t = term.
. xtset id term
panel variable: id (strongly balanced)
time variable: term, 1 to 2
delta: 1 unit
. tab term
fall = 1, |
spring = 2 | Freq. Percent Cum.
------------+-----------------------------------
1 | 366 50.00 50.00
2 | 366 50.00 100.00
------------+-----------------------------------
Total | 732 100.00
Initially using xtset heads off most problems, but it is nice to have the data
appropriately sorted. To sort the data, use
sort distid year
You do not want to sort by year and then district ID. (That would make the data
set look more like independently pooled cross sections, and mask the panel
structure.)
Wide format
Sometimes panel data sets (especially with two years) will be stored as
having only n records (rather than 2n, as above), with the variables from the
different years given different suffixes (to distinguish the years).
Generally, this makes the data harder to work with, especially if there are
more than two years.
It is sometimes called the “wide” storage method.
State has a command, reshape, that allows one to go from wide to long, and
vice versa.
Wide format
+---------------------------------------------+
| state vote90 vote88 inexp90 inexp88 |
|---------------------------------------------|
1. | AL 51 94 596096 234923 |
2. | AK 52 62 564759 626377 |
3. | AZ 66 73 112373 99607 |
4. | AR 71 75 105354 159221 |
5. | CA 64 59 515020 696748 |
|---------------------------------------------|
6. | CO 64 70 521500 217503 |
7. | CT 60 64 464500 727919 |
8. | DE 66 68 521336 371747 |
9. | FL 52 67 158280 210940 |
10. | GA 71 67 399035 337048 |
+---------------------------------------------+
Assume a balanced panel for units i. The units can be aggregated (schools or
cities) or disaggregated (students or teachers).
We have time periods t = 1 and t = 2 for each unit i. These periods do not
have to be, say, adjacent years. They could be periods far apart in time. Or,
they could be close together.
First consider the case with a single explanatory variable, xit
The model
The equation is
The model
The intercept in the first (base) period is β0 , and that for the second period is
β0 + δ0 .
It can be very important to allow changing intercepts to get a good estimate of
a causal effect.
(For example, a policy, as measured by xit , might be implemented just as the
aggregate economy is turning up or down – as captured by δ0 d2t .)
For policy analysis, xit is often a dummy variable.
Estimation
vit = ai + uit ,t = 1, 2
and write
vi1 = ai + ui1
vi2 = ai + ui2
Correlation of vi1 and vi2 causes the usual OLS standard errors to be invalid.
And using heteroskedasticity-robust standard errors does not solve the
problem.
This is the familiar problem of serial correlation or cluster correlation.
(Each unit i is a cluster of two time periods.)
Obtaining “cluster-robust” standard errors and test statistics is very easy these
days.
Consistency of POLS
2. A more serious issue is that consistency of OLS (as n gets large, as usual)
requires that xit and vit are uncorrelated.
Because vit = ai + uit , we need
Cov (xit , ai ) = 0
Cov (xit , uit ) = 0
Suppose we are willing to assume the second of these. The first might be
violated if xit is determined based on systematic differences in units.
For example, if yit is regional employment rate, xit is regional crime rate.
Crime rate might depend partly on historical economic conditions (e.g., age
distribution, education level) of an area, captured by ai , but not on
contemporaneous shocks to employment (in uit ).
When Cov (xit , ai ) ̸= 0 it is often said that (pooled) OLS suffers from
heterogeneity bias.
If the explanatory variable changes over time – at least for some units in the
population – heterogeneity bias can be solved by differencing away ai .
To remove the source of bias in POLS, ai , write the time periods in reverse
order for any unit i:
If we define ∆yi = yi2 − yi1 , where ∆ = change – and similarly for ∆xi and
∆ui – we can write the cross-sectional equation as
First-difference estimator
is often called the first-difference estimator. (With more than two time
periods, other orders of differencing are possible; hence the qualifier “first”.)
We will refer to it as the FD estimator.
. use jtrain
. des fcode year scrap grant
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------
fcode float %9.0g firm code number
year int %9.0g 1987, 1988, or 1989
scrap float %9.0g scrap rate (per 100 items)
grant byte %9.0g = 1 if received grant
. keep if year <= 1988
(157 observations deleted)
. tab year if scrap != .
1987, 1988, |
or 1989 | Freq. Percent Cum.
------------+-----------------------------------
1987 | 54 50.00 50.00
1988 | 54 50.00 100.00
------------+-----------------------------------
Total | 108 100.00
------------------------------------------------------------------------------
D.scrap | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
cgrant | -.7394436 .6826276 -1.08 0.284 -2.109236 .6303488
_cons | -.5637143 .4049149 -1.39 0.170 -1.376235 .2488069
------------------------------------------------------------------------------
Job training grant had reduced the scrap rate by -0.739 on average, though
the estimate is not statistically significant.
------------------------------------------------------------------------------
D.lscrap | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
cgrant | -.3170579 .1638751 -1.93 0.058 -.6458974 .0117816
_cons | -.0574357 .097206 -0.59 0.557 -.2524938 .1376224
------------------------------------------------------------------------------
The t statistic for ∆grant is only marginally significant (and even less
significant if we specify robust).
We conclude that ... there is likely no significant relationship between the
scrap rate and the job training grant.
First Differencing can be used with more than two periods of panel data, but
we must be careful to account for serial correlation (and, as usual, possibly
heteroskedasticity) in the FD equation.
This is because the FD equation is no longer just a single cross section.
Generally, we should also include a full set of time dummies for a convincing
analysis.
The model
where now δ1 denotes the intercept in the first year and δt , t ≥ 2, is the
difference between the intercept in period t and period 1.
The model still contains an unobserved effect, ai , and idiosyncratic error, uit .
Now we just use POLS on the changes for t = 2, . . . , T , and we can use the
usual and adjusted R-squareds as goodness-of-fit in the FD equation.
Except for changing how we allow for different time intercepts, this is the same
model as before. Estimates of the βj are identical.
Some tricks
A simpler way is to apply the differencing operator, d., to the equation all at
once:
reg D.(y d2 d3 ... dT x1 x2 ... xk), cluster(id)
This gives different time intercepts because these time dummies are
differenced.
How many time dummies will be estimated? T − 2, because an intercept is
included.
Can force the intercept to zero to get estimates on all T − 1 dummies:
reg D.(y d2 d3 ... dT x1 x2 ... xk), cluster(id) nocons
Extensions
With the levels equation as the starting point, it is easy to choose explanatory
variables for policy evaluation that allow, say, lagged effects.
We can also add quadratics, interactions, and so on. These are constructed
before applying differencing.
We use jtrain.dta, this time for three years: 1987,1988, and 1989.
No grants in 1987; if got a grant in 1988, could not get one in 1989.
Job training in a previous year could have an effect on this years’ productivity.
Hence, omitting the previous years’ grant indicator could be a serious
misspecification.
The differenced equation (where the “c” stands for “change” – can be
confused with “dummy variable” “d”) is
Estimate by pooled OLS (using two years). The firm effect ai has been
removed.
Can just use a dummy for 1989 and include a constant.
We can specify nocons to drop the intercept and get an estimate for d89.