What is panel data?
•

A data set containing observations on multiple phenomena observed at a single point in time is called crosssectional data

•

A data set containing observations on a single phenomenon observed over multiple time periods is called time series data

•

Observations on multiple phenomena over multiple time periods are panel data

•

Cross sectional and time series data are one dimensional, panel data are twodimensional

– Panel data can be used to answer both longitudinal and cross sectional questions!
Long versus Short Panel Data
• A short panel has many entities (large n) but few time periods (small T), while a long panel has many time periods (large T) but few entities
(Cameron and Trivedi, 2009: 230).
• Accordingly, a short panel data set is wide in width (crosssectional) and short in length (time series), whereas a long panel is narrow in width. Both too small N (Type I error) and too large N (Type II error) problems matter. Researchers
should be very careful especially when examining
either short or long panel.
Balanced versus Unbalanced Panel
Data
•

In a balanced panel, all entities have measurements in all time periods. In a contingency table (or crosstable) of cross sectional and timeseries variables, each cell should have only one frequency. Therefore, the total number of observations is

nT.

•

When each entity in a data set has different numbers of observations, the panel data are not balanced. Some cells in the contingency table have zero frequency. Accordingly, the total number of observations is not nT in an unbalanced panel. Unbalanced panel data entail some computation and estimation issues although most software packages are able to handle both balanced and unbalanced data.

Fixed versus Rotating Panel Data
• If the same individuals (or entities) are observed for each period, the panel data set is called a fixed panel (Greene 2008: 184). If a set of individuals changes from one period to the next, the data set is a rotating panel.
Recap : Example of a panel data
Organization

Time

Profit

Location

Tax

MUGER CEMENT





MUGER CEMENT





MUGER CEMENT





MESSEBO CEMENT FACTORY





MESSEBO CEMENT FACTORY





MESSEBO CEMENT FACTORY





DERBA MIDROC CEMENT





DERBA MIDROC CEMENT





DERBA MIDROC CEMENT







Xit

Xi

Xt

With panel data we can control for factors that:

Vary across entities (states) but do not vary over time


Could cause omitted variable bias if they are omitted


Are unobserved or unmeasured – and therefore cannot be included in the regression using multiple regression

•

Here’s the key idea:

If an omitted variable does not change over time, then any changes in Y over time cannot be caused by the
omitted variable.
Panel Data (cont.)
• Potential unobserved heterogeneity is a form of omitted variables bias.
• “Unobserved heterogeneity” refers to omitted
variables that are fixed for an individual (at
least over a long period of time). • A person’s upbringing, family characteristics, innate ability, and demographics (except age)
do not change.
Panel Data (cont.)
• With crosssectional data, there is no particular reason to differentiate between
omitted variables that are fixed over time and
omitted variables that are changing.
• However, when an omitted variable is fixed over time, panel data offers another tool for
eliminating the bias.
Panel Data (cont.)
•

Panel Data is data in which we observe repeated crosssections of the same individuals.

•

Examples:

– Annual unemployment rates of each country over several years
– Quarterly sales of individual stores over several quarters
– Wages for the same worker, working at several different jobs
Panel Data (cont.)
• By far the leading type of panel data is repeated crosssections over time.
• The key feature of panel data is that we observe the same individual in more than one
condition.
• Omitted variables that are fixed will take on the same values each time we observe the same individual.
Panel Data (cont.)
• Some of the most valuable data sets in economics and finance are panel data sets.
• Longitudinal surveys return year after year to the same individuals, tracking them over time.
Panel Data (cont.)
• For example, the National Longitudinal Survey
of Youth (NLSY) tracks labor market outcomes
for thousands of individuals, beginning in their teenage years.
• The Panel Survey of Income Dynamics provides highquality income data.
• The panel Survey of Stock Price provides a high quality asset return data
Example:
CrossIndustry Wage Disparities
•

A great puzzle in labor market is the presence of cross

industry wage disparities.

•

Workers of seemingly equivalent ability, in seemingly equivalent occupations, receive different wages in

different industries.
e.g. A person with MSc in accounting working at ECA paid 30,000 birr. Another person with MSc in accounting working at Addis Ababa University paid 5000 birr.
•
Do highwage industries actually pay higher wages, or
do they attract workers of unobservably higher quality?
Gibbons and Katz findings
• Gibbons and Katz (Review of Economic Studies 1992) exploited panel data to explore these differentials. • They observed workers in 1984 and 1986.
• They focused on workers who lost their 1984 jobs because of plant closings (on the grounds that plant closings are unlikely to be
correlated with an individual worker’s
abilities). They looked only at workers who were reemployed by 1986.
Gibbons and Katz findings
• Gibbons and Katz estimated wages as
ln
w
X
..
X
D
..
D
where X _{k}_{i}_{t} are demographic variables, and D _{i}_{t} are a set of dummy variables for being employed in different industries
Gibbons and Katz findings
Estimating with simple OLS, Gibbons and
Katz estimate ’s that are very similar to other estimates of crossindustry wage differentials
Gibbons and Katz findings
• Gibbons and Katz speculated that any
unmeasured ability is fixed over time and equally
rewarded in all industries.
ln
w
X
..
X
D
..
D
v
• Differencing the 1986 and 1984 observations eliminated the v _{i}
Gibbons and Katz findings
Gibbons and Katz findings
• The estimated industry coefficients from
the differenced equation are about 80% of
the estimated industry coefficients from the levels equation.
• Unobserved worker ability appears to explain relatively little of the crossindustry
wage differentials.
Panel Data DGP’s (cont.)
• Notice that when we have panel data, we index observations with both i and t.
• Pay close attention to the subscripts on variables.
• Some variables vary only across time or across individual.
Recap : Example of a panel data
Organization

Time

Profit

Location

Tax

MUGER CEMENT





MUGER CEMENT





MUGER CEMENT





MESSEBO CEMENT FACTORY





MESSEBO CEMENT FACTORY





MESSEBO CEMENT FACTORY





DERBA MIDROC CEMENT





DERBA MIDROC CEMENT





DERBA MIDROC CEMENT







Xit

Xi

Xt

Panel Data DGP’s (cont.)
• In this DGP, the b _{0}_{i} are fixed across samples. • The unmeasured heterogeneity is the same in every sample. • This DGP is called the “Distinct Intercepts” DGP. • It is suitable for panels of states or countries, where the same individuals would be selected
in each sample.
Panel Data DGP’s (cont.)
• With longitudinal data on individual workers
or consumers, we draw a different set of
individuals from the population each time we collect a sample.
• Each individual has his/her own set of fixed omitted variables.
• We cannot fix each individual intercept.
Panel Data DGP’s
•
In this DGP, we return to a model with a single intercept for all data points, b _{0}
• However, we break the error term into two components:
v
it
i
it
• When we draw an individual i, we draw one v _{i} that is fixed for that individual in all time periods.
• v _{i} includes all fixed omitted variables.
Panel Data DGP’s
• In the Distinct Intercepts DGP, the unobserved heterogeneity is absorbed into the individual specific intercept b _{0}_{i}
• In the second DGP, the unobserved heterogeneity is absorbed into the individual fixed component of the error term, v _{i}
• This DGP is an “Error Components Model.”
Panel Data DGP’s
• The Error Components DGP comes in two
flavors, depending on
E
(
X
jit
v
i
)
.
•
If
E
(
X
jit
v
i
)
0
, then the unobserved
heterogeneity is uncorrelated with the
explanators.
• OLS is unbiased and consistent.
Panel Data DGP’s
•
If
E
(
X
jit
v
i
)
0
, then the unobserved
heterogeneity IS correlated with
the explanators.
• OLS is BIASED and INCONSISTENT
Panel Data DGP’s
• Panel data is most useful in the second Error Components case.
• When
E
(
X
jit
v
i
)
0
, OLS is inconsistent.
• Using panel data, we can create a consistent estimator: Fixed Effects.
Recall
• OLS estimator beta and alpha are BLUE (best linear unbiased estimator)
• Best: variance of OLS estimator is minimal, smaller than the variance of any other estimator
• Linear: if the relationship is not linear OLS is not applicable
• Unbiased: the expected values of the estimated
beta and alpha equal the true values describing
the relationship between X and Y
Recall :How to choose between
estimator: Bias vs consistency
•

Efficiency is a relative measure between two estimator. An

estimator is efficient when it is smaller variance than

another estimator

•

Consistency: how far is the estimator likely to be from the true parameter as we let the sample size increase indefinitely. If N approaches to infinity the estimated beta

equals the true beta.

•

Unlike unbiasedness, consistency involves that the variance of the estimator collapses around the true value as N approaches infinity. Thus unbiased estimator are not necessarily consistent, but those whose variance shrink to

zero as the sample size grows are consistent
Fixed Effects
• The Fixed Effects Estimator
• Used with EITHER the distinct intercepts DGP OR the error components DGP
with
E
(
X
ijt
v
i
)
0
• Basic Idea: estimate a separate intercept for each individual
Fixed Effects
• The simplest way to estimate separate
intercepts for each individual is to use dummy
variables.
• This method is called the least squares dummy variable estimator.
Fixed Effects
• We have already seen that we can use dummy variables to estimate separate intercepts for
different groups.
• With panel data, we have multiple observations for each individual. We can group these observations.
Fixed Effects
• Least Squares Dummy Variable Estimator:

1. Create a set of n dummy variables, D ^{j} , such that D ^{j} = 1 if i = j.

2. Regress Y _{i}_{t} against all the dummies, X _{t} , and X _{i}_{t} variables (you must omit X _{i} variables and the constant).
Fixed Effects
• The LSDV estimator is conceptually quite simple.
• In practice, the tricky parts are:
Fixed Effects
• Suppose we have a longitudinal dataset with 300 workers over 10 years.
n = 300
• • We must create 300 dummy variables and then specify a regression with 300+ explanators. • How do we do this in our software package?
Fixed Effects
• Our regression output includes 300 intercepts.
• Usually, we are not interested in the intercepts themselves.
• We include the dummy variables to control for heterogeneity.
Fixed Effects
• In reporting your regression output,
it is preferable to note that you have included
“individual fixed effects.”
• Then omit the dummy variable coefficients from your table of results.
Fixed Effects
• At some point, n becomes too large for the computer to handle easily.
• Modern computers can implement LSDV for ever larger data sets, but eventually LSDV
becomes computationally intractable.
Fixed Effects
• A computationally convenient alternative is called the Fixed Effects Estimator.
• Technically, only this strategy is “Fixed Effects;” using dummy variables is
LSDV.
• In practice, econometricians tend to refer to either method as Fixed Effects.
Fixed Effects
• The initial insight for the Fixed Effects estimator: if we DIFFERENCE observations for the same individual, the v _{i} cancels out.
Fixed Effects
• When we difference, the heterogeneity term v _{i} drops out. • (In the distinct intercepts model, the b _{0}_{i} would drop out). • By assumption, the _{i}_{t} are uncorrelated with the X _{i}_{t} • OLS would be a consistent estimator of b _{1}
Fixed Effects
• If T = 2, then we have only 2 observations for each individual
(as in the Gibbons and Katz example). • Differencing the 2 observations is efficient.
• If T > 2, then differencing any 2 observations ignores valuable information in the other observations for each individual.
Fixed Effects
• We can use all the observations for each individual if we subtract the individualspecific
mean from each observation.
Fixed Effects
• The Fixed Effects and DVLS estimators provide exactly identical estimates.
• The nTk term of the Fixed Effects e.s.e.’s must be adjusted to account for the extra n degrees of freedom that have been used.
• The computer can make this adjustment
Fixed Effects
• Demeaning each observation by the individualspecific mean eliminates the need to create n dummy variables.
• FE is computationally much simpler.
Fixed Effects
• Fixed Effects (however estimated) discards all variation between individuals.
• Fixed Effects uses only variation over time within an individual.
• FE is sometimes called the “within” estimator.
Fixed Effects
• Fixed Effects discards a great deal of variation in the explanators (all variation between
individuals).
• Fixed Effects uses n degrees of freedom.
• Fixed Effects is not efficient if
E
• Could we use OLS?
(
X
it
v
i
)
0
Checking Understanding (cont.)
Checking Understanding (cont.)
• Because X is uncorrelated with either v or , OLS is consistent in the uncorrelated version of the error components DGP.
Var
(
)
it
(
Var v
i
2
2
it
v
)
• The error terms are homoskedatic
Checking Understanding (cont.)
• However, the covariance between disturbances for a given individual is
Checking Understanding (cont.)
• In the presence of serial correlation, OLS is inefficient.
Random Effects
• When unobserved heterogeneity is uncorrelated with explanators, panel data
techniques are not needed to produce a consistent estimator.
• However, we do need to correct for serial correlation between observations of the same
individual.
Random Effects
• When
E
(
X
it
v
i
)
0
, panel data provides
a valuable tool for eliminating omitted
variables bias. We use Fixed Effects to gain the benefits of panel data.
• When
E
(
X
it
v
i
)
0
, panel data does not offer
special benefits. We use Random Effects to
overcome the serial correlation of panel data.
Random Effects
• The key idea of random effects:
– Estimate _{v} ^{2} and _{} – Use these estimates to construct efficient weights of panel data observations
2
Random Effects
• Once we have estimates of _{v} ^{2} and _{} ^{2} , we can reweight the observations optimally.
• These calculations are complicated, but most computer packages can implement them.
Example: Production Functions
• We have data from 625 French firms from 16 countries for 8 years.
• We wish to estimate a Cobb–Douglas production function:
b
0
b
1
b
2
i
• Taking logs:
ln(
)
Q
i
ln(
b b
0
)
1
ln(
L b
i
)
2
ln(
)
K
i
ln(
)
• We estimate using random effects.
Random Effects Estimation of a Cobb–Douglas Production Function for a Sample of Manufacturing Firms
Example: Production Functions
• The estimated coefficients of 0.30 for capital
and 0.69 for labor are similar to estimates
using US data.
• We also get similar results using fixed effects estimation.
Fixed Effects Estimation of a Cobb–Douglas
Production Function for a Sample of
Manufacturing Firms
Example: Production Functions
• We arrive at similar estimates using either random effects or fixed effects.
• Because only fixed effects controls for unobserved heterogeneity that is
correlated with the explanators, the similarity
between the two estimates suggests that
unobserved heterogeneity is not creating a
large bias in
this sample.
Example: Production Functions
• The fixed effects estimator discards all variation between firms, and must use
624 more degrees of freedom than random effects.
• Moving from RE to FE increases the e.s.e. on capital from 0.0116 to 0.0145
• The e.s.e. on labor moves from 0.0118 to 0.0132
Example: Production Functions
• The RE estimator provides more precise estimates
• We would prefer to use RE instead of FE.
• However, RE might be inconsistent if
E
(
X
it
v
i
)
0
• We need a test to help determine whether it is safe to use RE.
The Hausman Test
• Hausman’s specification test for error components DGPs provides guidance on
whether
E
(
X
it
v
i
)
0
•
The key idea: if
E
(
X
it
v
i
)
0
, then the
inconsistent RE estimator and the consistent FE estimator converge to different estimates.
The Hausman Test
•
If
E
(
X
it
v
i
)
0
, then the unobserved
heterogeneity is uncorrelated with X and does not create a bias.
• RE and FE are both consistent.
• For two consistent estimators to provide significantly different estimates would be
surprising.
The Hausman Test
• We know the FE estimator is consistent even when
E
(
X
it
v
i
)
0
• The problem with FE is its inefficiency.
•
FE is not as precise as RE.
• Although FE is imprecise, it may provide a good enough estimate to detect a large bias in
RE.
The Hausman Test
• If FE is very imprecise, then the Hausman test has very weak power and cannot rule out
even large biases.
• If FE is very precise, then the Hausman test has very good power, but we gain little benefit from switching to the more efficient RE.
The Hausman Test
• If FE is somewhat precise, then the Hausman
test can warn us away from using RE in the
presence of a large bias, but there is still room for substantial efficiency gains in switching to
RE.
• With the French manufacturing firms, FE is precise enough to reject the null even though
the two estimates are fairly close.
Hausman Specification Test
of French Firms’ Error Components’
Correlation with Explanators
The Hausman Test
• Warning: fixed effects exacerbates measurement error bias.
• There is likely to be less variation in X within the experience of a single individual than across several individuals.
• Small measurement errors can become large relative to the withinvariation in X.
The Hausman Test
• The Hausman Test warns us that RE and FE provide significantly different estimates.
• This difference could arise because of omitted variables bias in RE, caused by
E( X _{i}_{t} v _{i} ) 0
• This difference could ALSO arise because of measurement error biases in FE.
Review
• Potential unobserved heterogeneity is a form of omitted variables bias.
• “Unobserved heterogeneity” refers to omitted variables that are fixed for an individual
(at least over a long period of time).
• A person’s upbringing, family characteristics,
innate ability, and demographics (except age)
do not change.
Review
•
Panel Data is data in which we observe repeated crosssections of the same
individuals.
• By far the leading type of panel data is repeated crosssections over time.
Review
• The key feature of panel data is that we observe the same individual in more than one
condition.
• Omitted variables that are fixed will take on the same values each time we observe the same individual.
Review
• We learned 3 different DGP’s for panel data.
• In the distinct intercept DGP, across samples we would observe the same individuals with
the same unobserved heterogeneity.
• Each i has its own intercept, b _{0}_{i} , that is fixed across samples.
Review
• Error components DGP’s are suitable when we would draw different individuals across
samples.
• When each i is drawn, its unobserved heterogeneity is captured in a v _{i} term.
• We learned two error components DGP, depending on whether the v _{i} is correlated with the X _{k}_{i}_{t} ’s.
Review
•
If
E( X _{i}_{t} v _{i} ) 0
, OLS would be inconsistent
• By estimating a separate intercept for each individual, we can control for the v _{i}
• We learned two equivalent strategies: DVLS and FE.
Review
• The simplest way to estimate separate
intercepts for each individual is to use dummy
variables.
• This method is called the least squares dummy variable estimator.
Review
• Least Squares Dummy Variable Estimator:

1. Create a set of n dummy variables, D ^{j} , such that D ^{j} = 1 if i = j

2. Regress Y _{i}_{t} against all the dummies, X _{t} , and X _{i}_{t} variables (you must omit X _{i} variables and the constant)
Review
• Demeaning each observation by the individualspecific mean eliminates the need to create n dummy variables.
• FE is computationally much simpler.
Review
• Fixed Effects (however estimated) discards all variation between individuals.
• Fixed Effects uses only variation over time within an individual.
• FE is sometimes called the “within” estimator.
Review
• Because X is uncorrelated with either v or , OLS is consistent in the uncorrelated version of the error components DGP.
Var( _{i}_{t} ) Var(v _{i} _{i}_{t} ) ^{2}
v
2
• OLS disturbances are homoskedatic.
Review
• However, the covariance between disturbances for a given individual is
Review
• When unobserved heterogeneity is uncorrelated with explanators, panel data
techniques are not needed to produce a consistent estimator.
• However, we do need to correct for serial correlation between observations of the same
individual.
Review
• When
E
(
X
,
it
v
i
)
0
panel data provides
a valuable tool for eliminating omitted
variables bias. We use Fixed Effects to gain the benefits of panel data.
Review
• When
E( X _{i}_{t} v _{i} ) 0
, panel data is less
convenient than an equalsized
crosssectional data set. We use Random Effects to overcome the serial correlation of
panel data.
Review
• The key idea of random effects:
– Estimate _{v} ^{2} and _{} – Use these estimates to construct efficient weights of panel data observations
2
Random vs. Fixed Effects
•
Random Effects

–

Small number of parameters


–

Efficient estimation

–

If you believe that differences across entities have some influence on your dependent variable then use Random effect model.

–

It also accommodate time invariant variables


Pr
ofit
it
b b Sale
1
2
in time t

it

b adv
3

cos

t
it
it

it

, where Profit is the profit of i firm
i

•
Fixed Effects



–

Robust – generally consistent


–

Large number of parameters

–

More reasonable assumption

–

Precludes time invariant regressors

–

Use fixed effect whenever you are interested in analyzing the impact of variables that vary over time

–

assumption of the correlation between entity’s error term and predictor variables

–

FE remove the effect of those timeinvariant characteristics

–

designed to study the causes of changes within a country


Pr
ofit
it
b b Sale
1
2
it
b adv
3
cos
t
it
i
it
, where Profit is the profit of i firm
i
in time t
Using panel data in Stata
•

Data on n cases, over t time periods, giving a total of n × t observations

•

One record per observation i.e. long format

•

Stata tools for analyzing panel data begin with the prefix

xt

•

First need to tell Stata that you have panel data using xtset

Commands
•

xtset company time

•

xtdes

•

Xtsum

•

Interpretation for sum:


**/ For timeinvariant variable ed the within variation is zero.

For individualinvariant variable t the between variation is zero.

For lwage the within variation < between variation/*****
• xtdata • xtline revenue • xtline revenue, overlay
Commands: Tabulation
• **/Panel tabulation for a variable. If south is a dummy /**
xttab south
Interpretation for tab:
• For between percent: 20.59% were ever in the south.
• For within percent: 94.9% of those ever in the south were always in the south.
Commands: Transition and correlation
• **/Transition probabilities for a variable/// • xttrans south, freq Interpretation: For the 28.99% of the sample
ever in the south, 99.23% remained in the south
the next period.
• correlate lwage L.lwage L2.lwage L3.lwage L4.lwage L5.lwage L6.lwage
Regression

1. reg revenue numberofbranchadvcost

2. ****Pooled OLS with clusterrobust standard errors*/** regress lwage exp exp2 wks ed, noheader vce(cluster id)

3. xi: regress revenue numberofbranchadvcosti.company

4. Pooled GLS estimator: xtgee. Pooled FGLS estimator with
AR(2) error & clusterrobust se’s xtgee lwage exp exp2 wks ed, corr(ar 2) vce(robust)
xtreg lwage exp exp2 wks ed, be
Random and fixed effect
7. xtreg revenue numberofbranch , fe **** fixed effect
**Coefficients of the regressors indicate how much Y
changes when X increases by one unit********** 8. * Random effects estimator with clusterrobust se’s xtreg lwage exp exp2 wks ed, re vce(robust) theta
xtreg revenue numberofbranch, re **** random effect
•
**** represents the average effect of X over Y when X changes across time and between countries by one
unit**
Hausman test
•
Hausman test (If this is < 0.05 mean P value (i.e. significant) use fixed effects): to run it
xtreg revenue numberofbranch , fe estimates store fixed xtreg revenue numberofbranch , re • estimates store randome • Hausmanfixeddrandomm
•
•
•
•

Testing for random effects: BreuschPagan Lagrange multiplier (LM)

The LM test helps you decide between a random effects regression and a simple OLS regression.

•

xtreg revenue numberofbranch , re

•

xttest0

•

If Prob value is higher than >0.05 . Here we failed to reject the null and conclude

that randomeffects is not appropriate. This is, no evidence of significant differences across countries, therefore you can run a simple OLS regression
Compare different estimation
crosssectional dependence
•

crosssectional dependence is a problem in macro panels with long

time series (over 20 30 years). This is not much of a problem in

micro panels (few years and large number of cases). The null

hypothesis in the BP/LM test of independence is that residuals across entities are not correlated.

•

xtreg revenue numberofbranch , fe

•

xttest2

•

if P value is higher than 0.05 then No crosssectional dependence

•

alternatively we can use Pasaran CD (crosssectional dependence):

used to test whether the residuals are correlated across entities

•

xtreg revenue numberofbranch , fe

xtcsd, pesaran abs

•

if P value is higher than 0.05 then No crosssectional dependence

A test for heteroskedasticiy and Serial
correlation tests
1.

A test for heteroskedasticiy

•

The null is homoskedasticity (or constant variance)

•

This is a userwritten program, to install it type:

•

ssc install xtest3

•

xttest3

2.

Serial correlation tests apply: is a problem of long time 20 30

•

A LagramMultiplier test for serial correlation is available using the

•

command xtserial.

•

This is a userwritten program, to install it typessc install xtserial

•

xtserial revenue numberofbranch
