You are on page 1of 134

Econ 140

MLR: Hypothesis Testing

Stephen Bianchi

Department of Economics
UC Berkeley

March 30, 2021


General joint hypotheses
I Joint hypotheses can be more complicated than just excluding
some independent variables.
I However, it is still straight forward to use the F -statistic.
I Example: house prices
Model:
ln(pricei ) = 0 + 1 ln(assessi ) + 2 ln(lotsizei ) +
3 ln(sqrfti ) + 4 bdrmsi + ui
where
price : house price
assess : assessed value (before sale)
lotsize : size of lot in square feet
sqrft : size of house in square feet
bdrms : number of bedrooms
General joint hypotheses

I Example: house prices


I Suppose we would like to test whether the assessed value is
"rational," in which case a 1% increase in assessed value
should lead to a 1% increase in the price of the house, i.e.,
1,0 = 1.
I In addition, lotsize, sqrft, and bdrms should not affect the
price of the house, after controlling for the assessed value, i.e.,
2,0 = 0, 3,0 = 0, 4,0 = 0.
I Hypothesis test:

H0 : 1 = 1, 2 = 0, 3 = 0, 4 =0
H1 : 1 6= 1 or 2 6= 0 or 3 6= 0 or 4 6= 0

I As in the MLB example, we estimate the unrestricted model


and impose the null hypothesis values to get a restricted
model.
General joint hypotheses

I Example: house prices


I Unrestricted:

Yi = 0 + 1 X1i + 2 X2i + 3 X3i + 4 X4i + ui

I Restricted:
Yi = 0 + X1i + vi
or
Yi X1i = 0 + vi
I We then proceed as before.
I Note: in this case we can not use the R 2 version of the
F -statistic. This is because the dependent variable is different
in the unrestricted and restricted models, i.e., TSSur 6= TSSr
(and hence, won’t cancel).
General joint hypotheses

I Example: house prices


I OLS results:

\
ln(price) = 0.264 + 1.043 · ln(assess) + 0.0074 · ln(lotsize)
(0.57) (0.151) (0.386)

0.1032 · ln(sqrft) + 0.0338 · bdrms


(0.1384) (0.0221)

2
n = 88, SSRur = 1.822, Rur = 0.7728

ln(price)\ln(assess) = 0.0848
(0.0156)

n = 88, SSRr = 1.88

(1.88 1.822)/4
F = = 0.661
1.822/83
General joint hypotheses

I Example: house prices


I Critical values:
Sign Level F4,83
10% 2.01
5% 2.48
1% 3.55

I Conclusion: there is no statistical evidence that assessed


values are not rational and that lotsize, sqrft, and bdrms have
any effect after controlling for assessed value (i.e., we fail to
reject the null hypothesis at any standard level of significance).
Linear hypotheses

I Hypothesis test:
k
X k
X
H0 : cj j = r vs H1 : cj j 6= r
j=0 j=0

I Example: consider a simple model to capture returns to


education at junior colleges and 4-year colleges (universities).
I The population for this example includes working people with
at least a high school diploma.
Model:

ln(wagei ) = 0 + 1 jci + 2 univi + 3 experi + ui


jc : number of years in junior college
univ : number of years in 4-year college)
exper : months in the workforce
Linear hypotheses

I Example: returns to education


Question: is another year of junior college worth the same as
another year at a 4-year college?
Hypothesis test:

H0 : 1 = 2 vs H1 : 1 < 2

Note that we can rewrite this as

H0 : 1 2 = 0 vs H1 : 1 2 <0

Thus we have

H0 : c 0 0 + c1 1 + c2 2 + c3 3 =r

where c0 = 0, c1 = 1, c2 = 1, c3 = 0, r = 0
Linear hypotheses

I Example: returns to education


Null hypothesis: another year at a junior college leads to the
same percentage increase in wage as another year at a 4-year
college.
t-statistic:
ˆ1 ˆ2
t=
SE ( ˆ1 ˆ2 )
I Is the t-statistic sufficiently less than zero to reject the null?
I Choose a significance level and obtain a critical value c from a
standard normal distribution.
I Reject H0 if t < c (c = 1.645 at the 5% level of
significance).
I The only difficulty is in obtaining the standard error.
Linear hypotheses

I Example: returns to education


OLS results:
\
ln(wage) = 1.472 + 0.0667 · jc + 0.0769 · univ +
(0.022) (0.0065) (0.0024)
0.0049 · exper
(0.0002)

n = 6, 763, R 2 = 0.222, tjc = 10.25, tuniv = 32.56


I The coefficient estimates for jc and univ are both "highly"
statistically significant.
I But we want to test whether the difference is statistically
significant.
ˆ1 ˆ2 = 0.0102

which implies that the return to a year at a junior college


results in approximately 1% less of an increase in earnings than
a year at a 4-year college.
Linear hypotheses
I Example: returns to education
I But the results don’t contain enough information to get the
standard error.
Var ( ˆ1 ˆ2 ) = Var ( ˆ1 ) + Var ( ˆ2 ) 2Cov ( ˆ1 , ˆ2 )
Estimator:
SE ( ˆ1 ˆ2 ) = (SE ( ˆ1 )2 + SE ( ˆ2 )2 2S ˆ1 , ˆ2 )1/2

where S ˆ1 , ˆ2 denotes an estimate of Cov ( ˆ1 , ˆ2 ).


I Rather than trying to compute SE ( ˆ1 ˆ2 ) directly, we take
another approach: transform the model.
I Define a new parameter ✓1 as the difference between 1 and
2 , i.e., ✓1 = 1 2.
I Our hypothesis test becomes

H0 : ✓1 = 0 vs H1 : ✓1 < 0
Linear hypotheses

I Example: returns to education


I The t-statistic is then

✓ˆ1
t=
SE (✓ˆ1 )
I Since ✓1 = 1 2 we can also write 1 = ✓1 + 2 and plug
this into the model

ln(wagei ) = 0 + (✓1 + 2 )jci + 2 univi + 3 experi + ui


= 0 + ✓1 jci + 2 (jci + univi ) + 3 experi + ui

I Define a new variable totcolli = jci + univi , then we have

ln(wagei ) = 0 + ✓1 jci + 2 totcolli + 3 experi + ui


Linear hypotheses

I Example: returns to education


I OLS results:

\
ln(wage) = 1.472 0.0102 · jc + 0.0769 · totcoll +
(0.022) (0.0066) (0.0024)
0.0049 · exper
(0.0002)

n = 6, 763, R 2 = 0.222, tjc = 1.53


I There is some, but not strong statistical evidence against H0 .
I Note: all of the other coefficient estimates and standard errors
are the same as before. This is a way to check whether the
model has been transformed correctly.
Linear hypotheses

I This can be extended to any number of coefficients.


I Hypothesis test:

H0 : 1 = 2 = 3 vs H1 : 1 6= 2 or 2 6= 3 or 1 6= 3

I Let ✓1 = 1 3 and ✓2 = 2 3. Then we have


1 = ✓1 + 3 and 2 = ✓2 + 3 .

I Plug these into our model to get

ln(wagei ) = 0 + (✓1 + 3 )jci + (✓2 + 3 )univi + 3 yrexpi + ui


= 0 + ✓1 jci + ✓2 univi + 3 (jci + univi + yrexpi ) + ui

where yrexpi = experi /12.


I Define: totedexpi = jci + univi + yrexpi , then we have

ln(wagei ) = 0 + ✓1 jci + ✓2 univi + 3 totedexpi + ui


Linear hypotheses

I This can be extended to any number of coefficients.


I Hypothesis test:

H0 : ✓1 = 0, ✓2 = 0 vs H1 : ✓1 6= 0 or ✓2 6= 0

I This is just a joint test with two exclusion restrictions!


Specification

I Model:

Yi = 0 + X + 2 X2i + 3 X3i + ui
|{z} | 1{z 1i} | {z } |{z}
dependent variable of control error
variable interest variables term

I Returns to education:

ln(wagei ) = 0 + 1 jci + 2 univi + 3 experi + ui

I Suppose we are primarily interested in the coefficient on jc.


I ˆ1 is an estimate of the value of a year at a junior college,
controlling for years at a 4-year college and work experience.
I We "control" for univ and exper because we think they have
an effect on wages and they are correlated with jc.
I In which case, leaving them out leads to omitted variable bias
(OVB).
Specification

I Returns to education:

ln(wagei ) = 0 + 1 jci + 2 univi + 3 experi + ui

I We may not be interested in the coefficients on univ and


exper , but we want to get an unbiased estimate of 1 .
I Recall the conditional mean zero errors assumption:

E[ui |X1i , X2i , X3i ] = 0

I If the conditional mean zero errors assumption holds, then all


of the coefficient estimates will be unbiased and have a causal
interpretation.
I Under a weaker assumption, ˆ1 will be an unbiased estimator
of 1 and have a causal interpretation, whether or not this is
also true for ˆ2 and ˆ3 .
Specification
I Conditional mean independence:

E[ui |X1i , X2i , X3i ] = E[ui |X2i , X3i ]

i.e., the expectation is independent of X1i .


I If the conditional mean independence assumption holds, then
ˆ1 will be an unbiased estimator of 1 and have a causal
interpretation.
I Consider a simpler model for illustration:

Yi = 0 + 1 X1i + X +ui
| 2{z 2i}
control
variable

I WTS ("want to show"): if

E[ui |X1i , X2i ] = E[ui |X2i ]

then ˆ1 will be an unbiased estimator of 1.


Specification

I "Illustration":
I Suppose E[ui |X2i ] is linear in X2i , i.e.,

E[ui |X2i ] = 0 + 2 X2i

where 0 and 2 are constants. 1


I Define vi = ui E[ui |X1i , X2i ], then

E[vi |X1i , X2i ] = E[ui |X1i , X2i ] E[ui |X1i , X2i ] = 0

I Using this, we can write

Yi = 0 + 1 X1i + 2 X2i + E[ui |X1i , X2i ] + vi


| {z }
ui

1
Statistical fact: if the random variables X and Y are jointly normally
distributed then E[Y |X ] = a + bX , for suitably chosen values of a and b.
Specification

I "Illustration":
I Then we have

Yi = 0 + 1 X1i + 2 X2i + E[ui |X1i , X2i ] + vi


| {z }
ui
= 0 + 1 X1i + 2 X2i + E[ui |X2i ] + vi
= 0+ 1 X1i + 2 X2i + ( 0 + 2 X2i ) + vi
= ( 0 + 0) + 1 X1i +( 2 + 2 )X2i + vi
= ↵0 + 1 X1i + ↵2 X2i + vi

where ↵0 = 0 + 0 and ↵2 = 2 + 2.

I Since E[vi |X1i , X2i ] = 0, ˆ1 will be an unbiased and have a


causal interpretation.

I Note: control variables are useful if conditional mean


independence holds, whether or not they have any effect on
Y.
Econ 140
MLR: Nonlinearities & Dummies

Stephen Bianchi

Department of Economics
UC Berkeley

April 1, 2021
Nonlinear regression
I Nonlinear regression falls into two categories:
(i) Nonlinear in the dependent variable and/or nonlinear in the
independent variables.
I But linear in the coefficients.
(ii) Nonlinear in the coefficients. We will leave this to a more
advanced course in econometrics.
I First category:
2
Yi = 0 + 1 X1i + 2 X2i + ui (quadratic)
Yi = 0 + 1 X1i + 2 ln(X2i ) + ui (linear-log)
2 3
Yi = 0 + 1 X1i + 2 X1i + 3 X1i + ui (polynomial)
More generally:
Yi = 0 + 1 g1 (X1i , X2i ) + 2 g2 (X1i , X2i ) + ui
where g1 and g2 are nonlinear functions of (potentially all) the
independent variables.
Nonlinear regression

I For example: suppose

g1 (X1i , X2i ) = ln(X1i )


g2 (X1i , X2i ) = X1i2 + X2i2

Then
2
Yi = 0 + 1 ln(X1i ) + 2 (X1i + X2i2 ) + ui
= 0 + 1 X̃1i + 2 X̃2i + ui

where X̃1i = ln(X1i ) and X̃2i = X1i2 + X2i2 , i.e., our regression
equation is still linear in the coefficients, as well as in the
redefined independent variables.
Nonlinear regression
I Polynomials:
2 k
Yi = 0 + 1 X1i + 2 X1i + ··· + k X1i + ui
I All k regressors are functions of the same independent
variable!
I k = 2 (quadratic), k = 3 (cubic)
I The interpretation of the coefficients is different.
I Suppose
Yi = 0 + 1 X1i + 2 X1i2 + ui
then
@Y
= 1 + 2 2 X1
@X1
I A common joint hypothesis test with polynomial regressions is
to test whether the population regression function is linear in
X1 :
H0 : 2 = 0, . . . , k = 0 vs
H1 : J 6= 0 for at least one j 2 {2, . . . , k}
Nonlinear regression
I Quadratic example: suppose we think that average hourly
earnings depend on experience in a nonlinear way.
Model:
2
wagei = 0 + 1 experi + 2 experi + ui

OLS results:

[ = 3.73 + 0.298 · exper


wage 0.0061 · exper 2

n = 526, R 2 = 0.093, texper = 7.66, texper 2 = 7.57


Interpretation:
@wage
= 0.298 2(0.0061) · exper
@exper
@ 2 wage
= 0.0122
@exper 2
Nonlinear regression

I Comments:
I Experience has a diminishing effect on wage with this
estimated equation.
The 1st year of experience is worth an average of $0.298
The 2nd year of experience is worth an average of
$(0.298-0.0122) = $0.286
I The change in wage will be positive provided exper < 24.4
years, i.e.,
@wage
> 0, if 0.298 > 2(0.0061) · exper
@exper
I The is no change in wage if exper = 24.4 years, i.e.,

@wage
= 0, if 0.298 = 2(0.0061) · exper
@exper
Nonlinear regression

I Comments:
I The change in wage will be negative if exper > 24.4 years,
i.e.,
@wage
< 0, if 0.298 < 2(0.0061) · exper
@exper
I How do we interpret this?
I Some possibilities:
(i) It could be that few people in the sample have more than 24.4
years of experience. If this were true, this would not be much
of a concern. But it turns out that 28% of the people in this
sample have more than 24.4 years of experience.
(ii) Of course, it is possible that the return to experience becomes
negative at some point. However, 24.4 years seems a little
early!
(iii) It is more likely that this is either a biased estimate due to
OVB, or the quadratic model is misspecified.
Logarithmic Regression

I This is the most popular device for including nonlinearities in


a regression model.
I Three types:
(i) Log-linear

ln(Yi ) = 0 + 1 X1i + 2 X2i + ui

(ii) Linear-log

Yi = 0 + 1 ln(X1i ) + 2 X2i + ui

(iii) Log-log

ln(Yi ) = 0 + 1 ln(X1i ) + 2 X2i + ui


Logarithmic Regression
I Interpretation of coefficients:
(i) Log-linear
Y
X1 = 1, then = 1
Y
Y
X2 = 1, then = 2
Y
(ii) Linear-log
X1
= 0.01, then Y = 0.01 1
X1
X2 = 1, then Y = 2

(iii) Log-log (constant elasticity model)


X1 Y
= 0.01, then = 0.01 1
X1 Y
Y
X2 = 1, then = 2
Y
Logarithmic Regression
I Example: consider a model relating the median housing price
to various characteristics of a community.
Model:
ln(pricei ) = 0 + 1 ln(noxi )+ 2 ln(disti )+ 3 roomsi + 4 stratioi +ui

nox : amount of nitrogen oxide in the air (ppm)


dist : weighted distance of community from 5
employment centers (miles)
rooms : average number of rooms in houses
stratio : average student-teacher ratio
I 1 is the price elasticity with respect to nitrogen oxide
I Hypothesis test: H0 : 1 = 1 vs H1 : 1 6= 1
I t-statistic:
ˆ1 1,0
ˆ1 + 1
t= =
SE ( ˆ1 ) SE ( ˆ1 )
Logarithmic Regression

I Example: consider a model relating the median housing price


to various characteristics of a community.
I OLS results:

\
ln(price) = 11.08 0.954 · ln(nox) 0.134 · ln(dist)
+ 0.255 · rooms 0.052 · stratio

n = 506, R 2 = 0.584, SE ( ˆ1 ) = 0.127


I Comments:
I All of the estimated coefficients have the expected signs and
all are (individually) statistically different from zero at the 5%
level.
I
0.954 + 1
tnox = = 0.3622
0.127
Hence, we can not reject H0 that the elasticity is -1.
Logarithmic Regression

I Example: consider a model relating the median housing price


to various characteristics of a community.
I Comments:
I If nox increases by 1%, we expect the median house price to
decrease by 0.954%.
I If rooms increases by 1, we expect the median house price to
increase by (approximately) 25.5%.

I To summarize:
Dep Var \ Regressor X ln(X )
X
Y Y = 1 X Y = 1 X
Y Y X
ln(Y ) Y = 1 X Y = 1 X
Single binary independent variable

I Example:

wagei = 0 + 0 femalei + 1 educi + ui

where ⇢
1 if female
femalei =
0 if male
I Interpretation of 0 : difference in hourly wage for men and
women, given the same amount of education.
I We can use this model to investigate the issue of wage
discrimination against women (i.e., if 0 < 0, then there is
wage discrimination).
I This can be depicted graphically as a parallel shift of the
regression line.
Single binary independent variable

I Since we have used female in the regression, 0 is the


intercept for male (i.e., female = 0) and 0 is the difference
in intercepts between male and female.
I Hypothesis test for wage discrimination:
H0 : 0 =0 vs H1 : 0 6= 0 or
H0 : 0 =0 vs H1 : 0 <0
where the latter version tests whether the discrimination, if
any, is specifically against women.
Multiple binary independent variables

I This can be extended to multiple binary variables.


I Example:

wagei = 0 + 1 educi + 2 southi + 3 northi + 4 easti + ui

where

1 if obs from the south region
southi =
0 if otherwise

1 if obs from the north region
northi =
0 if otherwise

1 if obs from the east region
easti =
0 if otherwise
The omitted region is west.
Multiple binary independent variables

I Then 2 , 3 , 4 , represent the change in wage versus the


west, holding education constant.
I Hypothesis test for regional wage differentials:

H0 : 2 = 0, 3 = 0, 4 = 0 vs
H1 : 2 6= 0 or 3 6= 0 or 4 6= 0
Econ 140
MLR: Interaction Terms, Internal Validity

Stephen Bianchi

Department of Economics
UC Berkeley

April 6, 2021
Interaction terms

I Sometimes the partial effect of an explanatory variable


depends on yet another explanatory variable.
I Example:

pricei = 0 + 1 sqrfti + 2 bdrmsi + 3 (sqrfti · bdrmsi ) +ui


| {z }
interaction term
then
@price
= 2 + 3 sqrft
@bdrms
I If 3 < 0, this implies that an additional bedroom yields a
smaller increase in price for larger houses.
I In other words, there is an interaction effect between square
footage and number of bedrooms.
Interaction terms

I We will consider three types of interaction terms:


(i) Interactions involving two binary variables.
(ii) One binary variable and one quantitative variable.
(iii) Two quantitative variables (as in the example).

I Two binary variables:

Yi = 0 + 1 X1i + 2 D2i + 3 D3i + 4 (D2i · D3i ) + ui

I This (also) represents parallel shifts of the regression line.


I Intercept:

D2 \ D 3 0 1
0 0 0 + 3
1 0 + 2 0+ 2+ 3 + 4
Interaction terms

I Example: suppose we want to look at wage differences


among 4 groups.
(i) unmarried men
(ii) married men
(iii) unmarried women
(iv) married women

I First approach: setup binary variables for each category (and


leave one out).

ln(wagei ) = 0 + 1 educi + 2 experi + 3 ufemi +


4 mfemi + 5 mmalei + ui
Interaction terms

I OLS results:

\
ln(wage) = 0.356 + 0.087 · educ + 0.0073 · exper 0.111 · ufem
0.141 · mfem + 0.323 · mmale

I Benchmark (omitted category): unmarried males


married males +32.3%
unmarried females -11.1%
married females -14.1%

I Second approach: setup binary variables for female and


married, and interact.

ln(wagei ) = 0 + 1 educi + 2 experi + 3 marriedi +


4 femalei + 5 (femalei · marriedi ) + ui
Econ 140
Internal Validity & Panel Data

Stephen Bianchi

Department of Economics
UC Berkeley

April 8, 2021
Errors-in-variables (measurement error)
I Sometimes we can’t collect data on the variable that truly
affects economic behaviour.
I But we can get data that is an imprecise measurement of that
variable.
I Good example: reported versus actual annual income
I Consider the simple regression model

Yi = 0 + 1 Xi + ui

where E[ui |Xi⇤ ] = 0.


I However, Xi⇤ is not observed, instead we observe

Xi = Xi⇤ + ei

where ei represents measurement error and we assume


E[ei |Xi⇤ ] = 0 1 . We also assume E[ui |Xi ] = 0 and ui is
uncorrelated with ei .
1
This is called the classic errors-in-variables (CEV) assumption.
Errors-in-variables (measurement error)
I We would like to know what happens to our estimate of 1
when we replace Xi⇤ with Xi and run the regression.
I Plugging Xi⇤ = Xi ei into our model gives

Yi = 0 + 1 (Xi ei ) + u i = 0 + 1 Xi + (ui 1 ei )

Hence
P
ˆ1 ! Cov (Xi , ui 1 ei )
1 + 2
X
I Under our assumption we have

Cov (Xi , ei ) = E[Xi ei ] E[Xi ]E[ei ]


= E[Xi ei ]
= E[Xi⇤ ei ] + E[ei2 ]
2
= e

In other words, under our assumption Xi and ei are correlated.


Errors-in-variables (measurement error)
I Now

Cov (Xi , ui 1 ei ) = Cov (Xi , ui ) 1 Cov (Xi , ei )


= 1 Cov (Xi , ei )
2
= 1 e
and
2 2 2
X = X⇤ + e
hence
2
✓ 2

P
ˆ1 ! 1 e e
1 2 2
= 1 1 2 2
X⇤ + e X⇤ + e

This implies that | ˆ1 | < | 1 | (i.e., ˆ1 is always closer to zero


than 1 ). This is called attenuation bias.

I Possible remedies for CEV:


1. Get more accurate measurements.
2. Use IV.
Simultaneous causality
I Causality runs from independent variable(s) to dependent
variable and from dependent variable to independent
variable(s).
I Mathematically:

Yi = 0 + 1 Xi + ui , and
Xi = 0 + 1 Yi + vi ,

where we assume Cov (vi , ui ) = 0.


I This leads to correlation between Xi and ui :

Cov (Xi , ui ) = Cov ( 0 + 1 Yi + vi , u i )


= Cov ( 1 Yi , ui ) + Cov (vi , ui )
= 1 Cov (Yi , ui )
= 1 Cov ( 0 + 1 Xi + ui , ui )
2
= 1 1 Cov (Xi , ui ) + 1 u
Simultaneous causality
I Solving for Cov (Xi , ui ) gives
✓ ◆
1 2
Cov (Xi , ui ) = u 6= 0,
1 1 1

unless 1 = 0.
I Hence, we will get biased and inconsistent estimates of 1 .
I Example: cities often want to determine whether additional
law enforcement will lower murder rates.
Model: murdpci = 0 + 1 polpci + ui
I But it is entirely plausible that a city’s spending on law
enforcement is determined (at least in part) by its murder
rate, i.e.,
polpci = 0 + 1 murdpci + vi
I Possible remedy: use IV.
Panel Data
I Information recorded for an entity (individual, household,
firm, state, country, etc) over several periods of time (day,
month, quarter, year).

1 2 ··· T
1 Y11 Y12 Y1T
2 Y21 Y22 Y2T
..
.
..
.
N YN1 YN2 YNT

I A balanced panel has no missing observations.


I With N entities and T time periods we have N ⇥ T
observations.
I An unbalanced panel has missing observations.
I With N entities and T time periods we have less than N ⇥ T
observations.
Example: traffic fatalities

I We have state level panel data on traffic fatalities and various


driving laws and demographics.
I Insurance rates tend to be higher for younger drivers. We will
explore one reason this may be the case.
I Our dataset has 25 years (1980-2004) of data for 48 states,
for a total of 1200 observations.
I We specify the following model:

tfrit = 0 + 1 p1424it + uit

where tfrit is the total number of fatalities per 100,000 of


population and p1424it is the percentage of the population
between the ages of 14 and 24.
Example: traffic fatalities

I For our initial analysis, we separately run cross sectional


regressions at two points in time, 1980 and 1992,
I The results are

1980 : c=
tfr 6.45 + 1.69 · p1424
(1.41)

1992 : c=
tfr 15.16 + 2.26 · p1424
(1.01)

giving a t-stat for the 1980 slope coefficient of 1.20, and a


t-stat for the 1992 slope coefficient of 2.24.
I At the 5% significance level, the slope coefficient is not
consistently statistically significant.
I This suggests that there might be a link between the
percentage of young people and traffic fatalities.
Example: traffic fatalities
I It seems likely that this simple regression suffers from omitted
variable bias.
I We can try to control for more factors.
I But it may be difficult to control for some factors.
I Specifically, those factors that are unobserved or for which we
have collected no data.
I One way to use panel data is to view the unobserved factors
as being one of three types:
I Those that are constant over time.
I Those that are constant across entities.
I Those that vary over time and across entities.
I Consider the following model:

tfrit = 0 + 1 p1424it + 2 Zi + uit

where Zi represent unobserved factors that vary across states,


but do not vary over time (hence, no t subscript).
Example: traffic fatalities
I Writing out the equations for 1980 and 1992 gives

tfri,1980 = 0 + 1 p1424i,1980 + 2 Zi + ui,1980


tfri,1992 = 0 + 1 p1424i,1992 + 2 Zi + ui,1992

I Notice that we can eliminate the effect of Zi by subtracting


the first equation from the second equation, giving

tfri,1992 tfri,1980 = 1 (p1424i,1992 p1424i,1980 )+(ui,1992 ui,1980 )

I Thus, if we let

dtfri = tfri,1992 tfri,1980


dprci = p1424i,1992 p1424i,1980
vi = ui,1992 ui,1980
Example: traffic fatalities
I We can specify our model as

dtfri = 0 + 1 dprci + vi
I This is sometimes called the "first-differenced" equation.
I We have included an intercept to allow for the possibility that
the mean change in the fatality rate, in the absence of a
change in the percentage of young people, is nonzero.
I Our results from the estimation of this model are

d = 4.26 + 2.76 · dprc


dtfr
(4.79) (1.07)

giving a t-stat for the intercept of 0.89 and a t-stat for the
slope coefficient of 2.58.
I This regression controls for factors that affect traffic fatality
rates that vary across states (e.g., speed limits) but do not
vary across time.
Fixed Effects Regression
I We can write the model from our example more generally as

Yit = 0 + 1 Xit + 2 Zi +uit


|{z}
time
invariant

I The goal is to estimate 1 while controlling for the


unobserved time invariant state characteristics Zi .
I Let’s rearrange our model to get

Yit = ( 0 + 2 Zi ) + 1 Xit + uit

and let ↵i = 0 + 2 Zi . This yields

Yit = ↵i + 1 Xit + uit (⇤)


I We can think of ↵i as state specific intercepts.
I These are more generally called entity fixed effects.
Fixed Effects Regression
I Note that 1 is the same for all states, i.e., we have a
regression line for each state with intercept ↵i and slope 1.
In other words, a family of parallel lines.
I The state specific intercepts in this model can also be
expressed in terms of binary variables.
I Let ⇢
1 if i = j
Dji =
0 otherwise

I Using these "entity dummies" we write our model as

Yit = 0 + 1 Xit + 2 D2i + ··· + N DNi + uit

I In our original model (⇤) the intercept for state i is given by


↵ i = 0 + 2 Zi .
I In this model the intercept for state i is given by 0 + j . That
is, ↵i = 0 + j (when i = j).
Fixed Effects Regression
I Thus we have two equivalent ways of writing the fixed effects
regression model.

I Graphically, we have
Fixed Effects Regression
I Using entity dummies, we write the regression model for our
traffic fatality example as
tfrit = 0 + 1 p1424it + 2 D2i + ··· + 48 D48i + uit
I Based on this specification, our estimation results (using the
data for 1980 and 1992) are
c = 1.85 · p1424 + state fixed effects
tfr
(0.23)

which gives a t-stat for the estimate of the slope coefficient of


8.04.
I But now we can open it up to all 25 years of data. Our
estimation results (using the data for all years) are
c = 1.20 · p1424 + state fixed effects
tfr
(0.12)

which gives a t-stat for the estimate of the slope coefficient of


10.
Fixed Effects Regression
I Another approach: demeaning
I Starting with our structural equation

Yit = 0 + 1 Xit + 2 Zi + uit

I Take the average of both sides across time


T T
! T
1 X 1 X 1 X
Yit = 0 + 1 Xit + 2 Zi + uit
T T T
t=1 t=1 t=1

which we write as

Ȳi = 0 + 1 X̄i + 2 Zi + ūi

I We then subtract this from the original equation to get

Yit Ȳi = 1 (Xit X̄i ) + (uit ūi )


Fixed Effects Regression
I Letting

Ỹit = Yit Ȳi


X̃it = Xit X̄i
ũit = uit ūi

we can write the model more compactly as

Ỹit = 1 X̃it + ũit

I This is a simple bivariate regression (with demeaned


variables), hence
PN PT
ˆ1 = i=1 t=1 X̃it Ỹit S
P N PT
= X̃2Ỹ
2 SX̃
i=1 t=1 X̃it
PN PT
i=1 t=1 X̃it ũit
= 1+ P N PT 2
i=1 t=1 X̃it
Fixed Effects Regression*
I When T = 2, all three methods give the same estimate of 1:

(i) differencing (when you regress without a constant)


(ii) binary variables
(iii) entity demeaning

I To see the equivalence of (i) and (iii), note that when T = 2


we can write our differencing model (without a constant) as

(Yi2 Yi1 ) = 1 (Xi2 Xi1 ) + (ui2 ui1 )

I Then our estimator is given by


PN
ˆ1BA = i=1 (Yi2 Yi1 )(Xi2 Xi1 )
PN
i=1 (Xi2 Xi1 )2

where BA refers to "before and after."


I Note that in a regression model without a constant
2
1 = E[XY ]/E[X ].
Fixed Effects Regression*
I Using the entity demeaning approach we have

Xi1 + Xi2
X̃i1 = Xi1
2
1 1
= Xi1 Xi2
2 2
1
= (Xi2 Xi1 )
2
and
Xi1 + Xi2
X̃i2 = Xi2
2
1
= (Xi2 Xi1 )
2
I Similarly, we have
1 1
Ỹi1 = (Yi2 Yi1 ) and Ỹi2 = (Yi2 Yi1 )
2 2
Fixed Effects Regression*

I Plugging these into our entity demeaned estimator of 1 gives


PN PT
ˆ1DM i=1 t=1 X̃it Ỹit
= PN PT 2
i=1 t=1 X̃it
PN 1
i=1 [ 4 (Yi2 Yi1 )(Xi2 Xi1 ) + 14 (Yi2 Yi1 )(Xi2 Xi1 )]
= PN 1 1
i=1 [ 4 (Xi2 Xi1 )2 + 4 (Xi2 Xi1 )2 ]
PN
i=1 (Yi2 Yi1 )(Xi2 Xi1 )
= PN
i=1 (Xi2 Xi1 )2

where DM refers to "demeaned."

I Thus we see that ˆ1DM = ˆ1BA .


I Terminology: ˆ1DM is most commonly called the fixed
effects estimator.
Fixed Effects Standard Errors

I With panel data, the regression errors can be correlated over


time within an entity.
I This is called serial correlation or autocorrelation.

I This does not introduce bias into the estimate of 1, but it


does affect the standard error of ˆ1 .
I To correct for this, we use so-called clustered standard errors.
I So-called because they are "clustered" within an entity.

I Let’s take a look at the assumptions under which the fixed


effects estimator is consistent and asymptotically normal.
Interaction terms

I OLS results:

\
ln(wage) = 0.356 + 0.087 · educ + 0.0073 · exper
0.111 · female + 0.323 · married
0.353(female · married)

I Benchmark: unmarried males


married males +32.3%
unmarried females -11.1%
married females (0.323 0.111 0.353) = -14.1%

I Hypothesis test on interaction term coefficient ( 5 ):

H0 : gender differential does not depend on marital status


Interaction terms
I One binary variable and one quantitative variable:
I This allows for different slopes in a regression.

ln(wagei ) = 0 + 1 educi + 2 experi + 3 femalei +


4 (femalei · educi ) + ui

men: intercept 0, slope (wrt educ) 1

women: intercept 0 + 3, slope (wrt educ) 1 + 4

I Hypothesis test: controlling for experience, is the return to


education the same for men and women?

H0 : 4 = 0 vs H1 : 4 6= 0

Since
@ ln(wage)
= 1 + 4 female
@educ
if 4 = 0, then additional years of education result in the same
percentage increase in wage for men and women.
Interaction terms

I One binary variable and one quantitative variable:


I OLS results:

\
ln(wage) = 0.461 + 0.093 · educ + 0.009 · exper 0.296 · female
0.004(female · educ)

n = 526, R 2 = 0.35, SE ( ˆ3 ) = 0.203, , SE ( ˆ4 ) = 0.016


I Comments:
I There is no statistical evidence against H0 .
I Further, the coefficient on female is not statistically different
from zero (even at the 10% level, since tfemale = 1.46 and
the critical value is 1.645).
I Note: dropping the interaction term and running the
regression gives ˆ3 = 0.344 and SE ( ˆ3 ) = 0.0347, so that
tfemale = 9.91 (which is "highly" significant at any standard
level of significance).
I Hence, when we add the interaction term we lose the
statistical significance of ˆ3 , why?
Interaction terms

I One binary variable and one quantitative variable:


I Comments:
I There is a high correlation between female and
(female · educ), which blows up the standard arror.
I 3 measures the wage differential between men and women
when educ = 0.
I It might be more interesting to estimate the gender
differential at the average education level in the sample, which
is educ = 12.56.
I To do this, we replace (female · educ) with
(female · (educ educ)) in the regression.
I When we do this, we get ˆ3 = 0.344 and SE ( ˆ3 ) = 0.0376.
Interaction terms

I Note:
Corr (female, female · educ) = 0.96
Corr (female, female · (educ educ)) = 0.07
Hence, the effect on SE ( ˆ3 ) is much smaller!
Interaction terms

I Note:
Cov (female, female · educ) = 3.08

Cov (female, female · (educ educ)) = Cov (female, female · educ)


Cov (female, female · educ)
= Cov (female, female · educ)
educ · Var (female)
= 3.08 (12.56)(0.25)
= 0.06
Validity of a statistical analysis
I There are two types of validity that characterize a statistical
analysis:
I External validity is the predictive value of the analysis’ findings
in a different context (i.e., can the findings be generalized
beyond the population that was studied).
I Internal validity is the question of whether the analysis
successfully uncovers causal effects for the population being
studied.
I Internal validity hinges on two things:
1. The estimator of the causal effect should be consistent
(unbiased is nice too, but not always feasible).
2. Hypothesis tests should have the desired significance level (i.e.,
you should be using the correct standard errors).
I In general, the OLS estimator of the causal effect will be
biased and inconsistent if the conditional mean zero error
assumption does not hold.
Validity of a statistical analysis
I In the context of simple (bivariate) regression (population
model: Yi = 0 + 1 X1i + vi ), we have:
Pn
ˆ1 = 1 + Pi=1 (X1i X̄1 )vi
n
i=1 (X1i X̄1 )2
I Since
Pn
i=1 (X1i X̄1 )E[vi |X1i ]
E[ ˆ1 |X1 ] = 1 + Pn
i=1 (X1i X̄1 )2
I and
Pn
ˆ1 i=1 (X1i X̄1 )vi S X1 v
= 1 + Pn = 1 +
i=1 (X1i X̄1 )2 SX 2
1
✓ ◆
P X1 v v
! 1 + 2
= 1 + ⇢X 1 v
X1 X1
Threats to internal validity
I ˆ1 will be biased and inconsistent if E[vi |X1i ] 6= 0 (which also
implies ⇢X1 v 6= 0).
I There are several reasons why this may occur, which are
sometimes called threats to internal validity.
I Here we concentrate on three threats to internal validity that
can (potentially) be addressed using panel data and/or
instrumental variables (IV) estimation methods:
1. Omitted variables.
2. Errors-in-variables (measurement error).
3. Simultaneous causality.
I Possible remedies for omitted variables (depending on
available data):
1. Include omitted variables (in which case, the conditional mean
zero error assumption is satisfied).
2. Include appropriate control variables (in which case, the
conditional mean independence assumption is satisfied).
3. Use panel data or IV.
Econ 140
Panel Data & Intrumental Variables

Stephen Bianchi

Department of Economics
UC Berkeley

April 13, 2021


Fixed Effects Standard Errors

I With panel data, the regression errors can be correlated over


time within an entity.
I This is called serial correlation or autocorrelation.

I This does not introduce bias into the estimate of 1, but it


does affect the standard error of ˆ1 .
I To correct for this, we use so-called clustered standard errors.
I So-called because they are "clustered" within an entity.

I Let’s take a look at the assumptions under which the fixed


effects estimator is consistent and asymptotically normal.
Fixed Effects Assumptions

I Given the structural model:

Yit = ↵i + 1 Xit + uit

(FE1) E[uit | Xi1 , Xi2 , . . . , XiT , ↵i ] = 0


| {z }
all time periods
(FE2) (Xi1 , . . . , XiT , Yi1 , . . . , YiT ) are iid draws from their joint
distribution
(FE3) Large outliers unlikely: Xit , Yit have nonzero finite fourth
moments
(FE4) There is no perfect multicollinearity.

I For multiple regressors, we replace Xit with the full list


X1,it , . . . , Xk,it .
Fixed Effects Standard Errors

I Recall: PN PT
ˆ1 = i=1 t=1 X̃it ũit
1 + P N PT 2
i=1 t=1 X̃it

where X̃it = Xit X̄i and ũit = uit ūi .


I Under the given assumptions, by the CLT we have
p
NT ( ˆ1 ˆ
d
1 ) ! N(0, Var ( 1 ))

I But what is Var ( ˆ1 )?


PN PT !
t=1 X̃it ũit
Var ( ˆ1 ) = Var i=1
P N PT 2
i=1 t=1 X̃it
Fixed Effects Standard Errors
I In large samples (think of letting N ! 1, with fixed T ) we
have
N X
T
!
X
Var ( ˆ1 ) / Var X̃it ũit
i=1 t=1
N T
!
X X
/ Var X̃it ũit
i=1 t=1

I Side note: for random variables X1 , . . . , XN

N
! N N X N
X X X
Var Xi = Var (Xi ) + 2 Cov (Xi , Xj )
i=1 i=1 i=1 j=i+1
N X
X N
= Cov (Xi , Xj )
i=1 j=1
Fixed Effects Standard Errors
I Using this we can write

N T X
T
!
X X
Var ( ˆ1 ) / Cov (X̃is ũis , X̃it ũit )
i=1 s=1 t=1

I The OLS heteroskedasticity robust variance formula misses all


of the covariances (i.e., all of the cases where s 6= t).
I The fixed effects standard errors are called clustered standard
errors, because the variance formula accounts for these
covariances.
I They are also known as heteroskedasticity and
autocorrelation consistent (HAC) standard errors.
I The formula for the estimator of Var ( ˆ1 ) is given by
2 !2 3 !2
N
X T
X N T
d ( ˆ1 ) = 1 4 1 1 5/ 1 XX 2
Var X̃it ũit X̃it
NT N 1 T t=1
NT t=1
i=1 i=1
Time Fixed Effects
I Consider the following model:

Yit = 0 + 1 Xit + W +uit


| 3{z }t
entity
invariant

I We can write this as

Yit = t + 1 Xit + uit

where t = 0 + 3 Wt represent the time fixed effects.


I Using binary variables, we have

Yit = 0 + 1 Xit + 2 I 2t + ··· + T ITt + uit

where ⇢
1 if s = t
Ist =
0 otherwise
I Hence, t = 0 + s (when s = t).
Time Fixed Effects

I Each period has a different intercept, but the slope of the


regression line in each period is 1 .
Time Fixed Effects

I Demeaning proceeds in an analagous fashion as for entity


demeaning:
N N
! N
1 X 1 X 1 X
Yit = 0 + 1 Xit + 3 Wt + uit
N N N
i=1 i=1 i=1

or
Ȳt = 0 + 1 X̄t + 3 Wt + ūt
I The we subtract from the original (time fixed effects)
equation to get

(Yit Ȳt ) = 1 (Xit X̄t ) + (uit ūt )


I This is a bivariate regression with time demeaned variables.
Entity and Time Fixed Effects
I Consider the following model:

Yit = 0 + 1 Xit + 2 Zi + 3 Wt +uit


|{z} | {z }
time entity
invariant invariant

I We can write this as

Yit = ↵i + t + 1 Xit + uit

where ↵i = 0 + 2 Zi represent the entity fixed effects and


t = 3 Wt represent the time fixed effects.
I Using binary variables, we have

Yit = 0 + 1 Xit + 2 D2i +· · ·+ N DNi + 2 I 2t +· · ·+ T ITt +uit

where
⇢ ⇢
1 if i = j 1 if s = t
Dji = Ist =
0 otherwise 0 otherwise
Entity and Time Fixed Effects
I Of course, we can also demean using both entity and time
means to get
(Yit Ȳi Ȳt + Ȳ¯ ) = 1 (Xit X̄i X̄t + X̄
¯ ) + (u
it ūi ūt + ū¯)
PN PT
where Ȳ¯ = 1
NT i=1 t=1 Yit ,
¯ and ū¯.
and similarly for X̄
I This is a bivariate regression with entity and time demeaned
variables.
I Going back to our example, using entity and time fixed effects
yields
c = 0.54 · p1424 + state fixed effects + time fixed effects
tfr
(0.20)

which gives a t-stat for the estimate of the slope coefficient of


2.70 (p-value = 0.009).
I This is less than half of the previous estimate, but still
statistically significant at the 5% and 1% levels.
Instrumental Variables (IV)
I IV is a general way to obtain a consistent estimator of the
unknown coefficients in the population regression model

Yi = 0 + 1 Xi + ui

when Xi is correlated with ui .


I We call Xi endogenous when Cov (Xi , ui ) 6= 0.
I We call Xi exogenous when Cov (Xi , ui ) = 0.
I What is the intuition for IV?
I Think of Xi as having two parts, one that is correlated with ui
and one that is not correlated with ui .
I IV isolates the part of Xi that is uncorrelated with ui (the
"good" part) allowing us to disregard the variations in Xi that
bias the OLS estimates (the "bad" part).
I How? By using other variables (instruments) that are
correlated with Yi (but only through Xi ) and are uncorrelated
with ui .
Instrumental Variables (IV)
I Example: consider the problem of unobserved ability in a
wage equation for working adults.
ln(wagei ) = 0 + 1 educi + 2 abili + ui
I If we leave ability in the error term, we have

ln(wagei ) = 0 + 1 educi + vi
I If we estimate this model via OLS, and educ and abil are
correlated, we will get a biased and inconsistent estimator for
1.
I It turns out that we can still use this equation as a basis for
estimation, provided we find an instrumental variable for
educ.
I Consider the simple regression model

Yi = 0 + 1 Xi + ui
where we think Cov (Xi , ui ) 6= 0, i.e., we think Xi is
endogenous.
Instrumental Variables (IV)
I In order to obtain consistent estimates of 0 and 1 , we need
an instrumental variable, call it Zi .
I A valid instrument Zi must satisfy two conditions:
(i) Instrument relevance: Cov (Zi , Xi ) 6= 0
(ii) Instrument exogeneity: Cov (Zi , ui ) = 0
I These conditions say that Zi must be uncorrelated with the
omitted variables and Zi must be related, either positively or
negatively, to the endogenous variable Xi .
I There is a very important difference between these two
requirements.
I Since Cov (Zi , ui ) involves the unobserved error ui we can not
generally hope to test this condition.
I We must appeal to economic theory or common sense.
I On the other hand, we can test the condition
Cov (Zi , Xi ) 6= 0.
I The easiest way to do this is to estimate a simple regression
between Xi and Zi
X i = ⇡0 + ⇡1 Z i + v i
Instrumental Variables (IV)

I On the other hand, we can test the condition


Cov (Zi , Xi ) 6= 0.
I We should be able to reject the null hypothesis H0 : ⇡1 = 0 at
a sufficiently small significance level (i.e., 5% or 1%).
I In our example, an IV for education must be (1) uncorrelated
with ability (and any other unobserved factors affecting wage)
and (2) correlated with education.
I In wage equations, labour economists have used family
background variables as IVs for education.
I Parent’s education: positively (empirically) correlated with
child’s education, though might also be correlated with child’s
ability.
I Number of siblings: empirically, having more siblings is
associated with lower levels of education, so if this is
uncorrelated with ability it can act as an IV for education.
Instrumental Variables (IV)
I Example:
scorei = 0 + 1 skippedi + ui
where scorei is final exam score and skippedi is the total
number of lectures missed during the semester.
I We might be worried that skippedi is correlated with other
factors in ui , i.e., more interested, highly motivated students
might miss fewer classes.
I What might be a good IV for skippedi ? We need something
that is correlated with skippedi , but uncorrelated with interest
and motivation.
I One option could be to use the distance between housing and
campus.
I Living further from campus may increase the likelihood of
missing class. Hence, skippedi may be positively correlated
with distance.
I We can check by regressing skippedi on distance and doing a
t-test.
I Is distance uncorrelated with ui ? If income affects student
performance and low-income families live further from
campus, then maybe not. But if yes, then distance could be a
good IV for skippedi .
Instrumental Variables (IV)
I General comments*: arguments as to why a variable makes
a good instrument should include a discussion about the
nature of the relationship between Xi and Zi .
I Hence, it is important to take note of the sign of ⇡
ˆ1 in the
regression of Xi on Zi .
I Example: due to genetics and background influences, it makes
sense that child’s education Xi and mother’s or father’s
education Zi are positively correlated.
I If ⇡
ˆ1 < 0 in your sample, then using mother’s or father’s
education as an IV for child’s education is likely to be
unconvincing – apart from any discussion about exogeneity.
I Example: one should find a positive and statistically
significant relationship between distance and skippedi in order
to justify using distance as an IV for skippedi .
I A negative relationship would be difficult to reconcile and
would suggest there are important omitted variables driving
the negative correlation – variables that should be included in
the model.
Instrumental Variables (IV)
I Let’s look at how a good IV can be used to get a consistent
estimate of the model parameters.
I In particular, we look at how Cov (Zi , ui ) = 0 and
Cov (Zi , Xi ) 6= 0 serve to identify 1 .
I Identification: means we can write 1 in terms of population
moments that can be estimated using a sample of data.
I Using Yi = 0 + 1 Xi + ui , we can write

Cov (Zi , Yi ) = 1 Cov (Zi , Xi ) + Cov (Zi , ui )


I Then using Cov (Zi , ui ) = 0 and Cov (Zi , Xi ) 6= 0, we can solve
for 1 :
Cov (Zi , Yi )
1 =
Cov (Zi , Xi )
I Given a sample of data, we can estimate 1 using the sample
analogs:
Pn Pn
ˆ1 = SZY = Pi=1 (Zi Z̄ )(Yi Ȳ ) (Zi Z̄ )Yi
n = Pi=1
n
SZX i=1 (Zi Z̄ )(Xi X̄ ) i=1 (Zi Z̄ )Xi
Instrumental Variables (IV)
I Let’s look at how a good IV can be used to get a consistent
estimate of the model parameters.
I This is called the instrumental variables (IV) estimator of
1.
I Then, as with OLS: ˆ0 = Ȳ ˆ1 X̄
I Note that when Zi = Xi , we obtain the OLS estimator of 1 .
I When Xi is exogenous, it can be used as its own IV.
I Then the IV estimator is identical to the OLS estimator.
I When discussing the application of IV it is important to be
careful with language.
I Like OLS, IV is an estimation method.
I It does not make sense to refer to an "instrumental variables
model," just as it makes no sense to use the phrase "OLS
model."
I The "model" is the specified equation, we can choose to
estimate the parameters in many different ways.
I An estimator is simply a rule for combining data.
I The estimators are well defined (mathematically) and exist
apart from any underlying model or assumptions.
Econ 140
Instrumental Variables

Stephen Bianchi

Department of Economics
UC Berkeley

April 15, 2021


Simultaneous causality

I The canonical example of simultaneous causality concerns


demand (or supply) estimation.
I Suppose we are interested in estimating the demand for a
product (say beer).
I The demand function relates the quantity demanded to price

QiD = 0 + 1 Pi + ui ,

where the error term ui captures unobserved demand factors


such as income and tastes.
I These are sometimes also called unobserved demand shifters.
Simultaneous causality
I There is also a supply function that relates the quantity firms
are willing to supply to price

QiS = 0 + 1 Pi + 2 Zi + vi ,

where the error term vi captures unobserved supply shifters.


I The only thing distinguishing these two functions is Zi , which
might be a supply factor like the price of hops (an input to
making beer).
I Zi is sometimes also called an observed supply shifter.
I In the data, the observed Qi and Pi are determined by the
equilibrium condition
QiD = QiS ,
i.e., the intersection of the two functions.
I Hence, the problem with using OLS in conjunction with the
demand (or supply) equation is that Pi will be correlated with
ui (and vi ). Let’s see why.
Simultaneous causality
I The demand and supply functions

Demand : Qi = 0 + 1 Pi + ui
Supply : Qi = 0 + 1 Pi + 2 Zi + vi

are called structural equations since each can be derived


from economic theory and has a causal interpretation.
I The parameters in these equations are called structural
parameters.
I The variables Qi and Pi are called endogenous variables
(since they are determined inside the system), while Zi is
called an exogenous variable (since it is determined outside
the system).
I Note that we can solve for the endogenous variables in terms
of the exogenous variable Zi .
I This will illustrate why we have an endogeneity problem and
suggest a path forward.
Simultaneous causality

I If we set the two equations equal, solve for Pi , and then plug
this Pi into the demand equation, we obtain the following
reduced form equations

0 0 2 vi ui
Pi = + Zi +
1 1 1 1 1 1
1 0 0 1 1 2 1 vi 1 ui
Qi = + Zi +
1 1 1 1 1 1

I This shows that Pi is correlated with the error terms in both


the structural supply and demand equations.
I This is direct evidence of the endogeneity problem arising
from simultaneous causality.
Simultaneous causality
I We can re-write the reduced form equations

0 0 2 vi ui
Pi = + Zi +
1 1 1 1 1 1
1 0 0 1 1 2 1 vi 1 ui
Qi = + Zi +
1 1 1 1 1 1

more compactly as

Pi = ⇡0P + ⇡1P Zi + ✏Pi


Qi = ⇡0Q + ⇡1Q Zi + ✏Q
i

I Since
1 2 1 1 ⇡1Q
1 = · =
1 1 2 ⇡1P
we can get a consistent estimate of 1 as long as we can get
consistent estimates of ⇡1P and ⇡1Q .
Simultaneous causality

I Since Zi is exogenous (by assumption), we can simply use the


OLS estimates of ⇡1P and ⇡1Q

SZP SZQ
ˆ1P =
⇡ , ˆ1Q =

SZ2 SZ2

I Then
ˆ1Q
ˆ1 = ⇡ SZQ
P
=
ˆ1
⇡ SZP

I This estimation approach is called indirect least squares (ILS).


It is a special case of instrumental variables (IV) estimation.
I However, from the reduced form equations, it is clear that
there is no way to "indirectly" estimate 1 .
Two Stage Least Squares (TSLS)

I Given a good instrument for Xi , the most popular method for


estimating 1 is the IV estimation method called two stage
least squares (TSLS).
I Suppose that
Yi = 0 + 1 Xi + ui ,
where Cov (Xi , ui ) 6= 0.
I As the name implies, TSLS estimation of 1 proceeds in two
stages.
I The first stage decomposes Xi into two components (both of
which are correlated with Yi ):
1. A problematic component that may be correlated with ui .
2. A problem free component that is uncorrelated with ui .
I The second stage uses the problem free component to
estimate 1 .
Two Stage Least Squares (TSLS)
I Specifically, the first stage assumes there is a population
regression model linking Xi and the instrument Zi :

X i = ⇡ 0 + ⇡ 1 Zi + v i ,

where we assume that


I Cov (Zi , ui ) = 0 (instrument exogeneity)
I Cov (Zi , Xi ) 6= 0 (instrument relevance)
I E[vi |Zi ] = 0 (OLS A1 for the first stage)
I The assumption that Cov (Zi , ui ) = 0, combined with the
assumption that Zi does not enter the structural equation, is
known as an exclusion restriction.
I The idea is that Zi "matters" only through Xi .
I We can express this formally as Cov (Zi |Xi , Yi |Xi ) = 0.
I In other words, if Zi meets this condition, then Zi may be
"excluded" from the structural equation.
I Note that this is a reduced form equation, since Xi is
expressed entirely in terms of exogenous variables.
Two Stage Least Squares (TSLS)
I The goal of the first stage is to find the fitted values for Xi ,
i.e., the part of Xi that is correlated with Zi and hence
uncorrelated with ui .

X̂i = ⇡
ˆ0 + ⇡
ˆ 1 Zi

where
SXZ
ˆ1 =
⇡ , ˆ0 = X̄
⇡ ˆ1 Z̄

SZ2

I In the second stage, we replace Xi with X̂i in the structural


equation
Yi = 0 + 1 X̂i + ui
and use OLS to estimate 1 giving

ˆ1TSLS = SY X̂
S2

Two Stage Least Squares (TSLS)

I Note that in the second stage "population" model we have

Cov (Yi , X̂i )


1 =
Var (X̂i )
where

Cov (Yi , X̂i ) = Cov (Yi , ⇡


ˆ0 + ⇡ ˆ1 Cov (Yi , Zi )
ˆ 1 Zi ) = ⇡
Var (X̂i ) = Var (ˆ
⇡0 + ⇡ ˆ12 Var (Zi )
ˆ 1 Zi ) = ⇡

so that
ˆ1 Cov (Yi , Zi )
⇡ Cov (Yi , Zi )
1 = 2
=
ˆ1 Var (Zi )
⇡ ˆ1 Var (Zi )

Two Stage Least Squares (TSLS)

I Using sample analogs we have

ˆ1TSLS SYZ
=
ˆ1 SZ2

SYZ SZ2
= ·
SZ2 SXZ
SYZ
=
SXZ
Pn
(Zi Z̄ )(Yi Ȳ )
= Pi=1
n
(Zi Z̄ )(Xi X̄ )
Pni=1
(Zi Z̄ )Yi
= Pi=1
n
i=1 (Zi Z̄ )Xi
Large sample distribution of ˆ1TSLS

I Let’s look more closely at the numerator from the previous


expression:
n
X n
X
(Zi Z̄ )Yi = (Zi Z̄ )( 0 + 1 Xi + ui )
i=1 i=1
n
X n
X
= 0 (Zi Z̄ ) + 1 (Zi Z̄ )Xi
i=1 i=1
n
X
+ (Zi Z̄ )ui
i=1
n
X n
X
= 1 (Zi Z̄ )Xi + (Zi Z̄ )ui
i=1 i=1
Large sample distribution of ˆ1TSLS
I Using this we can write
Pn
ˆ1
TSLS (Zi Z̄ )Yi
= Pi=1n
i=1 (Zi Z̄ )Xi
P n Pn
i=1 (Zi Z̄ )Xi (Zi Z̄ )ui
= 1 Pn + Pni=1
i=1 (Zi Z̄ )Xi i=1 (Zi Z̄ )Xi
Pn
(Zi Z̄ )ui
= 1 + Pni=1
i=1 (Zi Z̄ )Xi

I Is ˆ1TSLS an unbiased estimator of 1 ? No!


I Using arguments similar to those used to derive the
asymptotic distribution of ˆ1OLS we can show that
✓ ◆
a
ˆ1TSLS ⇠ Var [(Zi µZ )ui ]
N 1,
n[Cov (Zi , Xi )]2

I Is ˆ1TSLS a consistent estimator of 1? Yes!


Large sample distribution of ˆ1TSLS

I Of course, we don’t know Var [(Zi µZ )ui ] and Cov (Zi , Xi ),


so we must estimate them using their sample analogs:
n
X
1
d [(Zi
Var µZ )ui ] = (Zi Z̄ )2 ûi2
n 2
i=1

n
X
d (Zi , Xi ) = 1
Cov (Zi Z̄ )(Xi X̄ )
n 1
i=1
Two Stage Least Squares (TSLS)
I Let’s consider another approach which gets us to the same
place.
I Recall that we can write
ˆ1TSLS = SYZ
SXZ
SYZ SZ2
= ·
SZ2 SXZ

I Notice that we can also write this as


ˆ
ˆ1TSLS = 1
ˆ1

where ˆ1 and ⇡
ˆ1 come from regressions of Yi and Xi on Zi
Yi = 0 + 1 Zi + ei
Xi = ⇡0 + ⇡1 Zi + vi
Two Stage Least Squares (TSLS)
I We can derive this equation for Yi from the structural
equation for Yi and the reduced form (first stage) equation for
Xi

Yi = 0 + 1 Xi + ui
= 0 + 1 (⇡0 + ⇡1 Z i + v i ) + u i
= ( 0 + 1 ⇡0 ) + 1 ⇡1 Z i +( 1 vi + ui )
= 0 + 1 Zi + ei

where

0 = 0 + 1 ⇡0

1 = 1 ⇡1
ei = 1 vi + ui
Two Stage Least Squares (TSLS)

I Idea: the structural equation

Yi = 0 + 1 Xi + ui

and the first stage equation

X i = ⇡0 + ⇡1 Z i + vi

imply a relationship between Yi and Zi

Yi = ( 0 + 1 ⇡0 ) + 1 ⇡1 Z i +( 1 vi + ui )
= 0 + 1 Zi + ei

This is called the reduced form model.


Two Stage Least Squares (TSLS)
I Notice that the reduced form error

ei = 1 vi + ui

is "OK" since

Cov (Zi , ei ) = 1 Cov (Zi , vi ) + Cov (Zi , ui ) = 0

I Hence, we can estimate the reduced form model using OLS to


get an estimate of ˆ1 = [ 1 ⇡1 .
I Then to estimate 1 , we "unpack" ˆ1 from the reduced form
estimate by dividing by ⇡
ˆ1
ˆ1
ˆ1TSLS =
ˆ1

Two Stage Least Squares (TSLS)

I Let’s reprise our supply and demand example.


I Recall that we were interested in estimating the demand
equation
QiD = 0 + 1 Pi + ui
I But we had an endogeneity problem, since prices and
quantities are determined by the intersection of demand and
supply
QiS = 0 + 1 Pi + 2 Zi + vi
I However, the supply equation provides us with a natural
instrument for Pi in the demand equation.
I Namely, the supply shifter Zi .
Two Stage Least Squares (TSLS)
I How do we know that Zi is relevant? The reduced form
equation

0 0 2 vi ui
Pi = + Zi +
1 1 1 1 1 1

shows that Zi is related to Pi as long as 2 6= 0.


I Since Zi is exogenous by assumption, we can estimate the
slope of the demand equation using TSLS

ˆ1TSLS = SZQ
SZP
I Note that this is exactly the same estimator we found before
using indirect least squares (ILS).
I Since there is no exogenous demand shifter in the demand
equation, we can not estimate the coefficient of price in the
supply equation: it is not identified.
Econ 140
Instrumental Variables

Stephen Bianchi

Department of Economics
UC Berkeley

April 20, 2021


TSLS with exogenous regressors

I In many applications, there is a single endogenous regressor


and several exogenous regressors

Yi = 0 + 1 Xi + W1i + · · · + r +1 Wri +ui


|{z} |2 {z }
endogenous exogenous

where Cov (Xi , ui ) 6= 0 and Cov (Wsi , ui ) = 0, s = 1, . . . , r .


I Example: demand equation with price, income, and age as
regressors.
I In order to use TSLS, we need at least one instrument Zi for
the endogenous regressor Xi .
I With no instruments, the equation is underidentified.
I With one instrument, the equation is exactly identified.
I With more than one instrument, the equation is overidentified.
TSLS with exogenous regressors

I Given m 1 instruments, we proceed as follows.


I First stage:

Xi = ⇡0 + ⇡1 Z1i + · · · + ⇡m Zmi
+⇡m+1 W1i + · · · + ⇡m+r Wri + vi

yielding (via OLS)

X̂i = ⇡ ˆ1 Z1i + · · · + ⇡
ˆ0 + ⇡ ˆm Zmi
⇡m+1 W1i + · · · + ⇡
+ˆ ˆm+r Wri

I Second stage:

Yi = 0 + 1 X̂i + 2 W1i + ··· + r +1 Wri + ui

yielding (via OLS) ˆ1TSLS .


TSLS with exogenous regressors

I Only models that are exactly identified or overidentified can


be estimated using TSLS.

I In the supply and demand example, the demand equation was


exactly identified, but the supply equation was underidentified.

I If the demand equation included another exogenous variable


(like income), then the supply equation would be exactly
identified as well.

I What would it take for the demand equation to be


overidentified?
IV solutions to errors-in-variables*
I As mentioned previously, instrumental variables can also be
used to deal with the measurement error problem.
I As an illustration, consider the model

Yi = 0 + 1 Xi + 2 Wi + ui

where Yi and Wi are observed, but Xi⇤ is not.


I Instead we observe
Xi = Xi⇤ + ei
where E[ei ] = 0. We assume Cov (Xi⇤ , ei ) = 0 and
Cov (Wi , ei ) = 0.
I We also assume ui is uncorrelated with ei , Xi⇤ , and Wi .
I We can write the model in terms of observed variables as

Yi = 0 + 1 Xi + 2 Wi + vi

where vi = ui 1 ei .
IV solutions to errors-in-variables*

I Recall that our assumptions imply that Xi and vi are


correlated, so we need an instrument for Xi .
I One possibility is to obtain a second measurement on Xi⇤ , call
it Zi .
I Suppose
Zi = Xi⇤ + ai
where E[ai ] = 0 and Cov (Xi⇤ , ai ) = 0.
I Further, suppose ai is uncorrelated with ei .
I Hence both Xi and Zi mismeasure Xi⇤ , but their measurement
errors are uncorrelated.
I Relevance? Certainly, Xi and Zi are correlated through their
dependence on Xi⇤ .
I Exogenous? Since it is Xi⇤ that affects Yi , it seems natural
to assume that Zi is uncorrelated with ui .
IV solutions to errors-in-variables*

I Where might we get two measurements on a variable?


I When employees are asked for their annual salary, their
employers can sometimes provide a second measure.
I For married couples, each spouse can independently report the
level of savings or family income.
I In wage equations estimated based on samples of twins, each
twin has been asked about his or her sibling’s years of
education.
I This gives a second measure that can be used as an IV for
self-reported education.
I Generally, however, having two measures of an explanatory
variable is rare.
IV solutions to errors-in-variables*

I An alternative is to use other exogenous variables as IVs for a


potentially mismeasured variable.
I In our wage equation

ln(wagei ) = 0 + 1 educi + vi

we could use mother’s (and/or father’s) education for this


purpose.
I If we think that
educi = educi⇤ + ei
then ˆ1IV (using mother’s education as an instrument) will not
suffer from measurement error, if mother’s education is
uncorrelated with the measurement error, ei .
TSLS in multiple regression
I In general, there can be multiple endogenous regressors and
exogenous regressors

Yi = 0 + 1 X1i + ··· + k Xki + W1i + · · · + k+r Wri +ui


| {z } | k+1 {z }
endogenous exogenous

I First stage: for j = 1, . . . , k


Xji = ⇡j0 + ⇡j1 Z1i + · · · + ⇡jm Zmi
+⇡j(m+1) W1i + · · · + ⇡j(m+r ) Wri + vji

yielding (via OLS)

X̂ji = ⇡ ˆj1 Z1i + · · · + ⇡


ˆj0 + ⇡ ˆjm Zmi
⇡j(m+1) W1i + · · · + ⇡
+ˆ ˆj(m+r ) Wri
I Second stage:
Yi = 0 + 1 X̂1i +···+ k X̂ki + k+1 W1i +···+ k+r Wri + ui

yielding (via OLS) ˆ1TSLS , . . . , ˆkTSLS .


TSLS Assumptions
For any sample of size n < N:
(A1) Conditional mean zero errors:

E[ui |W1i , . . . , Wri ] = 0, for i = 1, . . . , n

(A2) Random sampling:

(Yi , X1i , . . . , Xki , Z1i , . . . , Zmi , W1i , . . . , Wri ) iid

(A3) Extreme outliers unlikely:


Yi , X1i , . . . , Xki , Z1i , . . . , Zmi , and W1i , . . . , Wri have finite
and nonzero fourth moments.
(A4) A set of instruments Z1i , . . . , Zmi must satisfy the following
two conditions to be valid:
1. (1, X̂1i , . . . , X̂ki , W1i , . . . , Wri ) not perfectly collinear (where X̂ji
are the fitted values from the regression of Xji on Z1i , . . . , Zmi
and W1i , . . . , Wri ).
2. Cov (Z1i , ui ) = 0, . . . , Cov (Zmi , ui ) = 0
TSLS in multiple regression

I As with a single endogenous regressor, TSLS requires that we


have at least as many instruments as endogenous regressors
I If m < k, the equation is underidentified.
I If m = k, the equation is exactly identified.
I If m > k, the equation is overidentified.

I In order to use TSLS, we must be either exactly identified or


overidentified.
Strength and exogeneity

I Whether IV regression is useful for a given application hinges


on whether the instruments are valid.
I How do we know if the instruments are valid?
I Recall that validity depends on two conditions:
1. Instrument relevance: Cov (Zi , Xi ) 6= 0
2. Instrument exogeneity: Cov (Zi , ui ) = 0

I Let’s look at the consequences of violating each one and see


how to test for violations.
Instrument relevance (strength)

I With relevance, the issue is not just whether the instrument is


relevant, but how relevant.
I The more variation in the endogenous regressor that is
explained by the instrument, the more information is available
for use in IV regression.
I Having a more relevant instrument is like having a larger
sample size:
I It provides a more accurate estimator, and
I It justifies the use of the normal approximation for statistical
inference.
I Thus, just like having a small sample, weak instruments can
pose a problem for statistical inference.
I This is a pretty serious problem in practice because many
applied papers use exogenous but "weak" instruments.
Instrument relevance (strength)

I Recall that when we have one endogenous regressor and one


instrument (and no other exogenous regressors), we have

ˆ1TSLS = SZY ⇡ Cov (Zi , Yi )


SZX Cov (Zi , Xi )

I Hence, if SZX is close to zero then ˆ1TSLS is highly variable as


SZX and SZY vary with sampling.

I More generally, if ⇡1 , . . . ⇡m (the first stage coefficients on the


instruments) are equal or close to zero then
⇣ ⇣ ⌘⌘
ˆjTSLS 6 ! d
N j , Var ˆjTSLS
Instrument relevance (strength)

I Hence, to test the "strength" of the instruments, regress Xi


on Z1i , . . . , Zmi , W1i , . . . , Wri and test the null hypothesis

H0 : ⇡1 = · · · = ⇡m = 0

I Rule of thumb: reject the null hypothesis (that the


instruments are weak) if F 10. 1
I Note: if there is only one instrument you can use F = t 2 .

I So what should you do if you have weak instruments?


I Find better ones (or drop the weak ones, if you are
overidentified).
I Use a different technique.

1
The intuition for the cutoff of 10 is a bit complicated and involves the
asymptotic bias of the TSLS coefficients and how much bias you are willing to
tolerate (for more details see S&W (4th Edition), Appendix 12.5).
Instrument exogeneity

I If the instruments are not exogenous (Cov (Zi , ui ) 6= 0), then


X̂i will be correlated with ui and the TSLS estimator will be
inconsistent.
I Defeats the purpose of IV since we can’t isolate the "good"
part of Xi .

I Since the error ui is unobservable, can we test the exogeneity


condition for the instruments?
I If you are exactly identified, then you can’t (have to appeal to
economic theory and intuition/judgement).
I If you are overidentified, you can test whether some of the
instruments are uncorrelated with the structural error.
I Let’s see how.
Instrument exogeneity

I To get some intuition, let’s consider a structural equation with


one endogenous regressor and two exogenous regressors

Yi = 0 + 1 Xi + 2 W1i + 3 W2i + ui

and suppose we have two instruments Z1 and Z2 .


I Again, the assumptions that Z1 and Z2 do not appear in the
structural equation and that they are uncorrelated with the
structural error are known as exclusion restrictions.
I Since there is only one endogenous regressor, if Z1 and Z2 are
both relevant, we could estimate this equation using only Z1
or only Z2 .
Instrument exogeneity

I Let ˇ1IV be the TSLS estimator of 1 using Z1 and ˜1IV be the


TSLS estimator of 1 using Z2 .
I Again, if both Z1 and Z2 are exogenous and relevant then we
have
P P
ˇ1IV ! 1 and
˜1IV ! 1

I Hence, if our logic for choosing the instruments is sound ˇ1IV


and ˜1IV should differ only by sampling error.
I Hausman (1978) proposed a test of whether Z1 and Z2 are
exogenous based on the difference ˇ1IV ˜IV .
1
I The procedure of comparing different IV estimators of the
same parameter is an example of testing overidentifying
restrictions.
Instrument exogeneity
I Idea: we have more instruments than we need to estimate the
parameters consistently.
I In the previous discussion, we had one more instrument than
we needed, resulting in one overidentifying restriction that can
be tested.
I In the general case, suppose we have q more instruments than
we need.
I For example, with one endogenous regressor and 3 proposed
instruments, we have

q=3 1=2

overidentifying restrictions.
I When q 2, comparing several IV estimates is cumbersome.
I But we can easily compute a test statistic based on the TSLS
residuals.
I If all instruments are exogenous, the TSLS residuals should be
uncorrelated with the instruments (up to sampling error).
Basmann test of overidentifying restrictions
1. Run TSLS to obtain

ŶiTSLS = ˆ0TSLS + ˆ1TSLS X1i + · · · + ˆTSLS Xki


k
+ ˆTSLS W1i + · · · + ˆTSLS Wri
k+1 k+r

2. Form the TSLS residuals

ûiTSLS = Yi ŶiTSLS

3. Regress ûiTSLS on all exogenous regressors

ûiTSLS = 0 + 1 Z1i +···+ m Zmi + m+1 W1i +···+ m+r Wri

4. Compute the F -statistic (homoskedasticity only) for

H0 : 1 = ··· = m =0
Basmann test of overidentifying restrictions

5. Form the J-statistic


d 2
J = mF ! q

where q = m k.

6. Reject H0 if J > 2q critical value (i.e., 95th percentile, for a


test at the 5% level).
I If H0 rejected, conclude that one or more of the instruments
are not exogenous.
Efficiency of IV versus OLS estimators
I For this analysis, we assume the structural errors are
conditionally homoskedastic. That is,
2
Var (ui | exogenous variables) = u
I Consider the IV estimator when there is one suspected
endogenous independent variable, that is actually exogenous
(and no other exogenous independent variables), and one
instrument:
✓ ◆
a
ˆ1IV ⇠ Var [(Zi µZ )ui ]
N 1,
n[Cov (Zi , Xi )]2

I For the OLS estimator, we have:


✓ ◆
a
ˆ1OLS ⇠ Var [(xi µX )ui ]
N 1, 2 )2
n( X

I Next time, we will show that


Var ( ˆ1IV ) > Var ( ˆ1OLS )

You might also like