You are on page 1of 12

Lecture notes on Economertrics (borrowed from Ming- Ching Luoh’s ppt)

1 Linear Regression with One Regressor


Introduction
One empirical problem is “class size and educational output.”
Linear Regression Model
Some notation and terminology:
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝑢𝑖 , 𝑖 = 1, … , 𝑛
𝑋: independent variable or regressor
𝑌: dependent variable
𝛽0: intercept
𝛽1: slope
𝑢𝑖 : the regression error (It consists ofomitted factors, or possibly measurement error
in the measurement of Y. In general, these omitted factors are other factors that
influence Y, other than the variable X.)
The ordinary least squares estimator (OLS estimator):
It can be used to estimate 𝛽0 and 𝛽1 from data, which solves
𝑛
2
𝑚𝑖𝑛 ∑[𝑌𝑖 − (𝛽̂0 + 𝛽̂1 𝑋𝑖 )]
̂0 ,𝛽
𝛽 ̂1
𝑖=1

and it can really be solved.


The OLS estimator has some desirable properties. Under certain assumptions, it’s
unbiased (i.e. 𝐸(𝛽̂1 ) = 𝛽1), and it has a tighter sampling distribution than some other
candidate estimators of 𝛽1.
The solutions to above problem are:
𝑆𝑋𝑌
𝛽̂1 = , 𝛽̂0 = 𝑌 − 𝛽̂1 𝑋
𝑆𝑋 2
And the predicted value and residual:
𝑌̂𝑖 = 𝛽̂0 + 𝛽̂1 𝑋𝑖 , 𝑢̂𝑖 = 𝑌𝑖 − 𝑌̂𝑖
Measures of Fit
A natural question is how well the regression line fits or explains the data. There are
two regression statistics that provide complementary measures of the quality of fit.
(1) The regression R2 measures the fraction of the variance of Y that is explained by
X; it’s unitless and ranges from 0 (no fit) to 1 (perfect fit).
Note that if we define:
2
TSS (total sum of squares) ≡ ∑𝑛𝑖=1(𝑌𝑖 − 𝑌)

2
SSR (sum of square residuals) ≡ ∑𝑛𝑖=1(𝑌𝑖 − 𝑌̂𝑖 )
2
ESS (explained sum of squares) ≡ ∑𝑛𝑖=1(𝑌̂𝑖 − 𝑌)

We can show that: TSS=SSR+ESS.


And now we define:
𝐸𝑆𝑆
𝑅2 ≡
𝑇𝑆𝑆
the meaning is quite obvious.
Note that for regression with a single X, 𝑅 2 = 𝜌𝑋𝑌 2 .
(2) The standard error of the regression (SER) measures the magnitude of a typical
regression residual in the units of Y. Or we say, it measures the spread of the
distribution of u, or, the average size of the OLS residual. The SER is (almost) the
sample standard deviation of the OLS residuals:

𝑛 𝑛
1 2 1
𝑆𝐸𝑅 ≡ √ ∑(𝑢̂𝑖 − 𝑢̂) = √ ∑ 𝑢̂𝑖 2
𝑛−2 𝑛−2
𝑖=1 𝑖=1

The second equality holds because ∑𝑛𝑖=1 𝑢̂𝑖 = 0.


The root mean squared error (RMSE) is closely related to the SER:

𝑛
1
𝑅𝑀𝑆𝐸 = √ ∑ 𝑢̂𝑖 2
𝑛
𝑖=1

The Least Squares Assumptions


(1) The conditional distribution of u given X has mean zero, that is, 𝐸(𝑢|𝑋 = 𝑥) =
0. This implies that 𝛽̂1 is unbiased (can be proved). This also implies
𝐶𝑜𝑟𝑟(𝑋𝑖 , 𝑢𝑖 ) = 0.
(2) (𝑋𝑖 , 𝑌𝑖 ), 𝑖 = 1, … , 𝑛 are i.i.d. (This delivers the sampling distribution of 𝛽̂0 and
𝛽̂1.)
(3) Large outliers in X and/or Y are rare (to avoid large influence on 𝛽̂1).
(Technically, X and u have finite four moments, 𝐸(𝑋 4 ) < ∞ and 𝐸(𝑢4 ) < ∞.)
Sampling Distribution of the OLS Estimators
The OLS estimator is computed from a sample of data; a different sample gives a
different value of 𝛽̂1. This is the source of the “sampling uncertainty” of 𝛽̂1. If we
know the sampling distribution the OLS estimator we can deal with several problems.
First, we can calculate that
𝑉𝑎𝑟(𝑣)
𝑉𝑎𝑟(𝛽̂1 ) = , 𝑤ℎ𝑒𝑟𝑒 𝑣𝑖 = (𝑋𝑖 − 𝑋)𝑢𝑖
𝑛(𝜎𝑥 2 )2
By CLT, when n is large:
𝑉𝑎𝑟((𝑋𝑖 − 𝜇𝑋 )𝑢𝑖 )
𝛽̂1 ~𝑁 (𝛽1 , )
𝑛𝜎𝑥 4
1 𝑛
∑ 𝑣 𝑉𝑎𝑟((𝑋𝑖 −𝜇𝑋 )𝑢𝑖 )
𝑛 𝑖=1 𝑖
Note that it can be shown 𝛽̂1 − 𝛽1 = 𝑛−1 , 𝑉𝑎𝑟(𝛽̂1 − 𝛽1 ) = .
( )𝑆𝑥 2 𝑛𝜎 4 𝑥
𝑛

Therefore, the larger the 𝜎𝑥 , the smaller the 𝑉𝑎𝑟(𝛽̂1 − 𝛽1 ), since more spread in X
means more information about 𝛽̂1.
Testing Hypothesis
First, we make our naull hypothesis and two-side alternative:
𝐻0 : 𝛽1 = 𝛽1,0 𝑣𝑠. 𝐻1 : 𝛽1 ≠ 𝛽1,0
In general,
estimator − hypothesized value
𝑡=
S. E. of the estimator
where the S.E. of the estimator is the square root of an estimator of the variance of the
estimator.
Applied to a hypothesis about 𝛽1:
𝛽̂1 − 𝛽1,0
𝑡=
𝑆𝐸(𝛽̂1 )
where 𝛽1,0 is the hypothesized value of 𝛽1. And 𝑆𝐸(𝛽̂1 ):

1 𝑛
2 2
𝑉𝑎𝑟(𝑣) 1 𝑛 − 2 ∑𝑖=1(𝑋𝑖 − 𝑋) 𝑢̂𝑖
𝑆𝐸(𝛽̂1 ) = √𝑉𝑎𝑟(𝛽̂1 ) = √𝜎̂𝛽̂1 2 =√ =√ ×
𝑛(𝜎𝑥 2 )2 𝑛 1 2 2
[𝑛 ∑𝑛𝑖=1(𝑋𝑖 − 𝑋) ]

𝜎 2
If we let 𝑉𝑎𝑟(𝑢𝑖 |𝑋𝑖 = 𝑥) = 𝜎𝑢 2 , then 𝑉𝑎𝑟(𝛽̂1 ) = 𝑛𝜎𝑢 2.
𝑋

And there’s no need to remember this since the software will calculate for us. Then
the following procedures is as usual.
Confidence intervals
Several examples.
Regression when X is Binary
A binary variable is sometimes called a dummy variable or an indicator variable. And
how do we interpret a regression with a binary regressor? For example, if 𝑌𝑖 = 𝛽0 +
𝛽1 𝑋𝑖 + 𝑢𝑖 and 𝑋𝑖 = 0 𝑜𝑟 1, then we can yield
𝛽1 = 𝐸(𝑌𝑖 |𝑋𝑖 = 1) − 𝐸(𝑌𝑖 |𝑋𝑖 = 0)
which is the population difference in group means.
Heteroskedasticity and Homoskedasticity
The meaning of these two terms:
(1) If 𝑉𝑎𝑟(𝑢|𝑋 = 𝑥) is constant, that is, the variance of the conditional distribution

of u given X doesn’t depend on X, then u is said to be homoskedasticity (變異數

齊一).

(2) Otherwise, u is said to be heteroskedasticity (變異數不齊一).

So far we have assumed that u is heteroskedastic. Let us recall the three least squares
assumption, since we didn’t restricted the conditional variance, it’s allowed that we
assume u is heteroskedastic.
What if the errors are in fact homoskedastic? We can prove that OLS has the lowest
variance among estimators that are linear in Y, a result called the Gauss-Markov
theorem.
Gauss-Markov conditions:
(i) 𝐸(𝑢𝑖 |𝑋1 , … , 𝑋𝑛 ) = 0
(ii) 𝑉𝑎𝑟(𝑢𝑖 |𝑋1 , … , 𝑋𝑛 ) = 𝜎𝑢 2 , 0 < 𝜎𝑢 2 < ∞, 𝑓𝑜𝑟 𝑖 = 1, … , 𝑛
(iii) 𝐸(𝑢𝑖 𝑢𝑗 |𝑋1 , … , 𝑋𝑛 ) = 0, 𝑖 = 1, … , 𝑛, 𝑖 ≠ 𝑗
Gauss-Markov Theorem
Under the Gauss-Markov conditions, the OLS estimator 𝛽̂1 is BLUE (Best Linear
Unbiased Estimator). That is, 𝑉𝑎𝑟(𝛽̂1|𝑋1 , … , 𝑋𝑛 ) ≤ 𝑉𝑎𝑟(𝛽̃1|𝑋1 , … , 𝑋𝑛 ) for all linear
conditionally unbiased estimators 𝛽̃1.
And the special case for the standard error of 𝛽̂1 under homoskedasticity is
1
1 ∑𝑛𝑖=1 𝑢̂𝑖 2
2
𝜎̂𝛽̂1 = × 𝑛 − 2
𝑛 1 ∑𝑛 (𝑋 − 𝑋)2
𝑛 𝑖=1 𝑖
However we’ll use the heteroskedastic formula since it’s OK for both heteroskedastic
and homoskedastic cases.
Weighted Least Squares (WLS)
Since OLS under homoscedasticity is efficient, traditional approach is trying to
transform a heteroskedastic model into a homoskedastic one.
Suppose the conditional variance of ui is known as a function of Xi, namely
𝑉𝑎𝑟(𝑢𝑖 |𝑋𝑖 ) = 𝜆ℎ(𝑋𝑖 )
Then we can divide both sides of the single-variable regression model by √ℎ(𝑋𝑖 ) to
obtain 𝑌̃𝑖 = 𝛽0 𝑋̃0𝑖 + 𝛽1 𝑋̃1𝑖 + 𝑢̃𝑖 , where
𝑌̃𝑖 = 𝑌𝑖 ⁄√ℎ(𝑋𝑖 ) , 𝑋̃0𝑖 = 1⁄√ℎ(𝑋𝑖 ) , 𝑋̃1𝑖 = 𝑋𝑖 ⁄√ℎ(𝑋𝑖 ) , 𝑢̃𝑖 = 𝑢𝑖 ⁄√ℎ(𝑋𝑖 )
𝑉𝑎𝑟(𝑢̃|𝑋𝑖 ) = 𝑉𝑎𝑟(𝑢𝑖 )⁄ℎ(𝑋𝑖 ) = 𝜆
The WLS estimator is the OLS estimator obtained by regressing by regressing 𝑌̃𝑖 on
̂𝑖 )
𝑋̃0𝑖 and 𝑋̃1𝑖 . However, ℎ(𝑋𝑖 ) is usually unknown, then we have to estimate ℎ(𝑋
̂𝑖 ). This is called feasible WLS.
first and then replace ℎ(𝑋𝑖 ) with ℎ(𝑋

2 Linear Regression with Multiple Regressors


Omitted Variable Bias
The bias in the OLS estimator that occurs as a result of an omitted factor is called
omitted variable bias.
For omitted variable bias to occur, the omitted factor “Z” must be (1) a determinant of
Y, and (2) correlated with the regressor X. Both conditions must hold for the omission
of Z to result in omitted variable bias.
Let’s introduce a formula for omitted variable bias. First, recall the equation
1 𝑛 1 𝑛
∑𝑖=1 𝑣𝑖 ∑𝑖=1(𝑋𝑖 − 𝑋)𝑢𝑖
𝛽̂1 = 𝛽1 + 𝑛 = 𝛽1 + 𝑛
𝑛−1 1 𝑛 2
( 𝑛 ) 𝑆𝑥 2 ∑𝑖=1(𝑋𝑖 − 𝑋)
𝑛
1 2 𝑝 1 𝑝
And ∑𝑛 (𝑋 − 𝑋) → 𝜎𝑋 2 , ∑𝑛𝑖=1(𝑋𝑖 − 𝑋)𝑢𝑖 → 𝐶𝑜𝑣(𝑋𝑖 , 𝑢𝑖 ) = 𝜌𝑋𝑢 𝜎𝑋 𝜎𝑢 , therefore,
𝑛 𝑖=1 𝑖 𝑛

𝑝 𝜎𝑢
𝛽̂1 → 𝛽1 + 𝜌𝑋𝑢
𝜎𝑋
Now we can see the omitted factor Z indeed requires the bias between 𝛽̂1 and 𝛽1.
And the direction of the bias in 𝛽̂1 depends on whether X and u are positively or
negatively correlated. To be more specific, suppose that the true model is
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝛽2 𝑍𝑖 + 𝑢𝑖 , 𝐶𝑜𝑣(𝑋𝑖 , 𝑢𝑖 ) = 0
The estimated model when Z is omitted is
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝜖𝑖 , 𝜖𝑖 ≡ 𝛽2 𝑍𝑖 + 𝑢𝑖
Then 𝐶𝑜𝑣(𝑋𝑖 , 𝜖𝑖 ) = 𝐶𝑜𝑣(𝑋𝑖 , 𝛽2 𝑍𝑖 + 𝑢𝑖 ) = 𝛽2 𝐶𝑜𝑣(𝑋𝑖 , 𝑍𝑖 ), therefore
𝑝 𝐶𝑜𝑣(𝑋𝑖 , 𝑍𝑖 )
𝛽̂1 → 𝛽1 + 𝛽2
𝑉𝑎𝑟(𝑋𝑖 )
Sometimes, we do regressions in order to figure out the causal effect of a certain
event. Then what’s the definition of causal effect here? We define: a causal effect is
defined to be the effect measured in an ideal randomized controlled experiment.
The Multiple Regression Model
Consider the case of two regressors:
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + 𝑢𝑖 , 𝑖 = 1, … , 𝑛
𝑋1, 𝑋2 are the two independent variables (regressors).
(𝑌𝑖 , 𝑋1𝑖 , 𝑋2𝑖 ) denote the i-th observation on Y, X1, and X2.
𝛽0: unknown population intercept.
𝛽1: effect on Y of a change in X1, holding X2 constant.
𝛽2: effect on Y of a change in X2, holding X1 constant.
𝑢𝑖 : “error term” (omitted factors).

The OLS Estimator


With two regressors, the OLS estimator solves
𝑛
2
𝑚𝑖𝑛 ∑[𝑌𝑖 − (𝛽̂0 + 𝛽̂1 𝑋1𝑖 + 𝛽̂2 𝑋2𝑖 )]
̂0 ,𝛽
𝛽 ̂1 ,𝛽
̂2
𝑖=1

The OLS estimator minimizes the sum of squared difference between the actual
values of Yi and the prediction (predicted value) based on the estimated line. This
minimization problem yields the OLS estimators of 𝛽0, 𝛽1 and 𝛽2.
Measures of Fit
Actual= predicted+ residual: 𝑌𝑖 = 𝑌̂𝑖 + 𝑢̂𝑖
SER (standard error of regression)= standard error of 𝑢̂𝑖 (with degree of freedom
1
correction) = √𝑛−𝑘−1 ∑𝑛𝑖=1 𝑢̂𝑖 2

1
RMSE= standard error of 𝑢̂𝑖 (without degree of freedom correction) = √𝑛 ∑𝑛𝑖=1 𝑢̂𝑖 2

𝐸𝑆𝑆 𝑆𝑆𝑅
R2= fraction of the sample variance of Y explained by X= 𝑇𝑆𝑆 = 1 − 𝑇𝑆𝑆 , and R2

2
always increases when we add another regressor (will be fixed by defining R ).

2 𝑛−1 𝑆𝑆𝑅
R = “adjusted R2”= 1 − (𝑛−𝑘−1) 𝑇𝑆𝑆, and it can be negative.

Least Squares Assumptions


𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + ⋯ + 𝛽𝑘 𝑋𝑘𝑖 + 𝑢𝑖 , 𝑖 = 1, … , 𝑛
(1) The conditional distribution of 𝑢𝑖 given the X’s has a mean of zero. (Failure of
this condition leads to omitted variable bias.)
(2) (𝑋1𝑖 , … , 𝑋𝑘𝑖 , 𝑌𝑖 ), 𝑖 = 1, … , 𝑛 are i.i.d.
(3) Large outliers are unlikely. (Xj and u have finite four moments for all j, i.e.
∀𝑗 𝐸(𝑋𝑗𝑖 4 ) < ∞ and 𝐸(𝑢𝑖 4 ) < ∞.)
(4) There is no perfect multicollinearity. (The regressors are said to be perfectly
multicollinear if one of the regressors is an exact linear function of the other
regressors, and perfect multicollinearity usually reflects a mistake in the definition
of the regressors.)
Sampling Distribution of OLS Estimators
Under the four least squares assumptions:
(1) The exact (finite sample) distribution of 𝛽̂1 has mean 𝛽1, 𝑉𝑎𝑟(𝛽̂1 ) is inversely
proportional to n. So too for 𝛽̂2.
(2) Other than its mean and variance, the exact distribution of 𝛽̂1 is complicated.
𝑝
(3) 𝛽̂1 is consisitant: 𝛽̂1 → 𝛽1. (WLLN)
̂1 −𝐸(𝛽
𝛽 ̂1 )
(4) is approximately distributed 𝑁(0, 1). (CLT)
̂1 )
√𝑉𝑎𝑟(𝛽

(5) So too for 𝛽̂2 , … , 𝛽̂𝑘 .


Multicollinearity
As mentioned before, perfect multicollinearity usually reflects a mistake in the
definition of the regressors. The solution to perfect multicollinearity is to modify our
list of regressors so that we no longer have perfect multicollinearity.
There’s another situation, imperfect multicollinearity, which occurs when two or more
regressors are very highly correlated. Imperfect multicollinearity implies that one or
more of the regression coefficients will be imprecisely estimated, and also results in
large standard errors for one or more of the OLS coefficients.
Hypothesis Tests
̂1 −𝐸(𝛽
𝛽 ̂1 )
Since is approximately distributed 𝑁(0, 1) (CLT), hypotheses on 𝛽1 can
̂1 )
√𝑉𝑎𝑟(𝛽

be tested using the usual t-statistic, and confidence intervals are constructed as

(𝛽̂1 ± 1.96 × 𝑆𝐸(𝛽̂1 )), and so as for 𝛽2 , … , 𝛽𝑘 . (Note that 𝛽̂1 and 𝛽̂2 are generally

not independently distributed- neither are their t-statics.)


Tests of Joint Hypotheses
A joint hypothesis specifies a value for two or more coefficients, that is, it imposes
restrictions on two or more coefficients.
A “common sense” test is to reject if either of the individual t-statistics exceeds 1.96
in absolute value. But this “common sense” approach doesn’t work. The resulting test
doesn’t have the right significance level, since the incorrectly calculating the
probability of rejecting the null using the “common sense” test based on the two
individual t-statistics
The size of a test is the actual rejection rate under the null hypothesis.
The size of the “common sense” test is not 5% originally set, since its size depends on
the correlation between t1 and t2. Then there’s two solutions: one is to use a different
critical value not 1.96 (Bonferroni method, rarely used in practice), and the other one
is to use the F-statistic.
The F-statistic tests all parts of a joint hypothesis at once. Formula for the special case
of the joint hypothesis 𝛽1 = 𝛽1,0 and 𝛽2 = 𝛽2,0 in a regression with two regressors.
1 𝑡1 2 + 𝑡2 2 − 2𝜌̂𝑡1 ,𝑡2
𝐹= ( )
2 1 − 𝜌̂𝑡1 ,𝑡2 2
̂𝑖 −𝛽𝑖,0
𝛽
where𝑡𝑖 = ̂𝑖 ) .
𝑆𝐸(𝛽

And rejects when F is large.


𝑝
Consider a special case that t1 and t2 are independent, so 𝜌̂𝑡1 ,𝑡2 → 0, then the formula

becomes
1 2
𝐹≅ (𝑡 + 𝑡2 2 )
2 1
And in large samples, F-statistic distributed as𝜒𝑞 2 ⁄𝑞 .
p-value here= tail probability of the 𝜒𝑞 2 ⁄𝑞 distribution beyond the F-statistic
actually computed.
Single Restriction Test
Consider the null and alternative hypothesis,
𝐻0 : 𝛽1 = 𝛽2 𝑣𝑠. 𝐻1 : 𝛽1 ≠ 𝛽2
This null imposes a single restriction (q=1) on multiple coefficients. Two methods for
testing single restrictions on multiple coefficients:
(1) Rearrange (“transform”) the regression. Rearrange the regressors so that the
restriction becomes a restriction on a single coefficient in an equivalent
regression.
(2) Perform the test directly.
Let’s see how the method one performs: let
𝑌𝑖 = 𝛽0 + 𝛾1 𝑋1𝑖 + 𝛽2 𝑊𝑖 + 𝑢𝑖
where 𝛾1 = 𝛽1 − 𝛽2 , 𝑊𝑖 = 𝑋1𝑖 + 𝑋2𝑖 . So now
𝐻0 : 𝛾1 = 0 𝑣𝑠. 𝐻1 : 𝛾1 ≠ 0
Model Specification
The job of determining which variables to include in multiple regression- that is, the
problem of choosing a regression specification- can be quite challenging, and no
single rule applies in all situations. The starting point for choosing a regression
specification is thinking through the possible sources of omitted variable bias. It’s
important to rely on our expert knowledge of the empirical problem and to focus on
obtaining an unbiased estimate of the casual effect of interest. Don’t rely solely on
purely statistical measures of fit.
A control variable isn’t the object of interest; rather it’s a regressor included to hold
constant factors that, if neglected, could lead the estimated casual effect of interest to
suffer from omitted variable bias. And the mechanism to make our interested variable
unbiased through adding control variable(s) will be introduced later.
Three interchangeable statements about what makes an effective control variable:
(i) An effective control variable is one which, when included in the regression,
makes the error term uncorrelated with the variable of interest.
(ii) Holding constant the control variable(s), the variable of interest is “as if”
randomly assigned.
(iii) Among individuals (entities) with the same value of the control variable(s), the
variable of interest is uncorrelated with the omitted determinants of Y.
We need a mathematical statement of what makes an effective control variable. This
condition is conditional mean independence: given the control variable, the mean of ui
doesn’t depend on the variable of interest.
Conditional mean independence: let Xi denote the variable of interest and Wi denote
the control variable(s). Wi is an effective control variable if conditional mean
independence holds: 𝐸(𝑢𝑖 |𝑋𝑖 , 𝑊𝑖 ) = 𝐸(𝑢𝑖 |𝑊𝑖 ).
Consider the regression model,
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝛽2 𝑊 + 𝑢
Where X is the variable of interest and W is an effective control variable. In addition,
suppose that LSA #2, #3, and #4 hold. Then:
(1) 𝛽1 has a casual interpretation.
(2) 𝛽̂1 is unbiased.
(3) The coefficient on the control variable,𝛽̂2, is in general biased. This bias stems
from the fact that the control variable is correlated with omitted variables in the
error term, so that is subject to omitted variable bias.
6 Instrumental Variable Regression
One Regressor and One Instrument
Three important threats to internal validity are:
(1) Omitted variable bias from a variable that is correlated with X but is un observed,
so cannot be included in the regression;
(2) Simultaneous causality bias (X causes Y, Y causes X);
(3) Errors-in-variables bias (X is measured with error).
Instrumental variables regression can eliminate bias when 𝐸(𝑢|𝑋) ≠ 0 – using an
instrumental variable, Z.
Instrumental variable (IV) regression is a general way to obtain a consistent estimator
of the unknown coefficients of the population regression function when the regressor,
X, is correlated with the error term, u.
The information about the movements in X that are uncorrelated with u is gleaned
from one or more additional variables, called instrumental variables or simply
instruments.
IV regression uses these additional variables as tools or “instruments” to isolate the
movements in X that are uncorrelated with u, which in turn permit consistent
estimation of the regression coefficients.
The IV Model and Assumptions
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝑢𝑖
If Xi and ui are correlated, the OLS estimator is inconsistent.
IV estimation uses an additional, “instrumental” variable Z to isolate that part of Xi
that is uncorrelated with ui.
Terminology: an “endogenous” variable is one that is correlated with u; an
“exogenous” variable is one that is uncorrelated with u.
Two conditions for a valid IV, Z:
(1) Instrument relevance: 𝐶𝑜𝑣(𝑍𝑖 , 𝑋𝑖 ) ≠ 0
(2) Instrument exogeneity: 𝐶𝑜𝑣(𝑍𝑖 , 𝑢𝑖 ) = 0
The Two Stage Least Squares (TSLS) Estimator
As it sounds, TSLS has two stages- two regressions:
(1) First isolates the part of X that is uncorrelated with u: regress X on Z using OLS.
𝑋𝑖 = 𝜋0 + 𝜋1 𝑍𝑖 + 𝑣𝑖
Because 𝑍𝑖 is uncorrelated with 𝑢𝑖 , 𝜋0 + 𝜋1 𝑍𝑖 is uncorrelated with 𝑢𝑖 . We
don’t know 𝜋0 or 𝜋1 but we have estimated them.
Compute the predicted values of 𝑋𝑖 : 𝑋̂𝑖 = 𝜋̂0 + 𝜋̂1 𝑍𝑖 . (And thus 𝑋̂𝑖 is
uncorrelated with 𝑢𝑖 .)
(2) Replace Xi by 𝑋̂𝑖 in the regression of interest: regress Y on 𝑋̂𝑖 using OLS:
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋̂𝑖 + 𝑢𝑖
Because 𝑋̂𝑖 is uncorrelated with 𝑢𝑖 in large samples, so the first least squares
assumption holds.
Thus 𝛽1 can be estimated by OLS using regression (2).
This argument relies on large samples (so 𝜋0 and 𝜋1 are well estimated using
regression (1)).
𝑇𝑆𝐿𝑆
This resulting estimator is called the TSLS estimator, 𝛽̂1 .
𝐶𝑜𝑣(𝑌 ,𝑍 )
Theoretically, we can show that 𝛽1 = 𝐶𝑜𝑣(𝑋𝑖 ,𝑍𝑖 ). Similarly, the IV estimator:
𝑖 𝑖

𝑇𝑆𝐿𝑆 𝑆𝑌𝑍
𝛽̂1 =
𝑆𝑋𝑍
𝑇𝑆𝐿𝑆 𝑝
We can also show that the consistency of the TSLS estimator: 𝛽̂1 → 𝛽1.

𝑇𝑆𝐿𝑆 𝑝 1 𝑉𝑎𝑟[(𝑍−𝜇𝑧 )𝑢]


And 𝛽̂1 → 𝑁 (𝛽1 , 𝜎 2 ̂ 𝑇𝑆𝐿𝑆 ) , 𝜎2̂ 𝑇𝑆𝐿𝑆 =𝑛 [𝐶𝑜𝑣(𝑋,Z)]2
.
𝛽1 𝛽1

The General IV Regression Model


𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + ⋯ + 𝛽𝑘 𝑋𝑘𝑖 + 𝛽𝑘+1 𝑊1𝑖 + ⋯ + 𝛽𝑘+𝑟 𝑊𝑟𝑖 + 𝑢𝑖
𝑌𝑖 : the dependent variable.
𝑋1𝑖 , … , 𝑋𝑘𝑖 : the endogenous regressors (potentially correlated with 𝑢𝑖 )
𝑊1𝑖 , … , 𝑊𝑟𝑖 : the included exogenous variables or included exogenous regressors
(uncorrelated with 𝑢𝑖 ).
𝛽0 , 𝛽1 , … , 𝛽𝑘+𝑟 : the unknown regression coefficients.
𝑍1𝑖 , … , 𝑍𝑚𝑖 : the m IVs.
Identification:
In general, a parameter is said to be identified if different values of the parameter
would produce different distributions of the data.
In IV regression, whether the coefficients are identified depends on the relation
between the number of instruments (m) and the number of endogenous regressors (k).
Intuitively, if there are fewer instruments than endogenous regressors, we can’t
estimate 𝛽1 , … , 𝛽𝑘 . And the coefficients 𝛽1 , … , 𝛽𝑘 are said to be exactly identified if
m= k; overidentified if m>k (if so, we can test whether the instruments are valid);
underidentified if m<k.
TSLS, 1 endogenous regressor
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑊1𝑖 + ⋯ + 𝛽1+𝑟 𝑊𝑟𝑖 + 𝑢𝑖
Instruments: 𝑍1𝑖 , … , 𝑍𝑚𝑖
First stage: regress X1 on all other exogenous regressors, including 𝑊1 , … , 𝑊𝑟 ,
𝑍1 , … , 𝑍𝑚 by OLS, and compute predicted values 𝑋̂1𝑖 .
Second stage: regress Y on 𝑋̂1 , 𝑊1 , … , 𝑊𝑟 by OLS. The coefficients from this second
stage regression are the TSLS estimators, but SEs are wrong. To get correct SEs run
TSLS in a single command.
Implications: Sampling distribution of TSLS
If the IV regression assumptions hold, then the TSLS estimator is normally distributed
in large samples.
Inference (hypothesis testing, confidence intervals) proceeds as usual.
Checking Instrument Validity
We’re here to check the two requirements for valid instruments: relevance and
exogeneity.
To avoid the situation that instruments are too weak, 𝑆𝑋𝑍 is too small and generates
𝑇𝑆𝐿𝑆
too large 𝛽̂1 , for weak IV we make a little modification:
𝑞
𝑇𝑆𝐿𝑆 𝑞 𝜎𝑞 𝜎 𝑞
𝛽̂1 ≅ 𝛽1 + = 𝛽1 + ( ) ( )
𝑟 𝜎𝑟 𝑟
𝜎𝑟
Where 𝜎𝑟 2 = 𝑉𝑎𝑟[(𝑍𝑖 − 𝜇𝑍 )(𝑋𝑖 − 𝜇𝑋 )] = 𝑛𝜎𝑟 2 , 𝜎𝑞 2 = 𝑉𝑎𝑟[(𝑍𝑖 − 𝜇𝑍 )𝑢𝑖 ] = 𝑛𝜎𝑞 2.
To check for weak instruments with a single included endogenous regressor, check the
first-stage F: if F>10, instruments are not weak- use TSLS; if F<10, weak
instruments- take some action.

You might also like