Statistics II - Least Squares Regression: Marcelo Sant'Anna

Statistics II - Least Squares Regression
Marcelo Sant’Anna
FGV EPGE
July 17, 2019
Marcelo Sant’Anna (FGV EPGE) Statistics II - Lec 3 July 17, 2019 1 / 16

Samples
We strengthen the assumption about our sample in order to derive finite sample
properties of the estimators we are studying.
Assumption
The observations (yi , xi ) for i = 1, . . . , n are independent and identically
distributed (i.i.d.).
The independence assumption makes our lives easier but is strong and might
not be reasonable in a variety of contexts
Fortunately it can be relaxed in many applications, allowing us to make
inference that is ‘robust’ to some forms of dependence

Linear regression model assumptions
Assumption (Linear Regression Model)

Observations (yi , xi ) satisfy the linear regression equation
yi =xi0 β + ei
E [ei |xi ] =0
have finite moments and E [xi xi0 ] is non-singular.
There are two cases for the conditional variance of errors:

Homoskedastic: E ei2 |xi = σ 2 (xi ) = σ 2

Heteroskedastic: E ei2 |xi = σ 2 (xi ) = σi2

Unbiasedness of OLS
Theorem (Conditional mean of OLS estimator)

In the linear regression model with iid sampling:
h i
E β̂|X = β.
From the expression of the OLS estimator:

−1
β̂ = (X 0 X ) (X 0 y )
−1
= (X 0 X ) (X 0 (X β + e))
−1
=β + (X 0 X ) (X 0 e)
Taking expectations conditional on X :

h i h i
−1
E β̂ − β|X = E (X 0 X ) (X 0 e) |X
−1
= (X 0 X ) X 0 E [e|X ]
| {z }
=0

Variance of OLS estimator
h i h i0
Vβ̂ = var β̂|X = E β̂ − β̂|X β̂ − β̂|X |X
0
= E β̂ − β β̂ − β |X
0
0 −1 0 0 −1 0
= E (X X ) X e (X X ) X e |X
h i
−1 −1
= E (X 0 X ) X 0 ee 0 X (X 0 X ) |X
−1 −1
= (X 0 X ) X 0 E [ee 0 |X ] X (X 0 X )
| {z }
≡D

What is D? A diagonal matrix: Dn×n = diag σ12 , . . . , σn2

Diagonal terms: E ei2 |X = E ei2 |xi = σi2
Off-diagonal terms: E [ei ej |X ] = E [ei |xi ] E [ej |xj ] = 0
Gauss-Markov theorem
Theorem
In the homoskedastic linear regression model with iid sampling, if β̃ is a linear
unbiased estimator of β then

−1
var β̃|X ≥ σ 2 (X 0 X ) .
What does the relation ≥ mean for matrices? A ≥ B ⇐⇒ A − B is positive

semi-definite.
−1
The OLS estimator under homoskedasticity has variance Vβ̂ = σ 2 (X 0 X ) ,
since in that case D = σ 2 I . This establishes the famous BLUE (best linear
unbiased estimator) for the OLS β̂.

What does it mean to be a “linear unbiased estimator of β?”

Linear function of y :
β̃ = A0 y ,
where An×k function of X . The OLS is a particular case in which
−1
A = X (X 0 X ) .
Unbiasedness:
h i
E β̃|X = E [A0 y |X ] = A0 E [y |X ] = A0 X β,
so β̃ is unbiased ⇐⇒ A0 X = I .

What does it mean to be a “linear unbiased estimator of β?”

Linear function of y :
β̃ = A0 y ,
where An×k function of X . The OLS is a particular case in which
−1
A = X (X 0 X ) .
Unbiasedness:
h i
E β̃|X = E [A0 y |X ] = A0 E [y |X ] = A0 X β,
so β̃ is unbiased ⇐⇒ A0 X = I .
And the variance of β̃:

var β̃|X = A0 E [ee 0 |X ] A = A0 Aσ 2

β̃ =A0 y

var β̃|X =A0 Aσ 2
We now show the theorem. We need to show that for any unbiased estimator
(A0 X = I ):
−1
A0 A − (X 0 X ) ≥ 0.
The trick is setting C = A − X (X 0 X )−1 . Note that C 0 X = 0 and re-writing
−1 0
A0 A − (X 0 X ) = C + X (X 0 X )−1 C + X (X 0 X )−1 − (X 0 X )−1

=C 0 C + (X 0 X )−1 X 0 C + C 0 X (X 0 X )−1 + (X 0 X )−1 − (X 0 X )−1

=C 0 C ≥ 0.

Estimation of error variance

The method of moments estimator of the error variance σ 2 = E ei2 is naturally:
n
1X 2
σ̂ 2 = êi
n
i=1
Under homoskedasticity, this estimator will be biased:

n−k
E σ̂ 2 |X = σ2

n
An obvious fix for the bias is to rescale the estimator:

n
1 X 2
s2 = êi
n−k
i=1

Covariance matrix estimation with homoskedasticity
Remember that

−1 −1
Vβ̂ = var β̂|X = (X 0 X ) X 0 E [ee 0 |X ] X (X 0 X )
| {z }
≡D
Under homoskedasticity, D = σ 2 In , so:

−1
Vβ̂ = (X 0 X ) σ2 .
A natural unbiased estimator for Vβ̂ is then:
−1 2
V̂β̂0 = (X 0 X ) s .

Covariance matrix estimation with heteroskedasticity

−1 −1
Vβ̂ = var β̂|X = (X 0 X ) X 0 E [ee 0 |X ] X (X 0 X )
| {z }
≡D
If the true errors were observed, an ideal estimator would be:

n
!
−1 −1
X
0
ideal
V̂β̂ = (X X ) xi xi ei (X 0 X ) ,
0 2
i=1
why? Note that

n
!
h i
−1 −1
X
E V̂β̂ideal |X = (X 0 X ) xi xi0 (X 0 X )
2
E ei |X
i=1
n
!
−1 −1
X
0
= (X X ) xi xi0 σi2 (X 0 X )
i=1
0 −1 −1
= (X X ) (X 0 DX ) (X 0 X ) = Vβ̂

Covariance matrix estimation with heteroskedasticity
A simple feasible estimator under heteroskedasticity was proposed by Eicker-White:
n
!
−1 −1
X
0
HC0
V̂β̂ = (X X ) xi xi êi (X 0 X ) .
0 2
i=1
However, we know êi2 is biased towards zero, so one may wish to inflate the
estimator:
n
!
n 0 −1
X −1
V̂β̂HC1 = (X X ) xi xi0 êi2 (X 0 X )
n−k
i=1
n
!
−1 −1
X
V̂β̂HC2 = (X 0 X ) (1 − hii )−1 xi xi0 êi2 (X 0 X )
i=1
n
!
−1 −1
X
0
V̂β̂HC3 = (X X ) (1 − hii )−2 xi xi0 êi2 (X 0 X )
i=1
HC2 and HC3 are more conservative than HC0:

V̂β̂HC0 < V̂β̂HC2 < V̂β̂HC3
Clustered Sampling
In many contexts, the random sample assumption might be strong. In many

cases, we may think that observations in the same group share many unobserved
factors that affect the outcomes we want to measure:
Students in the same class share teacher, location and peer effects, that
should make their achievement outcomes correlated;
Bidders in an auction may share common information about the object being
sold, which will imply a correlation in their bidding behavior;
Can you think of more examples?
We will briefly discuss the implications of clustered data for standard covariance
estimation and how to make covariance estimates robust to this environment.

Basic notation
Now we need a new index to identify the clusters of data:
{(yig , xig ) : g = 1, . . . , G ; i = 1, . . . , ng },
we can partion observations in the different clusters:
yg = (y1g , . . . , yng g )0 and Xg = (x1g , . . . , xng g )0 .
The regression equation for cluster g :
yg = Xg β + ee ,
and stacking up all clusters we get back
y = X β + e.

Unbiasedness of OLS in the clustered world
Assumption (Independent clusters)

The clusters (yg , Xg ) are mutually independent across clusters g .
Assumption (Conditional mean (in cluster))

Cluster errors are conditionally mean zero for all g = 1, . . . , G :
E [eg |Xg ] = 0.
Theorem
Given the assumptions above, the OLS estimator is still unbiased in the clustered
world: h i
E β̂|X = β.

Variance of the OLS estimator
Theorem
The variance of the OLS estimator, given the cluster assumptions above is
G
!

−1 −1
X
0
Xg E eg eg |Xg Xg (X 0 X ) .
0 0

Vβ̂ = var β̂|X = (X X )
g =1

Variance of the OLS estimator
Theorem
The variance of the OLS estimator, given the cluster assumptions above is
G
!

−1 −1
X
0
Xg E eg eg |Xg Xg (X 0 X ) .
0 0

Vβ̂ = var β̂|X = (X X )
g =1

When E eig2 |xg = σ 2 and E [eig ejg |xg ] = ρσ 2 for j 6= i, regressors do not
vary within cluster and all clusters have N observations, the formula above
simplifies to
−1
Vβ̂ = (X 0 X ) σ 2 (1 + ρ(N − 1)).
The Arellano’s variance estimator robust to clustering is a direct extension of
Eicker-White’s estimator to the clustered world:
G
!
−1 −1
X
V̂β = (X 0 X ) Xg0 êg êg0 Xg (X 0 X )
g =1

Statistics II - Least Squares Regression: Marcelo Sant'Anna

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics II - Least Squares Regression: Marcelo Sant'Anna

Uploaded by

Copyright:

Available Formats

Statistics II - Least Squares Regression

July 17, 2019

Marcelo Sant’Anna (FGV EPGE) Statistics II - Lec 3 July 17, 2019 1 / 16

Marcelo Sant’Anna (FGV EPGE) Statistics II - Lec 3 July 17, 2019 2 / 16

Assumption (Linear Regression Model)

have finite moments and E [xi xi0 ] is non-singular.

There are two cases for the conditional variance of errors:

Marcelo Sant’Anna (FGV EPGE) Statistics II - Lec 3 July 17, 2019 3 / 16

Theorem (Conditional mean of OLS estimator)

From the expression of the OLS estimator:

Taking expectations conditional on X :

Marcelo Sant’Anna (FGV EPGE) Statistics II - Lec 3 July 17, 2019 4 / 16

What does the relation ≥ mean for matrices? A ≥ B ⇐⇒ A − B is positive

Marcelo Sant’Anna (FGV EPGE) Statistics II - Lec 3 July 17, 2019 6 / 16

What does it mean to be a “linear unbiased estimator of β?”

Marcelo Sant’Anna (FGV EPGE) Statistics II - Lec 3 July 17, 2019 7 / 16

What does it mean to be a “linear unbiased estimator of β?”

And the variance of β̃:

Marcelo Sant’Anna (FGV EPGE) Statistics II - Lec 3 July 17, 2019 7 / 16

=C 0 C + (X 0 X )−1 X 0 C + C 0 X (X 0 X )−1 + (X 0 X )−1 − (X 0 X )−1

Marcelo Sant’Anna (FGV EPGE) Statistics II - Lec 3 July 17, 2019 8 / 16

Under homoskedasticity, this estimator will be biased:

An obvious fix for the bias is to rescale the estimator:

Marcelo Sant’Anna (FGV EPGE) Statistics II - Lec 3 July 17, 2019 9 / 16

Under homoskedasticity, D = σ 2 In , so:

A natural unbiased estimator for Vβ̂ is then:

Marcelo Sant’Anna (FGV EPGE) Statistics II - Lec 3 July 17, 2019 10 / 16

If the true errors were observed, an ideal estimator would be:

why? Note that

Marcelo Sant’Anna (FGV EPGE) Statistics II - Lec 3 July 17, 2019 11 / 16

HC2 and HC3 are more conservative than HC0:

In many contexts, the random sample assumption might be strong. In many

Can you think of more examples?

Marcelo Sant’Anna (FGV EPGE) Statistics II - Lec 3 July 17, 2019 13 / 16

Now we need a new index to identify the clusters of data:

we can partion observations in the different clusters:

yg = (y1g , . . . , yng g )0 and Xg = (x1g , . . . , xng g )0 .

The regression equation for cluster g :

and stacking up all clusters we get back

Marcelo Sant’Anna (FGV EPGE) Statistics II - Lec 3 July 17, 2019 14 / 16

Assumption (Independent clusters)

Assumption (Conditional mean (in cluster))

Marcelo Sant’Anna (FGV EPGE) Statistics II - Lec 3 July 17, 2019 15 / 16

Marcelo Sant’Anna (FGV EPGE) Statistics II - Lec 3 July 17, 2019 16 / 16

Marcelo Sant’Anna (FGV EPGE) Statistics II - Lec 3 July 17, 2019 16 / 16

You might also like