You are on page 1of 98

Econometrics - lecture notes

Andrea Carriero
October 31, 2023

Contents

I The linear regression model 4


1 Introduction 4
1.1 Univariate model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Multivariate model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Ordinary Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Algebraic properties of the OLS estimator. 8


2.1 Orthogonality and its implications . . . . . . . . . . . . . . . . . . . . 8
2.2 Goodness of …t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Small sample properties of the OLS estimator 13


3.1 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Gauss-Markov theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Inference on individual coe¢ cients 18


4.1 Estimation of error variance . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Distribution of OLS estimator when error variance is estimated . . . 20

5 Partitioned Regressions and the Frish-Waugh Theorem 22

1
6 Inference on multiple coe¢ cients and restricted least squares 26
6.1 Specifying multiple linear restrictions . . . . . . . . . . . . . . . . . . 26
6.2 Wald statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.3 Restricted least squares . . . . . . . . . . . . . . . . . . . . . . . . . . 30

7 The generalized linear regression model 34


7.1 Properties of OLS estimator and HAC standard errors . . . . . . . . 35
7.2 E¢ cient estimation via GLS . . . . . . . . . . . . . . . . . . . . . . . 37
7.3 Feasible GLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.3.1 Heteroskedasticity and weighted least squares . . . . . . . . . 38
7.3.2 Autocorrelation and Cochrane-Orcutt procedure . . . . . . . 39
7.3.3 Conditional heteroskedasticity and models of time varying volatil-
ities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

II Random regressors and endogeneity 41


8 Independent regressors 41
8.1 The law of iterated expectations. . . . . . . . . . . . . . . . . . . . . 41
8.2 Bias and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.3 Distribution of the OLS estimator and Gauss-Markov theorem . . . . 43

9 Asymptotic theory and large sample properties of the OLS estima-


tor 45
9.1 Asymptotic theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
9.2 Large sample properties of the OLS estimator . . . . . . . . . . . . . 47
9.2.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
9.2.2 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . 48
9.2.3 Summary of asymptotic results . . . . . . . . . . . . . . . . . 49
9.3 Asymptotic variance estimation . . . . . . . . . . . . . . . . . . . . . 50
9.4 A digression on time series models . . . . . . . . . . . . . . . . . . . . 52

10 Endogeneity and instrumental variables 53


10.1 Examples of endogeneity . . . . . . . . . . . . . . . . . . . . . . . . . 53
10.1.1 Omitted variables . . . . . . . . . . . . . . . . . . . . . . . . . 53
10.1.2 Measurement error . . . . . . . . . . . . . . . . . . . . . . . . 53
10.1.3 Selection bias . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
10.1.4 Simultaneous equation models . . . . . . . . . . . . . . . . . . 56
10.2 Instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

2
10.3 Two stage least squares . . . . . . . . . . . . . . . . . . . . . . . . . . 65
10.4 Hausman test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

III Likelihood based methods 68


11 Maximum Likelihood estimation 68
11.1 ML estimation of CLRM . . . . . . . . . . . . . . . . . . . . . . . . . 69
11.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
11.3 ML estimation of GLRM . . . . . . . . . . . . . . . . . . . . . . . . . 73
11.3.1 ARCH models . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
11.3.2 ARMA models . . . . . . . . . . . . . . . . . . . . . . . . . . 75
11.4 State space models (linear and Gaussian) . . . . . . . . . . . . . . . . 76

12 James - Stein Estimators 81


12.1 James-Stein result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
12.2 Application to CLRM . . . . . . . . . . . . . . . . . . . . . . . . . . 81
12.3 Bayesian interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . 82

13 Bayesian treatement of the linear regression model 85


13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
13.2 The CLRM with independent N-IG prior . . . . . . . . . . . . . . . . 92
13.3 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
13.4 Generalized linear regression model . . . . . . . . . . . . . . . . . . . 95

14 Appendix A: Moments of the error variance in the LRM 98

3
Part I
The linear regression model
1 Introduction
1.1 Univariate model
We will consider models in which we have 1 dependent variable y. We will try to
explain this variable using some independent variables x.
Consider the model:

yt = a + bxt ; t = 1; :::; T
this is a set of T equations:

y1 = a + bx1
y2 = a + bx2
..
.
yT = a + bxT

This system is overdetermined: there are T equations and 2 unknowns, a and b.


There is not a solution, there are no a and b so that all these equations are satis…ed
(unless some equations are simply the same as others or linear combinations). We
could drop T -k equations, then we would have a unique solution for a and b but that
would mean loosing information. Instead, consider adding an error to the model:

yt = a + bxt + "t ; t = 1; :::; T (1)


this is a set of T equations:

y1 = a + bx1 + "1
y2 = a + bx2 + "2
..
.
yT = a + bxT + "T

We have now 2+T unknowns: a,b, and "t ; t = 1; :::; T . The system now has more
unknowns than equations and there are many solutions. We choose the solution that

4
minimizes the sum of the squared residuals

RSS = "21 + "22 + :::: + "2T

^ and ^b (of a and


by minimizing the SSR we get the ordinary least squares estimates a
b). The y^t is the regression line:

^ + ^bxt
y^t = a

and the residual is:


yt y^t = ^"t :
This gives us:
yt = y^t + ^"t
data; actual mod el;f itted residuals

This is a key decomposition, which we will explore further later on.


The goal is now to extend the model to the case of more than one regressor. First,
realize that the intercept is a regressor

yt = a 1+b xt + "t ; t = 1; :::; T (2)

but a special one, as it is equal to 1 in all time periods:

y1 = a 1+b x1 + "1 ;
y2 = a 1+b x2 + "2 ;
..
.
yT = a 1+b xT + "T :

The linear regression model with k regressors:

yt = 1 x1;t + 2 x2;t + 3 x3;t + ::: + k xk;t + "t

special case: k = 2; x1;t = 1 for all t = 1; :::; T :

yt = 1 + 2 x2;t + "t ;

which is the same model as (1) (with a = 1; b= 2 x2t = xt ).

5
1.2 Multivariate model
This course is about the general model

yt = 1 x1;t + 2 x2;t + 3 x3;t + ::: + k xk;t + "t ; t = 1; :::; T

where there are k regressors. This a time series model.

yj = 1 x1;j + 2 x2;j + 3 x3;j + ::: + k xk;j + "j ; j = 1; :::; N

This is a cross-section model. Panel data include both the time series and the cross
section dimension.
De…ne the matrices:
2 3 2 3
y1 "1
6 y2 7 6 "2 7
6 7 6 7
y = 6 .. 7 ; " = 6 .. 7 ;
T 1 4 . 5 4 . 5
yT "T
2 3 2 3 2 3
x1;1 x2;1 xk;1
6 x1;2 7 6 x2;2 7 6 xk;2 7
6 7 6 7 6 7
x1 = 6 .. 7 ; x2 = 6 .. 7 ; :::; xk = 6 .. 7 ;
4 . 5 4 . 5 4 . 5
x1;T x2;T xk;T
2 3 2 3
x1;1 x2;1 xk;1 1
6 x1;2 x2;2 xk;2 7 6 7
6 7
X =6 .. .. .. 7; =6
4
2 7
5
T k 4 . . . 5
x1;T x2;T xk;T k

we have:
y = X + " (3)
T 1 T kk 1 T 1

This compact notation is equivalent to

yt = 1 x1;t + 2 x2;t + 3 x3;t + ::: + k xk;t + "t ; t = 1; :::; T (4)

6
1.3 Ordinary Least Squares
Derive the Ordinary Least Squares estimator for this model.
RSS = "21 + "22 ::: + "2T = "0 "
2 3
"1
h i6 "2 7
.. 6 7
= "1 "2 . "T 6 .. 7
4 . 5
"T
= (y X )0 (y X )
| {z }| {z }
"0 "

The OLS problem:


^ OLS = arg min RSS( )
the solution to this is:
^ = (X 0 X) 1 X 0 Y:
OLS
How to interpret this? Divide both X 0 X and X 0 Y by T :
1
^ OLS = X 0X X 0Y
T T
The quantities
X 0X X 0Y
! E[xt x0t ]; ! E[xt yt ]:
T T
where xt = [x1t ; x2t ; ::::; xkt ]. So the estimator is given by the ratio of the (uncentered)
sample covariance between X and Y and the variance of X. This is reminiscent of
the formula for the model with 1 regressor:
d
^ OLS = Cov(x; y) :
Vdar(x)
Once we have the estimator we can compute the …tted values
Y^ = X ^ OLS
and the decomposition:
Y = Y^ + ^"
this is the actual, …tted, residual decomposition. The decomposition Y = Y^ + ^" is
the basis to compute the R2 . Note that:
Y = X ^ OLS + ^" = X + ":
? ?

The ^" is such that ^"0^" is the smallest possible.

7
2 Algebraic properties of the OLS estimator.
2.1 Orthogonality and its implications
We derived the OLS estimator:
^ = (X 0 X) 1 X 0 Y
OLS

Recall the First Order Condition (FOC):

(X 0 X) ^ OLS + X 0 Y = 0

that is:
X 0 (Y X ^ OLS ) = 0
or:
X 0 ^" = 0
k TT 1

Under OLS the residuals ^" are orthogonal to the regressors X. This implies the
following properties:
1. If the model has an intercept, then the residuals have sample average exactly
equal to 0:
2 3
^"1
1 0 1 6 ^"2 7 1 1X
T
6 7
1 ^" = [1 1 1 1 ::::: 1] 6 .. 7 = (^"1 + ^"2 + ^"3 + :::: + ^"T ) = ^"T = 0
T T 1 T 4 . 5 T T t=1
^"T

2. The regression line passes trough the mean of the data (Y = Y^ ):

Y = Y^ + ^"
1 0 1 0^ 1
1Y = 1 Y + 10^"
T T T
1X X 1X
T T T
1 ^
Yt = Yt + ^"T
T t=1 T t=1 T t=1

1X^
T
= Yt + 0
T t=1

Y = Y^

8
3. OLS makes the Y^ such that it orthogonal to ^":
0 0
Y^ 0^" = (X ^ )0^" = ^ X 0^" = ^ 0 = 0;
Speci…cally OLS implies a partition of Y into two subspaces orthogonal to each other.
We have:
Y^ = X ^ = X(X 0 X) 1 X 0 Y = X(X 0 X) 1 X 0 Y = PX Y
| {z } | {z }
^ PX

Where PX = X(X 0 X) 1 X is the projection matrix for the space spanned by the X.
The matrix PX projects things on the space spanned by the X: The residuals are:
^" = Y Y^ = Y PX Y = (In PX )Y = MX Y
where MX = (In PX ) is the "residual maker" matrix" or "annichilator" matrix.
The matrix MX projects things on the space orthogonal to the X:
The matrices MX and PX have some important properties. They are symmetric:
PX0 = X(X 0 X) 1 X 0 = PX
MX0 = I 0 PX0 = I PX = MX
and idempotent:
PX PX = X(X 0 X) 1 X 0 X(X 0 X) 1 X 0
= X(X 0 X) 1 X 0 = PX

MX MX = (I PX )(I PX )
= I PX PX + PX P X
= I PX = MX
MX and PX are mutually orthogonal:
PX MX = PX (I PX ) = PX P X PX = 0
The projection of X on X is X itself:
PX X = X(X 0 X) 1 X 0 X = X
The projection of X on the space of MX is 0:
MX X = (I PX )X = X PX X = 0
Therefore we have:
Y = Y^ + ^" = PX Y + MX Y

9
2.2 Goodness of …t
Recall the actual-…tted-residual decomposition:

Y = Y^ + ^"
consider squaring both sides of the equation:

Y 0Y = (Y^ + ^")0 (Y^ + ^")


= Y^ 0 Y^ + Y^ 0^" + ^"0 Y^ + ^"0^"
= Y^ 0 Y^ + ^"0^"

note that:
X
T X
T X
T
Yt2 0
= Y Y = T SS; Y^t2 = Y^ 0 Y^ = ESS; ^"t = ^"0^" = RSS
t=1 t=1 t=1

we can write:
T SS = ESS + RSS
which gives:
ESS T SS RSS RSS
R2 == =1
T SS T SS T SS
2 2
this is the *uncentered* R . However, the uncentered R has a scaling problem. We
need the centered RC2 :
XT
(Y^t Y )2
CESS
RC2 = = t=1
CT SS XT
(Yt Y )2
t=1

where
X
T X
T
(Yt 2
Y ) = CT SS; (Y^t Y )2 = CESS:
t=1 t=1
2
Moreover there is the adjusted R which takes into account how many regressors are
in the model:
T 1
R2 = 1 (1 RC2 );
T k
when k increases, R2 reduces.

10
Figure 1: The regression plane of regression of y onto x1 ; x2 . The plane has
equation y^t = x1 ^ 1 + x2 ^ 2 .

11
Figure 2: The orthogonal projection of y onto span(x1 ; x2 ) = x1 1 + x2 2

12
3 Small sample properties of the OLS estimator
So far, the vector
^ = (X 0 X) 1 X 0 Y
k k

is simply a vector of estimates. Note that we used one assumption to derive ^ . We


assumed that (X 0 X) 1 exists. In order for this matrix to exist we need that X 0 X is
invertible. We require that rank(X 0 X) = k. If X has full rank k, then also X 0 X will
have full rank k.

(A1) : X is full rank = no multicollinearity

However, we have no theory to establish how good these estimates are. To do this,
we need to add a probabilistic treatment to our model. We add:

(A2) : X is deterministic, or …xed in repeated samples

We postulate that " is a random noise. We add the following assumptions:

(A3) : E["] = 0

this implies E["t ] = 0; for all t. We add the assumption that:


2 3
2
0 0
6 0 2 7
6 7
(A4) : E[""0 ] = 2 IT = 6 .. 7
4 . 5
2
0 0

Note that
2 3 2 3
"1 "21 "1 "2 "1 "T
6 "2 7 h i 6 " " " 2 7
6 7 0 . 0 6 2 1 2 7
" = 6 .. 7 ; " = "1 "2 .. "T ; "" = 6 ... 7;
4 . 5 4 5
"T "T "1 "2T
2 3
E["21 ] E["1 "2 ] E["1 "T ]
6 E["2 "1 ] E["2 ] 7
6 2 7
E[""0 ] = 6 .. 7
4 . 5
2
E["T "1 ] E["T ]

13
so this assumption implies that

E["t "s ] = 0; for all t 6= s. no auto-correlation across periods/units


E["2t ] = 2 ; for all t. homoschedasticty (in conjunction with (A3))1

Finally, we add:
2
(A5) : " N (0; IN ) : Gaussianity
To summarize, we have the following set of assumptions.

(A1) : X is full rank = no multicollinearity


(A2) : X is deterministic, or …xed in repeated samples
(A3) : E["] = 0
(A4) : E[""0 ] = 2 IT : sphericity (homoschedasticity, no autocorrelation)
(A5) : " N ( ; ) : Gaussianity

So now we can use these to compute the properties of the OLS estimator.

3.1 Unbiasedness
lets say you want to estimate a coe¢ cient , using an estimator ^. The estimator is
unbiased i¤:
E[^] =
First of all, recall the OLS estimator:
^ = (X 0 X) 1 X 0 Y

To study the properties, we need to put the estimator in the "sample error represen-
tation":
^ = (X 0 X) 1 X 0 (X + ")
= (X 0 X) 1 X 0 X + (X 0 X) 1 X 0 "
= + (X 0 X) 1 X 0 "
actual value | {z }
sampling error

This is useful to compute the properties of the estimator. So consider taking the
expectation:
E[ ^ ] = E[ + (X 0 X) 1 X 0 "]

14
now use the properties of the E[ ] operator.

E[ ^ ] = E[ ] + E[(X 0 X) 1 X 0 "]
| {z }
constant (A2)

= + (X 0 X) 1 X 0 E["]
|{z}
=0 (A3)
0 1 0
= + (X X) X 0
=

3.2 Variance
Now, let us consider the variance of this estimator.

V ar( ^ ) = V ar( + (X 0 X) 1 X 0 ")

Here we can use the properties of the V ar[ ] operator.

V ar( ^ ) = V ar( + (X 0 X) 1 X 0 ")


= V ar( ) + V ar((X 0 X) 1 X 0 ") + 2COV (:::)
= V ar((X 0 X) 1 X 0 ")
| {z }
constant, A2

= (X X) X V ar(")X(X 0 X)
0 1 0 1

= (X 0 X) 1 X 0 2 IN X(X 0 X) 1
= 2 (X 0 X) 1 X 0 X(X 0 X) 1
= 2 (X 0 X) 1 :

however, note that in our case is not a random variable. It follows that V ar( ) =
0; 2COV (:::) = 0:

V ar(Az) = AV ar(z)A0

3.3 Distribution
Finally, the distribution of the OLS estimator.
^= + (X 0 X) 1 X 0 "

15
we know that
" N (0; 2 IN )
0 1 0
(X X) X " N ((X 0 X) 1 X 0 0; (X 0 X) 1 X 0 2
IN X(X 0 X) 1 )
N (0; 2 (X 0 X) 1 )
+ (X 0 X) 1 X 0 " N (0 + ; 2 (X 0 X) 1 )
^ N ( ; 2 (X 0 X) 1 )
To sum up:
A1 used to derive the estimator.
1) ^ is unbiased. We used A1-A3
2) ^ has variance 2 (X 0 X) 1 . We used A1-A4
3) ^ has a normal distr. We used A1-A5
The expression
^ N( ; 2
(X 0 X) 1 )
summarizes these results.

3.4 Gauss-Markov theorem


Next step, an optimality result. The OLS is the Best Linear Unbiased Estimator
(BLUE). In the class of estimators that are 1) linear 2) unbiased, OLS is the one
with the smallest variance. Gauss-Markov theorem.
Proof. consider another linear estimator.
b= C Y
k T

if C = (X 0 X) 1 X 0 then b = ^ . Require unbiasedness:


E[b] =
E[b] = E[CY ] = E[C(X + ")] = E[CX + C"] = E[CX ] + E[C"]
= CX + CE["] = CX =
this happens only if CX = I. So the class of linear unbiased estimators b is de…ned
by the conditions:
any b = CY (linear)
such that CX = I (unbiased)

16
Compute the variance:

V ar(b) = V ar(CY )
= V ar(CX + C")
= V ar(C")
= CV ar(")C 0
= C 2 IC 0
2
= CC 0

Finally, de…ne the matrix


D=C (X 0 X) 1 X 0
it follows that C = D + (X 0 X) 1 X 0 . Use this to compute:

2
V ar(b) = CC 0
2 1 1
= fD + (X 0 X) X 0 gfD + (X 0 X) X 0 g0
2 1
= fD + (X 0 X) X 0 gfD0 + X(X 0 X) 1 g
2
= fDD0 + DX(X 0 X) 1 + (X 0 X) 1 X 0 D + (X 0 X) 1 X 0 X(X 0 X) 1 g
2
= fDD0 + 0 + 0 + (X 0 X) 1 g
2
= DD0 + 2 (X 0 X) 1 ;
| {z }
V ar(OLS)

where we used DX = 0, which is implied by the unbiasedness condition CX = I:


Therefore:
V ar(b) V ar( ^ ) = 2 DD0 ;
where DD0 is a positive de…nite matrix. this means that

V ar(b) > V ar( ^ )

therefore ^ has the smallest variance in the class of linear and unbiased estimators.

17
4 Inference on individual coe¢ cients
We have considered the model
Y = X + ":
?
We have derived the estimator
^ = (X 0 X) 1 X 0 Y:

We have seen that under A1-A5:


^ N( ; 2
(X 0 X) 1 ):
| {z }
k k

In this case this is a k-variate random variable with multivariate Gaussian distribu-
tion. We want to make on the individual coe¢ cient j ; j = 1; :::; k. The distribution
of the OLS estimator for the j-th element of the vector ^ is:
^ N ( j ; [ 2 (X 0 X) 1 ]jj ):
j

The pivotal quantity:


^
p j j
N (0; 1)
[ (X X) 1 ]jj
2 0

can be used for inference. Speci…cally, for hypothesis testing and con…dence intervals.
H
Say we want to test H0 : j =0 0 : The pivotal quantity under the null is:
^j 0 H0
z^ = p N (0; 1)
[ (X X) 1 ]jj
2 0

Alternatively, we can compute con…dence intervals:


q
^ z [ 2 (X 0 X) 1 ]jj :
j

However, there is a problem:


2
Y = X + "; " N (0; |{z} I)

The value of 2 in unknown. We will solve this issue in two steps.


1) We will propose an estimator for 2
2) We will derive the distribution of ^ j based on an unknown 2 . This distribution
will depend on the estimator of 2 and it will NOT be Gaussian.

18
4.1 Estimation of error variance
Step 1. We will propose an estimator. We want to estimate
2
= E["2t ]

from A4. We could use:


1X 2
T
"
T t=1 t
This is unfeasible. A feasible estimator is:

1X 2
T
2
~ = ^"
T t=1 t

This is our proposed estimator. We want to check that is unbiased.


1 0 1 1
~2 = ^" ^" = "0 M 0 M " = "0 M "
T T T
where we have used ^" = M Y = M ": The expectation is:
1
E[~ 2 ] = E[ "0 M "]
T
1
= E["0 M "]
T | {z }
1 1
1
= E[tracef"0 M "g]
T
1 0
=
T | {z""}g]
E[tracefM
1
= tracefE[M ""0 ]g
T
1
= tracefM E[""0 ]g
T | {z }
1
= tracefM 2 Ig
T
1 2
= trace(M )
T

19
recall the matrix M

M = IT X(X 0 X) 1 X 0
tracefM g = tracefIT g tracefX(X 0 X) 1 X 0 g
= T tracef(X 0 X) 1 X 0 Xg
= T tracefIk g
= T k

so we have:
1
E[~ 2 ] = 2
trace(M )
T
1 2
= (T k)
T
T k 2
=
T
this is biased. We can adjust it:

T T T T k
E ~2 = E[~ 2 ] = 2
= 2
T k T k T k T

So the unbiased estimator is:

1X 2 1 X 2
T T
T T
^2 = ~2 = ^"t = ^"
T k T k T t=1 T k t=1 t

4.2 Distribution of OLS estimator when error variance is


estimated
Now the second step is to use this estimator and derive the distribution of ^ . We
start from the result:
^"0^" "0 M " "0 " 2
2
= 2
= M T k

Now, note that " N (0; IT ): So this is a sum of squared normals, of which T -k are
linearly independent. This is the sum of T -k independent standard Gaussians.
Remember the result we obtained:
^ N( ; 2
(X 0 X) 1 )

20
Consider the following ratio:
^
p j
2 (X 0 X) 1 ]
j
N (0; 1)
[ jj
q q tT k
"0 ^
^ " 2
2 =(T k) T k =(T k)

Both these results are based on M being a …xed matrix, which is the case under A2.
Moreover, the degrees of freedom come precisely from the rank of M .
^ ^ ^ ^
p j j
2 (X 0 X) 1 ]
p 2j0 j p j
2 (X 0 X) 1 ]
j
p j j
^j
[ jj [ (X X) 1]
jj [ jj [(X 0 X) 1]
jj j
q = q 0 = q = p =q
"0 ^ ^2 2
^ "
=(T k) "^
^ "
2
^ ^ 2 [(X 0 X) 1 ]jj
2 (T k) 2

we have a new pivotal quantity:


^j j
t= q tT k
^ 2 [(X 0 X) 1 ]jj

as opposed to:
^
j j
z=p N (0; 1)
[ 2 (X 0 X) 1 ]jj
2
In practice we never know . For tests: H0 : j = 0

^ H0
j 0
q tT k
^ 2 [(X 0 X) 1 ]jj

For C.I: q
^j t ^ 2 [(X 0 X) 1 ]jj
Softwares provide
^j H0 : j =0
q tT k
2
^ [(X 0 X) 1 ] jj

If T -k is large this approaches a Gaussian.


All this based on the distribution:
^ tT k( ; [^ 2 (X 0 X) 1 ]jj )
j

21
5 Partitioned Regressions and the Frish-Waugh
Theorem
Recall the model:

Y = X + " :
T 1 T xkk 1 T 1

We are going to partition the regressors in 2 groups. The regressors are:


2 3 2 3 2 3
x1;1 x2;1 xk;1
6 x1;2 7 6 x2;2 7 6 xk;2 7
6 7 6 7 6 7
x1 = 6 .. 7 ; x2 = 6 .. 7 ; :::; xk = 6 .. 7 ;
4 . 5 4 . 5 4 . 5
x1;T x2;T xk;T

and we grouped them in the matrix:


2 3
x1;1 x2;1 xk;1
6 x1;2 x2;2 xk;2 7
6 7
X = 6 .. .. .. 7
T k 4 . . . 5
x1;T x2;T xk;T

Now we partition this matrix as follows:


2 2 3 2 3 3
x1;1 x2;1 xq;1 xq+1;1 xk;1
6 6 x x2;2 xq;2 7 6 xq+1;2 xk;2 7 7
6 6 1;2 7 6 7 7
6 6 .. .. .. 7 6 .. .. 7 7
X = X1 X2 = 6 4 . . . 5 4 . . 5 7
6 7
4 x1;T x2;T xq;T xq+T;2 xk;T 5
| {z } | {z }
X1 X2

And accordingly:
2 3 2 3
1 q+1
6 7 6 7
= 1
; 1 =6
4
2 7;
5 2 =6
4
q+2 7
5
2
q k

Then we can rewrite our model:

1
Y = X1 X2 +"
2

22
and if you perform the product:

Y = X1 1 + X2 2 +"

in X1 there are q regressors, and in X2 there are k-q regressors. We start from the
normal equations.
X 0X ^ = X 0Y
In the alternative notation:

X10 ^1 X10
X1 X2 ^2 = Y
X20 X20

Do the multiplication:

X10 X1 X10 X2 ^1 X10 Y


^2 =
X20 X1 X20 X2 X20 Y

then,

X10 X1 ^ 1 + X10 X2 ^ 2 = X10 Y


X20 X1 ^ 1 + X20 X2 ^ 2 = X20 Y

Solve the …rst equation for ^ 1

(X10 X1 ) 1 X10 X1 ^ 1 = (X10 X1 ) 1 (X10 Y X10 X2 ^ 2 )


^ 1 = (X10 X1 ) 1 X10 (Y X2 ^ 2 )

Now we can plug this expression in the second equation and solve for ^ 2

X20 X1 ^ 1 + X20 X2 ^ 2 = X20 Y


X20 X1 f(X10 X1 ) 1 X10 (Y X2 ^ 2 )g + X20 X2 ^ 2 = X20 Y
X20 X1 (X10 X1 ) 1 X10 Y X20 X1 f(X10 X1 ) 1 X10 X2 ^ 2 + X20 X2 ^ 2 = X20 Y
X20 P1 Y X20 P1 X2 ^ 2 + X20 X2 ^ 2 = X20 Y
X20 P1 X2 ^ 2 + X20 X2 ^ 2 = X20 Y X20 P1 Y
X20 [I P1 ]X2 ^ 2 = X20 [I P1 ]Y
X20 M1 X2 ^ 2 = X20 M1 Y
^2 = (X20 M1 X2 ) 1 X20 M1 Y

23
The same steps can be done to obtain a solution for ^ 1 . We have:
^ 1 = (X10 M2 X1 ) 1 X10 M2 Y
q 1 q TT TT q
^2 = ( X20 M1 X2 ) 1 X20 M1 Y
(k q) 1 (k q) T T T T (k q)

Consider the case in which X1 and X2 are orthogonal: X10 X2 = 0. This implies
X10 M2 = X10 [I X2 (X20 X2 )X20 ] = X1 0 = X1 : Also we have that X20 M1 = X20 [I
X1 (X10 X1 ) 1 X10 ] = X20 :
^ 1 = (X10 X1 ) 1 X10 Y
^ 2 = (X20 X2 ) 1 X20 Y

Remark 1 When the regressors are orthogonal, then the least square regression of
Y on the two separate sets of variables gives the same OLS estimator as regressing
Y on both groups at the same time.

What else can we say?


^ 2 = (X20 M1 X2 ) 1 X20 M1 Y
| {z }
What is M1 X2 ?
X2 = X1 +
The OLS estimator is:

^ = (X10 X1 ) 1 X10 X2
^ 2 = X1 ^ = X1 (X10 X1 ) 1 X10 X2 = P1 X2
X
^ = X2 X ^ 2 = X2 P1 X2 = M1 X2

So we note that:
^ 2 = (X20 M1 M1 X2 ) 1 X20 M1 Y
| {z }
^
0 1 0
= (^ ^) ^ Y;

i.e. the OLS estimator of the slope coe¢ cient in the regression:

Y = ^ + v:

This is the Frish-Waugh Theorem

24
Proposition 2 In the linear regression of Y on two sets of variables X1 and X2 the
subvector ^ 2 is the set of coe¢ cients obtained from regression of Y on the residuals
^, where ^ are the residuals of a regression of X2 onto X1 .

This extends to the case of individual regressors:

Y = x1 1 + x2 2 + ::: + xk k +"

Remark 3 The OLS estimator ^ j in a multiple regression measures the e¤ect of


the regressor xj on Y , after washing away the correlation of that regressor with the
remaining regressors x1 ; x2 ; :::xj 1; xj+1; ::::; xk .

25
6 Inference on multiple coe¢ cients and restricted
least squares
6.1 Specifying multiple linear restrictions
Consider the model:
Y =X +"
full model:
Y = x1 1 + x2 2 +"
restricted model:
Y = x1 1 +"
The restriction is 2 = 0.
1 1
=
2 0
Generalize.
Y = x1 1 + x2 2 + x3 3 + x4 4 +"
Imagine we want to impose the following restrictions:

a) 1 = 0, 1 =0
b) 2 = 1, 2 1=0
c) 2+ 3 = 5, 2+ 3 5=0

We want to use the general representation:

R = r
j kk 1 j 1

In our example, we have j=3 and k=3


2 3
2 3 1
2 3
1 0 0 0 6 7 0
4 0 1 0 0 56 2 7=4 1 5
4 3
5
0 1 1 0 5
4

Now, we write them as follows:

m=R r=0 (5)

26
2 3
2 3 1
2 3 2 3
1 0 0 0 6 7 0 0
4 0 1 0 0 56 2 7 4 1 5=4 0 5
4 3
5
0 1 1 0 5 0
4

The vector m is the vector measuing the distance from the restrictions.
Let us give a few more examples of restrictions and put them in the form (5).

H0 : i =0

R = [ 0 0 0 1 ::: 0 ]; r = 0
2 3
1
6 7
6 2 7
6 .. 7
6 7
m = 0 0 0 1 ::: 0 6 . 7 0=0= i 0=0
6 i 7
6 7
4 5
k

H0 : i = i+1

i i+1 = 0

R = 0 0 ::: 1 1 ::: ; r = 0
2 3
1
6 7
6 2 7
6 .. 7
6 . 7
6 7
m = 0 0 ::: 1 1 ::: 6 7 0=0
6 i 7
6 7
6 i+1 7
4 5
k

H0 : 2 + 3 + 4 =0

27
R = 0 1 1 1 ::: 0 ; r = 0
2 3
1
6 7
6 2 7
6 .. 7
6 . 7
6 7
m = 0 1 1 1 ::: 0 6 7 0=0
6 i 7
6 7
6 i+1 7
4 5
k

Let us consider multiple restrictions


H0 : 1 = 0; 2 = 0; 3 =0
2 3 2 3
1 0 0 0 ::: 0 0
R = 4 5 4
0 1 0 0 ::: 0 ; r = 0 5
0 0 1 0 ::: 0 0
2 3
1
6 7
2 366 ..
2 7
7 2 3 2 3
1 0 0 0 ::: 0 6 . 7 0 0
6 7
m = 4 0 1 0 0 ::: 0 5 6 7 4 0 5=4 0 5
6 i 7
0 0 1 0 ::: 0 6 7 0 0
6 i+1 7
4 5
k

Another one
H0 : 2 + 3 = 1; 4 +3 5 = 0; 5 + 6 =0
2 3 2 3
0 1 1 0 ::: ::: :: 0 1
R = 4 0 0 0 1 3 :: :: 0 5 ; r = 4 0 5
0 0 0 0 1 1 :: :: 0
2 3
1
6 7
2 36
6 ..
2 7
7 2 3 2 3
0 1 1 0 ::: ::: :: 0 6 . 7 1 0
6 7
m = 4 0 0 0 1 3 :: :: 0 5 6 7 4 0 5=4 0 5
6 i 7
0 0 0 0 1 1 :: :: 6 7 0 0
6 i+1 7
4 5
k

28
6.2 Wald statistic
The test starts from the discrepancy vector:
H
m=R r =0 0

Imagine that we estimate the model

Y =X +"

getting ^ . Then we can compute the estimate of the discrepancy vector

^ = R^
m r

The expected value of the estimated discrepancy is:

^ = E[R ^ r]
E[m]
= RE[ ^ ] r
H
= R r = m =0 0

Let us compute the variance:

V ar[m]
^ = V ar[R ^ r]
= V ar[R ^ ]
= RV ar[ ^ ]R0
= R 2 (X 0 X) 1 R0

How can we devise a test for H0 ? Imagine you have 3 restrictions

2 + 3 1 = 0;
4 + 3 5 = 0;

5+ 6 = 0

and after estimation, you …nd:


^ +^ 1 = 0:9
2 3
^ + 3 ^ = 0:0000001
4 5
^ 5 + ^ 6 = 0:008

29
how can we decide how "well" these restrictions are satis…ed? The answer is the
Wald criterion. Divide the discrepancy by its standard deviation, to obtain a rescaled
discrepancy:
p 1 p 1
V ar(m)
^ m
^ = V ar(m)
^ (R ^ r):
The rescaled discrepancy, is Gaussian and has variance Ij . The Wald criterion is:
^ 0 V ar(m)
W = m ^ 1m ^
^ [R (X X) 1 ] 1 R0 m
= m 0 2 0
^
= (R ^ r) [R (X X) 1 R0 ] 1 (R ^ r)
0 2 0

(R ^ r)0 [R(X 0 X) 1 R0 ] 1 (R ^ r)
= 2

(R ^ r)0 1 (R
^ r)
= [R(X 0 X) 1 R0 ] 2
j
| {z } j j | {z }
1 j j 1

However, is unknown and needs to be estimated. The estimator is

1 X
T
1
^= ^"2t = ^"0^":
T k t=1
T k
Moreover, recall that
^2 1 "0 "
2
(T k) = 2
^"0^" = M 2
T k

Therefore:
(R ^ r)0 1 (R ^ r)
[R(X 0 X) 1 R0 ] =j 2
j
^2
Fj;T k
2
2 (T k)=(T k) T k
Simplifying:
(R ^ r)0 [R^ 2 (X 0 X) 1 R0 ] 1 (R ^ r)
Fj;T k
j
This allows to test for the restrictions.

6.3 Restricted least squares


Imagine we want to impose these restrictions on the model.
Y = X +"
R r = 0

30
this model can also be estimated via Least squares. The solution is:
^ = ^ OLS (X 0 X) 1 R0 [R(X 0 X) 1 R0 ] 1 (R ^ OLS r):
RLS

derive the RLS

= arg min S( )
R
S( ) = (Y X )0 (Y X )
s:t: R = r

L( ; ) = S( ) + 2 0 (R r)
| {z }
m
lets solve this
@L( ; )
= 2X 0 (Y X ) + 2R0
@
= 2X 0 (Y X ^ R ) + 2R0 = 0
= 2X 0 (Y X ^ R ) + 2R0 = = 0

= X 0 (Y X ^ R ) + R0 = 0
= X 0 Y + X 0 X ^ R + R0 = 0
= X 0 X ^ R = X 0 Y R0
= ^ R = (X 0 X) 1 X 0 Y (X 0 X) 1 R0
= ^ R = ^ (X 0 X) 1 R0

@L( ; )
= 2(R r)
@
= 2(R ^ R r) = 0
= R^R = r
now lets substitute
R( ^ (X 0 X) 1 R0 ) = r
R ^ R(X 0 X) 1 R0 = r
R ^ r = R(X 0 X) 1 R0
= [R(X 0 X) 1 R0 ] 1
(R ^ r)
| {z }

31
now we put this equation into the …rst one
^ =^ (X 0 X) 1 R0 [R(X 0 X) 1 R0 ] 1 (R ^ r)
R
C | {z }
| {z } m^

and this is the RLS estimator. Check the SOC.

H0 : R r=0
3. Unbiasedness:

E[ ^ R ] = E[ ^ C(R ^ r)]
= E[ ^ ] E[C(R ^ r)]
= E[(X 0 X) 1 X 0 " + ] E[C(R ^ r)]
= + 0 + E[C(R ^ r)]
H
= C E[R ^ r] =0 C0 =
| {z }

instead, if H0 is not satis…ed

E[ ^ R ] = CE[R r] 6=

Variance of ^ R :
^ = ^ (X 0 X) 1 R0 [R(X 0 X) 1 R0 ] 1 (R ^ r)
R
C | {z }
| {z } m^

= ^ C(R ^ r)
= (I CR) ^ + Cr

since ^ = + (X 0 X) 1 X 0 " we have

V ar( ^ R ) = V ar[(I CR)( + (X 0 X) 1 X 0 ") + Cr]


= V ar[(I CR)((X 0 X) 1 X 0 ")]
= MR V ar((X 0 X) 1 X 0 ")MR0
= MR 2 (X 0 X) 1 MR0 :

where we de…ned MR = I CR. Finally, one can show that


2
MR (X 0 X) 1 MR0 = MR 2
(X 0 X) 1
| {z }

32
which gives V ar( ^ R ) = MR V ar( ^ ) and

V [ ^ RLS ] V ar( ^ OLS ) = (Mr I)V ar( ^ OLS ) = CV ar( ^ OLS )


= (X 0 X) 1 R0 [R(X 0 X) 1 R0 ] 1 R 2 (X 0 X) 1 ;

a negative de…nite matrix.


To summarize, under H0 :

E[ ^ RLS ] = E[ ^ OLS (X 0 X) 1 R0 [R(X 0 X) 1 R0 ] 1 (R ^ OLS r)]


= (X 0 X) 1 R0 [R(X 0 X) 1 R0 ] 1 E[R ^ OLS r]
H
since under H0 we have that E[R ^ OLS r] = 0, E[ ^ RLS ] =0 and the ^ RLS is
unbiased and more e¢ cient than ^ OLS . However, if the null H0 is not satis…ed, then
the ^ RLS has expectation:

E[ ^ RLS ] = (X 0 X) 1 R0 [R(X 0 X) 1 R0 ] 1 E[R ^ OLS r] 6= 0


since E[R ^ OLS r] 6= 0 under H0 :

so it is biased.

Remark 4 Under H0 the ^ RLS estimator is unbiased and more e¢ cient than ^ OLS .
However, if H0 is not satis…ed, ^ RLS is biased (while ^ OLS is unbiased).

33
7 The generalized linear regression model
recall the assumption CLRM:
A1 - X is a matrix of full rank — > OLS estimator exists
A2 - X is deterministic or …xed in repeated samples
A3 - E["]=0
A4 - E[""0 ]= 2 IT
A5 - " is Gaussian
Under these assumptions:
^ N( ; 2
(X 0 X) 1 )
OLS

and:
^ t( ; ^ 2 (X 0 X) 1 )
OLS
can be used for inference.
Now we start asking what happens as we remove some assumptions. Today’s
lecture focuses A4.
E[""0 ] = 2 IT
Sphericity.
V ar(") = E[""0 ] E["]E["0 ] = 2
IT
2 2 3
0 0 0 0
6 0 2 7
6 7
2
IT = 6
6 0
7
7
4 0 5
2

V ar("t ) = 2
Cov("t ; "s ) = 0
We are going to generalize this model.
A5bis: E[""0 ] = 2
=
2 3
V ar("1 ) Cov("1 ; "2 ) Cov("1 ; "3 ) ::: Cov("1 ; "T )
6 Cov("2 ; "1 ) V ar("2 ) 7
6 7
= 6
6 Cov("3 ; "1 ) 7
7
4 5
Cov("T ; "1 ) V ar("T )

34
More generally:
E[""0 ] = 2

The model with assumption E[""0 ] = 2 is called the "generalized" linear regression
model. Setting = I gets us back to the CLRM.
1. Ask how the OLS estimator performs under this alternative set of assumptions.
2. Propose a more e¢ cient estimator under these assumptions.

7.1 Properties of OLS estimator and HAC standard errors


First of all, lets consider bias.

^ OLS = (X 0 X) 1 X 0 Y = (X 0 X) 1 X 0 " +
E[ ^ OLS ] = E[(X 0 X) 1 X 0 " + ]
= (X 0 X) 1 XE["] +
|{z}
=0
= 0+

Then, consider the variance of the OLS estimator

V ar[ ^ OLS ] = V ar[(X 0 X) 1 X 0 " + ]


= V ar[(X 0 X) 1 X 0 "]
= (X 0 X) 1 X 0 V ar["]X(X 0 X) 1
| {z }
2

= 2 0
(X X) X1 0
X(X 0 X) 1
6= V ar[ ^ OLS jCLRM ] = 2
(X 0 X) 1

Therefore we have

V ar[ ^ OLS jGLRM ] 6= V ar[ ^ OLS jCLRM ]

Therefore, in presence of nonspherical disturbances the OLS standard errors are


*incorrect*. Therefore, in this case we use

V ar[ ^ OLS ] = |{z}


2
(X 0 X) 1 X 0 |{z} X(X 0 X) 1

= (X 0 X) 1 X 0 2 X(X 0 X) 1

35
Consider
2 3
V ar("1 ) Cov("1 ; "2 ) Cov("1 ; "3 ) ::: Cov("1 ; "T )
6 Cov("2 ; "1 ) V ar("2 ) 7
6 7
= 2
=6
6 Cov("3 ; "1 ) 7
7
4 5
Cov("T ; "1 ) V ar("T )

(T 2 T) T2 T + 2T T (T + 1)
+T = =
2 2 2
We try to estimate
X0 2
X
which has K(K+1)
2
.
White estimator (1980):
2 3
^"21 0 X
T
X 0d
2 X=X 4 0 5X = ^"2t xt x0t
0 ^"2T t=1

Newey-West (1982):
p
X X
T
jpj
X 0d
2 X = 1 ^"t^"t p xt x0t p
p= p t=p+1
1+p
p
X
T X X
T
p
= ^"2t xt x0t + 1 ^"t^"t p (xt x0t p + xt p x0t )
t=1 p=1 t=p+1
1+p

with p T 1=4 . These are called Heteroschedasticity and autocorrelation consistent


(HAC) robust estimators of the variance.

d
V ar[ ^ OLS ] = (X 0 X) 1 X 0 d
2 X(X 0 X) 1

then we can use


d
^ OLS ~t( ; V ar[ ^ OLS ])

for inference.

36
7.2 E¢ cient estimation via GLS
There is an alternative to the OLS estimator, which is actually more e¢ cient.

1 1=20
= P 0P = 1=2

The matrix P can be used to "rescale" the model.


Y =X +"
imagine pre-multiplying this by P:
PY = PX + P"
Y = X +"
Let’s compute the variance of the errors of the transformed model
V ar(" ) = V ar(P ")
= P V ar(")P 0
= P 2 P0
= 2P P 0
= 2 P (P 1 P 0 1 )P 0
= 2I
1 1
where we used =( ) = (P 0 P ) 1
=P 1
P 0 1 . It follows that the transformed
model
Y =X +"
satis…es the CLRM assumptions. And the OLS estimator of the transformed model:
^ = (X 0 X ) 1 X 0 Y
OLS

since this is the OLS estimator of a model that satis…es the CLRM assumptions, this
is BLUE for . We label this estimator the GLS estimator
^ = ^ OLS
GLS

In particular
^ GLS = (X 0 X ) 1 X 0 Y
= (X 0 P 0 P X) 1 (P X)0 P Y
= (X 0 P 0 P X) 1 X 0 P 0 P Y
= (X 0 1 X) 1 X 0 1 Y:

37
E[ ^ GLS ] =
V ar[ ^ GLS ] = 2
(X 0 1
X) 1

and it is BLUE for , so it is better than any other linear unbiased estimator,
including the OLS estimator:

V ar[ ^ GLS jGLRM ] = 2


(X 0 1
X) 1
< 2
(X 0 X) 1 X 0 X(X 0 X) 1
= V ar[ ^ OLS jGLRM ]

7.3 Feasible GLS


2 1
As it happen for the OLS estimator, we need to estimate and . We assume a
functional form for
= ( )
and estimate the subset of coe¢ cients
^= d
( )

generally, one assumes a functional form for the error terms and uses Maximum
Likelihood.
^ 0 d
1 ( )X) 1 X 0 d
1 ( )Y
F GLS = (X

where F stands for "feasible". The FGLS is not BLUE. It is actually both biased
and nonlinear. However
^ ^
F GLS ! GLS

So in a large sample, we converge to the unfeasible, optimal GLS estimator.

7.3.1 Heteroskedasticity and weighted least squares


For example let us assume that :
2
yt = + x t + "t ; "t N (0; t)

with
2
t = x2t
2 3 2 3
E["21 ] = 2
1 = x21 0 x21 0
6 0 E["22 ] = 2
= x22 7 6 0 x2 7
6 2 7 6 2 7
2
=6
6
7=
7 6 .. 7
4 5 4 . 5
0 0 E["2T ] = 2
= x2T 0 x2T
T

38
transform the model:
1 1 1 1
p yt = p +p x t + p "t
x2t x2t x2t x2t
yt xt "t

yt = xt + + "t ;
1 1
V ar("t ) = V ar( p "t ) = 2 x2t =
x2t xt
2 3
1=x1 0
6 0 1=x 7
This is FGLS with = f g and P = 6 4
2 7. This exampe with
5
0 0 1=xT
heteroskedasticity is a.k.a Weighted Least Squares.

7.3.2 Autocorrelation and Cochrane-Orcutt procedure


yt = + x t + "t
"t = %"t 1 + ut
2
u 2
V ar["t ] = =
1 %2
2 3
1 % %2 %T 1

6 % 1 7
2 6 7
2
= u 6 %2 % 7= 2
( )
1 %2 6
4
7
5
%T 1
0 1
2
This is FGLS with = f%; u g. Cochrane-Orcutt procedure:

yt = + xt + "t ! ^ ! ^"t
^"t = ^"t 1 + ut ! ^
and iterate:

yt = + x t + "t
^ yt 1 = ^ + ^ x t 1 + ^ " t 1

yt ^ yt 1 = (1 ^) + (xt ^ x t 1 ) + "t ^ " t 1


| {z }
ut

39
7.3.3 Conditional heteroskedasticity and models of time varying volatil-
ities
yt = + x t + "t
2
"t N (0; t)

model "2t as follows:

"2t = h2t vt2 ;


vt iidN (0; 1)
ht = 0 + 1 "2t 1 (ARCH 1)
2

this implies:
E["2t jIt 1 ] = E[h2t vt2 jIt 1 ] = h2t E[vt2 jIt 1 ] = h2t ;
which shows that h2t is the conditional heteroskedasticity. Note that the forecast
error is:

ut = "2t E["2t jIt 1 ]


= "2t h2t
=)
"2t = h2t + ut
2
= 0 + 1 "t 1 + ut

that is, the squared errors follow an AR(1) process. This model could be estimated
with ML or with a recursive approach similar to Cochrane-Orcutt.
Stochastic volatility:
h2t = 0 + 1 h2t 1 + t
note that di¤erently from GARCH h2t has its own disturbance. Estimated with ML,
or Bayesian methods.

40
Part II
Random regressors and
endogeneity
8 Independent regressors
So far we assumed:

Y =X + |{z}
"
|{z}
unkown rand om

However, in economics X is random. There is the need to specify the relation between
the two random objects in the model. The easiest case is one in which X and " are
independent.
A1 - X is a matrix of full rank — > OLS estimator exists
A2bis - X is random
A3bis - E["jX]=0
A4bis - E[""0 jX]= 2 IT
A5bis - "jX is Gaussian
This set of assumptions ensures that X is random (A2) but independent from
" (because of A3-A5). Under these assumptions, it turns out that the OLS
estimator is still BLUE.

8.1 The law of iterated expectations.


E[points vs team A]=3*2/10+1*3/10+0*5/10=6/10+3/10=9/10
E[points vs team B]=3*4/10+1*4/10+0*2/10=12/10+4/10+0=16/10
E[P ointsjX = A] = 0:9
E[pointsjX = B] = 1:6
E[points]=E[points vs team A]*prob(playing with A)+E[points vs team B]*prob(playing
with B)
E[points] = E[P ointsjA] p(A) + E[P ointsjA] p(B)+

41
We computed the expected values of the expected values.

E[points] = EX [E[P ointsjX]]


E[Z] = EX [E[ZjX]]

We are going to change some assumptions

8.2 Bias and variance


First, let us consider unbiasedness.
^ = (X 0 X) 1 X 0 Y = (X 0 X) 1 X 0 " +

First, we show that the estimator is *conditionally* unbiased.

E[ ^ jX] = E[(X 0 X) 1 X 0 " + jX]


= E[(X 0 X) 1 X 0 "jX] + E[ jX]
= (X 0 X) 1 X 0 E["jX] +
| {z }
=0
=

E["jX] = 0 =) E["] = 0
E["] = EX [E["jX]] = EX [0] = 0
| {z }

How about unconditionally?

E[ ] = EX [E[ ^ jX]] = EX [ ] =

therefore, it is unbiased.
Then we have also the variance.

V ar( ^ jX) = V ar((X 0 X) 1 X 0 " + jX)


= V ar((X 0 X) 1 X 0 "jX)
= (X 0 X) 1 X 0 V ar("jX) X(X 0 X) 1
| {z }
= (X X) X IT X(X 0 X) 1
0 1 0 2

= 2 (X 0 X) 1 X 0 X(X 0 X) 1 = 2 (X 0 X) 1

which is the usual. Since

42
The LIE is: E[Z] = EX [E[ZjX]]

The de…nition of variance is V ar(Z) = E[Z 2 ] E[Z]2


Imply V ar(Z) = EX V ar(ZjX) + V arx E[ZjX], we have that:

V ar( ^ ) = EX [V ar( ^ jX)] + V arX (E[ ^ jX])


| {z }
= EX [V ar( ^ jX)] = EX [ 2 (X 0 X) 1 ]

this is the variance.

8.3 Distribution of the OLS estimator and Gauss-Markov


theorem
Therefore, the OLS estimator is NOT a Gaussian.
^ jX N( ; 2
(X 0 X) 1 )

the average of the Gaussians above is NOT a Gaussian.


^ ?

Consider the residuals:


^"0^" "0 M " 2
2
jX = 2
jX T k

Moreover
^ 0
p 2 j 0 j 1 jX N (0; 1)
[(X X) ]jj
q tT k
"0 ^
^ " 2
2 jX =(T k) T k

and
(R ^ r)0 [R^ 2 (X 0 X) 1 R0 ] 1 (R ^ r)
jX Fj;T k
j
have densities that do not depend on X, hence they hold for every X which means
once we integrate the X out they remain the same.
This means that we can make inference as usual, using the pivotal quantities
above.

43
Finally, the same reasoning applies to the Gauss-Markov theorem.

V ar( ^ OLS jX) < V ar( ~ jX)

V ar( ~ jX) = 2
DD0 + 2
(X 0 X) 1 ;
| {z }
V ar( ^ OLS jX)

Now, apply the LIE:

V ar( ~ ) = EX [V ar( ~ jX)] = 2


EX [DD0 ]+EX [V ar( ^ OLS jX)] = 2
EX [DD0 ]+V ar( ^ OLS )

hence V ar( ~ ) V ar( ^ OLS ) = 2


EX [DD0 ] is positive de…nite.

44
9 Asymptotic theory and large sample properties
of the OLS estimator
So far X was random, but independent from "
A1 - X is a matrix of full rank — > OLS estimator exists
A2bis - X is random
A3bis - E["jX]=0
A4bis - E[""0 jX]= 2 IT
A5bis - "jX is Gaussian

2
"jX N (0; IT )
What if X is not independent from "? Small sample results break down. For
example:
E[ ^ ] = + E[(X 0 X) 1 X 0 "]
= + EX [E[(X 0 X) 1 X 0 "jX]]
= + EX [(X 0 X) 1 X 0 E["jX]]
We must resort to asymptotics. As a silver lining, the Gaussianity assumption
will no longer be necessary. Actually, one could invoke asymptotics even with
independence between X and ", just on the grounds of removing A5, which
can be viewed as strong. In that case, removing A5 while keeping A3bis and
A4bis means that unbiasedness remains valid, but normality happens only
asymptotically.
A1 - X is full rank
A2ter - X is iid + not contemporaneously correlated with "t + regularity con-
ditions (which ensure Laws of Large Numbers [LLN], Central Limit theorems
[CLT] apply).
A3 - E["]=0
A4 - E[""0 ]= 2 IT .
Note that A3 and A4 imply "t is iid.
A5 - is no longer necessary

45
9.1 Asymptotic theory
What happens to random variables (estimators) as the sample size increases. T !
1; N ! 1

^ !

De…nition 5 Convergence in probability. The random variable xn converges in


probability to a constant c if limn!1 Pr(jxn cj > ") = 0, for any positive ". We
p
write xn ! x or p lim xn = c. We say that c is probability limit (plim) of xn .

Theorem 6 Law of Large Numbers. If xi ; i = 1; :::; n is a random i.i.d. sample


from a distribution with …nite mean E[xi ] = , then p lim(xn ) = , where xn =
Xn
1
n
xi (sample mean).
i=1

De…nition 7 Consistency. If an estimator ^ is such that plim ^ = , then we call


that estimator "consistent".

An implication of the LLN and the de…nition of consistency is that the sample
mean is a consistent estimator of : (It is an estimator, which converges in probability
to ).
Moving the interest to distributions:

De…nition 8 Convergence in distribution. The random variable xn converges


in distribution to a random variable x with cdf F (x) if limn!1 jF (xn ) F (x)j = 0,
d
for all continuity points in F (x). This is written xn ! x and x is called the limiting
distribution.

Theorem 9 Central Limit Theorem. If xi ; i = 1; :::; n is a random i.i.d. sample


from a distribution with …nite mean E[xi ] = and …nite variance V ar(xi ) = 2 then
p d
n(x ) ! N (0; 2 ).

The LLN will be used to establish consistency of the OLS estimator


The CLT will be used to derive the asymptotic distribution of the OLS estimator.
Continuous functions are limit-preserving:
p
Theorem 10 Continuous mapping theorem. If xn ! c and g( ) is a contin-
p d
uous function, then g(xn ) ! g(c). If xn ! x and g( ) is a continuous function,
d
then g(xn ) ! g(x).

46
Theorem 11 The delta method. Let a random variable be such that
p d
n(^ ) !
and g( ) continuously di¤erentiable in a neighborood of . Then
p d
n(g(^ ) g( )) ! G
@g(u)
with G = @u
. For example if N (0; V ) then
d
g(^ ) ! N (g( ); GV G0 ):

9.2 Large sample properties of the OLS estimator


A1 - X is full rank
A2ter - X is i.i.d. with Cov(xt ; "t ) = 0, that is no *contemporaneous* correla-
tion between regressors and error term. We also require …nite second moment
E[xt x0t ] = QXX < 1 and positive de…nite. Furthermore, for asymptotic nor-
mality we will require …nite fourth moments: Qx" = V ar[xt "t ] < 1.
2
A3+A4 - "t is i.i.d. with E["t ] = 0; V ar("t ) = .
Under this set of assumptions, we can show consistency and asymptotic nor-
mality.

9.2.1 Consistency

^ = (X 0 X) 1 X 0 Y
= + (X 0 X) 1 X 0 "
1
X 0X X 0"
= +
T T
What are these quantities?

1X
T
X 0X p
= xt x0t ! E[xt x0t ] = Qxx ;
T T t=1
0
because xt x0t is iid and therefore a LLN applies. Compactly, plim XTX = Qxx . Also

1X
T
X 0" p
= xt "t ! E[xt "t ] = 0
T T t=1

47
because xt "t is iid and thereforea LLN applies, and because E[xt "t ] = 0 due to the
0
fact that E["t ] = 0 and Cov(xt "t ) = 0. Compactly, plim XT " = 0. Now using the
continuous mapping theorem:
1
X 0X X 0"
p lim ^ = + plim plim
T T
=

Hence the OLS is consistent.

9.2.2 Asymptotic normality


For the asymptotic
p distribution, we will need a CLT. Let us start focusing on the
distribution of T ( ^ ):
1
^ = X 0X X 0"
+
T T
p 1 p X 0"
X 0X
T (^ ) = T
T T

Let us focus on the term


p X 0" p 1 X T
T = T xt "t ;
T T t=1
p
which is T times the sample mean of xt "t . Since xt "t is i.i.d. with mean E[xt "t ] = 0
and variance

Qx" = V ar[xt "t ]


= E[xt "t (xt "t )0 ] E[xt "t ]E[(xt "t )0 ]
= E[xt "t "0t x0t ]
= E[xt x0t "2t ];

the CLT implies


!
1X
T
p d
T xt "t 0 ! N (0; Qx" ):
T t=1 E[xt "t ] V ar[xt "t ]

48
Then we have, by applying the continuous mapping theorem and the delta method:
p 1
X 0X X 0"
T (^ ) = p
T T
| {z } |{z}
p d
!Qxx !N (0;Qx" )
p d
T (^ ) ! Qxx1 N (0; Qx" )
p d
T (^ ) ! N (0; Qxx1 Qx" Qxx1 ); (6)
p
which is the asymptotic distribution of T ( ^ ). Rearranging gives the asymptotic
^
distribution of :
^ !d 1
N ( ; Qxx1 Qx" Qxx1 ) (7)
T
The variance T1 Qxx1 Qx" Qxx1 goes to 0 as T goes to in…nity, which is not surprising
p
since we know that ^ ! . The representation (7) is useful when constructing test
statistics and standard errors. However, for theoretical purposes the representation
(6) is more useful as it retains non-degenerate limits as the sample sizes diverge.

9.2.3 Summary of asymptotic results


So far we have seen that under these conditions:
0
X
plim XTX = plim T1 xt x0t = Qxx = E[xt x0t ] (LLN holds for xt x0t )
X
X0"
plim T
= plim T1 xt "t = E[xt "t ] (LLN holds for xt "t )

E[xt "t ] = 0 (no contemporaneous correlation)


p
We can show that OLS is consistent: ^ !
Also assuming:
p X0" d
T T
E[xt "t ] ! N (0; V ar[xt "t ]) (CLT holds for xt "t )

V ar[xt "t ] = Qx" p.d.


p d
We can show that OLS is asymptotically Gaussian: T (^ ) ! N (0; Qxx1 Qx" Qxx1 )
d
and ^ ! N ( ; T1 Qxx1 Qx" Qxx1 ):

49
9.3 Asymptotic variance estimation
Can we estimate the asymptotic variance?

AsyV ar( ^ ) = Qxx1 Qx" Qxx1

We are going to consider two cases, which depend on the assumption about the
0 1
conditional variance of the errors. Consistent estimator of Qxx1 is XTX . We need
to consistently estimate the matrix Qx" . In general:

Qx" = V ar[xt "t ] = E[xt x0t "2t ] E[xt "t ]E[xt "t ]0 = E[xt x0t "2t ]:

Under the assumption that


E["2t jxt ] = 2
;
that is, conditional homoskedasticity.2

Qx" = E[xt x0t "2t ] = E[E[xt x0t "2t jxt ]] = E[xt x0t E["2t jxt ]] = E[xt x0t 2
]= 2
E[xt x0t ] = 2
Qxx :

So we have that
AsyV ar( ^ ) = Qxx1 2
Qxx Qxx1 = 2
Qxx1
and p d
T (^ ) ! N (0; 2
Qxx1 ) (8)
and Qxx1 can be estimated consistently via:
1
X 0X p
^2 ! 2
Qxx1
T
That is
1
d X 0X
AsyV ar( ^ ) = ^ 2
T
Going back to (8) and plugging in this consistent estimator of the asymptotic variance
we have: !
p 0 1
d X X
T (^ ) ! N 0; ^ 2 ;
T
rearranging:
d
^ ! N ; ^ 2 (X 0 X) 1
;
2
Actually, the slighlty weaker assumption Cov(xt x0t ; "2t ) = 0 would su¢ ce: E[xt x0t "2t ] =
E[xt x0t ]E["2t ] = 2 Qxx .

50
which is the usual! Note that
1 d
V ar( ^ jX) = ^ 2 (X 0 X) 1
= AsyV ar( ^ )
T
and goes to 0 as T ! 1:
1 d
AsyV ar( ^ ):
T ! 2 Qxx1
!0
Let us now consider the more general case with conditional heteroskedasticity.
In this case "t is still independent across time, but is not identically distributed, so
E["2t jxt ] = 2 does not apply. We have
Qx" = E[xt x0t "2t ]
a consistent estimator is
1X
T
1
^
Qx" = xt x0t^"2t = X 0 d
2 X;
T t=1 T

that is, the White (1980) estimator. Indeed note that:

1X 1X 1X
T T T
0 2
xt xt^"t = xt x0t "2t xt x0t (^"2t "2t )
T t=1 T t=1 T t=1
PT p PT p
where 1
T t=1 xt x0t "2t ! Qx" and 1
T t=1 xt x0t (^"2t "2t ) ! 0. Therefore we have:
1 1
d X 0X 1 0d X 0X
AsyV ar( ^ ) = X 2 X; ! AsyV ar( ^ ) = Qxx1 Qx" Qxx1 :
T T T
Plugging this estimator into (6) gives:
!
p 0 1 0 1
d XX 1 0d XX
T (^ ) !N 0; X 2 X
T T T

and rearranging gives:

X0 d
d
^ ! 1 1
N ; (X 0 X) 2 X (X 0 X) ; (9)
which is the usual OLS expression with White heteroskedasticity consistent estima-
tor. Note that again.
1 d
X0 d
1 1
V ar( ^ jX) = (X 0 X) 2 X (X 0 X) = AsyV ar( ^ )
T
51
d
which of course goes to 0 as T ! 1 since AsyV ar( ^ ) ! 2
Qxx1 Qx" Qxx1
A similar reasoning could be applied in a case with "t autocorrelated. However,
this would require invoking di¤erent CLT and LLN to account for dependence. The
resulting asymptotic distribution would be as in (9) with X 0 d
2 X based on a HAC

estimator, e.g. the Newey-West estimator.

Remark 12 The asymptotic distributions derived in this Section are the same we
derived in the small sample using normality of the errors and …xed regressors, and
this also in the case of autocorrelated and heteroschedastic disturbances. In other
words, if we have a large sample we can resort to these asymptotic arguments to avoid
assuming that the errors are Gaussian, and also to allow for random regressors. The
assumptions we need to be able to resort to the asyptotic argument is that the random
regressors are iid and not contemporaneously correlated with the errors of the model,
plus some regularity conditions.

9.4 A digression on time series models


What we discussed used CLT and LLN to derive consistency and asymptotic dis-
tributions. In order for the LLN and CLT to hold we have made requirements on
the heterogeneity and dependence of the regressors X, in particular assuming they
were independent (so no dependence) and identically distributed (so no heterogene-
ity). In a time series model, such assumptions are not tenable, however they can be
"substituted" by stationarity (which limits the heterogeneity) and ergodicity (which
limits the dependence). Indeed it is possible to derive LLN and CLT that apply to
dependent and/or heterogeneous data, as long as the regressors are stationary and
ergodic. If that is the case, then the same steps can be used to derive the consistency
and asymptotic distrbution of a time series model.
This also means that in a dynamic model (a time series model in which the
regressor is the lag of the dependent variable) such as:
yt = yt 1 + "t
the coe¢ cient can be consistently estimated with OLS, and the OLS estimator has
an asymptotically Gaussian distribution. However, it is key that the errors are white
noise for this result to hold. This can be achieved by specifying a rich enough lag
speci…cation.
Finally, as we shall see later on, even in presence of nonstationarity the OLS
estimator is still consistent (actually, it is super-consistent) even though it converges
to a non-standard distribution.

52
10 Endogeneity and instrumental variables
We have seen that under the assumption Cov(xt ; "t ) = 0 "standard" inference is
still valid. There are many applications in which such assumption is untenable.
Sometimes it happens that:

Cov(xt ; "t ) 6= 0 ! Endogeneity;

and xt is called an endogenous variable.

10.1 Examples of endogeneity


10.1.1 Omitted variables
We have seen that, if we omit a variable we have a bias (omitted variable bias). We
also have inconsistency:

DGP : y = X1 1 + X2 2 + ; (10)

with X1 and X2 mutually correlated and uncorrelated with .

M ODEL : y = X1 1 +v

where we have
v = X2 2 + ;
which gives:
^1 = (X 0 X1 ) 1 X 0 y
1 1
= (X10 X1 ) 1 X10 fX1 1 + vg
1
1 0 1 0
= 1 + X X1 Xv
T 1 T 1
!E[x1t vt ]

and
E[x1t vt ] = E[x1t (x2t 2 + "t )] = E[x1t x2t ] 2 6= 0

10.1.2 Measurement error


Similar to omitted variable. Imagine you want to estimated earnings as a funcion of
education:
yi = 1 + 2 xi + "i

53
where xi = Education, but we only observe x~i = Schooling, which is an imperfect
proxy for education:
x~i = xi + ui :
So the model we can estimate is:

yi = 1+ 2 xi + "i
= 1+ 2 (~
xi u i ) + "i
= 1+ 2x~ i + wi

with wi = "i 2 ui . In this model E[~ xi wi ] = E[(xi + ui )("i 2 ui )] = E[xi ("i


2
2 ui )] + E[ui ("i u
2 i )] = 2 E[ui ]. It follows that
1X 2
x~i wi ! 2 E[ui ] 6= 0:
T
Note the negative value: attenuation bias (we are assuming 2 positive as it should
be expected).

10.1.3 Selection bias


This happens when we are trying to estimate a causal relationship and the regressors
X are not selected in a randomized way. Here the problem of endogeneity is a feature
of the model/experimental design, which does not match appropriately the causal
relation we want to represent. The textbook example is the study of the e¤ects of
a drug on some patients. Let yi denote a patient health status and let xi denote
wether he took the drug (xi = 1) or not (xi = 0). The linear regression model is:

yi = 0 + 1 xi + "i

which implies
E[yi jxi ] = 0 + 1 xi

And allows to answer questions such as: what is the expected e¤ect of this drug?
One answer is:

E[yi jxi = 1] E[yi jxi = 0] = 0 + 1 1 ( 0 + 1 0) = 1: (11)

Note however that this answer is not necessarily satisfactory. We cannot limit our-
selves to computing the di¤erence between the expected health of those who took
the drug E[yi jxi = 1] and those who did not E[yi jxi = 0], because there might be

54
other carachteristics of the individuals playing some other explanatory role. What
we really are after is
E[yi1 jxi = 1] E[yi0 jxi = 1] (12)
where
yi0 = health status of individual i when he does not take the drug
;
yi1 = health of individual i when he takes the drug

irrespectively of wether they actually took the drug or not. This is Rubin’s concept of
potential output. Expression (12) then denotes the di¤erence between the expected
health of those who took the drug E[yi jxi = 1] = E[yi1 jxi = 1] and the *same*
people (therefore people for whom xi = 1), under the hypothetical scenario that
they did not take it, E[yi0 jxi = 1]. Since we do not observe yi0 jxi = 1 (this is called
a conterfactual) this creates a complication. We have:

E[yi jxi = 1] E[yi jxi = 0]


= E[yi1 jxi = 1] E[yi0 jxi = 0] (observables)
= E[yi1 jxi = 1] + ( E[yi0 jxi = 1] + E[yi0 jxi = 1]) E[yi0 jxi = 0]
count erf actual
= E[yi1 jxi = 1] E[yi0 jxi = 1] + E[yi0 jxi = 1] E[yi0 jxi = 0]:
| {z }
Selection Bias

The steps above show that the desired quantity E[yi1 jxi = 1] E[yi0 jxi = 1] coincides
to the observable quantity E[yi jxi = 1] E[yi jxi = 0] ony if the selection bias term
is 0. This term represents the e¤ect of how the patients taking the drug are selected.
Note that, if they are selected randomly, that is, independently from the potential
outcome, we have that E[yi0 jxi = 1] = E[yi0 jxi = 0] = E[yi0 ] and the selection bias
term is 0. But if they are not selected randomly, for example the choice to take the
drug is on a voluntary basis, or is given only among people that are regularly visiting
their doctor, those patients with better health and better possible outcomes are less
likely to be taking the drug and E[yi0 jxi = 1] < E[yi0 jxi = 0].
Moving on to the regression, the observed outcome can be written as a function
of the potential outcomes:

yi = yi0 + (yi1 yi0 )xi


= E[yi0 ] + (yi1 yi0 )xi + (yi0 E[yi0 ])
= 0 + 1 xi + "i :

55
Evaluating the expectations under treatment and no treatment gives:

E[yi jxi = 1] = 0 + 1 + E("i jxi = 1)


E[yi jxi = 0] = 0 + E("i jxi = 0)
E[yi jxi = 1] E[yi jxi = 0] = 1 + E("i jxi = 1) E("i jxi = 0)

which shows that if there is any kind of dependence between the regressor xi and
the error term "i ; then 1 is not identi…ed from the data. If randomization is not
possible, then there are two solutions. Note that we can interpret this also this
example as an instance of omitted variable problem/measument error, and adding
a control variable to the conditional mean of the model, for example age or overall
health can remove the selection bias. The other solution is to use a zi instrument
correlated with xi but uncorrelated with the potential outcomes (i.e. uncorrelated
with the errors). Why? because then we can project xi on the space of zi and obtain
a regression

yi = 0+ 1 (Pz xi
+ Mz xi ) + "i :
0+ 1x
^i + ~"i

where x^i = Pz xi and ~"i = 1 Mz xi + "i are by construction orthogonal, hence the
selection bias of this regression is 0.

10.1.4 Simultaneous equation models


A basic demand-supply model Simultaneous equation models represent a set
of causal relationships. A basic demand - supply model is:

qtd = pt + "dt
qts = pt + "st

In equilibrium qtd = qts = qt which implies the model can be written as:

1 qt "dt
= :
1 pt "st

Solving the model gives:


" # " "dt "st
#
qt "dt
= 1 1 = "dt "st : (13)
pt "st

56
Clearly both qt and pt depend on both shocks. What happens if we try to estimate
the demand equation via OLS?3 We would have:

1
^ T
pt qt
= 1
T
p2t
1 "dt "st "dt "st
T
= 2
1 "dt "st
T
1
T
("dt "st )( "dt "st )
= 2
1
T
"dt "st
1
T
f ("dt )2 "dt "st "dt "st + ("st )2 g
= 1
T
f("dt )2 + ("st )2 2"dt "st g
! E[("dt )2 ] + E[("st )2 ]
=
! E[("dt )2 ] + E[("st )2 ]
2 2
d + s
! 2 2
: (14)
d + s

This is not the elasticity we are looking for (the demand elasticity ) but rather
a mix of demand and supply elasticities (the demand elasticity and the supply
elasticity ).4
2+ 2
d s
In this situation the coe¢ cient 2+ 2
s
is identi…ed but and are not. This
d
is not surprising, it is a direct consequence of the simultaneity of this model: it is
3
The same would happen if we tried to estimate the supply equation.
4
Only in the special case in which the shocks to one of the two curves are so small to be negligible
we could consistently estimate and . This is close to a situation in which only one of the two
curves is being shocked, and its movements after the shocks allow us to trace out the other curve.

57
possible to show that in model (13) we have5
2 2
d + s
E[qt jpt ] = 2 2
pt ; (15)
d + s

which is indeed the probability limit to which the OLS estimator is converging (see
(14)). This is an observable property of the joint distribution, and indeed the linear
regression of qt on pt yields a consistent estimator of this quantity.
Clearly in this (and any other) simultaneous equation model the object pt is not
the conditional mean, but a conceptual causal object capturing what would happen
to qt if we moved pt only via shocks "st while keeping …xed "dt . This amounts to a
situation in which only one of the two curves is being shocked, and its movements
after the shocks allow us to trace out the other curve. A notation for this object
was given by Pearl (2010): pt = E[qt jdo("dt = "dt )]. This latter expectation and the
conditional expectation E[qt jpt ] in (15) will coincide only if E["dt jpt ] = 0, that is if
there is independence between "dt and pt , a condition that by construction does not
hold in a simultaneous model.
Problem: it is important to emphasize that the causal object pt is not a
characteristic of the joint distribution and therefore cannot be inferred from the
data without additional assumptions.
Solution: …nd a variable that shifts one curve but not the other. This would in-
troduce in the system a channel trough which we can reproduce the ideal counterfac-
tual situation of keeping one curve …xed while moving the other. If the instrumental
variable only moves one of the two curves, then it is possible to trace the curve that
does not respond to the instrument. For example, de…ne wt as the number of days
with temperatures below 0 Celsius, which would cause a reduction in the yield of the
agricultural sector. We can think of wt as a component of the supply shocks:

"st = hwt + ust


5
From (13) we have:

qt 0
E =
pt 0
2
qt 2 d 0 1
V ar = ( ) 2
pt 1 1 0 s 1
2 2 2 2 2 2
2 s + d s + d
= ( ) 2 2 2 2
s + d d + s

2 2 2 2 1
E [qt jpt ] = ( s + d )( d + s ) pt ;

where the last line follows from the properties of the multivariate Gaussian.

58
and we can write the model as
qtd = pt + "dt
:
qts = pt + hwt + ust

Note that wt only impacts the supply and is uncorrelated with both the other shocks.
Putting in matrix notation and solving the model gives:

qt 1 0 1 "dt
= ( ) +( )
pt 1 1 hwt 1 1 ust
" "dt ust
#
h
= h wt + "dt ust (16)

Use the second equation to …nd the conditional expectation E[pt jwt ]:

h "dt ust
pt = wt + : (17)
| {z }
pt =E[pt jwt ]

Key step: using the conditional expectation pt we can re-write the model (16) as:
" "d us #
t t
qt
= pt + d s
"t ut : (18)
pt 1

It is clear that in this model E[qt jpt ] = pt , i.e. the conditional expectation E[qt jpt ]
coincides with the causal object of interest pt = E[qt jdo("dt = "dt )]. The OLS
estimator in the regression of qt on pt would converge to this value:
1
^ = T
pt qt
1 (19)
T
(pt )2
1
T
pt ( pt + vt )
= 1
T
(pt )2
1
T
( (pt )2 + pt vt )
= 1
T
(pt )2
1
T
pt vt
= + 1 ! ;
T
(pt )2

where the last step follows from E[pt vt ] = 0.

59
The regression of qt on pt appearing in the …rst line of (18) can be interpreted
as a version of the demand equation in which pt has been swapped with pt (with a
remainder term ending up in the disturbances). The second equation of (18) instead
is the "…rst stage" regression, projecting pt on a space orthogonal to the demand
shocks, to obtain a variable pt that is uncorrelated to the demand shock. In practice
pt needs to be estimated, which can be done using the linear regression:
pt = w t + t

with p^t = ^wt a consistent estimate of pt = E[pt jwt ] = h


wt . The operational
version of (19) is therefore:
1
^ = T
p^t ( p^t + vt )
1
T
pt ) 2
(^
1
T
p^t vt
= + 1 ! ;
T
pt )2
(^
since p^t ! pt and E[pt vt ] = 0.6
Of course, it is key that only one of the two curves is moved by the instrument.
Imagine this is not the case the model becomes:
qt 1 kwt 1 "dt
= ( ) +( )
pt 1 1 hwt 1 1 ust
" "dt ust
#
1 k h
= ( ) wt + "dt ust :
k h

Using the second equation we can obtain the conditional expectation


k h
pt = E[pt jwt ] = wt ;

but this is no longer proportional to the expectation E[qt jpt ] appearing in the …rst
equation and therefore cannot be used to pin-down by re-writing the system in
the form (18). The form (18) would ensure that E[qt jpt ] = pt but it requires the
exclusion restriction k = 0.
6
Another way to see how the introduction of wt identi…es is to note that using representation
h h E[qt jwt ]
(16) one can compute E[qt jwt ] = wt and E[pt jwt ] = wt , and then note that E[p t jwt ]
=
2 1
. In the sample, one could regress qt on wt to obtain ( wt ) wt qt and pt on wt to obtain
( w2 ) 1 w q
( wt2 ) 1 wt pt , and then take the ratio ( w2t ) 1 wttptt = ( wt pt ) 1 wt qt which is the IV estimator.
t
This approach is less e¢ cient that the TSLS approach, because it uses wt instead of the projection
of pt on wt .

60
Full information system estimation approach While the system written as in
(18) obviously lends itself to a TSLS estimation procedure, an alternative procedure
involves estimating the system in (16) and then solve for the parameters of interest,
that is estimate the form:
qt a1 vtd
= wt +
pt a2 vts
with
h
a1
= h ; (20)
a2
2 2 2+ 2 2 2 2
3
us + us
"d "d
vtd 2
V ar = vd vd vs
=4 2
( )2
+ 2us
(
2 +
)2
2
5: (21)
vts vd vs
2
vs "d "d us
( )2 ( )2

There are …ve structural coe¢ cients h; ; ; 2"d ; 2us and …ve reduced form coe¢ cients
a1 ; a2 ; 2vd ; vd vs ; 2vs . The latter can be consistently estimated with OLS and then
one can solve for the structural parameters. In this example (20) identi…es trough
the ratio a1 =a2 , and then assuming knowkledge of one can …nd h = a2 ( ).
To …nd one can use the ratios of the variance conditions in (21)
2 2 2 2 2 2 2
vd vs "d
+ us vd "d
+ us
2
= 2 2
; 2
= 2 2
;
vs "d
+ us vs "d
+ us

writing them as:


2 2
vd vs us vd vs us
= 2
+ 2 2 2
; (22)
vs "d vs "d
2 2 2 2
2 vd us vd us 2
= 2
+ 2 2 2
; (23)
vs "d vs "d
2
us
and solving for 2 and . To do so, …rst equalize the square of (22) with (23) to
"d
2
us
obtain 2 :
"d

2 2 2 2 2
vd vs us vd vs vd us vd 2
2
+ 2 2
= 2
+ 2 2
vs "d vs vs "d vs
2 2 2
vs vd
2 2 1
us vd vs vs
! 2
= 2 2 :
"d vs
1
vd vs

61
2
Now use this ratio in (23) to solve for :
2 2 2 2 2
vs vd vd
2 2 1 2
2 vd vd vs vs vd 2 vd vs
= 2
+ 2 2 2
= 2 2
vs vs
1 vs vs
1
vd vs vd vs

2
vd vs vd
Taking the square root and rearranging yields = 2 , hence also is identi-
vs vd vs
2 + 2 2 + 2
"d us 2 "d us
…ed. Finally we can use ( )2
= vs and ( )2
= vd vs in (21) to straightfor-
wardly solve for 2"d and 2
us .

A 3 variables macroeconomic model Here is a minimial example of a modern


macro model:

IS : y~t = a1 (it ~ t ) + a2 y~t 1 + "yt


P C : ~ t = b1 Et [~ t+1 ] + b2 (~
yt yt ) + "t
T R : it = c1 (~
yt yt ) + c2 (~ t ~ t ) + c3 it 1 + "it

where y~t = ln Yt ln Yt 1 is the growth rate of output, ~ t = ln Pt ln Pt 1 the growth


rate of prices, and it the (nominal) interest rate. The IS (Investment - Savings) curve
is relating output to interest rates. The higher interest rates are, the lower output is.
The Phillips Curve in which in‡ation depends on in‡ation expectations Et [~ t+1 ] and
the output gap. Note that ~ t is related to y~t trough the IS curve: higher in‡ation
reduces the cost of borrowing, which increases investments and output. However
higher output pushes in‡ation up trough the Phillips curve (more …rms are looking
for workers so salaries go up, costs go up, eventually prices go up). The PC e¤ect
counterbalances the IS e¤ect until we reach a stable equilibrium. Finally, the Taylor
Rule depicts the central bank actions in pursuing its goals of in‡ation and output
deviations from ~ t and yt are adjusted via increases/decreases in the interest rate.
We can simplify this model by setting Et [~ t+1 ] = ~ t 1 (adaptive expectations)
and ~ t = yt = 0 (this is just to simplify notation).

IS : y~t = a1 (it ~ t ) + a2 y~t 1 + "yt


P C : ~ t = b1 ~ t 1 + b2 y~t + "t
T R : it = c1 y~t + c2 ~ t + c3 it 1 + "it

Since every variable interacts contemporaneously with all the others, one cannot
estimate any of these three equations via OLS. Note you cannot even perform a

62
forecast, as you would need the forecast from one equation to forecast another. The
model above is a system of di¤erential equations.
2 32 3 2 32 3 2 3
1 a1 a1 y~t a2 0 0 y~t 1 "yt
4 b2 1 0 5 4 ~t =5 4 0 b1 0 5 4 ~t 1
5 + 4 "t 5
c1 c2 1 it 0 0 c3 it 1 "it

Consider solving it:


2 3 2 3 1 2 32 3 2 3 1 2 3
y~t 1 a1 a1 a2 0 0 y~t 1 1 a1 a1 "yt
4 ~ t 5 = 4 b2 1 0 5 4 0 b1 0 5 4 ~ t 1
5 + 4 b2 1 0 5 4 "t 5
it c1 c2 1 0 0 c3 it 1 c 1 c2 1 "it
2 32 3
a2 b1 (a1 a1 c2 ) c 3 a1 y~t 1
1 4 5 4
= a2 b 2 b1 (a1 c1 1) c 3 a1 b 2 ~t 1 5
D
a2 (c1 + b2 c2 ) b1 (c2 a1 c1 ) c3 (a1 b2 + 1) it 1
2 32 y 3
1 (a1 a1 c2 ) a1 "t
1
+ 4 b2 (a1 c1 1) a1 b 2 5 4 " t 5
D
c 1 + b2 c 2 c 2 a1 c 1 a1 b 2 + 1 "it

with D = a1 b2 a1 c1 a1 b2 c2 + 1. Note that the autoregressive matrix above can


now be estimated with OLS! However, what we get is not the structural parameters
we wanted! They are the parameters of the reduced form, the only form identi…ed
in the data. There is no guarantee that we can back the structural parameters from
the reduces form parameters.

10.2 Instruments
The problem is solved trough the use of some alternative variables, called instru-
ments. Z

Z to be uncorrelated with the errors Cov(zt ; "t ) = 0 , E[zt "t ] = 0. (Validity


or exogeneity).

Z to be correlated (as much as possible) with the X. (Relevance).

E[zt zt0 ] = Qzz < 1, E[zt x0t ] = Qzx < 1. (Regularity conditions).

Start with as many instruments as variables. The starting point is the moment
condition:
E[Z 0 "] = 0

63
suggests the moment estimator

Z 0 (y X ^) = 0

that is:
^ = (Z 0 X) 1 Z 0 Y
IV

which of course turns out to be consistent (as we know that the moment condition
holds).
^ IV = (Z 0 X) 1 Z 0 Y
= (Z 0 X) 1 Z 0 (X + ")
= (Z 0 X) 1 Z 0 X + (X 0 Z) 1 Z 0 "
1
Z 0X Z 0"
= +
T T0
!Qzx !E[Z "]=0

Asymptotic distribution:
p 1p
Z 0X Z 0"
T ( ^ IV )= T
T T
!Qzx

and
p Z 0"
T ! N (0; V ar[zt "t ])
T
with V ar[zt "t ] = E[zt zt0 "2t ] = EZ [zt zt0 E["2t jzt ]] = Qzz 2 . Hence
p
T ( ^ IV ) ! N (0; 2 Qzx1 Qzz Qzx10 )

a consistent estimator of the asymptotic variance is


1 1 1 1
^"0^" Z 0X Z 0Z X 0Z Z 0X Z 0Z X 0Z
= ^2
T T T T T T T

and we can write:


^ 1
IV ! N ( ; ^ 2 (Z 0 X) Z 0 Z(X 0 Z) 1 ):

64
10.3 Two stage least squares
Suppose that you found a valid and relevant set of instruments Z. They can possibly
be more than the Xs. Do the following.

X =Z +u

^ = (Z 0 Z) 1 Z 0 X
^ = Z^
X

these are the …tted values.


^ = Z(Z 0 Z) 1 Z 0 X = PZ X
X
^
It follows X
^ = ^ 0 X) 1 X
(X ^ 0Y
2SLS
= (X 0 Z(Z 0 Z) 1 Z 0 X) 1 X 0 Z(Z 0 Z) 1 Z 0 Y
= (X 0 Pz X) 1 X 0 Pz Y
= (X 0 Pz Pz X) 1 X 0 Pz Y
= ^ 0 X)
(X ^ 1X ^ 0Y

which is the two stage least squares. The second stage is


^ +"
Y =X

The …rst stage was


X =Z +u
It is possible to show that TSLS is the most e¢ cient choice among all possible
transformations of Z. The asymptotic distribution is
^ 2SLS ! N ( ; ^ 2 (X
^ 0 X) 1 X
^ 0 X(X
^ 0 X)
^ 1 ):

^ 0 X) 1 X
where (X ^ 0 X(X
^ 0 X)
^ 1 = (X^ 0 X)
^ 1 and ^ 2 = (Y X ^ )0 (Y X ^ )=(T
k). Important: note ^ 2 uses the endogenous regressors and not the instrumented
ones (Y X ^ ^ )0 (Y X ^ ^ )=(T k).

65
10.4 Hausman test
Consider the null hypothesis of exogeneity. The OLS estimator ^ OLS will be consis-
tent. Then take a 2SLS estimator ^ 2SLS derived from Z instruments. Under the Null
hypothesis of exogeneity, this estimator is also consistent.

p lim( ^ 2SLS ^ OLS ) = 0: (24)

Under the alternative only ^ 2SLS is consistent. Hence we can devise a test for the
null of exogeneity using (24) as the null. In order to re-scale the discrepancy vector,
we need to standardize it by its variance:

( ^ 2SLS ^ 0
OLS ) asyV ar( ^ 2SLS ^
OLS )
1
( ^ 2SLS ^
OLS ):

where

asyV ar(( ^ 2SLS ^ OLS )) = asyV ar( ^ 2SLS ) + asyV ar( ^ OLS )
2asyCov( ^ 2SLS ; ^ OLS ):

Now note that under the null ^ OLS will be BLUE and ^ 2SLS will be ine¢ cient.7 This
implies8 that asyV ar( ^ 2SLS ) = asyCov( ^ 2SLS ; ^ OLS ) which simpli…es things as we
do not need to estimate the covariance:

asyV ar(( ^ 2SLS ^ OLS )) = asyV ar( ^ OLS ) asyV ar( ^ 2SLS ):
7
To see this more clearly, consider the di¤erence in the estimated asymptotic variances of the
two estimators under the null of exogeneity:
2 ^ 0 X)
(X ^ 1 2
(X 0 X) 1

consider the object:

(X 0 X) (X 0 Z) (Z 0 Z) 1
(Z 0 X)
= X 0X XPz X
0
= X Mz X

positive de…nite. Since A B > 0 , A 1 < B 1


we have that the OLS has always smaller variance
under the null of exogeneity.
8
This follows from Hausmann’s result:

Cov( ^ ef f icient ; ^ ef f icient ^ inef f icient )


= V ar( ^ ef f icient ) Cov( ^ ef f icient ; ^ inef f icient ) = 0

which implies that V ar( ^ ef f icient ) = Cov( ^ ef f icient ; ^ inef f icient ).

66
Therefore the Wald statistic can be written as

W = ( ^ 2SLS ^ 0
OLS ) (asyV ar( ^ 2SLS ) asyV ar( ^ OLS )) 1 ( ^ 2SLS ^
OLS )

and estimated using


c = ( ^ 2SLS
W ^ 0 2 ^0 ^
OLS ) (^ [(X X)
1
(X 0 X) 1 ]) 1 ( ^ 2SLS ^
OLS ):

Since there will be some Xs that are exogenous (e.g. the intercept) this quadratic
form will generally be of reduced rank K = K K0 where K0 is the number of
exogenous Xs and K the number of endogenous Xs (call these X ). Hence the
inversion of the matrix in squared brakets needs to be performed using a generalized
inverse procedure. In practice though, an equivalent variable addition test due to
Wu (1973) can be performed easily by running the augmented regression:
^
y =X +X +"

where X are all the variables (both endogenous and exogenous) and X ^ are the
projection of the endogenous variables X on the instruments. The null of exo-
geneity corresponds to = 0. If there is endogeneity, it will be picked up by . The
Wald statistic for = 0 has distribution FK ;n K K and this procedure is equivalent
to the Hausmann test described above.

67
Part III
Likelihood based methods
11 Maximum Likelihood estimation
Fisher, 1925. Maximum Likelihood estimation is more widely applicable (not only
linear regression models). For iid data:

Y
T
L(y; ) = f (yt ; )
t=1

more generally for dependent (but id) variables:

Y
T
L(y; ) = f (yt jyt 1 ; : : : ; y1 ; )
t=1

Maximum Likelihood Estimator (MLE):

bM L = arg max L(y; )

Usually better to use the log-likelihood function:

X
T
l(y; ) = ln L(y; ) = ln f (yt j yt 1 ; : : : ; y1 ; )
t=1
@l
First order condition s(y; ) = @
= 0. [s(y; ) is called the score vector]. Nice
asymptotic properties:

Consistency: plimT !1 bM L =
p
Asymptotic normality: T (bM L ) ! N (0; [ T1 I( )] 1 ) and bM L ! N ( ; [I( )] 1 ).

Here I( ) is the Fisher information matrix. De…ned as the variance of the score
vector:
@l @l 0
I( ) = E
@ @

68
It also equals (minus) the expected value of the Hessian (information matrix
equality):
@l @l 0 @ 2l
I( ) = E = E
@ @ @ @ 0
Asymptotic e¢ ciency: [I( )] 1 :The MLE has variance at least as small as the
best unbiased estimate of . The MLE is generally not unbiased, but its bias
is small making the comparison with unbiased estimates and the Cramer–Rao
bound appropriate.
p
Estimators of asymptotic variance matrix [ T1 I( )] 1 (since T (bM L ) !
1 1
N (0; [ T I( )] )) are based on the information matrix computed at the ML
estimates:
" # 1 " # 1
1 X @lt (y; bM L ) @lt (y; bM L ) 0 X
T T 2 bM L )
1 1 @ lt (y; 1
( )( ) ! [ I( )] 1 or ! [ I( )] 1 :
T t=1 @ @ T T t=1 @ @ 0 T

Since bM L ! N ( ; [I( )] 1 ) this gives


" T # 1 " # 1
X @lt (y; bM L ) @lt (y; bM L ) X
T
@ 2 lt (y; bM L )
( )( )0 ! [I( )] 1
or 0 ! [I( )] 1 :
t=1
@ @ t=1
@ @

Invariance. If g( ) is a continuous function of , the ML estimator of g( ) is


g(bM L )

11.1 ML estimation of CLRM


Linear Model
2
y = X + "; " N (0; I)
Gaussianity:
T =2 2 1=2
f (") = (2 ) I expf 0:5("0 ( 2 I) 1 ")g
Transformation:
@r 1 (y)
fy (yjX) = f" (r 1 (y))
@y
@r 1 (y)
where " = r 1 (y) = y X and @y
= 1.

T =2 2 1=2
f (y) = (2 ) I expf 0:5((y X )0 ( 2 I) 1 (y X ))g

69
in logs:
T T 1
ln L = log 2 log 2 (y X )0 (y X ):
2 2 2 2
0
Let =( ) : First order conditions: s( ; y) = @@l = 0 gives
2 0

@l 1
= 2
( X 0y + X 0X ) = 0
@
@l T 1
= + 4 (y X )0 (y X )=0
@ 2 2 2 2

bM L = (X 0 X) 1 X 0 y = bOLS
1 1
b2M L = (y X bM L )0 (y X bM L ) = ^"0^"
T T
Second order conditions
" 2 #
@ l @2l X0X X0"
@ @ 0 @ @ 2 2 4
@2l @2l = (X 0 ")0 T 1
@ 2@ @( 2 )2 4 2 4 6 "0 "

Information matrix:
@lt X0X X0X X0"
E[ 2 ]= 2 E[ 4 ]=0
[I( )] = E[ ]= X0"
@ @ 0 E[ 4 ]=0 E T
2 4
1 0
6" " =
T
2 4

0 2T
where we used E[ " 6" ] = E[ 6 ]= T
4 . We have:
2
1 (X 0 X) 1
0
[I( )] =
0 2 4 =T

11.2 Logistic regression


[Example from Efron, "Computer age statistical inference" ] An experimental new
anti-cancer drug called Xilathon is under development. Before human testing can
begin, animal studies are needed to determinesafe dosages. To this end, a bioassay
or dose–response experiment was carried out: N = 11 groups of n = 10 mice each
were injected with increasing amounts of Xilathon. Let

yi = number of mice dying in group i

70
The counts are modelled with a binomial:
iid
yi Bi(ni = 10; i ); i = 1; :::; N = 11

where i is the probability of death in group i, which is prob(deathjxi ). The observed


sample proportion is:
pi = yi =ni = yi =10
which is the direct evidence for i ( i = E[pi jX]). By plotting the pi we can see that
a linear regression such as
pi = 0 + 1 xi + "i
is not ideal. Indeed we have that E[pi jxi ] = 0 + 1 xi which is not necessarily a
probabiliy. Moreover, the …gure above suggests a nonlinear relationship. Instead, for
the conditioanl mean use a non-linear funcion with desirable properties. In particular

pi = (Xi0 ) + "i

where
e( 0 + 1 xi )
(Xi0 ) =
1 + e ( 0 + 1 xi )
is a good candidate as it is between 0 and 1 and allows nonlinear behaviours. In this
case we have
e( 0 + 1 xi )
i = E[p i jX] =
1 + e ( 0 + 1 xi )

71
and
1
1 i =1 E[pi jX] = :
1 + e( 0 + 1 xi )

Taking logs of the ratio:


i
i = ln = ln e( 0 + 1 xi ) = 0 + 1 xi
1 i

shows that this is a model in which the log of the ratio between the probabilities
i and 1 i is a linear function of xi . The coe¢ cient i is the logit parameter.
The likelihood is:
ni N ni yi yi
Pr ob(y1 ; y2 ; ::::; yN jX) = yi =0 (1 i) yi =1 i = i=1 (1 i) i ;
yi

and if we recall that


e( 0 + 1 xi )
i = = (Xi0 )
1 + e ( 0 + 1 xi )
we can write, in logs:

X
N
ln L / (ni yi ) ln(1 (Xi0 )) + yi ln (Xi0 )
i=1

The …rst order condition is

@ ln L X
N 0
(Xi0 ) 0
(Xi0 )
= (ni yi ) + y i Xi = 0:
@ i=1
1 (Xi0 ) (Xi0 )

Note that
0 e ( 0 + 1 xi )
(Xi0 ) = = (Xi0 )(1 (Xi0 ))
(1 + e( 0 + 1 xi ) )2
so we have

@ ln L XN
= [ (ni yi ) (Xi0 ) + yi (1 (Xi0 ))] Xi
@ i=1
X
N
= [ ni (Xi0 ) + yi (Xi0 ) + yi yi (Xi0 ))] Xi
i=1
XN
= [yi ni (Xi0 )] Xi = 0
i=1

72
which is a set of orthogonality conditions based on the logistic function (Xi ) as
opposed as a linear speci…cation Xi . The second derivative is:
@ 2 ln L X
N X
N
0
H= = ni (Xi0 )Xi Xi0 = ni (Xi0 )(1 (Xi0 ))Xi Xi0
@ @ 0 i=1 i=1

which is always negative de…nite. The simpler case in which ni = 1 is the "discrete
choice model".
The solution is also the one that minimizes the deviance:
pi (1 pi )
D( i ; pi ) = 2ni pi ln + (1 pi ) ln ;
i (1 i)

where the term in the square brackets is the expectation taken using the probabil-
ities p of the logarithmic di¤erence between the probabilities p and . Deviance is
analogous to squared error in ordinary regression theory, and can be seen as its
generalization. It is twice the “Kullback–Leibler divergence” the preferred name in
the information-theory literature.9 When pi = i the deviance is 0.

11.3 ML estimation of GLRM


We no longer necessarily have iid data. Linear Model
2
y = X + "; " N (0; )

T 1
ln L = ln 2 2 (y X )0 1 (y X )
2 2 2
T T 1 1
= log 2 ln 2 log j j (y X )0 1
(y X );
2 2 2 2 2
here we assume is known, hence we can just focus on:
T 2 1
ln L / log 2
(y X )0 1
(y X )
2 2
@l
First order conditions: s( ; y) = @
= 0 gives
@l 1
= 2
( X0 1
y + X0 1
X )=0
@
@l T 1
= + (y X )0 1
(y X )=0
@ 2 2 2 2 4
9
The KL divergence is the expected value of a log-likelihood ratio if the data are actually drawn
from the distribution at the numerator of the ratio, i.e. under the null hypothesis for which that
ratio is computed.

73
bM L = (X 0 1
X) 1 X 0 1
y = bGLS
1 1 0
b2M L = (y X b M L )0 1
(y X bM L ) = ^" 1
^"
T T
This assumed known. If Unknown, we need to consider ( ) and maximize w.r.t.
as well.

11.3.1 ARCH models


Consider a Gaussian ARCH(m) process:

yt = 0 xt + t ;
t = ht "t ; "t N (0; 1)
2 2 2
ht = 0 + 1 t 1 + ::: + m t m;

note h2t is a deterministic function of 2t 1 . Conditions on 0 and 1 are needed


( 0 > 0; 1 0 and 1 < 1 at the least. More for existence of higher moments).
The conditional expectation is:

E[ t jIt 1 ] = E[ht "t jIt 1 ] = ht E["t jIt 1 ] = 0

The conditional variance is:

V [ t jIt 1 ] = E[ 2t jIt 1 ] E[ t jIt 1 ]2 = E[h2t "2t jIt 1 ] = h2t E["2t jIt 1 ] = h2t

It follows that:
t jIt 1 N (0; h2t ):
Note that the unconditional distribution of t is not normal. It is a mixture of
normal distributions with di¤erent (and random) variances. It can be shown that
the resulting unconditional distribution has fatter tails than a normal distribution:
E[ 4t ] E[h4t "4t ] E[h4t ]E["4t ] E[h4t ]
( t) = = = = 3 3
E[ 2t ]2 E[h2t "2t ]2 E[h2t ]2 E["2t ]2 E[h2t ]2
where the last inequality follows from Jensen’s inequality E[f (x)] f (E[x]):Therefore,
the ARCH is consistent with the fact thar returns are not normal. It is also con-
sistent with the volatility clustering phenomenon. To see this, consider the forecast
error:
2
t E[ 2t jIt 1 ] = 2t h2t = t

74
which implies
2
t = h2t + t = 0 + 2
1 t 1 + t

which means 2t follows an AR(1).


As we said, the conditional distribution of t is t jIt 1 N (0; h2t ), with pdf:
2
1 t
f ( t jIt 1 ) / p exp .
h2t 2h2t

The conditional density of yt is therefore:


0
1 (yt xt )2
exp 2 0 + 1 (yt 1
0
xt )2 +:::+ m (yt m
0
xt )2
f (yt jIt 1 ; ; ) / p 0 0
( 0 + 1 (yt 1 xt )2 + ::: + m (yt m xt )2 )

(Jacobian is 1). Here = 0 ; 1 ; :::; m is the vector of coe¢ cients. Now consider a
sample of T observations. The Likelihood (conditional on the …rst m observations)
is:

L / f (yT ; :::; ym+1 j ; ;ym ; :::; y1 )


= f (yT jIT 1 ) f (yT 1 jIT 2 ) ::: f (ym+1 jIm )
0
1 (yt xt ) 2
exp 2 0+ 1 (yt 1
0
xt )2 +:::+ m (yt m
0
xt )2
T
= t=m+1 p 0 0
( 0 + 1 (yt 1 xt )2 + ::: + m (yt m xt ) 2 )

The function log L can be minimized wrt and . Remember there are restrictions
on , so it is a constrained estimation. Engle (1982) shows that the information
matrix is such that the ML estimators of and are independent. Parameters ;
can be estimated with full e¢ ciency based only on a consistent estimate of the other.

11.3.2 ARMA models


Consider the ARMA(2,1):
2
Yt = 1 Yt 1 + 2 Yt 2 + ut + #ut 1 ; ut i.i.d. N 0; u :

One could write down the likelihood for the entire sample and maximize, as we did
in the GARCH example. There is a quicker way to evaluate the likelihood using:
1. The prediction error decomposition, and
2. A recursive algorithm known as the Kalman …lter

75
Prediction error decomposition To understand what the PED is, consider the
simpler model:
Yt = 1 Yt 1 + ut ; ut i.i.d. N 0; 2u
and the forecast:

Ytjt 1 E[Yt jIt 1 ] = 1 Yt 1 ;


tjt 1 V AR[Yt jIt 1 ] = 2u :

This forecast has forecast error:

vtjt 1 Yt E[Yt jIt 1 ] = ut ;

which has variance V ar[vtjt 1 ] = 2u = tjt 1 ; i.e. the same as V AR[Yt jIt 1 ]. The
likelihood for observation t can be written as:
0 1
l(Yt jY1:t 1 ; ) / ln j tjt 1 ( )j vtjt 1 ( ) tjt 1 ( ) vtjt 1 ( ):

For more complex models, e.g. the ARMA model, computing the PED moments
tjt 1 and vtjt 1 can be easily done using a prediction-update algorithm known as
the Kalman Filter.

11.4 State space models (linear and Gaussian)


The linear Gaussian SSM consists of two equations

Space (measurement, observation): Y t = s t + "t ;


State (transition): st = F s t 1 + t :

"t " 0
iidN 0; :
t 0

Yt vector of N observed variables, while st vector of k unobserved states

and F are N k and k k coe¢ cient matrices

Intercepts or additional exogenous regressors in both equations are omitted but


can be introduced easily.

Similarly, time variation in the coe¢ cient matrices and F can be allowed

76
The Kalman Filter - Learning about states from data

An algorithm that produces and updates linear projections of the latent variable
st given observations of Yt .

Useful in its own right, and is also employed in ML estimation.

De…ne:
stjs := E [st jY1:s ] ; Ptjs := Var [st jY1:s ] :

We then wish to do:

Filtering: stjt and Ptjt ; t = 1; :::; T:


Smoothing: stjT and PtjT ; t = 1; :::; T:

Derivation of Kalman Filter

Assume you start by knowing st 1 N (st 1jt 1 ; Pt 1jt 1 ). The …lter is a rule
to update to st N (stjt ; Ptjt ) once we observe data Yt .

The …rst step is to …nd the joint distribution of states and data in t, conditional
on past observations (1; :::; t 1):
0
st stjt 1 Ptjt 1 Ctjt 1
It 1 N ; (25)
Yt Ytjt 1 Ctjt 1 tjt 1

The moments above can be calculated easily using the equations of the system, Yt =
st + "t ; st = F st 1 + t )

stjt 1 = E [st jYt 1 ] = F E [st 1 jYt 1 ] + E [ t jYt 1 ] = F st 1jt 1 ;


Ytjt 1 = E [Yt jYt 1 ] = E [st jYt 1 ] + E ["t jYt 1 ] = stjt 1 :
Ptjt 1 = Var [st jYt 1 ] = F Pt 1jt 1 F 0 + ;
tjt 1 = Var [Yt jYt 1 ] = Ptjt 1 0 + " ;
Ctjt 1 = Cov (Yt ; st jYt 1 ) = Ptjt 1 : (26)

Conditional and joint normals

We have now speci…ed the distribution (st ; Yt jYt 1 ) we now look for the distri-
bution (st jYt ; Yt 1 ) = (st jYt ).

77
This is easy to do using basic results regarding Normal distributions: Let (a; b)
be normally distributed,
a a aa ab
N ; : (27)
b b ba bb

Then the conditional distribution of a conditional on b is given as


ajb N ajb ; ajb ; (28)
where
1 1
ajb = a + ab bb (b b) ; ajb = aa ab bb ba :

Updating a linear projection


Set (27) to be the joint distribution in (25)
0
a = st a = stjt 1 aa = Ptjt 1 ab = Ctjt 1
Yt 1 N ;
b = Yt b = Ytjt 1 ba = Ctjt 1 bb = tjt 1

Applying the result (28):


st j Yt 1 ; Yt N ajb = stjt ; ajb = Ptjt (29)
with
0 1
stjt = stjt 1 + Ctjt tjt 1 Yt
1 Ytjt 1 (30)
0 1
Ptjt = Ptjt 1 Ctjt 1 tjt 1 Ctjt 1 (31)

So we have moved from (st 1 jYt 1 ) to (st ; Yt jYt 1 ) (prediction) and then from
(st ; Yt jYt 1 ) to (st jYt ) (update).
We can now repeat and use (st jYt ) to move forward to (st+1 jYt+1 )
Using Ctjt 1 = Ptjt 1 (see (26)) the updating equations can be re-written as:
stjt = stjt 1 + Ktjt 1 vtjt 1 (32)
Ptjt = Ptjt 1 Ktjt 1 Ptjt 1 (33)
with
0 1
Ktjt 1 = Ptjt 1 tjt 1 (34)
denoting the Kalman Gain and vtjt 1 = Yt Ytjt 1 denoting the 1-step ahead
prediction error.

78
The Kalman Filter recursions The algorithm works as follows:

0) Start with an initial condition (st 1 jYt 1 ) N (st 1jt 1 ; Pt 1jt 1 )

1) Use the prediction equations to …nd the moments of (st ; Yt jYt 1 ):

stjt 1 = F st 1jt 1 (state prediction)


Ptjt 1 = F Pt 1jt 1 F 0 + (variance of state prediction)
Ytjt 1 = stjt 1 (measurement prediction)
tjt 1 = Ptjt 1 0 + " (variance of measurement prediction)

1.5 Compute the Kalman gain:


0 1
Ktjt 1 = Ptjt 1 tjt 1

2 Observe the prediction error vtjt 1 = Yt Ytjt 1 and update to …nd (st jYt )
N (stjt ; Ptjt ):

stjt = stjt 1 + Ktjt 1 vtjt 1


Ptjt = Ptjt 1 Ktjt 1 Ptjt 1

Likelihood via prediction error decomposition

As a by product, the algorithm will provide the rime series of vtjt 1 and tjt 1 for
t = 1; :::T .

So at each t = 1; :::T we can compute and store:


0 1
l(Yt jY1:t 1 ; ) / ln j tjt 1 ( )j vtjt 1 ( ) tjt 1 ( ) vtjt 1 ( )

where
1
=f ( ; F; "; )
P
The sum of the likelihoods of the forecast errors lt ( ) provides the likelihood
of the whole system

Therefore the KF o¤ers a fast way to evaluate the likelihood of a SS model.

79
The ARMA example

s1t
Yt = 1 # = s1t + #s2t
s2t
s1t 1 2 s1t 1 ut
= + ;
s2t 1 0 s2t 1 0
2
u 0
with " =0; = : Indeed:
0 0

Yt = 1 s1t 1 + 2 s2t 1 + ut + #(+ 1 s1t 2 + 2 s2t 2 + ut 1 )


= 1 (s1t 1 + #s1t 2 ) + 2 (s2t 1 + #s2t 2 ) + ut + #ut 1
= 1 (s1t 1 + #s2t 1 ) + 2 (s1t 2 + #s2t 2 ) + ut (1 + #)

80
12 James - Stein Estimators
12.1 James-Stein result
Let fzt gTt=1 be a sequence of iid k-variate Gaussians with mean and variance Ip :

zt j N ( ; Ik ):

As an estimator of the population mean, the sample mean


T
^M L = zT = t=1 zt
T
is the ML estimator and as such it is Minimum Variance Unbiased.
However, Stein (1956) showed that this is inadmissible with respect to the MSE
loss whenever k 3. (Inadmissible is decision theory jargon to say that is strictly
dominated). Moreover, James and Stein (1961) proposed an alternative estimator
ML
which always improves over ^ in terms of MSE (when k 3):

^JS = (k 2)
1 zT ;
kzT k2
where k k is the Euclidian norm, hence
q 2
2
kzT k = z12 + z22 + ::: + zk2 = zT0 zT

and we can re-write


^JS = (k 2)
1 zT :
zT0 zT
JS ML
Since (k 2)
zT0 z
T
is a positive quantity, ^ is closer to 0 than ^ : shrinkage.

12.2 Application to CLRM


This result applies to the CLRM as well. In that case we have that the k-variate
Gaussian is:
^j N ( ; 2 (X 0 X) 1 )
Standardizing gives
1
(X 0 X)1=2 ^ j N( 1
(X 0 X)1=2 ; Ik ):
| {z } | {z }
zt

81
Here we only have 1 realization of ^ so T = 1 and:
1
^M L = zT = t=1 zt 1
= z1 = (X 0 X)1=2 ^
1
1 ^0 0 ^
zT0 zT = XX
2

and the JS of is:


!
2
^JS (k2) 1
= 1 0 (X 0 X)1=2 ^ ! 1
(X 0 X)1=2
^ X 0X ^ | {z }

To go back to the representation in terms of just undo the standardization:


!
2
1 1
^JS = (k 2) ^ 1
(X 0 X)1=2 1 0 ! 1
(X 0 X)1=2 =
^ X 0X ^

that is !
2
^ JS (k2) ^! :
= 1 0
^ X 0X ^

12.3 Bayesian interpretation


This result was introduced in frequentist terms. However, the JS estimator has a
straightworward Bayesian interpretation. Suppose we wish to estimate the coe¢ cient
vetor:
1 ; 2 ; :::; k

on which we have prior


i N (M0 ; V0 );
by using a single observation vector x:

x1 ; x2 ; :::; xk

with likelihood
xi j i N ( i ; 1):
The MLE is
^M
i
L
= xi

82
The Bayesian posterior mean is

^ Bayes
i = (V0 1
+ 1) 1 (V0 1 M0 + xi )
1
1 + V0
= (V0 1 M0 + xi )
V0
= (V0 + 1) 1 (M0 + V0 xi )
V0
= M0 + (xi M0 ) (35)
V0 + 1
It turns out that the JS estimator is the plug-in version of (35)

dV0
^ JS ^
i = M0 + (xi ^ 0)
M
V0 + 1
with

M ^0 = x
dV0 k 3 k
= 1 ; S= i=1 (xi x)2 :
V0 + 1 S
Substituting gives:
k 3
^ JS
i = x+ 1 (x x): (36)
S
This is Lindlay (1962) version of JS estimator, which shrinks towards x rather than
towards 0. The standard case is imediately obtained by setting M0 = M ^ 0 = 0. In this
case, the MSE bene…t comes from the fact that we are using the other observations
to form a point to which to shrink the estimates. Note that we are using the data to
compute an estimate of the prior mean M ^ 0 = x! This is called "Empirical Bayes".

83
84
13 Bayesian treatement of the linear regression
model
13.1 Introduction
Start with:
2
y = X + "; " N (0; IT )
and get
^ = (X 0 X) 1
X 0Y

Add data:
Y X " 2
= + ; " N (0; IT +T1 )
Y1 X1 "1
and get
1
^ = 0 0 X 0 Y
X X1 X 0 X1
X1 Y1
0 1 0
= X 0 X + X1 X 1 (X 0 Y + X1 Y1 )

Bayesian approach

1. The researcher starts with a prior belief about the coe¢ cient . The prior
belief is in the form of a distribution p( )

N( 0; 0)

2. Collect data and write down the likelihood function as before p(Y j ).

3. Update your prior belief on the basis of the information in the data. Combine
the prior distribution p( ) and the likelihood function p(Y j ) to obtain the
posterior distribution p( jY )

Key identities

These three steps come from Bayes Theorem:

p(Y j ) p( )
p( jY ) = / p(Y j ) p( )
p(Y )

85
Useful identities:

p(Y; ) = p(Y ) p( jY ) = p(Y j ) p( )


joint data density posterior likelihood prior

p(Y ) is the data density (also known as marginal likelihood).It is the constant
of integration of the posterior:
Z Z
1
p( jY )d = p(Y j ) p( )d = 1;
p(Y ) | {z }
posterior kernel

therefore it is not needed if we are only interested in the posterior kernel. The
posterior kernel is su¢ cient to compute e.g. mean and variance of p( jY ).

p( jY ) / p(Y j ) p( )

2
Derivation of posterior We assume that is known, recall that k is the
number of regressors. Set prior distribution for N ( 0; 0)
k 1 0 1
p( ) = (2 ) 2 j 0j 2 exp 0:5 ( 0) 0 ( 0)
0 1
/ exp 0:5 ( 0) 0 ( 0)

Obtain data and form the likelihood function:


T 1
p(Y j ) = (2 ) 2 2 IT 2 exp[ 0:5 (Y X )0 ( 2 IT ) 1
(Y X )]
/ exp[ 0:5 (Y X )0 (Y X ) = 2 ]

Obtain the posterior kernel

p( jY ) / p(Y j ) p( )
1
/ exp[ 0:5 (Y X )0 (Y X ) = 2 ] exp[ ( 0)
0
0
1
( 0 )]
2
/ exp[ 0:5f(Y X )0 (Y X )= 2
+( 0)
0
0
1
( 0 )] (37)

One can show that


0 1
p( jY ) / exp[ 0:5 ( 1) 1 ( 1 )]

86
which is the kernel of a normal distribution with moments
1
1 1
1 = 0 + 2
X 0X (38)

1 1
1 = 1 0 0 + 2
X 0Y ; (39)

therefore we conclude:
jY N( 1; 1)

Details Completing the squares of (37) gives:


0 1 0 1 0 1 0 1
ln p( jY ) / 0 0 0 0 0 + 0 0 0 +
1 1 1 0 1 0
2
Y 0Y 2
Y 0X 2
X 0Y + 2
X 0X

regrouping gives:
0 1 1 0 1
ln p( jY ) / ( 0 + 2 X 0 X) ( 0
1
0 + 2
X 0Y ) +
| {z } | {z }
1 1
1 1 1

0 1 1 0 1
( 0 0 + 2
Y 0 X) + 0 0
1
0 + 2
Y 0Y (40)
| {z }
0 1
1 1

where the elements in braces are those in de…nitions (38) and (39). Rewrite the …rst
term as:
0 1
= ( + 1 )0 1 1 ( 1 + 1)
1
| {z }1 |{z}
= ( )0 1 ( ) +( 1 )0 1 1 ( 1 )
| {z 1 } 1 | {z 1} |{z} |{z}
0 1 0 1
+( ) ( 1 )+( 1 ) 1 ( ):
| {z 1 } 1 |{z} |{z} | {z 1}
The last three terms in the expression contain 5 terms: of these 3 terms 01 1 1 1 with
0 1 0 1 0 1
signs +,-,- simplify to just 1 1 1 and the other two terms are 1 1 and 1 1 ,
which simplify with the 2nd and 3rd terms in (40). Therefore, we have:
0 1 0 1 0 1 1
ln p( jY ) / ( 1) 1 ( 1) 1 1 1 + 0 0 0 + 2
Y 0Y
0 1
/ ( 1) 1 ( 1)

87
14
posterior
prior
likelihood
12

10

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

Comparison with OLS (or ML)


Note that, as X 0 X ^ = X 0 Y , we have:
1
1 1
1 = 0
1
+ 2
0
XX 0
1
0 + 2
X 0X ^

1
0and 0 1 0 are the prior moments, can be interpreted as dummy observations/pre-
sample observations.
Without the priors, these moments are simply the OLS estimates. Without
the data, these moments are simply the priors
The mean is a weighted average of the prior and OLS. The weights are inversely
proportional to the variance of prior and data information
1
Setting 0 = 2 Ik and 0 = 0 gives the Ridge regression:
1
1 = ( Ik + X 0 X) X 0 Y:

The Likelihood Principle


Consider tossing a drawing pin (Lindley and Phillips 1976). Luigi says he
tossed it 12 times and obtained :
fU; U; U; D; U; D; U; U; U; U; U; Dg

88
You -as a statistician- are asked to give a 5% rejection region for the null that
U and D are equally likely
Obtaining 9 U ’s out of 12 suggests that the chance of its falling uppermost (U )
exceeds 50%.The results that would even more strongly support this conclusion
are:
(10; 2); (11; 1); and (12; 0);
so that, under the null hypothesis = 1=2, the chance of the observed result,
or more extreme, is:
12
12 12 12 12 1
+ + + = = 7:5% >5%
3 2 1 0 2

Hence, you do NOT reject the null that U and D are equally likely (50%).
However, now Luigi tells you: "but I didn’t set to throw the pin 12 times. My
plan was to throw the pin until 3 Ds appeared"
Does this change your inference? Yes it does
Under the new scenario, the more exreme events would be:
(10; 3); (11; 3); (12; 3); :::;
while the events (10; 2); (11; 1); and (12; 0) actually can NOT take place under
this design.
So the chance of the observed result under the null hypothesis becomes:
( )
11 10 3
10 1 9 1 2 1
1 ::: = 3:25% <5%
2 2 2 2 2 2

Why is this happening? Because the two setups imply a di¤erent stopping rule
(stop at 12 draws, or stop at 3 D draws). This, more generally, alters the
sample space.
Things are even more problematic. Think if Luigi says "I just kept drawing
the pin until lunch was served". How would you tackle this?
Con…dence intervals similarly demand consideration of the sample space. In-
deed, so does every statistical technique, with the exception of maximum like-
lihood.

89
Lindley and Phillips (1976): Many people’s intuition says this speci-
…cation is irrelevant. Their argument might more formally be expressed
by saying that the evidence is of 12 honestly reported tosses, 9 of which
were U; 3, D. Furthermore, these were in a particular order, that reported
above. Of what relevance are things that might have happened [e.g. no
lunch], but did not?

Indeed, this helps us understand The LIKELIHOOD PRINCIPLE:

1. All the information about obtainable from an experiment is contained in the


likelihood function for given the data. There is no informatin beyond that
contained in the likelihood and the data.

2. If two likelihood functions for (from the same or di¤erent experiments) are
proportional to one another, then they contain the same information about .

By using only the likelihood, and nothing else from the experiment, the answer
to the problem is the same regardless of the stopping rule. Indeed, let x1 = #U in
experiment 1 x2 = #U in experiment 2. In experiment 1 (E1) we have a binomial
density:
12 x1 12 9
f 1 (x1 ) = (1 )12 x1 =) `1 (9) = (1 )3
x1 9
In experiment 2 (E2) we have a negative binomial density:

x2 + 3 1 x2 11 9
f 2 (x2 ) = (1 )3 =) `2 (9) = (1 )3
x2 9

In this situation, the Likelihood Principle says that:

1. For experiment E1 alone the information about is contained solely in


`1 (9);
2. For experiment E2 alone the information about is contained solely in
`2 (9);
3. Since `1 (9) and `2 (9) are proportional as functions of , the information
about in the two experiments is identical.

90
Error variance
The distribution of ^"0^"= 2
is a chi-square:
"0 "
^"0^"= 2
= (I PX ) 2
T k

Write down ^"0^" in terms of the bias adjusted estimator ~ 2 = ^"0^"=(T k)

^"0^"= 2
= (T k)~ 2 = 2 2
T k

In a chi-square we have E[ 2v ] = v and V ar[ 2


v] = 2v. From the moments of
^"0^"= 2 we can infer the moments of ~ 2 :

E[(T k)~ 2 = 2 ] = T k =) E[~ 2 ] = 2


V ar[(T k)~ 2 = 2 ] = 2(T k) =) V ar[~ 2 ] = 2 4 =(T k)

Note that a rescaled chi-square is a Gamma:


v s2
s2 ~ 2 2
v () ~ 2 ;
2 2

The distribution of ~ 2 is Gamma with v = (T k) and s2 = (T k)= 2 .


Speci…cally:
2
T k (T k)=
(T k)~ 2 = 2 2
T k () ~ 2 ; : (41)
2 2

It has mean
2
T k (T k)= 2
=
2 2
and variance ,
2 2
T k (T k)= 4
=2 /(T k)
2 2

Now imagine that 2 was a random variable (not a coe¢ cient) and consider
(41) as a function of 2 . The distribution of 12 is a Gamma:

2 2 2 1 T k (T k)~ 2
(T k)~ = T k () 2
;
2 2

91
2
and that of is an inverse Gamma:
T k (T k)~ 2
(T k)~ 2 = 2 2
T k () 2 1
;
2 2

where v = (T k) and s2 = (T k)~ 2 .

Note that (T k)~ 2 = ^"0^", so we can write

2 1 T k ^"0^"
;
2 2

2
All this suggests to use an inverse Gamma as a prior for .

2
Conditional Posterior of error variance 1. Set prior distribution
1
(v0 =2; s20 =2)
0 0 +2
p( 2 ) = [ ( 0 =2)] 1 (s20 =2) 2 ( 2 ) 2 exp s20 =2 2

2. Obtain data and form the likelihood function


T 1 1 1
p(Y j ; 2
) = (2 ) 2
2
IT 2
exp (Y X )0 2
IT (Y X )
2

3. Obtain the conditional posterior kernel


T + 0 +2
p 2
jY; / ( 2) 2 exp fs20 + (Y X )0 (Y X )g=2 2

which is the kernel of an inverse gamma


1
(v1 =2; s21 =2)

with
v1 = T + v0 ; s21 = s20 + (Y X )0 (Y X ):

13.2 The CLRM with independent N-IG prior


2
In summary, when the prior for and is:

2 1 s20
0
N( 0; 0 ); ;
2 2

92
we derived the conditional posteriors:
s21
1
j 2; Y N( 1; 1) ;
2
j ;Y 1
;
2 2
with moments:
1
1 1
1 = 0 + 2
X 0X
1
1 1
1 = 0
1
+ 2
X 0X 0
1
0 + 2
X 0X ^

1 = 0+T
s21 = s20 + (y X )0 (y X )

2
Simple Monte Carlo simulation is not an option to obtain the joint ; jY .
Other methods are needed ! Gibbs Sampling (MCMC)

Integrating out 2 analytically to obtain jY is not an option. Computing the


marginal likelihood in closed form is not feasible.

However, the model is more ‡exible than the one with conjugate prior for it
does not require proportionality between the prior on and the error variance.

13.3 Gibbs sampling


j=0 j=0
Set starting values 1 ; :::; k .

j=1
1. Sample i from
1 1 1 1 0 0
f i j 1 ; 2 ; :::; i 1 ; i+1 ; :::; k

2. Set j = j + 1 and repeat (1) until j = J:


j j j j j 1 j 1
f i j 1 ; 2 ; :::; i 1 ; i+1 ; :::; k :

Gibbs sampling is a special case of more general MCMC sampling

As J ! 1 the joint and marginal distributions of simulated f j1 ; ::: jK gm j=1


converge at an exponential rate to the joint and marginal distributions of 1 :::: k

For simple models (e.g. linear regressions, also multivariate), this happens
*really fast*

93
By construction the Gibbs sampler produces draws than are autocorrelated

– Some burn-in required


– What is the e¢ ciency/mixing?

Convergence

– Time series plots


– Tests of equality across independent chains e.g. Geweke’s (1992) conver-
gence diagnostic test for equal means
– Potential Scale Reduction Factors (PSRF), Gelman and Rubin (1992)

Mixing

– Time series plots


– Autocorrelograms
– Ine¢ ciency factors (IF) give an idea of how far we are from i.i.d. sampling

Check out the R / MATLAB packages COnvergence DiAgnostics

Gelman and Rubin (1992) and Brooks, S.P. and Gelman, A. (1998)

Let f mj gJj=1 be the m th simulated chain, m = 1; : : : ; M . Let ^m and ^ 2m be


the sample posterior mean and variance of the m th chain, and let the overall
sample posterior mean be ^ = M ^
m+1 m =M .

2
There are two ways to estimate the variance of the stationary distribution :

– The mean of the empirical variance within each chain:


1 M 2
W = m+1 ^ m
M
– The empirical variance from all chains combined:
J 1 M +1
V = W+ B;
J MJ

where B = J M ^ ^)2 is the empirical between-chain variance


M 1 m+1 ( m

94
If the chains have converged, then both W and V are unbiased. Otherwise
the …rst method will underestimate the variance, since the individual chains
have not had time to range all over the stationary distribution, and the second
method will overestimate the variance, since the starting points were chosen to
be overdispersed.

The convergence diagnostic is:


r
V
P SRF =
W

Brooks and Gelman (1997) have suggested, if P SRF < 1:2 for all model para-
meters, one can be fairly con…dent that convergence has been reached.

More reassuring (and common) is to apply the more stringent condition P SRF <
1:1ors
P
The ine¢ ciency factor (IF) 1 + 2 1 k=1 k , where k is the k-th order auto-
correlation. This is the inverse of the relative numerical e¢ ciency measure of
Geweke (1992). Usually estimated as the spectral density at frequency zero
with Newey-West kernel (with a 4% bandwidth).

i.i.d. sampling (e.g. MC sampling) features IF=1, here IF < 20 are considered
good.

Note that mixing can *always* be improved arti…cially by a practice called


thinning. Thinning (or skip sampling) is only advisable if you have space
constraints, since it always implies loss of information

13.4 Generalized linear regression model


Gibbs sampling is powerful. For example, we can easily extend the model we
are considering:

yt = x t + "t (42)
2
"t = "t 1 + ut ; ut iidN (0; u) (43)
2
In this model there are 3 (groups of) parameters: ; ; u. Also, we have
2 2 2
" = V ar("t ) = u =(1 )

95
Consider the Cochrane-Orcutt transformation:
2 3
1 0
6 .. .. 7
P =4 . . 5
0 1

Py = PX + P" (44)
where [P "]t = "t 1 + "t .

The model in (44) is a Generalized LRM, with error variance V ar(P ") =
2
P V ar(")P 0 = 2" P P 0 = 1 u 2 P P 0 = ( ; 2u ): We have:

P ( )y = P ( )X + P ( )" (45)

which has likelihood:


T 1 1
p(Y j ; 2
; ) = (2 ) 2 j j 2 exp (Y X )0 ( ) 1
(Y X )
2
2
where recall = ( ; u)

One can specify:

2 1 0s20
N( 0; 0 ); N ( 0; 0
); u ;
2 2

2
Under knowledge of u and , this gives the following posterior for :
2
j ; ;y N( 1; 1)

1 1
1 = 0 + X0 ( ; 2 1
u) X ; 1 = 1 0
1
0 + X0 ( ; 2 1
u) Y

which is simply the average of a GLS estimator and the prior.

Then, under knowledge of , it is easy to use the model in (42) to derive "j ; y
and this can be used as an observable in (43). This gives
2
"1 = "0 + u; ut N (0; u IT 1 ) (46)

which is a standard linear regression model with AR coe¢ cient and error
variance 2u .

96
Given the prior speci…ed in (43), the posteriors will be:
2
j ; u; y N ( 1; 1
);
1
1 1 1 1
1
= 0
+ "0 "
2 0 0
; 1 = 1 0 0 + "0 "
2 0 1
u u

2 1 0 + T s20 + ("1 "0 )0 ("1 "0 )


uj ; ;y ;
2 2

97
14 Appendix A: Moments of the error variance in
the LRM
Start with:
^"0^"= 2
= (T k)^ 2 = 2 2

where
E[(T k)^ 2 = 2 ] = (T k)
V ar[(T k)^ 2 = 2 ] = 2(T k)

E[(T k)^ 2 = 2 ] = (T k) =) (T k)E[^ 2 ]= 2 = (T k) =) E[^ 2 ] = 2


V ar[(T k)^ 2 = 2 ] = 2(T k) =) (T k)2 V ar[^ 2 ]= 4 = 2(T k) =) V ar[^ 2 ] = 2 4 =(T k):
Alternative derivation which also shows the distribution:
^"0^"= 2
= (T k)^ 2 = 2
= s0 ^ 2 2
T k

where s20 = (T k)= 2 . Note that if


s0 ^ 2 2
T k

then ^ 2 is a gamma:
T k s20
^2 ;
2 2
with mean
T k s20
= = (T k)=s20 = (T k)=((T k)= 2 ) = 2
2 2
and variance
2
T k s20
= = 2(T k)=s40 = 2(T k)=((T k)2 = 4 ) = 2 4 =(T k)
2 2

Towards a conjugate prior Finally note that


^"0^"= 2
= (T k)^ 2 = 2
= s0 ^ 2 2
T k

implies that
1 2 1
^"0^" T k 2
2
that is, is the inverse of a rescaled chi-square distribution, i.e. it is an inverse
gamma. This o¤ers a choice for a conjugate prior for

98

You might also like