Professional Documents
Culture Documents
Andrea Carriero
October 31, 2023
Contents
1
6 Inference on multiple coe¢ cients and restricted least squares 26
6.1 Specifying multiple linear restrictions . . . . . . . . . . . . . . . . . . 26
6.2 Wald statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.3 Restricted least squares . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2
10.3 Two stage least squares . . . . . . . . . . . . . . . . . . . . . . . . . . 65
10.4 Hausman test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3
Part I
The linear regression model
1 Introduction
1.1 Univariate model
We will consider models in which we have 1 dependent variable y. We will try to
explain this variable using some independent variables x.
Consider the model:
yt = a + bxt ; t = 1; :::; T
this is a set of T equations:
y1 = a + bx1
y2 = a + bx2
..
.
yT = a + bxT
y1 = a + bx1 + "1
y2 = a + bx2 + "2
..
.
yT = a + bxT + "T
We have now 2+T unknowns: a,b, and "t ; t = 1; :::; T . The system now has more
unknowns than equations and there are many solutions. We choose the solution that
4
minimizes the sum of the squared residuals
^ + ^bxt
y^t = a
y1 = a 1+b x1 + "1 ;
y2 = a 1+b x2 + "2 ;
..
.
yT = a 1+b xT + "T :
yt = 1 + 2 x2;t + "t ;
5
1.2 Multivariate model
This course is about the general model
This is a cross-section model. Panel data include both the time series and the cross
section dimension.
De…ne the matrices:
2 3 2 3
y1 "1
6 y2 7 6 "2 7
6 7 6 7
y = 6 .. 7 ; " = 6 .. 7 ;
T 1 4 . 5 4 . 5
yT "T
2 3 2 3 2 3
x1;1 x2;1 xk;1
6 x1;2 7 6 x2;2 7 6 xk;2 7
6 7 6 7 6 7
x1 = 6 .. 7 ; x2 = 6 .. 7 ; :::; xk = 6 .. 7 ;
4 . 5 4 . 5 4 . 5
x1;T x2;T xk;T
2 3 2 3
x1;1 x2;1 xk;1 1
6 x1;2 x2;2 xk;2 7 6 7
6 7
X =6 .. .. .. 7; =6
4
2 7
5
T k 4 . . . 5
x1;T x2;T xk;T k
we have:
y = X + " (3)
T 1 T kk 1 T 1
6
1.3 Ordinary Least Squares
Derive the Ordinary Least Squares estimator for this model.
RSS = "21 + "22 ::: + "2T = "0 "
2 3
"1
h i6 "2 7
.. 6 7
= "1 "2 . "T 6 .. 7
4 . 5
"T
= (y X )0 (y X )
| {z }| {z }
"0 "
7
2 Algebraic properties of the OLS estimator.
2.1 Orthogonality and its implications
We derived the OLS estimator:
^ = (X 0 X) 1 X 0 Y
OLS
(X 0 X) ^ OLS + X 0 Y = 0
that is:
X 0 (Y X ^ OLS ) = 0
or:
X 0 ^" = 0
k TT 1
Under OLS the residuals ^" are orthogonal to the regressors X. This implies the
following properties:
1. If the model has an intercept, then the residuals have sample average exactly
equal to 0:
2 3
^"1
1 0 1 6 ^"2 7 1 1X
T
6 7
1 ^" = [1 1 1 1 ::::: 1] 6 .. 7 = (^"1 + ^"2 + ^"3 + :::: + ^"T ) = ^"T = 0
T T 1 T 4 . 5 T T t=1
^"T
Y = Y^ + ^"
1 0 1 0^ 1
1Y = 1 Y + 10^"
T T T
1X X 1X
T T T
1 ^
Yt = Yt + ^"T
T t=1 T t=1 T t=1
1X^
T
= Yt + 0
T t=1
Y = Y^
8
3. OLS makes the Y^ such that it orthogonal to ^":
0 0
Y^ 0^" = (X ^ )0^" = ^ X 0^" = ^ 0 = 0;
Speci…cally OLS implies a partition of Y into two subspaces orthogonal to each other.
We have:
Y^ = X ^ = X(X 0 X) 1 X 0 Y = X(X 0 X) 1 X 0 Y = PX Y
| {z } | {z }
^ PX
Where PX = X(X 0 X) 1 X is the projection matrix for the space spanned by the X.
The matrix PX projects things on the space spanned by the X: The residuals are:
^" = Y Y^ = Y PX Y = (In PX )Y = MX Y
where MX = (In PX ) is the "residual maker" matrix" or "annichilator" matrix.
The matrix MX projects things on the space orthogonal to the X:
The matrices MX and PX have some important properties. They are symmetric:
PX0 = X(X 0 X) 1 X 0 = PX
MX0 = I 0 PX0 = I PX = MX
and idempotent:
PX PX = X(X 0 X) 1 X 0 X(X 0 X) 1 X 0
= X(X 0 X) 1 X 0 = PX
MX MX = (I PX )(I PX )
= I PX PX + PX P X
= I PX = MX
MX and PX are mutually orthogonal:
PX MX = PX (I PX ) = PX P X PX = 0
The projection of X on X is X itself:
PX X = X(X 0 X) 1 X 0 X = X
The projection of X on the space of MX is 0:
MX X = (I PX )X = X PX X = 0
Therefore we have:
Y = Y^ + ^" = PX Y + MX Y
9
2.2 Goodness of …t
Recall the actual-…tted-residual decomposition:
Y = Y^ + ^"
consider squaring both sides of the equation:
note that:
X
T X
T X
T
Yt2 0
= Y Y = T SS; Y^t2 = Y^ 0 Y^ = ESS; ^"t = ^"0^" = RSS
t=1 t=1 t=1
we can write:
T SS = ESS + RSS
which gives:
ESS T SS RSS RSS
R2 == =1
T SS T SS T SS
2 2
this is the *uncentered* R . However, the uncentered R has a scaling problem. We
need the centered RC2 :
XT
(Y^t Y )2
CESS
RC2 = = t=1
CT SS XT
(Yt Y )2
t=1
where
X
T X
T
(Yt 2
Y ) = CT SS; (Y^t Y )2 = CESS:
t=1 t=1
2
Moreover there is the adjusted R which takes into account how many regressors are
in the model:
T 1
R2 = 1 (1 RC2 );
T k
when k increases, R2 reduces.
10
Figure 1: The regression plane of regression of y onto x1 ; x2 . The plane has
equation y^t = x1 ^ 1 + x2 ^ 2 .
11
Figure 2: The orthogonal projection of y onto span(x1 ; x2 ) = x1 1 + x2 2
12
3 Small sample properties of the OLS estimator
So far, the vector
^ = (X 0 X) 1 X 0 Y
k k
However, we have no theory to establish how good these estimates are. To do this,
we need to add a probabilistic treatment to our model. We add:
(A3) : E["] = 0
Note that
2 3 2 3
"1 "21 "1 "2 "1 "T
6 "2 7 h i 6 " " " 2 7
6 7 0 . 0 6 2 1 2 7
" = 6 .. 7 ; " = "1 "2 .. "T ; "" = 6 ... 7;
4 . 5 4 5
"T "T "1 "2T
2 3
E["21 ] E["1 "2 ] E["1 "T ]
6 E["2 "1 ] E["2 ] 7
6 2 7
E[""0 ] = 6 .. 7
4 . 5
2
E["T "1 ] E["T ]
13
so this assumption implies that
Finally, we add:
2
(A5) : " N (0; IN ) : Gaussianity
To summarize, we have the following set of assumptions.
So now we can use these to compute the properties of the OLS estimator.
3.1 Unbiasedness
lets say you want to estimate a coe¢ cient , using an estimator ^. The estimator is
unbiased i¤:
E[^] =
First of all, recall the OLS estimator:
^ = (X 0 X) 1 X 0 Y
To study the properties, we need to put the estimator in the "sample error represen-
tation":
^ = (X 0 X) 1 X 0 (X + ")
= (X 0 X) 1 X 0 X + (X 0 X) 1 X 0 "
= + (X 0 X) 1 X 0 "
actual value | {z }
sampling error
This is useful to compute the properties of the estimator. So consider taking the
expectation:
E[ ^ ] = E[ + (X 0 X) 1 X 0 "]
14
now use the properties of the E[ ] operator.
E[ ^ ] = E[ ] + E[(X 0 X) 1 X 0 "]
| {z }
constant (A2)
= + (X 0 X) 1 X 0 E["]
|{z}
=0 (A3)
0 1 0
= + (X X) X 0
=
3.2 Variance
Now, let us consider the variance of this estimator.
= (X X) X V ar(")X(X 0 X)
0 1 0 1
= (X 0 X) 1 X 0 2 IN X(X 0 X) 1
= 2 (X 0 X) 1 X 0 X(X 0 X) 1
= 2 (X 0 X) 1 :
however, note that in our case is not a random variable. It follows that V ar( ) =
0; 2COV (:::) = 0:
V ar(Az) = AV ar(z)A0
3.3 Distribution
Finally, the distribution of the OLS estimator.
^= + (X 0 X) 1 X 0 "
15
we know that
" N (0; 2 IN )
0 1 0
(X X) X " N ((X 0 X) 1 X 0 0; (X 0 X) 1 X 0 2
IN X(X 0 X) 1 )
N (0; 2 (X 0 X) 1 )
+ (X 0 X) 1 X 0 " N (0 + ; 2 (X 0 X) 1 )
^ N ( ; 2 (X 0 X) 1 )
To sum up:
A1 used to derive the estimator.
1) ^ is unbiased. We used A1-A3
2) ^ has variance 2 (X 0 X) 1 . We used A1-A4
3) ^ has a normal distr. We used A1-A5
The expression
^ N( ; 2
(X 0 X) 1 )
summarizes these results.
16
Compute the variance:
V ar(b) = V ar(CY )
= V ar(CX + C")
= V ar(C")
= CV ar(")C 0
= C 2 IC 0
2
= CC 0
2
V ar(b) = CC 0
2 1 1
= fD + (X 0 X) X 0 gfD + (X 0 X) X 0 g0
2 1
= fD + (X 0 X) X 0 gfD0 + X(X 0 X) 1 g
2
= fDD0 + DX(X 0 X) 1 + (X 0 X) 1 X 0 D + (X 0 X) 1 X 0 X(X 0 X) 1 g
2
= fDD0 + 0 + 0 + (X 0 X) 1 g
2
= DD0 + 2 (X 0 X) 1 ;
| {z }
V ar(OLS)
therefore ^ has the smallest variance in the class of linear and unbiased estimators.
17
4 Inference on individual coe¢ cients
We have considered the model
Y = X + ":
?
We have derived the estimator
^ = (X 0 X) 1 X 0 Y:
In this case this is a k-variate random variable with multivariate Gaussian distribu-
tion. We want to make on the individual coe¢ cient j ; j = 1; :::; k. The distribution
of the OLS estimator for the j-th element of the vector ^ is:
^ N ( j ; [ 2 (X 0 X) 1 ]jj ):
j
can be used for inference. Speci…cally, for hypothesis testing and con…dence intervals.
H
Say we want to test H0 : j =0 0 : The pivotal quantity under the null is:
^j 0 H0
z^ = p N (0; 1)
[ (X X) 1 ]jj
2 0
18
4.1 Estimation of error variance
Step 1. We will propose an estimator. We want to estimate
2
= E["2t ]
1X 2
T
2
~ = ^"
T t=1 t
19
recall the matrix M
M = IT X(X 0 X) 1 X 0
tracefM g = tracefIT g tracefX(X 0 X) 1 X 0 g
= T tracef(X 0 X) 1 X 0 Xg
= T tracefIk g
= T k
so we have:
1
E[~ 2 ] = 2
trace(M )
T
1 2
= (T k)
T
T k 2
=
T
this is biased. We can adjust it:
T T T T k
E ~2 = E[~ 2 ] = 2
= 2
T k T k T k T
1X 2 1 X 2
T T
T T
^2 = ~2 = ^"t = ^"
T k T k T t=1 T k t=1 t
Now, note that " N (0; IT ): So this is a sum of squared normals, of which T -k are
linearly independent. This is the sum of T -k independent standard Gaussians.
Remember the result we obtained:
^ N( ; 2
(X 0 X) 1 )
20
Consider the following ratio:
^
p j
2 (X 0 X) 1 ]
j
N (0; 1)
[ jj
q q tT k
"0 ^
^ " 2
2 =(T k) T k =(T k)
Both these results are based on M being a …xed matrix, which is the case under A2.
Moreover, the degrees of freedom come precisely from the rank of M .
^ ^ ^ ^
p j j
2 (X 0 X) 1 ]
p 2j0 j p j
2 (X 0 X) 1 ]
j
p j j
^j
[ jj [ (X X) 1]
jj [ jj [(X 0 X) 1]
jj j
q = q 0 = q = p =q
"0 ^ ^2 2
^ "
=(T k) "^
^ "
2
^ ^ 2 [(X 0 X) 1 ]jj
2 (T k) 2
as opposed to:
^
j j
z=p N (0; 1)
[ 2 (X 0 X) 1 ]jj
2
In practice we never know . For tests: H0 : j = 0
^ H0
j 0
q tT k
^ 2 [(X 0 X) 1 ]jj
For C.I: q
^j t ^ 2 [(X 0 X) 1 ]jj
Softwares provide
^j H0 : j =0
q tT k
2
^ [(X 0 X) 1 ] jj
21
5 Partitioned Regressions and the Frish-Waugh
Theorem
Recall the model:
Y = X + " :
T 1 T xkk 1 T 1
And accordingly:
2 3 2 3
1 q+1
6 7 6 7
= 1
; 1 =6
4
2 7;
5 2 =6
4
q+2 7
5
2
q k
1
Y = X1 X2 +"
2
22
and if you perform the product:
Y = X1 1 + X2 2 +"
in X1 there are q regressors, and in X2 there are k-q regressors. We start from the
normal equations.
X 0X ^ = X 0Y
In the alternative notation:
X10 ^1 X10
X1 X2 ^2 = Y
X20 X20
Do the multiplication:
then,
Now we can plug this expression in the second equation and solve for ^ 2
23
The same steps can be done to obtain a solution for ^ 1 . We have:
^ 1 = (X10 M2 X1 ) 1 X10 M2 Y
q 1 q TT TT q
^2 = ( X20 M1 X2 ) 1 X20 M1 Y
(k q) 1 (k q) T T T T (k q)
Consider the case in which X1 and X2 are orthogonal: X10 X2 = 0. This implies
X10 M2 = X10 [I X2 (X20 X2 )X20 ] = X1 0 = X1 : Also we have that X20 M1 = X20 [I
X1 (X10 X1 ) 1 X10 ] = X20 :
^ 1 = (X10 X1 ) 1 X10 Y
^ 2 = (X20 X2 ) 1 X20 Y
Remark 1 When the regressors are orthogonal, then the least square regression of
Y on the two separate sets of variables gives the same OLS estimator as regressing
Y on both groups at the same time.
^ = (X10 X1 ) 1 X10 X2
^ 2 = X1 ^ = X1 (X10 X1 ) 1 X10 X2 = P1 X2
X
^ = X2 X ^ 2 = X2 P1 X2 = M1 X2
So we note that:
^ 2 = (X20 M1 M1 X2 ) 1 X20 M1 Y
| {z }
^
0 1 0
= (^ ^) ^ Y;
i.e. the OLS estimator of the slope coe¢ cient in the regression:
Y = ^ + v:
24
Proposition 2 In the linear regression of Y on two sets of variables X1 and X2 the
subvector ^ 2 is the set of coe¢ cients obtained from regression of Y on the residuals
^, where ^ are the residuals of a regression of X2 onto X1 .
Y = x1 1 + x2 2 + ::: + xk k +"
25
6 Inference on multiple coe¢ cients and restricted
least squares
6.1 Specifying multiple linear restrictions
Consider the model:
Y =X +"
full model:
Y = x1 1 + x2 2 +"
restricted model:
Y = x1 1 +"
The restriction is 2 = 0.
1 1
=
2 0
Generalize.
Y = x1 1 + x2 2 + x3 3 + x4 4 +"
Imagine we want to impose the following restrictions:
a) 1 = 0, 1 =0
b) 2 = 1, 2 1=0
c) 2+ 3 = 5, 2+ 3 5=0
R = r
j kk 1 j 1
26
2 3
2 3 1
2 3 2 3
1 0 0 0 6 7 0 0
4 0 1 0 0 56 2 7 4 1 5=4 0 5
4 3
5
0 1 1 0 5 0
4
The vector m is the vector measuing the distance from the restrictions.
Let us give a few more examples of restrictions and put them in the form (5).
H0 : i =0
R = [ 0 0 0 1 ::: 0 ]; r = 0
2 3
1
6 7
6 2 7
6 .. 7
6 7
m = 0 0 0 1 ::: 0 6 . 7 0=0= i 0=0
6 i 7
6 7
4 5
k
H0 : i = i+1
i i+1 = 0
R = 0 0 ::: 1 1 ::: ; r = 0
2 3
1
6 7
6 2 7
6 .. 7
6 . 7
6 7
m = 0 0 ::: 1 1 ::: 6 7 0=0
6 i 7
6 7
6 i+1 7
4 5
k
H0 : 2 + 3 + 4 =0
27
R = 0 1 1 1 ::: 0 ; r = 0
2 3
1
6 7
6 2 7
6 .. 7
6 . 7
6 7
m = 0 1 1 1 ::: 0 6 7 0=0
6 i 7
6 7
6 i+1 7
4 5
k
Another one
H0 : 2 + 3 = 1; 4 +3 5 = 0; 5 + 6 =0
2 3 2 3
0 1 1 0 ::: ::: :: 0 1
R = 4 0 0 0 1 3 :: :: 0 5 ; r = 4 0 5
0 0 0 0 1 1 :: :: 0
2 3
1
6 7
2 36
6 ..
2 7
7 2 3 2 3
0 1 1 0 ::: ::: :: 0 6 . 7 1 0
6 7
m = 4 0 0 0 1 3 :: :: 0 5 6 7 4 0 5=4 0 5
6 i 7
0 0 0 0 1 1 :: :: 6 7 0 0
6 i+1 7
4 5
k
28
6.2 Wald statistic
The test starts from the discrepancy vector:
H
m=R r =0 0
Y =X +"
^ = R^
m r
^ = E[R ^ r]
E[m]
= RE[ ^ ] r
H
= R r = m =0 0
V ar[m]
^ = V ar[R ^ r]
= V ar[R ^ ]
= RV ar[ ^ ]R0
= R 2 (X 0 X) 1 R0
2 + 3 1 = 0;
4 + 3 5 = 0;
5+ 6 = 0
29
how can we decide how "well" these restrictions are satis…ed? The answer is the
Wald criterion. Divide the discrepancy by its standard deviation, to obtain a rescaled
discrepancy:
p 1 p 1
V ar(m)
^ m
^ = V ar(m)
^ (R ^ r):
The rescaled discrepancy, is Gaussian and has variance Ij . The Wald criterion is:
^ 0 V ar(m)
W = m ^ 1m ^
^ [R (X X) 1 ] 1 R0 m
= m 0 2 0
^
= (R ^ r) [R (X X) 1 R0 ] 1 (R ^ r)
0 2 0
(R ^ r)0 [R(X 0 X) 1 R0 ] 1 (R ^ r)
= 2
(R ^ r)0 1 (R
^ r)
= [R(X 0 X) 1 R0 ] 2
j
| {z } j j | {z }
1 j j 1
1 X
T
1
^= ^"2t = ^"0^":
T k t=1
T k
Moreover, recall that
^2 1 "0 "
2
(T k) = 2
^"0^" = M 2
T k
Therefore:
(R ^ r)0 1 (R ^ r)
[R(X 0 X) 1 R0 ] =j 2
j
^2
Fj;T k
2
2 (T k)=(T k) T k
Simplifying:
(R ^ r)0 [R^ 2 (X 0 X) 1 R0 ] 1 (R ^ r)
Fj;T k
j
This allows to test for the restrictions.
30
this model can also be estimated via Least squares. The solution is:
^ = ^ OLS (X 0 X) 1 R0 [R(X 0 X) 1 R0 ] 1 (R ^ OLS r):
RLS
= arg min S( )
R
S( ) = (Y X )0 (Y X )
s:t: R = r
L( ; ) = S( ) + 2 0 (R r)
| {z }
m
lets solve this
@L( ; )
= 2X 0 (Y X ) + 2R0
@
= 2X 0 (Y X ^ R ) + 2R0 = 0
= 2X 0 (Y X ^ R ) + 2R0 = = 0
= X 0 (Y X ^ R ) + R0 = 0
= X 0 Y + X 0 X ^ R + R0 = 0
= X 0 X ^ R = X 0 Y R0
= ^ R = (X 0 X) 1 X 0 Y (X 0 X) 1 R0
= ^ R = ^ (X 0 X) 1 R0
@L( ; )
= 2(R r)
@
= 2(R ^ R r) = 0
= R^R = r
now lets substitute
R( ^ (X 0 X) 1 R0 ) = r
R ^ R(X 0 X) 1 R0 = r
R ^ r = R(X 0 X) 1 R0
= [R(X 0 X) 1 R0 ] 1
(R ^ r)
| {z }
31
now we put this equation into the …rst one
^ =^ (X 0 X) 1 R0 [R(X 0 X) 1 R0 ] 1 (R ^ r)
R
C | {z }
| {z } m^
H0 : R r=0
3. Unbiasedness:
E[ ^ R ] = E[ ^ C(R ^ r)]
= E[ ^ ] E[C(R ^ r)]
= E[(X 0 X) 1 X 0 " + ] E[C(R ^ r)]
= + 0 + E[C(R ^ r)]
H
= C E[R ^ r] =0 C0 =
| {z }
E[ ^ R ] = CE[R r] 6=
Variance of ^ R :
^ = ^ (X 0 X) 1 R0 [R(X 0 X) 1 R0 ] 1 (R ^ r)
R
C | {z }
| {z } m^
= ^ C(R ^ r)
= (I CR) ^ + Cr
32
which gives V ar( ^ R ) = MR V ar( ^ ) and
so it is biased.
Remark 4 Under H0 the ^ RLS estimator is unbiased and more e¢ cient than ^ OLS .
However, if H0 is not satis…ed, ^ RLS is biased (while ^ OLS is unbiased).
33
7 The generalized linear regression model
recall the assumption CLRM:
A1 - X is a matrix of full rank — > OLS estimator exists
A2 - X is deterministic or …xed in repeated samples
A3 - E["]=0
A4 - E[""0 ]= 2 IT
A5 - " is Gaussian
Under these assumptions:
^ N( ; 2
(X 0 X) 1 )
OLS
and:
^ t( ; ^ 2 (X 0 X) 1 )
OLS
can be used for inference.
Now we start asking what happens as we remove some assumptions. Today’s
lecture focuses A4.
E[""0 ] = 2 IT
Sphericity.
V ar(") = E[""0 ] E["]E["0 ] = 2
IT
2 2 3
0 0 0 0
6 0 2 7
6 7
2
IT = 6
6 0
7
7
4 0 5
2
V ar("t ) = 2
Cov("t ; "s ) = 0
We are going to generalize this model.
A5bis: E[""0 ] = 2
=
2 3
V ar("1 ) Cov("1 ; "2 ) Cov("1 ; "3 ) ::: Cov("1 ; "T )
6 Cov("2 ; "1 ) V ar("2 ) 7
6 7
= 6
6 Cov("3 ; "1 ) 7
7
4 5
Cov("T ; "1 ) V ar("T )
34
More generally:
E[""0 ] = 2
The model with assumption E[""0 ] = 2 is called the "generalized" linear regression
model. Setting = I gets us back to the CLRM.
1. Ask how the OLS estimator performs under this alternative set of assumptions.
2. Propose a more e¢ cient estimator under these assumptions.
^ OLS = (X 0 X) 1 X 0 Y = (X 0 X) 1 X 0 " +
E[ ^ OLS ] = E[(X 0 X) 1 X 0 " + ]
= (X 0 X) 1 XE["] +
|{z}
=0
= 0+
= 2 0
(X X) X1 0
X(X 0 X) 1
6= V ar[ ^ OLS jCLRM ] = 2
(X 0 X) 1
Therefore we have
= (X 0 X) 1 X 0 2 X(X 0 X) 1
35
Consider
2 3
V ar("1 ) Cov("1 ; "2 ) Cov("1 ; "3 ) ::: Cov("1 ; "T )
6 Cov("2 ; "1 ) V ar("2 ) 7
6 7
= 2
=6
6 Cov("3 ; "1 ) 7
7
4 5
Cov("T ; "1 ) V ar("T )
(T 2 T) T2 T + 2T T (T + 1)
+T = =
2 2 2
We try to estimate
X0 2
X
which has K(K+1)
2
.
White estimator (1980):
2 3
^"21 0 X
T
X 0d
2 X=X 4 0 5X = ^"2t xt x0t
0 ^"2T t=1
Newey-West (1982):
p
X X
T
jpj
X 0d
2 X = 1 ^"t^"t p xt x0t p
p= p t=p+1
1+p
p
X
T X X
T
p
= ^"2t xt x0t + 1 ^"t^"t p (xt x0t p + xt p x0t )
t=1 p=1 t=p+1
1+p
d
V ar[ ^ OLS ] = (X 0 X) 1 X 0 d
2 X(X 0 X) 1
for inference.
36
7.2 E¢ cient estimation via GLS
There is an alternative to the OLS estimator, which is actually more e¢ cient.
1 1=20
= P 0P = 1=2
since this is the OLS estimator of a model that satis…es the CLRM assumptions, this
is BLUE for . We label this estimator the GLS estimator
^ = ^ OLS
GLS
In particular
^ GLS = (X 0 X ) 1 X 0 Y
= (X 0 P 0 P X) 1 (P X)0 P Y
= (X 0 P 0 P X) 1 X 0 P 0 P Y
= (X 0 1 X) 1 X 0 1 Y:
37
E[ ^ GLS ] =
V ar[ ^ GLS ] = 2
(X 0 1
X) 1
and it is BLUE for , so it is better than any other linear unbiased estimator,
including the OLS estimator:
generally, one assumes a functional form for the error terms and uses Maximum
Likelihood.
^ 0 d
1 ( )X) 1 X 0 d
1 ( )Y
F GLS = (X
where F stands for "feasible". The FGLS is not BLUE. It is actually both biased
and nonlinear. However
^ ^
F GLS ! GLS
with
2
t = x2t
2 3 2 3
E["21 ] = 2
1 = x21 0 x21 0
6 0 E["22 ] = 2
= x22 7 6 0 x2 7
6 2 7 6 2 7
2
=6
6
7=
7 6 .. 7
4 5 4 . 5
0 0 E["2T ] = 2
= x2T 0 x2T
T
38
transform the model:
1 1 1 1
p yt = p +p x t + p "t
x2t x2t x2t x2t
yt xt "t
yt = xt + + "t ;
1 1
V ar("t ) = V ar( p "t ) = 2 x2t =
x2t xt
2 3
1=x1 0
6 0 1=x 7
This is FGLS with = f g and P = 6 4
2 7. This exampe with
5
0 0 1=xT
heteroskedasticity is a.k.a Weighted Least Squares.
6 % 1 7
2 6 7
2
= u 6 %2 % 7= 2
( )
1 %2 6
4
7
5
%T 1
0 1
2
This is FGLS with = f%; u g. Cochrane-Orcutt procedure:
yt = + xt + "t ! ^ ! ^"t
^"t = ^"t 1 + ut ! ^
and iterate:
yt = + x t + "t
^ yt 1 = ^ + ^ x t 1 + ^ " t 1
39
7.3.3 Conditional heteroskedasticity and models of time varying volatil-
ities
yt = + x t + "t
2
"t N (0; t)
this implies:
E["2t jIt 1 ] = E[h2t vt2 jIt 1 ] = h2t E[vt2 jIt 1 ] = h2t ;
which shows that h2t is the conditional heteroskedasticity. Note that the forecast
error is:
that is, the squared errors follow an AR(1) process. This model could be estimated
with ML or with a recursive approach similar to Cochrane-Orcutt.
Stochastic volatility:
h2t = 0 + 1 h2t 1 + t
note that di¤erently from GARCH h2t has its own disturbance. Estimated with ML,
or Bayesian methods.
40
Part II
Random regressors and
endogeneity
8 Independent regressors
So far we assumed:
Y =X + |{z}
"
|{z}
unkown rand om
However, in economics X is random. There is the need to specify the relation between
the two random objects in the model. The easiest case is one in which X and " are
independent.
A1 - X is a matrix of full rank — > OLS estimator exists
A2bis - X is random
A3bis - E["jX]=0
A4bis - E[""0 jX]= 2 IT
A5bis - "jX is Gaussian
This set of assumptions ensures that X is random (A2) but independent from
" (because of A3-A5). Under these assumptions, it turns out that the OLS
estimator is still BLUE.
41
We computed the expected values of the expected values.
E["jX] = 0 =) E["] = 0
E["] = EX [E["jX]] = EX [0] = 0
| {z }
E[ ] = EX [E[ ^ jX]] = EX [ ] =
therefore, it is unbiased.
Then we have also the variance.
= 2 (X 0 X) 1 X 0 X(X 0 X) 1 = 2 (X 0 X) 1
42
The LIE is: E[Z] = EX [E[ZjX]]
Moreover
^ 0
p 2 j 0 j 1 jX N (0; 1)
[(X X) ]jj
q tT k
"0 ^
^ " 2
2 jX =(T k) T k
and
(R ^ r)0 [R^ 2 (X 0 X) 1 R0 ] 1 (R ^ r)
jX Fj;T k
j
have densities that do not depend on X, hence they hold for every X which means
once we integrate the X out they remain the same.
This means that we can make inference as usual, using the pivotal quantities
above.
43
Finally, the same reasoning applies to the Gauss-Markov theorem.
V ar( ~ jX) = 2
DD0 + 2
(X 0 X) 1 ;
| {z }
V ar( ^ OLS jX)
44
9 Asymptotic theory and large sample properties
of the OLS estimator
So far X was random, but independent from "
A1 - X is a matrix of full rank — > OLS estimator exists
A2bis - X is random
A3bis - E["jX]=0
A4bis - E[""0 jX]= 2 IT
A5bis - "jX is Gaussian
2
"jX N (0; IT )
What if X is not independent from "? Small sample results break down. For
example:
E[ ^ ] = + E[(X 0 X) 1 X 0 "]
= + EX [E[(X 0 X) 1 X 0 "jX]]
= + EX [(X 0 X) 1 X 0 E["jX]]
We must resort to asymptotics. As a silver lining, the Gaussianity assumption
will no longer be necessary. Actually, one could invoke asymptotics even with
independence between X and ", just on the grounds of removing A5, which
can be viewed as strong. In that case, removing A5 while keeping A3bis and
A4bis means that unbiasedness remains valid, but normality happens only
asymptotically.
A1 - X is full rank
A2ter - X is iid + not contemporaneously correlated with "t + regularity con-
ditions (which ensure Laws of Large Numbers [LLN], Central Limit theorems
[CLT] apply).
A3 - E["]=0
A4 - E[""0 ]= 2 IT .
Note that A3 and A4 imply "t is iid.
A5 - is no longer necessary
45
9.1 Asymptotic theory
What happens to random variables (estimators) as the sample size increases. T !
1; N ! 1
^ !
An implication of the LLN and the de…nition of consistency is that the sample
mean is a consistent estimator of : (It is an estimator, which converges in probability
to ).
Moving the interest to distributions:
46
Theorem 11 The delta method. Let a random variable be such that
p d
n(^ ) !
and g( ) continuously di¤erentiable in a neighborood of . Then
p d
n(g(^ ) g( )) ! G
@g(u)
with G = @u
. For example if N (0; V ) then
d
g(^ ) ! N (g( ); GV G0 ):
9.2.1 Consistency
^ = (X 0 X) 1 X 0 Y
= + (X 0 X) 1 X 0 "
1
X 0X X 0"
= +
T T
What are these quantities?
1X
T
X 0X p
= xt x0t ! E[xt x0t ] = Qxx ;
T T t=1
0
because xt x0t is iid and therefore a LLN applies. Compactly, plim XTX = Qxx . Also
1X
T
X 0" p
= xt "t ! E[xt "t ] = 0
T T t=1
47
because xt "t is iid and thereforea LLN applies, and because E[xt "t ] = 0 due to the
0
fact that E["t ] = 0 and Cov(xt "t ) = 0. Compactly, plim XT " = 0. Now using the
continuous mapping theorem:
1
X 0X X 0"
p lim ^ = + plim plim
T T
=
48
Then we have, by applying the continuous mapping theorem and the delta method:
p 1
X 0X X 0"
T (^ ) = p
T T
| {z } |{z}
p d
!Qxx !N (0;Qx" )
p d
T (^ ) ! Qxx1 N (0; Qx" )
p d
T (^ ) ! N (0; Qxx1 Qx" Qxx1 ); (6)
p
which is the asymptotic distribution of T ( ^ ). Rearranging gives the asymptotic
^
distribution of :
^ !d 1
N ( ; Qxx1 Qx" Qxx1 ) (7)
T
The variance T1 Qxx1 Qx" Qxx1 goes to 0 as T goes to in…nity, which is not surprising
p
since we know that ^ ! . The representation (7) is useful when constructing test
statistics and standard errors. However, for theoretical purposes the representation
(6) is more useful as it retains non-degenerate limits as the sample sizes diverge.
49
9.3 Asymptotic variance estimation
Can we estimate the asymptotic variance?
We are going to consider two cases, which depend on the assumption about the
0 1
conditional variance of the errors. Consistent estimator of Qxx1 is XTX . We need
to consistently estimate the matrix Qx" . In general:
Qx" = V ar[xt "t ] = E[xt x0t "2t ] E[xt "t ]E[xt "t ]0 = E[xt x0t "2t ]:
Qx" = E[xt x0t "2t ] = E[E[xt x0t "2t jxt ]] = E[xt x0t E["2t jxt ]] = E[xt x0t 2
]= 2
E[xt x0t ] = 2
Qxx :
So we have that
AsyV ar( ^ ) = Qxx1 2
Qxx Qxx1 = 2
Qxx1
and p d
T (^ ) ! N (0; 2
Qxx1 ) (8)
and Qxx1 can be estimated consistently via:
1
X 0X p
^2 ! 2
Qxx1
T
That is
1
d X 0X
AsyV ar( ^ ) = ^ 2
T
Going back to (8) and plugging in this consistent estimator of the asymptotic variance
we have: !
p 0 1
d X X
T (^ ) ! N 0; ^ 2 ;
T
rearranging:
d
^ ! N ; ^ 2 (X 0 X) 1
;
2
Actually, the slighlty weaker assumption Cov(xt x0t ; "2t ) = 0 would su¢ ce: E[xt x0t "2t ] =
E[xt x0t ]E["2t ] = 2 Qxx .
50
which is the usual! Note that
1 d
V ar( ^ jX) = ^ 2 (X 0 X) 1
= AsyV ar( ^ )
T
and goes to 0 as T ! 1:
1 d
AsyV ar( ^ ):
T ! 2 Qxx1
!0
Let us now consider the more general case with conditional heteroskedasticity.
In this case "t is still independent across time, but is not identically distributed, so
E["2t jxt ] = 2 does not apply. We have
Qx" = E[xt x0t "2t ]
a consistent estimator is
1X
T
1
^
Qx" = xt x0t^"2t = X 0 d
2 X;
T t=1 T
1X 1X 1X
T T T
0 2
xt xt^"t = xt x0t "2t xt x0t (^"2t "2t )
T t=1 T t=1 T t=1
PT p PT p
where 1
T t=1 xt x0t "2t ! Qx" and 1
T t=1 xt x0t (^"2t "2t ) ! 0. Therefore we have:
1 1
d X 0X 1 0d X 0X
AsyV ar( ^ ) = X 2 X; ! AsyV ar( ^ ) = Qxx1 Qx" Qxx1 :
T T T
Plugging this estimator into (6) gives:
!
p 0 1 0 1
d XX 1 0d XX
T (^ ) !N 0; X 2 X
T T T
X0 d
d
^ ! 1 1
N ; (X 0 X) 2 X (X 0 X) ; (9)
which is the usual OLS expression with White heteroskedasticity consistent estima-
tor. Note that again.
1 d
X0 d
1 1
V ar( ^ jX) = (X 0 X) 2 X (X 0 X) = AsyV ar( ^ )
T
51
d
which of course goes to 0 as T ! 1 since AsyV ar( ^ ) ! 2
Qxx1 Qx" Qxx1
A similar reasoning could be applied in a case with "t autocorrelated. However,
this would require invoking di¤erent CLT and LLN to account for dependence. The
resulting asymptotic distribution would be as in (9) with X 0 d
2 X based on a HAC
Remark 12 The asymptotic distributions derived in this Section are the same we
derived in the small sample using normality of the errors and …xed regressors, and
this also in the case of autocorrelated and heteroschedastic disturbances. In other
words, if we have a large sample we can resort to these asymptotic arguments to avoid
assuming that the errors are Gaussian, and also to allow for random regressors. The
assumptions we need to be able to resort to the asyptotic argument is that the random
regressors are iid and not contemporaneously correlated with the errors of the model,
plus some regularity conditions.
52
10 Endogeneity and instrumental variables
We have seen that under the assumption Cov(xt ; "t ) = 0 "standard" inference is
still valid. There are many applications in which such assumption is untenable.
Sometimes it happens that:
DGP : y = X1 1 + X2 2 + ; (10)
M ODEL : y = X1 1 +v
where we have
v = X2 2 + ;
which gives:
^1 = (X 0 X1 ) 1 X 0 y
1 1
= (X10 X1 ) 1 X10 fX1 1 + vg
1
1 0 1 0
= 1 + X X1 Xv
T 1 T 1
!E[x1t vt ]
and
E[x1t vt ] = E[x1t (x2t 2 + "t )] = E[x1t x2t ] 2 6= 0
53
where xi = Education, but we only observe x~i = Schooling, which is an imperfect
proxy for education:
x~i = xi + ui :
So the model we can estimate is:
yi = 1+ 2 xi + "i
= 1+ 2 (~
xi u i ) + "i
= 1+ 2x~ i + wi
yi = 0 + 1 xi + "i
which implies
E[yi jxi ] = 0 + 1 xi
And allows to answer questions such as: what is the expected e¤ect of this drug?
One answer is:
Note however that this answer is not necessarily satisfactory. We cannot limit our-
selves to computing the di¤erence between the expected health of those who took
the drug E[yi jxi = 1] and those who did not E[yi jxi = 0], because there might be
54
other carachteristics of the individuals playing some other explanatory role. What
we really are after is
E[yi1 jxi = 1] E[yi0 jxi = 1] (12)
where
yi0 = health status of individual i when he does not take the drug
;
yi1 = health of individual i when he takes the drug
irrespectively of wether they actually took the drug or not. This is Rubin’s concept of
potential output. Expression (12) then denotes the di¤erence between the expected
health of those who took the drug E[yi jxi = 1] = E[yi1 jxi = 1] and the *same*
people (therefore people for whom xi = 1), under the hypothetical scenario that
they did not take it, E[yi0 jxi = 1]. Since we do not observe yi0 jxi = 1 (this is called
a conterfactual) this creates a complication. We have:
The steps above show that the desired quantity E[yi1 jxi = 1] E[yi0 jxi = 1] coincides
to the observable quantity E[yi jxi = 1] E[yi jxi = 0] ony if the selection bias term
is 0. This term represents the e¤ect of how the patients taking the drug are selected.
Note that, if they are selected randomly, that is, independently from the potential
outcome, we have that E[yi0 jxi = 1] = E[yi0 jxi = 0] = E[yi0 ] and the selection bias
term is 0. But if they are not selected randomly, for example the choice to take the
drug is on a voluntary basis, or is given only among people that are regularly visiting
their doctor, those patients with better health and better possible outcomes are less
likely to be taking the drug and E[yi0 jxi = 1] < E[yi0 jxi = 0].
Moving on to the regression, the observed outcome can be written as a function
of the potential outcomes:
55
Evaluating the expectations under treatment and no treatment gives:
which shows that if there is any kind of dependence between the regressor xi and
the error term "i ; then 1 is not identi…ed from the data. If randomization is not
possible, then there are two solutions. Note that we can interpret this also this
example as an instance of omitted variable problem/measument error, and adding
a control variable to the conditional mean of the model, for example age or overall
health can remove the selection bias. The other solution is to use a zi instrument
correlated with xi but uncorrelated with the potential outcomes (i.e. uncorrelated
with the errors). Why? because then we can project xi on the space of zi and obtain
a regression
yi = 0+ 1 (Pz xi
+ Mz xi ) + "i :
0+ 1x
^i + ~"i
where x^i = Pz xi and ~"i = 1 Mz xi + "i are by construction orthogonal, hence the
selection bias of this regression is 0.
qtd = pt + "dt
qts = pt + "st
In equilibrium qtd = qts = qt which implies the model can be written as:
1 qt "dt
= :
1 pt "st
56
Clearly both qt and pt depend on both shocks. What happens if we try to estimate
the demand equation via OLS?3 We would have:
1
^ T
pt qt
= 1
T
p2t
1 "dt "st "dt "st
T
= 2
1 "dt "st
T
1
T
("dt "st )( "dt "st )
= 2
1
T
"dt "st
1
T
f ("dt )2 "dt "st "dt "st + ("st )2 g
= 1
T
f("dt )2 + ("st )2 2"dt "st g
! E[("dt )2 ] + E[("st )2 ]
=
! E[("dt )2 ] + E[("st )2 ]
2 2
d + s
! 2 2
: (14)
d + s
This is not the elasticity we are looking for (the demand elasticity ) but rather
a mix of demand and supply elasticities (the demand elasticity and the supply
elasticity ).4
2+ 2
d s
In this situation the coe¢ cient 2+ 2
s
is identi…ed but and are not. This
d
is not surprising, it is a direct consequence of the simultaneity of this model: it is
3
The same would happen if we tried to estimate the supply equation.
4
Only in the special case in which the shocks to one of the two curves are so small to be negligible
we could consistently estimate and . This is close to a situation in which only one of the two
curves is being shocked, and its movements after the shocks allow us to trace out the other curve.
57
possible to show that in model (13) we have5
2 2
d + s
E[qt jpt ] = 2 2
pt ; (15)
d + s
which is indeed the probability limit to which the OLS estimator is converging (see
(14)). This is an observable property of the joint distribution, and indeed the linear
regression of qt on pt yields a consistent estimator of this quantity.
Clearly in this (and any other) simultaneous equation model the object pt is not
the conditional mean, but a conceptual causal object capturing what would happen
to qt if we moved pt only via shocks "st while keeping …xed "dt . This amounts to a
situation in which only one of the two curves is being shocked, and its movements
after the shocks allow us to trace out the other curve. A notation for this object
was given by Pearl (2010): pt = E[qt jdo("dt = "dt )]. This latter expectation and the
conditional expectation E[qt jpt ] in (15) will coincide only if E["dt jpt ] = 0, that is if
there is independence between "dt and pt , a condition that by construction does not
hold in a simultaneous model.
Problem: it is important to emphasize that the causal object pt is not a
characteristic of the joint distribution and therefore cannot be inferred from the
data without additional assumptions.
Solution: …nd a variable that shifts one curve but not the other. This would in-
troduce in the system a channel trough which we can reproduce the ideal counterfac-
tual situation of keeping one curve …xed while moving the other. If the instrumental
variable only moves one of the two curves, then it is possible to trace the curve that
does not respond to the instrument. For example, de…ne wt as the number of days
with temperatures below 0 Celsius, which would cause a reduction in the yield of the
agricultural sector. We can think of wt as a component of the supply shocks:
qt 0
E =
pt 0
2
qt 2 d 0 1
V ar = ( ) 2
pt 1 1 0 s 1
2 2 2 2 2 2
2 s + d s + d
= ( ) 2 2 2 2
s + d d + s
2 2 2 2 1
E [qt jpt ] = ( s + d )( d + s ) pt ;
where the last line follows from the properties of the multivariate Gaussian.
58
and we can write the model as
qtd = pt + "dt
:
qts = pt + hwt + ust
Note that wt only impacts the supply and is uncorrelated with both the other shocks.
Putting in matrix notation and solving the model gives:
qt 1 0 1 "dt
= ( ) +( )
pt 1 1 hwt 1 1 ust
" "dt ust
#
h
= h wt + "dt ust (16)
Use the second equation to …nd the conditional expectation E[pt jwt ]:
h "dt ust
pt = wt + : (17)
| {z }
pt =E[pt jwt ]
Key step: using the conditional expectation pt we can re-write the model (16) as:
" "d us #
t t
qt
= pt + d s
"t ut : (18)
pt 1
It is clear that in this model E[qt jpt ] = pt , i.e. the conditional expectation E[qt jpt ]
coincides with the causal object of interest pt = E[qt jdo("dt = "dt )]. The OLS
estimator in the regression of qt on pt would converge to this value:
1
^ = T
pt qt
1 (19)
T
(pt )2
1
T
pt ( pt + vt )
= 1
T
(pt )2
1
T
( (pt )2 + pt vt )
= 1
T
(pt )2
1
T
pt vt
= + 1 ! ;
T
(pt )2
59
The regression of qt on pt appearing in the …rst line of (18) can be interpreted
as a version of the demand equation in which pt has been swapped with pt (with a
remainder term ending up in the disturbances). The second equation of (18) instead
is the "…rst stage" regression, projecting pt on a space orthogonal to the demand
shocks, to obtain a variable pt that is uncorrelated to the demand shock. In practice
pt needs to be estimated, which can be done using the linear regression:
pt = w t + t
but this is no longer proportional to the expectation E[qt jpt ] appearing in the …rst
equation and therefore cannot be used to pin-down by re-writing the system in
the form (18). The form (18) would ensure that E[qt jpt ] = pt but it requires the
exclusion restriction k = 0.
6
Another way to see how the introduction of wt identi…es is to note that using representation
h h E[qt jwt ]
(16) one can compute E[qt jwt ] = wt and E[pt jwt ] = wt , and then note that E[p t jwt ]
=
2 1
. In the sample, one could regress qt on wt to obtain ( wt ) wt qt and pt on wt to obtain
( w2 ) 1 w q
( wt2 ) 1 wt pt , and then take the ratio ( w2t ) 1 wttptt = ( wt pt ) 1 wt qt which is the IV estimator.
t
This approach is less e¢ cient that the TSLS approach, because it uses wt instead of the projection
of pt on wt .
60
Full information system estimation approach While the system written as in
(18) obviously lends itself to a TSLS estimation procedure, an alternative procedure
involves estimating the system in (16) and then solve for the parameters of interest,
that is estimate the form:
qt a1 vtd
= wt +
pt a2 vts
with
h
a1
= h ; (20)
a2
2 2 2+ 2 2 2 2
3
us + us
"d "d
vtd 2
V ar = vd vd vs
=4 2
( )2
+ 2us
(
2 +
)2
2
5: (21)
vts vd vs
2
vs "d "d us
( )2 ( )2
There are …ve structural coe¢ cients h; ; ; 2"d ; 2us and …ve reduced form coe¢ cients
a1 ; a2 ; 2vd ; vd vs ; 2vs . The latter can be consistently estimated with OLS and then
one can solve for the structural parameters. In this example (20) identi…es trough
the ratio a1 =a2 , and then assuming knowkledge of one can …nd h = a2 ( ).
To …nd one can use the ratios of the variance conditions in (21)
2 2 2 2 2 2 2
vd vs "d
+ us vd "d
+ us
2
= 2 2
; 2
= 2 2
;
vs "d
+ us vs "d
+ us
2 2 2 2 2
vd vs us vd vs vd us vd 2
2
+ 2 2
= 2
+ 2 2
vs "d vs vs "d vs
2 2 2
vs vd
2 2 1
us vd vs vs
! 2
= 2 2 :
"d vs
1
vd vs
61
2
Now use this ratio in (23) to solve for :
2 2 2 2 2
vs vd vd
2 2 1 2
2 vd vd vs vs vd 2 vd vs
= 2
+ 2 2 2
= 2 2
vs vs
1 vs vs
1
vd vs vd vs
2
vd vs vd
Taking the square root and rearranging yields = 2 , hence also is identi-
vs vd vs
2 + 2 2 + 2
"d us 2 "d us
…ed. Finally we can use ( )2
= vs and ( )2
= vd vs in (21) to straightfor-
wardly solve for 2"d and 2
us .
Since every variable interacts contemporaneously with all the others, one cannot
estimate any of these three equations via OLS. Note you cannot even perform a
62
forecast, as you would need the forecast from one equation to forecast another. The
model above is a system of di¤erential equations.
2 32 3 2 32 3 2 3
1 a1 a1 y~t a2 0 0 y~t 1 "yt
4 b2 1 0 5 4 ~t =5 4 0 b1 0 5 4 ~t 1
5 + 4 "t 5
c1 c2 1 it 0 0 c3 it 1 "it
10.2 Instruments
The problem is solved trough the use of some alternative variables, called instru-
ments. Z
E[zt zt0 ] = Qzz < 1, E[zt x0t ] = Qzx < 1. (Regularity conditions).
Start with as many instruments as variables. The starting point is the moment
condition:
E[Z 0 "] = 0
63
suggests the moment estimator
Z 0 (y X ^) = 0
that is:
^ = (Z 0 X) 1 Z 0 Y
IV
which of course turns out to be consistent (as we know that the moment condition
holds).
^ IV = (Z 0 X) 1 Z 0 Y
= (Z 0 X) 1 Z 0 (X + ")
= (Z 0 X) 1 Z 0 X + (X 0 Z) 1 Z 0 "
1
Z 0X Z 0"
= +
T T0
!Qzx !E[Z "]=0
Asymptotic distribution:
p 1p
Z 0X Z 0"
T ( ^ IV )= T
T T
!Qzx
and
p Z 0"
T ! N (0; V ar[zt "t ])
T
with V ar[zt "t ] = E[zt zt0 "2t ] = EZ [zt zt0 E["2t jzt ]] = Qzz 2 . Hence
p
T ( ^ IV ) ! N (0; 2 Qzx1 Qzz Qzx10 )
64
10.3 Two stage least squares
Suppose that you found a valid and relevant set of instruments Z. They can possibly
be more than the Xs. Do the following.
X =Z +u
^ = (Z 0 Z) 1 Z 0 X
^ = Z^
X
^ 0 X) 1 X
where (X ^ 0 X(X
^ 0 X)
^ 1 = (X^ 0 X)
^ 1 and ^ 2 = (Y X ^ )0 (Y X ^ )=(T
k). Important: note ^ 2 uses the endogenous regressors and not the instrumented
ones (Y X ^ ^ )0 (Y X ^ ^ )=(T k).
65
10.4 Hausman test
Consider the null hypothesis of exogeneity. The OLS estimator ^ OLS will be consis-
tent. Then take a 2SLS estimator ^ 2SLS derived from Z instruments. Under the Null
hypothesis of exogeneity, this estimator is also consistent.
Under the alternative only ^ 2SLS is consistent. Hence we can devise a test for the
null of exogeneity using (24) as the null. In order to re-scale the discrepancy vector,
we need to standardize it by its variance:
( ^ 2SLS ^ 0
OLS ) asyV ar( ^ 2SLS ^
OLS )
1
( ^ 2SLS ^
OLS ):
where
asyV ar(( ^ 2SLS ^ OLS )) = asyV ar( ^ 2SLS ) + asyV ar( ^ OLS )
2asyCov( ^ 2SLS ; ^ OLS ):
Now note that under the null ^ OLS will be BLUE and ^ 2SLS will be ine¢ cient.7 This
implies8 that asyV ar( ^ 2SLS ) = asyCov( ^ 2SLS ; ^ OLS ) which simpli…es things as we
do not need to estimate the covariance:
asyV ar(( ^ 2SLS ^ OLS )) = asyV ar( ^ OLS ) asyV ar( ^ 2SLS ):
7
To see this more clearly, consider the di¤erence in the estimated asymptotic variances of the
two estimators under the null of exogeneity:
2 ^ 0 X)
(X ^ 1 2
(X 0 X) 1
(X 0 X) (X 0 Z) (Z 0 Z) 1
(Z 0 X)
= X 0X XPz X
0
= X Mz X
66
Therefore the Wald statistic can be written as
W = ( ^ 2SLS ^ 0
OLS ) (asyV ar( ^ 2SLS ) asyV ar( ^ OLS )) 1 ( ^ 2SLS ^
OLS )
Since there will be some Xs that are exogenous (e.g. the intercept) this quadratic
form will generally be of reduced rank K = K K0 where K0 is the number of
exogenous Xs and K the number of endogenous Xs (call these X ). Hence the
inversion of the matrix in squared brakets needs to be performed using a generalized
inverse procedure. In practice though, an equivalent variable addition test due to
Wu (1973) can be performed easily by running the augmented regression:
^
y =X +X +"
where X are all the variables (both endogenous and exogenous) and X ^ are the
projection of the endogenous variables X on the instruments. The null of exo-
geneity corresponds to = 0. If there is endogeneity, it will be picked up by . The
Wald statistic for = 0 has distribution FK ;n K K and this procedure is equivalent
to the Hausmann test described above.
67
Part III
Likelihood based methods
11 Maximum Likelihood estimation
Fisher, 1925. Maximum Likelihood estimation is more widely applicable (not only
linear regression models). For iid data:
Y
T
L(y; ) = f (yt ; )
t=1
Y
T
L(y; ) = f (yt jyt 1 ; : : : ; y1 ; )
t=1
X
T
l(y; ) = ln L(y; ) = ln f (yt j yt 1 ; : : : ; y1 ; )
t=1
@l
First order condition s(y; ) = @
= 0. [s(y; ) is called the score vector]. Nice
asymptotic properties:
Consistency: plimT !1 bM L =
p
Asymptotic normality: T (bM L ) ! N (0; [ T1 I( )] 1 ) and bM L ! N ( ; [I( )] 1 ).
Here I( ) is the Fisher information matrix. De…ned as the variance of the score
vector:
@l @l 0
I( ) = E
@ @
68
It also equals (minus) the expected value of the Hessian (information matrix
equality):
@l @l 0 @ 2l
I( ) = E = E
@ @ @ @ 0
Asymptotic e¢ ciency: [I( )] 1 :The MLE has variance at least as small as the
best unbiased estimate of . The MLE is generally not unbiased, but its bias
is small making the comparison with unbiased estimates and the Cramer–Rao
bound appropriate.
p
Estimators of asymptotic variance matrix [ T1 I( )] 1 (since T (bM L ) !
1 1
N (0; [ T I( )] )) are based on the information matrix computed at the ML
estimates:
" # 1 " # 1
1 X @lt (y; bM L ) @lt (y; bM L ) 0 X
T T 2 bM L )
1 1 @ lt (y; 1
( )( ) ! [ I( )] 1 or ! [ I( )] 1 :
T t=1 @ @ T T t=1 @ @ 0 T
T =2 2 1=2
f (y) = (2 ) I expf 0:5((y X )0 ( 2 I) 1 (y X ))g
69
in logs:
T T 1
ln L = log 2 log 2 (y X )0 (y X ):
2 2 2 2
0
Let =( ) : First order conditions: s( ; y) = @@l = 0 gives
2 0
@l 1
= 2
( X 0y + X 0X ) = 0
@
@l T 1
= + 4 (y X )0 (y X )=0
@ 2 2 2 2
bM L = (X 0 X) 1 X 0 y = bOLS
1 1
b2M L = (y X bM L )0 (y X bM L ) = ^"0^"
T T
Second order conditions
" 2 #
@ l @2l X0X X0"
@ @ 0 @ @ 2 2 4
@2l @2l = (X 0 ")0 T 1
@ 2@ @( 2 )2 4 2 4 6 "0 "
Information matrix:
@lt X0X X0X X0"
E[ 2 ]= 2 E[ 4 ]=0
[I( )] = E[ ]= X0"
@ @ 0 E[ 4 ]=0 E T
2 4
1 0
6" " =
T
2 4
0 2T
where we used E[ " 6" ] = E[ 6 ]= T
4 . We have:
2
1 (X 0 X) 1
0
[I( )] =
0 2 4 =T
70
The counts are modelled with a binomial:
iid
yi Bi(ni = 10; i ); i = 1; :::; N = 11
pi = (Xi0 ) + "i
where
e( 0 + 1 xi )
(Xi0 ) =
1 + e ( 0 + 1 xi )
is a good candidate as it is between 0 and 1 and allows nonlinear behaviours. In this
case we have
e( 0 + 1 xi )
i = E[p i jX] =
1 + e ( 0 + 1 xi )
71
and
1
1 i =1 E[pi jX] = :
1 + e( 0 + 1 xi )
shows that this is a model in which the log of the ratio between the probabilities
i and 1 i is a linear function of xi . The coe¢ cient i is the logit parameter.
The likelihood is:
ni N ni yi yi
Pr ob(y1 ; y2 ; ::::; yN jX) = yi =0 (1 i) yi =1 i = i=1 (1 i) i ;
yi
X
N
ln L / (ni yi ) ln(1 (Xi0 )) + yi ln (Xi0 )
i=1
@ ln L X
N 0
(Xi0 ) 0
(Xi0 )
= (ni yi ) + y i Xi = 0:
@ i=1
1 (Xi0 ) (Xi0 )
Note that
0 e ( 0 + 1 xi )
(Xi0 ) = = (Xi0 )(1 (Xi0 ))
(1 + e( 0 + 1 xi ) )2
so we have
@ ln L XN
= [ (ni yi ) (Xi0 ) + yi (1 (Xi0 ))] Xi
@ i=1
X
N
= [ ni (Xi0 ) + yi (Xi0 ) + yi yi (Xi0 ))] Xi
i=1
XN
= [yi ni (Xi0 )] Xi = 0
i=1
72
which is a set of orthogonality conditions based on the logistic function (Xi ) as
opposed as a linear speci…cation Xi . The second derivative is:
@ 2 ln L X
N X
N
0
H= = ni (Xi0 )Xi Xi0 = ni (Xi0 )(1 (Xi0 ))Xi Xi0
@ @ 0 i=1 i=1
which is always negative de…nite. The simpler case in which ni = 1 is the "discrete
choice model".
The solution is also the one that minimizes the deviance:
pi (1 pi )
D( i ; pi ) = 2ni pi ln + (1 pi ) ln ;
i (1 i)
where the term in the square brackets is the expectation taken using the probabil-
ities p of the logarithmic di¤erence between the probabilities p and . Deviance is
analogous to squared error in ordinary regression theory, and can be seen as its
generalization. It is twice the “Kullback–Leibler divergence” the preferred name in
the information-theory literature.9 When pi = i the deviance is 0.
T 1
ln L = ln 2 2 (y X )0 1 (y X )
2 2 2
T T 1 1
= log 2 ln 2 log j j (y X )0 1
(y X );
2 2 2 2 2
here we assume is known, hence we can just focus on:
T 2 1
ln L / log 2
(y X )0 1
(y X )
2 2
@l
First order conditions: s( ; y) = @
= 0 gives
@l 1
= 2
( X0 1
y + X0 1
X )=0
@
@l T 1
= + (y X )0 1
(y X )=0
@ 2 2 2 2 4
9
The KL divergence is the expected value of a log-likelihood ratio if the data are actually drawn
from the distribution at the numerator of the ratio, i.e. under the null hypothesis for which that
ratio is computed.
73
bM L = (X 0 1
X) 1 X 0 1
y = bGLS
1 1 0
b2M L = (y X b M L )0 1
(y X bM L ) = ^" 1
^"
T T
This assumed known. If Unknown, we need to consider ( ) and maximize w.r.t.
as well.
yt = 0 xt + t ;
t = ht "t ; "t N (0; 1)
2 2 2
ht = 0 + 1 t 1 + ::: + m t m;
V [ t jIt 1 ] = E[ 2t jIt 1 ] E[ t jIt 1 ]2 = E[h2t "2t jIt 1 ] = h2t E["2t jIt 1 ] = h2t
It follows that:
t jIt 1 N (0; h2t ):
Note that the unconditional distribution of t is not normal. It is a mixture of
normal distributions with di¤erent (and random) variances. It can be shown that
the resulting unconditional distribution has fatter tails than a normal distribution:
E[ 4t ] E[h4t "4t ] E[h4t ]E["4t ] E[h4t ]
( t) = = = = 3 3
E[ 2t ]2 E[h2t "2t ]2 E[h2t ]2 E["2t ]2 E[h2t ]2
where the last inequality follows from Jensen’s inequality E[f (x)] f (E[x]):Therefore,
the ARCH is consistent with the fact thar returns are not normal. It is also con-
sistent with the volatility clustering phenomenon. To see this, consider the forecast
error:
2
t E[ 2t jIt 1 ] = 2t h2t = t
74
which implies
2
t = h2t + t = 0 + 2
1 t 1 + t
(Jacobian is 1). Here = 0 ; 1 ; :::; m is the vector of coe¢ cients. Now consider a
sample of T observations. The Likelihood (conditional on the …rst m observations)
is:
The function log L can be minimized wrt and . Remember there are restrictions
on , so it is a constrained estimation. Engle (1982) shows that the information
matrix is such that the ML estimators of and are independent. Parameters ;
can be estimated with full e¢ ciency based only on a consistent estimate of the other.
One could write down the likelihood for the entire sample and maximize, as we did
in the GARCH example. There is a quicker way to evaluate the likelihood using:
1. The prediction error decomposition, and
2. A recursive algorithm known as the Kalman …lter
75
Prediction error decomposition To understand what the PED is, consider the
simpler model:
Yt = 1 Yt 1 + ut ; ut i.i.d. N 0; 2u
and the forecast:
which has variance V ar[vtjt 1 ] = 2u = tjt 1 ; i.e. the same as V AR[Yt jIt 1 ]. The
likelihood for observation t can be written as:
0 1
l(Yt jY1:t 1 ; ) / ln j tjt 1 ( )j vtjt 1 ( ) tjt 1 ( ) vtjt 1 ( ):
For more complex models, e.g. the ARMA model, computing the PED moments
tjt 1 and vtjt 1 can be easily done using a prediction-update algorithm known as
the Kalman Filter.
"t " 0
iidN 0; :
t 0
Similarly, time variation in the coe¢ cient matrices and F can be allowed
76
The Kalman Filter - Learning about states from data
An algorithm that produces and updates linear projections of the latent variable
st given observations of Yt .
De…ne:
stjs := E [st jY1:s ] ; Ptjs := Var [st jY1:s ] :
Assume you start by knowing st 1 N (st 1jt 1 ; Pt 1jt 1 ). The …lter is a rule
to update to st N (stjt ; Ptjt ) once we observe data Yt .
The …rst step is to …nd the joint distribution of states and data in t, conditional
on past observations (1; :::; t 1):
0
st stjt 1 Ptjt 1 Ctjt 1
It 1 N ; (25)
Yt Ytjt 1 Ctjt 1 tjt 1
The moments above can be calculated easily using the equations of the system, Yt =
st + "t ; st = F st 1 + t )
We have now speci…ed the distribution (st ; Yt jYt 1 ) we now look for the distri-
bution (st jYt ; Yt 1 ) = (st jYt ).
77
This is easy to do using basic results regarding Normal distributions: Let (a; b)
be normally distributed,
a a aa ab
N ; : (27)
b b ba bb
So we have moved from (st 1 jYt 1 ) to (st ; Yt jYt 1 ) (prediction) and then from
(st ; Yt jYt 1 ) to (st jYt ) (update).
We can now repeat and use (st jYt ) to move forward to (st+1 jYt+1 )
Using Ctjt 1 = Ptjt 1 (see (26)) the updating equations can be re-written as:
stjt = stjt 1 + Ktjt 1 vtjt 1 (32)
Ptjt = Ptjt 1 Ktjt 1 Ptjt 1 (33)
with
0 1
Ktjt 1 = Ptjt 1 tjt 1 (34)
denoting the Kalman Gain and vtjt 1 = Yt Ytjt 1 denoting the 1-step ahead
prediction error.
78
The Kalman Filter recursions The algorithm works as follows:
2 Observe the prediction error vtjt 1 = Yt Ytjt 1 and update to …nd (st jYt )
N (stjt ; Ptjt ):
As a by product, the algorithm will provide the rime series of vtjt 1 and tjt 1 for
t = 1; :::T .
where
1
=f ( ; F; "; )
P
The sum of the likelihoods of the forecast errors lt ( ) provides the likelihood
of the whole system
79
The ARMA example
s1t
Yt = 1 # = s1t + #s2t
s2t
s1t 1 2 s1t 1 ut
= + ;
s2t 1 0 s2t 1 0
2
u 0
with " =0; = : Indeed:
0 0
80
12 James - Stein Estimators
12.1 James-Stein result
Let fzt gTt=1 be a sequence of iid k-variate Gaussians with mean and variance Ip :
zt j N ( ; Ik ):
^JS = (k 2)
1 zT ;
kzT k2
where k k is the Euclidian norm, hence
q 2
2
kzT k = z12 + z22 + ::: + zk2 = zT0 zT
81
Here we only have 1 realization of ^ so T = 1 and:
1
^M L = zT = t=1 zt 1
= z1 = (X 0 X)1=2 ^
1
1 ^0 0 ^
zT0 zT = XX
2
that is !
2
^ JS (k2) ^! :
= 1 0
^ X 0X ^
x1 ; x2 ; :::; xk
with likelihood
xi j i N ( i ; 1):
The MLE is
^M
i
L
= xi
82
The Bayesian posterior mean is
^ Bayes
i = (V0 1
+ 1) 1 (V0 1 M0 + xi )
1
1 + V0
= (V0 1 M0 + xi )
V0
= (V0 + 1) 1 (M0 + V0 xi )
V0
= M0 + (xi M0 ) (35)
V0 + 1
It turns out that the JS estimator is the plug-in version of (35)
dV0
^ JS ^
i = M0 + (xi ^ 0)
M
V0 + 1
with
M ^0 = x
dV0 k 3 k
= 1 ; S= i=1 (xi x)2 :
V0 + 1 S
Substituting gives:
k 3
^ JS
i = x+ 1 (x x): (36)
S
This is Lindlay (1962) version of JS estimator, which shrinks towards x rather than
towards 0. The standard case is imediately obtained by setting M0 = M ^ 0 = 0. In this
case, the MSE bene…t comes from the fact that we are using the other observations
to form a point to which to shrink the estimates. Note that we are using the data to
compute an estimate of the prior mean M ^ 0 = x! This is called "Empirical Bayes".
83
84
13 Bayesian treatement of the linear regression
model
13.1 Introduction
Start with:
2
y = X + "; " N (0; IT )
and get
^ = (X 0 X) 1
X 0Y
Add data:
Y X " 2
= + ; " N (0; IT +T1 )
Y1 X1 "1
and get
1
^ = 0 0 X 0 Y
X X1 X 0 X1
X1 Y1
0 1 0
= X 0 X + X1 X 1 (X 0 Y + X1 Y1 )
Bayesian approach
1. The researcher starts with a prior belief about the coe¢ cient . The prior
belief is in the form of a distribution p( )
N( 0; 0)
2. Collect data and write down the likelihood function as before p(Y j ).
3. Update your prior belief on the basis of the information in the data. Combine
the prior distribution p( ) and the likelihood function p(Y j ) to obtain the
posterior distribution p( jY )
Key identities
p(Y j ) p( )
p( jY ) = / p(Y j ) p( )
p(Y )
85
Useful identities:
p(Y ) is the data density (also known as marginal likelihood).It is the constant
of integration of the posterior:
Z Z
1
p( jY )d = p(Y j ) p( )d = 1;
p(Y ) | {z }
posterior kernel
therefore it is not needed if we are only interested in the posterior kernel. The
posterior kernel is su¢ cient to compute e.g. mean and variance of p( jY ).
p( jY ) / p(Y j ) p( )
2
Derivation of posterior We assume that is known, recall that k is the
number of regressors. Set prior distribution for N ( 0; 0)
k 1 0 1
p( ) = (2 ) 2 j 0j 2 exp 0:5 ( 0) 0 ( 0)
0 1
/ exp 0:5 ( 0) 0 ( 0)
p( jY ) / p(Y j ) p( )
1
/ exp[ 0:5 (Y X )0 (Y X ) = 2 ] exp[ ( 0)
0
0
1
( 0 )]
2
/ exp[ 0:5f(Y X )0 (Y X )= 2
+( 0)
0
0
1
( 0 )] (37)
86
which is the kernel of a normal distribution with moments
1
1 1
1 = 0 + 2
X 0X (38)
1 1
1 = 1 0 0 + 2
X 0Y ; (39)
therefore we conclude:
jY N( 1; 1)
regrouping gives:
0 1 1 0 1
ln p( jY ) / ( 0 + 2 X 0 X) ( 0
1
0 + 2
X 0Y ) +
| {z } | {z }
1 1
1 1 1
0 1 1 0 1
( 0 0 + 2
Y 0 X) + 0 0
1
0 + 2
Y 0Y (40)
| {z }
0 1
1 1
where the elements in braces are those in de…nitions (38) and (39). Rewrite the …rst
term as:
0 1
= ( + 1 )0 1 1 ( 1 + 1)
1
| {z }1 |{z}
= ( )0 1 ( ) +( 1 )0 1 1 ( 1 )
| {z 1 } 1 | {z 1} |{z} |{z}
0 1 0 1
+( ) ( 1 )+( 1 ) 1 ( ):
| {z 1 } 1 |{z} |{z} | {z 1}
The last three terms in the expression contain 5 terms: of these 3 terms 01 1 1 1 with
0 1 0 1 0 1
signs +,-,- simplify to just 1 1 1 and the other two terms are 1 1 and 1 1 ,
which simplify with the 2nd and 3rd terms in (40). Therefore, we have:
0 1 0 1 0 1 1
ln p( jY ) / ( 1) 1 ( 1) 1 1 1 + 0 0 0 + 2
Y 0Y
0 1
/ ( 1) 1 ( 1)
87
14
posterior
prior
likelihood
12
10
1
0and 0 1 0 are the prior moments, can be interpreted as dummy observations/pre-
sample observations.
Without the priors, these moments are simply the OLS estimates. Without
the data, these moments are simply the priors
The mean is a weighted average of the prior and OLS. The weights are inversely
proportional to the variance of prior and data information
1
Setting 0 = 2 Ik and 0 = 0 gives the Ridge regression:
1
1 = ( Ik + X 0 X) X 0 Y:
88
You -as a statistician- are asked to give a 5% rejection region for the null that
U and D are equally likely
Obtaining 9 U ’s out of 12 suggests that the chance of its falling uppermost (U )
exceeds 50%.The results that would even more strongly support this conclusion
are:
(10; 2); (11; 1); and (12; 0);
so that, under the null hypothesis = 1=2, the chance of the observed result,
or more extreme, is:
12
12 12 12 12 1
+ + + = = 7:5% >5%
3 2 1 0 2
Hence, you do NOT reject the null that U and D are equally likely (50%).
However, now Luigi tells you: "but I didn’t set to throw the pin 12 times. My
plan was to throw the pin until 3 Ds appeared"
Does this change your inference? Yes it does
Under the new scenario, the more exreme events would be:
(10; 3); (11; 3); (12; 3); :::;
while the events (10; 2); (11; 1); and (12; 0) actually can NOT take place under
this design.
So the chance of the observed result under the null hypothesis becomes:
( )
11 10 3
10 1 9 1 2 1
1 ::: = 3:25% <5%
2 2 2 2 2 2
Why is this happening? Because the two setups imply a di¤erent stopping rule
(stop at 12 draws, or stop at 3 D draws). This, more generally, alters the
sample space.
Things are even more problematic. Think if Luigi says "I just kept drawing
the pin until lunch was served". How would you tackle this?
Con…dence intervals similarly demand consideration of the sample space. In-
deed, so does every statistical technique, with the exception of maximum like-
lihood.
89
Lindley and Phillips (1976): Many people’s intuition says this speci-
…cation is irrelevant. Their argument might more formally be expressed
by saying that the evidence is of 12 honestly reported tosses, 9 of which
were U; 3, D. Furthermore, these were in a particular order, that reported
above. Of what relevance are things that might have happened [e.g. no
lunch], but did not?
2. If two likelihood functions for (from the same or di¤erent experiments) are
proportional to one another, then they contain the same information about .
By using only the likelihood, and nothing else from the experiment, the answer
to the problem is the same regardless of the stopping rule. Indeed, let x1 = #U in
experiment 1 x2 = #U in experiment 2. In experiment 1 (E1) we have a binomial
density:
12 x1 12 9
f 1 (x1 ) = (1 )12 x1 =) `1 (9) = (1 )3
x1 9
In experiment 2 (E2) we have a negative binomial density:
x2 + 3 1 x2 11 9
f 2 (x2 ) = (1 )3 =) `2 (9) = (1 )3
x2 9
90
Error variance
The distribution of ^"0^"= 2
is a chi-square:
"0 "
^"0^"= 2
= (I PX ) 2
T k
^"0^"= 2
= (T k)~ 2 = 2 2
T k
It has mean
2
T k (T k)= 2
=
2 2
and variance ,
2 2
T k (T k)= 4
=2 /(T k)
2 2
Now imagine that 2 was a random variable (not a coe¢ cient) and consider
(41) as a function of 2 . The distribution of 12 is a Gamma:
2 2 2 1 T k (T k)~ 2
(T k)~ = T k () 2
;
2 2
91
2
and that of is an inverse Gamma:
T k (T k)~ 2
(T k)~ 2 = 2 2
T k () 2 1
;
2 2
2 1 T k ^"0^"
;
2 2
2
All this suggests to use an inverse Gamma as a prior for .
2
Conditional Posterior of error variance 1. Set prior distribution
1
(v0 =2; s20 =2)
0 0 +2
p( 2 ) = [ ( 0 =2)] 1 (s20 =2) 2 ( 2 ) 2 exp s20 =2 2
with
v1 = T + v0 ; s21 = s20 + (Y X )0 (Y X ):
2 1 s20
0
N( 0; 0 ); ;
2 2
92
we derived the conditional posteriors:
s21
1
j 2; Y N( 1; 1) ;
2
j ;Y 1
;
2 2
with moments:
1
1 1
1 = 0 + 2
X 0X
1
1 1
1 = 0
1
+ 2
X 0X 0
1
0 + 2
X 0X ^
1 = 0+T
s21 = s20 + (y X )0 (y X )
2
Simple Monte Carlo simulation is not an option to obtain the joint ; jY .
Other methods are needed ! Gibbs Sampling (MCMC)
However, the model is more ‡exible than the one with conjugate prior for it
does not require proportionality between the prior on and the error variance.
j=1
1. Sample i from
1 1 1 1 0 0
f i j 1 ; 2 ; :::; i 1 ; i+1 ; :::; k
For simple models (e.g. linear regressions, also multivariate), this happens
*really fast*
93
By construction the Gibbs sampler produces draws than are autocorrelated
Convergence
Mixing
Gelman and Rubin (1992) and Brooks, S.P. and Gelman, A. (1998)
2
There are two ways to estimate the variance of the stationary distribution :
94
If the chains have converged, then both W and V are unbiased. Otherwise
the …rst method will underestimate the variance, since the individual chains
have not had time to range all over the stationary distribution, and the second
method will overestimate the variance, since the starting points were chosen to
be overdispersed.
Brooks and Gelman (1997) have suggested, if P SRF < 1:2 for all model para-
meters, one can be fairly con…dent that convergence has been reached.
More reassuring (and common) is to apply the more stringent condition P SRF <
1:1ors
P
The ine¢ ciency factor (IF) 1 + 2 1 k=1 k , where k is the k-th order auto-
correlation. This is the inverse of the relative numerical e¢ ciency measure of
Geweke (1992). Usually estimated as the spectral density at frequency zero
with Newey-West kernel (with a 4% bandwidth).
i.i.d. sampling (e.g. MC sampling) features IF=1, here IF < 20 are considered
good.
yt = x t + "t (42)
2
"t = "t 1 + ut ; ut iidN (0; u) (43)
2
In this model there are 3 (groups of) parameters: ; ; u. Also, we have
2 2 2
" = V ar("t ) = u =(1 )
95
Consider the Cochrane-Orcutt transformation:
2 3
1 0
6 .. .. 7
P =4 . . 5
0 1
Py = PX + P" (44)
where [P "]t = "t 1 + "t .
The model in (44) is a Generalized LRM, with error variance V ar(P ") =
2
P V ar(")P 0 = 2" P P 0 = 1 u 2 P P 0 = ( ; 2u ): We have:
P ( )y = P ( )X + P ( )" (45)
2 1 0s20
N( 0; 0 ); N ( 0; 0
); u ;
2 2
2
Under knowledge of u and , this gives the following posterior for :
2
j ; ;y N( 1; 1)
1 1
1 = 0 + X0 ( ; 2 1
u) X ; 1 = 1 0
1
0 + X0 ( ; 2 1
u) Y
Then, under knowledge of , it is easy to use the model in (42) to derive "j ; y
and this can be used as an observable in (43). This gives
2
"1 = "0 + u; ut N (0; u IT 1 ) (46)
which is a standard linear regression model with AR coe¢ cient and error
variance 2u .
96
Given the prior speci…ed in (43), the posteriors will be:
2
j ; u; y N ( 1; 1
);
1
1 1 1 1
1
= 0
+ "0 "
2 0 0
; 1 = 1 0 0 + "0 "
2 0 1
u u
97
14 Appendix A: Moments of the error variance in
the LRM
Start with:
^"0^"= 2
= (T k)^ 2 = 2 2
where
E[(T k)^ 2 = 2 ] = (T k)
V ar[(T k)^ 2 = 2 ] = 2(T k)
then ^ 2 is a gamma:
T k s20
^2 ;
2 2
with mean
T k s20
= = (T k)=s20 = (T k)=((T k)= 2 ) = 2
2 2
and variance
2
T k s20
= = 2(T k)=s40 = 2(T k)=((T k)2 = 4 ) = 2 4 =(T k)
2 2
implies that
1 2 1
^"0^" T k 2
2
that is, is the inverse of a rescaled chi-square distribution, i.e. it is an inverse
gamma. This o¤ers a choice for a conjugate prior for
98