Review: Generalized Least Squares: 3.1. Covariance Matrices

3.
Review: Generalized least squares

3.1. Covariance matrices
1
We maintain the assumption of the linear structure L0 with C1 and C3,
y = X +u,
but replace C2 by
Assumption
G2 Cov (u) = E(uu
) = with
=
, > 0 i.e. a
a > 0 for all a R

n
\{0} ;
By construction, is symmetric and positive semidenite. Hence, the actual assump-
tion is positive deniteness. Choosing a as ith column of the identity matrix I
n
, this
assumption implies quite reasonably:
ii
= E(u
2
i
) = Var(u
i
) > 0.
Well now review some useful results.
Lemma 3.1 (Symmetric matrices) With =
it holds:
a) All eigenvalues are real.
b) The eigenvectors can be chosen such that v
i
v
i
= 1 and v
i
v
j
= 0 for i = j ,
(even if
i
=
j
)
1
Readings: Greene, Appendix.
20
3. Generalized least squares
c) Let V = (v
1
, . . . , v
n
) contain orthonormal eigenvectors from b), such that V is
orthogonal [i.e. V
V = V V
= I
n
]. Further denote
0.5
:= diag(
_
1
, . . . ,
_
n
) and A
0.5
:= V
0.5
V
.
It holds A = A
0.5
A
0.5
and A = V V
(spectral decomposition).
d) Let the mn matrix F with n m be of full rank n, and A = F
F. Then A is
positive denite (A > 0).
Lemma 3.2 (Cholesky decomposition) Let =
and > 0. It then holds:

1. is invertible.
2. There exists a triangular decomposition,
= CC
with C =
_
_
_
_
_
_
_
_
c
11
0 0 0
c
21
c
22
0 0
.
.
.
.
.
.
c
n1
c
n2
c
n3
c
nn
_
_
_
_
_
_
_
_
,
where all entries on the main diagonal are non-zero.
3. The inverse
1
is positive denite, too.
This decomposition is not unique!
Corollary 3.1 Let be positive denite. As a consequence of the Cholesky factor C
being invertible it holds that
1
= C
1
C
1
C
1
C = I
n
C
1
C
1
= I
n
.
3.2. Model transformation
2
Under C1, G2 and C3 it holds that:
E
_
_
= ,
2
Readings: Greene, Sections 8.4.1-8.4.3 and 8.3.
21
Cov
_
_
= E
_
(
) (
_
= (X
X)
1
X
X (X
X)
1
.
Issues with OLS:
1.

is no longer ecient (BLUE) under G2, cf. the Gauss-Markov theorem;
2. statistical inference building on inappropriate covariance estimation in Propo-
sitions 2.6 or 2.7 is not valid.
3
To overcome those problems we transform the equation with C
1
from Lemma 3.2:
C
1
y = C
1
X + C
1
u,
or
y
= X
+u
, (3.1)
with y
= C
1
y, X
= C
1
X and u
= C
1
u. Notice that (3.1) satises C1 through
C3, and in particular C2! Hence, OLS of (3.1) is ecient. It yields the generalized
least squares [GLS] estimator of (1.2), denoted (here) by

GLS
:
GLS
=
_
X
1
X
_
1
X
1
y . (3.2)
In practice or
1
has to be estimated; but its Cholesky decomposition is only used
in justifying the estimator, and not in the actual estimation.
GLS residuals are the OLS residuals of the transformed equation:
u
= y
GLS
;
do not confuse them with the residuals, which are still u = y y = y X
GLS
.
Properties:
E
_
GLS
_
= ,
Cov
_
GLS
_
= E
_
(
GLS
) (
GLS
)
_
=
_
X
1
X
_
1
,
and

GLS
is BLUE (Gauss-Markov).
Equivalence with OLS
Under certain conditions, the GLS and the OLS estimators coincide.
3
Not always, though, as it could happen that (X
X)
1
X
X (X
X)
1
=
2
(X
X)
1
for a
suitable average variance
2
. But do not bet on its happening.
22
Proposition 3.1 (GLS = OLS) It holds

=

GLS
if and only if
X
1
(I
n
P
x
) = 0 ,
where P
x
= X (X
X)
1
X
. Furthermore, this condition is equivalent to:

P
x
= P
x
.
Under C2, =
2
I
n
, and it follows that

=

GLS
. Further applications of Proposi-
tion 3.1 will be given in Section 3.4.
3.3. Asymptotic properties and feasible GLS
4
Proposition 3.2 Assume in addition to C1, G2, C3 that
X
1
X
n
S
x
,
with S
x
nite and invertible as n . If n
0.5
X
1
u has a limiting multivariate
normal distribution, then
n
_
GLS
_
d
N
K
_
0, S
x
1
_
,
and, under the nul l H
0
: R = c,
X
2
r
:= (R
GLS
c)
_
R(X
1
X)
1
R
_
1
(R
GLS
c)
d

2
(r) .
The rst additional assumption of the proposition requires the covariances of the n
noise terms to behave in a regular manner, just like the regressors themselves are
required to (by C4). The second asks for a CLT to hold; it must be imposed, since
the noise terms u are not iid under G2, so a CLT cannot be derived on the basis of
the existence of the limit S
x
alone.
Estimation of
GLS depends on ; if replacing it with an appropriate estimator

, we talk about
Feasible or Estimated GLS. But estimation of is not feasible in general (why?)
4
Readings: Greene, Sections 8.3 and 8.7.
23
so it usually relies on simplifying assumptions, which are in turn devised for specic
applications; see e.g. Section 3.4.
Feasible GLS will be denoted (here) by tildes,
GLS
:=
_
X

1
X
_
1
X

1
y ,
X
2
r
:= (R
GLS
c)
_
R(X

1
X)
1
R
_
1
(R
GLS
c) .
Determining C: Classical examples
Example 3.1 (Heteroskedasticity) Assume
y
i
=
1
+
2
x
i,2
+ +
K
x
i,K
+ u
i
, i = 1, . . . , n,
with
Cov (u) =
_
_
_
_
_
_
_
_
2
1
0 0 0
0
2
2
0 0
.
.
.
0 0 0 0
2
n
_
_
_
_
_
_
_
_
= diag
_
2
1
, . . . ,
2
n
_
.
With
i
> 0:
= diag
_
2
1
, . . . ,
2
n
_
,
1
=
_
1
2
1
, . . . ,
1
2
n
_
,
C = diag (
1
, . . . ,
n
) = C
, C
1
=
_
1
1
, . . . ,
1
n
_
.
The decomposition of not unique. Just as wel l:
C = diag (
1
, . . . ,
n
) = C
.
GLS now leads to so-called weighted LS estimation.
Example 3.2 (Autocorrelation of order 1) Assume for u
i
a so-cal led AR(1) pro-
cess,
u
i
= a u
i1
+
i
, |a| < 1 , i Z,
where satises C2. One can show (see Chapter 10) that
Cov (u
1
, u
1+h
) = E(u
1
u
1+h
) =
a
h
2
1 a
2
,
24
where
2
= E(
2
i
). Hence,
=

2
1 a
2
_
_
_
_
_
_
_
_
1 a a
2
a
n1
a 1 a a
n2
.
.
.
.
.
.
a
n1
a
n2
a
n3
1
_
_
_
_
_
_
_
_
.
By multiplying we can check that
1
is a band matrix:
1
=
1
2
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1 a 0 0 0 0
a 1 +a
2
a 0 0 0
0 a 1 +a
2
a 0 0
.
.
.
.
.
.
0 0 0 a 1 +a
2
a
0 0 0 a 1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
.
To obtain
1
= C
1
C
1
we use
C
1
=
1
_
_
_
_
_
_
_
_
_
_
_
1 a
2
0 0 0 0
a 1 0 0 0
0 a 1 0 0
.
.
.
.
.
.
.
.
.
0 0 0 a 1
_
_
_
_
_
_
_
_
_
_
_
.
This leads to the so-cal led Cochrane-Orcutt procedure: compute quasi-dierences,
y
= C
1
y , X
= C
1
X ,
yielding componentwise
y
1
=
1 a
2
y
1
, y
2
= y
2
a y
1
, . . . , y
n
= y
n
a y
n1
,
and similarly for x
i,k
. Feasible estimation requires estimation of C, i.e. of the autore-
gressive parameter a. Compute it with the OLS residuals, u = y X
:
a =
n1
i=1
u
i
u
i+1
n
i=1
u
2
i
.
Substituting a by a yields

C
1
. Regression of quasi-dierences

C
1
y on

C
1
X delivers
GLS
.
25
The following section discusses a situation often encountered in applied work where
feasible GLS can be successfully applied.
3.4. Seemingly unrelated regressions (SUR)
5
Model
There are N units, countries, individuals or simply variables with T observations over
time,
y
t,j
R, j = 1, . . . , N , t = 1, . . . T ,
to be explained each by linear equations with K
j
variables
6
,
y
t,j
=
K
j
k=1
(j)
k
x
(j)
t,k
+ u
t,j
, t = 1, . . . T
= x
(j)
t

(j)
+u
t,j
, j = 1, . . . , N , (3.3)
with K
j
-dimensional vectors
x
(j)
t
=
_
_
_
_
_
x
(j)
t,1
.
.
.
x
(j)
t,K
j
_
_
_
_
_
,
(j)
=
_
_
_
_
_
(j)
1
.
.
.
(j)
K
j
_
_
_
_
_
.
Each of the N equations can be written more compactly as
y
(j)
= X
(j)
(j)
+u
(j)
, j = 1, . . . , N , (3.4)
where
y
(j)
=
_
_
_
_
_
y
1,j
.
.
.
y
T,j
_
_
_
_
_
, X
(j)
=
_
_
_
_
_
x
(j)
1
.
.
.
x
(j)
T
_
_
_
_
_
, u
(j)
=
_
_
_
_
_
u
1,j
.
.
.
u
T,j
_
_
_
_
_
.
Parameter values and even explanatory variables may dier for dierent units, but the
N equations could still in general be related through correlation of the error terms.
For that reason they are only seemingly unrelated (SUR). In fact, when the variables
and the parameters are the same, such SUR data would be a panel data set.
7
5
Readings: Greene, Sections 10.2.
6
Those kinds of models, and in particular the ecient estimation thereof, have been rst discussed
by Zellner (1962).
7
Another dierence is that N is much larger that T in classical panels; for SUR, the implicit
assumption is that T is much larger.
26
Stacking N regressions in one equation
Now, we compactify the N equations,
y = X +u
with vectors of length n = N T,
y =
_
_
_
_
_
y
(1)
.
.
.
y
(N)
_
_
_
_
_
, u =
_
_
_
_
_
u
(1)
.
.
.
u
(N)
_
_
_
_
_
,
and a (huge!) regressor matrix of dimension N T
N
j=1
K
j
,
X =
_
_
_
_
_
_
_
_
X
(1)
0 0 . . . 0
0 X
(2)
0 . . . 0
.
.
.
.
.
.
0 X
(N)
_
_
_
_
_
_
_
_
,
which is block diagonal and ts the stacked parameter vector of dimension K =
N
j=1
K
j
:
=
_
_
_
_
_
(1)
.
.
.
(N)
_
_
_
_
_
.
Hence by stacking equations, a multiple equation model can be cast into the single
equation framework.
Error assumptions
Let u be the stacked error vector. We continue to assume
E(u) = 0.
But the errors are allowed to correlate across the units. This amounts to a generalized
linear model with
Cov(u) = > 0.
Recall that consists of n(n+1)/2 free parameters in general, and, therefore, cannot
be estimated without restrictions.
27
The assumptions of the SUR model are as follows:
1. contemporaneous correlation across units, yes, but
2. no correlation at dierent times.
They
8
are summarized as
E(u
(j)
u
(l)
) =
jl
I
T
= E(u
(l)
u
(j)
), j, l = 1, . . . , N.
In other words: Due to the diagonal shape there is no serial correlation neither within
nor between units; and within each unit (j = l) there is homoskedasticity, while the
variances may vary from unit to unit; nally, the units may be contemporaneously
correlated, and the correlation remains constant over time. You could see u
t
=
(u
t,1
, . . . , u
t,N
)
as (iid) draws from an N-dimensional joint distribution.

The s are gathered in an N N matrix , which is assumed to be positive denite:
=
_
_
_
_
_
_
_
_
11

12

1N
21

22

2N
.
.
.
N1

N2

NN
_
_
_
_
_
_
_
_
> 0 .
The SUR assumptions can now be expressed in condensed form using the Kronecker
product :
= E(uu
) = E
_
_
_
_
_
_
_
_
u
(1)
u
(1)
u
(1)
u
(2)
u
(1)
u
(N)
u
(2)
u
(1)
u
(2)
u
(2)
u
(2)
u
(N)
.
.
.
u
(N)
u
(1)
u
(N)
u
(2)
u
(N)
u
(N)
_
_
_
_
_
_
_
_
or
=
_
_
_
_
_
_
_
_
11
I
T

12
I
T

1N
I
T
21
I
T

22
I
T

2N
I
T
.
.
.
N1
I
T

N2
I
T

NN
I
T
_
_
_
_
_
_
_
_
= I
T
.
8
Depending on the data set, the classical SUR assumptions may not always be appropriate; to
ensure lack of serial correlation in each unit, one may have to resort to time series models; see
Chapter 10.
28
It will be convenient to become more familiar with properties of the Kronecker prod-
uct.
Kronecker product
The following lemma contains three useful results.
9
Lemma 3.3 (Kronecker product) Let A be an mn dimensional matrix and B
of dimension k l. Then:
1. For n = k it holds with conformable matrices C and D:
(A C) (B D) = (AB) (C D) .
2. If A and B are square and of full rank, then
(A B)
1
= A
1
B
1
.
3. Without restrictions on dimensions:
(A B)
= A
.
GLS estimation of SUR
To compute GLS we require
1
, which is simple to determine because of Lemma
3.3:
1
=
1
I
T
.
There are at least two empirically relevant situations where GLS reduces to OLS with
SUR. First, if the dierent units are not contemporaneously correlated but may have
dierent variances; and second, if the set of regressors is numerically identical in every
equation. This simply is a corollary to Proposition 3.1.
Corollary 3.2 (GLS=OLS) Consider the SUR model (3.4). It holds that

=

if
9
Have you noticed how everything in the lecture is either interesting or useful?
29
1. = diag(
11
, . . . ,
NN
), or
2. X
(1)
= = X
(N)
.
Note that we require numerically identical regressors in 2 and not only variables mea-
suring the same entities. Yet, an important example for this case is the unrestricted
Vector AutoRegressive [VAR] model, which is a multivariate time series SUR model.
See Chapter 11.
Feasible GLS
Recall, is not known in practice but has to be estimated. Replacing by some
estimator

we speak of feasible (or sometimes: estimated) GLS. In case of SUR, an
estimator is readily available:
=

I
T
or

1
=

1
I
T
.
Let u
(j)
and u
(l)
denote the OLS residuals from regressing jth and lth equations. An
obvious estimator of the contemporaneous cross-correlation is:

jl
=
u
(j)
u
(l)
T
=
T
t=1
u
t,j
u
t,l
T
j, l = 1, . . . , N.
Testing for no cross-correlation
In view of Corollary 3.2 it is interesting to test the null hypothesis of no contempo-
raneous correlation:
H
0
:
jl
= 0 for j = l , j, l {1, . . . , N} .
Given the symmetry of , the null hypothesis consists of
1 + 2 +. . . + (N 2) + (N 1) =
N (N 1)
2
zero restrictions. The squared correlations are estimated as

2
j,l
:=

2
jl

jj

ll
, j = l .
30
A test statistic due to Breusch and Pagan (1980) is
BP := T
N
j=2
j1
l=1

2
j,l
.
Under H
0
the test relies for T on
BP
d

2
_
N (N 1)
2
_
.
(Strictly speaking, the asymptotic result is derived under cross-unit independence.
See Exercise 3.3.)
3.5. Exercises
Exercise 3.1 Verify that the square root of the matrix
A =
_
_
2 1
1 2
_
_
is given by
A
0.5
=
1
2
_
_
3 + 1
3 1
3 1
3 + 1
_
_
.
Exercise 3.2 Obtain an explicit Cholesky decomposition for the matrix
A =
_
_
a b
b c
_
_
,
which is assumed to be positive denite.
Exercise 3.3 Assume a SUR model obeying the classical assumptions (as a whole,
i.e. not just unit-wise). Derive the limiting distribution of the Breusch-Pagan
test statistic under the null of cross-unit independence. Does anything change
when the disturbances are contemporarily uncorrelated, but otherwise depen-
dent?
31

Review: Generalized Least Squares: 3.1. Covariance Matrices

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Review: Generalized Least Squares: 3.1. Covariance Matrices

Uploaded by

Copyright:

Available Formats

3.

Review: Generalized least squares

a > 0 for all a R

and > 0. It then holds:

. Furthermore, this condition is equivalent to:

as (iid) draws from an N-dimensional joint distribution.

You might also like