Professional Documents
Culture Documents
Linton O. Econometrics (LN, LSE, 2002) (144s) - GL
Linton O. Econometrics (LN, LSE, 2002) (144s) - GL
Oliver Linton
October 14, 2002
Contents
1 Linear Regression
1.1 The Model . . . . . . . . . . . . . . . . . . . . .
1.2 The OLS Procedure . . . . . . . . . . . . . . . .
1.2.1 Some Alternative Estimation Paradigms
1.3 Geometry of OLS and Partitioned Regression .
1.4 Goodness of Fit . . . . . . . . . . . . . . . . . .
1.5 Functional Form . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
10
12
14
19
21
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
. .
. .
. .
t
. .
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33
35
36
37
40
42
46
48
.
.
.
.
.
.
53
53
55
56
57
59
60
CONTENTS
5 Asymptotics
5.1 Types of Asymptotic Convergence . . . . . . . . . . . . .
5.2 Laws of Large Numbers and Central Limit Theorems . .
5.3 Additional Results . . . . . . . . . . . . . . . . . . . . .
5.4 Applications to OLS . . . . . . . . . . . . . . . . . . . .
5.5 Asymptotic Distribution of OLS . . . . . . . . . . . . . .
5.6 Order Notation . . . . . . . . . . . . . . . . . . . . . . .
5.7 Standard Errors and Test Statistics in Linear Regression
5.8 The delta method . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
65
65
67
69
70
71
73
73
75
6 Errors in Variables
6.1 Solutions to EIV . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Other Types of Measurement Error . . . . . . . . . . . . . . .
6.3 Durbin-Wu-Hausman Test . . . . . . . . . . . . . . . . . . . .
77
81
82
83
7 Heteroskedasticity
7.1 Eects of Heteroskedasticity . . .
7.2 Plan A: Eicker-White . . . . . . .
7.3 Plan B: Model Heteroskedasticity
7.4 Properties of the Procedure . . .
7.5 Testing for Heteroskedasticity . .
.
.
.
.
.
85
85
87
88
89
90
.
.
.
.
.
.
.
.
93
94
96
98
101
.
.
.
.
.
.
103
. 105
. 107
. 108
. 111
. 113
. 114
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
10 Time Series
10.1 Some Fundamental Properties . . .
10.2 Estimation . . . . . . . . . . . . . .
10.3 Forecasting . . . . . . . . . . . . .
10.4 Autocorrelation and Regression . .
10.5 Testing for Autocorrelation . . . . .
10.6 Dynamic Regression Models . . . .
10.7 Adaptive expectations . . . . . . .
10.8 Partial adjustment . . . . . . . . .
10.9 Error Correction . . . . . . . . . .
10.10Estimation of ADL Models . . . . .
10.11Nonstationary Time Series Models
10.12Estimation . . . . . . . . . . . . . .
10.13Testing for Unit Roots . . . . . . .
10.14Cointegration . . . . . . . . . . . .
10.15Martingales . . . . . . . . . . . . .
10.16GARCH Models . . . . . . . . . . .
10.17Estimation . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
117
. 117
. 122
. 125
. 126
. 129
. 130
. 132
. 133
. 133
. 134
. 135
. 137
. 138
. 139
. 140
. 141
. 144
CONTENTS
Chapter 1
Linear Regression
1.1
The Model
The rst part of the course will be concerned with estimating and
testing in the linear model. We will suggest procedures and derive
their properties under certain assumptions. The linear model is the
basis for most of econometrics and a rm grounding in this theory is
essential for future work.
We observe the following data
y1
y = ... ;
yn
x11 xK1
x01
.. = .. ,
X = ...
. .
x1n
x0n
xKn
where rank(X) = K. Note that this is an assumption, but it is immediately veriable from the data in contrast to some other assumptions
we will make.
It is desirable for statistical analysis to specify a model of how these
data were generated. We suppose that there is a random mechanism
which is behind everything - the data we have is one realisation of
7
We stick with xed design for most of the linear regression section.
Fixed design is perhaps unconvincing for most economic data sets, because of the asymmetry between y and x. That is, in economic datasets
we have no reason to think that some data were randomly generated
while others were xed. This is especially so in time series when one
regressor might be a lagged value of the dependent variable.
A slightly dierent specication is Random Design Linear Model
(A1r) X is random with respect to repeated samples
(A2r) s.t. E(y|X) = X
(A3r) Var(y|X) = 2 Inn ,
Finally, we write the regression model in the more familiar form. Dene
= y X = (1 , . . . , n )0 , then
y = X + ,
where [in the xed design]
E() = 0
E(0 ) = 2 In .
The linear regression model is more commonly stated like this with
statistical assumptions made about the unobservable rather than directly on the observable y. The assumptions about the vector are
quite weak in some respects - the observations need not be independent and identically distributed, since only the rst two mometns of
the vector are specied - but strong in regard to the second moments
themselves.
It is worth discussing here some alternative assumptions made about
the error terms. For this purpose we shall assume a random design,
and moreover suppose that (xi , i ) are i.i.d. In this case, we can further
assume that
E(i xi ) = 0
E(i |xi ) = 0, denoted i xi
10
1.2
n
X
i=1
b +X
b Xb,
u(b) = yX
11
so that
b +X
b Xb)0 (yX
b +X
b Xb)
S(b) = (yX
b 0 (yX )
b
= (yX )
b b)0 X 0 X(
b b)
+(
=
because
But
b 0 X(
b b) + (
b b)0 X 0 (yX )
b
+(yX )
b 0 (yX )
b + (
b b)0 X 0 X(
b b),
(yX )
b = X 0 y X 0 y = 0.
X 0 (yX )
b b)0 X 0 X(
b b) 0,
(
b
and equality holds only when b = .
Therefore,
S
b = 0.
= 2X 0 (yX )
b
b
X 0 y = X 0 X .
Now we use the assumption that X is of full rank. This ensures that
X 0 X is invertible, and
b = (X 0 X)1 X 0 y
12
ui
= xij .
bj
n
X
i=1
x1i u
bi = 0
x2i u
bi = 0
..
.
xKi u
bi = 0.
1.2.1
1
1
0
exp (yX) (yX) .
fy|X (y) =
(2 2 )n/2
2
The density function depends on the unknown parameters , 2 , which
we want to estimate. We therefore switch the emphasis an call the
following quantity the log likelihood function for the observed data
n
1
n
`(b, 2 |y, X) = log 2 log 2 2 (yXb)0 (yXb),
2
2
2
13
1
bmle ).
bmle )0 (yX
(yX
n
Basically, the criterion function is the least squares criterion apart from
an ane transformation involving only .
b2mle =
b = (X 0 X)1 X 0 y,
i.e., the MOM estimator is equal to OLS in this case. Thus, for the
moment conditions above we are lead to the least squares estimator.
14
1.3
15
1 0
X = 0 1 .
0 0
Then C(X) is the set of all vectors in R3 with third component zero.
What is the closest point for the example above with y = (1, 1, 1)0 ?
b u
b = (0, 0, 1)0 In fact u
b is orthogonal to C(X),
This is (1, 1, 0)0 = X ,
i.e., u
b C (X) = (0, 0, )0 .
PX = X(X 0 X)1 X 0
projects onto C(X).
16
1 0
3
A = 1 2 2 ,
0 1 5
which is of full rank. Therefore, (1) and (2) yield the same yb, u
b.
17
Emphasizing C(X) rather than X itself is called the coordinate free approach. Some aspects of model/estimate are properties of C(X) choice
of coordinates is irrelevant.
When X is not of full rank
the space C(X) is still well dened, as is the projection from y
onto C(X).
b1
This just says that if X1 and X2 were orthogonal, then we could get
b2 by regressing y on X2 only.
by regressing y on X1 only, and
Very rarely are X1 and X2 orthogonal, but we can construct equivalent regressors that are orthogonal. Suppose we have general X1 and
X2 , whose dimensions satisfy K1 + K2 = K. We make the following
observations:
(X1 , X2 ) and (M2 X1 , X2 ) span the same space. This follows because
X1 = M2 X1 + P2 X1 ,
18
b2 + (X 0 X2 )1 X 0 X1
b1 + X2 [
b1 ]
= M2 X1
2
2
b1 + X2 C,
b
= M2 X1
b1 .
b2 + (X 0 X2 )1 X 0 X1
b=
C
2
2
1
1
1
1
ii0
n
19
and
M2 x1n1 = x1
1
n
n
X
i=1
x1i x1
1
..
x1i ... =
.
.
1
x1n x1
1.4
Goodness of Fit
How well does the model explain the data? One possibility is to measure the t by the residual sum of squares
n
X
(yi ybi )2 .
RSS =
i=1
In general, the smaller the RSS the better. However, the numerical
value of RSS depends on the units used to measure y in so that one
cannot compare across models.
Generally used measure of goodness of t is the R2 . In actuality, there
are three alternative denitions in general.
One minus the ratio of the residual sum of squares to total sum
of squares,
Pn
(yi ybi )2
RSS
2
R1 = 1
= 1 Pi=1
.
n
2
T SS
i=1 (yi y)
The sample correlation squared between y and yb,
P
yi yb)]2
[ ni=1 (yi y)(b
2
R2 = Pn
.
P
n
2
yi yb)2
i=1 (yi y)
i=1 (b
20
u
bi = 0,
n
n
n
X
X
X
(yi y)(b
yi yb) =
(yi y)(b
yi y) =
(b
yi y)2 ,
i=1
i=1
because
n
X
i=1
i=1
u
bi (b
yi y) = 0.
n
n
n
X
X
X
2
2
(yi y) =
(yi ybi ) +
(b
yi y)2 .
i=1
i=1
i=1
21
R =1
n1
(1 R2 ).
nK
y)
i
i=1
n1
It follows that
n 1 R2
n1
R
=
(1 R2 ).
2
K
n
K
K
(n
K)
| {z }
| {z }
+
1.5
Functional Form
Linearity can often be restrictive. We shall now consider how to generalize slightly the use of the linear model, so as to allow certain types
of nonlinearity, but without fundamentally altering the applicability of
the analytical results we have built up.
W ages = + ed + UNION + u
W ages = + ed + ab + ed ab + u
W ages = + ed + ex + ex2 + ex3 + u
log W ages = + ed + UNION + u
log
fs
= + inc + u
1 fs
22
if x t1
1 + 1 x + u
2 + 2 x + u if t1 x t2
y=
3 + 3 x + u
if x t2 .
y = 1 + 1 x + 1 D1 + 1 D1 x + 2 D2 + 2 D2 x + u,
where
D1 =
D2 =
1 if x t1 ,
0
else
1 if x t2 ,
0
else.
23
y =+
x 1
+ u,
where
x 1
ln(x);
x 1
as 1,
x1
as 0,
2
+ u.
x2
/
Q = 1 2 K 3 + (1 2 )L 3 4 3 + u.
24
Chapter 2
Statistical Properties of the
OLS Estimator
We now investigate the statistical properties of the OLS estimator in
both the xed and random designs. Specically, we calculate its exact
mean and variance. We shall examine later what happens when the
sample size increases.
b is that it is linear in y, i.e.,
The rst thing to note in connection with
there exists a matrix C not depending on y such that
b = (X 0 X)1 X 0 y = Cy.
b is unbiased.
where this equality holds for all . We say that
25
var((X 0 X)1 X 0 y) = (X 0 X)1 X 0 var yX(X 0 X)1 = E{(X 0 X)1 X 0 0 X(X 0 X)1 }
= (X 0 X)1 X 0 2 IX(X 0 X)1 = 2 (X 0 X)1 .
Random Design. For this result we need E[i |xi ] = 0. We rst condition
on the matrix X; this results in a xed design and the above results
hold. Thus, if we are after conditional results, we can stop here. If
we want to calculate unconditional mean and variance we must now
average over all possible X designs. Thus
b = E{E(|X)}
b
E()
= E() = .
On average we get the true parameter . Note that this calculation uses
the important property called The Law of Iterated Expectation. The
most general version of this says that
E(Y |I1 ) = E[E(Y |I2 )|I1 ],
whenever I1 I2 for two information sets I1 , I2 .
Note that if only E[xi i ] = 0, then the above calculation may not be
valid. For example, suppose that Yi = Xi3 , where Xi is i.i.d. standard
normal. Then = 3 minimizes E[(Yi bXi )2 ]. Now consider the least
squares estimator
Pn
Pn
4
Xi Yi
i=1
i=1 Xi
b
P
= Pn
=
.
n
2
2
i=1 Xi
i=1 Xi
You cant show that this is unbiased, and indeed it isnt.
27
which is established by repeated application of the law of iterated expectation. We now obtain
b = Evar(|X)
b
var()
= 2 E{(X 0 X)1 }.
This is not quite the same answer as in the xed design case, and the
interpretation is of course dierent.
The properties of an individual coecient can be obtained from the
partitioned regression formula
e1 = (X 0 M2 X1 )1 X 0 M2 y.
1
1
b1 ] = (X 0 M2 X1 )1 X 0 M2 E0 M2 X1 (X 0 M2 X1 )1 = 2 (X 0 M2 X1 )1 .
var[
1
1
1
1
b1 ) = Pn
var(
This is the well known variance of the least squares estimator in the
single regressor plus intercept regression.
b This will be important when we
We now turn to the distribution of .
want to conduct hypothesis tests and construct condence intervals. In
order to get the exact distribution we will need to make an additional
assumption.
(A4) y N(X, 2 I) or
Under A4,
b N(, 2 (X 0 X)1 )
b = (X 0 X)1 X 0 y =
n
X
ci yi ,
i=1
Z
1
z
g(v)dv,
fb (z) =
v
v
P
1/2
where g is the density of ( ni=1 x2i )
and is the standard normal
density function.
2.1
Optimality
n
X
i=1
|yi xi |.
b ,
e and are all linear unbiased. How do we choose between
In fact, ,
estimators? Computational convenience is an important issue, but the
above estimators are all similar in their computational requirements.
We now investigate statistical optimality.
Denition: The mean squared error (hereafter MSE) matrix of a generic
estimator b
of a parameter Rp is
E[(b
)(b
)0 ]
= E[(b
E(b
)+E(b
) )(b
E(b
)+E(b
))0 ]
= E[(b
E(b
))(b
E(b
))0 ]
|
{z
}
variance
+ [E(b
)][E(b
)]0 .
|
{z
}
squared bias
2.1. OPTIMALITY
29
2 0
B=
.
0 1/4
In this case, we can not rank the estimators. The problem is due to
the multivariate nature of the optimality criterion.
One solution is to take a scalar function of MSE such as the trace or determinant, which will result in a complete ordering. However, dierent
functions will rank estimators dierently [see the example above].
Also note that no estimator can dominate uniformly across according
to MSE because it would have to beat all constant estimatros which
have zero MSE at a single point. This is impossible unless there is no
randomness.
One solution is to change the criterion function. For example, we might
take max tr(MSE), which takes the most pessimistic view. In this
case, we might try and nd the estimator that minimizes this criterion
- this would be called a minimax estimator. The theory for this class of
estimators is very complicated, and in any case it is not such a desirable
criterion because it is so pessimistic about nature trying to do the worst
to us.
e = , .
E()
b var()
e
var()
e = 2 AA0 and
b = 2 (X 0 X)1 ; var()
Proof. var()
e var()
b =
var()
=
=
=
=
2 [AA0 (X 0 X)1 ]
2 [AA0 AX(X 0 X)1 X 0 A0 ]
2 A[I X(X 0 X)1 X 0 ]A0
2 AMA0
2 (AM) (M 0 A0 )
0.
2.1. OPTIMALITY
31
By making the stronger assumption A4, we get a much stronger conclusion. This allows us to compare say LAD estimation with OLS.
Asymptotically, a very large class of estimators are both unbiased and
indeed linear so that the Gauss-Markov and CramrRao Theorems
apply to a very broad class of estimators when the words for large n
are inserted.
Chapter 3
Hypothesis Testing
In addition to point estimation we often want to know how good our
estimator is and whether it is compatible with certain preconceived
hypotheses about the data.
Suppose that we observe certain data (y, X), and there is a true data
distribution denoted by f , which is known to lie in a family of models
F. We now suppose there is a further reduction called a Hypothesis
H0 F. For example, H0 could be the:
Prediction of a scientic theory. For example, the interest elasticity of demand for money is zero; the gravitational constant is
9.
Absence of some structure, e.g., independence of error term over
time, homoskedasticity etc.
Pretesting (used as part of model building process).
We distinguish between a Simple hypothesis (under H0 , the data distribution is completely specied) and a Composite hypothesis (in which
case, H0 does not completely determine the distribution, i.e., there are
nuisance parameters not specied by H0 ).
We also distinguish between Single and Multiple hypotheses (one or
more restriction on parameters of f ).
We shall also introduce the alternative hypothesis HA , which will be
the complement of H0 in F, i.e., F = H0 HA . That is, the choice of
33
34
0 , t < 1974
1 , t 1974.
3.1
35
General Notations
36
z
} when < 0.
n1/2
3.2
Examples
q K,
37
3.3
b2mle =
2
b
0b
n
b
0b
=
nK
nK 2
s
2
2nK
b
c0
.
s(c0 (X 0 X)1 c)1/2
T t(n K).
38
0 X i 2
=
2n .
2
i=1
n
But b
are residuals that use K parameter estimates. Furthermore,
and
= 0 MX
b
0b
E[0 MX ] =
=
=
=
=
E[tr MX 0 ]
tr MX E(0 )
2 tr MX
2 (n tr PX )
2 (n K)
because
tr(X(X 0 X)1 X 0 ) = tr X 0 X(X 0 X)1
= tr IK = K.
These calculations show that
Eb
0b
= 2 (n K),
39
has exactly the same normal distribution as [it has the same mean
and variance, which is sucient to determine the normal distribution].
Therefore,
nK
b
0b
X 2
=
zi
2
i=1
for some i.i.d. standard normal random variables zi . Therefore, b
0b
/ 2
2
is nK by the denition of a chi-squared random variable.
Furthermore, under H0 ,
b = c0 (X 0 X)1 X 0 and b
c0
= MX
b
c0
,
s(c0 (X 0 X)1 c)1/2
using the tnk distribution for an exact test under normality. Can test
either one-sided and two-sided alternatives, i.e., reject if
|T | tnK (/2)
[two-sided alternative] or if
T tnK ()
[one-sided alternative].
Above is a general rule, and would require some additional compub Sometimes one can avoid this: if computer
tations in addition to .
automatically prints out results of hypothesis for i = 0, and one can
redesign the null regression suitably. For example, suppose that
H0 : 2 + 3 = 1.
40
3.4
b r)/q
b r)0 s2 R(X 0 X)1 R0 1 (R
F = (R
b r)0 [R(X 0 X)1 R0 ] (R
b r)/q
(R
.
=
(n K)s2 /(n K)
1
2q /q
F (q, n K)
2nK /(n K)
41
R = ... IK1 .
0
y = X + u.
Alternative is
y1 = X1 1 + u1 ,
y2 = X2 2 + u2 ,
i n1 ,
i n2 ,
where n = n1 + n2 . Let
X1 0
y1
, X =
,
y =
y2
0 X2
1
u1
=
, u=
.
2 2K1
u2 n1
Then, we can write the alternative regression as
y = X + u.
Consider the null hypothesis H0 : 1 = 2 . Let
.
RK2K = [IK .. IK ].
Compare with F (K, n 2K).
Condence interval is just critical region centred not at H0 , but at a
function of parameter estimates. For example,
b t/2 (n K)s{c0 (X 0 X)1 c}1/2
c0
42
3.5
b, u
b , Q=u
b.
b = y X
b0 u
Restricted regression:
, u = y X , Q = u u
The idea is that under H0 , Q Q, but under the alternative the two
quantities dier. The following theorem makes this more precise.
Theorem. Under H0 ,
Q Q n K
= F F (q, n K).
Q
q
and
Therefore,
and
b
=
b R = R
b r.
R(X 0 X)1 R0 = R
1
b r)
(R
= R(X 0 X)1 R0
This gives the restricted least squares estimator in terms of the restrictions and the unrestricted least squares estimator. From this relation
we can derive the statistical properties of the estimator .
44
and
b +
b
=
(y X )0 (y X )
b X( )]
b 0 [y X
b X( ]
b
= [y X
b 0 (y X )
b + (
b )0 X 0 X(
b )
= (y X )
b 0 X( )
b
(y X )
b )0 X 0 X(
b )
= u
b0 u
b + (
using the orthogonality property of the unrestricted least squares estimator. Therefore,
b )0 X 0 X(
b ).
Q Q = (
An intermediate representation is
0
Q Q = R(X 0 X)1 R0 .
This brings out the use of the Lagrange Multipliers in dening the test
statistic, and lead to the use of this name.
Importance of the result: the t version was easier to apply in the
old days, before fast computers, because one can just do two separate
regressions and use the sum of squared residuals. Special cases:
X1 0
1
y1 = X1 1 + u1
or y =
+ u.
y2 = X2 2 + u2
0 X2
2
Null is of no structural change, i.e.,
H0 : 1 = 2 ,
with
.
R = (I .. I).
R = ( R1 , R2 ) ;
q(kq) qq
where
X1 1 + X2 2 = X
R1 1 + R2 2 = r,
46
2 = R21 (r R1 1 ).
We then dene
u = y X1 1 X2 2 and Q
accordingly.
3.6
u
y1
i1 0 x1 0
2
1
+
=
,
0 i2 0 x2 1
y2
u2
2
and let = (1 , 2 , 1 , 2 ). Dierent slopes and intercepts allowed.
The rst null hypothesis is that the slopes are the same, i.e.,
H0 : 1 = 2 = .
47
y1
y2
i1 0 x1
0 i2 x2
1
u
1
2 +
.
u2
b0 u
b)/ dim( 1 )
(u u u
F =
,
0
u
bu
b/(n dim())
F (dim( 1 ), n dim())
distribution.
The second null hypothesis is that the intercepts are the same, i.e.,
H0 : 1 = 2 = .
Restricted regression (, 1 , 2 )
y1
y2
i1 x1 0
i2 0 x2
u
1
1 +
.
u2
2
Note that the unrestricted model can be rewritten using dummy variables:
yi = + xi + Di + xi Di + ui ,
where
Di =
Then, in period 1
1 in period 2
0 else.
yi = + xi + ui ,
while in period 2
yi = + + ( + )xi + ui .
The null hypothesis is that = 0.
48
3.7
L( )
n
o1
Wald : (Rb
r)0 RH(b
(Rb
r)
)1 R0
log L 0
1 log L
LM :
H(
)
49
2 log L
n
1
n
log 2 log 2 2 u()0 u()
2
2
2
log L
2
2
log L
0
2 log L
( 2 )2
2 log L
2
=
=
=
=
=
1 0
X u()
2
n
1
2 + 4 u()0 u()
2
2
1 0
XX
2
n
2
6 u()0 u()
4
2
2
1
4 X 0 u().
50
1
b r)
b r)0 R(X 0 X)1 R0
b2
(R
W = (R
Q Q
,
=
(Q/n)
where
is the MLE of 2 .
b2 = Q/n
The Wald test statistic is the same as the F test apart from the
use of
b2 instead of s2 and a multiplicative factor q. In fact,
W = qF
n
.
nk
1 0
0
Xu
u X X 0 X
,
LM = 2
2
2
where
2 = Q /n.
Recall that
X 0 u = R0 .
Therefore,
R(X 0 X)1 R0
LM =
,
2
where is the vector of Lagrange Multipliers evaluated at the
optimum.
Furthermore, we can write the score test as
Q
Q Q
=n 1 .
LM =
(Q /n)
Q
When the restrictions are the standard zero ones, the test statistic
is n times the R2 from the unrestricted regression.
51
n
1
n
1 0
n
u
b0 u
b n
b
b2 2 u
bu
b = log 2 2 log
log L(,
b2 ) = log 2 log
2
2
2
n 2
2b
2b
and
n
n
u u n
n
1
n
log L( , ) = log 2 log 2 2 u0 u = log 2 log
.
2
2
2
2
2
n
2
Therefore,
b0 u
b/n and 2 = u0 u /n.
b2 = u
b
Q
Q
L(,
b2 )
= n log
log
= n[log Q log Q].
LR = 2 log
L(, 2 )
n
n
Note that W, LM, and LR are all monotonic functions of F, in fact
qn
,
nk
W
,
LM =
1 + W/n
W
LR = n log 1 +
.
n
W = F
qF W
52
Chapter 4
Further Topics in Estimation:
4.1
Suppose that
y = X1 1 + X2 2 + u,
b1 = (X 0 X1 )1 X 0 y
1
1
= (X10 X1 )1 X10 (X1 1 + X2 2 + u)
= 1 + (X10 X1 )1 X10 X2 2 + (X10 X1 )1 X10 u,
so that
where
b1 ) = 1 + (X 0 X1 )1 X 0 X2 2
E(
1
1
= 1 + 12 ,
12 = (X10 X1 )1 X10 X2 2 .
Example. Wages on education get positive eect but are omitting ability. If ability has a positive eect on wages and is positively correlated
with education this would explain some of the positive eect. Wages
on race/gender (discrimination). Omit experience/education.
53
54
s2 (X10 X1 )1 ,
where
y 0 M1 y
n K1
(X2 2 + u)0 M1 (X2 2 + u)
=
,
n K1
s2 =
2,
since M1 is a positive semi-denite matrix.
b1 is upwardly biased.
Therefore, the estimated variance of
55
4.2
y = X1 1 + u,
1
1
= 1 + (X10 M2 X1 )1 X10 M2 u
Therefore
b1 ) = 1 all 1
E(
b1 ) = 2 (X 0 M2 X1 )1 .
var(
1
56
4.3
Model Selection
Rj = 1
bj u
bj
n1
n1 u
(1 Rj2 ) = 1
0 ,
n Kj
n Kj u u
u
b0j u
bj
Kj
1+
P Cj =
n Kj
n
u
b0j u
bj 2Kj
+
n
n
0
bj Kj log n
u
bj u
+
.
BICj = ln
n
n
The rst criterion should be maximized, while the others should be
2
minimized. Note that maximizing Rj is equivalent to minimizing the
unbiased variance estimate u
bj u
bj /(n Kj ).
AICj = ln
It has been shown that all these methods have the property that the
selected model is larger than or equal to the true model with probability tending to one; only BICj correctly selects the true model with
probability tending to one.
M may be large and computing 2K 1 regressions infeasible.
4.4. MULTICOLLINEARITY
57
4.4
Multicollinearity
Solution: Find a minimal (not unique) basis X for C(X) and do least
squares.
=
=
=
=
1ifQuarter
1ifQuarter
1ifQuarter
1ifQuarter
1,
2,
3,
4,
0
0
0
0
else
else
else
else.
58
1
..
.
..
.
..
.
X=
1
..
.
1 0 0 0
0 1 0 0
0 0 1 0
.
0 0 0 1
.. ..
. . 0 0
.. .. .. ..
. . . .
4.5
59
Inuential Observations
At times one can suspect that some observations are having a large
impact on the regression results. This could be a real inuence, i.e.,
just part of the way the data were generated, or it could be because
some observations have been misrecorded, say with an extra zero added
on by a careless clerk.
How do we detect inuential observations? Delete one observation at
a time and see what changes. Dene the leave-one-out estimator and
residual
b
(i)
= [X(i)0 X(i)]1 X(i)0 y(i)
b
u
bj (i) = yj X 0 (i),
j
where
and similarly for X(i). We shall say that observation (Xi , yi ) is inuential if u
bi (i) is large.
Note that
so that
b ),
u
bi (i) = ui Xi0 ((i)
E[b
ui (i)] = 0
var[b
ui (i)] = 2 [1 + x0i (X 0 (i)X(i))1 xi ].
Then examine standardized residuals
Ti =
s(1
+
u
bi (i)
.
0
0
xi (X (i)X(i))1 xi )1/2
60
4.6
Missing Observations
In surveys, responders are not representative sample of full population. For example, we dont have information
on people with y >
Pn
1
$250, 000, y $5, 000. In this case, n i=1 yi is biased and inconsistent as an estimate of the population mean.
In regression, parameter estimates are biased if selection is:
On dependent variable (or on error term);
Non-random. For example, there is no bias [although precision is
aected] if in a regression of inc on education, we have missing
data when edu 5 years.
We look at ignorable case where the process of missingness is unrelated
to the eect of interest.
Missing y
yA XA nA
? XB nB .
What do we do? One solution is to impute values of the missing variable. In this case, we might let
bA = (X 0 XA )1 X 0 yA .
bA , where
ybB = XB
A
A
We can then recompute the least squares estimate of using all the
data
0
1 0 yA
b
,
= (X X) X
ybB
XA
.
X =
XB
However, some simple algebra reveals that there is no new information
in ybB , and in fact
b=
bA .
(X X) X
61
yA
= (XA0 XA )1 XA0 yA ,
b
XB A
X 0 X = (XA0 XA + XB0 XB ).
Ee have
X
yA
bA
XB
bA
= XA0 yA + XB0 XB
and
X 0 X(XA0 XA )1 XA0 yA = XA0 yA + (XB0 XB )(XA0 XA )1 XA0 yA .
Therefore, this imputation method has not really added anything. It
is not possible to improve estimation in this case.
Now suppose that we have some missing X. For example, X = (x, z),
and xB is missing, i.e., we observe (xA , zA , yA ) and (zB , yB ). The model
for the complete data set is
y = x + z + u
with var(u) = 2u , and suppose also that
x = z +
with being an iid mean zero error term with var() = 2u .
There are a number of ways of trying to use the information in period
B. First, predict xB by regressing xA on zA
Then regress y on
x
bB = zB (zA0 zA )1 zA0 xA .
b=
X
xA zA
x
bB zB
62
for some error term e that includes uB + B plus the estimation error
bAb
in
A . This regression can be jointly estimated with the
bA xA = zA + eA .
yA
This sometimes improves matters, but sometimes does not! The answer
depends on relationship between x and z. In any case, the eect of
x is not better estimated; the eect of z maybe improved. Griliches
(1986) shows that the (asymptotic) relative eciency of this approach
to the just use A least squares estimator is
2
2
(1 ) 1 + 2 ,
u
where is the fraction of the sample that is missing. Eciency will be
improved by this method when
2
2
1
,
<
2
u
1
i.e., the unpredictable part of x from z is not too large relative to the
overall noise in the y equation.
Another approach. Let = + . Then clearly, we can estimate
bAb
from the B data by OLS say, call this b
B . Then let
A.
bB = b
B
Now consider the class of estimators
B ,
() = b
b
A + (1 )b
63
2A AB
,
2A + 2B 2 AB
where in our case 2A , 2B are the asymptotic variances of the two esitmators and AB is their asymptotic vcovariance. Intuitively, unless
either 2A = AB or 2B = AB , we should be able to improve matters.
What about the likelihood approach? Suppose for convenience that z
is a xed variable, then the log likelihood function of the observed data
is
X
X
log f (yA , xA |zA ) +
log f (yB |zB ).
A
( + )zA
yA
u + 2 2 2
N
,
xA
zA
2
yB N ( + )zB , 2u + 2 2 ,
The MLE is going to be quite complicated here because the error variances depend on the mean parameter , but it going to be more ecient
than the simple least squares that only uses the A data.
64
Chapter 5
Asymptotics
5.1
Xn X or p lim Xn = X.
n
66
CHAPTER 5. ASYMPTOTICS
We say that a sequence of random variables {Xn }
n=1 converges almost
surely or with probability one to a random variable X, denoted
a.s.
Xn X,
if
Pr[ lim Xn = X] = 1.
n
Xn X,
if for all x,
lim Pr[Xn x] = Pr[X x].
) N(0, 2 ) .
n1/2 (b
Xn X,
if
lim E[|Xn X|2 ] = 0.
When X is a constant,
5.2
is that
E(|Xi |) < .
(Lindeberg-Levy Central Limit Theorem) Suppose that X1 , . . . , Xn are
i.i.d. with E(Xi ) = and var(Xi ) = 2 . Then
n
1 X Xi D
N(0, 1).
n1/2 i=1
These results are important because many estimators and test statistics
can be reduced to sample averages or functions thereof. There are now
many generalizations of these results for data that are not i.i.d., e.g.,
heterogeneous, dependent weighted sums. We give one example
68
CHAPTER 5. ASYMPTOTICS
(Lindeberg-Feller) Let X1 , . . . , Xn be independent random variables
with E(Xi ) = 0 and var(Xi ) = 2i . Suppose also that Lindebergs
condition holds: for all > 0,
!#
"
i
n
X
X
1
Pn
0.
E Xi2 1 Xi2 >
2j
2
i=1 i i=1
j=1
Then
n
X
1
D
Pn
Xi N(0, 1).
( i=1 2i )1/2 i=1
5.3
69
Additional Results
MannWald Theorem.
D
g(Xn ) g(X).
P
If Xn , then
g(Xn ) g().
D
Xn + yn X + ;
D
Xn yn X; and
D
kXn Xk 0,
where kxk = (x0 x)1/2 is Euclidean norm, if and only if
P
70
CHAPTER 5. ASYMPTOTICS
5.4
Applications to OLS
We are now able to establish some results about the large sample properties of the least squares estimator. We start with the i.i.d. random
design case because the result is very simple.
If we assume that:
xi , i are i.i.d. with E(xi i ) = 0
0 < E [xi x0i ] < and E [kxi i k] < .
Then,
P
b
The proof comes from applying laws of large numbers to the numerator
and denominator of
#1
" n
n
X
1X
1
0
b
xi xi
xi i .
=
n i=1
n i=1
These regularity conditions are often regarded as unnecessary and perhaps strong and unsuited to the xed design.
We next consider the bare minimum condition that works in the xed
design case and is perhaps more general since it allows for example
trending variables.
Theorem. Suppose that A0-A2 hold and that
min (X 0 X) as n .
Then,
Proof. First,
for all . Then,
P
b
.
b =
E()
b = 2 (X 0 X)1 ,
var()
(?)
0 1
(X X) = max ((X 0 X)1 ) =
71
1
,
min (X 0 X)
b 0.
var()
2
O(n(2+1) ) if 6= 1/2
b
.
var() = Pn 2 =
O(1/ log n) if = 1/2
j=1 j
Therefore, consistency holds if and only if 1/2.
5.5
We rst state the result for the simplest random design case.
Suppose that
xi , i are i.i.d. with i independent of xi ,
E(2i ) = 2
0 < E [xi x0i ] < .
Then,
D
1
b )
N(0, 2 {E [xi x0i ]} ).
n1/2 (
We next consider the xed design case [where the errors are i.i.d. still].
In this case, it suces to have a vector central limit theorem for the
weighted i.i.d. sequence
n
n
n
X
X
X
0 1
b
xi xi )
xi i =
wi i ,
=(
i=1
i=1
i=1
72
CHAPTER 5. ASYMPTOTICS
for some weights wi depending only on the X data. That is, the source
of the heterogeneity is the xed regressors.
A sucient condition for the scalar standardized random variable
Pn
wi i
Tn = Pn i=1
1/2
( i=1 wi2 2 )
n
X
xi x0i / 2
i=1
!1/2
b ) N(0, 1),
(
D
1in
1in
For example, if xi = i,
j=1
x2j
0.
max1in i2
n2
Pn 2 =
0.
O(n3 )
j=1 j
In this case, even though the largest element is increasing with sample
size many other elements are increasing just as fast.
73
1 if i < n
xi =
n if i = n.
In this case, the negligibility condition fails and the distribution of
the least squares estimator would be largely determined by the last
observation.
5.6
Order Notation
Xn P
0
n
and
Xn = Op ( n ) if Xn / n is stochastically bounded,
i.e., if for all K,
Xn
5.7
b
u
b0 u
nk
74
CHAPTER 5. ASYMPTOTICS
1
{u0 u u0 X(X 0 X)1 X 0 u}
n
k
0
1 u0 X 0
uu
n
(X X/n)1 X 0 u/n1/2 .
=
nk
n
n k n1/2
=
Theorem. Suppose that ui are i.i.d. with nite fourth moment, and that
the regressors are from a xed design and satisfy (X 0 X/n) M,where
M is a positive denite matrix. Then
D
var(
0
u0 X
2X X
,
)
=
n1/2
n
1
Op (1)
nk
= 2 + op (1).
What about the asymptotic distribution of s2 ?
n1/2 (s2 2 )
= [1 + op (1)]
n
1 X 2
n1/2
2
(u
Op (1)
n1/2 i=1 i
nk
n
1 X 2
= 1/2
(ui 2 ) + op (1)
n
i=1
D
75
b
n1/2 c0
t =
s(c0 (X X)
c)1/2
n
b
n1/2 c0
=
+ op (1)
(c0 M 1 c)1/2
N(0, 2 c0 M 1 c)
N(0, 1) under H0 .
(c0 M 1 c)1/2
"
0
b r)0 s2 R X X
W = n(R
n
R0
#1
b r).
(R
Theorem. Suppose that R is of full rank, that ui are i.i.d. with nite
fourth moment, and that the regressors are from a xed design and satisfy
(X 0 X/n) M,
where M is a positive denite matrix. Then,
D
W N(0, 2 RM 1 R0 ) [ 2 RM 1 R0 ]1 N(0, 2 RM 1 R0 ) = 2q .
5.8
f f
D
1/2
b
.
n (f () f ()) N 0,
f (b
) = f () + (b
)f 0 ( ),
n1/2 (f (b
) f ()) = f 0 ( ) n1/2 (b
).
76
CHAPTER 5. ASYMPTOTICS
Furthermore,
P
P
b
,
f 0 ( ) f 0 (),
where
f 0 () 6= 0 < .
Therefore,
) f ()) = [f 0 () + op (1)]n1/2 (b
),
n1/2 (f (b
N(0, e2 2 M 1 ).
b
2 2
D
1/2
2 f
1 f
,
N 0,
n
M
b3 3
where
0
f
= 1/ 3 ,
2 / 23
3
2
23
M 22 M 23
M 32 M 33
1
3
2
23
!)
Chapter 6
Errors in Variables
Measurement error is a widespread problem in practice, since much
economics data is poorly measured. This is an important problem that
has been investigated a lot over the years.
One interpretation of the linear model is that
- there is some unobservable y satisfying
y = X
- we observe y subject to error
y = y + ,
where is a mean zero stochastic error term satisfying
y
[or more fundamentally, X].
Combining these two equations
y = X + ,
where has the properties of the usual linear regression error term. It
is clear that we treat X, y asymmetrically; X is assumed to have been
measured perfectly.
77
78
= + (X 0 X)1 X 0
= + (X 0 X)1 X 0 (X 0 X)1 X 0 U.
b = E (X 0 X)1 X 0 E[U|X]
E()
= E (X 0 X)1 X 0 [X X ]
X 0X
X 0 X
X 0 U U 0 U
=
+2
+
.
n
n
n
n
79
We shall suppose that for
X 0 X P
Q
n
X 0 U P
0
n
U 0 P
0
n
X 0 P
0
n
U 0U P
UU
n
X 0 P
0
n
X 0U
U 0U P
X 0 U P
0 +
UU
=
n
n
n
by similar reasoning.
Therefore,
P
b
[Q + U U ]1 UU = [Q + U U ]1 Q C.
C=
q
1
,
=
q + 2u
1 + 2u /q
b is unbiased.
noise
= 1,
signal
80
noise
,
signal
b kk ,
|| p lim ||
n
but it is not necessarily the case that each element is shrunk towards
zero.
Suppose that K > 1, but only one regressor is measured with error,
i.e.,
2
u 0
.
U U =
0 0
b are biased; that particular coecient estimate is
In this case, all
shrunk towards zero.
q
u
+
,
2
q + u q + 2u
X X
U 0U =
T
X
t=1
T
X
t=1
t2 = O(T 3 ),
Ut2 = Op (T ).
81
Therefore,
X 0U
T 3/2
X 0 X P X 0 X
T3
T3
0
0
X U
UU P
=
+ 3/2 0.
3/2
T
T
Therefore,
P
b
.
This is because the signal here is very strong and swamps the noise.
6.1
Solutions to EIV
or equivalently Z 0 /n 0, Z 0 U/n 0.
Then dene the instrumental variables estimator (IVE)
We have
bIV = (Z 0 X)1 Z 0 y.
bIV = +
Z 0X
n
Z 0 P
,
n
82
Z 0 X 1 Z 0
1
b
n 2 IV =
1
n
n2
D
1
N 0, 2 Q1
ZX QZZ QZX .
1 if xi > median xi
zi =
0 if xi < median xi
6.2
1 Xi with prob 1 .
We can write this as Xi = Xi + Ui , but Ui is not independent of Xi .
Magic Numbers. Suppose that there is rounding of numbers so that
Xi is continuous, while Xi is the closest integer to Xi .
6.3
83
Durbin-Wu-Hausman Test
bOLS
bIV Vb 1
bIV ,
bOLS
H=
and reject the null hypothesis for large values of this statistic.
bIV are both consistent under H0 ,
bOLS and
- The idea is that
b
but under HA , OLS is inconsistent. Therefore, there should be a
discrepancy that can be picked up under the alternative.
bOLS
bIV = (Z 0 X)1 Z 0 (X 0 X)1 X 0 = A
84
where
Thus
Under H0 ,
1 X 2
=
b
,
n k i=1 i
n
s2
bIV .
b
i = yi X
1 0
0
0
Z X.
Vb 1 = s2
X Z (Z MX Z)
D
H 2K ,
Chapter 7
Heteroskedasticity
We made the assumption that
V ar(y) = 2 I
in the context of the linear regression model. This contains two material
parts:
- o diagonals are zero (independence), and
- diagonals are the same.
Here we extend to the case where
21
0
...
V ar(y) =
= ,
2
0
n
i.e., the data are heterogeneous.
7.1
Eects of Heteroskedasticity
b = (X 0 X)1 X 0 y.
85
86
CHAPTER 7. HETEROSKEDASTICITY
In the new circumstances, this is unbiased, because
However,
b = ,
E()
i=1
i=1
6= 2 (X 0 X)1 .
s2 =
1X 2
lim
i ,
n n
i=1
n
but
1X
1X 2 1X
xi x0i 2i
i
xi x0i 9 0
n i=1
n i=1
n i=1
n
in general.
b = (X 0 X )1 X 0 y = X 0 1 X 1 X 0 1 y
87
is ecient by Gauss-Markov. So
bGLS = X 0 1 X 1 X 0 1 y
unless
= 2I
In some special cases OLS = GLS, but in general they are dierent.
What to do?
7.2
Plan A: Eicker-White
Use OLS but correct standard errors. Accept ineciency but have
correct tests, etc.
How do we do this? Cant estimate 2i , i = 1, ..., n because there are n
of them. However,Pthis is not necessary - instead we must estimate the
b =
sample average n1 ni=1 xi x0i 2i . We estimate V ar()
!1 n
!1 n
!1
n
X
X
1
1
1 1X
xi x0i
xi x0i 2i
xi x0i
V =
n n i=1
n i=1
n i=1
by
1
Vb =
n
n
1X
xi x0i
n i=1
!1
n
1X
xi x0i u
b2i
n i=1
!1
P
Vb V 0.
n
1X
xi x0i
n i=1
!1
88
CHAPTER 7. HETEROSKEDASTICITY
Typically nd that Whites standard errors [obtained from the diagonal
elements of Vb ] are larger than OLS standard errors.
Finally, one can construct test statistics which are robust to heteroskedasticity, thus
D
b r)0 [RVb R0 ]1 (R
b r)
n(R
2J .
7.3
1
2
V ar(uij ) = .
ni
ni
Here, the variance is inversely proportional to household size. This is
easy case since apart from single constant, 2 , 2i is known.
V ar(ui ) =
= ().
n
X
L
(yi x0i )
=
xi
2i ()
i=1
"
#
n
L
1 X ln 2i
(yi x0i )2
=
()
1 .
2 i=1
2i ()
89
bMLE , b
The estimators (
MLE ) solve this pair of equations, which are
nonlinear in general.
i i
i=1
i=1
2i =
1
(x0i xi )
2 i=1 2 i=1
n
where
7.4
b
=
Pn
1
b 0 xi
b2i ()x
i
i=1 u
bMLE .
u
bi = yi x0i
D
1 MLE
N(0, )
b
n 2 MLE
for some .
90
CHAPTER 7. HETEROSKEDASTICITY
If y is normal, then = I 1 , the information matrix,
limn n1 X 0 1 X o
.
I=
0
?
b is asymptotically equivalent to
In this case,
bGLS = X 0 1 X 1 X 0 1 y.
1
bF GLS = X 0 1 (b
AH )X
X 0 1 (b
AH )y.
7.5
Provided b
and some additional conditions, this procedure is also
bGLS .
asymptotically equivalent to
n
1X
L
ui
xi
=
1 .
2 i=1
Under normality,
V ar
u2i
= 2.
Therefore,
LM =
#1 n
" X
n
X u
b2i
0
1 xi 2
1 xi ,
xi xi
b
i=1
i=1
n 2
X
u
b
i
i=1
1X 2
b=
u
b,
n i=1 i
n
where u
b2i are the OLS residuals from the restricted regression.
Under H0
LM 21 .
91
92
CHAPTER 7. HETEROSKEDASTICITY
Chapter 8
Nonlinear Regression Models
Suppose that
yi = g(xi , ) + i ,
i = 1, 2, ..., n,
94
8.1
Computation
x 1
.
where
Then write
X() =
x
1 1
..
.
x
n 1
2
n
1X
xi 1
b
b
Sn (), =
yi ()
,
n i=1
b to min this,
which is the concentrated criterion function. Now nd
e.g., by line search on [0, 1].
8.1. COMPUTATION
95
Sn b
NNLS = 0.
2 Sn
Sn
2 Sn
(
( 1 ) 1 ,
(
)
+
)
1
2
1
1
Sn
2 Sn
2 = 1
( ).
2 ( 1 )
1
In k-dimensions
1
Sn
2 Sn
2 = 1
( ).
0 ( 1 )
1
96
i=1
i=1
Variable step length
1
Sn
2 Sn
( ),
2 () = 1
0 ( 1 )
1
8.2
Consistency of NLLS
97
P
b
0.
Proof is in Amemiya (1986, Theorem 4.1.1). We just show why (3) and
(4) are plausible. Substituting for yi , we have
1X
[i + g(xi , 0 ) g(xi , )]2
Sn () =
n i=1
n
1X 2 1X
1X
=
i +
[g(xi , ) g(xi , 0 )]2 + 2
i [g(xi , ) g(xi , 0 )] .
n i=1
n i=1
n i=1
n
1X
P
i [g(xi , ) g(xi , 0 )] 0 for all
n i=1
n
1X
P
[g(xi , ) g(xi , 0 )]2 E [g(xi , ) g(xi , 0 )]2 .
n i=1
n
Therefore,
P
Sn () 2 + E [g(xi , ) g(xi , 0 )]2 S().
98
Now
8.3
b
Sn ()
=0
and satises
P
b
0,
n2
Sn ( 0 ) D
N (0, B) .
99
- Then,
1
D
b 0)
n 2 (
N0, V ),
0 = n2
1 Sn
Sn b
2 Sn 1 b
( )n 2 ( 0 ),
() = n 2
( 0 ) +
n i=1
g
2 X
i
(xi , 0 ).
=
n i=1
2 g
E
(x
,
)
(x
,
)
i
i
0
0 < ,
i
0
Sn
g
2 g
n
( ) N 0, 4E i
(xi , 0 ) 0 (xi , 0 ) .
0
1
2
100
(xi , 0 ).
i
0
0 (xi , 0 ) +
n i=1
n i=1 0
Provided
2g
<
(x
,
)
E
i
i
0
0
and
g g
< ,
E
(x
,
)
i
0
0
1X
2g
P
i
(xi , 0 ) 0
n i=1 0
n
1 X g g
g g
P
(xi , 0 ) E
(xi , 0 ) .
n i=1 0
0
n
(x
,
)
=
(x
,
)
+
(
)
(xi , )
i
i
0
0
2
2
3
for some intermediate point . Then, provided
3
ED(X) < ,
condition (2) will be satised.
101
Similar results can be shown in the xed design case, but we need to
use the CLT and LLN for weighted sums of i.i.d. random variables.
Note that when i are i.i.d. and independent of xi , we have
g g
B
A = 4E
0 (xi , 0 ) = 2
1
D
b 0)
N 0, 2 A1 .
n 2 (
X g g
b
b= 1
(xi , )
A
n i=1 0
n
X g g
b 2 ,
b= 1
(xi , )b
B
i
n i=1 0
n
b
where b
i = yi g(xi , ).Then
8.4
P
Vb V.
n
X
`i ().
i=1
Let b
maximize `(data, ).
and
P
b
0
1
D
n 2 (b
0 ) N 0, I 1 (0 ) ,
102
`i `i
`i
(0 ) = E
`i (0 ) .
I(0 ) = E
0
0
This last equation is called the information matrix equality.
Asymptotic Cramr-Rao Theorem. The MLE is asymptotically ecient amongst the class of all asymptotically normal estimates (stronger
than Gauss-Markov).
Chapter 9
Generalized Method of
Moments
We suppose that there is i.i.d. data {Zi }ni=1 from some population.
It is known that there exists a unique 0 such that
E [g(0 , Zi )] = 0
for some q 1 vector of known functions g(0 , ).
For example, g could be the rst order condition from OLS or
more generally maximum likelihood, e.g., g(, Zi ) = xi (yi x0i ).
104
1X
Gn () =
g(, Zi ).
n i=1
n
Qn () = Gn ()0 Wn Gn (),
This denes a large class of estimators, one for each weighting matrix
Wn .
It is generally a nonlinear optimization problem like nonlinear least
squares; various techniques are available for nding the minimizer.
GMM is a general estimation method that includes both OLS and more
general MLE as special cases!!
105
`(Zi , ),
i=1
(Zi , ) = 0.
`
(Zi , ).
9.1
1
D
b
2
n GMM N 0, (0 W )1 0 W W (0 W )1 ,
where:
(0 ) = Var n 2 Gn (0 ),
Gn (0 )
= p lim
.
n
0 < W =p lim Wn .
n
106
1
D
0 ) N 0, 1 10 .
n 2 (b
) =
n i=1
n
where e
is a preliminary estimate of 0 obtained using some arbitrary weighting matrix, e.g., Iq .
opt
e 1 Gn ().
- b
=b
GMM = arg min Gn ()0
The asymptotic distribution is now normal with mean zero and variance
1
D
n 2 (b
0 ) N 0, (0 1 )1 .
and
1X b
b = (b
) =
g(, Zi )g(b
, Zi )0 ,
n i=1
n
b
b = Gn () ,
9.2
107
Test Statistics
1
D
) N(0, V ())
n 2 (b
where
F () =
Therefore,
f ()
.
0
D
b
n f () f () N (0, F ()V ()F ()0 ) .
1
2
108
9.3
fb = f (b
) and Fb = F (b
).
h
i1
D
nfb0 FbVb Fb0
fb 2m
Examples
Linear regression
y = X + u,
Suppose now
1 0
b = 0.
X (y X )
n
E [xi ui ( 0 )] 6= 0,
i.e., the errors are correlated with the regressors. This could be because
- omitted variables. There are variables in u that should be in X.
- The included X variables have been measured with error
9.3. EXAMPLES
109
(9.1)
Now take
1 0
Z (y X)
n
n
1X
=
zi (yi x0i ).
n i=1
Gn () =
110
y = A 2 y and X = A 2 X
we see that
with solution
Qn () = (y X )0 (y X )
bGMM = (X 0 X )1 X 0 y
1
= (X 0 AX) X 0 Ay
1
= (X 0 ZWn Z 0 X) X 0 ZWn Z 0 y.
The question is, what is the best choice of Wn ? Suppose also that u has
variance matrix 2 I independent of Z, and that Z is a xed variable.
Then
h 1
i
Z 0Z
1
var n 2 Gn ( 0 ) = var 1 Z 0 u = 2
.
n
n2
Therefore, the optimal weighting is to take
Wn (Z 0 Z)1
in which case
where
bGMM = (X 0 PZ X)1 X 0 PZ y,
PZ = Z(Z 0 Z)1 Z 0
9.4
111
112
Estimation now proceeds by forming some UNCONDITIONAL MOMENT RESTRICTIONS using valid instruments, i.e., variables from
Ft Ft . Thus, let
g(, Zt ) = (0 , Yt ) Xt ,
where Xt Ft and Zt = (Yt , Xt ). We suppose that g is of dimensions
q with q p. Then
E[g(, Zt )] = 0 = 0 .
We then form the sample moment condition
T
1X
g(, Zt ).
GT () =
T i=1
minimize
over
QT ()
Rp .
9.5. ASYMPTOTICS
9.5
113
Asymptotics
As before we have
1
D
b
2
T GMM N 0, (0 W )1 0 W W (0 W )1 ,
where:
(0 ) = Var n 2 Gn (0 ),
Gn (0 )
= p lim
.
n
0 < W =p lim Wn .
n
Now, however
(0 ) =
lim varT 2 GT (0 )
lim var
"
T
1 X
1
T2
g(0 , Zt )
t=1
#
T
T
1 XX
g(0 , Zt )g(0 , Zs )0 .
= lim E
T
T t=1 s=1
"
X
k=1
( k + 0k ) ,
114
where
k = E [g(0 , Zt )g(0 , Ztk )0 ]
0k = E [g(0 , Ztk )g(0 , Zt )0 ]
is the covariance function of Ut .
For standard errors and optimal estimation we need an estimator of .
The Newey-West estimator
0
XX
e
e
b
w (|t s|) g , Zt g , Zs ,
T =
t,s:|ts|n(T )
where
w(j) = 1
j
,
n+1
and where e
is a preliminary estimate of 0 obtained using some arbitrary weighting matrix, e.g., Iq . This ensures a positive denite covariance matrix estimate. Provided n = n(T ) but n(T )/T 0 at
some rate
P
b T
.
9.6
Example
Hansen and Singleton, Econometrica (1982). One of the most inuential econometric papers of the 1980s. Intertemporal consumption/Investment decision:
- ct consumption
- u() utility uc > 0, ucc < 0.
- 1 + ri,t+1 , i = 1, . . . , m is gross return on asset i at time t + 1.
9.6. EXAMPLE
115
{ct ,wt }
t=0
X
=0
E [u(ct+ ) |It ] ,
where
- wt is a vector of portfolio weights.
- is the discount rate with 0 < < 1.
- It is the information available to the agent at time t.
We assume that there is a unique interior solution; this is characterized
by the following condition
u0 (ct ) = E [(1 + ri,t+1 )u0 (ct+1 ) |It ] ,
for i = 1, . . . , m.
Now suppose that
u(ct ) =
c1
t
1
log ct
if > 0, 6= 1,
= 1.
c
= E (1 + ri,t+1 )c
t
t+1 |It
for i = 1, . . . , m.
E 1
ct+1
(1 + ri,t+1 )
ct
for i = 1, . . . , m, where
It It
|It = 0
116
..
.
ct+1
zjt
g(, xt ) = 1 (1 + ri,t+1 ) ct
..
.
where
zt It RJ
are instruments, q = mJ, and
xt = (zt , ct , ct+1 , r1,t+1 , . . . , rm,t+1 )0 .
Typically, zt is chosen to be lagged variables and are numerous, so that
q p.
The model assumption is that
E [g(0 , xt )] = 0
for some unique 0 .
This is a nonlinear function of .
Exercise. Show how to consistently estimate and in this case.
Chapter 10
Time Series
10.1
We start with univariate time series {yt }Tt=1 . There are two main features:
- stationarity/nonstationarity
- dependence
We rst dene stationarity.
Strong Stationarity. The stochastic process y is said to be strongly
stationary if the vectors
(yt , . . . , yt+r )
and
(yt+s , . . . , yt+s+r )
have the same distribution for all t, s, r.
Weak Stationarity. The stochastic process y is said to be weakly stationary if the vectors
(yt , . . . , yt+r )
and
(yt+s , . . . , yt+s+r )
have the same mean and variance for all t, s, r.
117
118
cov(yt , yts ) = s ; s = s .
0
Note that stationarity was used here in order to assert that these moments only depend on the gap s and not on calendar time t as well.
For i.i.d. series,
s = 0 for all s 6= 0,
119
X
j tj ,
=
j=0
120
Furthermore,
var(yt ) = 2 var(yt1 ) + 2 ,
2
,
1 2
0 = var(yt ) = var(yt1 ).
This last calculation of course requires that || < 1, which we are
assuming for stationarity.
Finally,
2
cov(yt , yt1 ) = E(yt yt1 ) = E(yt1
) + 0,
2
,
1 2
2
.
1 2
s
; s = s .
1 2
The correlation function decays geometrically towards zero.
s = 2
121
- Furthermore,
cov(yt , yt1 ) = E {(t t1 ) (t1 t2 )}
= E(2t1 )
= 2 .
Therefore,
,
1 + 2
= 0, j = 2, . . .
1 =
j
j ytj = t .
j=0
122
10.2
Estimation
bs
T
X
1
=
(yt y)(yts y)
T s t=s+1
b
s =
bs
.
b0
These sample quantities are often used to describe the actual series
properties. Consistent and asymptotically normal.
Box-Jenkins analysis: identication of the process by looking at the
correlogram. In practice, it is hard to identify any but the simplest
processes, but the covariance function still has many uses.
Estimation of ARMA parameters . Can invert the autocovariance/autocorrelation
function to compute an estimate of . For example in the AR(1) case,
the parameter is precisely the rst order autocorrelation. In the
10.2. ESTIMATION
123
MA(1) case, can show that the parameter satises a quadratic equation in which the coecients are the autocorrelation function at the
rst two lags. A popular estimation method is the Likelihood under
normality. Suppose that
1
..
. N(0, 2 I),
T
then
y1
..
. N (0, )
yT
1 2 T 1
..
...
. ,
=
1 2
...
1
2
= 2 (1 + 2 )
1+2
0
...
1
1
T
log 2 log || y 0 1 y.
2
2
2
124
T
X
t=p+1
1
1
`t|t1 log 2 2 (yt 1 yt1 )2 .
2
2
y1 .
`(y1 ) = log
2
2 2
1 2
y .
2
2 t=2
2
2 2 1
1 2
PT
t=2
` (yt |yt1 , . . . , y1 ),
T
1 X
T 1
(yt yt1 )2 .
log 2 2
2
2 t=2
10.3. FORECASTING
125
This criterion is equivalent to the least squares criterion, and has unique
maximum
PT
yt yt1
b
= Pt=2
.
T
2
t=2 yt1
This estimator is just the OLS on yt on yt1 [but using the reduced
sample]. Can also interpret this as a GMM estimator with moment
condition
E [yt1 (yt yt1 )] = 0.
The full MLE will be slightly dierent from the approximate MLE. In
terms of asymptotic properties, the dierence is negligible.
- However, in nite sample there can be signicant dierences.
b be less than one - as 1,
- Also, the MLE imposes that
` . The OLS estimate however can be either side of the
unit circle.
10.3
Forecasting
|| < 1,
yT +r = r yT + r1 T +1 + . . . + T +r .
Therefore, let
be our forecast.
ybT +r|T = r yT
126
The forecast error ybT +r|T yT +r has mean zero and variance
2 1 + 2 + . . . + 2r2 .
where
b is estimated from sample data. If is estimated well, then this
will not make much dierence.
Forecast interval
SD = 2 1 + 2 + . . . + 2r2 .
10.4
127
0 1 2 T 1
...
2 .
E(uu0 ) = =
...
0
The consequences for estimation and testing of are the same as
with heteroskedasticity: OLS is consistent and unbiased, but inecient,
while the SEs are wrong.
Specically,
where
T
T X
X
t=1 s=1
xt x0t |ts| .
128
u
b21 u
b1 u
b2 u
b1 u
bT
2
u
b
1
bT = X 0
...
u
b2T
T
T X
xt x0t u
bt u
bs
X =
t=1 s=1
XX
t,s:|ts|n(T )
j
,
n+1
This also ensures a positive denite covariance matrix estimate. Provides consistent standard errors.
An alternative strategy is to parameterize ut by, say, an ARMA process
and do maximum likelihood
1
1
` = ln |()| (y X)0 ()1 (y X) .
2
2
Ecient estimate of (under Gaussianity) is a sort of GLS
1
bML = X 0 (b
)1 X
X 0 (b
)1 y,
where b
is the MLE of . This will be asymptotically ecient when the
chosen parametric model is correct.
10.5
129
P
u
bt u
bt1 2
D
t
P 2
= T r12 21 ,
LM = T
u
b
t t1
where u
bt are the OLS residuals. Therefore, we reject the null hypothesis
when LM is large relative to the critical value from 21 .
T 2 r1 N(0, 1)
under the null hypothesis.
The Durbin-Watson d is
d=
PT
ut u
bt1 )2
t=2 (b
.
PT 2
u
b
t
t=1
d 2(1 r1 ),
we have [under the null hypothesis]
1
d
N(0, 1).
T2 1
2
130
P
X
j=1
10.6
rj2 2P .
We have looked at pure time series models with dynamic response and
at static regression models. In practice, we may want to consider models that have both features.
Distributed lag
yt = +
q
X
j Xtj + ut ,
j=0
ut 0,
2.
131
s t.
Then
yt yt + 0
yt+1 yt + ( 0 + 1 ) etc.
- The impact eect is 0 .
P
- Long run eect is
s=0 s .
When q is large (innite) there are too many free parameters j , which
makes estimation dicult and imprecise. To reduce the dimensionality
it is appropriate to make restrictions on j .
- The polynomial lag
a0 + a1 j + . . . + ap j p
j =
0
if j p
else.
j = 0, 1, . . .
j xtj + ut
j=0
"
#
X
j j
= +
( L ) xt + ut
j=0
= +
Therefore,
1
xt + ut .
1 L
(1 L)yt = (1 L) + xt + (1 L)ut ,
which is the same as
yt = (1 ) + yt1 + xt + ut ut1 .
The last equation is called the lagged dependent variable representation.
132
A(L)yt = B(L)xt + ut ,
t i.i.d. 0, 2 .
10.7
Adaptive expectations
Suppose that
yt = +
|{z}
demand
xt+1
|{z}
+t ,
expected price
but that the expected price is made at time t and is unobserved by the
econometrician. Let
We observe xt , where
xt+1 xt
| {z }
revised expectations
i.e.,
xt+1 =
= (1 ) (xt xt ) ,
| {z }
xt
|{z}
old forecast
Write
forecast error
+ (1 )xt .
| {z }
news
(1 L)xt = (1 )xt ,
(1 )
xt =
xt = (1 ) xt + xt1 + 2 xt2 + . . . .
1 L
Therefore,
yt = +
which implies that
(1 )
xt + t ,
1 L
yt = yt1 + (1 ) + (1 )xt + t t1 .
10.8
133
Partial adjustment
Suppose that
yt = + xt ,
actual change
Substituting we get
yt = (1 )yt + yt1 + t
= (1 ) + yt1 + (1 )xt + t .
This is an ADL with an i.i.d. error term - assuming that the original
error term was i.i.d.
10.9
Error Correction
134
10.10
Suppose that
yt = 1 + 2 yt1 + 3 xt + t ,
1 = (1 )
2 =
.
3 = (1 )
In this case, we would estimate the original parameters by indirect least
squares
b = b
b =
b =
b
1
1b
2
b
3
1b
2
135
10.11
There are many dierent ways in which a time series yt can be nonstationary. For example, there may be xed seasonal eects such that
yt =
m
X
Djt j + ut ,
j=1
where Djt are seasonal dummy variables, i.e., one if we are in season j
and zero otherwise. If ut is an iid mean zero error term,
Eyt =
m
X
Djt j
j=1
and so varies with time. In this case there is a sort of periodic movement
in the time series but no trend.
We next discuss two alternative models of trending data: trend stationary and dierence stationary.
Trend stationary. Consider the following process
yt = + t + ut ,
where {ut } is a stationary mean zero process e.g.,
A(L)ut = B(L)t
with the polynomials A, B satisfying the usual conditions required for
stationarity and invertibility. This is the trend+stationary decomposition.
We have
Eyt = + t;
var(yt ) = 2
for all t. The lack of stationarity comes only through the mean.
136
xed
y0 =
random Variable N(0, v)
for some variance v.
Any shocks have permanent aects
yt = y0 + t +
t
P
us .
s=1
yt = yt yt1 = + ut .
Both the mean and the variance of this process are generally explosive.
Eyt = y0 + t;
var yt = 2 t.
If = 0, the mean does not increase over time but the variance does.
Note that dierencing in the trend stationary case gives
yt = + ut + ut1 ,
which is a unit root MA. So although dierencing apparently eliminates stationarity it induces non-invertibility. Likewise detrending the
dierence stationary case is not perfect.
10.12. ESTIMATION
137
10.12
Estimation
yt = yt1 + ut ,
Then,
where ut 0, 2 .
PT
yt yt1 P
,
2
t=2 yt1
- If || < 1
b
OLS = Pt=2
T
) N(0, 1 2 ).
T 2 (b
T (b
) X,
138
10.13
Suppose that
yt = + yt1 + ut ,
10.14. COINTEGRATION
139
The DF test is only valid if the error term ut is i.i.d. Have to adjust
for the serial correlation in the error terms to get a valid test. The
Augmented D-F allows the error term to be correlated over time upto
a certain order. Their test is based on estimating the regression
yt = + yt1 +
p1
P
j ytj + t
j=1
Can also add trend terms in the regression. Phillips-Perron test (PP)
is an alternative way of correcting for serial correlation in ut .
Applications
10.14
Cointegration
140
Johansen test for the presence of cointegration and the number of cointegrating relations. If we have a k-vector unit root series yt there can
be no cointegrating relations, one,..., k 1 cointegrating relations.. Johansen tests these restrictions sequentially to nd the right number of
cointegrating relations in the data.
10.15
Martingales
10.16
141
GARCH Models
2
,
2t = + rt1
.
1
E rt1
= E E 2t1 |It1 2t1
= E 2t1 = 2 .
4 = E(rt4 ) = E 4t 4t
= 3E( 4t ),
where
4
2
+ 2rt1
E( 4t ) = E 2 + 2 rt1
= 2 + 2 4 + 2 2 .
142
4 = 3 3 + 2 4 + 2 2
3 (2 + 2 2 )
=
1 3 2
32
3 4 =
.
(1 )2
2 .
2t 2 = rt1
Suppose that 2t1 = 2 . When we get a large shock, i.e., 2t1 > 1,
we get 2t > 2 but the process decays rapidly to 2 unless we get a
sequence of large shocks 2t1+s > 1, s = 0, 1, 2, . . .. In fact, for a normal
distribution the probability of having 2 > 1 is only about 0.32 so we
generally see little persistence.
143
p
X
2
j rtj
,
j=1
2
j1 rtj
,
+
1
j=1
.
1 ( + )
144
2
B(L) 2t = + C(L)rt1
,
where A and B are lag polynomials. Usually assume that the parameters in , B, C > 0 to ensure that the variance is positive.
Other models. For example, one can write the model for log of variance,
i.e.,
2
.
log 2t + + log 2t1 + rt1
This automatically imposes the restriction that 2t 0 so there is no
need to impose restrictions on the parameters.
Nelsons EGARCH
10.17
Estimation
yt = b0 xt + t t
2
B(L) 2t = + C(L) (yt1 b0 xt1 ) .