Professional Documents
Culture Documents
onometri
s
Version 0.92, Jan 2008
Dept. of E onomi s and E onomi History, Universitat Autnoma de Bar elona, mi hael. reeluab.es,
http://pareto.uab.es/m reel
Contents
15
Li enses
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
1.2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
1.3
16
19
21
3.1
21
3.2
22
3.3
X, Y
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
24
3.3.1
In
Spa e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
3.3.2
In Observation Spa e . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
3.3.3
. . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.4
26
3.5
Goodness of t
28
3.6
3.7
3.8
3.9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . .
31
3.7.1
Unbiasedness
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.7.2
Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.7.3
33
. . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.8.1
Theoreti al ba kground
. . . . . . . . . . . . . . . . . . . . . . . . .
35
3.8.2
36
3.8.3
. . . . . . . . . . . . . . . . . . . . . . .
37
Exer ises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
29
41
41
4.1.1
42
4.2
Consisten y of MLE
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
4.4
46
4.4.1
48
4.5
. . . . . . . . . . . . . . . . . . . . . . . .
49
4.6
50
CONTENTS
4.7
Exer ises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
53
5.1
Consisten y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
5.2
Asymptoti normality
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
5.3
Asymptoti e ien y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
5.4
Exer ises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
6.1.1
Imposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
6.1.2
60
Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
6.2.1
t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
6.2.2
test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
6.2.3
Wald-type tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
6.2.4
. . . . . . . .
64
6.2.5
65
6.3
66
6.4
. . . . . . . . . . . . . . . . . . . . . . . . .
69
6.5
Conden e intervals
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
6.6
Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
6.7
71
6.8
. . . . . . . . . . . . . . . . . . . . . . . . . . .
74
6.9
Exer ises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
6.2
57
79
7.1
80
7.2
81
7.3
Feasible GLS
83
7.4
7.5
7.6
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.1
7.4.2
Dete tion
84
. . . . . . . .
84
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
7.4.3
Corre tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
7.4.4
88
Auto orrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
7.5.1
Causes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
7.5.2
. . . . . . . . . . . . . . . . . . . . . .
92
7.5.3
AR(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
7.5.4
MA(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
7.5.5
7.5.6
7.5.7
7.5.8
Examples
99
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
CONTENTS
107
Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.2
Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.3
Case 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.4
8.5
9 Data problems
9.1
9.2
9.3
9.4
Collinearity
113
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
9.1.1
9.1.2
Ba k to ollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
9.1.3
9.1.4
Measurement error
. . . . . . . . . . . . . . . . . . . . . . . . . 116
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
9.2.1
9.2.2
Missing observations
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
9.3.1
9.3.2
9.3.3
. . . . . . . . . . . . . . . . . 123
127
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
137
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
. . . . . . . . . . . . . . . . . . . . . 146
. . . . . . . . . . . . . . . . . . . . . . . . . . . 148
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
. . . . . . . . . . . . . . . . . . . . . . 164
167
CONTENTS
173
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
. . . . . . . . . . . . . . . . . . . . 180
. . . . . . . . . . . . . . . . . 181
. . . . . . . . . . . . . . . . . . . . . . . . . 184
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
189
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
. . . . . . . . . . . . . . . . . . . . . . . . . 195
. . . . . . . . . . . . . . . . . . . . . . . . . 196
. . . . . . . . . . . . . 199
203
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
. . . . . . . . . . . . . . . . . . . . . . . . . 207
. . . . . . . . . . . . . . . . . 208
. . . . . . . . . . . . . . . . . . . . . 210
. . . . . . . . . . . . . . . . . . . . . . . . . 217
. . . . . . . . . . . . . . . . . . . . . . . . . . . 219
. . . . . . . . . . . . . . . . . . . . . . . . . . 221
CONTENTS
231
16 Quasi-ML
16.1 Consistent Estimation of Varian
e Components
. . . . . . . . . . . . . . . . 232
. . . . . 238
243
. . . . . . . . . . . . . . . . . . . . . . . . . . . 243
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
. . . . . . . . . . . . . . . . . . . . . . . . . . 248
17.7 Appli
ation: Limited dependent variables and sample sele
tion
17.7.1 Example: Labor Supply
. . . . . . . 249
. . . . . . . . . . . . . . . . . . . . . . . . . 249
18 Nonparametri inferen e
253
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
. . . . . . . . . . 261
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
. . . . . . . . . . . . . . . . . . . . . 265
. . . . . . . . . . . . . . . . . . . . . . 267
. . . . . . . . . . . . . 268
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
. . . . . . . 272
19 Simulation-based estimation
275
. . 275
. . . . . . . . . . . . . 277
CONTENTS
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
. . . . . . . . . . . . . . . . . . . . . 279
. . . . . . . . . . . . . . . . . . . . . . 280
. . . . . . . . . . . . . . . . . . . . . 281
. . . . . . . . . . . . . . . . . . . . . . . . . 287
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
. . . . . . . . . . . . . . 289
293
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
20.1.2 ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
20.1.3 GMM
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
301
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
309
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
313
. . . . . . . . . . . . . . 313
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
24 Li
enses
24.1 The GPL
319
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
CONTENTS
333
25 The atti
. . . . . . . . . . . . . . . . . . . . . . . . . . 337
. . . . . . . . . . . . . . . . . . . . . . . . . . . 340
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
10
CONTENTS
List of Figures
1.1
LYX
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
1.2
O tave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
3.1
22
3.2
24
3.3
25
3.4
3.5
2
Un
entered R
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
27
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.6
31
3.7
32
3.8
34
3.9
. . . . . . . . . . . . . . .
35
6.1
70
6.2
77
7.1
. . . . . . . . . . . . . . . . .
89
7.2
. . . . . . . . . . . . . . . . . .
92
7.3
7.4
9.1
s()
9.2
s()
9.3
. . . . . . . . . . . . . . . . . . . . . . . 102
. . . . . . . . . . . . . . . . . . 103
. . . . . . . . . . . . . . . . . . . 178
. . . . . . . . . . . . . . . . . 183
. . . . . . . . . . . . . 185
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
. . . . . . . . . . . . . . . . . . . 254
. . . . . . . . . . . . . . . . 256
. . . . . . . . . . . . . . . . 256
11
12
LIST OF FIGURES
. . . . . . . . . . . . . . . . . . . . . . . . . . 297
. . . . . . . . . . . . . . . . . 302
. . . . . . . . . . . . . . . 302
List of Tables
13
14
LIST OF TABLES
Chapter 1
About this do
ument
This do
ument integrates le
ture notes for a one year graduate level
ourse with
omputer
programs that illustrate and apply the methods that are studied. The immediate availability of exe
utable (and modiable) example programs when using the PDF version of
the do
ument is one of the advantages of the system that has been used. On the other
hand, when viewed in printed form, the do
ument is a somewhat terse approximation to a
textbook. These notes are not intended to be a perfe
t substitute for a printed textbook.
If you are a student of mine, please note that last senten
e
arefully. There are many good
textbooks available. A few of my favorites are listed in the bibliography.
With respe
t to
ontents, the emphasis is on estimation and inferen
e within the world
of stationary data, with a bias toward mi
roe
onometri
s. The se
ond half is somewhat
more polished than the rst half, sin
e I have taught that
ourse more often. If you take
a moment to read the li
ensing information in the next se
tion, you'll see that you are
free to
opy and modify the do
ument. If anyone would like to
ontribute material that
expands the
ontents, it would be very wel
ome. Error
orre
tions and other additions are
also wel
ome.
1.1 Li
enses
All materials are
opyrighted by Mi
hael Creel with the date that appears above. They are
provided under the terms of the GNU General Publi
Li
ense, ver. 2, whi
h forms Se
tion
24.1 of the notes, or, at your option, under the Creative Commons Attribution-Share Alike 2.5 li
ense,
whi
h forms Se
tion 24.2 of the notes. The main thing you need to know is that you are
free to modify and distribute these materials in any way you like, as long as you share
your
ontributions in the same way the materials are made available to you. In parti
ular,
you must make available the sour
e les, in editable form, for your modied version of the
materials.
15
16
CHAPTER 1.
and
ontributions. The main do
ument was prepared using LYX (www.lyx.org) and GNU
1
Windows, and Ma
OS systems. Figure 1.1 shows LYX editing this do
ument.
GNU O
tave has been used for the example programs, whi
h are s
attered though
the do
ument.
the O
tave environment for doing applied e
onometri
s. The fundamental tools exist and
are implemented in a way that make extending them fairly easy. The example programs
in
luded here may
onvin
e you of this point. Se
ondly, O
tave's li
ensing philosophy ts
in with the goals of this proje
t. Thirdly, it runs on Linux, Windows and Ma
OS. Figure
1.2 shows an O
tave program being edited by NEdit, and the result of running the program
in a shell window.
Free is used in the sense of freedom, but LYX is also free of harge.
1.3.
17
and as a PDF, together with all of the examples and the software needed to run them
are available on Peli
anHPC. The reason why these notes are integrated into a Linux
distribution for parallel
omputing will be apparent if you get to Chapter 20. If you don't
get that far or you're not interested in parallel
omputing, please just ignore the stu
on the CD that's not related to e
onometri
s. If you happen to be interested in parallel
omputing but not e
onometri
s, just skip ahead to Chapter 20.
18
CHAPTER 1.
Chapter 2
Introdu
tion: E
onomi
and
e
onometri
models
E
onomi
theory tells us that an individual's demand fun
tion for a good is something
like:
x = x(p, m, z)
x
is
G1
is in ome
t (this is a ross
n individuals' demands
at time
xi = xi (pi , mi , zi )
The model is not estimable as it stands, sin
e:
Some omponents of
zi
i.
people don't eat the same lun
h every day, and you
an't tell what they will order
just by looking at them. Suppose we
an break
wi
zi
i .
A step toward an estimable e
onometri
model is to suppose that the model may be written
as
xi = 1 + pi p + mi m + wi w + i
We have imposed a number of restri
tions on the theoreti
al model:
xi ()
19
20
CHAPTER 2.
Of all parametri
families of fun
tions, we have restri
ted the model to the
lass of
linear in the variables fun
tions.
But in
oe ients to exist in a sense that has e onomi meaning, and in order to
be able to use sample data to make reliable inferen
es about their values, we need to make
additional assumptions. These additional assumptions have
no theoreti al basis,
they
are assumptions on top of those needed to prove the existen
e of a demand fun
tion. The
validity of any results we obtain using this model will be
ontingent on these additional
restri
tions being at least approximately
orre
t. For this reason,
will
be needed, to
he
k that the model seems to be reasonable. Only when we are
onvin
ed
that the model is at least approximately
orre
t should we use it for e
onomi
analysis.
When testing a hypothesis using an e
onometri
model, at least three fa
tors
an
ause
a statisti
al test to reje
t the null hypothesis:
To be able to make s
ienti
progress, we would like to ensure that the third reason is
not
ontributing in a major way to reje
tions, so that reje
tion will be most likely due
to either the rst or se
ond reasons.
there are many possible sour
es of misspe
i
ation of e
onometri
models. In the next few
se
tions we will obtain results supposing that the e
onometri
model is entirely
orre
tly
spe
ied. Later we will examine the
onsequen
es of misspe
i
ation and see some methods
for determining if a model is
orre
tly spe
ied. Later on, e
onometri
methods that seek
to minimize maintained assumptions are introdu
ed.
Chapter 3
Ordinary Least Squares
3.1 The Linear Model
y
x1 , x2 , ..., xk .
We an onsider a
Linearity:
0 :
y = 10 x1 + 20 x2 + ... + k0 xk +
or, using ve
tor notation:
y = x 0 +
The dependent variable
x = ( x1 x2 xk ) is a k
k0 ) . The supers
ript 0 in 0
0 = ( 10 20
means this is the true value of the unknown parameter. It will be dened more pre
isely
later, and usually suppressed when it's not ne
essary for
larity.
Suppose that we want to use data to try to determine the best linear approximation
to
x.
The data
{(yt , xt )} , t = 1, 2, ..., n
yt = xt + t
The
y = X + ,
where
y=
y1 y2 yn
is
n1
and
X=
(3.1)
x1 x2 xn
Linear models are more general than they might rst appear, sin
e one
an employ
nonlinear transformations of the variables:
0 (z) =
where the
1
Dening
y = 0 (z), x1 = 1 (w), et .
leads to a model
For example, ross-se tional data may be obtained by random sampling. Time series data a umulate
histori ally.
21
22
CHAPTER 3.
-5
-10
-15
0
10
X
12
14
16
18
20
ln z = ln A + 2 ln w2 + 3 ln w3 + .
If we dene
y = ln z, 1 = ln A,
et .,
The
approximation is linear in the parameters, but not ne essarily linear in the variables.
yt = 1 + 2 xt2 + t .
xt2 .
(xt2 , yt ),
where
1 + 2 xt2 ,
and
Exa tly how the green line is dened will be ome lear later.
In pra
ti
e, we only have the data, and we don't know where the green line lies. We need
to gain information about the straight line that best ts the data points.
The
3.2.
23
n
X
s() =
t=1
yt xt
2
= (y X) (y X)
= y y 2y X + X X
= k y X k2
This last expression makes it
lear how the OLS estimator is dened: it minimizes the
Eu
lidean distan
e between
and
X.
x
using
The tted OLS
oe
ients are those that give the
as basis fun
tions, where best means minimum
Eu
lidean distan
e. One
ould think of other estimators based upon other metri
s. For
example, the
(MAD) minimizes
Pn
t=1 |yt
xt |.
Later, we
will see that whi
h estimator is best in terms of their statisti
al properties, rather than in
terms of the metri
s that dene them, depends upon the properties of
about whi h we
s(),
D s() = 2X y + 2X X
Then setting it to zeros gives
= 2X y + 2X X 0
D s()
so
= (X X)1 X y.
To verify that this is a minimum, he k the se ond order su ient ondition:
= 2X X
D2 s()
Sin
e
(X) = K,
n),
so
The
The
Note that
is in fa t a minimizer.
y = X +
= X +
24
CHAPTER 3.
-5
-10
-15
0
10
X
12
14
16
18
20
X y X X = 0
X y X = 0
X = 0
X.
arefully.
K > 1.
We an de ompose
3.3.
25
Observation 2
e = M_xY
S(x)
x
x*beta=P_xY
Observation 1
Sin e
nK
is hosen to make
spanned by
X.
Sin e
as
short as possible,
=
is in this spa
e, X
0.
will
X, .
X,
or
X = X X X
Therefore, the matrix that proje
ts
1
X y
is
PX = X(X X)1 X
sin
e
X = PX y.
is
of
X.
We have that
onto the
N K
= y X
= y X(X X)1 X y
= In X(X X)1 X y.
26
CHAPTER 3.
is
= In X(X X)1 X
MX
= In PX .
We have
= MX y.
Therefore
y = PX y + MX y
= X + .
These two proje
tion matri
es de
ompose the
omponents - the portion that lies in the
nK
PX
and
MX
are
dimensional ve tor
symmetri
and
idempotent.
A symmetri matrix
An idempotent matrix
is one su h that
and the
dimensional spa e.
X,
A = A .
is one su h that
A = AA.
ith
i =
(X X)1 X
= ci y
is simply
This is how we dene a linear estimator - it's a linear fun
tion of the dependent variable.
Sin
e it's a linear
ombination of the observations on the dependent variable, where the
weights are determined by the observations on the regressors, some observations may have
more inuen
e than others.
To investigate this, let
the
tth
et
be an
In .
in the t
Dene
ht = (PX )tt
= et PX et
so
ht
is the t
PX .
Note that
ht = k PX et k2
so
ht k et k2 = 1
th position,
i.e., it's
3.4.
27
12
10
-2
0
So
0.5
0 < ht < 1.
1.5
X
2.5
Also,
T rPX = K h = K/n.
So the average of the
ht is K/n.
The value
If the leverage is mu
h higher than average, the observation has the potential to ae
t the
OLS t importantly. However, an observation may also be inuential due to the value of
yt ,
xt 's.
th
without using the t
observation (des-
rather than the weight it is multiplied by, whi
h only depends on the
To a
ount for this,
onsider estimation of
(t) ).
proof ) that
(t) =
so the
hange in the
tth
1
1 ht
(X X)1 Xt t
xt
xt (t)
ht
1 ht
While an observation may be inuential if it doesn't ae t its own tted value, it ertainly
is
inuential if it does.
ht
1ht
Figure 3.4 gives an example plot of data, t, leverage and inuen
e. The O
tave program
is InuentialObservation.m . If you re-run the program you will see that the leverage of
the last observation (an outlying value of x) is always high, and the inuen
e is sometimes
high.
After inuential observations are dete
ted, one needs to determine
why
data entry error, whi h an easily be orre ted on e dete ted. Data entry errors
very ommon.
are
28
CHAPTER 3.
There exist
robust
3.5 Goodness of t
The tted model is
y = X +
Take the inner produ
t:
y y = X X + 2 X +
But the middle term of the RHS is zero sin
e
X = 0,
so
y y = X X +
un entered Ru2
The
(3.2)
is dened as
Ru2 = 1
yy
X X
yy
k PX y k2
=
k y k2
= cos2 (),
where
The un entered
R2
45
spa
e). Another, more
ommon denition measures the
ontribution of the variables,
other than the
onstant term, to explaining the variation in
ability of the model to explain the variation of
y.
mean.
Let
= (1, 1, ..., 1) ,
-ve tor. So
M = In ( )1
= In /n
M y
just returns the ve tor of deviations from the mean. In terms of deviations from the
y M y = X M X + M
3.6.
29
R2
The
entered Rc2
is dened as
Rc2 = 1
where
ESS = and T SS = y M y =
Supposing that
ESS
=1
y M y
T SS
Pn
t=1 (yt
M = .
X = 0
so
y)2 .
t = 0
In this ase
y M y = X M X +
So
Rc2 =
where
RSS
T SS
RSS = X M X
Rc2
X (PX = ),
then one
1.
model, and the regression parameters have no e onomi interpretation. For example, what
30
CHAPTER 3.
with respe t to
xj ?
y = 1 x1 + 2 x2 + ... + k xk +
The partial derivative is
= j +
xj
xj
we need to make additional assumptions. The assumptions that are appropriate to make
depend on the data under
onsideration.
model, whi
h in
orporates some assumptions that are
learly not realisti
for e
onomi
data.
notational lutter.
Later we'll adapt the results to what we an get with more realisti
assumptions.
Linearity:
0 :
y = 10 x1 + 20 x2 + ... + k0 xk +
(3.3)
y = x 0 +
K,
1
lim X X = QX
n
where
QX
(3.4)
(3.5)
(3.6)
E(t s ) = 0, t 6= s
(3.7)
Optionally, we will sometimes assume that the errors are normally distributed.
(3.8)
3.7.
0.1
0.08
0.06
0.04
0.02
0
-3
-2
-1
3.7.1 Unbiasedness
We have
= (X X)1 X y .
By linearity,
= (X X)1 X (X + )
= + (X X)1 X
By 3.4 and 3.5
= 9,
and
appears to be estimated
without bias. The program that generates the plot is Unbiased.m , if you would like to
experiment with this.
With time series data, the OLS estimator will often be biased. Figure 3.7 shows the
32
CHAPTER 3.
0.12
0.1
0.08
0.06
0.04
0.02
0
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0.2
0.4
results of a small Monte Carlo experiment where the OLS estimator was
al
ulated for
1000 samples from the AR(1) model with
yt = 0 + 0.9yt1 + t ,
where
n = 20
and
2 = 1.
In this
ase, assumption 3.4 does not hold: the regressors are sto
hasti
. We
an see that
the bias in the estimation of
is about -0.2.
The program that generates the plot is Biased.m , if you would like to experiment with
this.
3.7.2 Normality
With the linearity assumption, we have
= + (X X)1 X .
Adding the assumption of normality (3.8, whi h implies strong exogeneity), then
N , (X X)1 02
sin
e a linear fun
tion of a normal random ve
tor is also normally distributed. In Figure
3.6 you
an see that the estimator appears to be normally distributed. It in fa
t is normally
distributed, sin
e the DGP (see the O
tave program) has normal errors. Even when the
data may be taken to be IID, the assumption of normality is often questionable or simply
untenable. For example, if the dependent variable is the number of automobile trips per
week, it is a
ount variable with a dis
rete distribution, and is thus not normally distributed.
Many variables in e
onomi
s
an take on only nonnegative values, whi
h, stri
tly speaking,
2
Normality may be a good model nonetheless, as long as the probability of a negative value o uring is
negligable under the model. This depends upon the mean being large enough in relation to the varian e.
3.7.
3.7.3 The varian
e of the OLS estimator and the Gauss-Markov theorem
Now let's make all the
lassi
al assumptions ex
ept the assumption of normality. We have
= + (X X)1 X
= .
E()
So
= E (X X)1 X X(X X)1
= E
V ar()
= (X X)1 02
The OLS estimator is a
the dependent variable,
linear estimator ,
y.
(X X)1 X y
= Cy
where
is also
unbiased
is a fun tion of the explanatory variables only, not the dependent variable.
other weights
It
is a fun tion of
If the estimator
W X = IK :
E(W y)
E(W X0 + W )
W X0
WX
The varian
e of
IK
is
= W W 2 .
V ()
0
Dene
D = W (X X)1 X
so
W = D + (X X)1 X
Sin
e
W X = IK , DX = 0,
so
D + (X X)1 X D + (X X)1 X 02
1 2
0
=
DD + X X
=
V ()
So
V ()
V ()
The inequality is a shorthand means of expressing, more formally, that
V ()
V ()
is a
34
CHAPTER 3.
0.5
1.5
2.5
3.5
The OLS
It is worth emphasizing again that we have not used the normality assumption in any
way to prove the Gauss-Markov theorem, so it is valid if the errors are not normally
distributed, as long as the other assumptions hold.
To illustrate the Gauss-Markov result,
onsider the estimator that results from splitting
the sample into
estimator is unbiased, but ine ient with respe t to the OLS estimator.
The program
E
ien
y.m illustrates this using a small Monte Carlo experiment, whi
h
ompares the
OLS estimator and a 3-way split sample estimator. The data generating pro
ess follows
the
lassi
al model, with
n = 21.
= 2.
3.9 we
an see that the OLS estimator is more e
ient, sin
e the tails of its histogram are
more narrow.
We have that
=
E()
and
=
V ar()
1
XX
02 ,
2
the varian
e of , 0 , in order to have an idea of the pre
ision of the estimates of
2
ommonly used estimator of 0 is
c2 =
1
nK
3.8.
35
0.1
0.08
0.06
0.04
0.02
0
0
0.5
1.5
c2 =
0
=
c2 ) =
E(
0
=
=
=
=
=
where we use the fa
t that
2.5
3.5
1
nK
1
M
nK
1
E(T r M )
nK
1
E(T rM )
nK
1
T rE(M )
nK
1
2 T rM
nK 0
1
2 (n k)
nK 0
02
Thus,
min w x
x
36
CHAPTER 3.
f (x) = q.
x(w, q).
The
is obtained by
C(w, q)
0
w
Remember that these derivatives give the
onditional fa
tor demands (Shephard's
Lemma).
Homogeneity The
ost fun
tion is homogeneous of degree 1 in input pri
es: C(tw, q) =
tC(w, q)
where
geneous of degree zero in fa tor pri es - they only depend upon relative pri es.
Returns to s ale
The
returns to s ale
parameter
C(w, q)
q
q
C(w, q)
1
= 1.
C = Aw11 ...wg g q q e
What is the elasti
ity of
eC
wj
C
=
wj ?
with respe t to
C
WJ
wj
C
= j Aw11 .wj j
..wg g q q e
wj
1
Aw1 ...wg g q q e
= j
This is one of the reasons the Cobb-Douglas form is popular - the
oe
ients are easy
to interpret, sin
e they are the elasti
ities of the dependent variable with respe
t to the
3.8.
37
eC
wj
C
WJ
= xj (w, q)
wj
C
wj
C
sj (w, q)
the
ost share
of the
j th
j = sj (w, q).
The
ln C = + 1 ln w1 + ... + g ln wg + q ln q +
where
= ln A
. So we see that the transformed model is linear in the logs of the data.
g
X
g = 1
i=1
=
so
q = 1.
1
=1
q
i 0, i = 1, ..., g.
The data are for the U.S., and were olle ted by M. Nerlove.
(PK ).
Note that the data are sorted by output level (the third olumn).
ln C = 1 + 2 ln Q + 3 ln PL + 4 ln PF + 5 ln PK +
(3.9)
using OLS. To do this yourself, you need the data le mentioned above, as well as Nerlove.m (the estimation pr
, and the library of O
tave fun
tions mentioned in the introdu
tion to O
tave that forms
3
*********************************************************
OLS estimation results
Observations 145
R-squared 0.925955
3
If you are running the bootable CD, you have all of this installed and ready to run.
38
CHAPTER 3.
Sigma-squared 0.153943
Results (Ordinary var-
ov estimator)
onstant
output
labor
fuel
apital
estimate
-3.527
0.720
0.436
0.427
-0.220
st.err.
1.774
0.017
0.291
0.100
0.339
t-stat.
-1.987
41.244
1.499
4.249
-0.648
p-value
0.049
0.000
0.136
0.000
0.518
*********************************************************
While we will use O
tave programs as examples in this do
ument, sin
e following the
programming statements is a useful way of learning how theory is put into pra
ti
e, you
may be interested in a more user-friendly environment for doing e
onometri
s. I heartily
re
ommend Gretl, the Gnu Regression, E
onometri
s, and Time-Series Library. This is an
easy to use program, available in English, Fren
h, and Spanish, and it
omes with a lot
AT X fragments, so that I
of data ready to use. It even has an option to save output as L
E
an just in
lude the results into this do
ument, no muss, no fuss. Here the results of the
Nerlove model from GRETL:
Coe ient
Std. Error
3.5265
t-statisti
1.77437
p-value
1.9875
0.0488
41.2445
0.0000
0.720394
0.0174664
l_labor
0.436341
0.291048
1.4992
0.1361
l_fuel
0.426517
0.100369
4.2495
0.0000
0.219888
0.339429
0.6478
0.5182
l_output
l_ apita
1.72466
1.42172
R2
2
R
21.5520
0.392356
0.925955
0.923840
F (4, 140)
437.686
145.084
159.967
3.9.
39
EXERCISES
Fortunately, Gretl and my OLS program agree upon the results. Gretl is in
luded in the
bootable CD mentioned in the introdu
tion.
X N (x , ),
where
and
and
AX + b,
T r(AB) = T r(BA)
for
tra
e.
For the model with a
onstant and a single regressor, yt = 1 + 2 xt + t , whi
h satises
the
lassi
al assumptions, prove that the varian
e of the OLS estimator de
lines to zero as
the sample size in
reases.
40
CHAPTER 3.
Chapter 4
Maximum likelihood estimation
The maximum likelihood estimator is important sin
e it is asymptoti
ally e
ient, as is
shown below. For the
lassi
al linear model with normal errors, the ML and OLS estimators
of
are the same, so the following theory is presented without examples. In the se ond
half of the
ourse, nonlinear models with nonnormal errors are introdu
ed, and examples
may be found there.
Y =
0 :
y1 . . . yn
and
Z=
z1 . . . zn
and
z.
fY Z (Y, Z, 0 ).
This is the joint density of the sample. This density
an be fa
tored as
L(Y, Z, ) = f (Y, Z, ), ,
where
The
is a
parameter spa e.
of
is the value of
fun
tion.
Note that if
hood fun
tion
fY |Z (Y |Z, )
and
with respe t to
are said to be
exogenous
0 .
that orre-
for estimation of
we may more
onveniently work with the
onditional likelihood fun
tion
the purposes of estimating
and
fY |Z (Y |Z, )
for
42
CHAPTER 4.
If the
L(Y |Z, ) =
where the
ft
n
Y
t=1
f (yt |zt , )
ontributions of ob-
servations, by using the fa
t that a joint density
an be fa
tored into the produ
t of
a marginal and
onditional (doing this iteratively)
L(Y, ) = f (y1 |z1 , )f (y2 |y1 , z2 , )f (y3 |y1 , y2 , z3 , ) f (yn |y1, y2 , . . . ytn , zn , )
To simplify notation, dene
x1 = z1 , x2 = {y1 , z2 },
et .
L(Y, ) =
n
Y
t=1
f (yt |xt , )
The riterion fun tion an be dened as the average log-likelihood fun tion:
1X
1
ln f (yt |xt , )
sn () = ln L(Y, ) =
n
n t=1
The maximum likelihood estimator may thus be dened equivalently as
ln L
and
Sin e
ln()
is a monotoni in reasing
Dividing by
has no ee t on
y = 1(heads)
Let
be a binary variable that indi ates whether or not a heads is observed. The
fY (y, p) = py (1 p)1y
4.2.
43
CONSISTENCY OF MLE
and
ln fY (y, p) = y ln p + (1 y) ln (1 p)
The derivative of this is
ln fY (y, p)
p
=
=
y (1 y)
p (1 p)
yp
p (1 p)
gives
1 X yi p
sn (p)
=
p
n
p (1 p)
i=1
p = y
So it's easy to
al
ulate the MLE of
p0 in
(4.1)
this ase.
Now imagine that we had a bag full of bent
oins, ea
h bent around a sphere of a
dierent radius (with the head pointing to the outside of the sphere). We might suspe
t
that the probability of a heads
ould depend upon the radius. Suppose that
(1 + exp(xi ))
where
xi =
1 ri
, so that
is a 21 ve tor. Now
pi p(xi , ) =
pi ()
= pi (1 pi ) xi
so
ln fY (y, )
y pi
pi (1 pi ) xi
pi (1 pi )
= (yi p(xi , )) xi
sn ()
=
Pn
i=1 (yi
p(xi , )) xi
n
There is no
expli
it solution for the two elements that set the equations to zero. This is
ommonly the
ase with ML estimators: they are often nonlinear, and nding the value of the estimate
often requires use of numeri
methods to nd solutions to the rst order
onditions. This
possibility is explored further in the se
ond half of these notes (see se
tion 14.5).
whi h is ompa t.
parameter spa e .
44
CHAPTER 4.
Uniform
onvergen
e
u.a.s
sn () lim E0 sn () s (, 0 ), .
n
We have suppressed
holds for all possible parameter values. For a given parameter value, an ordinary Law of
Large Numbers will usually imply almost sure
onvergen
e to the limit of the expe
tation.
Convergen
e for a single element of the parameter spa
e,
ombined with the assumption
of a
ompa
t parameter spa
e, ensures uniform
onvergen
e.
a.s.
n 0 .
set.
Se
ond, for any
6= 0
by Jensen's inequality (
L()
L()
E ln
ln E
L(0 )
L(0 )
ln ()
E
sin
e
L(0 ) is
L()
L(0 )
L()
L(0 )dy = 1,
L(0 )
the density fun tion of the observations, and sin e the integral of any density
is 1. Therefore, sin e
ln(1) = 0,
E ln
L()
L(0 )
0,
or
s (, 0 ) s (0 , 0 ) 0
ex
ept on a set of zero probability.
By the identi
ation assumption there is a unique maximizer, so the inequality is stri
t
if
6= 0 :
Suppose that
s (, 0 ) s (0 , 0 ) < 0, 6= 0 , a.s.
is a limit point of
is a maximizer, independent of
n,
s ( , 0 ) s (0 , 0 ) 0.
we must have
4.3.
45
= 0 , a.s.
Thus there is only one limit point, and it is equal to the true parameter value, with
probability one. In other words,
lim = 0 , a.s.
This
ompletes the proof of strong
onsisten
y of the MLE. One
an use weaker assumptions
to prove weak
onsisten
y (
onvergen
e in probability to
0 )
N (0 )
of
sn () is twi
e
ontinuously
when n is large enough.
Assume that
0 ,
at least
dierentiable in a neigh-
gn (Y, ) = D sn ()
n
1X
D ln f (yt |xx , )
=
n
t=1
This is the
s ore ve tor
(with dim
n
1X
gt ().
n t=1
K 1).
as an
will often be suppressed for
larity, but one should not forget that they are still there.
The ML estimator
sets
X
= 1
0.
gn ()
gt ()
n
t=1
E [gt ()] = 0, t.
1
[D f (yt |xt , )] f (yt |xt , )dyt
f (yt |xt , )
46
CHAPTER 4.
D f,
E [gt ()] = D
f (yt|xt , )dyt
= D 1
= 0
So
E (gt () = 0 :
so it implies that
E gn (Y, ) = 0.
= g(0 ) + (D g( )) 0
0 g()
or with appropriate denitions
H( ) 0 = g(0 ),
where
= + (1 )0 , 0 < < 1.
minute). So
Now onsider
H( ).
Assume
H( )
n 0 = H( )1 ng(0 )
This is
H( ) = D g( )
= D2 sn ( )
n
1X 2
=
D ln ft ( )
n
t=1
D2 sn ()
2 sn ()
.
Given that this is an average of terms, it should usually be the
ase that this satises a
strong law of large numbers (SLLN).
Regularity onditions
guarantee that this will happen. There are dierent sets of assumptions that
an be used to
justify appeal to dierent SLLN's. For example, the
D2 ln ft ( )
dependent over time, and their varian
es must not be
ome innite. We don't assume any
parti
ular set here, sin
e the appropriate assumptions will depend upon the parti
ularities
of a given model. However, we assume that a SLLN applies.
Also, sin
e we know that
a.s.
0 .
is
= + (1 )0 ,
H()
we have that
is ontinuous in
Given
4.4.
this,
H( )
47
a.s.
H( ) lim E D2 sn (0 ) = H (0 ) <
n
H (0 ) = D2 lim E (sn (0 ))
n
2
D s (0 , 0 )
s (, 0 ) < s (0 , 0 )
i.e., 0
maximizes the limiting obje tive fun tion. Sin e there is a unique maximizer, and
H (0 )
must be negative denite, and therefore of full rank. Therefore the previous
a.s.
n 0 H (0 )1 ng(0 ).
Now
onsider
ng(0 ).
(4.2)
This is
ngn (0 ) =
nD sn ()
X
n
n
D ln ft (yt |xt , 0 )
n
t=1
n
1 X
gt (0 )
n
t=1
a.s.
E [gt ()] = 0.
gn (0 ) 0,
n.
Xn
a random
Xn
Xn =
This is the
ase for
properties of the
Xt .
ng(0 )
for example.
Xt
Xn
n:
Pn
t=1 Xt
Xn
depend on the
dependent, then a CLT for dependent pro esses will apply. Supposing that a CLT applies,
48
CHAPTER 4.
E( ngn (0 ) = 0,
we get
d
I (0 )1/2 ngn (0 ) N [0, IK ]
where
I (0 ) =
=
This
an also be written as
I (0 )
is known as the
ngn (0 )
lim V0
ngn (0 ) N [0, I (0 )]
(4.3)
information matrix.
a
n 0 N 0, H (0 )1 I (0 )H (0 )1 .
d
n 0 N (0, V )
(4.4)
There do exist, in spe ial ases, estimators that are onsistent su h that
0.
super onsistent
p
n 0
= .
lim E ()
(4.5)
mators are asymptoti ally unbiased. Su h ases are unusual, though. An example is
p = y
4.5.
49
n (
p p).
lim V ar n (
p p) = lim nV ar (
p p)
= lim nV ar (
p)
= lim nV ar (
y)
P
yt
= lim nV ar
n
X
1
V ar(yt ) (by independen
e of obs.)
= lim
n
1
= lim nV ar(y) (by identi
ally distributed obs.)
n
= p (1 p)
H () = I ().
1 =
Let
0 =
ft ()
f (yt |xt , )
be short for
ft ()dy,
so
D ft ()dy
(D ln ft ()) ft ()dy
D2 ln ft ()
0 =
and multiply by
(4.6)
1
n
#
" n
n
1X
1X
[Ht ()] = E
[gt ()] [gt ()]
E
n
n
t=1
The s ores
gt
and
gs
t=1
t 6= s,
sin e for
is xed in
t.
has
basis for a spe
i
ation test proposed by White: if the s
ores appear to be
orrelated one
may question the spe
i
ation of the model). This allows us to write
sin
e all
ross produ
ts between dierent periods expe
t to zero. Finally take limits, we
get
H () = I ().
(4.7)
50
CHAPTER 4.
0 .
Using this,
a.s.
n 0 N 0, H (0 )1 I (0 )H (0 )1
simplies to
a.s.
n 0 N 0, I (0 )1
(4.8)
H (0 )
and
use
n
X
I\
(0 ) = n
I (0 ).
We an
t ()
gt ()g
t=1
H\
(0 ) = H().
Note, one
an't use
h
ih
i
gn ()
I\
(0 ) = n gn ()
V (0 )
These in lude
\
V\
(0 ) = H (0 )
\
V\
(0 ) = I (0 )
\
V\
(0 ) = H (0 )
\
I\
(0 )H (0 )
estimators, respe
tively. The sandwi
h form is the most robust, sin
e it
oin
ides with the
ovarian
e estimator of the
quasi-ML estimator.
lim E ( ) = 0
n
Dierentiate wrt
lim E ( ) =
lim
= 0 (this
Noting that
h
i
D f (Y, ) dy
is a
K K
matrix of zeros).
we an write
Z
Z
f (Y, )D dy = 0.
f ()D ln f ()dy + lim
lim
4.6.
51
D = IK ,
and
Z
f ()D ln f ()dy = IK .
lim
n
Playing with powers of
lim
we get
1
n
n [D ln f ()] f ()dy = IK
{z
}
|n
Note that the bra keted part is just the transpose of the s ore ve tor,
lim E
g(), so we an write
h
i
n
ng() = IK
n , for any CAN
varian
e of
n tends to
This means that the
ovarian
e of the s
ore fun
tion with
estimator, is an identity matrix. Using this, suppose the
V ().
Therefore,
# "
#
"
n
V ()
IK
.
=
V
IK
I ()
ng()
(4.9)
1 ()
I
This simplies to
Sin e
is arbitrary,
"
V ()
IK
IK
I ()
#"
I ()1
-ve tor
0.
h
i
I 1 () 0.
V ()
I 1 ()
V ()
1 ()
I
is a
lower bound
if V ()V
() is a positive semidenite matrix.
Summary of MLE
Consistent
This is for general MLE: we haven't spe ied the distribution or the linearity/nonlinearity of the estimator
52
CHAPTER 4.
y = 1(heads)
is
pb0 = y.
n.
a
n (
y p0 ) N 0, H (p0 )1 I (p0 )H (p0 )1
a) nd the analyti
expression for gt () and show that E [gt ()] = 0
b) nd the analyti
al expressions for H (p0 ) and I (p0 ) for this problem
d) Write an O tave program that does a Monte Carlo study that shows that n (y p0 ) is
yt =
n (
y p0 )
xt
+ t
n.
f (t ) =
1
, < t <
1 + 2t
The Cau
hy density has a shape similar to a normal density, but with mu
h thi
ker tails.
Thus, extremely small and large errors o
ur mu
h more frequently with this density than
would happen if the errors were normally distributed. Find the s
ore fun
tion
gn () where
gn ()
where
Compare the rst order
onditions that dene the ML estimators of problems 2 and
3 and interpret the dieren
es.
Why
Chapter 5
Asymptoti
properties of the least
squares estimator
1
The OLS estimator under the
lassi
al assumptions is BLUE , for all sample sizes. Now
let's see what happens when the sample size tends to innity.
5.1 Consisten
y
= (X X)1 X y
= (X X)1 X (X + )
= 0 + (X X)1 X
1
XX
X
= 0 +
n
n
Consider the last two terms. By assumption
Q1
X ,
X X
n
= QX limn
xt t
X
n ,
X
1X
=
xt t
n
n t=1
X
n
=0
V (xt t ) = xt xt 2 .
X X
n
1
sin e the inverse of a nonsingular matrix is a ontinuous fun tion of the elements of
Ea h
limn
BLUE
53
As long as these are nite, and given a te
hni
al
ondition , the Kolmogorov SLLN applies,
so
1X
a.s.
xt t 0.
n t=1
a.s.
0 .
This is the property of
strong onsisten y:
true value.
errors.
= 0 + (X X)1 X
0 = (X X)1 X
1
X
XX
n 0
=
n
n
XX
n
1
Q1
X .
Now as before,
X
Considering , the limit of the varian
e is
n
lim V
lim E
= 02 QX
X X
n
The mean is of
ourse zero. To get asymptoti
normality, we need to apply a CLT.
We assume one (for instan
e, the Lindeberg-Feller CLT) holds, so
X d
N 0, 02 QX
n
Therefore,
d
n 0 N 0, 02 Q1
X
For appli ation of LLN's and CLT's, of whi h there are very many to hoose from, I'm going to avoid
the te
hni
alities. Basi
ally, as long as terms that make up an average have nite varian
es and are not
too strongly dependent, one will be able to nd a LLN or CLT to apply. Whi
h one it is doesn't matter,
we only need the result.
5.3.
55
ASYMPTOTIC EFFICIENCY
In summary, the OLS estimator is normally distributed in small and large samples
if
is normally distributed.
If
is asymptoti ally
s() =
n
X
yt xt
t=1
Supposing that
2
y = X0 + ,
N (0, 02 In ), so
n
Y
1
2t
f () =
exp 2
2
2 2
t=1
The joint density for
so
y
In and | y
|
= 1,
so
n
Y
(yt xt )2
f (y) =
exp
2 2
2 2
t=1
Taking logs,
We have
= y X,
n
X
(yt xt )2
ln L(, ) = n ln 2 n ln
.
2 2
t=1
As we'll see later, it will be possible to use (iterated) linear estimation methods and still
a
hieve asymptoti
e
ien
y even if the assumption that
normally distributed. This is
not
the ase if
errors it will be ne
essary to use nonlinear estimation methods to a
hieve asymptoti
ally
e
ient estimation. That possibility is addressed in the se
ond half of the notes.
n j j ,
rameters.
where
is one of the
slope pa-
data should follow the lassi al assumptions, ex ept that the errors should not be
ment.
et ).
Com-
Chapter 6
Restri
tions and hypothesis tests
6.1 Exa
t linear restri
tions
In many
ases, e
onomi
theory suggests restri
tions on the parameters of a model. For
example, a demand fun
tion is supposed to be homogeneous of degree zero in pri
es and
in
ome. If we have a Cobb-Douglas (log-linear) model,
ln q = 0 + 1 ln p1 + 2 ln p2 + 3 ln m + ,
then we need that
k0 ln q = 0 + 1 ln kp1 + 2 ln kp2 + 3 ln km + ,
so
1 ln p1 + 2 ln p2 + 3 ln m = 1 ln kp1 + 2 ln kp2 + 3 ln km
= (ln k) (1 + 2 + 3 ) + 1 ln p1 + 2 ln p2 + 3 ln m.
The only way to guarantee this for arbitrary
is to set
1 + 2 + 3 = 0,
whi
h is a
6.1.1 Imposition
The general formulation of linear equality restri
tions is the model
y = X +
R = r
where
is a
QK
R
matrix,
We assume
Q<K
is of rank
Q,
and
is a
Q1
ve tor of onstants.
57
58
CHAPTER 6.
R = r.
min s() =
1
(y X) (y X) + 2 (R r).
n
The Lagrange multipliers are s aled by 2, whi h makes things less messy. The fon are
)
= 2X y + 2X X R + 2R
0
D s(,
)
= RR r 0,
D s(,
whi h an be written as
"
We get
"
X X R
R
#
"
#"
X X R
R
"
X y
#1 "
X y
r
Note that
"
(X X)1
R (X X)1 IQ
#"
X X R
R
AB
=
"
"
IK
(X X)1 R
R (X X)1 R
#
(X X)1 R
IK
0
C,
and
"
IK
(X X)1 R P 1
P 1
#"
IK
(X X)1 R
DC
= IK+Q ,
so
DAB = IK+Q
DA = "
B 1
#
#"
(X X)1
0
IK (X X)1 R P 1
1
B
=
R (X X)1 IQ
0
P 1
#
"
(X X)1 (X X)1 R P 1 R (X X)1 (X X)1 R P 1
,
=
P 1 R (X X)1
P 1
6.1.
59
so (everyone should start paying attention again, and please note that we have made the
denition
"
P = R (X X)1 R )
"
P 1 R (X X)1
P 1
(X X)1 R P 1 R r
=
P 1 R r
#
"
"
#
X)1 R P 1 r
(X
IK (X X)1 R P 1 R
+
=
P 1 r
P 1 R
The fa t that
and
and
is
makes
#"
X y
r
V ar (Ax + b) = AV ar(x)A .
Though this is the obvious way to go about nding the restri
ted estimator, an easier
way, if the number of restri
tions is small, is to impose them by substitution. Write
h
where
R1
is
R1 R2
Q Q nonsingular.
an always make
R1
"
1
2
y = X1 1 + X2 2 +
#
= r
Supposing the
X.
one
Then
1 = R11 r R11 R2 2 .
Substitute this into the model
y = X1 R11 r X1 R11 R2 2 + X2 2 +
y X1 R11 r = X2 X1 R11 R2 2 +
yR = XR 2 + .
This model satises the
lassi
al assumptions,
estimate by OLS. The varian
e of
is as before
V (2 ) = XR
XR
and the estimator is
V (2 ) = XR
XR
where one estimates
02
1
1
02
c2 =
yR XR b2
yR XR b2
n (K Q)
i.e.,
One an
60
CHAPTER 6.
To re
over
fun
tion
1 ,
V (1 ) = R11 R2 V (2 )R2 R11
1
= R11 R2 X2 X2
R2 R11 02
R = (X X)1 R P 1 R r
R = (X X)1 X
+ (X X)1 R P 1 [r R]
M SE(R ) = (X X)1 2
+ (X X)1 R P 1 [r R] [r R] P 1 R(X X)1
So, the rst term is the OLS
ovarian
e. The se
ond term is PSD, and the third term is
NSD.
If the restri tion is true, the se ond term is 0, so we are better o.
If the restri tion is false, we may be better or worse o, in terms of MSE, depending
r R
and
2 .
6.2 Testing
In many
ases, one wishes to test e
onomi
theories. If theory suggests parameter restri
tions, as in the above homogeneity example, one
an test theory by testing parameter
restri
tions. A number of tests are available.
6.2.1 t-test
Suppose one has the model
6.2.
61
TESTING
y = X +
and one wishes to test the
normality of the errors,
so
R r
R(X X)1 R 02
02
R r
R(X X)1 R
Under
N (0, 1) .
02 , but the test would only be valid asymptoti ally in this ase.
Proposition 4
H0 , with
N (0, 1)
q 2
t(q)
c2
in pla e of
(6.1)
(q)
q
distribution.
where =
When a
2
i i
r.v.
is the
(6.2)
2
a
entral r.v., and it's distribution is written as
2 (n),
parameter.
V 1
as
P P
y = P x.
We have
y N (0, P V P )
but
V P P
PV P P
so
P V P = In
and thus
y N (0, In ).
Thus
= In
= P
y y 2 (n)
y y = x P P x = xV 1 x
and we get the result we wanted.
but
is dened to be
62
CHAPTER 6.
(6.3)
Proposition 8 If the random ve tor (of dimension n) x N (0, I), and B is idempotent
x Bx 2 (r).
(6.4)
=
02
MX
02
MX
=
0
0
2 (n K)
Proposition 9 If the random ve tor (of dimension n) x N (0, I), then Ax and x Bx are
independent if AB = 0.
Now onsider (remember that we have only one restri tion in this ase)
Rr
R(X X)1 R
q
=
(nK)02
(X X)1 X
and
t(n K)
distribution if
0
c
R r
R(X X)1 R
and
are independent.
But
= +
(X X)1 X MX = 0,
so
0
c
R r
R(X X)1 R
H0 : i = 0
vs.
H0 : i 6= 0
R r
t(n K)
i
t(n K)
Note:
the
t test is stri tly valid only if the errors are a tually normally distributed.
If one has nonnormal errors, one
ould use the above asymptoti
result to justify
taking
riti
al values from the
n .
N (0, 1)
distribution, sin e
t(n K) N (0, 1)
as
H0
t
t
6.2.
63
TESTING
6.2.2
The
test
(6.5)
are independent if AB = 0.
F =
distribution:
1
R r
R r
R (X X)1 R
q
2
F (q, n K).
(ESSR ESSU ) /q
F (q, n K).
ESSU /(n K)
Note:
The
test is stri tly valid only if the errors are truly normally distributed.
The following tests will be appropriate when one
annot assume normally distributed
errors.
then under
H0 : R0 = r,
so by Proposition [6
Note that
Q1
X
estimators.
or
Use
02
d
n 0 N 0, 02 Q1
X
we have
d
n R r N 0, 02 RQ1
X R
d
1
r
n R r
R
R
02 RQ1
2 (q)
X
are not observable. The test statisti we use substitutes the onsistent
(X X/n)1
Q1
X .
R r
d
c2 R(X X)1 R 1 R r
2 (q)
0
64
CHAPTER 6.
The Wald test is a simple way to test restri tions without having to estimate the
Note that this formula is similar to one of the formulae provided for the
test.
y = (X) +
is nonlinear in
and
, but is linear in
under
is a bit more
ompli
ated, so one might prefer to have a test based upon the restri
ted,
linear model. The s
ore test is useful in this situation.
S
ore-type tests are based upon the general prin
iple that the gradient ve
tor of the
unrestri
ted model, evaluated at the restri
ted estimate, should be asymptoti
ally
normally distributed with mean zero, if the restri
tions are true.
velopment was for ML estimation, but the prin
iple is valid for a wide variety of
estimation methods.
We have seen that
1
R(X X)1 R
R r
= P 1 R r
so
Given that
nP =
n R r
d
n R r N 0, 02 RQ1
X R
So
Noting that
d
1
nP 02 RQ1
n
P
R
2 (q)
X
1
R,
lim nP = RQX
nP N 0, 02 RQ1
X R
n an el.
we obtain,
R(X X)1 R
02
2 (q)
2
of 0 .
This makes it
lear why the test is sometimes referred to as a Lagrange multiplier
test. It may seem that one needs the a
tual Lagrange multipliers to
al
ulate this.
6.2.
65
TESTING
If we impose the restri
tions by substitution, these are not available. Note that the
test
an be written as
(X X)1 R
R
02
2 (q)
X y + X X R + R
to get that
= X (y X R )
R
= X R
R X(X X)1 X R d 2
(q)
02
but this is simply
PX
d
R 2 (q).
02
To see why the test is also known as a s
ore test, note that the fon
for restri
ted least
squares
X y + X X R + R
give us
= X y X X R
R
and the rhs is simply the gradient (s
ore) of the unrestri
ted model, evaluated at the
restri
ted estimator.
The s ores evaluated at the unrestri ted estimate are identi ally
zero. The logi
behind the s
ore test is that the s
ores evaluated at the restri
ted estimate
should be approximately zero, if the restri
tion is true. The test is also known as a Rao
test, sin
e P. Rao rst proposed it in 1948.
al
ulated using only the restri
ted model. The likelihood ratio test, on the other hand,
uses both the restri
ted and the unrestri
ted estimators. The test statisti
is
ln L()
LR = 2 ln L()
where
2
asymptoti
ally , take a se
ond order Taylor's series expansion of
ln L()
+
ln L()
n
H()
2
ln L()
about
66
CHAPTER 6.
sin e
0
D ln L()
H()
is dened in terms of
1
n
ln L())
so
LR n H()
As
H (0 ) = I(0 ),
n , H()
a
LR = n I (0 )
?? that
a
n 0 = I (0 )1 n1/2 g(0 ).
An analogous result for the restri
ted estimator is (this is unproven here, to prove this set
up the Lagrangean for MLE subje
t to
:
R = r,
1
a
RI (0 )1 n1/2 g(0 ).
n 0 = I (0 )1 In R RI (0 )1 R
1
a
n = n1/2 I (0 )1 R RI (0 )1 R
RI (0 )1 g(0 )
??
i
i
h
1 h
a
LR = n1/2 g(0 ) I (0 )1 R RI (0 )1 R
RI (0 )1 n1/2 g(0 )
But sin e
LR 2 (q).
same
2 rv, under the null hypothesis. We'll show that the Wald and LR
a
d
1
r
R
W = n R r
R
2 (q)
02 RQ1
X
6.3.
Using
0 = (X X)1 X
and
R r = R( 0 )
we get
nR( 0 ) =
nR(X X)1 X
1
XX
= R
n1/2 X
n
?? to get
=
a
=
where
PR
1
RQ1
X X
1
a
= X(X X)1 R 02 R(X X)1 R
R(X X)1 X
2
1
= n1 XQ1
X R 0 RQX R
A(A A)1 A
02
PR
02
X(X X)1 R .
q.
1
1 (y X) (y X)
.
ln L(, ) = n ln 2 n ln
2
2
Using this,
1
ln L(, )
n
X (y X0 )
n 2
X
n 2
g(0 ) D
=
=
68
CHAPTER 6.
I(0 ) = H (0 )
= lim D g(0 )
X (y X0 )
= lim D
n 2
X X
= lim
n 2
QX
=
2
so
I(0 )1 = 2 Q1
X
??, we get
1
R(X X)1 X
= W
This
ompletes the proof that the Wald and LR tests are asymptoti
ally equivalent. Sim-
qF = W = LM = LR
LR
The
LR
statisti
is
qF
statisti an be thought of as a
qF
and
LR,
pseudo-LR statisti ,
supposing
in that it's
like a LR statisti
in that it uses the value of the obje
tive fun
tions of the restri
ted
and unrestri
ted models, but it doesn't require distributional assumptions.
The presentation of the s
ore and Wald tests has been done in the
ontext of the linear
model.
methods.
are
small samples. The numeri values of the tests also depend upon how
is estimated, and
we've already seen than there are several ways to do this. For example all of the following
are
onsistent for
under
H0
6.4.
69
nk
n
R R
nk+q
R R
n
and in general the denominator
all be repla
ed with any quantity
It
an be shown, for linear regression models subje
t to linear restri
tions, and if
used to
al
ulate the Wald test and
R R
n
n is
test is to
statisti
t() =
a
100 (1 ) % onden e
t()
interval for
H0 : 0 = ,
using a
is the interval
su h that
signi an e level:
< c/2 }
cc/2
A onden e ellipse for two oe ients jointly would be, analogously, the set of {1 , 2 }
su h that the
(or some other test statisti ) doesn't reje t at the spe ied riti al value.
70
CHAPTER 6.
The region is an ellipse, sin
e the CI for an individual
oe
ient denes a (innitely
long) re
tangle with total prob. mass
(e.g.,
an take on any value). Sin
e the ellipse is bounded in both dimensions but
also
ontains mass
1 ,
Reje
tion of hypotheses individually does not imply that the joint test will
reje
t.
Joint reje tion does not imply individal tests will reje t.
6.6 Bootstrapping
When we rely on asymptoti
theory to use the normal distribution-based tests and
onden
e intervals, we're often at serious risk of making important errors. If the sample size is
small and errors are highly nonnormal, the small sample distribution of
n 0
may
be very dierent than its large sample distribution. Also, the distributions of test statisti
s
may not resemble their limiting distributions at all. A means of trying to gain information
on the small sample distribution of test statisti
s and estimators is the
bootstrap.
We'll
X0 +
IID(0, 02 )
X
Given that the distribution of
is nonsto hasti
is unknown,
the distribution of
will be unknown
in small
The
steps are:
1. Draw
observations from
yj = X + j
j = (X X)1 X yj .
4. Save
(it's a
n 1).
6.7.
J,
of
71
j .
empiri al distribution of j .
One
j from smallest
onden
e interval for 0 would be to order the
J/2
endpoints as the limits of the CI. Note that this will not give the shortest CI if the empiri
al
distribution is skewed.
j,
for example a
If the assumption of iid errors is too strong (for example if there is heteros
edasti
ity
or auto
orrelation, see below) one
an work with a bootstrap dened by sampling
from
(y, x)
How to hoose
J: J
This is easy to he k.
The bootstrap is based fundamentally on the idea that the empiri
al distribution of
the sample data
onverges to the a
tual sampling distribution as
be omes large,
In nite samples, this doesn't hold. At a minimum, the bootstrap is a good way to
he
k if asymptoti
theory results oer a de
ent approximation to the small sample
distribution.
Bootstrapping
an be used to test hypotheses. Basi
ally, use the bootstrap to get an
approximation to the empiri
al distribution of the test statisti
under the alternative
hypothesis, and use this to get
riti
al values. Compare the test statisti
al
ulated
using the real data, under the null, to the bootstrap
riti
al values. There are many
variations on this theme, whi
h we won't go into here.
r(0 ) = 0.
where
at
r()
as
is a
q -ve tor
valued fun tion. Write the derivative of the restri tion evaluated
D r() = R()
72
CHAPTER 6.
0 ,
so that
(R()) = q
0 .
in a neighborhood of
r()
about
0 :
= r(0 ) + R( )( 0 )
r()
where
is a onvex ombination of
and
0 .
= R( )( 0 )
r()
Due to
onsisten
y of
we an repla e
=
nr()
by
0 ,
asymptoti ally, so
nR(0 )( 0 )
n( 0 ).
d
2
nr()
N 0, R(0 )Q1
X R(0 ) 0 .
1
r()
1
X)1 R()
R()(X
r()
r()
2 (q)
c2
0, QX
and
02 ,
the re-
2 (q)
Sin
e this is a Wald test, it will tend to over-reje
t in nite samples. The s
ore and
LR tests are also possibilities, but they require estimation methods for nonlinear
models, whi
h aren't in the s
ope of this
ourse.
Note that this also gives a
onvenient way to estimate nonlinear fun
tions and asso
iated
asymptoti
onden
e intervals. If the nonlinear fun
tion
r(0 )
is not hypothesized to be
d
2
r(0 )
n r()
N 0, R(0 )Q1
X R(0 ) 0
6.7.
(x) =
where
f (x)
73
is
x
f (x)
x
f (x)
y = x + .
The elasti
ities of
w.r.t.
are
(x) =
x
x
(note that this is the entire ve tor of elasti ities). The estimated elasti ities are
b(x) =
x
x
To al ulate the estimated standard errors of all ve elasti ites, use
R() =
(x)
x1 0 0
.
0 x
.
.
2
.
.
..
..
0
0 0 xk
1 x21
2 x22
.
.
.
(x )2
..
0
.
.
.
k x2k
In many
ases, nonlinear restri
tions
an also involve the data, not just the parameters.
For example,
onsider a model of expenditure shares. Let
where
is pri es and
x(p, m)
si (p, m) =
goods is
pi xi (p, m)
, i = 1, 2, ..., G.
m
Now demand must be positive, and we assume that expenditures sum to in
ome, so we
have the restri
tions
G
X
0 si (p, m) 1, i
si (p, m)
i=1
i
si (p, m) = 1i + p pi + mm
+ i
It is fairly easy to write restri
tions su
h that the shares sum to one, but the restri
tion
that the shares lie in the
[0, 1]
74
CHAPTER 6.
and
m.
0 si (p, m) 1
and
m.
spe i ation.
*********************************************************
OLS estimation results
Observations 145
R-squared 0.925955
Sigma-squared 0.153943
Results (Ordinary var-
ov estimator)
onstant
output
labor
fuel
apital
estimate
-3.527
0.720
0.436
0.427
-0.220
st.err.
1.774
0.017
0.291
0.100
0.339
t-stat.
-1.987
41.244
1.499
4.249
-0.648
p-value
0.049
0.000
0.136
0.000
0.518
*********************************************************
Note that
sK = K < 0,
and that
L + F + K 6= 1.
L + F + K = 1.
Q = 1,
onstant
output
labor
fuel
and if there is
estimate
-4.691
0.721
0.593
0.414
st.err.
0.891
0.018
0.206
0.100
t-stat.
-5.263
41.040
2.878
4.159
p-value
0.000
0.000
0.005
0.000
6.8.
75
apital
-0.007
0.192
-0.038
0.969
*******************************************************
Value
p-value
F
0.574
0.450
Wald
0.594
0.441
LR
0.593
0.441
S
ore
0.592
0.442
Imposing and testing CRTS
*******************************************************
Restri
ted LS estimation results
Observations 145
R-squared 0.790420
Sigma-squared 0.438861
onstant
output
labor
fuel
apital
estimate
-7.530
1.000
0.020
0.715
0.076
st.err.
2.966
0.000
0.489
0.167
0.572
t-stat.
-2.539
Inf
0.040
4.289
0.132
p-value
0.012
0.000
0.968
0.000
0.895
*******************************************************
Value
p-value
F
256.262
0.000
Wald
265.414
0.000
LR
150.863
0.000
S
ore
93.771
0.000
Noti e that the input pri e oe ients in fa t sum to 1 when HOD1 is imposed. HOD1
e.g., = 0.10).
Also,
R2
when the restri
tion is imposed,
ompared to the unrestri
ted results.
should note that
Q = 1
Q = 1,
so the restri tion is satised. Also note that the hypothesis that
is reje ted by the test statisti s at all reasonable signi an e levels. Note that
R2
drops quite a bit when imposing CRTS. If you look at the unrestri
ted estimation results,
you
an see that a t-test for
does
not overlap 1.
From the point of view of neo
lassi
al e
onomi
theory, these results are not anomalous:
HOD1 is an impli
ation of the theory, but CRTS is not.
Exer ise 12 Modify the NerloveRestri tions.m program to impose and test the restri tions
jointly.
76
CHAPTER 6.
Sin e CRTS is reje ted, let's examine the possibilities more arefully.
Re
all that the data is sorted by output (the third
olumn). Dene 5 subsamples of rms,
with the rst group being the 29 rms with the lowest output levels, then the next 29 rms,
et
. The ve subsamples
an be indexed by
j=2
for
j = 1, 2, ..., 5,
where
j=1
for
t = 1, 2, ...29,
ln Ct = 1j + 2j ln Qt + 3j ln PLt + 4j ln PF t + 5j ln PKt + t
where
(6.6)
is a supers ript (not a power) that ini ates that the oe ients may be dierent
a
ording to the subsample in whi
h the observation falls. That is, the
oe
ients depend
upon
t.
this way of breaking up the sample. The new model may be written as
y1
X1 0
y2 0
.. ..
. = .
0
y5
where
y1
is 291,
j
and is the
X1
29 1
is 295,
X2
X3
X4
is the
51
.
+
5
X5
5
(6.7)
j th
subsample,
th subsample.
ve
tor of errors for the j
The O
tave program Restri
tions/ChowTest.m estimates the above model. It also tests
the hypothesis that the ve subsamples share the same parameter ve
tor, or in other words,
that there is
oe
ient stability a
ross the ve subsamples. The null to test is that the
parameter ve
tors for the separate groups are all the same, that is,
1 = 2 = 3 = 4 = 5
This type of test, that parameters are
onstant a
ross dierent sets of data, is sometimes
referred to as a
Chow test.
There are 20 restri tions. If that's not lear to you, look at the O tave program.
The restri tions are reje ted at all onventional signi an e levels.
Sin
e the restri
tions are reje
ted, we should probably use the unrestri
ted model for
analysis. What is the pattern of RTS as a fun
tion of the output group (small to large)?
Figure 6.2 plots RTS. We
an see that there is in
reasing RTS for small rms, but that
RTS is approximately
onstant for large rms.
6.9.
77
EXERCISES
2.2
1.8
1.6
1.4
1.2
0.8
1
1.5
2.5
3
Output group
3.5
4.5
model is
ln Ci = 1j + 2j ln Qi + 3 ln PLi + 4 ln PF i + 5 ln PKi + i
(a) estimate this model by OLS, giving
(6.8)
t-statisti
s for tests of signi
an
e, and the asso
iated p-values. Interpret the
results in detail.
(b) Test the restri
tions implied by this model (relative to the model that lets all
oe
ients vary a
ross groups) using the F, qF, Wald, s
ore and likelihood ratio
tests. Comment on the results.
(
) Estimate this model but imposing the HOD1 restri
tion,
mation program.
using an OLS
esti-
[
RT
S =
1
cq . Apply the
delta method to
al
ulate the estimated standard error for estimated RTS. Dire
tly
test
H0 : RT S = 1
HA : Q 6= 1.
versus
HA : RT S 6= 1
H0 : Q = 1
versus
3. Perform a Monte Carlo study that generates data from the model
y = 2 + 1x2 + 1x3 +
where the sample size is 30,
[0, 1]
and
IIN (0, 1)
x2
and
x3
78
CHAPTER 6.
(a) Compare the means and standard errors of the estimated
oe
ients using OLS
and restri
ted OLS, imposing the restri
tion that
2 + 3 = 2.
(b) Compare the means and standard errors of the estimated
oe
ients using OLS
and restri
ted OLS, imposing the restri
tion that
(
) Dis
uss the results.
2 + 3 = 1.
Chapter 7
Generalized least squares
One of the assumptions we've made up to now is that
t IID(0, 2 ),
or o
asionally
t IIN (0, 2 ).
Now we'll investigate the
onsequen
es of nonidenti
ally and/or dependently distributed
errors. We'll assume xed regressors for now, relaxing this admittedly unrealisti
assumption later. The model is
y = X +
E() = 0
V () =
where
in pla e of
to
has the same number on the main diagonal but nonzero elements
o the main diagonal gives identi
ally (assuming higher moments are also the same)
dependently distributed errors. This is known as
auto orrelation.
The general
ase
ombines heteros
edasti
ity and auto
orrelation. This is known as
nonspheri
al disturban
es, though why this term is used, I have no idea. Perhaps
it's be
ause under the
lassi
al assumptions, a joint
onden
e region for
an
dimensional hypersphere.
79
would be
80
CHAPTER 7.
= (X X)1 X y
= + (X X)1 X
The varian e of
is
i
h
E ( )( ) = E (X X)1 X X(X X)1
= (X X)1 X X(X X)1
isn't
(7.1)
is invalid, sin e
2
any , it doesn't exist as a feature of the true d.g.p. In parti
ular, the
t, F, 2
distributions.
If
is still onsistent, following exa tly the same argument given before.
testing hypotheses.
we still have
n
n(X X)1 X
=
1
XX
=
n1/2 X
n
Dene the limiting varian
e of
n1/2 X
lim E
so we obtain
Summary:
X X
n
d
1
n N 0, Q1
X QX
unbiased in the same ir umstan es in whi h the estimator is unbiased with iid errors
has a dierent varian e than before, so the previous test statisti s aren't valid
is onsistent
7.2.
81
is asymptoti ally normally distributed, but with a dierent limiting ovarian e matrix. Previous test statisti s aren't valid in this ase for this reason.
P P = 1
Here,
P P = In
so
P P P = P ,
whi
h implies that
P P = In
Consider the model
P y = P X + P ,
or, making the obvious denitions,
y = X + .
This varian
e of
= P
is
E(P P ) = P P
= In
Therefore, the model
y = X +
E( ) = 0
V ( ) = In
satises the
lassi
al assumptions. The GLS estimator is simply OLS applied to the transformed model:
GLS
= (X X )1 X y
= (X P P X)1 X P P y
= (X 1 X)1 X 1 y
The GLS estimator is unbiased in the same ir umstan es under whi h the OLS esti-
82
CHAPTER 7.
is nonsto hasti
E(GLS ) = E (X 1 X)1 X 1 y
= E (X 1 X)1 X 1 (X +
= .
GLS
an be al ulated using
= (X X )1 X y
= (X X )1 X (X + )
= + (X X )1 X
so
GLS
GLS
= E (X X )1 X X (X X )1
= (X X )1 X X (X X )1
= (X X )1
= (X 1 X)1
All the previous results regarding the desirable properties of the least squares estimator hold, when dealing with the transformed model, sin
e the transformed model
satises the
lassi
al assumptions..
X.
2
Furthermore, any test that involves
an set it to
1.
in pla e
This is preferable to
The GLS estimator is more e
ient than the OLS estimator. This is a
onsequen
e
of the Gauss-Markov theorem, sin
e the GLS estimator is based on a model that
satises the
lassi
al assumptions but the OLS estimator is not. To see this dire
tly,
not that (the following needs to be
ompleted)
= AA
where
i
h
A = (X X)1 X (X 1 X)1 X 1 .
AA
AA
is a quadrati form in
As one
an verify by
al
ulating fon
, the GLS estimator is the solution to the
minimization problem
7.3.
83
FEASIBLE GLS
metri 1
so the
nn matrix with n2 n /2+n = n2 + n /2
it's an
unique elements.
n.
There's no way to devise an estimator that satises a LLN without adding restri tions.
The
the form of
Suppose that we
and
where
may in lude
as well
= (X, )
where
estimate
is of xed dimension.
as long as
(X, )
If we an onsistently estimate
we an onsistently
this ase,
If we repla
e
tor.
p
b = (X, )
(X, )
b we obtain
,
The FGLS estimator shares the same asymptoti properties as GLS. These
are
1. Consisten
y
2. Asymptoti
normality
3. Asymptoti
e
ien
y
if
().
b = (X, )
1 ).
Pb = Chol(
P y = P X + P
5. Estimate using OLS on the transformed model.
84
CHAPTER 7.
E( ) =
is a diagonal matrix, so that the errors are un
orrelated, but have dierent varian
es.
Heteros
edasti
ity is usually thought of as asso
iated with
ross se
tional data, though
there is absolutely no reason why time series data
annot also be heteros
edasti
. A
tually,
the popular ARCH (autoregressive
onditionally heteros
edasti
) models expli
itly assume
that a time series is heteros
edasti
.
Consider a supply fun
tion
qi = 1 + p Pi + s Si + i
where
Pi
is pri e and
Si
ith
unobservable fa
tors (e.g., talent of managers, degree of
oordination between produ
tion
units,
et .)
i .
Si
is high than
when it is low.
Another example, individual demand.
qi = 1 + p Pi + m Mi + i
where
is pri e and
There are more possibilities for expression of preferen
es when one is ri
h, so it is possible
that the varian
e of
is high.
d
1
n N 0, Q1
X QX
lim E
n
This matrix has dimension
estimate
KK
X X
n
onsistently. The onsistent estimator, under heteros edasti ity but no auto-
orrelation is
X
b= 1
xt xt 2t
n
t=1
One
an then modify the previous test statisti
s to obtain tests that are valid when there
is heteros
edasti
ity of unknown form. For example, the Wald test for
H0 : R r = 0
7.4.
85
HETEROSCEDASTICITY
would be
n R r
X X
n
1
X X
n
1
!1
a
R r 2 (q)
Goldfeld-Quandt
tions, where
n1 + n2 + n3 = n.
n1 , n2
and
n3
observa-
and
1 M 1 1 d 2
1 1
=
(n1 K)
2
2
and
3 M 3 3 d 2
3 3
(n3 K)
=
2
2
so
1 1 /(n1 K) d
F (n1 K, n3 K).
3 3 /(n3 K)
The distributional result is exa
t if the errors are normally distributed. This test is a twotailed test. Alternatively, and probably more
onventionally, if one has prior ideas about the
possible magnitudes of the varian
es of the observations, one
ould order the observations
a
ordingly, from largest to smallest. In this
ase, one would use a
onventional one-tailed
F-test.
Draw pi ture.
Ordering the observations is an important step if the test is to have any power.
The motive for dropping the middle observations is to in
rease the dieren
e between
the average varian
e in the subsamples, supposing that there exists heteros
edasti
ity.
This
an in
rease the power of the test.
1 1
and
3 3 .
A rule of thumb, based on Monte Carlo experiments is to drop around 25% of the
observations.
If one doesn't have any ideas about the form of the het. the test will probably have
low power sin
e a sensible data ordering isn't available.
White's test
When one has little idea if there exists heteros edasti ity, and no idea of its
potential form, the White test is a possibility. The idea is that if there is homos
edasti
ity,
then
E(2t |xt ) = 2 , t
so that
xt
or fun tions of
1. Sin e
xt
E(2t ).
instead.
86
CHAPTER 7.
2. Regress
2t = 2 + zt + vt
where
zt
is a
P -ve tor. zt
= 0.
The
qF =
Note that
ESSR = T SSU ,
qF
P (ESSR ESSU ) /P
ESSU / (n P 1)
qF = (n P 1)
R2
as well as
xt .
get
xt ,
xt ,
R2
1 R2
or the arti ial regression used to test for heteros edasti ity,
2
not the R of the original model.
An asymptoti ally equivalent statisti , under the null of no heteros edasti ity (so that
R2
nR2 2 (P ).
This doesn't require normality of the errors, though it does assume that the fourth moment
of
Question:
The White test has the disadvantage that it may not be very powerful unless the
zt
ve tor is hosen well, and this is hard to do without knowledge of the form of
It also has the problem that spe
i
ation errors other than heteros
edasti
ity may
lead to reje
tion.
Note: the null hypothesis of this test may be interpreted as
model
(2t )
h( + zt ), where
h()
=0
test is more general than is may appear from the regression that is used.
if the observations are ordered a ording to the suspe ted form of the heteros edasti ity.
() be supplied, and
7.4.
87
HETEROSCEDASTICITY
yt = xt + t
t2 = E(2t ) = zt
but the other
lassi
al assumptions hold. In this
ase
2t = zt
and
vt
+ vt
tently, were
and
onsis-
2
pla
e of t , sin
e it is
onsistent by the Slutsky theorem. On
e we have
2
estimate t
onsistently using
p
t2 = zt t2 .
and
2t
in
we an
In the se ond step, we transform the model by dividing by the standard deviation:
x
t
yt
= t +
t
or
yt = x
t + t .
Asymptoti
ally, this model satises the
lassi
al assumptions.
This model is a bit
omplex in that NLS is required to estimate the model of the
varian
e. A simpler version would be
yt
xt + t
t2 = E(2t ) = 2 zt
where
zt
is a single variable. There are still two parameters to be estimated, and the
sear h method
an be used in this
ase to redu
e the estimation problem to repeated appli
ations
of OLS.
The regression
e.g.,
[0, 3].
ztm .
2t = 2 ztm + vt
is linear in the parameters,
onditional on
m ,
so one an estimate
ESSm
ESSm .
by OLS.
as the estimate.
88
CHAPTER 7.
Draw pi ture.
Can rene.
Works well when the parameter to be sear
hed over is low dimensional, as in this
ase.
time-series model.
within the
ross-se
tional units, but that it diers a
ross them (e.g., rms or
ountries of
dierent sizes...). The model is
yit = xit + it
E(2it ) = i2 , t
where
i = 1, 2, ..., G
t = 1, 2, ..., n
i2
obser-
relax later.
E(it is ) = 0.
i2
i2 =
1X 2
it
n
t=1
1/n
so
nK
regressors,
yit
x it
= it +
i
Do this for ea
h
ross-se
tional group. This transformed model satises the
lassi
al
assumptions, asymptoti
ally.
going to use the model with the
onstant and output
oe
ient varying a
ross 5 groups,
but with the input pri
e
oe
ients xed (see Equation 6.8 for the rationale behind this).
Figure 7.1, whi
h is generated by the O
tave program GLS/NerloveResiduals.m plots the
residuals. We
an see pretty
learly that the error varian
e is larger for small rms than
for larger rms.
7.4.
89
HETEROSCEDASTICITY
0.5
-0.5
-1
-1.5
0
20
40
60
80
100
120
140
160
Now let's try out some tests to formally he k for heteros edasti ity.
The O tave
program GLS/HetTests.m performs the White and Goldfeld-Quandt tests, using the above
model. The results are
Value
p-value
White's test
61.903
0.000
Value
p-value
GQ test
10.886
0.000
All in all, it is very
lear that the data are heteros
edasti
. That means that OLS estimation
is not e
ient, and tests of restri
tions that ignore heteros
edasti
ity are not valid. The
previous tests (CRTS, HOD1 and the Chow test) were
al
ulated assuming homos
edasti
ity. The O
tave program GLS/NerloveRestri
tions-Het.m uses the Wald test to
he
k
1
for CRTS and HOD1, but using a heteros edasti - onsistent ovarian e estimator.
The
results are
Testing HOD1
Value
6.161
p-value
0.013
Value
20.169
p-value
0.001
Wald test
Testing CRTS
Wald test
1
By the way, noti e that GLS/NerloveResiduals.m and GLS/HetTests.m use the restri ted LS estimator
dire
tly to restri
t the fully general model with all
oe
ients varying to the model with only the
onstant
and the output
oe
ient varying. But GLS/NerloveRestri
tions-Het.m estimates the model by substituting the restri
tions into the model. The methods are equivalent, but the se
ond is more
onvenient and
easier to understand.
90
CHAPTER 7.
We see that the previous
on
lusions are altered - both CRTS is and HOD1 are reje
ted at
the 5% level. Maybe the reje
tion of HOD1 is due to to Wald test's tenden
y to over-reje
t?
From the previous plot, it seems that the varian
e of
output. Suppose that the 5 size groups have dierent error varian
es (heteros
edasti
ity
by groups):
V ar(i ) = j2 ,
where
j =1
if
i = 1, 2, ..., 29,
et .,
estimates the model using GLS (through a transformation of the model so that OLS
an
be applied). The estimation results are
*********************************************************
OLS estimation results
Observations 145
R-squared 0.958822
Sigma-squared 0.090800
Results (Het.
onsistent var-
ov estimator)
onstant1
onstant2
onstant3
onstant4
onstant5
output1
output2
output3
output4
output5
labor
fuel
apital
estimate
-1.046
-1.977
-3.616
-4.052
-5.308
0.391
0.649
0.897
0.962
1.101
0.007
0.498
-0.460
st.err.
1.276
1.364
1.656
1.462
1.586
0.090
0.090
0.134
0.112
0.090
0.208
0.081
0.253
t-stat.
-0.820
-1.450
-2.184
-2.771
-3.346
4.363
7.184
6.688
8.612
12.237
0.032
6.149
-1.818
p-value
0.414
0.149
0.031
0.006
0.001
0.000
0.000
0.000
0.000
0.000
0.975
0.000
0.071
*********************************************************
*********************************************************
OLS estimation results
Observations 145
R-squared 0.987429
Sigma-squared 1.092393
Results (Het.
onsistent var-
ov estimator)
estimate
st.err.
t-stat.
p-value
7.5.
91
AUTOCORRELATION
onstant1
onstant2
onstant3
onstant4
onstant5
output1
output2
output3
output4
output5
labor
fuel
apital
-1.580
-2.497
-4.108
-4.494
-5.765
0.392
0.648
0.892
0.951
1.093
0.103
0.492
-0.366
0.917
0.988
1.327
1.180
1.274
0.090
0.094
0.138
0.109
0.086
0.141
0.044
0.165
-1.723
-2.528
-3.097
-3.808
-4.525
4.346
6.917
6.474
8.755
12.684
0.733
11.294
-2.217
0.087
0.013
0.002
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.465
0.000
0.028
*********************************************************
Testing HOD1
Value
9.312
Wald test
p-value
0.002
The rst panel of output are the OLS estimation results, whi
h are used to
onsistently
estimate the
j2 .
The se ond panel of results are the GLS estimation results. Some om-
ments:
The
R2
measures are not omparable - the dependent variables are not the same.
The measure for the GLS results uses the transformed dependent variable. One
ould
al
ulate a
omparable
R2
an
be
interpreted as eviden
e of improved e
ien
y of GLS, sin
e the OLS standard errors
are
al
ulated using the Huber-White estimator. They would not be
omparable if
the ordinary (in
onsistent) estimator had been used.
Note that the previously noted pattern in the output
oe
ients persists. The non
onstant CRTS result is robust.
The
oe
ient on
apital is now negative and signi
ant at the 3% level. That seems
to indi
ate some kind of problem with the model or the data, or e
onomi
theory.
Note that HOD1 is now reje
ted. Problem of Wald test over-reje
ting? Spe
i
ation
error in model?
For
92
CHAPTER 7.
example, a sho
k to oil pri
es will simultaneously ae
t all
ountries, so one
ould expe
t
ontemporaneous
orrelation of ma
roe
onomi
variables a
ross
ountries.
7.5.1 Causes
Auto
orrelation is the existen
e of
orrelation a
ross the error term:
E(t s ) 6= 0, t 6= s.
Why might this o
ur? Plausible explanations in
lude
yt = xt + t ,
one
ould interpret
xt
Suppose
xt
is onstant over a
from equilibrium. If the time needed to return to equilibrium is long with respe
t
to the observation frequen
y, one
ould expe
t
t+1
to be positive, onditional on
yt = 0 + 1 xt + 2 x2t + t
but we estimate
yt = 0 + 1 xt + t
The ee
ts are illustrated in Figure 7.2.
7.5.
93
AUTOCORRELATION
7.5.3 AR(1)
There are many types of auto
orrelation.
most ommonly en ountered ase: autoregressive order 1 (AR(1) errors. The model is
yt = xt + t
t = t1 + ut
ut iid(0, u2 )
E(t us ) = 0, t < s
We assume that the model satises the other
lassi
al assumptions.
|| < 1.
explodes as
t = t1 + ut
= (t2 + ut1 ) + ut
= 2 t2 + ut1 + ut
= 2 (t3 + ut2 ) + ut1 + ut
In the limit the lagged
t =
m 0
as
m ,
so we obtain
m utm
m=0
With this, the varian
e of
is found as
E(2t )
u2
2m
m=0
u2
2
using
so
V (t ) =
0th
u2
1 2
0 = V (t )
94
CHAPTER 7.
is
Cov(t , t1 ) = s = E((t1 + ut ) t1 )
=
V (t )
u2
1 2
s<t
Cov(t , ts ) = s =
The
t:
s u2
1 2
{t }
is
ovarian e stationary
ov(x, y)
se(x)se(y)
but in this ase, the two standard errors are the same, so the
s-order
auto orrelation
is
s = s
u2
1 2
| {z }
.
.
.
n1
..
{z
n1
n2
..
.
1
}
So we have homos
edasti
ity, but elements o the main diagonal are not zero. All of
this depends only on two parameters,
we an apply FGLS.
It turns out that it's easy to estimate these onsistently. The steps are
yt = xt + t
by OLS.
t =
t1 + ut
Sin
e
t t ,
t = t1 + ut
7.5.
95
AUTOCORRELATION
t =
ut ,
obtained
by applying OLS to
the estimator
u2 =
1X 2 p 2
(
ut ) u
n
t=2
u2
and
form
= (
u2 , )
u2 /(1 2 ),
1
1 X
1 y).
F GLS = X
(X
One an iterate the pro ess, by taking the rst FGLS estimator of
, re-estimating
2
and u , et
. If one iterates to
onvergen
es it's equivalent to MLE (supposing normal
errors).
n1
observations (sin e
y0
and
x0
aren't available).
Co hrane and Or utt. Dropping the rst observation is asymptoti ally irrelevant, but
by putting
y1 = y1
x1 = x1
1 2
1 2
1 . See
Davidson and Ma Kinnon, pg. 348-49 for more dis ussion. Note that the varian e of
y1
is
u2 , asymptoti ally,
u s
periods.
7.5.4 MA(1)
The linear regression model with moving average order 1 errors is
yt = xt + t
t = ut + ut1
ut iid(0, u2 )
E(t us ) = 0, t < s
y s,
in dierent time
96
CHAPTER 7.
In this ase,
i
h
V (t ) = 0 = E (ut + ut1 )2
=
u2 + 2 u2
u2 (1 + 2 )
Similarly
and
= u2
1 + 2
1 + 2
..
.
.
.
.
.
.
..
1 + 2
1 =
2
u
2 (1+2 )
u
1
0
(1 + 2 )
= 1 and a minimum
at
minimal auto
orrelations are 1/2 and -1/2. Therefore, series that are more strongly
auto
orrelated
an't be MA(1) pro
esses.
Again the
ovarian
e matrix has a simple stru
ture that depends on only two parameters.
The problem in this
ase is that one
an't estimate
using OLS on
t = ut + ut1
be
ause the
ut
V (t ) = 2 = u2 (1 + 2 )
7.5.
97
AUTOCORRELATION
1X 2
c2 = 2 (1
\
2) =
t
u
n
t=1
u2
and
X
c2 (1 + b2 ) = 1
2t
u
n
t=1
However, this isn't su
ient to dene
onsistent estimators of the parameters, sin
e
it's unidentied.
and
t1
using
X
d2 = 1
d t , t1 ) =
t t1
Cov(
u
n
t=2
This is a
onsistent estimator, following a LLN (and given that the epsilon hats
are
onsistent for the epsilons).
unidentied estimator:
X
c2 = 1
t t1
u
n
t=2
Now solve these two equations to obtain identied (and therefore onsistent) estimators of both
and
u2 .
c2 )
= (,
following the form we've seen above, and transform the model using the Cholesky
de
omposition. The transformed model satises the
lassi
al assumptions asymptoti
ally.
where, as before,
is
d
1
n N 0, Q1
X QX
= lim E
n
Dene
X X
n
mt = xt t
(re
all that
xt
is dened as a
K 1
98
CHAPTER 7.
X =
t=1
n
X
i
2
xn .
..
n
x1 x2
n
X
xt t
mt
t=1
"
so that
1
= lim E
n n
We assume that
mt
n
X
mt
t=1
n
X
mt
t=1
!#
mt
and
mts
t).
v th
mt
auto ovarian e of
as
v = E(mt mtv ).
Note that
mt
E(mt mt+v ) = v .
v = E(mt mtv ) 6= 0
Note that this auto
ovarian
e does not depend on
ontemporaneously orrelated (
E(mit mjt ) 6= 0
t,
xt
will in
= i2
to base a parametri
spe
i
ation. Re
ent resear
h has fo
used on
onsistent nonparametri
estimators of
Now dene
1
n = E
n
We have (
"
n
X
mt
t=1
n
X
mt
t=1
!#
show that the following is true, by expanding sum and shifting rows to left)
n = 0 +
n2
n1
1
1 + 1 +
2 + 2 +
n1 + n1
n
n
n
is
n
X
cv = 1
m
tm
tv .
n
t=v+1
7.5.
99
AUTOCORRELATION
where
m
t = xt t
(note:
estimator of
1/(n v)
instead of
1/n
here).
would be
c + n 2
c + + 1
[
c1 +
c2 +
n =
c0 + n 1
[
+
n1
n1
1
2
n
n
n
n1
Xnv
c .
c0 +
cv +
=
v
n
v=1
n .
n,
so information
tends to
a modied estimator
where
q(n)
as
q(n)
X
c ,
cv +
n =
c0 +
v=1
q(n)
The assumption that auto orrelations die o is reasonable in many ases. For example, the AR(1) model with
|| < 1
nv
n
an be dropped be
ause it tends to one for
The term
A disadvantage of this estimator is that is may not be positive denite. This ould
v < q(n),
given that
q(n)
n.
n =
c0 +
q(n)
X
v=1
v
c .
cv +
1
v
q+1
This estimator is p.d. by onstru tion. The ondition for onsisten y is that
0.
q.
kernel
n1/4 q(n)
It is an example of a
estimator.
Finally, sin e
has
p
n
as its limit,
. We
an now
use
and
1
d
Q
X = nX X
to on-
sistently estimate the limiting distribution of the OLS estimator under heteros
edasti
ity
and auto
orrelation of unknown form. With this, asymptoti
ally valid tests are
onstru
ted
in the usual way.
100
CHAPTER 7.
DW
Pn
(
t t1 )2
t=2P
n
2t
t=1
Pn
t t1
2t 2
t=2
Pn 2
t
t=1
=
=
+ 2t1
The null hypothesis is that the rst order auto orrelation of the errors is zero:
1 = 0. The alternative
is of ourse
H0 :
the errors are AR(1), sin
e many general patterns of auto
orrelation will have the
rst order auto
orrelation dierent than zero. For this reason the test is useful for
dete
ting auto
orrelation in general. For the same reason, one shouldn't just assume
that an AR(1) model is appropriate when the DW test reje
ts the null.
Under the null, the middle term tends to zero, and the other two tend to one, so
DW 2.
term tends to
term tends to
2,
2,
so
so
= 1.
= 1.
DW 0
p
DW 4
DW
X, so tables
an't give exa
t
riti
al values. The give upper and lower bounds, whi
h
orrespond
to the extremes that are possible. See Figure
exa
t
riti
al values
onditional on
??.
X.
is xed in repeated
samples. This is often unreasonable in the
ontext of e
onomi
time series, whi
h is
pre
isely the
ontext where the test would have appli
ation. It is possible to relate
the DW test to other test statisti
s whi
h are valid without stri
t exogeneity.
t = xt + 1 t1 + 2 t2 + + P tP + vt
and the test statisti
is the
There are
restri tions,
2
so the test statisti
is asymptoti
ally distributed as a (P ).
The intuition is that the lagged errors shouldn't
ontribute to explaining the
urrent
error if there is no auto
orrelation.
xt
even if the
7.5.
101
AUTOCORRELATION
This test is valid even if the regressors are sto
hasti
and
ontain lagged dependent
variables, so it is
onsiderably more useful than the DW test for typi
al time series
data.
The alternative is not that the model is an AR(P), following the argument above.
The alternative is simply that some or all of the rst
from zero. This is ompatible with many spe i forms of auto orrelation.
We've seen that the OLS estimator is onsistent under auto orrelation, as long as
0.
ase where
E(X ) = 0,
plim Xn =
ontains lagged y s and the errors are auto orrelated. A simple example is
the ase of a single lag of the dependent variable with AR(1) errors. The model is
yt = xt + yt1 + t
t = t1 + ut
Now we
an write
E(yt1 t ) = E (xt1 + yt2 + t1 )(t1 + ut )
6= 0
plim
X
n
6= 0.
E(2t1 )
Sin e
plim = + plim
the OLS estimator is in
onsistent in this
ase.
E(X ) 6= 0,
and
X
n
7.5.8 Examples
Nerlove model, yet again
not think of performing tests for auto
orrelation. However, spe
i
ation error
an indu
e
auto
orrelated errors. Consider the simple Nerlove model
ln C = 1 + 2 ln Q + 3 ln PL + 4 ln PF + 5 ln PK +
and the extended Nerlove model
ln C = 1j + 2j ln Q + 3 ln PL + 4 ln PF + 5 ln PK + .
We have seen eviden
e that the extended model is preferred. So if it is in fa
t the proper
model, the simple model is misspe
ied. Let's
he
k if this misspe
i
ation might indu
e
auto
orrelated errors.
The O
tave program GLS/NerloveAR.m estimates the simple Nerlove model, and plots
the residuals as a fun
tion of
ln Q,
102
CHAPTER 7.
0.5
-0.5
-1
1
Value
34.930
p-value
0.000
??)
to
Klein model
in the model explains onsumption (C ) as a fun tion of prots (P ), both urrent and lagged,
as well as the sum of wages in the private se tor (W ) and wages in the government se tor
(W ). Have a look at the README le for this data set. This gives the variable names
and other information.
Consider the model
*********************************************************
OLS estimation results
Observations 21
R-squared 0.981008
Sigma-squared 1.051732
7.5.
103
AUTOCORRELATION
1.5
0.5
-0.5
-1
-1.5
-2
5
10
15
20
Constant
Profits
Lagged Profits
Wages
estimate
16.237
0.193
0.090
0.796
st.err.
1.303
0.091
0.091
0.040
t-stat.
12.464
2.115
0.992
19.933
p-value
0.000
0.049
0.335
0.000
*********************************************************
Value
p-value
Breus
h-Godfrey test
1.539
0.215
and the residual plot is in Figure 7.4. The test does not reje
t the null of nonauto
orrelatetd
errors, but we should remember that we have only 21 observations, so power is likely to be
fairly low. The residual plot leads me to suspe
t that there may be auto
orrelation - there
are some signi
ant runs below and above the x-axis. Your opinion may dier.
Sin
e it seems that there
may
*********************************************************
OLS estimation results
Observations 21
R-squared 0.967090
104
CHAPTER 7.
Sigma-squared 0.983171
Results (Ordinary var-
ov estimator)
Constant
Profits
Lagged Profits
Wages
estimate
16.992
0.215
0.076
0.774
st.err.
1.492
0.096
0.094
0.048
t-stat.
11.388
2.232
0.806
16.234
p-value
0.000
0.039
0.431
0.000
*********************************************************
Value
p-value
Breus
h-Godfrey test
2.129
0.345
The test is farther away from the reje
tion region than before, and the residual plot
is a bit more favorable for the hypothesis of nonauto
orrelated residuals, IMHO. For
this reason, it seems that the AR(1)
orre
tion might have improved the estimation.
Nevertheless, there has not been mu
h of an ee
t on the estimated
oe
ients nor
on their estimated standard errors. This is probably be
ause the estimated AR(1)
oe
ient is not very large (around 0.2)
The existen
e or not of auto
orrelation in this model will be important later, in the
se
tion on simultaneous equations.
V ar(GLS ) = AA
V ar()
Verify that this is true.
Show that the GLS estimator
an be dened as
where
d
1
n N 0, Q1
X QX ,
lim E
n
Explain why
X X
n
n
X
b= 1
xt xt 2t
n
t=1
7.6.
105
EXERCISES
as
v = E(mt mtv ).
Show that
E(mt mt+v ) = v .
ln C = 1j + 2j ln Q + 3 ln PL + 4 ln PF + 5 ln PK +
assume that
V (t |xt ) = j2 , j = 1, 2, ..., 5.
a) Apply White's test using the OLS residuals, to test for homos
edasti
ity
b) Cal
ulate the FGLS estimator and interpret the estimation results.
) Test the transformed model to
he
k whether it appears to satisfy homos
edasti
ity.
106
CHAPTER 7.
Chapter 8
Sto
hasti
regressors
Up to now we have treated the regressors as xed, whi
h is
learly unrealisti
. Now we will
assume they are random. There are several ways to think of the problem. First, if we are
interested in an analysis
onditional
they are sto
hasti
or not, sin
e
onditional on the values of they regressors take on, they
are nonsto
hasti
, whi
h is the
ase already
onsidered.
yt
may depend on
yt1 ,
su
iently general, sin
e we may want to predi
t into the future many periods out,
so we need to
onsider the behavior of
on
X.
The model we'll deal will involve a ombination of the following assumptions
Linearity:
0 :
yt = xt 0 + t ,
or in matrix form,
where
is
formable.
n 1, X =
x1 x2
y = X0 + ,
, where xt
xn
is
K 1,
and
and
are on-
has rank
is sto hasti
limn Pr
with probability 1
1
nX X
= QX = 1,
where
QX
n1/2 X N (0, QX 02 )
(8.1)
108
CHAPTER 8.
STOCHASTIC REGRESSORS
xt
yt
(8.2)
given
xt : E(yt |xt ) = xt
8.1 Case 1
Normality of , strongly exogenous regressors
In this
ase,
= 0 + (X X)1 X
E(|X)
= 0 + (X X)1 X E(|X)
= 0
and sin
e this holds for all
= ,
X, E()
un onditional on
N , (X X)1 2
|X
0
If the density of
is
Likewise,
onditional density by
density for
X.
in small samples.
However, onditional on
and
2 distributions.
Summary:
When
is normally dis-
tributed:
1.
is unbiased
2.
is nonnormally distributed
3. The usual test statisti
s have the same distribution as with nonsto
hasti
4. The Gauss-Markov theorem still holds, sin
e it holds
onditionally on
this is true for all
X,
X.
and
X.
8.2 Case 2
Still, we have
= 0 + (X X)1 X
1
XX
X
= 0 +
n
n
8.3.
109
CASE 3
Now
X X
n
1
Q1
X
by assumption, and
n1/2 X p
X
0
=
n
n
sin
e the numerator
onverges to a
N (0, QX 2 )
p
0 .
Considering the asymptoti
distribution
1
X
XX
n 0
=
n
n
n
1
XX
=
n1/2 X
n
so
d
2
n 0 N (0, Q1
X 0 )
Sin e
the asymptoti
results on all test statisti
s only require this, all the previous asymptoti
results on test statisti
s are also valid in this
ase.
normal or nonnormal,
has
the properties:
1. Unbiasedness
2. Consisten
y
3. Gauss-Markov theorem holds, sin
e it holds in the previous
ase and doesn't
depend on normality.
4. Asymptoti
normality
5. Tests are asymptoti
ally valid
6. Tests are not valid in small samples if the error is normally distributed
8.3 Case 3
Weakly exogenous regressors
An important
lass of models are
dynami models,
have an impa
t on the
urrent value. A simple version of these models that
aptures the
important points is
yt = zt +
p
X
s=1
= xt + t
s yts + t
110
CHAPTER 8.
where now
xt
STOCHASTIC REGRESSORS
E(t |xt ) = 0, X
and
E(t1 xt ) 6= 0
sin
e
xt
ontains
yt1
t1 )
as an element.
This fa t implies that all of the small sample properties su h as unbiasedness, GaussMarkov theorem, and small sample validity of test statisti s
do not hold
in this ase.
Re
all Figure 3.7. This is a
ase of weakly exogenous regressors, and we see that the
OLS estimator is biased in this
ase.
1
nX X
= QX = 1,
1.
limn Pr
2.
n1/2 X N (0, QX 02 )
QX
The most
ompli
ated
ase is that of dynami
models, sin
e the other
ases
an be treated
as nested in this
ase.
pro
esses, many of whi
h are fairly te
hni
al. We won't enter into details (see Hamilton,
Chapter 7 if you're interested). A main requirement for use of standard asymptoti
s for a
dependent sequen
e
{st } = {
1X
zt }
n
t=1
zt
be
t.
Covarian
e (weak) stationarity requires that the rst and se
ond moments of this set
not depend on
t.
An example of a sequen
e that doesn't satisfy this is an AR(1) pro
ess with a unit
root (a
random walk):
xt = xt1 + t
t IIN (0, 2 )
xt
depends upon
8.5.
111
EXERCISES
The series
sin t + t
t,
stationary either.
Stationarity prevents the pro
ess from trending o to plus or minus innity, and prevents
y
li
al behavior whi
h would allow
orrelations between far removed
zt
znd
zs
to be high.
In summary, the assumptions are reasonable when the sto
hasti
onditioning variables have varian
es that are nite, and are not too strongly dependent. The AR(1)
model with unit root is an example of a
ase where the dependen
e is too strong for
standard asymptoti
s to apply.
The e
onometri
s of nonstationary pro
esses has been an a
tive area of resear
h in
the last two de
ades. The standard asymptoti
s don't apply in this
ase. This isn't
in the s
ope of this
ourse.
and
B,
if
E(A|B) = 0,
then
E (Af (B)) = 0.
How
112
CHAPTER 8.
STOCHASTIC REGRESSORS
Chapter 9
Data problems
In this se
tion well
onsider problems asso
iated with the regressor matrix:
ollinearity,
missing observation and measurement error.
9.1 Collinearity
Collinearity is the existen
e of linear relationships amongst the regressors. We
an always
write
1 x1 + 2 x2 + + K xK + v = 0
where
xi
is the
ith
X,
and
is an
n1
ve tor.
In the
relative and approximate are impre
ise, so it's di
ult to dene when
ollinearilty
exists.
(X) < K,
so (X X)
< K,
equal) then
yt = 1 + 2 x2t + 3 x3t + t
x2t = 1 + 2 x3t
then we
an write
yt = 1 + 2 (1 + 2 x3t ) + 3 x3t + t
= 1 + 2 1 + 2 2 x3t + 3 x3t + t
= (1 + 2 1 ) + (2 2 + 3 ) x3t
= 1 + 2 x3t + t
The
s an be onsistently estimated,
s,
the
unidentied
113
that solve
114
CHAPTER 9.
DATA PROBLEMS
Perfe
t
ollinearity is unusual, ex
ept in the
ase of an error in
onstru
tion of the
regressor matrix, su
h as in
luding the same regressor twi
e.
Another
ase where perfe
t
ollinearity may be en
ountered is with models with dummy
variables, if one is not
areful. Consider a model of rental pri
e
(yi ) of an
apartment. This
Let
Gi , Ti
Bi = 1
and
Li
su h as
yi = 1 + 2 Bi + 3 Gi + 4 Ti + 5 Li + xi + i
In this model,
Bi + Gi + Ti + Li = 1, i,
variables and the
olumn of ones
orresponding to the
onstant. One must either drop the
onstant, or one of the qualitative variables.
9.1.2 Ba
k to
ollinearity
The more
ommon
ase, if one doesn't make mistakes su
h as these, is the existen
e of
inexa
t linear relationships,
i.e., orrelations between the regressors that are less than one
in absolute value, but not zero. The basi
problem is that when two (or more) variables
move together, it is di
ult to determine their separate inuen
es.
i.e., estimates with high varian
es. With e
onomi
data,
ollinearity
is
ommonly en
ountered, and is often a severe problem.
When there is
ollinearity, the minimizing point of the obje
tive fun
tion that denes
the OLS estimator (s(), the sum of squared errors) is relatively poorly dened. This is
seen in Figures 9.1 and 9.2.
To see the ee
t of
ollinearity on varian
es, partition the regressor matrix as
X=
where
x W
isf we like,
so there's no loss of generality in
onsidering the rst
olumn). Now, the varian
e of
under the
lassi
al assumptions, is
= X X
V ()
Using the partition,
X X =
"
x x
1
x W
W x W W
9.1.
115
COLLINEARITY
Figure 9.1:
s()
60
55
50
45
40
35
30
25
20
15
6
4
2
0
-2
-4
-6
-6
-4
-2
Figure 9.2:
s()
6
4
2
0
-2
-4
-6
-6
-4
-2
100
90
80
70
60
50
40
30
20
116
CHAPTER 9.
DATA PROBLEMS
X X
where by
ESSx|W
1
x x x W (W W )1 W x
1
=
x In W (W W ) 1 W x
1
= ESSx|W
1
1,1
x = W + v.
Sin
e
R2 = 1 ESS/T SS,
we have
ESS = T SS(1 R2 )
so the varian
e of the
oe
ient
orresponding to
V (x ) =
is
2
2
T SSx (1 Rx|W
)
We see three fa tors inuen e the varian e of this oe ient. It will be high if
1.
is large
x.
2
Rx|W
2
Rx|W
will be lose to 1. As
1, V (x ) .
See the
gures no ollin.ps (no orrelation) and ollin.ps ( orrelation), available on the web site.
R2 ,
there is a problem of
ollinearity. Furthermore, this pro edure identies whi h parameters are ae ted.
High
9.1.
117
COLLINEARITY
Also indi ative of ollinearity is that the model ts well (high
R2 ),
variables is signi
antly dierent from zero (e.g., their separate inuen
es aren't well determined).
In summary, the arti
ial regressions are the best approa
h if one wants to be
areful.
ollinearity.
R = r + v
where
and
y = X +
R = r + v
!
!
0
,
N
0
v
2 In 0nq
v2 Iq
0qn
This sort of model isn't in line with the
lassi
al interpretation of parameters as
onstants:
a
ording to this interpretation the left hand side of
R = r + v
is random. This model does t the Bayesian perspe
tive: we
ombine information
oming
from the model and the data, summarized in
y = X +
N (0, 2 In )
with prior beliefs regarding the distribution of the parameter, summarized in
R N (r, v2 Iq )
Sin
e the sample is random it is reasonable to suppose that
pie
e of information in the spe
i
ation.
E(v ) = 0,
The
"
y
r
"
X
R
2 6= v2 .
"
Dene the
This
118
CHAPTER 9.
DATA PROBLEMS
expresses the degree of belief in the restri
tion relative to the variability of the data.
Supposing that we spe
ify
k,
"
y
kr
"
X
kR
"
kv
is homos
edasti
and
an be estimated by OLS. Note that this estimator is biased. It is
onsistent, however, given that
have no weight in the obje
tive fun
tion, so the estimator has the same limiting obje
tive
fun
tion as the OLS estimator, and is therefore
onsistent.
To motivate the use of sto
hasti
restri
tions,
onsider the expe
tation of the squared
length of
:
1
1
+ XX
X
+ XX
X
= + E X(X X)1 (X X)1 X
1 2
= + T r X X
= E
E( )
= + 2
K
X
i (the
i=1
so
> +
E( )
where
min(X X)
eigenvalue of
(X X)1 ).
is p.d.
2
min(X X)
X X
X X
be omes more
nearly singular, so
the eigenvalues)
hand,
"
y
0
IK = 0 + v.
"
X
kIK
is nite.
"
kv
ridge =
=
This is the ordinary
X kIK
X X + k2 IK
ridge regression
"
1
X
kIK
#!1
X y
X IK
"
y
0
2
to add k IK , whi
h is nonsingular, to
X X,
k ,
= 0,
that is,
9.2.
119
MEASUREMENT ERROR
the oe ients are shrunken toward zero. Also, the estimator tends to
ridge = X X + k2 IK
so
ridge
ridge 0.
1
X y k2 IK
1
X y =
X y
0
k2
This is learly a false restri tion in the limit, if our original model is
at al sensible.
There should be some amount of shrinkage that is in fa
t a true restri
tion. The problem
is to determine the
OLS .
depends on
ridge
ridge
and
2,
su h that
as a fun tion of
k,
that artisti ally seems appropriate (e.g., where the ee t of in reasing
pi ture here.
M SE(ridge ) <
dies o ).
Draw
In summary, the ridge estimator oers some hope, but it is impossible to guarantee
that it will outperform the OLS estimator. Collinearity is a fa
t of life in e
onometri
s,
and there is no
lear solution to the problem.
y = X +
y = y + v
vt iid(0, v2 )
where
that
y = X +
y + v = X +
We assume
120
CHAPTER 9.
DATA PROBLEMS
so
y = X + v
= X +
t iid(0, 2 + v2 )
As long as
is un orrelated with
X,
and an be estimated by OLS. This type of measurement error isn't a problem, then.
yt = x
t + t
xt = xt + vt
vt iid(0, v )
where
is a
K K
matrix. Now
is
X +
yt = (xt vt ) + t
= xt vt + t
= xt + t
xt
and
E(xt t ) = E (xt + vt ) vt + t
= v
where
t ,
sin e
v = E vt vt .
Be
ause of this
orrelation, the OLS estimator is biased and in
onsistent, just as in the
ase of auto
orrelated errors with lagged dependent variables. In matrix notation, write
the estimated model as
y = X +
We have that
X X
n
1
X y
n
and
plim
X X
n
1
= plim
(X + V ) (X + V )
n
= (QX + v )1
9.3.
MISSING OBSERVATIONS
sin e
and
121
plim
V V
n
= lim E
= v
1X
vt vt
n
t=1
Likewise,
plim
X y
n
(X + V ) (X + )
n
= QX
= plim
so
plim = (QX + v )1 QX
So we see that the least squares estimator is in
onsistent when the regressors are measured
with error.
y = X +
"
or
where
y2
y1
y2
"
X1
X2
"
1
2
y1 = X1 + 1
Sin
e these observations satisfy the
lassi
al assumptions, one
ould estimate by OLS.
The question remains whether or not one
ould somehow repla
e the unobserved
by a predi
tor, and improve over OLS in some sense. Let
Now
y2
y2
y2 .
122
CHAPTER 9.
("
X1
# "
X1
#)1 "
X1
# "
y1
X2
X2
X2
y2
1
= X1 X1 + X2 X2
X1 y1 + X2 y2
DATA PROBLEMS
X X = X y
so if we regressed using only the rst (
omplete) observations, we would have
X1 X1 1 = X1 y1.
Likewise, an OLS regression using only the se
ond (lled in) observations would give
X2 X2 2 = X2 y2 .
Substituting these into the equation for the overall
ombined estimator gives
i
1 h
X1 X1 + X2 X2
X1 X1 1 + X2 X2 2
1
1
= X1 X1 + X2 X2
X1 X1 1 + X1 X1 + X2 X2
X2 X2 2
A1 + (IK A)2
where
1
X1 X1
A X1 X1 + X2 X2
and we use
X1 X1 + X2 X2
1
X1 X1 + X2 X2 X1 X1
X1 X1 + X2 X2
1
X1 X1
= IK X1 X1 + X2 X2
1
X2 X2 =
= IK A.
Now,
= A + (IK A)E 2
E()
2 = .
only if E
The
on
lusion is the this lled in observations alone would need to dene an unbiased
estimator. This will be the
ase only if
y2 = X2 + 2
where
knowledge of
y2 = y1
biased estimator.
9.3.
123
MISSING OBSERVATIONS
One possibility that has been suggested (see Greene, page 275) is to estimate
using
1 = (X1 X1 )1 X1 y1
then use this estimate,
1 ,to
predi t
y2
y2 = X2 1
= X2 (X1 X1 )1 X1 y1
Now, the overall estimate is a weighted average of
and
2 ,
have
2 = (X2 X2 )1 X2 y2
= (X2 X2 )1 X2 X2 1
= 1
This shows that this suggestion is
ompletely empty of
ontent: the nal estimator
is the same as the OLS estimator using only the
omplete observations.
yt = xt + t
whi
h is assumed to satisfy the
lassi
al assumptions. However,
What is observed is
yt
yt
dened as
yt = yt
Or, in other words,
yt
if
yt 0
The dieren
e in this
ase is that the missing values are not random: they are
orrelated
with the
xt .
y = x +
with
V () = 25,
y > 0
"
y1
y2
"
X1
X2
"
1
2
just estimate using the omplete observations, but it may seem frustrating to have to drop
124
CHAPTER 9.
DATA PROBLEMS
15
10
-5
-10
0
10
X2 is repla ed by some predi tion, X2 , then we are in the ase of errors of observation.
As before, this means that the OLS estimator is biased when X2 is used instead of X2 .
Consisten
y is salvaged, however, as long as the number of missing observations doesn't
in
rease with
n.
ad ho
values an be
bias. It is di
ult to determine whether MSE in
reases or de
reases. Monte Carlo
studies suggest that it is dangerous to simply substitute the mean, for example.
In the
ase that there is only one regressor other than the
onstant, subtitution of
for the missing
xt
K > 2.
lling in missing elements with intelligent guesses, but this ould also in rease MSE.
ln C = 1j + 2j ln Q + 3 ln PL + 4 ln PF + 5 ln PK +
When this model is estimated by OLS, some
oe
ients are not signi
ant. This may be
due to
ollinearity.
9.4.
125
EXERCISES
126
CHAPTER 9.
DATA PROBLEMS
Chapter 10
Fun
tional form and nonnested tests
Though theory often suggests whi
h
onditioning variables should be in
luded, and suggests
the signs of
ertain derivatives, it is usually silent regarding the fun
tional form of the
relationship between the dependent variable and the regressors. For example,
onsidering
a
ost fun
tion, one
ould have a Cobb-Douglas model
c = Aw11 w22 q q e
This model, after taking logarithms, gives
ln c = 0 + 1 ln w1 + 2 ln w2 + q ln q +
where
0 = ln A.
c=0
when
q = 0.
Homogeneity of degree
c = 0 + 1 w1 + 2 w2 + q q +
may be just as plausible. Note that
and
ln(x)
the regressors, and up to a linear transformation, so it may be di
ult to
hoose between
these models.
The basi
point is that many fun
tional forms are
ompatible with the linear-inparameters model, sin
e this model
an in
orporate a wide variety of nonlinear transformations of the dependent variable and the regressors. For example, suppose that
a real valued fun
tion and that
x()
is a
g()
is
xt = x(zt )
yt = xt + t
There may be
where
zt ,
P.
127
For example,
zt .
xt
regressors,
ould in lude
128
CHAPTER 10.
K = 1 + P + P 2 P /2 + P
y = g(x) +
A se
ond-order Taylor's series expansion (with remainder term) of the fun
tion
the point
x=0
is
g(x) about
x Dx2 g(0)x
+R
2
Use the approximation, whi h simply drops the remainder term, as an approximation to
g(x) :
g(x) gK (x) = g(0) + x Dx g(0) +
As
x 0,
x Dx2 g(0)x
2
the approximation be omes more and more exa t, in the sense that
2
and Dx gK (x)
x = 0,
gK (x)
up to the se ond order. The idea behind many exible fun tional forms is to note that
Dx g(0)
and
Dx2 g(0)
g(0),
will have exa
tly enough free parameters to approximate the fun
tion
unknown form, exa
tly, up to se
ond order, at the point
x = 0.
g(x),
whi h is of
The model is
gK (x) = + x + 1/2x x
so the regression model to t is
y = + x + 1/2x x +
While the regression model has enough free parameters to be Diewert-exible, the
The answer is no, in general. The reason is that if we treat the true values of the
question remains: is
plim
= g(0)?
Is
x,
so that
plim = Dx g(0)?
Is
= D 2 g(0)?
plim
x
and
The on lusion is that exible fun tional forms aren't really exible in a useful
fun tion.
Draw pi ture.
statisti al sense, in that neither the fun tion itself nor its derivatives are onsistently
10.1.
129
estimated, unless the fun
tion belongs to the parametri
family of the spe
ied fun
tional form. In order to lead to
onsistent inferen
es, the regression model must be
orre
tly spe
ied.
y = ln(c)
z
x = ln
z
= ln(z) ln(
z)
y = + x + 1/2x x +
y
x
= + x
=
=
ln(c)
ln(z)
c z
z c
z.
with respe t to
x is
onstant)
so the
z, x = 0,
so
y
=
x z=z
y = c(w, q)
where
extending
is output.
x=
c(w, q)
w
By Shephard's
130
CHAPTER 10.
s=
c(w, q) w
wx
=
c
w c
whi
h is simply the ve
tor of elasti
ities of
ost with respe
t to input pri
es. If the
ost
fun
tion is modeled using a translog fun
tion, we have
ln(c) = + x + z + 1/2
"
11 12
12 22
#"
x
z
= + x + z + 1/2x 11 x + x 12 z + 1/2z 22
where
x = ln(w/w)
"
11 =
"
12 =
11 12
12 22
#
13
z = ln(q/
q ),
and
23
= 33 .
22
s=+
11 12
"
x
z
Therefore, the share equations and the ost equation have parameters in ommon.
By
pooling the equations together and imposing the (true) restri
tion that the parameters of
the equations be the same, we
an gain e
ien
y.
To illustrate in more detail,
onsider the
ase of two inputs, so
x=
"
x1
x2
In this ase the translog model of the logarithmi ost fun tion is
ln c = + 1 x1 + 2 x2 + z +
11 2 22 2 33 2
x +
x +
z + 12 x1 x2 + 13 x1 z + 23 x2 z
2 1
2 2
2
ln c
with respe t to
x1
and
x2 :
s1 = 1 + 11 x1 + 12 x2 + 13 z
s2 = 2 + 12 x1 + 22 x2 + 13 z
Note that the share equations and the
ost equation have parameters in
ommon. One
an do a pooled estimation of the three equations at on
e, imposing that the parameters
are the same. In this way we're using more observations and therefore more information,
whi
h will lead to imporved e
ien
y. Note that this does assume that the
ost equation
is
orre
tly spe
ied (
i.e.,
10.1.
131
not be the true derivatives of the log
ost fun
tion, and would then be misspe
ied for the
shares. To pool the equations, write the model in matrix form (adding in error terms)
ln c
s1
s2
This is
x21
2
1 x1 x2 z
= 0 1
0 0
x22
2
z2
2
x1 x2
x2
x2 0
x1
0 x1 0
0 0
x1 z x2 z
z
0
0
z
11
+ 2
22
3
33
12
13
23
observation an be written as
yt = Xt + t
The overall model would sta
k
observations:
n
y1
X1
3n
y2 X2
. = .
+ .2
. .
.
. .
.
yn
Xn
n
1t
t the
t = 2t
3t
First
onsider the
ovarian
e matrix of this ve
tor: the shares are
ertainly
orrelated
sin
e they must sum to one.
ovarian
e is -1 times the varian
e. General notation is used to allow easy extension to the
ase of more than 2 inputs). Also, it's likely that the shares and the
ost equation have
dierent varian
es. Supposing that the model is
ovarian
e stationary, the varian
e of
t:
11 12 13
V art = 0 =
22 23
33
132
CHAPTER 10.
V ar .
=
..
n
0 0
.
.
0
0 . . ..
= .
..
..
.
0
0
0
0
= In 0
and
is
a B ...
21
AB = .
..
apq B
.
.
.
apq B
So, this model has heteros
edasti
ity and auto
orrelation, so OLS won't be e
ient. The
next question is: how do we estimate e
iently using FGLS? FGLS is based upon inverting
the estimated error
ovarian
e
So we need to estimate
using
X
0 = 1
t t
n t=1
0 .
It an be shown that
will be
singular when the shares sum to one, so FGLS won't work. The solution is to drop
one of the share equations, for example the se
ond. The model be
omes
"
ln c
s1
"
1 x1 x2 z
0 1
x21
2
x22
2
0 x1 0
z2
2
x1 x2
x2
#
x1 z x2 z
z
0
"
#
11
1
+
22
2
33
12
13
23
10.1.
133
yt = Xt + t
and in sta
ked notation for all observations we have the
y1
X1
2n
observations:
y2 X2
. = .
+ .2
. .
.
. .
yn
Xn
n
or, nally in matrix notation for all observations:
y = X +
Considering the error
ovarian
e, we
an dene
Dene
as the leading
22
= V ar
"
= In
blo k of
1
2
, and form
= In
0 .
This is a
onsistent estimator, following the
onsisten
y of OLS and applying a LLN.
4. Next
ompute the Cholesky fa
torization
1
0
P0 = Chol
(I am assuming this is dened as an upper triangular matrix, whi
h is
onsistent
with the way O
tave does it) and the Cholesky fa
torization of the overall
ovarian
e
matrix of the 2 equation model, whi
h
an be
al
ulated as
= In P0
P = Chol
5. Finally the FGLS estimator
an be
al
ulated by applying OLS to the transformed
model
P y = P X + P
F GLS =
1
1
1
X 0
X
X
y
0
P0 yy = P0 Xt + P0
134
CHAPTER 10.
It is
2. Also, we have only imposed symmetry of the se
ond derivatives. Another restri
tion
that the model should satisfy is that the estimated shares should sum to 1. This
an
be a
omplished by imposing
1 + 2 = 1
3
X
ij = 0, j = 1, 2, 3.
i=1
These are linear parameter restri
tions, so they are easy to impose and will improve
e
ien
y if they are true.
iterated.
F GLS
= y X F GLS
These might be expe
ted to lead to a better estimate than the estimator based on
estimated error ovarian e. It an be shown that if this is repeated until the estimates
don't hange (
At any rate, the asymptoti
properties of the iterated and uniterated estimators are
the same, sin
e both are based upon a
onsistent estimator of the error
ovarian
e.
qF
For example, the Cobb-Douglas model is a parametri restri tion of the translog:
The
translog is
yt = + xt + 1/2xt xt +
where the variables are in logarithms, while the Cobb-Douglas is
yt = + xt +
so a test of the Cobb-Douglas versus the translog is simply a test that
The situation is more
ompli
ated when we want to test
= 0.
non-nested hypotheses.
If the
two fun tional forms are linear in the parameters, and use the same transformation of the
10.2.
135
M1 : y = X +
t iid(0, 2 )
M2 : y = Z +
iid(0, 2 )
We wish to test hypotheses of the form:
H0 : Mi
versus
HA : Mi
One ould a ount for non-iid errors, but we'll suppress this for simpli ity.
E onometri a
is
test, proposed by
y = (1 )X + (Z) +
If the rst model is
orre
tly spe
ied, then the true value of
hand, if the se
ond model is
orre
tly spe
ied then
is zero.
On the other
= 1.
The problem is that this model is not identied in general. For example, if the
models share some regressors, as in
M1 : yt = 1 + 2 x2t + 3 x3t + t
M2 : yt = 1 + 2 x2t + 3 x4t + t
then the
omposite model is
test is to substitute
= 0.
in pla e of
supposing that the se
ond model is
orre
tly spe
ied. It will tend to a nite probability
limit even if the se
ond model is misspe
ied. Then estimate the model
y = (1 )X + (Z ) +
= X +
y+
where
y = Z(Z Z)1 Z y = PZ y.
In this model,
show that, under the hypothesis that the rst model is orre t,
136
CHAPTER 10.
-statisti for
=0
t=
a
N (0, 1)
t ,
sin e
tends in probability
reje
t the false null model, asymptoti
ally, sin
e the statisti
will eventually ex
eed
any
riti
al value with probability one.
We an reverse the roles of the models, testing the se ond against the rst.
neither
will still reje
t the null hypothesis, asymptoti
ally, if we use
riti
al values from
the
N (0, 1)
p
|t| .
Of ourse, when we swit h the roles of the models the other will also be
In summary, there are 4 possible out
omes when we test two models, ea
h against
the other. Both may be reje
ted, neither may be reje
ted, or one of the two may be
reje
ted.
M1
P -test
is nonlinear.
The above presentation assumes that the same transformation of the dependent variable is used by both models. Ma Kinnon, White and Davidson,
Journal of E ono-
metri s, (1983) shows how to deal with the ase of dierent transformations.
Monte-Carlo eviden
e shows that these tests often over-reje
t a
orre
tly spe
ied
model. Can use bootstrap
riti
al values to get better-performing tests.
Chapter 11
Exogeneity and simultaneity
Several times we've en
ountered
ases where
orrelation between regressors and the error
term lead to biasedness and in
onsisten
y of the OLS estimator. Cases in
lude auto
orrelation with lagged dependent variables and measurement error in the regressors. Another
important
ase is that of simultaneous equations. The
ause is dierent, but the ee
t is
the same.
y = X +
where, for purposes of estimation we
an treat
we
ondition
on
X,
on
as we saw in the se tion on sto hasti regressors. Nevertheless, the OLS estimator
obtained by treating
that
ase.
Simultaneous equations is a dierent prospe
t. An example of a simultaneous equation
system is a simple supply-demand system:
qt = 1 + 2 pt + 3 yt + 1t
Demand:
"
1t
2t
qt
Supply:
1t 2t
and
pt
q = 1 + 2 pt + 2t
#
!t
"
11 12
=
22
, t
yt
It's easy to see that we have orrelation between regressors and errors. Solving for
1 + 2 pt + 3 yt + 1t = 1 + 2 pt + 2t
2 pt 2 pt = 1 1 + 3 yt + 1t 2t
3 yt
1t 2t
1 1
+
+
pt =
2 2 2 2
2 2
137
pt
138
CHAPTER 11.
pt
is un orrelated with
1t :
1 1
3 yt
1t 2t
+
+
2 2 2 2
2 2
11 12
2 2
E(pt 1t ) = E
=
1t
Be
ause of this
orrelation, OLS estimation of the demand equation will be biased and
in
onsistent. The same applies to the supply equation, for the same reason.
In this model,
qt
and
yt
pt
is an
and we'll return to it in a minute. First, some notation. Suppose we group together
urrent
endogs in the ve
tor
Et .
Xt
is
, whi h is
K 1.
as
Yt = Xt B + Et
Et N (0, ), t
E(Et Es ) = 0, t 6= s
We
an sta
k all
Y = XB + E
E(X E) = 0(KG)
vec(E) N (0, )
where
is
n G, X
is
Y1
X1
X2
Y2
Y = . ,X =
..
.
..
Yn
Xn
n K,
and
is
n G.
E1
E
,E = . 2
.
En
This system is
There is a normality assumption. This isn't ne essary, but allows us to onsider the
Et
11 In 12 In 1G In
.
.
.
22 In
..
= In
.
.
.
GG In
are indi-
11.2.
139
EXOGENEITY
predetermined.
11.2 Exogeneity
The model denes a
and
Xt ,
Yt
G2 + GK + G2 G /2 + G
is a
di-
mensional ve tor. This is the parameter ve tor that were interested in estimating.
parameter ve tor
Yt
and
Xt ,
whi h depends on a
ft (Yt , Xt |, It )
where
It
t.
Xt
Yt s
Yt
and lagged
onditional on
Xt
Xt
times the
So use
and
In
Yt = Xt B + Et
Et N (0, ), t
E(Et Es ) = 0, t 6= s
Normality and la
k of
orrelation over time imply that the observations are independent
of one another, so we
an write the log-likelihood fun
tion as the sum of likelihood
ontri-
140
CHAPTER 11.
butions of ea h observation:
ln L(Y |, It ) =
=
=
n
X
t=1
n
X
t=1
n
X
t=1
ln ft (Yt , Xt |, It )
ln (ft (Yt |Xt , 1 , It )ft (Xt |2 , It ))
ln ft (Yt |Xt , 1 , It ) +
n
X
ln ft (Xt |2 , It ) =
t=1
eter ve
tor) if there is a mapping from to that is invariant to 2 . More formally, for
an arbitrary (1 , 2 ), () = (1 ).
This implies that
would hange as
and
Xt
(1 , 2 ).
Supposing that
Xt
ln L(Y |X, , It ) =
n
X
t=1
ln ft (Yt |Xt , 1 , It )
2 .
1 .
Xt
Xt
is irrelevant, we an treat
Xt
1 ,
Sin e the
as xed in inferen e.
(1 ),and
Of ourse, we'll need to gure out just what this mapping is to re over
With la k of weak exogeneity, the joint and onditional likelihood fun tions maximize
is
this mapping is
from 1 .
Xt
Xt
Yt
Yt1 It .
Lagged
Yt
aren't exogenous in
11.3.
141
REDUCED FORM
Yt = Xt B + Et
V (Et ) =
This is the model in
Denition 16 (Stru tural form) An equation is in stru tural form when more than one
Yt = Xt B1 + Et 1
= Xt + Vt =
Now only one
urrent period endog appears in ea
h equation. This is the
redu ed form.
Denition 17 (Redu ed form) An equation is in redu ed form if only one urrent pe-
An example is our supply/demand system. The redu
ed form for quantity is obtained
by solving the supply equation for pri
e and substituting into demand:
qt 1 2t
+ 3 yt + 1t
1 + 2
2
2 1 2 (1 + 2t ) + 2 3 yt + 2 1t
2 3 yt
2 1t 2 2t
2 1 2 1
+
+
2 2
2 2
2 2
11 + 21 yt + V1t
qt =
2 qt 2 qt =
qt =
=
Similarly, the rf for pri
e is
1 + 2 pt + 2t = 1 + 2 pt + 3 yt + 1t
2 pt 2 pt = 1 1 + 3 yt + 1t 2t
3 yt
1t 2t
1 1
+
+
pt =
2 2 2 2
2 2
= 12 + 22 yt + V2t
The interesting thing about the rf is that the equations individually satisfy the
lassi
al assumptions, sin
e
yt
E(yt Vit ) = 0,
i=1,2,
t.
is un orrelated with
"
V1t
V2t
"
1t
and
2 1t 2 2t
2 2
1t 2t
2 2
2t
#
142
CHAPTER 11.
The varian e of
V1t
is
2 1t 2 2t
2 1t 2 2t
2 2
2 2
2
2 11 22 2 12 + 2 22
(2 2 )2
V (V1t ) = E
Vt .
1t 2t
1t 2t
V (V2t ) = E
2 2
2 2
11 212 + 22
=
(2 2 )2
and the
ontemporaneous
ovarian
e of the errors a
ross equations is
1t 2t
2 1t 2 2t
2 2
2 2
2 11 (2 + 2 ) 12 + 22
(2 2 )2
E(V1t V2t ) = E
=
In summary the rf equations individually satisfy the
lassi
al assumptions, under the
assumtions we've made, but they are
ontemporaneously
orrelated.
Yt = Xt B1 + Et 1
= Xt + Vt
so we have that
Vt
Vt = 1 Et N 0, 1 1 , t
are timewise independent (note that this wouldn't be the ase if the
Et
11.4 IV estimation
The IV estimator may appear a bit unusual at rst, but it will grow on you over time.
The simultaneous equations model is
Y = XB + E
Considering the rst equation (this is without loss of generality, sin
e we
an always reorder
the equations) we
an partition the
matrix as
Y =
y Y1 Y2
11.4.
143
IV ESTIMATION
Y1
are the other endogenous variables that enter the rst equation
Y2
Similarly, partition
as
X=
X1
X2
X1 X2
E=
Assume that
E12
that simply s
ale the remaining
oe
ients on ea
h equation, and whi
h s
ale the varian
es
of the error terms.
Given this s
aling and our partitioning, the
oe
ient matri
es
an be written as
1
12
= 1 22
0
32
#
"
1 B12
B =
0 B22
With this, the rst equation
an be written as
y = Y1 1 + X1 1 +
= Z +
The problem, as we've seen is that
is orrelated with
sin e
Y1
is formed of endogs.
Now, let's
onsider the general problem of a linear regression model with
orrelation
between regressors and the error term:
y = X +
iid(0, In 2 )
E(X ) 6= 0.
The present
ase of a stru
tural equation from a system of equations ts into this notation,
but so do other problems, su
h as measurement error or lagged dependent variables with
auto
orrelated errors. Consider some matrix
with
PW = W (W W )1 W
so that anything that is proje
ted onto the spa
e spanned by
144
CHAPTER 11.
by the denition of
W.
PW y = PW X + PW
or
y = X +
Now we have that
and
E(X ) = E(X PW
PW )
= E(X PW )
and
PW X = W (W W )1 W X
X
W,
on
of the olumns
y = X +
will lead to a
onsistent estimator, given a few more assumptions. This is the
generalized
is
IV = (X PW X)1 X PW y
from whi
h we obtain
IV
= (X PW X)1 X PW (X + )
= + (X PW X)1 X PW
so
IV = (X PW X)1 X PW
=
X W (W W )1 W X
IV =
X W
n
to get
W W 1
n
!
W W
n
QW W ,
X W
n
QXW ,
W p
n
W X
n
1
X W (W W )1 W
!1
X W
n
W W
n
1
W
n
a nite pd matrix
(= ols(X) )
then the plim of the rhs is zero. This last term has plim 0 sin e we assume that
and
11.4.
145
IV ESTIMATION
E(Wt t ) = 0,
Given these assumtions the IV estimator is
onsistent
p
IV .
Furthermore, s
aling by
n,
X W
n
n IV =
we have
W W
n
1
W X
n
!1
X W
n
W W
n
1
d
W
n
N (0, QW W 2 )
then we get
d
1 2
n IV N 0, (QXW Q1
W W QXW )
QXW
and
QW W
is
2 = 1 y X
IV .
IV
d
y
IV
n
This estimator is onsistent following the proof of onsisten y of the OLS estimator of
2 ,
V (IV ) =
The IV estimator is
X W
W W
IV
1
is
W X
1 d
2
IV
1. Consistent
2. Asymptoti
ally normally distributed
3. Biased in general, sin
e even though
1 and
be zero, sin
e (X PW X)
QW W ,
W1 .
W2
IV
depends upon
QXW
and
W.
W1
and
may not
W2
su h that
W1 W2 ,
then the IV
There are spe ial ases where there is no gain (simultaneous equations is an example
The penalty for indis riminant use of instruments is that the small sample bias of the
IV estimator rises as the number of instruments in reases. The reason for this is that
PW X
146
CHAPTER 11.
plim n1 W = 0.
This matrix is
1 2
V (IV ) = (QXW Q1
W W QXW )
The ne essary and su ient ondition for identi ation is simply that this matrix be
For this matrix to be positive denite, we need that the onditions noted above hold:
These identi ation onditions are not that intuitive nor is it very obvious how to
positive denite, and that the instruments be (asymptoti ally) un orrelated with
QW W
QXW
).
he k them.
y = Z +
where
Z=
Notation:
Let
Let
K = cols(X1 )
Let
Y1 X1
G = cols(Y1 ) + 1 be the
K = K K
be the
G = G G
Now the
X1
don't lead to an identied model then no other instruments will identify the
model either. Assuming this is true (we'll prove it in a moment), then a ne essary
11.5.
147
cols(X2 ) cols(Y1 )
W
When
the only identifying information is ex
lusion restri
tions on the variables that enter
an equation, then the number of ex
luded exogs must be greater than or equal to
the number of in
luded endogs, minus 1 (the normalized lhs endog), e.g.,
K G 1
W.
1
plim W Z
n
where
Z=
= K + G 1
Y1 X1
Y = XB + E
as
Y =
X=
Given the redu
ed form
y Y1 Y2
X1 X2
Y = X + V
we
an write the redu
ed form using the same partition
y Y1 Y2
X1 X2
"
11 12 13
21 22 23
v V1 V2
so we have
Y1 = X1 12 + X2 22 + V1
so
Be
ause the
and
V1
i
h
1
1
W Z = W X1 12 + X2 22 + V1 X1
n
n
V1
i
h
1
1
plim W Z = plim W X1 12 + X2 22 X1
n
n
Sin
e the far rhs term is formed only of linear
ombinations of
olumns of
X,
the rank
148
CHAPTER 11.
K,
When
olumns we have
G 1 + K > K
or noting that
K = K K ,
G 1 > K
In this
ase, the limiting matrix is not of full
olumn rank, and the identi
ation
ondition
fails.
Yt = Xt B + Et
V (Et ) =
This leads to the redu
ed form
Yt = Xt B1 + Et 1
= Xt + Vt
V (Vt ) = 1 1
=
The redu ed form parameters are onsistently estimable, but none of them are known
priori,
and there are no restri tions on their values. The problem is that more than one
stru
tural form has the same redu
ed form, so knowledge of the redu
ed form parameters
alone isn't enough to determine the stru
tural parameters. To see this,
onsider the model
Yt F
= Xt BF + Et F
V (Et F ) = F F
where
GG
Yt = Xt BF (F )1 + Et F (F )1
= Xt BF F 1 1 + Et F F 1 1
= Xt B1 + Et 1
= Xt + Vt
11.5.
149
V (Et F (F )1 ) = V (Et 1 )
=
Sin
e the two stru
tural forms lead to the same rf, and the rf is all that is dire
tly estimable,
the models are said to be
restri
tions on
and
observationally equivalent.
equations are to be identied). Take the oe ient matri es as partitioned before:
"
12
=
0
1
0
22
32
B12
B22
The
oe
ients of the rst equation of the transformed model are simply these
oe
ients
multiplied by the rst
olumn of
"
F.
#"
This gives
f11
F2
12
=
0
1
0
#
22 "
f11
32
F
2
B12
B22
For identi
ation of the rst equation we need that there be enough restri
tions so that
the only admissible
"
f11
F2
1
0
12
#
1
22 "
f11
=
32
0
F2
B12
1
0
B22
"
32
B22
F2 =
"
0
0
"
32
B22
#!
= cols
"
32
B22
#!
=G1
then the only way this an hold, without additional restri tions on the model's parameters,
150
CHAPTER 11.
is if
F2
F2
1 12
Therefore, as long as
"
"
then
"
F2
f11
F2
B22
#!
"
32
f11
= 1 f11 = 1
=G1
1
0G1
The rst equation is identied in this
ase, so the
ondition is su
ient for identi
ation.
It is also ne
essary, sin
e the
ondition implies that this submatrix must have at least
G1
G + K = G G + K
rows, we obtain
G G + K G 1
or
K G 1
whi
h is the previously derived ne
essary
ondition.
The above result is fairly intuitive (draw pi
ture here). The ne
essary
ondition ensures
that there are enough variables not in the equation of interest to potentially move the other
equations, so as to tra
e out the equation of interest. The su
ient
ondition ensures that
those other equations in fa
t do move around as the variables
hange their values. Some
points:
K = G 1,
is is
in that omission of an
and still retain
onsisten
y. Overidentifying restri
tions are therefore testable. When
an equation is overidentied we have more instruments than are stri
tly ne
essary for
onsistent estimation. Sin
e estimation by IV with more instruments is more e
ient
asymptoti
ally, one should employ overidentifying restri
tions if one is
ondent that
they're true.
We an repeat this partition for ea h equation in the system, to see whi h equations
These results are valid assuming that the only identifying information omes from
knowing whi
h variables appear in whi
h equations, e.g., by ex
lusion restri
tions,
and through the use of a normalization. There are other sorts of identifying information that
an be used. These in
lude
1. Cross equation restri
tions
11.5.
151
2. Additional restri
tions on parameters within equations (as in the Klein model
dis
ussed below)
4. Nonlinearities in variables
When these sorts of information are available, the above
onditions aren't ne
essary
for identi
ation, though they are of
ourse still su
ient.
Y = XB + E
where
system
triangular
y1 = XB1 + E1
Sin
e only exogs appear on the rhs, this equation is identied.
The se
ond equation is
y2 = 21 y1 + XB2 + E2
This equation has
K = 0
G = 2
21 = 0,
E(y1t 2t ) = E (Xt B1 + 1t )2t = 0
If the entire
re ursive
model.
This is known as a
fully
152
CHAPTER 11.
Consumption:
Investment:
Wtp = 0 + 1 Xt + 2 Xt1 + 3 At + 3t
Private Wages:
Xt = Ct + It + Gt
Output:
Prots:
Capital Sto
k:
It = 0 + 1 Pt + 2 Pt1 + 3 Kt1 + 2t
1t
Pt = Xt Tt Wtp
Kt = Kt1 + It
2t IID 0 ,
0
3t
The other variables are the government wage bill,
spending,
Gt ,and
a time trend,
At .
Yt =
11 12 13
22 23
33
Wtg ,
taxes,
Tt ,
government nonwage
Ct It Wtp Xt Pt Kt
Xt =
The model assumes that the errors of the equations are
ontemporaneously
orrelated, by
nonauto
orrelated. The model written as
Y = XB + E
3 0
B=
1 0
0
0
0 0 0 0 0
3 0
2 2 0
0
3 0
1
0
1 0
1
0
0
1
1 0
1 1
1 1 0
gives
1 0
0
0 1 0
0 0
0
0 0
0
0 0
1
0 0
0
0 0
don't
32
and
B22 ,
These are the rows that have zeros in the rst olumn, and we need to drop the rst
11.6.
153
2SLS
olumn. We get
"
32
B22
1 0
1 1
1 0
0
0
0
1
1
0
1 0
3 0
0
0
0
0
We need to nd a set of 5 rows of this matrix gives a full-rank 55 matrix. For example,
sele
ting rows 3,4,5,6, and 7 we obtain the matrix
A=
0
0
3
0
0
3
0
0 0
0 1 0
0 0
0
0 0
1
1 0
This matrix is of full rank, so the su
ient
ondition for identi
ation is met. Counting
in
luded endogs,
G = 3,
K = 5,
so
K L = G 1
5L
L
=31
=3
The equation is over-identied by three restri
tions, a
ording to the
ounting rules,
whi
h are
orre
t when the only identifying information are the ex
lusion restri
tions.
However, there is additional information in this
ase. Both
Wtp
and
Wtg
enter the
onsumption equation, and their
oe
ients are restri
ted to be the same. For this
reason the
onsumption equation is in fa
t overidentied by four restri
tions.
11.6 2SLS
When we have no information regarding
ross-equation restri
tions or the stru
ture of the
error
ovarian
e matrix, one
an estimate the parameters of a single equation of the system
without regard to the other equations.
This isn't always e
ient, as we'll see, but it has the advantage that misspe
i
ations
in other equations will not ae
t the
onsisten
y of the estimator of the parameters
of the equation of interest.
Also, estimation of the equation won't be ae
ted by identi
ation problems in other
equations.
Y1
is regressed on
all
154
CHAPTER 11.
are
Y1 = X(X X)1 X Y1
= PX Y1
1
= X
Sin
e these tted values are the proje
tion of
any ve
tor in this spa
e is un
orrelated with
Sin
e
Y1
Y1
Y1
by assumption,
X,
and sin e
is un orrelated with
Y1 ,
requirement is that the instruments be linearly independent. This should be the
ase when
the order
ondition is satised, sin
e there are more
olumns in
Y1
in pla e of
Y1 ,
X2
than in
Y1
in this ase.
model is
y = Y1 1 + X1 1 +
= Z +
and the se
ond stage model is
y = Y1 1 + X1 1 + .
Sin
e
X1
X, PX X1 = X1 ,
model as
y = PX Y1 1 + PX X1 1 +
PX Z +
The OLS estimator applied to this model is
= (Z PX Z)1 Z PX y
whi
h is exa
tly what we get if we estimate using IV, with the redu
ed form predi
tions of
the endogs used as instruments. Note that if we dene
Z = PX Z
h
i
=
Y1 X1
so that
Z,
then we an write
= (Z Z)1 Z y
Important note: OLS on the transformed model
an be used to
al
ulate the 2SLS
estimate of
sin e we see that it's equivalent to IV using a parti ular set of instru-
ments. However
11.7.
155
A tually, there is also a simpli ation of the general IV varian e formula. Dene
Z = PX Z
h
i
=
Y X
The IV
ovarian
e estimator would ordinarily be
1
1
2
= Z Z
V ()
Z Z Z Z
IV
However, looking at the last term in bra
kets
ZZ=
but sin
e
PX
Y1 X1
i h
Y1 X1
Y1 X1
i h
"
PX X = X,
Y1 X1
"
X1 X1
we an write
Y1 PX PX Y1 Y1 PX X1
X1 X1
X1 PX Y1
h
i h
i
=
Y1 X1
Y1 X1
= Z Z
Therefore, the se
ond and last term in the varian
e formula
an
el, so the 2SLS var
ov
estimator simplies to
1
2
= Z Z
V ()
IV
1
2
= Z Z
V ()
IV
Finally, re
all that though this is presented in terms of the rst equation, it is general sin
e
any equation
an be pla
ed rst.
Properties of 2SLS:
1. Consistent
2. Asymptoti
ally normal
3. Biased when the mean esists (the existen
e of moments is a te
hni
al issue we won't
go into here).
4. Asymptoti
ally ine
ient, ex
ept in spe
ial
ir
umstan
es (more on this later).
of the model.
As su h, there is room for error here: one might erroneously lassify a variable
as exog when it is in fa
t
orrelated with the error term. A general test for the spe
i
ation
on the model
an be formulated as follows:
156
CHAPTER 11.
s(IV ) = y X IV PW y X IV ,
but
IV
= y X IV
= y X(X PW X)1 X PW y
= I X(X PW X)1 X PW y
= I X(X PW X)1 X PW (X + )
= A (X + )
where
A I X(X PW X)1 X PW
so
Moreover,
A PW A
A PW A =
=
=
Furthermore,
s(IV ) = + X A PW A (X + )
I PW X(X PW X)1 X PW I X(X PW X)1 X PW
PW PW X(X PW X)1 X PW PW PW X(X PW X)1 X PW
I PW X(X PW X)1 X PW .
is orthogonal to
AX =
I X(X PW X)1 X PW X
= X X
= 0
so
s(IV ) = A PW A
Supposing the
2 ,
s(IV )
A PW A
=
2
2
is a quadrati
form of a
so
N (0, 1)
s(IV )
2 ((A PW A))
2
2 .
s(IV ) a 2
((A PW A))
c
2
11.7.
157
Even if the
aren't normally distributed, the asymptoti result still holds. The last
A PW A = PW PW X(X PW X)1 X PW
so
(A PW A) = T r PW PW X(X PW X)1 X PW
= T rW (W W )1 W KX
= T rW W (W W )1 KX
= KW KX
where
KW
and
KX
X.
The degrees of freedom of the test is simply the number of overidentifying restri
tions:
the number of instruments we have beyond the number that is stri
tly ne
essary for
onsistent estimation.
This test is an overall spe
i
ation test: the joint null hypothesis is that the model
is
orre
tly spe
ied
y = Z +
and
This is a parti ular ase of the GMM riterion test, whi h is overed in the se ond
IV = A
and
s(IV ) = A PW A
we
an write
s(IV )
c2
where
Ru2
W (W W )1 W W (W W )1 W
=
/n
= n(RSSIV |W /T SSIV )
= nRu2
is the un entered
instruments
W.
R2
IV
y = X +
and
cols(W ) =
158
cols(X),
CHAPTER 11.
W X
so
PW y = PW X + PW
and the fon
are
X PW (y X IV ) = 0
The IV estimator is
IV = X PW X
Considering the inverse here
X PW X
IV
1
X PW y,
1
X PW y
X W (W W )1 W X
1
1
= (W X)1 X W (W W )1
1
= (W X)1 (W W ) X W
we obtain
= (W X)1 (W W ) X W
= (W X)1 (W W ) X W
= (W X)1 W y
1
1
X PW y
X W (W W )1 W y
y X IV PW y X IV
X PW y X IV
y PW y X IV IV
X PW y + IV
X PW X IV
y PW y X IV IV
X PW y + X PW X IV
y PW y X IV IV
y PW y X IV
s(IV ) =
=
=
=
=
by the fon for generalized IV. However, when we're in the just indentied ase, this is
s(IV ) = y PW y X(W X)1 W y
= y PW I X(W X)1 W y
= y W (W W )1 W W (W W )1 W X(W X)1 W y
= 0
The value of the obje
tive fun
tion of the IV estimator is zero in the just identied
ase.
This makes sense, sin
e we've already shown that the obje
tive fun
tion after dividing by
is asymptoti ally
restri tions.
2 (0) rv, whi
h has mean 0 and varian
e 0, e.g., it's simply 0. This means we're not able
to test the identifying restri
tions in the
ase of exa
t identi
ation.
11.8.
159
Y = XB + E
E(X E) = 0(KG)
vec(E) N (0, )
Et
are indi-
11 In 12 In 1G In
.
.
.
22 In
..
.
.
.
GG In
= In
This means that the stru
tural equations are heteros
edasti
and
orrelated with one
another
In general, ignoring this will lead to ine
ient estimation, following the se
tion on
GLS. When equations are
orrelated with one another estimation should a
ount for
the
orrelation in order to obtain e
ien
y.
Also, sin
e the equations are
orrelated, information about one equation is impli
itly
information about all equations.
equation improve e
ien
y for
all
Single equation methods an't use these types of information, and are therefore ine ient (in general).
11.8.1 3SLS
Note: It is easier and more pra
ti
al to treat the 3SLS estimator as a generalized method
of moments estimator (see Chapter 15). I no longer tea
h the following se
tion, but it is
retained for its possible histori
al interest. Another alternative is to use FIML (Subse
tion 11.8.2), if you are willing to make distributional assumptions on the errors. This is
omputationally feasible with modern
omputers.
160
CHAPTER 11.
yi = Yi 1 + Xi 1 + i
= Zi i + i
Grouping the
y1
y2
.
.
.
yG
or
Z1 0
0
=
.
..
0
0
.
.
.
Z2
..
ZG
.
..
2
+ .
.
.
G
y = Z +
where we already have that
E( ) =
= In
The 3SLS estimator is just 2SLS
ombined with a GLS
orre
tion that takes advantage of
the stru
ture of
Dene
as
X(X X)1 X Z1 0
Z = .
..
Y1 X1
= .
.
.
0
.
.
.
X(X X)1 X Z2
..
0
Y2 X2
0
0
.
.
.
..
0
YG XG
unrestri ted
0
0
X(X X)1 X ZG
with the exogs. The distin tion is that if the model is overidentied, then
= B1
may be subje
t to some zero restri
tions, depending on the restri
tions on
and
B,
and
= (Z Z)1 Z y
as
an be veried by simple multipli
ation, and noting that the inverse of a blo
k-diagonal
matrix is just the matrix with the inverses of the blo
ks on the main diagonal. This IV
11.8.
161
estimator still ignores the
ovarian
e information. The natural extension is to add the GLS
transformation, putting the inverse of the error
ovarian
e into the formula, whi
h gives
the 3SLS estimator
3SLS
=
=
Z ( In )1 Z
Z ( In )1 y
1 1
Z In y
Z 1 In Z
1
. The obvious
residuals:
i = yi Zi i,2SLS
using
is estimated by
ij =
Substitute
Zi ,
not
Zi ).
i, j
of
i j
n
Analogously to what we did in the
ase of 2SLS, the asymptoti
distribution of the
3SLS estimator
an be shown to be
!
Z ( I )1 Z 1
a
n
n 3SLS N 0, lim E
n
A formula for estimating the varian
e of the 3SLS estimator in nite samples (
an
elling
out the powers of
n)
is
1
1 In Z
V 3SLS = Z
??),
In the ase that all equations are just identied, 3SLS is numeri ally equivalent to
orre tion.
2SLS. Proving this is easiest if we use a GMM interpretation of 2SLS and 3SLS.
GMM is presented in the next e
onometri
s
ourse. For now, take it on faith.
al ulated equation by
= (X X)1 X Y
whi h is simply
= (X X)1 X
all
y1 y2 yG
.
It may seem odd that we use OLS on the redu
ed form, sin
e the rf equations are
162
CHAPTER 11.
orrelated:
Yt = Xt B1 + Et 1
= Xt + Vt
and
Vt = 1 Et N 0, 1 1 , t
= 1 1
y1
y2
.
.
.
yG
where
yi
of exogs,
is the
n1
is the
ith
X 0
0
=
.
.
.
..
and
vi
.
.
0
.
G
X
.
.
.
is the
ith
ith
v1
v2
+ .
.
.
vG
endog,
olumn of
is the entire
V.
nK
matrix
y = X + v
to indi
ate the pooled model. Following this notation, the error
ovarian
e matrix is
V (v) = In
tions (SUR)
are ontemporanously orrelated, however. The general ase would have a dierent
Xi
for ea h equation.
Note that ea h equation of the system individually satises the lassi al assumptions.
However, pooled estimation using the GLS orre tion is more e ient, sin e equationby-equation estimation is equivalent to pooled estimation, sin e
is blo k diagonal,
Xi
X = In X.
1.
(A B)1 = (A1 B 1 )
2.
(A B) = (A B )
and
OLS.
11.8.
3.
we get
1
(In X) ( In )1 (In X)
(In X) ( In )1 y
1 1
1 X (In X)
X y
(X X)1 1 X y
IG (X X)1 X y
.2
.
.
SU R =
=
=
=
163
See
11.8.2 FIML
Full information maximum likelihood is an alternative estimation method.
FIML will
be asymptoti
ally e
ient, sin
e ML estimators based on a given information set are
asymptoti
ally e
ient w.r.t. all other estimators that use the same information set, and
in the
ase of the full-information ML estimator we use the entire information set. The
2SLS and 3SLS estimators don't require distributional assumptions, while FIML of
ourse
does. Our model is, re
all
Yt = Xt B + Et
Et N (0, ), t
E(Et Es ) = 0, t 6= s
The joint normality of
Et
is
g/2
(2)
The transformation from
Et
to
det
Yt
1 1/2
G/2
(2)
Yt
1
exp Et 1 Et
2
| det
so the density for
Et
dEt
| = | det |
dYt
is
| det | det
1 1/2
1
exp Yt Xt B 1 Yt Xt B
2
164
CHAPTER 11.
Given the assumption of independen e over time, the joint log-likelihood fun tion is
ln L(B, , ) =
n
1X
nG
ln(2)+n ln(| det |) ln det 1
Yt Xt B 1 Yt Xt B
2
2
2
t=1
This is a nonlinear in the parameters obje
tive fun
tion. Maximixation of this
an be
done using iterative numeri
methods. We'll see how to do this in the next se
tion.
It turns out that the asymptoti
distribution of 3SLS and FIML are the same,
as-
One
an
al
ulate the FIML estimator by iterating the 3SLS estimator, thus avoiding
the use of a nonlinear optimizer. The steps are
1. Cal ulate
3SLS
2. Cal ulate
= B
3SLS
1 .
3SLS
and
3SLS
B
as normal.
This is new, we didn't estimate
in this way
before. This estimator may have some zeros in it. When Greene says iterated
3SLS doesn't lead to FIML, he means this for a pro
edure that doesn't update
,
3. Cal
ulate the instruments
and
If you update
you do
.
Y = X
and al ulate
using
onverge to FIML.
and
to get the
FIML is fully e
ient, sin
e it's an ML estimator that uses all information.
implies that 3SLS is fully e
ient
This
Also,
if ea
h equation is just identied and the errors are normal, then 2SLS will be fully
e
ient, sin
e in this
ase 2SLS3SLS.
When the errors aren't normally distributed, the likelihood fun
tion is of
ourse
dierent than what's written above.
CONSUMPTION EQUATION
*******************************************************
2SLS estimation results
Observations 21
R-squared 0.976711
Sigma-squared 1.044059
estimate
st.err.
t-stat.
p-value
11.9.
165
Constant
Profits
Lagged Profits
Wages
16.555
0.017
0.216
0.810
1.321
0.118
0.107
0.040
12.534
0.147
2.016
20.129
0.000
0.885
0.060
0.000
*******************************************************
INVESTMENT EQUATION
*******************************************************
2SLS estimation results
Observations 21
R-squared 0.884884
Sigma-squared 1.383184
Constant
Profits
Lagged Profits
Lagged Capital
estimate
20.278
0.150
0.616
-0.158
st.err.
7.543
0.173
0.163
0.036
t-stat.
2.688
0.867
3.784
-4.368
p-value
0.016
0.398
0.001
0.000
*******************************************************
WAGES EQUATION
*******************************************************
2SLS estimation results
Observations 21
R-squared 0.987414
Sigma-squared 0.476427
Constant
Output
Lagged Output
Trend
estimate
1.500
0.439
0.147
0.130
st.err.
1.148
0.036
0.039
0.029
t-stat.
1.307
12.316
3.777
4.475
p-value
0.209
0.000
0.002
0.000
*******************************************************
The above results are not valid (spe
i
ally, they are in
onsistent) if the errors are
auto
orrelated, sin
e lagged endogenous variables will not be valid instruments in that
ase. You might
onsider eliminating the lagged endogenous variables as instruments, and
re-estimating by 2SLS, to obtain
onsistent parameter estimates in this more
omplex
ase. Standard errors will still be estimated in
onsistently, unless use a Newey-West type
ovarian
e estimator. Food for thought...
166
CHAPTER 11.
Chapter 12
Introdu
tion to the se
ond half
We'll begin with study of
based on a sample of size
extremum estimators
in general. Let
sn (Zn , )
n.
Zn
over a set
is the optimizing
element of an obje -
We'll usually write the obje tive fun tion suppressing the dependen e on
Zn .
yn = Xn 0 + n , where Xn =
as
x1 x2 xn
= (X X)1 X y.
arg max Ln () =
n
Y
t=1
(yt )2
(2)1/2 exp
2
(0, ),
maximization of the
as
fun tion:
n
X
(yt )2
t=1
.
= y
MLE estimators are asymptoti ally e ient (Cramr-Rao lower bound, Theorem3),
One an investigate the properties of an ML estimator supposing that the distribu-
supposing the strong distributional assumptions upon whi h they are based are true.
167
168
CHAPTER 12.
yt
from the
2 ( 0 )
1 ,
1 = 1 ( 0 )
is a
distribution. Here,
is the
i.e., 1 ( 0 ) .
moment-parameter equation.
1 ( 0 ) = 0 ,
though in
general the relationship may be more ompli ated. The sample rst moment is
c1 =
n
X
yt /n.
t=1
Dene
The method of moments prin iple is to hoose the estimator of the parameter to set
m1 () = 1 ()
c1
0.
, i.e., m1 ()
Then the moment-parameter equation is inverted to solve for the parameter estimate.
In this
ase,
=
m1 ()
Sin
e
Pn
t=1 yt /n
n
X
yt /n = 0.
t=1
V (yt ) = E yt 0
Dene
m2 () = 2
2
2 ( 0 )
r.v. is
= 2 0 .
Pn
t=1 (yt
y)2
= 2
m2 ()
Pn
t=1 (yt
y)2
0.
Again, by the LLN, the sample varian e is onsistent for the true varian e, that is,
Pn
t=1 (yt
So,
n
Pn
y)2
t=1 (yt
2n
2 0 .
y)2
169
overiden-
ti
ation, whi
h means that we have more information than is stri
tly ne
essary for
onsistent estimation of the parameter.
The GMM
ombines information from the two moment-parameter equations to form
a new estimator whi
h will be
m1t (),
i.e.,
m1t () = yt .
m1 () = 1/n
=
n
X
t=1
n
X
m1 ()
is the sample
m1t ()
yt /n.
t=1
0.
0 , both E m1t ( 0 ) = 0 and E m1 ( 0 ) =
m2 () = 2
Again, it is
lear from the LLN that
either
=0
m1 ()
or
Pn
t=1 (yt
a.s.
m2 ( 0 ) 0.
= 0. In general,
m2 ()
y)2
no single value of
to
set
simultaneously.
d(m()),
where
and hoosing
d(m) = m Am,
where
While it's
lear that the MM gives
onsistent estimates if there is a one-to-one relationship
between parameters and moments, it's not immediately obvious that the GMM estimator
is
onsistent. (We'll see later that it is.)
These examples show that these widely used estimators may all be interpreted as the
solution of an optimization problem. For this reason, the study of extremum estimators is
useful for its generality. We will see that the general results extend smoothly to the more
spe
ialized results available for spe
i
estimators. After studying extremum estimators
in general, we will study the GMM estimator, then QML and NLS. The reason we study
GMM rst is that LS, IV, NLS, MLE, QML and other well-known parametri
estimators
170
CHAPTER 12.
may all be interpreted as spe
ial
ases of the GMM estimator, so the general results on
GMM
an simplify and unify the treatment of these other estimators. Nevertheless, there
are some spe
ial results on QML and NLS, and both are important in empiri
al resear
h,
whi
h makes fo
us on them useful.
0 (yt ) =
For example,
0 + t
In spite of this generality, situations often arise whi
h simply
an not be
onvin
ingly
represented by linear in the parameters models.
models also applies to linear models, so one may as well start o with the general ase.
xi =
ith
of
goods is
v(p, y)/pi
.
v(p, y)/y
An expenditure share is
so ne essarily
si [0, 1],
and
si pi xi /y,
PG
i=1 si
= 1.
xi
or
si
with a parameter spa
e that is dened independent of the data
an guarantee that either of
these
onditions holds. These
onstraints will often be violated by estimated linear models,
whi
h
alls into question their appropriateness in
ases of this sort.
0
0
for provision of a proje
t. Indire
t utility in the base
ase (no proje
t) is v (m, z)+ , where
m is in
ome and z is a ve
tor of other variables su
h as pri
es, personal
hara
teristi
s,
1
1
After provision, utility is v (m, z) + . The random terms
= 1, 2,
ree t variations of
to pay
0
1}
| {z
<
et .
i , i
if
v 1 (m A, z) v 0 (m, z)
|
{z
}
v(w, A)
We assume here that responses are truthful, that is there is no strategi behavior and that individuals
171
= 0 1 ,
Dene
y = 1
Dene
let
olle t
and
z,
and let
y = 0
otherwise.
The
probability of agreement is
(12.1)
v 1 (m, z) = m
v 0 (m, z) = m
and
and
are i.i.d. extreme value random variables. That is, utility depends only on
in
ome, preferen
es in both states are homotheti
, and a spe
i
distributional assumption
is made on the distribution of preferen
es in the population. With these assumptions (the
details are unimportant here, see arti
les by D. M
Fadden if you're interested) it
an be
shown that
p(A, ) = ( + A) ,
where
(z) is
(z) = (1 + exp(z))1 .
This is the simple logit model: the
hoi
e probability is the logit fun
tion of a linear in
parameters fun
tion.
Now,
is either
is
( + A)
. Thus, we an write
y = ( + A) +
E() = 0.
One
ould estimate this by (nonlinear) least squares
1X
(y ( + A))2
,
= arg min
n t
The main point is that it is impossible that
( + A)
A,
there are no
, (A)
su h that
( + A) = (A) , A
where
(A)
is a
p-ve tor
we an always nd a
and
is a
su h that
(A)
this sort of problem o
urs often in empiri
al work, it is useful to study NLS and other
nonlinear models.
After dis
ussing these estimation methods for parametri
models we'll briey introdu
e
172
CHAPTER 12.
onsistently when we are not willing to assume that a model of the form
yt = f (xt ) + t
an be restri
ted to a parametri
form
yt = f (xt , ) + t
Pr(t < z) = F (z|, xt )
,
where
f ()
and perhaps
F (z|, xt )
e
onomi
theory gives us general information about fun
tions and the signs of their derivatives, but not about their spe
i
form.
Then we'll look at simulation-based methods in e
onometri
s.
us to substitute
omputer power for mental power.
relatively
heap
ompared to mental eort, any e
onometri
ian who lives by the prin
iples
of e
onomi
theory should be interested in these te
hniques.
Finally, we'll look at how e
onometri
omputations
an be done in parallel on a
luster
of
omputers.
Chapter 13
Numeri
optimization methods
Readings:
This se tion gives a very brief introdu tion to what is a large literature
on numeri
optimization methods. We'll
onsider a few well-known te
hniques, and one
fairly new te
hnique that may allow one to solve di
ult problems. The main obje
tive
is to be
ome familiar with the issues, and to learn how to use the BFGS algorithm at the
pra
ti
al level.
The general problem we
onsider is how to nd the maximizing element
of a fun
tion
s().
(a K
-ve tor)
This fun tion may not be ontinuous, and it may not be dierentiable.
e.g.,
1
s() = a + b + C,
2
D s() = b + C
so the maximizing (minimizing) element would be
= C 1 b.
we have with linear models estimated by OLS. It's also the
ase for feasible GLS, sin
e
onditional on the estimate of the var
ov matrix, we have a quadrati
obje
tive fun
tion
in the remaining parameters.
More general problems will not have linear f.o.
., and we will not be able to solve for
the maximizer analyti
ally. This is when we need a numeri
optimization method.
13.1 Sear
h
The idea is to
reate a grid over the parameter spa
e and evaluate the fun
tion at ea
h
point on the grid. Sele
t the best point. Then rene the grid in the neighborhood of the
best point, and
ontinue until the a
ura
y is good enough. See Figure
??.
One has to
be
areful that the grid is ne enough in relationship to the irregularity of the fun
tion to
ensure that sharp peaks are not missed entirely.
173
174
CHAPTER 13.
To
he
k
he
k
qK
values in ea h dimension of a
points.
q = 100
For example, if
and
K = 10,
there would be
10010
points to
9
he
k. If 1000 points
an be
he
ked in a se
ond, it would take 3. 17110 years to perform
the
al
ulations, whi
h is approximately the age of the earth. The sear
h method is a very
reasonable
hoi
e if
is moderate or large.
k+1
given
The iteration method
an be broken into two problems:
hoosing the stepsize
and
hoosing the dire
tion of movement,
dk ,
ak
(a s alar)
so that
(k+1) = (k) + ak dk .
s( + ad)
>0
a
d,
k
k
all be represented as Q g( ) where
Qk
0
is the gradient at . To see this, take a T.S. expansion around a
g () = D s()
=0
we need g() d
> 0.
Dening
d = Qg(),
where
If
that
g() = 0.
and
Qg()
(k+1) = (k) + ak Qk g( k )
13.2.
175
DERIVATIVE-BASED METHODS
and we keep going until the gradient be
omes zero, so that there is no in
reasing dire
tion.
The problem is how to
hoose
Conditional on Q,
and
hoosing
Q.
is a s alar.
Q.
gradient provides the dire tion of maximum rate of hange of the obje tive fun tion.
Disadvantages: This doesn't always work too well however (draw pi
ture of banana
fun
tion).
13.2.3 Newton-Raphson
The Newton-Raphson method uses information about the slope and
urvature of the obje
tive fun
tion to determine whi
h dire
tion and how far to move from an initial point.
sn ()
sn () sn ( k ) + g( k ) k + 1/2 k H( k ) k
To attempt to maximize
depends on
sn (),
i.e., we
an maximize
s() = g( k ) + 1/2 k H( k ) k
176
CHAPTER 13.
with respe t to
so it
D s() = g( k ) + H( k ) k
k+1 = k H( k )1 g( k )
This is illustrated in Figure
??.
sn ()
may be bad
k+1 = k ak H( k )1 g( k )
A potential problem is that the Hessian may not be negative denite when we're
far from the maximizing point.
So
H( k )1
when the obje
tive fun
tion has at regions, in whi
h
ase the Hessian matrix is
very ill-
onditioned (e.g., is nearly singular), or when we're in the vi
inity of a lo
al
minimum,
H( k )
de reasing
dire tion of
sear
h. Matrix inverses by
omputers are subje
t to large errors when the matrix is
ill-
onditioned. Also, we
ertainly don't want to go in the dire
tion of a minimum
when we're maximizing. To solve this problem,
H()
where
Quasi-Newton
is well-
Stopping
riteria
The last thing we need is to de
ide when to stop.
limited ma
hine pre
ision and round-o errors.
hope that a program
an
exa tly
jk jk1
jk1
| < 2 , j
13.2.
177
DERIVATIVE-BASED METHODS
|s( k ) s( k1 )| < 3
|gj ( k )| < 4 , j
Also, if we're maximizing, it's good to he k that the last round (real, not approximate) Hessian is negative denite.
Starting values
The Newton-Raphson and related algorithms work well if the obje
tive fun
tion is
on
ave (when maximizing), but not so well if there are
onvex regions and lo
al minima
or multiple lo
al maxima. The algorithm may
onverge to a lo
al minimum or to a lo
al
maximum that is not optimal. The algorithm may also have di
ulties
onverging at all.
The usual way to ensure that a global maximum has been found is to use many
dierent starting values, and
hoose the solution that returns the highest obje
tive
fun
tion value.
sn ()
is
ompli
ated. Possible solutions are to
al
ulate derivatives numeri
ally, or to use programs
su
h as MuPAD or Mathemati
a to
al
ulate analyti
derivatives. For example, Figure 13.2
1
shows MuPAD
al ulating a derivative that I didn't know o the top of my head, and one
Numeri
derivatives are less a
urate than analyti
derivatives, and are usually more
ostly to evaluate.
One advantage of numeri
derivatives is that you don't have to worry about having
made an error in
al
ulating the analyti
derivative.
derivatives it's a good idea to
he
k that they are
orre
t by using numeri
derivatives.
This is a lesson I learned the hard way when writing my thesis.
Numeri
se
ond derivatives are mu
h more a
urate if the data are s
aled so that the
elements of the gradient are of the same order of magnitude. Example: if the model
is
yt = h(xt + zt ) + t ,
and
D sn () = 0.001.
zt /1000.
1
D sn () = 1000
D sn ()
and
D sn ()
will both be 1.
MuPAD is not a freely distributable program, so it's not on the CD. You an download it from
http://www.mupad.de/download.shtml
178
CHAPTER 13.
In general, estimation programs always work better if data is s
aled in this way, sin
e
roundo errors are less likely to be
ome important.
There are algorithms (su
h as BFGS and DFP) that use the sequential gradient
evaluations to build up an approximation to the Hessian. The iterations are faster
for this reason sin
e the a
tual Hessian isn't
al
ulated, but more iterations usually
are required for
onvergen
e.
randomly sele
ts evaluation points, a
epts all points that yield an in
rease in the obje
tive
fun
tion, but also a
epts some points that de
rease the obje
tive fun
tion. This allows the
algorithm to es
ape from lo
al minima. As more and more points are tried, periodi
ally
the algorithm fo
uses on the best point so far, and redu
es the range over whi
h random
points are generated. Also, the probability that a negative move is a
epted redu
es. The
algorithm relies on many evaluations, as in the sear
h method, but fo
uses in on promising
areas, whi
h redu
es fun
tion evaluations with respe
t to the sear
h method. It does not
require derivatives to be evaluated. I have a program to do this if you're interested.
13.4 Examples
This se
tion gives a few examples of how some nonlinear models may be estimated using
maximum likelihood.
13.4.
179
EXAMPLES
y = g(x)
y = 1(y > 0)
P r(y = 1) = F [g(x)]
p(x, )
The log-likelihood fun
tion is
sn () =
1X
(yi ln p(xi , ) + (1 yi ) ln [1 p(xi , )])
n
i=1
For the logit model (see the
ontingent valuation example above), the probability has
the spe
i
form
p(x, ) =
1
1 + exp(x)
You should download and examine LogitDGP.m , whi
h generates data a
ording to
the logit model, logit.m , whi
h
al
ulates the loglikelihood, and EstimateLogit.m , whi
h
sets things up and
alls the estimation routine, whi
h uses the BFGS algorithm.
Here are some estimation results with
n = 100,
= (0, 1) .
***********************************************
Trial of MLE estimation of Logit model
MLE Estimation Results
BFGS
onvergen
e: Normal
onvergen
e
Average Log-L: 0.607063
Observations: 100
estimate
onstant
0.5400
slope
0.7566
st. err
0.2229
0.2374
t-stat
2.4224
3.1863
p-value
0.0154
0.0014
Information Criteria
CAIC : 132.6230
BIC : 130.6230
AIC : 125.4127
***********************************************
mle_results(),
180
CHAPTER 13.
o tave-forge
repository.
Under the home produ
tion framework, individuals de
ide when to make health
are visits
to maintain their health sto
k, or to deal with negative sho
ks to the sto
k in the form of
a
idents or illnesses. As su
h, individual demand will be a fun
tion of the parameters of
the individuals' utility fun
tions.
The MEPS health data le ,
meps1996.data,
sures of health
are usage. The data is from the 1996 Medi
al Expenditure Panel Survey
(MEPS). You
an get more information at
http://www.meps.ahrq.gov/.
sures of use are are o
e-based visits (OBDV), outpatient visits (OPV), inpatient visits
(IPV), emergen
y room visits (ERV), dental visits (VDV), and number of pres
ription
drugs taken (PRESCR). These form
olumns 1 - 6 of
meps1996.data.
The ondition-
ing variables are publi
insuran
e (PUBLIC), private insuran
e (PRIV), sex (SEX), age
(AGE), years of edu
ation (EDUC), and in
ome (INCOME). These form
olumns 7 - 12
of the le, in the order given here. PRIV and PUBLIC are 0/1 binary variables, where a
1 indi
ates that the person has a
ess to publi
or private insuran
e
overage. SEX is also
0/1, where 1 indi
ates that the person is female. This data will be used in examples fairly
extensively in what follows.
The program ExploreMEPS.m shows how the data may be read in, and gives some
des
riptive information about variables, whi
h follows:
All of the measures of use are
ount data, whi
h means that they take on the values
0, 1, 2, ....
It might be reasonable to try to use this information by spe ifying the density
as a
ount data density. One of the simplest
ount data densities is the Poisson density,
whi
h is
fY (y) =
exp()y
.
y!
1X
(i + yi ln i ln yi !)
sn () =
n
i=1
i = exp(xi )
xi = [1 P U BLIC P RIV SEX AGE EDU C IN C] .
This ensures that the mean is positive, as is required for the Poisson model. Note that for
this parameterization
j =
/j
13.4.
181
EXAMPLES
so
j xj = xj ,
the elasti
ity of the
onditional mean of
j th
onditioning variable.
The program EstimatePoisson.m estimates a Poisson model using the full data set.
The results of the estimation, using OBDV as the dependent variable are here:
******************************************************
Poisson model, MEPS 1996 full data set
MLE Estimation Results
BFGS
onvergen
e: Normal
onvergen
e
Average Log-L: -3.671090
Observations: 4564
onstant
pub. ins.
priv. ins.
sex
age
edu
in
estimate
-0.791
0.848
0.294
0.487
0.024
0.029
-0.000
st. err
0.149
0.076
0.071
0.055
0.002
0.010
0.000
t-stat
-5.290
11.093
4.137
8.797
11.471
3.061
-0.978
p-value
0.000
0.000
0.000
0.000
0.000
0.002
0.328
Information Criteria
CAIC : 33575.6881
Avg. CAIC: 7.3566
BIC : 33568.6881
Avg. BIC:
7.3551
AIC : 33523.7064
Avg. AIC:
7.3452
******************************************************
spell
is the period of time between the o uren e of initial event and the on luding
event. For example, the initial event ould be the loss of a job, and the nal event is the
182
CHAPTER 13.
t0
t1
D = t1 t0 .
FD (t) = Pr(D < t).
D, fD (t),
with
Several questions may be of interest. For example, one might wish to know the expe
ted
time one has to wait to nd a job given that one has already waited
that a spell lasts
s years.
The probability
years is
years is
fD (t)
.
1 FD (s)
fD (t|D > s) =
The expe
tan
ed additional time required for the spell to end given that is has already
lasted
E = E(D|D > s) s =
Z
fD (z)
z
dz
1 FD (s)
To estimate this fun tion, one needs to spe ify the density
s.
then estimate by maximum likelihood. There are a number of possibilities in
luding the
exponential density, the lognormal,
et .
E(D) = .
To illustrate appli
ation of this model, 402 observations on the lifespan of mongooses
in Serengeti National Park (Tanzania) were used to t a Weibull model.
The spell in
this
ase is the lifetime of an individual mongoose. The parameter estimates and standard
errors are
= 0.559 (0.034)
and
= 0.867 (0.033)
Figure 13.3 presents tted life expe
tan
y (expe
ted additional years of life) as a fun
tion
of age, with 95%
onden
e bands. The plot is a
ompanied by a nonparametri
KaplanMeier estimate of life-expe
tan
y. This nonparametri
estimator simply averages all spell
lengths greater than age, and then subtra
ts age. This is
onsistent by the LLN.
In the gure one
an see that the model doesn't t the data well, in that it predi
ts
life expe
tan
y quite dierently than does the nonparametri
model.
nonparametri
estimate is outside the
onden
e interval that results from the parametri
model, whi
h
asts doubt upon the parametri
model. Mongooses that are between 2-6
years old seem to have a lower life expe
tan
y than is predi
ted by the Weibull model,
whereas young mongooses that survive beyond infan
y have a higher life expe
tan
y, up
to a bit beyond 2 years. Due to the dramati
hange in the death rate as a fun
tion of
t,
13.4.
EXAMPLES
183
184
CHAPTER 13.
fD (t)
2
1
fD (t|) = e(1 t) 1 1 (1 t)1 1 + (1 ) e(2 t) 2 2 (2 t)2 1 .
The parameters
and
i , i = 1, 2
hose between the two models, sin
e under the null that
parameters
and
=1
are not identied. It is possible to take this into a ount, but this
topi
is out of the s
ope of this
ourse. Nevertheless, the improvement in the likelihood
fun
tion is
onsiderable. The parameter estimates are
Parameter
Estimate
St. Error
0.233
0.016
1.722
0.166
1.731
0.101
1.522
0.096
0.428
0.035
Note that the mixture parameter is highly signi
ant. This model leads to the t in Figure
13.4. Note that the parametri
and nonparametri
ts are quite
lose to one another, up to
around
6 years.
The disagreement after this point is not too important, sin e less than 5%
of mongooses live more than 6 years, whi
h implies that the Kaplan-Meier nonparametri
estimate has a high varian
e (sin
e it's an average of a small number of observations).
Mixture models are often an ee
tive way to model
omplex responses, though they
an suer from overparameterization. Alternatives will be dis
ussed later.
(all
zeros). To see this run Che
kS
ore.m. With uns
aled data, one element of the gradient is
very large, and the maximum and minimum elements are 5 orders of magnitude apart. This
auses
onvergen
e problems due to serious numeri
al ina
ura
y when doing inversions
to
al
ulate the BFGS dire
tion of sear
h.
the gradient are very large, and the maximum dieren
e in orders of magnitude is 3.
Convergen
e is qui
k.
13.5.
185
Think of limbing a
mountain in an unknown range, in a very foggy pla
e (Figure 13.5). You
an go up until
there's nowhere else to go up, but sin
e you're in the fog you don't know if the true summit
is a
ross the gap that's at your feet. Do you
laim vi
tory and go home, or do you trudge
down the gap and explore the other side?
The best way to avoid stopping at a lo
al maximum is to use many starting values,
for example on a grid, or randomly generated. Or perhaps one might have priors about
Let's try to nd the true minimizer of minus 1 times the foggy mountain fun
tion (sin
e
the algoritms are set up to minimize). From the pi
ture, you
an see it's
lose to
(0, 0),
but
let's pretend there is fog, and that we don't know that. The program FoggyMountain.m
shows that poor start values
an lead to problems. It uses SA, whi
h nds the true global
minimum, and it shows that BFGS using a battery of random start values
an also nd
the global minimum help. The output of one run is here:
186
CHAPTER 13.
================================================
SAMIN final results
NORMAL CONVERGENCE
Fun
. tol. 1.000000e-10 Param. tol. 1.000000e-03
Obj. fn. value -0.100023
parameter
sear h width
13.5.
187
0.037419
0.000018
-0.000000
0.000051
================================================
Now try a battery of random start values and
a short BFGS on ea
h, then iterate to
onvergen
e
The result using 20 randoms start values
ans =
3.7417e-02
2.7628e-07
In that run, the single BFGS run with bad start values
onverged to a point far from
the true minimizer, whi
h simulated annealing and BFGS using a battery of random start
values both found the true maximizaer. battery of random start values managed to nd
the global max. The moral of the story is be
autious and don't publish your results too
qui
kly.
188
CHAPTER 13.
the
the
Using logit.m and EstimateLogit.m as templates, write a fun
tion to
al
ulate the
probit loglikelihood, and a s
ript to estimate a probit model.
a
tually follows a logit model (you
an generate it in the same way that is done in the logit
example).
Study
mle_results.m
alls, and in turn the fun
tions that those fun
tions
all. Write a
omplete des
ription of
how the whole
hain works.
Look at the Poisson estimation results for the OBDV measure of health
are use and
give an e
onomi
interpretation.
health
are usage.
Chapter 14
Asymptoti
properties of extremum
estimators
Readings:
2, Ch.
24
4.1 ; Davidson and Ma
Kinnon, pp. 591-96; Gallant, Ch. 3; Newey and M
Fadden (1994),
Large Sample Estimation and Hypothesis Testing, in
Ch. 36.
np
sn ()
random matrix
over a set
Zn =
z1 z2 zn
where the
zt
are
sn (Zn , )
p-ve
tors
depend upon a
and
is nite.
sn (Zn , ) = 1/n
n
X
i=1
yi xi
= 1/n k Y X k2
2
14.2 Consisten
y
The following theorem is patterned on a proof in Gallant (1987) (the arti
le, ref. later),
whi
h we'll see in its original form later in the
ourse.
following proof with Amemiya's Theorem 4.1.1, whi
h is done in terms of
onvergen
e in
probability.
Theorem 19
[Consisten y of e.e.
Assume
1. Compa
tness: The parameter spa
e is an open bounded subset of Eu
lidean spa
e
K . So the
losure of , , is
ompa
t.
189
190
CHAPTER 14.
0.
Then n a.s.
Proof:
that is
Sele t a
sn ()
{sn (, )}
Then
is a xed sequen e of
onverges uniformly to
{n }
s ().
of
with
There is a subsequen e
limm nm = .
{nm } ({nm }
by
ompa t set has at least one limit point (Davidson, Thm. 2.12), say that
{n }.
This happens
is
a limit point
onvergen e implies
o
nm .
Then uniform
lim snm (t ) = s (t ).
m
Continuity of
s ()
implies that
lim s (t ) = s ()
t
sin
e the limit as
of
Next, by maximization
n o
t
is
However,
m
as seen above, and
lim snm ( 0 ) = s ( 0 )
m
by uniform
onvergen
e, so
s ( 0 ).
s ()
But by assumption (3), there is a unique global maximum of
= s ( 0 ),
s ()
far we have held
and
= 0 .
s () at 0 , so we
must have
0
one limit point, , ex
ept on a set
with
P (C) = 0.
Therefore
{n }
has only
14.2.
191
CONSISTENCY
(2)
This proof relies on the identi ation assumption of a unique global maximum at
0.
Identi ation:
Any point
in
whi
h mat
hes the way we will write the assumption in the se
tion on nonparametri
inferen
e.
We assume that
unique for
is in fa t a global maximum of
sn () .
It is not required to be
nite, though the identi ation assumption requires that the limiting
numeri optimization methods showed that a tually nding the global maximum of
sn ()
See Amemiya's Example 4.1.4 for a ase where dis ontinuity leads to breakdown of
onsisten y.
is in the interior of
has not been used to prove
onsisten
y, so we
ould dire
tly assume that
an element of a
ompa
t set
assumption)
is simply
is that this is ne
essary for subsequent proof of asymptoti
normality, and I'd like
to maintain a minimal set of simple assumptions, for
larity.
Parameters on the
boundary of the parameter set
ause theoreti
al di
ulties that we will not deal
with in this
ourse. Just note that
onventional hypothesis testing methods do not
apply in this
ase.
Note that
sn ()
The following gures illustrate why uniform onvergen e is important. In the se ond
s ()
is.
gure, if the fun
tion is not
onverging around the lower of the two maxima, there is
no guarantee that the maximizer will be in the neighborhood of the global maximizer.
192
CHAPTER 14.
We need a uniform strong law of large numbers in order to verify assumption (2) of
Theorem 19. The following theorem is from Davidson, pg. 337.
Theorem 20
a.s.
if and only if
(a) Gn () a.s.
0 for ea
h 0 , where 0 is a dense subset of and
(b) {Gn ()} is strongly sto
hasti
ally equi
ontinuous..
K ,
The pointwise almost sure onvergen e needed for assuption (a) omes from one of
the obje
tive fun
tion is
ontinuous and bounded with probability one on the
entire parameter spa
e
These are reasonable
onditions in many
ases, and hen
eforth when dealing with
spe
i
estimators we'll simply assume that pointwise almost sure
onvergen
e
an
be extended to uniform almost sure
onvergen
e in this way.
14.3.
193
The more general theorem is useful in the
ase that the limiting obje
tive fun
tion
an be
ontinuous in
sn ()
even if
is dis ontinuous.
dis
ontinuities may be smoothed out as we take expe
tations over the data. In the
se
tion on simlation-based estimation we will se a
ase of a dis
ontinuous obje
tive
fun
tion.
+t . (wt , t )
(y, w),
where
yt = 0 + 0 wt
0
for whi
h is
ompa
t. Let xt = (1, wt ) , so we
an write yt = xt + t . The sample
has the
ommon distribution fun
tion
sn () = 1/n
= 1/n
n
X
t=1
n
X
t=1
is
yt xt
2
xt 0
= 1/n
n
X
xt 0 + t xt
i=1
2
n
X
+ 2/n
t=1
2
n
X
xt 0 t + 1/n
2t
t=1
1/n
n
X
t=1
a.s.
2t
2 dW dE = 2 .
the SLLN
1/n
n
X
xt
t=1
=
=
2
a.s.
,
Z
x 0
2
+ 2 0 0
0
2
dW
(14.1)
wdW +
2
2
2
0 + 2 0 0 E(w) + 0 E w2
w2 dW
Finally, the obje
tive fun
tion is
learly
ontinuous, and the parameter spa
e is assumed
to be
ompa
t, so the
onvergen
e is also uniform. Thus,
2
2
s () = 0 + 2 0 0 E(w) + 0 E w2 + 2
= 0 , = 0 .
Exer ise 21 Show that in order for the above solution to be unique it is ne essary that
Dis
uss the relationship between this
ondition and the problem of
olinearity
of regressors.
E(w2 ) 6= 0.
This example shows that Theorem 19
an be used to prove strong
onsisten
y of the
OLS estimator. There are easier ways to show this, of
ourse - this is only an example of
194
CHAPTER 14.
Theorem 22
19, assume
(a) Jn () D2 sn() exists and is
ontinuous in an open,
onvex neighborhood of 0 .
(b) {Jn (n )} a.s.
J ( 0 ), a nite negative denite matrix, for any sequen
e {n } that
onverges almost surely to 0.
d
(
) nDsn ( 0 )
N 0, I ( 0 ) , where I ( 0 ) = limn V ar nD sn ( 0 )
d
Then n 0
N 0, J ( 0 )1 I ( 0 )J ( 0 )1
Proof:
By Taylor expansion:
D sn (n ) = D sn ( 0 ) + D2 sn ( ) 0
where
= + (1 ) 0 , 0 1.
will
D2 sn ()
Note that
Now the l.h.s. of this equation is zero, at least asymptoti ally, sin e
as
Also, sin e
is a maximizer
must hold exa tly sin e the limiting obje tive fun tion is stri tly
on ave in a neighborhood of
is between
0.
and
0,
and sin e
a.s.
n 0
a.s.
D2 sn ( ) J ( 0 )
So
0 = D sn ( 0 ) + J ( 0 ) + op (1) 0
And
0=
Now
J ( 0 )
vant next to
nD sn ( 0 ) + J ( 0 ) + op (1) n 0
J ( 0 ),
op (1)
so we an write
0=
nD sn ( 0 ) + J ( 0 ) n 0
a
n 0 = J ( 0 )1 nD sn ( 0 )
14.5.
195
EXAMPLES
Be
ause of assumption (
), and the formula for the varian
e of a linear
ombination of
r.v.'s,
d
n 0 N 0, J ( 0 )1 I ( 0 )J ( 0 )1
Assumption (b) is not implied by the Slutsky theorem. The Slutsky theorem says
that
a.s.
g(xn ) g(x)
an't depend on
if
xn x and g()
x.
is ontinuous at
In our ase
Jn (n )
is
a fun tion of
g()
n.
a.s.
uniformly on an open neighborhood of 0 , then gn ()
g ( 0 ) if g ( 0 ) is
ontinuous at
a.s.
0 and 0 .
To apply this to the se
ond derivatives, su
ient
onditions would be that the se
ond
derivatives be strongly sto
hasti
ally equi
ontinuous on a neighborhood of
that an ordinary LLN applies to the derivatives when evaluated at
0,
and
N ( 0 ).
Stronger
onditions that imply this are as above:
ontinuous and bounded se
ond
derivatives in a neighborhood of
0.
is representable as an average of
sn ()
2
onsider, D sn () is also an average of
entered (they do not have zero expe tation). Supposing a SLLN applies, the almost
0 d
0
assumption (
): nD sn ( ) N 0, I ( ) means that
sure limit of
hand,
51.
On the other
nD sn ( 0 ) = Op ()
n,
we'd have
D sn ( 0 ) = n 2 Op (1)
1
= Op n 2
is
14.5 Examples
14.5.1 Coin ipping, yet again
Remember that in se
tion 4.4.1 we saw that the asymptoti
varian
e of the MLE of the
parameter of a Bernoulli trial, using i.i.d. data, was
lim V ar n (
p p) = p (1 p).
verify this using the methods of this Chapter. The log-likelihood fun tion is
sn (p) =
1X
{yt ln p + (1 yt ) (1 ln p)}
n
t=1
Let's
196
CHAPTER 14.
so
Esn (p) = p0 ln p + 1 p0 (1 ln p)
s (p) = p0 ln p + 1 p0 (1 ln p).
D2 sn (p)p=p0 Jn () =
p0 (1 p0 )
n.
1 (p0 ).
J
1 (p0 ) = p0 1 p0
J
lim V ar n p p0 =
y = x
y = 1(y > 0)
N (0, 1)
Here,
() =
Then
(2)1/2 exp(
2
)d
2
x,
p(x, ) = (x ),
p(x, ) = (x ),
where
()
is the standard
is the maximizer of
sn () =
1X
(yi ln p(xi , ) + (1 yi ) ln [1 p(xi , )])
n
i=1
n
1X
s(yi , xi , ).
n
i=1
(14.2)
14.5.
197
EXAMPLES
s(y, x, ).
sn ()
sn ().
Noting that
Eyi = p(xi , 0 ),
to get
Ey|x {y ln p(x, ) + (1 y) ln [1 p(x, )]} = p(x, 0 ) ln p(x, )+ 1 p(x, 0 ) ln [1 p(x, )] .
Next taking expe
tation over
s () =
where
p(x, 0 ) ln p(x, ) + 1 p(x, 0 ) ln [1 p(x, )] (x)dx,
x)
as
p(x, )
x.
and
(14.3)
is the support of
as long
p(x, )
Z
X
s (), ,
1 p(x, 0 )
p(x, 0 )
p(x, )
p(x, ) (x)dx = 0
p(x, )
1 p(x, )
Question:
d
n 0 N 0, J ( 0 )1 I ( 0 )J ( 0 )1 .
I ( 0 ) = limn V ar nD sn ( 0 )
There's no need to subtra t the mean, sin e it's zero, following the f.o. .
The terms in
1X
lim V ar nD
s( 0 )
n
n t
X
1
s( 0 )
= lim V ar D
n
n
t
X
1
D s( 0 )
= lim V ar
n n
t
lim V ar nD sn ( 0 ) =
lim V arD s( 0 )
= V arD s( 0 )
So we get
I ( ) = E
0
0
s(y, x, ) s(y, x, ) .
in the
198
CHAPTER 14.
Likewise,
J ( 0 ) = E
Expe
tations are jointly over
over
x.
and
x,
2
s(y, x, 0 ).
onditional on
x,
then
s(y, x, 0 ) = y ln p(x, 0 ) + (1 y) ln 1 p(x, 0 ) .
Now suppose that we are dealing with a orre tly spe ied logit model:
1
p(x, ) = 1 + exp(x )
.
2
1 + exp(x )
exp(x )x
p(x, ) =
exp(x )
x
1 + exp(x )
= p(x, ) (1 p(x, )) x
= p(x, ) p(x, )2 x.
1
1 + exp(x )
So
s(y, x, 0 ) = y p(x, 0 ) x
2
s( 0 ) = p(x, 0 ) p(x, 0 )2 xx .
I ( ) =
=
then
gives
EY y 2 2p(x, 0 )p(x, 0 ) + p(x, 0 )2 xx (x)dx
p(x, 0 ) p(x, 0 )2 xx (x)dx.
EY (y) = EY (y 2 ) = p(x, 0 ).
0
J ( ) =
(14.4)
(14.5)
(14.6)
Likewise,
p(x, 0 ) p(x, 0 )2 xx (x)dx.
(14.7)
Note that we arrive at the expe ted result: the information matrix equality holds (that is,
J ( 0 ) = I( 0 )).
simplies to
With this,
d
n 0 N 0, J ( 0 )1 I ( 0 )J ( 0 )1
d
n 0 N 0, J ( 0 )1
d
n 0 N 0, I ( 0 )1 .
14.5.
199
EXAMPLES
On a nal note, the logit and standard normal CDF's are very similar - the logit
distribution is a bit more fat-tailed.
p(x, )
will be virtually
1980 is an earlier
referen
e.
Suppose we have a nonlinear model
yi = h(xi , 0 ) + i
where
i iid(0, 2 )
The
estimator solves
1X
(yi h(xi , ))2
n = arg min
n
i=1
We'll study this more later, but for now it is
lear that the fo
for minimization will require
solving a set of nonlinear equations. A
ommon approa
h to the problem seeks to avoid
this di
ulty by
x0
point
linearizing
yi = h(x0 , 0 ) + (xi x0 )
where
en ompasses both
h(x0 , 0 )
+ i
x
is no longer
= h(x0 , 0 ) x0
=
h(x0 , 0 )
x
h(x0 , 0 )
x
and
by applying OLS to
yi = + xi + i
Question, will
and
be onsistent for
and
and
as extremum estimators.
(, ) .
n
1X
(yi xi )2
= arg min sn () =
n
i=1
200
CHAPTER 14.
u.a.s.
sn () s () = EX EY |X (y x)2
and
onverges
a.s.
to the
that minimizes
s ():
EX EY |X y x
2
2
= EX EY |X h(x, 0 ) + x
2
= 2 + EX h(x, 0 ) x
drop out.
and
0
losest to the true regression fun
tion h(x, ) a
ording to the mean squared error
riterion. This depends on both the shape of
h()
variables.
Tangent line
Fitted line
x
x
x_0
It is lear that the tangent line does not minimize MSE, sin e, for example, if
h(x, 0 )
is on ave, all errors between the tangent line and the true fun tion are negative.
(it may be of a dierent dimension than the dimension of the parameter of the
approximating model, whi
h is 2 in this example).
Se
ond order and higher-order approximations suer from exa
tly the same problem,
though to a less severe degree, of
ourse. For this reason, translog, Generalized Leontiev and other exible fun
tional forms based upon se
ond-order approximations
in general suer from bias and in
onsisten
y. The bias may not be too important for
analysis of
onditional means, but it
an be very important for analyzing rst and
14.5.
201
EXAMPLES
se ond derivatives. In produ tion and onsumer analysis, rst and se ond derivatives
autious of unthinking appli
ation of models that impose stong restri
tions on se
ond
derivatives.
given
of the parameters of the model using data. The se
tion on simulation-based methods
oers a means of obtaining
onsistent estimators of the parameters of dynami
ma
ro
models that are too
omplex for standard methods of analysis.
202
CHAPTER 14.
xi
uniform(0,1), and
yi = + xi +
0
that are the probability limits of
0
values of and
yi = 1 x2i + i ,
where
is iid(0,
2 ). Suppose
2. Verify your results using O
tave by generating data that follows the above model,
and
al
ulating the OLS estimator. When the sample size is very large the estimator
should be very
lose to the analyti
al results you obtained in question 1.
3. Use the asymptoti
normality theorem to nd the asymptoti
distribution of the ML
estimator of
x.
x.
uniform(-a,a).
1
Pr(y = 1|x) = 1 + exp( 0 x) .
of the ML estimator of
0.
Chapter 15
Generalized method of moments
(GMM)
Readings:
Hamilton Ch. 14 ; Davidson and Ma Kinnon, Ch. 17 (see pg. 587 for refs.
to appli
ations); Newey and M
Fadden (1994), Large Sample Estimation and Hypothesis
Testing, in
36.
15.1 Denition
We've already seen one example of GMM in the introdu
tion, based upon the
distribu-
tion. Consider the following example based upon the t-distribution. The density fun
tion
of a t-distributed r.v.
Yt
is
(0 +1)/2
0 + 1 /2
1 + yt2 / 0
fYt (yt , ) =
1/2
( 0 ) ( 0 /2)
n,
fun tion
arg max ln Ln () =
n
X
ln fYt (yt , )
t=1
This approa h is attra tive sin e ML estimators are asymptoti ally e ient.
This
is be
ause the ML estimator uses all of the available information (e.g., the distribution is fully spe
ied up to a parameter). Re
alling that a distribution is
ompletely
hara
terized by its moments, the ML estimator is interpretable as a GMM estimator that uses
all
moments to estimate a
fYt (yt , 0 )
/ ( 2)
V (yt ) = 0 / 0 2
yt2 and
m1 () = 1/n
0
(for
Pn
> 2).
203
Pn
m1t () =
t=1 m1t ()
has mean
204
CHAPTER 15.
E0 m1 ( 0 ) = 0.
Choosing
to
set
0
m1 ()
yields a MM estimator:
(15.1)
Pn 2
i yi
This estimator is based on only one moment of the distribution - it uses less information
than the ML estimator, so it is intuitively
lear that the MM estimator will be ine
ient
relative to the ML estimator.
An alternative MM estimator ould be based upon the fourth moment of the tdistribution. The fourth moment of a t-distributed r.v. is
4
provided
0 > 4.
E(yt4 )
2
3 0
= 0
,
( 2) ( 0 4)
m2 () =
1X 4
3 ()2
yt
( 2) ( 4) n
t=1
to
set
0.
m2 ()
This estimator isn't e ient either, sin e it uses only one moment.
A GMM estimator
would use the two moment onditions together to estimate the single parameter.
The
As before, set
0
sample size. Note that m( )
variables, whereas
The
m() = Op (1), 6= 0 ,
0.
onsistent.
sn () =
d (m()).
m() W
n m(). We assume
gg
Wn
d (m()) =
m W
A popular
n m, and we minimize
moment onditions, so
m()
is a
is a
matrix.
For the purposes of this
ourse, the following denition of the GMM estimator is su
iently
general:
15.2.
205
CONSISTENCY
What's the reason for using GMM if MLE is asymptoti ally e ient?
Robustness: GMM is based upon a limited set of moment
onditions. For
onsisten
y,
only these moment
onditions need to be
orre
tly spe
ied, whereas MLE in ee
t
e
ien
y with respe
t to the MLE estimator. Keep in mind that the true distribution
is not known so if we erroneously spe
ify a distribution and estimate by MLE, the
estimator will be in
onsistent in general (not always).
Feasibility: in some
ases the MLE estimator is not available, be
ause we are
not able to dedu
e the likelihood fun
tion.
simulation-based estimation.
15.2 Consisten
y
We simply assume that the assumptions of Theorem 19 hold, so the GMM estimator is
strongly
onsistent.
0
a unique global maximum at ,
quadrati
obje
tive fun
tion
i.e., s
( 0 )
> s (), 6=
sn () = mn () Wn mn (),
Sin e
E mn ( 0 ) = 0
Sin e
s ( 0 ) = m ( 0 ) W m ( 0 ) = 0,
need that
m () 6= 0 for 6=
assumption that
a.s.
mn ().
a.s.
mn () m ().
m ( 0 ) = 0.
in order for asymptoti
identi
ation, we
Wn W ,
a nite positive
0
that is asymptoti
ally identied.
has
rst onsider
by assumption,
Identi ation: s ()
gg
denite
gg
matrix guarantee
Note that asymptoti
identi
ation does not rule out the possibility of la
k of identi
ation for a given data set - there may be multiple minimizing solutions in nite
samples.
d
0
n N 0, J ( 0 )1 I ( 0 )J ( 0 )1
where
2
sn () and
sn ( 0 ).
I ( 0 ) = limn V ar n
We need to determine the form of these matri es given the obje tive fun tion
mn
() W
n mn ().
sn () =
206
CHAPTER 15.
sn () = 2
mn () Wn mn ()
Dene the
K g
matrix
m () ,
n
Dn ()
so:
s() = 2D()W m () .
(Note that
sn (), Dn (), Wn
mn ()
and
(15.2)
n,
but it is omitted
Di
be the
th row of
D().
2
s() =
i
2Di ()Wn m ()
= 2Di W D + 2m W
D
i
D()i
2m() W
at
0,
assume that
0 a.s.
D( )i 0,
2m( ) W
0
sin e
a.s.
m( 0 ) = op (1), W W .
lim
where we dene
I ( 0 ),
Em( 0 )
rows of
D,
we get
sn ( 0 ) = J ( 0 ) = 2D W D
, a.s.,
lim D = D , a.s.,
With regard to
0
zero at (sin
e
and
lim W = W ,
following equation 15.2, and noting that the s ores have mean
=0
by assumption), we have
lim V ar n sn ( 0 )
n
I ( 0 ) =
m( 0 )
n.
Assuming this,
nm( 0 ) N (0, ),
15.4.
207
where
= lim E nm( 0 )m( 0 ) .
n
I ( 0 ) = 4D W W D
Using these results, the asymptoti
normality theorem gives us
h
i
d
1
1
n 0 N 0, D W D
D W W D
D W D
,
the asymptoti
distribution of the GMM estimator for arbitrary weighting matrix
Note that for
to be positive denite,
Wn .
(D ) = k.
is a
weighting matrix,
individual moment
onditions. For example, if we are mu
h more sure of the rst moment
ondition, whi
h is based upon the varian
e, than of the se
ond, whi
h is based upon the
fourth moment, we
ould set
"
W =
with
mu h larger than
b.
a 0
0 b
Sin
e moments are not independent, in general, we should expe
t that there be a
orrelation between the moment
onditions, so it may not be desirable to set the
o-diagonal elements to 0.
of the GMM estimator. Sin
e the GMM estimator is already ine
ient w.r.t. MLE,
we might like to
hoose the
dened by
within
mn ().
Let
y = x +, where N (0, ).
P y = P X + P
1 ,
e.g,
P P = 1 .
V (P ) = P V ()P = P P = P (P P )1 P =
P P 1 (P )1 P = In . (Note:
(AB)1 = B 1 A1
we use
for
A, B
both nonsingular).
(y X) 1 (y X).
P y = P X + P
Interpreting
(y X) = ()
0 ),
matrix is seen to be the inverse of the ovarian e matrix of the moment onditions.
208
CHAPTER 15.
This result
arries over to GMM estimation. (Note: this presentation of GLS is not
a GMM estimator, be
ause the number of moment
onditions here is equal to the
sample size,
n. Later we'll see that GLS an be put into the GMM framework dened
above).
limn E nm( 0 )m( 0 ) .
Proof:
For
W = 1
,
D W D
simplies to
D 1
D
1
1
D W W D
D W D
W =
1
W 6= 1
,
1 versus when
onsider the
is some arbitrary
D 1
D W D
D D W D D W W D
h
i
1/2
1/2
1
1/2
I
W
D
D
W
W
D
= D 1/2
D
W
as
an be veried by multipli
ation. The term in bra
kets is idempotent, whi
h is also easy
to
he
k by multipli
ation, and is therefore positive semidenite. A quadrati
form in a
positive semidenite matrix is also positive semidenite. The dieren
e of the inverses of
the varian
es is positive semidenite, whi
h implies that the dieren
e of the varian
es is
negative semidenite, whi
h proves the theorem.
The result
allows us to treat
h
i
d
1
n 0 N 0, D 1
D
N
where the
tors of
means approximately
and
D 1
D
,
n
0
d
D
1 !
(15.3)
is simply
mn
, whi
h
ten y of
assuming that
nmn ( 0 ).
parametri ally, we in general have little information upon whi h to base a parametri
15.5.
mt
not depend on
209
2
= it
6= 0).
).
Sin
e we need to estimate so many
omponents if we are to take the parametri
approa
h,
it is unlikely that we would arrive at a
orre
t parametri
spe
i
ation. For this reason,
resear
h has fo
used on
onsistent nonparametri
estimators of
Hen
eforth we assume that
mts
v =
t).
mt
Dene the
E(mt mt+s )
v th
mt
and
v . Re all that
mt
and
0
for now assume that we have some
onsistent estimator of , so that
m
t = mt ().
so
Now
!#
!
"
n
n
X
X
mt
1/n
mt
= E nm( 0 )m( 0 ) = E n 1/n
"
= E 1/n
n
X
t=1
mt
n
X
mt
t=1
t=1
t=1
!#
n2
n1
1
= 0 +
1 + 1 +
2 + 2 +
n1 + n1
n
n
n
is
cv = 1/n
n
X
m
tm
tv .
t=v+1
would be
c + n 2
c + +
[
c1 +
c2 +
=
c0 + n 1
[
n1
1
2
n1
n
n
n1
X nv
c .
cv +
c0 +
=
v
n
v=1
n .
n,
so information
tends to
a modied estimator
where
q(n)
as
=
c0 +
q(n)
X
c ,
cv +
v
v=1
nv
The term
n
an be dropped be
ause
q(n)
must be
q(n)
op (n).
a
umulate at a rate that satises a LLN. A disadvantage of this estimator is that it may
not be positive denite.
example!
statisti , for
210
CHAPTER 15.
requires an estimate of
m( 0 ),
0,
0
then re-estimate . The pro
ess
an be iterated until neither
nor
hange
=
c0 +
q(n)
X
1
v=1
v
c .
cv +
v
q+1
This estimator is p.d. by
onstru
tion. The
ondition for
onsisten
y is that
Note that this is a very slow rate of growth for
q.
pre-whitening
n1/4 q 0.
kernel
(Review of E
onomi
Studies,
.
It is an example of a
estimator.
1994) use
before applying the kernel estimator. The idea is to t a VAR model to the
moment
onditions. It is expe
ted that the residuals of the VAR model will be more nearly
white noise, so that the Newey-West
ovarian
e estimator might perform better with short
lag lengths..
The VAR model is
m
t = 1 m
t1 + + p m
tp + ut
This is estimated, giving the residuals
u
t .
tted VAR
c
c1 m
cp m
m
t =
t1 + +
tp
ut .
One
ommon way of dening un
onditional moment
onditions is based upon
onditional moment
onditions.
Suppose that a random variable
variable
X
EY |X Y =
Y f (Y |X)dY = 0
g(X)
of
is also
15.6.
211
EY g(X) =
Z Z
X
dX.
This
an be fa
tored into a
onditional expe
tation and an expe
tation w.r.t. the marginal
density of
X:
EY g(X) =
g(X)
Sin e
doesn't depend on
Z Z
X
EY g(X) =
Y g(X)f (Y |X)dY
f (X)dX.
Z Z
Y f (Y |X)dY
g(X)f (X)dX.
EY g(X) = 0
as
laimed.
This is important e
onometri
ally, sin
e models often imply restri
tions on
onditional
moments. Suppose a model tells us that the fun
tion
on the information set
It ,
equal to
k(xt , ),
K(yt , xt ) = yt
so that
k(xt , ) =
yt = xt + t ,
we an set
xt .
ht () = K(yt , xt ) k(xt , )
has
onditional expe
tation equal to zero
E ht ()|It = 0.
This is a s
alar moment
ondition, whi
h isn't su
ient to identify a
parameter
(K > 1).
-dimensional
expe tations
mt () = Z(wt )ht ()
where
It .
The
onditions, so as long as
Z(wt )
g>K
are
wt
and
wt
instrumental variables.
We now have
moment
212
CHAPTER 15.
ng
Zn
matrix
Zg (w2 )
Z1 (w2 ) Z2 (w2 )
=
.
..
.
.
.
Z1 (wn ) Z2 (wn ) Zg (wn )
Z1
Z2
Zn
g
h1 ()
h2 ()
1
Zn
n ...
hn ()
=
=
=
Z(t,)
is the
tth
row of
moment onditions
mn () =
where
Zn .
1
Z hn ()
n n
n
1X
Zt ht ()
n
1
n
t=1
n
X
mt ()
t=1
e
ien
y.
Note that with this
hoi
e of moment
onditions, we have that
matrix) is
Dn () =
=
Dn () =
Hn
is a
K n
m () (a
K g
1
Zn hn ()
n
1
h () Zn
n n
whi h we an dene to be
where
Dn
1
Hn Zn .
n
as its olumns. Likewise, dene the var- ov. of the moment onditions
n = E nmn ( 0 )mn ( 0 )
1
Zn hn ( 0 )hn ( 0 ) Zn
= E
n
1
0
0
hn ( )hn ( ) Zn
= Zn E
n
n
Zn
Zn
n
15.7.
213
with the sample size, so it is not
onsistently estimable without additional assumptions.
The asymptoti
normality theorem above says that the GMM estimator using the
optimal weighting matrix is distributed as
d
n 0 N (0, V )
where
V = lim
Hn Zn
n
Zn n Zn
n
1
Zn Hn
n
!1
(15.4)
Zn = 1
n Hn
auses the above var-
ov matrix to simplify to
V = lim
Hn 1
n Hn
n
1
(15.5)
and furthermore, this matrix is smaller that the limiting var-
ov for any other
hoi
e of
instrumental variables. (To prove this, examine the dieren
e of the inverses of the var-
ov
matri
es with the optimal intruments and with non-optimal instruments. As above, you
an show that the dieren
e is positive semi-denite).
0,
Usually, estimation of
where
and
is
Hn
b = hn ,
H
Estimation of
nn
elements than
n,
be estimated onsistently. Basi ally, you need to provide a parametri spe i ation of
A solution
Note that
the simplied var-
ov matrix in equation 15.5 will not apply if approximately optimal
instruments are used - it will be ne
essary to use an estimator based upon equation
15.4, where the term
Z
Zn
n n
must be estimated
onsistently apart, for example by
n
214
CHAPTER 15.
1
s() = 2
mn mn 0
or
0
1 mn ()
D()
Consider a Taylor expansion of
Multiplying by
:
m()
= mn ( 0 ) + D ( 0 ) 0 + op (1).
m()
n
1
D()
(15.6)
we obtain
= D()
1 m()
1 mn ( 0 ) + D()
1 D( 0 ) 0 + op (1)
D()
The lhs is zero, and sin
e
tends
to
and
tends to
we an write
0
0 a
1
D 1
mn ( ) = D D
or
a
1
0
n 0 = n D 1
D 1
D
mn ( )
With this, and taking into a ount the original expansion (equation 15.6), we get
=
D 1
nm()
nmn ( 0 ) nD
D
=
nm()
Or
0
D 1
mn ( ).
1/2
1/2
1
1/2
mn ( 0 )
n D
D 1
D
D
a
n1/2
m() =
Now
1
1/2
1 1
1/2
n Ig 1/2
D
mn ( 0 )
D
D
D
0
n1/2
mn ( ) N (0, Ig )
1 1
1/2
P = Ig 1/2
D
D
D
D
is idempotent of rank
tra
e) so
g K,
d
1/2
= nm()
1 m()
n1/2
m(
)
n
m(
)
2 (g K)
15.9.
215
Sin e
onverges to
we also have
d
1 m()
nm()
2 (g K)
or
2 (g K)
n sn ()
supposing the model is
orre
tly spe
ied. This is a
onvenient test sin
e we just multiply
n,
2 (g K)
riti al
The test is a general test of whether or not the moments used to estimate are
This won't work when the estimator is just identied. The f.o. . are
0.
1 m()
D sn () = D
But with exa
t identi
ation, both
and
0.
m()
So the moment
onditions are zero
regardless
= 0,
sn ()
so the test
breaks down.
A note: this sort of test often over-reje
ts in nite samples. One should be
autious
in reje
ting a model when this test reje
ts.
y = X 0 + ,
where
N (0, ),
and
a diagonal matrix.
= (),
where
is a nite dimensional
is orre t.
xt
(a
K1
mt () = xt yt xt .
E(xt t ) = 0,whi h
216
CHAPTER 15.
m() = 1/n
mt = 1/n
W, m()
parameters and
X
t
xt yt 1/n
moment onditions). We
xt xt .
ti
ation. That is, sin
e the number of moment
onditions is identi
al to the number of
parameters, the fo
imply that
0
m()
regardless of
W.
timal weighting matrix in this
ase, an identity matrix works just as well for the purpose
of estimation. Therefore
xt xt
!1
xt yt = (X X)1 X y,
d
D
is simply
.
m
In this ase
d
D
= 1/n
X
t
d
b 1 d
D
D
1
. Re all that
xt xt = X X/n.
is
=
c0 +
n1
X
c .
cv +
v
v=1
This is in general in
onsistent, but in the present
ase of nonauto
orrelation, it simplies
to
=
c0
b =
c0 = 1/n
= 1/n
" n
X
= 1/n
=
where
is an
nn
m
tm
t
t=1
xt xt
t=1
" n
X
n
X
2
yt xt
xt xt 2t
t=1
X EX
2t
in the position
t, t.
15.9.
217
!
)1
1 X X
X EX
X X
n
n
n
!
1
XX
X X 1
X EX
=
n
n
n
(
V
n
=
This is the var
ov estimator that White (1980) arrived at in an inuential arti
le. This
estimator is
onsistent under heteros
edasti
ity of an unknown form. If there is auto
orrelation, the Newey-West estimator
an be used to estimate
y = X 0 +
N (0, )
where
is a diagonal matrix.
is known, so that
X).
= X 1 X
is a orre t parametri
1
X 1 y)
= 1/n
m()
( 0 )
moment onditions
X xt yt
X xt x
t
1/n
0.
0)
0)
(
t
t
t
t
That is, the GLS estimator in this ase has an obvious representation as a GMM estimator.
With auto orrelation, the representation exists but it is a little more ompli ated.
The (feasible) GLS estimator is known to be asymptoti ally e ient in the lass of
This means that it is more e ient than the above example of OLS with White's
15.9.3 2SLS
Consider the linear model
yt = zt + t ,
or
y = Z +
218
CHAPTER 15.
is
K 1
and
zt
Suppose that
xt
(suppose that
exogenous variables.
Dene
X (X X)1 X Z
= X X X
Z
Sin e
is a linear
ombination
Z
with
K -dimensional
m() = 1/n
parameters and
1
t zt
z
X,
e.g.,
=
Z
X Z
x, zt
must be un orrelated
mt () = zt (yt zt )
t yt zt .
z
r 1).
moment ondition
X
t
Sin e we have
is
and so
xt
!1
W,
X
t
so we have
1
Z
y
(zt yt ) = Z
Z
This is the standard formula for 2SLS. We use the exogenous variables and the redu
ed
form predi
tions of the endogenous variables as instruments, and apply IV estimation. See
Hamilton pp. 420-21 for the var
ov formula (whi
h is the standard formula for 2SLS), and
for how to deal with
Note that
t dependent
y1t = f1 (zt , 10 ) + 1t
y2t = f2 (zt , 20 ) + 2t
.
.
.
0
yGt = fG (zt , G
) + Gt ,
or in
ompa
t notation
yt = f (zt , 0 ) + t ,
where
f ()
is a
We need to nd an
orrelated with
variables in
zt ,
it .
Ai 1
0 ) .
0 = (10 , 20 , , G
ve tor of instruments
xit ,
P
G
A
i=1 i 1
orthogo-
15.9.
nality onditions
.
.
219
A note on identi
ation: sele
tion of instruments that ensure identi
ation is a nontrivial problem.
A note on e
ien
y: the sele
ted set of instruments has important ee
ts on the
e
ien
y of estimation.
moment onditions.
information than others, sin
e the moment
onditions
ould be highly
orrelated. A GMM
estimator that
hose an optimal set of
we'll see that the optimal moment
onditions are simply the s
ores of the ML estimator.
Let
yt
be a
Yt = (y1 , y2 , ..., yt ) .
Then at time
t, Yt1
has been observed (refer to it as the information set, sin
e we assume the
onditioning
variables have been sele
ted to take advantage of all useful information). The likelihood
fun
tion is the joint density of the sample:
ln L() =
n
X
t=1
ln f (yt |Yt1 , ).
Dene
s ore
of the
tth
220
CHAPTER 15.
so one
ould interpret these as moment
onditions to use to dene a just-identied GMM
estimator ( if there are
sets
1/n
n
X
= 1/n
mt (Yt , )
n
X
t=1
t=1
= 0,
D ln f (yt |Yt1 , )
whi
h are pre
isely the rst order
onditions of MLE. Therefore, MLE
an be interpreted
as a GMM estimator. The GMM var
ov formula is
V = D 1 D
1
= 1/n
d
D
D2 ln f (yt |Yt1 , )
m(Yt , )
=
t=1
mt
and
mts , s > 0
Yts ,
t.
mts
is
Un onditional un or-
relation follows from the fa
t that
onditional un
orrelation hold regardless of the
realization of
Yt1 , so marginalizing
with respe t to
Yt1
the se tion on ML estimation, above). The fa t that the s ores are serially un orrelated implies that
th auto ovarian e of
b = 1/n
n
X
t (Yt , )
= 1/n
mt (Yt , )m
n h
X
t=1
t=1
ih
i
D ln f (yt |Yt1 , )
D ln f (yt |Yt1 , )
Re all from study of ML estimation that the information matrix equality (equation
??)
states that
n
o
= E D2 ln f (yt |Yt1 , 0 ) .
D ln f (yt |Yt1 , 0 ) D ln f (yt|Yt1 , 0 )
This result implies the well known (and already seeen) result that we an estimate
in
nP
o
n
2 ln f (y |Y
,
)
D
t
t1
t=1
P h
ih
i 1
n
c
V = n
D ln f (yt |Yt1 , )
n
o
Pn
D2 ln f (yt |Yt1 , )
t=1
or the inverse of the negative of the Hessian (sin
e the middle and last term
an
el,
ex
ept for a minus sign):
"
Vc
= 1/n
n
X
t=1
#1
D2 ln f (yt |Yt1 , )
15.10.
221
or the inverse of the outer produ
t of the gradient (sin
e the middle and last
an
el
ex
ept for a minus sign, and the rst term
onverges to minus the inverse of the
middle term, whi
h is still inside the overall inverse)
Vc
=
1/n
n h
X
t=1
ih
i
D ln f (yt |Yt1 , )
D ln f (yt |Yt1 , )
)1
This simpli
ation is a spe
ial result for the MLE estimator - it doesn't apply to GMM
estimators in general.
Asymptoti
ally, if the model is
orre
tly spe
ied, all of these forms
onverge to the
same limit.
outer produ
t of the gradient formula does not perform very well in small samples (see
Davidson and Ma
Kinnon, pg.
477).
White's
(E onometri a,
1982) is based upon
omparing the two ways to estimate the information matrix: outer
produ
t of gradient or negative of the Hessian. If they dier by too mu
h, this is eviden
e
of misspe
i
ation of the model.
form and the
hoi
e of regressors is
orre
t, but that the some of the regressors may be
orrelated with the error term, whi
h as you know will produ
e in
onsisten
y of
For
lagged values of the dependent variable are used as regressors and t is auto orrelated.
To illustrate, the O
tave program biased.m performs a Monte Carlo experiment where
errors are
orrelated with regressors, and estimation is by OLS and IV. The true value of
the slope
oe
ient used to generate the data is
= 2.
estimator is quite biased, while Figure 15.2 shows that the IV estimator is on average mu
h
loser to the true value. If you play with the program, in
reasing the sample size, you
an
see eviden
e that the OLS estimator is asymptoti
ally biased, while the IV estimator is
onsistent.
We have seen that in
onsistent and the
onsistent estimators
onverge to dierent
probability limits. This is the idea behind the Hausman test - a pair of
onsistent estimators
onverge to the same probability limit, while if one is
onsistent and the other is not they
e.g.,
e.g.,
the IV estimator),
he k if the dieren e between the estimators is signi antly dierent from zero.
et .),
why should we be
interested in testing - why not just use the IV estimator? Be ause the OLS estimator
222
CHAPTER 15.
0.1
0.08
0.06
0.04
0.02
0
2.3
2.31
2.32
2.33
2.34
2.35
2.36
2.37
Figure 15.2: IV
IV estimates
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
1.9
1.92
1.94
1.96
1.98
2.02
2.04
2.06
2.08
15.10.
223
is more e
ient when the regressors are exogenous and the other
lassi
al assumptions
(in
luding normality of the errors) hold. When we have a more e
ient estimator
that relies on stronger assumptions (su
h as exogeneity) than the IV estimator, we
might prefer to use it, unless we have eviden
e that the assumptions are false.
.
estimator) and some other CAN estimator, say
(or
a.s.
n 0 H (0 )1 ng(0 ).
Equation 4.7 is
H () = I ().
Combining these two equations, we get
a.s.
n 0 I (0 )1 ng(0 ).
Also, equation 4.9 tells us that the asymptoti
ovarian
e between any CAN estimator
and the MLE s
ore ve
tor is
# "
"
#
n
V ()
IK
V
.
=
IK
I ()
ng()
Now,
onsider
"
IK
0K
#
#"
n
n
0K
a.s.
.
I ()
n
ng()
#"
"
#"
n
IK
I
0
V
(
)
I
K
K
K
=
V
0K
0K I ()1
IK
I ()
n
#
"
V ()
I ()1
,
=
I ()1 I ()1
0K
I ()1
"
#
1
n
V
(
)
I
()
=
V
.
I ()1 V ()
n
So, the asymptoti
ovarian
e between the MLE and any other CAN estimator is equal to
the MLE asymptoti
varian
e (the inverse of the information matrix).
Now, suppose we with to test whether the the two estimators are in fa
t both
onverging
to
0 ,
versus the alternative hypothesis that the MLE estimator is not in fa t onsistent
(the onsisten y of
is a maintained
224
CHAPTER 15.
we have
IK
IK
n 0
= n ,
n 0
d
V ()
.
n N 0, V ()
So,
where
1
d
V ()
n
V ()
2 (),
is the rank of the dieren e of the asymptoti varian es. A statisti that has the
1
d
V ()
V ()
2 ().
This is the Hausman test statisti
, in its original form. The reason that this test has power
under the alternative hypothesis is that in that
ase the MLE estimator will not be
Note: if the test is based on a sub-ve
tor of the entire parameter ve
tor of the MLE,
it is possible that the in
onsisten
y of the MLE will not show up in the portion of the
ve
tor that has been used. If this is the
ase, the test may not have power to dete
t
the in
onsisten
y. This may o
ur, for example, when the
onsistent but ine
ient
estimator is not identied for all the parameters of the model.
The rank,
dimension of the matri
es, and it may be di
ult to determine what the true rank
is. If the true rank is lower than what is taken to be true, the test will be biased
against reje
tion of the null hypothesis. The
ontrary holds if we underestimate the
rank.
A solution to this problem is to use a rank 1 test, by
omparing only a single
oe
ient. For example, if a variable is suspe
ted of possibly being endogenous, that
variable's
oe
ients may be
ompared.
This simple formula only holds when the estimator that is being tested for
onsisten
y
is
or a fully e
ient estimator that has the same asymptoti
distribution as the ML
estimator. This is quite restri
tive sin
e modern estimators su
h as GMM and QML
are not in general fully e
ient.
Following up on this last point, let's think of two not ne
essarily e
ient estimators,
and
2 ,
where one is assumed to be onsistent, but the other may not be.
We assume
15.10.
225
and
i = arg min mi (i ) Wi mi (i )
i
where
mi (i )
is a
weighting matrix,
gi 1
i = 1, 2.
1 , 2 = arg min
Wi
is a
m1 (1 ) m2 (2 )
"
W1
#"
0(g1 g2 )
W2
0(g2 g1 )
gi gi
positive denite
m1 (1 )
m2 (2 )
(15.7)
#)
(
"
m1 (1 )
= lim V ar
n
n
m2 (2 )
!
1 12
(15.8)
and
(or
subve
tors of the two) applied to the omnibus GMM estimator, but with the
ovarian
e of
the moment
onditions estimated as
b=
c1
0(g2 g1 )
0(g1 g2 )
c2
12
of the test statisti
when one of the estimators is asymptoti
ally e
ient, as we have seen
above, and thus it need not be estimated.
The general solution when neither of the estimators is e
ient is
lear: the entire
matrix must be estimated
onsistently, sin
e the
12
, e.g.,
are well-known
using a proper estimator of the overall ovarian e matrix will now have an asymptoti
i 1
h
e
1 , 2 = arg min m1 (1 ) m2 (2 )
where
"
m1 (1 )
m2 (2 )
(15.9)
of equation 15.8. By
standard arguments, this is a more e
ient estimator than that dened by equation 15.7,
so the Wald test using this alternative is more powerful. See my arti
le in
Applied E o-
al ulates the Wald test orresponding to the e ient joint GMM estimator (the H2 test
226
CHAPTER 15.
Though GMM estimation has many appli
ations, appli
ation to rational expe
tations
models is elegant, sin
e theory dire
tly suggests the moment
onditions. Hansen and Singleton's 1982 paper is also a
lassi
worth studying in itself. Though I strongly re
ommend
reading the paper, I'll use a simplied model with similar notation to Hamilton's.
We assume a representative
onsumer maximizes expe
ted dis
ounted utility over an
innite horizon. Utility is temporally additive, and the expe
ted utility hypothesis holds.
The future
onsumption stream is the sto
hasti
sequen
e
at time
X
s=0
The parameter
s E (u(ct+s )|It ) .
at time
and earlier.
ct
wt .
Suppose the
onsumer
an invest in a risky asset. A dollar invested in the asset yields
a gross return
where
to
pt
dt
pt+1 + dt+1
pt
t.
The pri e of
ct
is normalized
1.
Current wealth
wt = (1 + rt )it1 ,
where
it1
is investment in period
t 1.
So the
problem is to allo
ate
urrent wealth between
urrent
onsumption and investment
to nan
e future
onsumption:
(15.10)
(1 + rt+1 ) =
{ct }
t=0 .
wt = ct + it .
rt+s , s > 0
are
not known
in period
t:
A partial set of ne essary onditions for utility maximization have the form:
u (ct ) = E (1 + rt+1 ) u (ct+1 )|It .
(15.11)
To see that the
ondition is ne
essary, suppose that the lhs < rhs. Then by redu
ing
urrent
onsumption marginally would
ause equation 15.10 to drop by
u (ct ),
sin e there is no
dis
ounting of the
urrent period. At the same time, the marginal redu
tion in
onsumption
nan
es investment, whi
h has gross return
period
t+1. This in rease in onsumption would ause the obje tive fun tion to in rease by
Therefore, unless the ondition holds, the expe ted dis ounted
15.11.
227
To use this we need to
hoose the fun
tional form of utility. A
onstant relative risk
aversion form is
u(ct ) =
where
c1
1
t
1
u (ct ) = c
t
so the fo
are
o
n
c
t = E (1 + rt+1 ) ct+1 |It
n
o
E c
(1
+
r
)
c
|It = 0
t+1
t
t+1
ct
is stationary,
even though it is in real terms, and our theory requires stationarity. To solve this, divide
though by
c
t
E
(note that
ct
1-
(1 + rt+1 )
ct+1
ct
)!
is analogous to
ht () dened
It .
represents
and
ct
is hosen based
t).
(1 + rt+1 )
ct+1
ct
)
|It = 0
zt
1 (1 + rt+1 )
ct+1
ct
zt mt ()
t, mts
0.
set. By rational expe tations, the auto ovarian es of the moment onditions other than
should be zero. The optimal weighting matrix is therefore the inverse of the varian e
= lim E nm( 0 )m( 0 )
= 1/n
n
X
t ()
mt ()m
t=1
whi h an be
228
CHAPTER 15.
we then minimize
1 m().
s() = m()
This pro
ess
an be iterated, e.g., use the new estimate to re-estimate
estimate
0,
use this to
In prin
iple, we
ould use a very large number of moment
onditions in estimation,
sin
e
ould be used in
xt .
onditions will lead to a more (asymptoti
ally) e
ient estimator, one might be
tempted to use many instrumental variables. We will do a
omputer lab that will
show that this may not be a good idea with nite samples. This issue has been studied
using Monte Carlos (Tau
hen,
JBES, 1986).
Empiri
al papers that use this approa
h often have serious problems in obtaining
pre
ise estimates of the parameters. Note that we are basing everything on a single
parial rst order
ondition. Probably this f.o.
. is simply not informative enough.
Simulation-based estimation methods (dis
ussed below) are one means of trying to
use more informative moment
onditions to estimate this sort of model.
r,
JBES,
c, p,
and
******************************************************
Example of GMM estimation of rational expe
tations model
GMM Estimation Results
BFGS
onvergen
e: Normal
onvergen
e
Obje
tive fun
tion value: 0.000014
Observations: 94
X^2 test
beta
estimate
0.915
df
1.000
st. err
0.009
Value
0.001
p-value
0.971
t-stat
97.271
p-value
0.000
and
15.12.
229
gamma
0.569
0.319
1.783
0.075
******************************************************
******************************************************
Example of GMM estimation of rational expe
tations model
GMM Estimation Results
BFGS
onvergen
e: Normal
onvergen
e
Obje
tive fun
tion value: 0.037882
Observations: 93
X^2 test
Value
3.523
df
3.000
p-value
0.318
estimate
st. err
t-stat
p-value
beta
0.857
0.024
35.636
0.000
gamma
-2.351
0.315
-7.462
0.000
******************************************************
Pretty
learly, the results are sensitive to the
hoi
e of instruments. Maybe there is some
problem here: poor instruments, or possibly a
onditional moment that is not very informative. Moment
onditions formed from Euler
onditions sometimes do not identify the
parameter of a model. See Hansen, Heaton and Yarron, (1996)
problem here, (I haven't
he
ked it
arefully)?
JBES
230
CHAPTER 15.
Exer
ises
1. Show how to
ast the generalized IV estimator presented in se
tion 11.4 as a GMM
estimator.
Dn ,
mt (),
what is the e ient weight matrix, and show that the ovarian e
matrix formula given previously
orresponds to the GMM
ovarian
e matrix formula.
2. Using O
tave, generate data from the logit dgp . Re
all that
[1 + exp(xt
)]1 .
mt () =
oi ide.
1 m()
n m()
has a
2 (g K)
That is, show that the monster matrix is idempotent and has tra e
g K.
4. For the portfolio example, experiment with the program using lags of 3 and 4 periods
to dene instruments
= (, )
and
to onvergen e.
(b) Comment on the results. Are the results sensitive to the set of instruments used?
(Look at
Chapter 16
Quasi-ML
Quasi-ML is the estimator one obtains when a misspe
ied probability model is used to
al
ulate an ML estimator.
n of a random
ve tor
0 :
pY (Y|X, 0 ).
As long as the marginal density of
doesn't depend on
0 ,
hara
terizes the random
hara
teristi
s of samples: i.e., it fully des
ribes the probabilisti
ally important features of the d.g.p. The
at other values
Let
Yt1 =
y1
L(Y|X, ) = pY (Y|X, ), .
. . . yt1 , Y0 = 0, and let Xt = x1 . . . xt
The like-
L(Y|X, ) =
n
Y
t=1
n
Y
pt (yt |Yt1 , Xt , )
pt ()
t=1
sn () =
pt ().
Mistak-
ft (yt |Yt1 , Xt , ), ,
pt (yt |Yt1 , Xt , 0 ), t
1
1X
ln L(Y|X, ) =
ln pt ()
n
n t=1
This setup allows for heterogeneous time series data, with dynami misspe i ation.
231
232
CHAPTER 16.
misspe ied
QUASI-ML
is
1X
ln ft (yt |Yt1 , Xt , 0 )
n
sn () =
t=1
n
1X
ln ft ()
n
t=1
n = arg max sn ()
a.s.
sn () lim E
n
1X
ln ft () s ()
n
t=1
We assume that this an be strengthened to uniform onvergen e, a.s., following the previous arguments. The pseudo-true value of
s():
0 = arg max s ()
lim n = 0 , a.s.
d
n 0 N 0, J ( 0 )1 I ( 0 )J ( 0 )1
where
J ( 0 ) = lim ED2 sn ( 0 )
n
and
I ( 0 ) = lim V ar nD sn ( 0 ).
n
Note that asymptoti normality only requires that the additional assumptions regarding
and
hold in a neighborhood of
for
and at
Jn (n ) =
J ( 0 ) is straightforward.
t=1
t=1
1X 2
1X 2
a.s.
D ln ft (n ) lim E
D ln ft ( 0 ) = J ( 0 ).
n n
n
I ( 0 )
in pla e of
0.
16.1.
233
Notation:
Let
gt D ft ( 0 )
We need to estimate
I ( 0 ) =
=
lim V ar nD sn ( 0 )
n
1X
lim V ar n
D ln ft ( 0 )
n
n t=1
n
X
1
gt
V ar
n n
t=1
! )
! n
( n
X
X
1
= lim E
(gt Egt )
(gt Egt )
n n
=
lim
t=1
t=1
1X
(Egt ) (Egt )
n n
lim
t=1
whi
h will not tend to zero, in general. This term is not
onsistently estimable in general,
sin
e it requires
al
ulating an expe
tation using the true density under the d.g.p., whi
h
is unknown.
I ( 0 )
is
onsistently estimable.
i.e., they
For example,
be the
ase with
ross se
tional data, for example. (Note: under i.i.d. sampling, the
joint distribution of
density
f (yt |xt )
(yt , xt )
is identi al.
is identi al).
With random sampling, the limiting obje tive fun tion is simply
s ( 0 ) = EX E0 ln f (y|x, 0 )
where
density of
x.
By the requirement that the limiting obje tive fun tion be maximized at
we have
D EX E0 ln f (y|x, 0 ) = D s ( 0 ) = 0
The dominated
onvergen
e theorem allows swit
hing the order of expe
tation and
dierentiation, so
D EX E0 ln f (y|x, 0 ) = EX E0 D ln f (y|x, 0 ) = 0
The CLT implies that
1 X
d
234
CHAPTER 16.
QUASI-ML
That is, it's not ne
essary to subtra
t the individual means, sin
e they are zero.
Given this, and due to independent observations, a
onsistent estimator is
1X
ln ft ()
D ln ft ()D
Ib =
n
t=1
This is an important
ase where
onsistent estimation of the
ovarian
e matrix is possible.
Other
ases exist, even for dynami
ally misspe
ied time series models.
V[
(y) =
Pn
t=1
ERV, we get We see that even after onditioning, the overdispersion is not aptured in
ERV
Sample
38.09
0.151
Estimated
3.28
0.086
either
ase. There is huge problem with OBDV, and a signi
ant problem with ERV. In
both
ases the Poisson model does not appear to be plausible. You
an
he
k this for the
other use measures if you like.
random parameters
To apture unobserved
exp() y
y!
= exp(x + )
fY (y|x, ) =
= exp(x) exp()
=
where
= exp(x )
and
= exp().
Now
density
fY (y|x) =
This density
an
exp[] y
fv (z)dz
y!
likelihood fun tion. In some ases, though, the integral will have an analyti solution. For
16.2.
235
example, if
(y + )
fY (y|x, ) =
(y + 1)()
where
= (, ).
y
(16.1)
E(y|x) = ,
If
= /,
x,
If
= 1/,
where
where
> 0,
= exp(x )
> 0,
then
then
V (y|x) = + .
Note that
V (y|x) = + 2 .
is a fun tion of
NB-II model.
So both forms of the NB model allow for overdispersion, with the NB-II model allowing
for a more radi
al form.
Testing redu
tion of a NB model to a Poisson model
annot be done by testing
=0
using standard Wald or LR pro
edures. The
riti
al values need to be adjusted to a
ount
for the fa
t that
=0
details, suppose that the data were in fa t Poisson, so there is equidispersion and the true
= 0.
Then about half the time the sample data will be underdispersed, and about half
will be
= 0.
Thus, under the null, there will be a probability spike in the asymptoti distribution of
n(
) =
This program will do estimation using the NB model. Note how modelargs is used to
sele t a NB-I or NB-II density. Here are NB-I estimation results for OBDV:
236
CHAPTER 16.
0.2551
0.2024
0.2289
0.1969
0.0769
0.0000
1.7146
QUASI-ML
-0.0000
0.0000
-0.0000
0.0000
0.0000 -0.0000
0.0000 -0.0000
0.0000 -0.0000
-0.0000
0.0000
-0.0000
0.0000
******************************************************
Negative Binomial model, MEPS 1996 full data set
MLE Estimation Results
BFGS
onvergen
e: Normal
onvergen
e
Average Log-L: -2.185730
Observations: 4564
onstant
pub. ins.
priv. ins.
sex
age
edu
in
alpha
estimate
-0.523
0.765
0.451
0.458
0.016
0.027
0.000
5.555
st. err
0.104
0.054
0.049
0.034
0.001
0.007
0.000
0.296
t-stat
-5.005
14.198
9.196
13.512
11.869
3.979
0.000
18.752
p-value
0.000
0.000
0.000
0.000
0.000
0.000
1.000
0.000
Information Criteria
CAIC : 20026.7513
Avg. CAIC: 4.3880
BIC : 20018.7513
Avg. BIC:
4.3862
AIC : 19967.3437
Avg. AIC:
4.3750
******************************************************
Note that the parameter values of the last BFGS iteration are dierent that those
reported in the nal results. This ree
ts two things - rst, the data were s
aled before
doing the BFGS minimization, but the
mle_results
reports the results using the original s
aling. But also, the parameterization
is used to enfor
e the restri
tion that
> 0.
= exp( )
= log
is
used to dene the log-likelihood fun
tion, sin
e the BFGS minimization algorithm does
not do
ontrained minimization. To get the standard error and t-statisti
of the estimate
of
mle_results,
making use of
16.2.
237
gradient
0.0000
-0.0000
0.0000
0.0000
0.0000
-0.0000
0.0000
-0.0000
hange
-0.0000
0.0000
-0.0000
-0.0000
0.0000
0.0000
-0.0000
0.0000
******************************************************
Negative Binomial model, MEPS 1996 full data set
MLE Estimation Results
BFGS
onvergen
e: Normal
onvergen
e
Average Log-L: -2.184962
Observations: 4564
onstant
pub. ins.
priv. ins.
sex
age
edu
in
alpha
estimate
-1.068
1.101
0.476
0.564
0.025
0.029
-0.000
1.613
st. err
0.161
0.095
0.081
0.050
0.002
0.009
0.000
0.055
t-stat
-6.622
11.611
5.880
11.166
12.240
3.106
-0.176
29.099
p-value
0.000
0.000
0.000
0.000
0.000
0.002
0.861
0.000
Information Criteria
CAIC : 20019.7439
Avg. CAIC: 4.3864
BIC : 20011.7439
Avg. BIC:
4.3847
AIC : 19960.3362
Avg. AIC:
4.3734
******************************************************
For the OBDV usage measurel, the NB-II model does a slightly better job than the
NB-I model, in terms of the average log-likelihood and the information
riteria (more
on this last in a moment).
238
CHAPTER 16.
QUASI-ML
Note that both versions of the NB model t mu
h better than does the Poisson model
(see 13.4.2).
The estimated
Pn
V[
(y) =
t )2
(
t=1 t +
. For OBDV and ERV (estimation results not reported), we get For OBDV,
n
ERV
Sample
38.09
0.151
Estimated
30.58
0.182
the overdispersion problem is signi
antly better than in the Poisson
ase, but there is still
some that is not
aptured. For ERV, the negative binomial model seems to
apture the
overdispersion adequately.
more subgroups. Many studies have in
orporated obje
tive and/or subje
tive indi
ators of
health status in an eort to
apture this heterogeneity. The available obje
tive measures,
su
h as limitations on a
tivity, are not ne
essarily very informative about a person's overall
health status. Subje
tive, self-reported measures may suer from the same problem, and
may also not be exogenous
Finite mixture models are
on
eptually simple. The density is
p1
X
(i)
i=1
where
i > 0, i = 1, 2, ..., p, p = 1
that the
Pp1
i=1
i ,
and
Pp
i=1 i
= 1.
1 2 p
and
i 6= j , i 6= j .
The properties of the mixture density follow in a straightforward way from those of
the
omponents. In parti
ular, the moment generating fun
tion is the same mixture
of the moment generating fun
tions of the
omponent densities, so, for example,
E(Y |x) =
Pp
i (x)
ith
omponent density.
Mixture densities may suer from overparameterization, sin
e the total number of
parameters grows rapidly with the number of
omponent densities. It is possible to
onstrained parameters a
ross the mixtures.
16.2.
239
Testing for the number of
omponent densities is a tri
ky issue. For example, testing
for
p=1
1 = 1,
1 = 1,
p=2
(a mixture
likelihood ratio test are not appli
able when parameters are on the boundary under
the null hypothesis. Information
riteria means of
hoosing the model (see below)
are valid.
The following results are for a mixture of 2 NB-II models, for the OBDV data, whi
h you
an repli
ate using this program .
OBDV
******************************************************
Mixed Negative Binomial model, MEPS 1996 full data set
MLE Estimation Results
BFGS
onvergen
e: Normal
onvergen
e
Average Log-L: -2.164783
Observations: 4564
onstant
pub. ins.
priv. ins.
sex
age
edu
in
alpha
onstant
pub. ins.
priv. ins.
sex
age
edu
in
alpha
Mix
estimate
0.127
0.861
0.146
0.346
0.024
0.025
-0.000
1.351
0.525
0.422
0.377
0.400
0.296
0.111
0.014
1.034
0.257
st. err
0.512
0.174
0.193
0.115
0.004
0.016
0.000
0.168
0.196
0.048
0.087
0.059
0.036
0.042
0.051
0.187
0.162
t-stat
0.247
4.962
0.755
3.017
6.117
1.590
-0.214
8.061
2.678
8.752
4.349
6.773
8.178
2.634
0.274
5.518
1.582
p-value
0.805
0.000
0.450
0.003
0.000
0.112
0.831
0.000
0.007
0.000
0.000
0.000
0.000
0.008
0.784
0.000
0.114
Information Criteria
CAIC : 19920.3807
Avg. CAIC: 4.3647
BIC : 19903.3807
Avg. BIC:
4.3610
AIC : 19794.1395
Avg. AIC:
4.3370
******************************************************
It is worth noting that the mixture parameter is not signi antly dierent from zero,
240
CHAPTER 16.
QUASI-ML
but also not that the
oe
ients of publi
insuran
e and age, for example, dier quite a
bit between the two latent
lasses.
+ k(ln n + 1)
CAIC = 2 ln L()
+ k ln n
BIC = 2 ln L()
+ 2k
AIC = 2 ln L()
It
an be shown that the CAIC and BIC will sele
t the
orre
tly spe
ied model from
a group of models, asymptoti
ally. This doesn't mean, of
ourse, that the
orre
t model
is ne
esarily in the group.
over-parameterized model over the
orre
tly spe
ied model. Here are information
riteria
values for the models we've seen, for OBDV. Pretty
learly, the NB models are better
AIC
BIC
CAIC
Poisson
7.345
7.355
7.357
NB-I
4.375
4.386
4.388
NB-II
4.373
4.385
4.386
MNB-II
4.337
4.361
4.365
than the Poisson. The one additional parameter gives a very signi
ant improvement in
the likelihood fun
tion value. Between the NB-I and NB-II models, the NB-II is slightly
favored.
But one should remember that information riteria values are statisti s, with
varian
es. With another sample, it may well be that the NB-I model would be favored,
sin
e the dieren
es are so small. The MNB-II model is favored over the others, by all 3
information
riteria.
Why is all of this in the
hapter on QML? Let's suppose that the
orre
t model for
OBDV is in fa
t the NB-II model. It turns out in this
ase that the Poisson model will
give
onsistent estimates of the slope parameters (if a model is a member of the linearexponential family and the
onditional mean is
orre
tly spe
ied, then the parameters of
the
onditional mean will be
onsistently estimated). So the Poisson estimator would be
a QML estimator that is
onsistent for some parameters of the true model. The ordinary
OPG or inverse Hessinan ML
ovarian
e estimators are however biased and in
onsistent,
sin
e the information matrix equality does not hold for QML estimators. But for i.i.d. data
(whi
h is the
ase for the MEPS data) the QML asymptoti
ovarian
e
an be
onsistently
16.2.
estimated, as dis ussed above, using the sandwi h form for the ML estimator.
241
mle_results
in fa
t reports sandwi
h results, so the Poisson estimation results would be reliable for
inferen
e even if the true model is the NB-I or NB-II. Not that they are in fa
t similar to
the results for the NB models.
However, if we assume that the
orre
t model is the MNB-II model, as is favored by
the information
riteria, then both the Poisson and NB-x models will have misspe
ied
mean fun
tions, so the parameters that inuen
e the means would be estimated with bias
and in
onsistently.
242
CHAPTER 16.
QUASI-ML
Exer
ises
Considering the MEPS data (the des
ription is in Se
tion 13.4.2), for the OBDV (y )
measure, let
suspe t that
be a latent index of health status that has expe
tation equal to unity.
and
P RIV
We
is un orrelated with
Poisson()
= exp(1 + 2 P U B + 3 P RIV +
(16.2)
Sin
e mu
h previous eviden
e indi
ates that health
are servi
es usage is overdispersed ,
this is almost
ertainly not an ML estimator, and thus is not e
ient. However, when
and
P RIV
When
and
P RIV
are orrelated,
Mullahy's (1997) NLIV estimator that uses the residual fun tion
=
where
is dened
y
1,
ments we use all the exogenous regressors, as well as the
ross produ
ts of
variables in
PUB
with the
W = {1 P U B Z P U B Z }.
1. Cal
ulate the Poisson QML estimates.
(a) Cal
ulate the generalized IV estimates (do it using a GMM formulation - see
the portfolio example for hints how to do this).
(b) Cal
ulate the Hausman test statisti
to test the exogeneity of PRIV.
(
)
omment on the results
1
2
Chapter 17
Nonlinear least squares (NLS)
Readings:
yt = f (xt , 0 ) + t .
In general,
distributed. However, dealing with this is exa
tly as in the
ase of linear models, so
we'll just treat the iid
ase here,
t iid(0, 2 )
If we sta
k the observations verti
ally, dening
y = (y1 , y2 , ..., yn )
f = (f (x1 , ), f (x1 , ), ..., f (x1 , ))
and
= (1 , 2 , ..., n )
we
an write the
observations as
y = f () +
Using this notation, the NLS estimator
an be dened as
1
1
arg min sn () = [y f ()] [y f ()] = k y f () k2
n
n
The estimator minimizes the weighted sum of squared errors, whi
h is the same as
minimizing the Eu
lidean distan
e between
243
and
f ().
244
CHAPTER 17.
sn () =
1
y y 2y f () + f () f () ,
n
0.
f () y +
f () f ()
Dene the
nK
matrix
D f ().
F()
In shorthand, use
in pla e of
F().
(17.1)
as
0,
y + F
f ()
F
or
h
i
0.
y f ()
F
(17.2)
This bears a good deal of similarity to the f.o.
. for the linear model - the derivative of
the predi
tion is orthogonal to the predi
tion error. If
f () = X,
then
is simply
X,
so
X y X X = 0,
the usual 0LS f.o.
.
Note that the nonlinearity of the manifold leads to potential multiple lo
al maxima,
minima and saddlepoints: the obje
tive fun
tion
( 0 )
< s (), 6=
D2 s ( 0 )
sn () =
1X
[yt f (xt , )]2
n
t=1
=
=
n
2
1 X
f (xt , 0 ) + t ft (xt , )
n t=1
n
n
2 1 X
1 X
0
(t )2
ft ( ) ft () +
n
n
t=1
t=1
n
2 X
ft ( 0 ) ft () t
n
t=1
17.3.
245
CONSISTENCY
OLS, we
on
lude that the se
ond term will
onverge to a
onstant whi
h does not
depend upon
f () and
are un orrelated.
Next, pointwise
onvergen
e needs to be stregnthened to uniform almost sure
onvergen
e. There are a number of possible assumptions one
ould use. Here, we'll just
assume it holds.
Turning to the rst term, we'll assume a pointwise law of large numbers applies, so
n
2 a.s.
1 X
ft ( 0 ) ft ()
n
t=1
where
(x)
x.
2
f (z, 0 ) f (z, ) d(z),
In many
ases,
f (x, )
will
(17.3)
be bounded
f (x, ) = [1 + exp(x)]
0.
, f : K (0, 1) ,
is
a bounded
(asymptoti
), the question is whether or not there may be some other minimizer. A lo
al
ondition for identi
ation is that
2
2
s () =
be positive denite at
0.
2
f (x, 0 ) f (x, ) d(x)
2
f (x, ) f (x, ) d(x)
0
=2
0
D f (z, 0 )
D f (z, 0 ) d(z)
the expe tation of the outer produ t of the gradient of the regression fun tion evaluated at
0 . (Note: the uniform boundedness we have already assumed allows passing the derivative
through the integral, by the dominated
onvergen
e theorem.) This matrix will be positive
denite (wp1) as long as the gradient ve
tor is of full rank (wp1). The tangent spa
e to the
regression manifold must span a
perfe
t
olinearity in a linear model. This is a ne
essary
ondition for identi
ation. Note
that the LLN implies that the above expe
tation is equal to
J ( 0 ) = 2 lim E
F F
n
17.3 Consisten
y
We simply assume that the
onditions of Theorem 19 hold, so the estimator is
onsistent.
Given that the strong sto
hasti
equi
ontinuity
onditions hold, as dis
ussed above, and
given the above identi
ation
onditions an a
ompa
t estimation spa
e (the
losure of the
parameter spa
e
),
246
CHAPTER 17.
As in the
ase of GMM, we also simply assume that the
onditions for asymptoti
normality
as in Theorem 22 hold.
asymptoti
varian
e-
ovarian
e matrix. Re
all that the result of the asymptoti
normality
theorem is
where
J ( 0 )
d
n 0 N 0, J ( 0 )1 I ( 0 )J ( 0 )1 ,
2
sn () evaluated at
0,
and
I ( 0 ) = lim V ar nD sn ( 0 )
The obje
tive fun
tion is
sn () =
1X
[yt f (xt , )]2
n
t=1
So
2X
D sn () =
[yt f (xt , )] D f (xt , ).
n
t=1
Evaluating at
0,
D sn ( 0 ) =
2X
t D f (xt , 0 ).
n
t=1
and
xt
So to
al
ulate the varian
e, we
an simply
al
ulate the se
ond moment about zero. Also
note that
n
X
t D f (xt , 0 ) =
t=1
0
f ( )
= F
I ( 0 ) = lim V ar nD sn ( 0 )
4
= lim nE 2 F ' F
n
F F
= 4 2 lim E
n
We've already seen that
J ( 0 ) = 2 lim E
F F
,
n
0
expressions for J ( ) and
we get
and
Combining these
d
n 0 N
F F
0, lim E
n
1
17.5.
247
F
n
where
!1
2,
(17.4)
2 =
i h
i
y f ()
y f ()
n
the obvious estimator. Note the lose orresponden e to the results for the linear model.
yt
onditional on
ount data
variable is a
xt
of model has been used to study visits to do
tors per year, number of patents registered
by businesses per year,
et .
f (yt ) =
The mean of
yt
is
t ,
exp(t )yt t
, yt {0, 1, 2, ...}.
yt !
true mean is
0t = exp(xt 0 ),
whi
h enfor
es the positivity of
t .
Suppose we estimate
n
2
1X
= arg min sn () =
yt exp(xt )
T t=1
We
an write
sn () =
=
n
2
1X
exp(xt 0 + t exp(xt )
T t=1
n
n
n
2 1 X
1X
1X
2t + 2
t exp(xt 0 exp(xt )
exp(xt 0 exp(xt ) +
T
T
T
t=1
t=1
t=1
The last term has expe
tation zero sin
e the assumption that
that
E (t |xt ) = 0,
xt
t .
Applying a strong LLN, and noting that the obje
tive fun
tion is
ontinuous on a
ompa
t
parameter spa
e, we get
2
s () = Ex exp(x 0 exp(x ) + Ex exp(x 0 )
where the last term
omes from the fa
t that the
onditional varian
e of
the varian
e of
y.
is the same as
248
CHAPTER 17.
n 0 .
The Gauss-Newton optimization te
hnique is spe
i
ally designed for nonlinear least
squares. The idea is to linearize the nonlinear model, rather than the obje
tive fun
tion.
The model is
y = f ( 0 ) + .
At some
0,
we have
y = f () +
where
Dene
0
rather than the true value . Take a rst order Taylor's series
1 :
y = f ( 1 ) + D f 1
1 + + approximation
z y f ( 1 )
and
b ( 1 ).
error.
z = F( 1 )b + ,
where, as above,
F( 1 ) D f ( 1 )
1
fun
tion, evaluated at , and
is
is the
nK
series.
1.
Note that
Given
b,
is known, given
as
2 = b + 1 .
2
new Taylor's series expansion around and repeat the pro
ess. Stop when
b = 0
To see why this might work,
onsider the above approximation, but evaluated at the NLS
estimator:
+ F()
+
y = f ()
b is
1 h
i
b = F
.
F
y f ()
F
h
i
0
y f ()
F
17.7.
by denition of the NLS estimator (these are the normal equations as in equation 17.2,
Sin
e
b 0
when we evaluate at
The Gauss-Newton method doesn't require se ond derivatives, as does the Newton-
as a
F
In fa
t, a normal OLS program will give the NLS var
ov estimator dire
tly, sin
e it's
just the OLS var
ov estimator from the last iteration.
F() F(),
may be very
nearly singular, even with an asymptoti
ally identied model, espe
ially if
far from
is very
y = 1 + 2 xt 3 + t
When evaluated at
so
2 0, 3
F F
will be
E onometri a,
arti le, not required for reading, and whi h is a bit out-dated.
pla
e to start if you en
ounter sample sele
tion problems in your resear
h).
Sample sele
tion is a
ommon problem in applied resear
h. The problem o
urs when
observations used in estimation are sampled non-randomly, a
ording to some sele
tion
s
heme.
Oer wage:
Reservation wage:
s = x +
wo = z +
wr = q +
w =
z + q +
r +
250
CHAPTER 17.
s = x +
w = r + .
"
Assume that
"
0
0
# "
,
#!
We assume that the oer wage and the reservation wage, as well as the latent variable
w = 1 [w > 0]
s = ws .
In other words, we observe whether or not a person is working. If the person is working,
we observe labor supply, whi
h is equal to latent labor supply,
s .
Otherwise,
s = 0 6= s .
Note that we are using a simplifying assumption that individuals
an freely
hoose their
weekly hours of work.
s = x + residual
using only observations for whi
h
for whi
h
sin e
w > 0,
and
or equivalently,
s > 0.
< r
and
E | < r 6= 0,
sin e elements of
an enter in
r.
E [| < r ] .
and
we an
where
s = x + + .
If we
ondition this equation on
< r
we get
s = x + E(| < r ) +
whi
h may be written as
s = x + E(| > r ) +
122)
17.7.
z N (0, 1)
E(z|z > z ) =
where
()
()
and
(z )
,
(z )
IM R(z ) =
(z )
(z )
With this we
an write (making use of the fa
t that the standard normal density is
symmetri
about zero, so that
(a) = (a)):
(r )
+
(r )
#
"
i
(r )
s = x +
where
= .
regressors
(r )
(r )
(r )
(17.5)
+ .
(17.6)
He kman showed how one an estimate this in a two step pro edure where rst
is
estimated, then equation 17.6 is estimated by least squares using the estimated value
of
to form the regressors. This is ine ient and estimation of the ovarian e is a
The model presented above depends strongly on joint normality. There exist many
alternative models whi
h weaken the maintained assumptions. It is possible to estimate
onsistently without distributional assumptions. See Ahn and Powell,
of E onometri s, 1994.
Journal
252
CHAPTER 17.
Chapter 18
Nonparametri
inferen
e
18.1 Possible pitfalls of parametri
inferen
e: estimation
Readings:
Fun
tions,
149-70.
In this se
tion we
onsider a simple example, whi
h illustrates both why nonparametri
methods may in some
ases be preferred to parametri
methods.
We suppose that data is generated by random sampling of
is uniformly distributed on
(0, 2),
and
y = f (x) +,
where
f (x) = 1 +
3x x 2
2
2
(y, x),
f (x)
with respe t to
x,
throughout
x.
f (x)
f (x)
is unknown.
x0 .
x0 :
x0 = 0,
we an write
h(x) = a + bx
The
oe
ient
derivative at
x = 0.
x = 0,
These are of ourse not known. One might try estimation by ordinary
s(a, b) = 1/n
n
X
t=1
The limiting obje tive fun tion, following the argument we used to get equations 14.1 and
253
254
CHAPTER 18.
NONPARAMETRIC INFERENCE
3.5
approx
true
3.0
2.5
2.0
1.5
1.0
0
17.3 is
s (a, b) =
2
0
The theorem regarding the onsisten y of extremum estimators (Theorem 19) tells us that
and
fun
tion.
at
will
onverge almost surely to the values that minimize the limiting obje
tive
Solving the rst order
onditions
0 7 0
a = 6 , b = 1 .
s (a, b) obtains
reveals that
its minimum
tends almost
surely to
h (x) = 7/6 + x/
In Figure 18.1 we see the true fun
tion and the limit of the approximation to see the
asymptoti
bias as a fun
tion of
x.
(The approximating model is the straight line, the true model has urvature.)
Note
that the approximating model is in general in
onsistent, even at the approximation point.
This shows that exible fun
tional forms based upon Taylor's series approximations do
not in general lead to
onsistent estimation of fun
tions.
The approximating model seems to t the true model fairly well, asymptoti
ally. However, we are interested in the elasti
ity of the fun
tion.
(x) = x (x)/(x)
Good approximation of the elasti
ity over the range of
of both
f (x)
x.
(x) = xh (x)/h(x)
1
le at
18.1.
255
0.7
approx
true
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0
In Figure 18.2 we see the true elasti
ity and the elasti
ity obtained from the limiting
approximating model.
The true elasti
ity is the line that has negative slope for large
x.
the elasti
ity is not approximated so well. Root mean squared error in the approximation
of the elasti
ity is
Z
((x) (x)) dx
1/2
= . 31546
Now suppose we use the leading terms of a trigonometri
series as the approximating
model. The reason for using a trigonometri
series as an approximating model is motivated
by the asymptoti
properties of the Fourier exible fun
tional form (Gallant, 1981, 1982),
whi
h we will study in more detail below. Normally with this type of model the number
of basis fun
tions is an in
reasing fun
tion of the sample size.
basis fun
tion xed. We will
onsider the asymptoti
behavior of a xed model, whi
h we
interpret as an approximation to the estimator's behavior in nite samples. Consider the
set of basis fun
tions:
Z(x) =
gK (x) = Z(x).
Maintaining these basis fun
tions as the sample size in
reases, we nd that the limiting
obje
tive fun
tion is minimized at
7
1
1
1
a1 = , a2 = , a3 = 2 , a4 = 0, a5 = 2 , a6 = 0 .
6
4
Substituting these values into
gK (x)
1
g (x) = 7/6 + x/ + (cos x) 2
1
+ (sin x) 0 + (cos 2x) 2
4
+ (sin 2x) 0
(18.1)
256
CHAPTER 18.
NONPARAMETRIC INFERENCE
3.5
approx
true
3.0
2.5
2.0
1.5
1.0
0
0.7
approx
true
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0
In Figure 18.3 we have the approximation and the true fun
tion: Clearly the trun
ated
trigonometri
series model oers a better approximation, asymptoti
ally, than does the
linear model.
In Figure 18.4 we have the more exible approximation's elasti ity and
that of the true fun
tion: On average, the t is better, though there is some implausible
wavyness in the estimate. Root mean squared error in the approximation of the elasti
ity
is
g (x)x
(x)
g (x)
2
dx
!1/2
= . 16213,
about half that of the RMSE when the rst order approximation is used. If the trigonometri
series
ontained innite terms, this error measure would be driven to zero, as we
shall see.
18.2.
Consider means of testing for the hypothesis that onsumers maximize utility.
2
onsequen
e of utility maximization is that the Slutsky matrix Dp h(p, U ), where
h(p, U )
are the a set of ompensated demand fun tions, must be negative semi-
denite. One approa
h to testing for utility maximization would estimate a set of
normal demand fun
tions
x(p, m).
Estimation of these fun
tions by normal parametri
methods requires spe
i
ation
of the fun
tional form of demand, for example
x(p, m) = x(p, m, 0 ) + , 0 0 ,
where
x(p, m, 0 )
eter.
The problem with this is that the reason for reje
tion of the theoreti
al proposition
may be that our
hoi
e of fun
tional form is in
orre
t. In the introdu
tory se
tion
we saw that fun
tional form misspe
i
ation leads to in
onsistent estimation of the
fun
tion and its derivatives.
Testing using parametri
models always means we are testing a
ompound hypothesis.
The hypothesis that is tested is 1) the e
onomi
proposition we wish to test, and 2)
the model is
orre
tly spe
ied. Failure of either 1) or 2)
an lead to reje
tion. This
is known as the model-indu
ed augmenting hypothesis.
Varian's WARP allows one to test for utility maximization without spe
ifying the
form of the demand fun
tions.
dire tly implied by theory, so reje tion of the hypothesis alls into question the theory.
Cambridge.
258
CHAPTER 18.
NONPARAMETRIC INFERENCE
y = f (x) + ,
where
f (x)
assume that
is a lassi al error.
is a
P dimensional
ve tor.
x i =
at an arbitrary point
xi f (x)
,
f (x) xi f (x)
xi .
The Fourier form, following Gallant (1982), but with a somewhat dierent parameterization, may be written as
A X
J
X
gK (x | K ) = + x + 1/2x Cx +
where the
K -dimensional
=1 j=1
uj cos(jk x) vj sin(jk x) .
parameter ve tor
(18.2)
2.
(18.3)
the approximation, whi
h is desirable sin
e e
onomi
fun
tions aren't periodi
. For
example, subtra
t sample means, divide by the maxima of the
onditioning variables,
and multiply by
The
2 eps,
eps
where
The
in value.
k , = 1, 2, ..., A
independent, and we follow the
onvention that the rst non-zero element be positive.
For example
0 1 1 0 1
0 1 1 0 1
i
i
0 2 2 0 2
things in pra
ti
e. The
ost of this is that we are no longer able to test a quadrati
spe
i
ation using nested testing.
18.3.
259
Dx gK (x | K ) = + Cx +
A X
J
X
=1 j=1
uj sin(jk x) vj cos(jk x) jk
(18.4)
Dx2 gK (x|K ) = C +
A X
J
X
=1 j=1
uj cos(jk x) + vj sin(jk x) j 2 k k
arguments
be an
N -dimensional
multi-
If we have
of the (arbitrary) fun tion h(x), use D h(x) to indi ate a ertain partial
derivative:
||
D h(x)
When
(18.5)
x1 1 x2 2 xNN
h(x)
(1 K)
ve tor
Z (x)
so that
D gK (x|K ) = z (x) K .
(18.6)
Both the approximating model and the derivatives of the approximating model are
For the approximating model to the fun tion (not derivatives), write
gK (x|K ) =
The following theorem an be used to prove the onsisten y of the Fourier form.
sample obje
tive fun
tion sn(h) over HKn where HK is a subset of some fun
tion spa
e H
on whi
h is dened a norm k h k. Consider the following
onditions:
(a) Compa
tness: The
losure of H with respe
t to k h k is
ompa
t in the relative
topology dened by k h k.
(b) Denseness: K HK , K = 1, 2, 3, ... is a dense subset of the
losure of H with respe
t
to k h k and HK HK+1.
(
) Uniform
onvergen
e: There is a point h in H and there is a fun
tion s (h, h )
that is
ontinuous in h with respe
t to k h k su
h that
lim sup | sn (h) s (h, h ) |= 0
almost surely.
(d) Identi
ation: Any point h in the
losure of H with s (h, h ) s (h , h ) must
have k h h k= 0.
Under these
onditions limn k h h n k= 0 almost surely, provided that limn Kn =
almost surely.
260
CHAPTER 18.
NONPARAMETRIC INFERENCE
The modi
ation of the original statement of the theorem that has been made is to set
the parameter spa
e
1. A generi norm
k h k
khk
implies
onvergen
e w.r.t the Eu
lidean norm. Typi
ally we will want to make sure that the
norm is strong enough to imply
onvergen
e of all fun
tions of interest.
2. The estimation spa
e
in
family, only a restri
tion to a spa
e of fun
tions that satisfy
ertain
onditions. This
formulation is mu
h less restri
tive than the restri
tion to a parametri
family.
3. There is a denseness assumption that was not present in the other theorem.
We will not prove this theorem (the proof is quite similar to the proof of theorem [19, see
Gallant, 1987) but we will dis
uss its assumptions, in relation to the Fourier form as the
approximating model.
khk
the fun
tions we are interested in are a
ounted for. Sin
e we are interested in rst-order
elasti
ities in the present
ase, we need
lose approximation of both the fun
tion
x.
Let
f (x)
and
that we're interested in. The Sobolev norm is appropriate in this ase. It is
k h km,X = max sup D h(x)
| |m X
gK (x | K ),
f (x)
we would evaluate
k f (x) gK (x | K ) km,X .
We see that this norm takes into a
ount errors in approximating the fun
tion and partial
derivatives up to order
m.
X,
onvergen e w.r.t.
would be
m = 1.
uniform
sup
x.
E onometri a,
18.3.
261
k h km,X ,
m + 1.
A Sobolev spa e is
is a nite onstant.
Denition 29 [Estimation spa e The estimation spa e H = W2,X (D). The estimation
So we are assuming that the fun
tion to be estimated has bounded se
ond derivatives
throughout
X.
HK n ,
dened as:
18.3.4 Denseness
The important point here is that
dimensional parameter (K has
n > K,
HK
element of
HK ,
so optimization over
HK
HK
observations,
and
H, at
as
be dense subsets of
HK ,
. A set of subsets
Aa
n.
n .
HK
dim Kn
spa e,
With
this parameter is estimable. Note that the true fun tion h is not ne essarily an
n,
This is a hieved
H.
A:
a=1 Aa = A
Use a pi
ture here. The rest of the dis
ussion of denseness is provided just for
ompleteness:
there's no need to study it in detail. To show that HK is a dense subset of H with respe
t
262
to
CHAPTER 18.
k h k1,X ,
NONPARAMETRIC INFERENCE
Theorem 31 [Edmunds and Mos atelli, 1977 Let the real-valued fun tion h (x) be on-
HK
X,
q = 1, and m = 2.
X,
so the theorem is appli able. Closely following Gallant and Ny hka (1987),
HK .
HK
su h that
lim k h hK k1,X = 0,
K
for all
h H.
Therefore,
H HK .
However,
HK H,
so
HK H.
Therefore
H = HK ,
so
HK
is a dense subset of
H,
k h k1,X .
sn (K ) =
1X
(yt gK (xt | K ))2
n
t=1
With random sampling, as in the
ase of Equations 14.1 and 17.3, the limiting obje
tive
fun
tion is
s (g, f ) =
where the true fun
tion
of the theorem. Both
f (x)
g(x)
(f (x) g(x))2 dx 2 .
and
f (x)
are elements of
(18.7)
in the presentation
HK .
The pointwise
onvergen
e of the obje
tive fun
tion needs to be strengthened to uniform
onvergen
e. We will simply assume that this holds, sin
e the way to verify this depends
upon the spe
i
appli
ation. We also have
ontinuity of the obje
tive fun
tion in
g,
with
18.3.
263
k h k1,X
sin e
lim
kg 1 g 0 k1,X 0
lim
s g 1 , f ) s g 0 , f )
Z h
2
2 i
dx.
g1 (x) f (x) g0 (x) f (x)
kg 1 g 0 k1,X 0 X
By the dominated
onvergen
e theorem (whi
h applies sin
e the nite bound
dene
W2,Z (D)
used to
k g f k1,X = 0.
and
are on e
ontinuously dierentiable (by the assumption that denes the estimation spa e).
Estimation spa
e
fun
tion must lie.
H = W2,X (D):
Consisten y norm
k h k1,X .
Estimation subspa e
HK .
The losure of
sn (K ),
K .
that is repre-
H.
s ( g, f ),
whi h is ontinuous in
maximum in its rst argument, over the
losure of the innite union of the estimation
subpa
es, at
g = f.
xi f (x)
f (x) xi f (x)
are
onsistently estimated for all
x X.
264
CHAPTER 18.
NONPARAMETRIC INFERENCE
Souza, 1991) are very stri
t. Supposing we sti
k to these rates, our approximating model
is:
gK (x|K ) = z K .
Dene
ZK
as the
LS estimator is
nK
K = ZK ZK
where
()+
+
ZK y,
ZK ZK
K(n)
large
z K ,
distributed:
f (x)
d
n z K f (x) N (0, AV ),
#
"
+
2
ZK ZK
z
.
AV = lim E z
n
n
where
Formally, this is exa
tly the same as if we were dealing with a parametri
linear model.
I emphasize, though, that this is only valid if
grows.
If
we
an't sti
k to a
eptable rates, we should probably use some other method of
approximating the small sample distribution. Bootstrapping is a possibility. We'll
dis
uss this in the se
tion on simulation.
Advan es in
nearest neighbor, et
.). We'll
onsider the Nadaraya-Watson kernel regression estimator
in a simple
ase.
The model is
yt = g(xt ) + t ,
where
E(t |xt ) = 0.
given
is
g(x).
g(x) =
=
f (x, y)
dy
h(x)
Z
1
yf (x, y)dy,
h(x)
y
18.4.
where
h(x)
x:
Z
h(x) =
265
f (x, y)dy.
g(x)
by estimating
h(x)
and
yf (x, y)dy.
h(x)
1 X K [(x xt ) /n ]
h(x)
=
,
n t=1
nk
where
K()
is the dimension of
Z
and
K()
integrates to
In this respe t,
x.
1:
|K(x)|dx < ,
K(x)dx = 1.
K()
to be nonnegative.
The
window width
parameter,
lim n = 0
lim nnk =
So, the window width must tend to zero, but not too qui kly.
h(x)
for
h(x),
estimator (sin
e the estimator is an average of iid terms we only need to
onsider the
expe
tation of a representative term):
i Z
Change variables as
z = (x z)/n ,
so
z = x n z
and
| dzdz | = nk ,
Z
i
E h(x) =
nk K (z ) h(x n z )nk dz
Z
=
K (z ) h(x n z )dz .
h
we obtain
266
CHAPTER 18.
NONPARAMETRIC INFERENCE
Z
lim
K (z ) h(x n z )dz
n
Z
=
lim K (z ) h(x n z )dz
n
Z
=
K (z ) h(x)dz
Z
= h(x) K (z ) dz
h
i
lim E h(x)
=
= h(x),
sin
e
n 0
and
K (z ) dz = 1
through the integral is a result of the dominated
onvergen
e theorem.. For this to
hold we need that
h()
nnk V
h(x),
n
i
X
1
K
[(x
x
)
/
]
t
n
k
h(x)
= nn 2
V
n t=1
nk
n
= nk
1X
V {K [(x xt ) /n ]}
n
t=1
h
i
nnk V h(x)
= nk V {K [(x z) /n ]}
Also, sin e
we have
h
o
i
n
nnk V h(x)
= nk E (K [(x z) /n ])2 nk {E (K [(x z) /n ])}2
Z
2
Z
2
k
k
k
=
n K [(x z) /n ] h(z)dz n
n K [(x z) /n ] h(z)dz
Z
h
i2
=
nk K [(x z) /n ]2 h(z)dz nk E b
h(x)
The se
ond term
onverges to zero:
nk E
h
i2
b
h(x) 0,
by the previous result regarding the expe tation and the fa t that
lim nnk V
n
n 0. Therefore,
Z
i
nk K [(x z) /n ]2 h(z)dz.
h(x) = lim
n
Using exa tly the same hange of variables as before, this an be shown to be
Z
h
i
n
Sin
e both
[K(z )]2 dz
and
18.4.
267
h
i
V h(x) 0.
Sin e the bias and the varian e both go to zero, we have pointwise onsisten y ( onvergen e in quadrati mean implies onvergen e in probability).
yf (x, y)dy,
we need an estimator of
h(x),
f (x, y).
1 X K [(y yt ) /n , (x xt ) /n ]
f(x, y) =
n
nk+1
t=1
The kernel
K ()
yK (y, x) dy = 0
h(x) :
K (y, x) dy = K(x).
1 X K [(x xt ) /n ]
y f(y, x)dy =
yt
n
nk
t=1
g(x) =
=
=
h(x)
1 Pn
y f(y, x)dy
K[(xxt )/n ]
k
n
K[(xxt )/n ]
1 Pn
k
t=1
n
n
Pn
yt K [(x xt ) /n ]
Pt=1
.
n
t=1 K [(x xt ) /n ]
n
t=1 yt
where higher weights are asso iated with points that are loser to
xt .
The weights
sum to 1.
A large window width redu es the varian e (strong imposition of atness), but in-
at as
n ,
1/n.
268
CHAPTER 18.
NONPARAMETRIC INFERENCE
A small window width redu
es the bias, but makes very little use of information
ex
ept points that are in a small neighborhood of
K(.)
and
y out
and
xout .
ytout
orresponding to ea h
xout
t .
xout
t ,
out
involve yt .
4. Repeat for all out of sample points.
5. Cal
ulate RMSE()
2,
6. Go to step
7. Sele
t the
that minimizes RMSE() (Verify that a minimum has been found, for
).
and
density is simply
fby|x =
=
=
f(x, y)
h(x)
1 Pn
t=1
18.6.
269
where we obtain the expressions for the joint and marginal densities from the se
tion on
kernel regression.
E onometri a,
V. 12,
1997.
MLE is the estimation method of
hoi
e when we are
ondent about spe
ifying the
density. Is is possible to obtain the benets of MLE when we're not so
ondent about the
spe
i
ation? In part, yes.
Suppose we're interested in the density of
Suppose that the density
f (y|x, )
onditional on
where
hp (y|) =
p
X
k y k
k=0
p (x, , ) is a normalizing fa
tor to make the density integrate (sum) to one. Be
ause
2
hp (y|)/p (x, , ) is a homogenous fun
tion of it is ne
essary to impose a normalization:
and
is set to 1.
p (, )
Johansson) using
E(Y r ) =
y r fY (y|, )
y=0
yr
[hp (y|)]2
fY (y|)
p (, )
y=0
p X
p
X
X
y r fY (y|)k l y k y l /p (, )
p X
p
X
k l
k=0 l=0
p X
p
X
y=0
y r+k+l fY (y|) /p (, )
k l mk+l+r /p (, ).
k=0 l=0
By setting
r=0
18.8
p (, ) =
p X
p
X
k l mk+l
(18.8)
k=0 l=0
Re all that
mr
moments of the baseline density. Gallant and Ny hka (1987) give onditions under whi h
270
CHAPTER 18.
NONPARAMETRIC INFERENCE
su
h a density may be treated as
orre
tly spe
ied, asymptoti
ally. Basi
ally, the order of
the polynomial must in
rease as the sample size in
reases. However, there are te
hni
alities.
Similarly to Cameron and Johannson (1997), we may develop a negative binomial
polynomial (NBP) density for
ount data. The negative binomial baseline density may be
written (see equation as
fY (y|) =
where
= {, }, > 0
variables
and
> 0.
is the parameterization
(y + )
(y + 1)()
y
= ex .
= 1/
V (Y ) = + .
When
= /
V (Y ) =
E(Y ) = .
fY (y|, ) =
[hp (y|)]2 (y + )
p (, ) (y + 1)()
y
(18.9)
To get the normalization fa tor, we need the moment generating fun tion:
MY (t) = et +
(18.10)
To illustrate, Figure 18.5 shows
al
ulation of the rst four raw moments of the NB density,
al
ulated using MuPAD, whi
h is a Computer Algebra System that (use to be?) free for
personal use. These are the moments you would need to use a se
ond order polynomial
(p = 2).
MuPAD will output these results in the form of C ode, whi h is relatively easy to
edit to write the likelihood fun
tion for the model. This has been done in NegBinSNP.
,
whi
h is a C++ version of this model that
an be
ompiled to use with o
tave using the
mko tfile
ommand. Note the impressive length of the expressions when the degree of
the expansion is 4 or 5! This is an example of a model that would be di
ult to formulate
without the help of a program like
MuPAD.
parameters to depend
a wide variety of densities arbitrarily well as the degree of the polynomial in
reases with
the sample size. This approa
h is not without its drawba
ks: the sample obje
tive fun
tion
an have an
extremely
If someone
ould gure out how to do in a way su
h that the sample obje
tive fun
tion
was ni
e and smooth, they would probably get the paper published in a good journal. Any
ideas?
Here's a plot of true and the limiting SNP approximations (with the order of the
polynomial xed) to four dierent
ount data densities, whi
h variously exhibit over and
underdispersion, as well as ex
ess zeros. The baseline model is a negative binomial density.
18.7.
271
EXAMPLES
Case 1
Case 2
.5
.4
.1
.3
.2
.05
.1
0
Case 3
10
15
20
.25
.2
.2
.15
.15
Case 4
10
15
20
25
.1
.1
.05
.05
1
2.5
7.5
10
12.5
15
18.7 Examples
We'll use the MEPS OBDV data to illustrate kernel regression and semi-nonparametri
maximum likelihood.
272
CHAPTER 18.
NONPARAMETRIC INFERENCE
2
20
25
30
35
40
Age
45
50
55
60
65
We'll reshape a
negative binomial density, as dis
ussed above. The program EstimateNBSNP.m loads the
MEPS OBDV data and estimates the model, using a NB-I baseline density and a 2nd order
polynomial expansion. The output is:
OBDV
======================================================
BFGSMIN final results
Used numeri
gradient
-----------------------------------------------------STRONG CONVERGENCE
Fun
tion
onv 1 Param
onv 1 Gradient
onv 1
-----------------------------------------------------Obje
tive fun
tion value 2.17061
Stepsize 0.0065
18.7.
273
EXAMPLES
24 iterations
-----------------------------------------------------param
gradient
hange
1.3826 0.0000 -0.0000
0.2317 -0.0000
0.0000
0.1839 0.0000
0.0000
0.2214 0.0000 -0.0000
0.1898 0.0000 -0.0000
0.0722 0.0000 -0.0000
-0.0002 0.0000 -0.0000
1.7853 -0.0000 -0.0000
-0.4358 0.0000 -0.0000
0.1129 0.0000
0.0000
******************************************************
NegBin SNP model, MEPS full data set
MLE Estimation Results
BFGS
onvergen
e: Normal
onvergen
e
Average Log-L: -2.170614
Observations: 4564
onstant
pub. ins.
priv. ins.
sex
age
edu
in
gam1
gam2
lnalpha
estimate
-0.147
0.695
0.409
0.443
0.016
0.025
-0.000
1.785
-0.436
0.113
st. err
0.126
0.050
0.046
0.034
0.001
0.006
0.000
0.141
0.029
0.027
t-stat
-1.173
13.936
8.833
13.148
11.880
3.903
-0.011
12.629
-14.786
4.166
p-value
0.241
0.000
0.000
0.000
0.000
0.000
0.991
0.000
0.000
0.000
Information Criteria
CAIC : 19907.6244
Avg. CAIC: 4.3619
BIC : 19897.6244
Avg. BIC:
4.3597
AIC : 19833.3649
Avg. AIC:
4.3456
******************************************************
Note that the CAIC and BIC are lower for this model than for the models presented in
Table 16.3. This model ts well, still being parsimonious. You
an play around trying other
use measures, using a NP-II baseline density, and using other orders of expansions. Density
fun
tions formed in this way may have
MANY
before a
epting the results of a
asual run. To guard against having
onverged to a lo
al
maximum, one
an try using multiple starting values, or one
ould try simulated annealing
as an optimization method. If you un
omment the relevant lines in the program, you
an
use SA to do the minimization. This will take a
lot
274
CHAPTER 18.
NONPARAMETRIC INFERENCE
Chapter 19
Simulation-based estimation
Readings:
In addition to the book mentioned previously, arti les in lude Gallant and
Tau hen (1996), Whi h Moments to Mat h?, ECONOMETRIC THEORY, Vol. 12, 1996,
J. Apl.
E
onometri
s; Pakes and Pollard (1989) E
onometri
a ; M
Fadden (1989) E
onometri
a.
pages 657-681; Gourieroux, Monfort and Renault (1993), Indire t Inferen e,
19.1 Motivation
Simulation methods are of interest when the DGP is fully
hara
terized by a parameter
ve
tor, but the likelihood fun
tion is not
al
ulable. If it were available, we would simply
estimate by MLE, whi
h is asymptoti
ally fully e
ient.
yi
m.
Suppose that
yi = Xi + i
where
Xi
is
m K.
Suppose that
i N (0, )
i
(19.1)
y = (y )
This mapping is su
h that ea
h element of
Dene
Ai = A(yi ) = {y |yi = (y )}
Suppose random sampling of
(yi , Xi ).
yj , i 6= j.
275
yi
may not be
yi
is
276
CHAPTER 19.
Let
= ( , (vec ) )
of the
ith
SIMULATION-BASED ESTIMATION
pi () =
Ai
n(yi Xi , )dyi
where
M/2
n(, ) = (2)
1/2
||
1
exp
2
The log-
ln L() =
and the MLE
solves
1X
ln pi ()
n
i=1
n
n
1 X D pi ()
1X
gi () =
0.
n
n
pi ()
i=1
i=1
Li ()
by standard
(the dimension of
y)
).
The mapping
dierent
hoi
es of (y ) it nests the
ase of dynami
binary dis
rete
hoi
e models
as well as the
ase of multinomial dis
rete
hoi
e (the
hoi
e of one out of a nite set
of alternatives).
Multinomial dis
rete
hoi
e is illustrated by a (very simple) job sear
h model.
We have
ross se
tional data on individuals' mat
hing to a set of
jobs that
is
uj = Xj + j
Utilities of jobs, sta
ked in the ve
tor
ui
yj = 1 [uj > uk , k m, k 6= j]
Only one of these elements is dierent than zero.
Dynami
dis
rete
hoi
e is illustrated by repeated
hoi
es over time between
two alternatives. Let alternative
have utility
ujt = Wjt jt ,
j
{1, 2}
t {1, 2, ..., m}
19.1.
277
MOTIVATION
Then
y = u2 u1
= (W2 W1 ) + 2 1
X +
Now the mapping is (element-by-element)
y = 1 [y > 0] ,
that is
yit = 1
if individual
t,
zero
otherwise.
0, 1, 2, 3, ...)
Pr(y = i) =
exp()i
i!
The mean and varian e of the Poisson distribution are both equal to
E(y) = V (y) = .
Often, one parameterizes the
onditional mean as
i = exp(Xi ).
This ensures that the mean is positive (as it must be). Estimation by ML is straightforward.
Often,
ount data exhibits overdispersion whi
h simply means that
i = exp(Xi + i )
where
parameters). Let
d(i )
be the density of
Pr(y = yi ) =
i .
will have a losed-form solution (one an derive the negative binomial distribution in the
278
CHAPTER 19.
way if
SIMULATION-BASED ESTIMATION
has an exponential distribution), but often this will not be possible. In this ase,
Pr(y = i),
In this
ase, sin
e there is only one latent variable, quadrature is probably a better
hoi
e. However, a more exible model with heterogeneity would allow all parameters
(not just the
onstant) to vary. For example
Pr(y = yi ) =
entails a
when
gets large.
{Wt }
W (T ) =
dWt N (0, T )
W (0) = 0
[W (s) W (t)] N (0, s t)
[W (s) W (t)]
and
[W (j) W (k)]
That is,
h(, yt )
g(, yt )
To estimate a model of this sort, we typi
ally have data that are assumed to be observations
of
yt
to evaluate the likelihood fun
tion or to evaluate moment
onditions (whi
h are based upon
expe
tations with respe
t to this density).
19.2.
279
A typi
al solution is to dis
retize the model, by whi
h we mean to nd a dis
rete
time approximation to the model. The dis
retized version of the model is
approximation of the dis
retization to the a
tual (unknown) dis
rete time version
of the model is not equal to
This is an
likelihood, QML) based upon this equation is in general biased and in
onsistent for
the original parameter,
The important point about these three examples is that
omputational di
ulties
prevent dire
t appli
ation of ML, GMM, et
. Nevertheless the model is fully spe
ied
in probabilisti
terms up to a parameter ve
tor.
1X
M L = arg max sn () =
ln p(yt |Xt , )
n t=1
where
p(yt |Xt , )
M L
tth
p(yt |Xt , )
observation. When
does not
E f (, yt , Xt , ) = p(yt |Xt , )
where the density of
p (yt , Xt , ) =
H
1 X
f (ts , yt , Xt , )
H
s=1
is unbiased for
p(yt |Xt , ).
p (yt , Xt , )
in pla e of
p(yt |Xt , )
in the log-likelihood
1X
SM L = arg max sn () =
ln p (yt , Xt , )
n
i=1
280
CHAPTER 19.
SIMULATION-BASED ESTIMATION
is
uj = Xj + j
and the ve
tor
is formed of elements
yj = 1 [uj > uk , k m, k 6= j]
The problem is that
However,
Draw
Cal ulate
Dene
Repeat this
Dene
u
i = Xi + i
N (0, )
(where
Xi
Xij )
ei
as the
eij =
m-ve tor
formed of the
PH
ijh
h=1 y
H
eij .
Ea h element of
ei
is between 0 and 1,
p (yi , Xi , ) = yi
ei
Now
1X
yi ln p (yi , Xi , )
n
ln L(, ) =
This is to be maximized w.r.t.
and
i=1
Notes:
The
draws of
used to nd
and
are draw
only on e
i.
If the
are re-drawn at
The log-likelihood fun tion with this simulator is a dis ontinuous fun tion of
and .
This does not
ause problems from a theoreti
al point of view sin
e it
an be shown
that
ln L(, )
ei
problem.
yi
H,
log(0)
19.3.
281
2) Smooth the simulated probabilities so that they are
ontinuous fun
tions of
the parameters. For example, apply a kernel transformation su
h as
m
m
yij = A uij max uik
+ .5 1 uij = max uik
k=1
where
that
yij
is the maximum.
pij
k=1
and therefore
requires that
This makes
ln L(, )
p
A(n) ,
uij
yij
uij = 1
and
if it
so that
(e.g., Gibbs sampling) that may work better, but this is too te
hni
al to dis
uss
here.
To solve to log(0) problem, one possibility is to sear
h the web for the slog fun
tion.
Also, in
rease
19.2.2 Properties
The properties of the SML estimator depend on how
Lee (1995) Asymptoti
Bias in Simulated Maximum Likelihood Estimation of Dis
rete
Choi
e Models,
437-83.
d
n SM L 0 N (B, I 1 ( 0 ))
The var ov is the typi al inverse of the information matrix, so that as long as
than
n1/2 .
H
grows fast enough the estimator is onsistent and fully asymptoti ally e ient.
is not al ulable.
282
CHAPTER 19.
SIMULATION-BASED ESTIMATION
On e ould, in prin iple, base a GMM estimator upon the moment onditions
mt () = [K(yt , xt ) k(xt , )] zt
where
k(xt , ) =
zt
onditional on
However
xt .
p(y|xt , )
is the density of
k(xt , )
H
1 X
e
k (xt , ) =
K(e
yth , xt )
H
h=1
a.s.
e
k (xt , ) k (xt , ) , as H , whi
h provides a
lear
intuitive basis for the estimator, though in fa
t we obtain
onsisten
y even for
nite, sin
e a law of large numbers is also operating a
ross the
observations of real
where
zt
h
i
m
ft () = K(yt , xt ) e
k (xt , ) zt
(19.2)
m()
e
=
=
1X
m
ft ()
n
i=1
"
#
n
H
1 X
1X
h
K(yt , xt )
k(e
yt , xt ) zt
n
H
i=1
(19.3)
h=1
with whi
h we form the GMM
riterion and estimate as usual. Note that the unbiased
simulator
k(e
yth , xt )
19.3.1 Properties
Suppose that the optimal weighting matrix is used. M
Fadden (ref. above) and Pakes and
Pollard (refs. above) show that the asymptoti
distribution of the MSM estimator is very
similar to that of the infeasible GMM estimator. In parti
ular, assuming that the optimal
weighting matrix is used, and for
nite,
1
d
0
1 1
n M SM N 0, 1 +
D D
H
where
D 1 D
1
(19.4)
1 + 1/H.
MSM estimator is not fully asymptoti ally e ient relative to the infeasible GMM
19.3.
283
estimator, for
reasonably large.
H = 1.
If one doesn't use the optimal weighting matrix, the asymptoti var ov is just the
The above presentation is in terms of a spe i moment ondition based upon the
This is an advantage
relative to SML.
onditional mean.
1 + 1/H.
form.
19.3.2 Comments
Why is SML in
onsistent if
upon an average of
To use the multinomial probit model as an example, the log-likelihood fun tion is
ln L(, ) =
The SML version is
1X
yi ln pi (, )
n
i=1
1X
ln L(, ) =
yi ln pi (, )
n
i=1
E ln(
pi (, )) 6= ln(E pi (, ))
in spite of the fa
t that
E pi (, ) = pi (, )
due to the fa
t that
ln()
p ()
tends to
p ().
The reason that MSM does not suer from this problem is that in this
ase the unbiased
simulator appears
linearly
(see equation [19.3). Therefore the SLLN applies to an el out simulation errors, from
whi
h we get
onsisten
y. That is, using simple notation for the random sampling
ase,
the moment
onditions
m()
"
#
n
H
1 X
1X
h
K(yt , xt )
k(e
yt , xt ) zt
n
H
i=1
h=1
"
#
n
H
X
X
1
1
k(xt , 0 ) + t
[k(xt , ) + ht ] zt
n
H
i=1
h=1
m
() =
k(x, 0 ) k(x, ) z(x)d(x).
(19.5)
(19.6)
284
CHAPTER 19.
(note:
zt
SIMULATION-BASED ESTIMATION
xt ).
s () = m
() 1
()
m
whi
h obviously has a minimum at
0,
If you look at equation 19.6 a bit, you will see why the varian e ination fa tor is
(1 +
1
H ).
A poor
hoi
e of moment
onditions may lead to very ine
ient estimators, and
an
even
ause identi
ation problems (as we've seen with the GMM problem set).
The drawba
k of the above approa
h MSM is that the moment
onditions used in
estimation are sele
ted arbitrarily. The asymptoti
e
ien
y of the estimator may
be low.
The asymptoti
ally optimal
hoi
e of moments would be the s
ore ve
tor of the
likelihood fun
tion,
mt () = D ln pt ( | It )
As before, this
hoi
e is unavailable.
The e ient method of moments (EMM) (see Gallant and Tau hen (1996), Whi h Moments to Mat h?, ECONOMETRIC THEORY, Vol.
If the approximation is
very good, the resulting estimator will be very nearly fully e
ient.
The DGP is
hara
terized by random sampling from the density
p(yt |xt , 0 ) pt ( 0 )
We
an dene an auxiliary model,
alled the s
ore generator, whi
h simply provides
a (misspe
ied) parametri
density
f (y|xt , ) ft ()
X
= arg max sn () = 1
ln ft ().
n
t=1
After determining
.
D ln f (yt |xt , )
is
19.4.
The important point is that even if the density is misspe ied, there is a pseudotrue
for whi h the true expe tation, taken with respe t to the true but unknown
density of
y, p(y|xt , 0 ),
: EX EY |X
D ln f (y|x, ) =
0
Z Z
X
onditions
1X
n t=1
is zero:
=
mn (, )
285
0 ;
t ()dy
D ln ft ()p
pt ()
n
H
1X 1 X
m
fn (, ) =
D ln f (e
yth |xt , )
n
H
t=1
where
yth
is a draw from
onverges to
DGP (),
h=1
holding
xt
0 ,
m
e ( 0 , 0 ) = 0.
(19.7)
m
e n (, )
assuming that
f (yt|xt , )
is identied.
losely approximates
p(y|xt , ),
If one has prior information that a
ertain density approximates the data well, it
would be a good
hoi
e for
f ().
If one has no density in mind, there exist good ways of approximating unknown
E onometri a,
E onometri a, 1987)
Ny hka's (
the SNP density is
onsistent, the e
ien
y of the indire
t estimator is the same as
the infeasible ML estimator.
given the numeri al pre ision of a omputer). The theory for the ase of
innite follows
m(,
e )
We an apply
d
0
n
N 0, J (0 )1 I(0 )J (0 )1
(19.8)
286
CHAPTER 19.
f (yt |xt , )
If the density
SIMULATION-BASED ESTIMATION
p(y|xt , ),
then
would be the
only an approximation to
Re
all that
of
f (yt|xt , )
sn ()
is
with the
J (0 ) = D m( 0 , 0 ).
As in Theorem 22,
sn () sn ()
I( ) = lim E n
.
n
0 0
0
In this
ase, this is simply the asymptoti
varian
e
ovarian
e matrix of the moment
onditions,
nmn ( 0 , )
about
0 :
=
nm
n ( 0 , )
First onsider
nm
n ( 0 , 0 ) +
nm
n ( 0 , 0 ).
so we have
0 + op (1)
0 , 0 )
nD m(
J (0 ),
1
0
H I ( ).
0 .
0 , 0 )
nD m(
Note that
a.s.
n ( 0 , 0 )
D m
0 = nJ (0 )
0 , a.s.
0 , 0 )
nD m(
a
0
nJ (0 )
N 0, I(0 )
Now, ombining the results for the rst and se ond terms,
Suppose that
\
0)
I(
1
0
N 0, 1 +
nm
n ( , )
I( )
H
0
approximator, sin
e the individual s
ore
ontributions may not have mean zero in this
ase
(see the se
tion on QML) . Even if this is the
ase, the individuals means
an be
al
ulated
by simulation, so it is always possible to
onsistently estimate
I(0 )
simulable. On the other hand, if the s
ore generator is taken to be
orre
tly spe
ied, the
ordinary estimator of the information matrix is
onsistent. Combining this with the result
on the e
ient GMM weighting matrix in Theorem 25, we see that dening
= arg min mn (, )
1
1+
H
\
0)
I(
1
mn (, )
as
19.4.
287
If one has used the Gallant-Ny
hka ML estimator as the auxiliary model, the appropriate weighting matrix is simply the information matrix of the auxiliary model,
sin
e the s
ores are un
orrelated.
sin e the s ore generator an approximate the unknown density arbitrarily well).
!1
1
1
d
,
D
I(0 )
n 0 N 0, D 1 +
H
where
D = lim E D mn ( 0 , 0 ) .
n
= D mn (,
D
1
a
0
nmn ( , ) N 0, 1 +
I( )
H
0
implies that
)
nmn (,
where
is
dim() dim(),
1
1+
H
1
a 2
)
mn (,
(q)
I()
sin e without
dim()
identied, so testing is impossible. One test of the model is simply based on this statisti
: if
it ex
eeds the
2 (q) riti al point, something may be wrong (the small sample performan e
diag
1
1+
H
1/2 !1
nmn (,
I()
an be used to test whi
h moments are not well modeled. Sin
e these moments are
related to parameters of the s
ore generator, whi
h are usually related to
ertain
features of the model, this information
an be used to revise the model. These aren't
and nmn (,
)
have dierent
N (0, 1), sin
e nmn ( 0 , )
)
is somewhat more
ompli
ated). It
an be shown
nmn (,
a
tually distributed as
distributions (that of
that the pseudo-t statisti
s are biased toward nonreje
tion. See Gourieroux
or Gallant and Long, 1995, for more details.
et. al.
288
CHAPTER 19.
SIMULATION-BASED ESTIMATION
19.5 Examples
19.5.1 Estimation of sto
hasti
dierential equations
It is often
onvenient to formulate theoreti
al models in terms of dierential equations,
and when the observation frequen
y is high (e.g., weekly, daily, hourly or real-time) it may
be more natural to adopt this framework for e
onometri
models of time series.
The most
ommon approa
h to estimation of sto
hasti
dierential equations is to
dis
retize the model, as above, and estimate using the dis
retized version. However, sin
e
the dis
retization is only an approximation to the true dis
rete-time version of the model
(whi
h is not
al
ulable), the resulting estimator is in general biased and in
onsistent.
An alternative is to use indire
t inferen
e: The dis
retized model is used as the s
ore
generator. That is, one estimates by QML to obtain the s
ores of the dis
retized approximation:
mn (, ).
and the s ores are al ulated and averaged over the simulations
N
1 X
min (, )
m
n (, ) =
N
i=1
is
)
0
m
n (,
(sin
e
and
This method requires simulating the sto
hasti
dierential equation. There are many
ways of doing this. Basi
ally, they involve doing very ne dis
retizations:
yt+
= yt + g(, yt ) + h(, yt )t
t N (0, )
By setting
This is only one method of using indire t inferen e for estimation of dierential equations. There are others (see Gallant and Long, 1995 and Gourieroux
et. al.).
Use of a series
approximation to the transitional density as in Gallant and Long is an interesting possibility sin
e the s
ore generator may have a higher dimensional parameter than the model,
whi
h allows for diagnosti
testing. In the method des
ribed above the s
ore generator's
parameter
19.5.
289
EXAMPLES
The
le probitdgp.m generates data that follows the probit model. The le emm_moments.m
denes EMM moment
onditions, where the DGP and s
ore generator
an be passed as
arguments. Thus, it is a general purpose moment
ondition for EMM estimation. This le
is interesting enough to warrant some dis
ussion. A listing appears in Listing 19.1. Line
3 denes the DGP, and the arguments needed to evaluate it are dened in line 4.
s
ore generator is dened in line 5, and its arguments are dened in line 6.
estimate of the parameter of the s
ore generator is read in line 7.
The
The QML
the random draws needed to simulate data are passed with the data, and are thus xed
during estimation, to avoid
hattering. The simulated data is generated in line 16, and
the derivative of the s
ore generator using the simulated data is
al
ulated in line 18. In
line 20 we average the s
ores of the s
ore generator, whi
h are the moment
onditions that
the fun
tion returns.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
290
CHAPTER 19.
SIMULATION-BASED ESTIMATION
gradient
0.0000
-0.0000
-0.0000
-0.0000
-0.0000
hange
0.0000
0.0000
0.0000
0.0000
0.0000
======================================================
Model results:
******************************************************
EMM example
GMM Estimation Results
BFGS
onvergen
e: Normal
onvergen
e
Obje
tive fun
tion value: 0.000000
Observations: 1000
Exa
tly identified, no spe
. test
estimate
st. err
t-stat
p-value
p1
1.069
0.022
47.618
0.000
p2
0.935
0.022
42.240
0.000
p3
1.085
0.022
49.630
0.000
p4
1.080
0.022
49.047
0.000
p5
0.978
0.023
41.643
0.000
******************************************************
It might be interesting to
ompare the standard errors with those obtained from ML
estimation, to
he
k e
ien
y of the EMM estimator. One
ould even do a Monte Carlo
study.
19.5.
EXAMPLES
291
Exer
ises
1. Do SML estimation of the probit model.
2. Do a little Monte Carlo study to
ompare ML, SML and EMM estimation of the
probit model. Investigate how the number of simulations ae
t the two simulationbased estimators.
292
CHAPTER 19.
SIMULATION-BASED ESTIMATION
Chapter 20
Parallel programming for
e
onometri
s
The following borrows heavily from Creel (2005).
Parallel
omputing
an oer an important redu
tion in the time to
omplete
omputations. This is well-known, but it bears emphasis sin
e it is the main reason that parallel
omputing may be attra
tive to users. To illustrate, the Intel Pentium IV (Willamette)
pro
essor, running at 1.5GHz, was introdu
ed in November of 2000.
The Pentium IV
languages
that allow the in orporation of parallelism into programs written in these lan-
guages. A third is the spread of dual and quad-
ore CPUs, so that an ordinary desktop or
laptop
omputer
an be made into a mini-
luster. Those
ores won't work together on a
single problem unless they are told how to.
Following are examples of parallel implementations of several mainstream problems
in e
onometri
s.
nearly identi
al to the interfa
e of equivalent serial versions, end users will nd it easy to
take advantage of parallel
omputing's performan
e. We
ontinue to use O
tave, taking
1
By high-level matrix programming language I mean languages su h as MATLAB (TM the Mathworks,
In .), Ox (TM OxMetri s Te hnologies, Ltd.), and GNU O tave (www.o tave.org ), for example.
293
294
CHAPTER 20.
et al.
(2004). There are also parallel pa
kages for Ox, R, and Python whi
h may be of interest
to e
onometri
ians, but as of this writing, the following examples are the most a
essible
introdu
tion to parallel programming for e
onometri
ians.
et al.
et. al.
a fun
tion that
al
ulates the tra
e test statisti
for the la
k of
ointegration of integrated
time series.
This fun tion is illustrative of the format that we adopt for Monte Carlo
simulation of a fun
tion: it re
eives a single argument of
ell type, and it returns a row
ve
tor that holds the results of one random simulation. The single argument in this
ase is
a
ell array that holds the length of the series in its rst position, and the number of series
in the se
ond position. It generates a random result though a pro
ess that is internal to
the fun
tion, and it reports some output in a row ve
tor (in this
ase the result is a s
alar).
m
_example1.m is an O
tave s
ript that exe
utes a Monte Carlo study of the tra
e
test by repeatedly evaluating the
tra etest.m
monte arlo.m
monte arlo.m.
In line 10, there is a fourth argument. When
alled with four arguments, the last argument
is the number of slave hosts to use. We see that running the Monte Carlo study on one
or more pro
essors is transparent to the user - he or she must only indi
ate the number of
slave
omputers to be used.
20.1.2 ML
For a sample
{(yt , xt )}n
of
an be dened as
= arg max sn ()
where
sn () =
1X
ln f (yt |xt , )
n
t=1
Here,
yt
may be a ve tor of random variables, and the model may be dynami sin e
ontain lags of
yt .
xt
may
As Swann (2002) points out, this an be broken into sums over blo ks
20.1.
295
EXAMPLE PROBLEMS
1
sn () =
n
n1
X
t=1
Analogously, we an dene up to
ln f (yt |xt , )
n
n
X
t=n1 +1
!)
ln f (yt |xt , )
data is read, the name of the density fun
tion is provided in the variable
initial value of the parameter ve
tor is set. In line 5, the fun
tion
ordinary serial
al
ulation of the ML estimator, while in line 7 the same fun
tion is
alled
with 6 arguments. The fourth and fth arguments are empty pla
eholders where options
to
mle_estimate may be set, while the sixth argument is the number of slave omputers to
use for parallel exe
ution, 1 in this
ase. A person who runs the program sees no parallel
programming
ode - the parallelization is transparent to the end user, beyond having to
sele
t the number of slave
omputers. When exe
uted, this s
ript prints out the estimates
theta_s
and
theta_p,
It is worth noting that a dierent likelihood fun tion may be used by making the
model
variable point to a dierent fun
tion. The likelihood fun
tion itself is an ordinary O
tave
fun
tion that is not parallelized. The
all any likelihood fun
tion that has the appropriate input/output syntax for evaluation
either serially or in parallel.
Users need only learn how to write the likelihood fun tion
20.1.3 GMM
For a sample as above, the GMM estimator of the parameter
an be dened as
arg min sn ()
where
sn () = mn () Wn mn ()
and
mn () =
1X
mt (yt |xt , )
n
t=1
Sin e
mn ()
blo ks:
1
mn () =
n
n1
X
t=1
mt (yt |xt , )
n
X
t=n1 +1
!)
mt (yt |xt , )
(20.1)
dierent ma
hine.
gmm_example1.m is a s
ript that illustrates how GMM estimation may be done serially
or in parallel. When this is run,
296
CHAPTER 20.
onvergen
e of the minimization routine. The point to noti
e here is that an end user
an
perform the estimation in parallel in virtually the same way as it is done serially. Again,
gmm_estimate,
used in lines 8 and 10, is a generi fun tion that will estimate any model
fun
tion that uses no parallel programming, so users
an write their models using the
simple and intuitive HLMP syntax of O
tave. Whether estimation is done in parallel or
serially depends only the seventh argument to
gmm_estimate
estimation is by default done serially with one pro
essor. When it is positive, it spe
ies
the number of slave nodes to use.
g(x)
at a point
is
Pn
yt K [(x xt ) /n ]
g(x) = Pt=1
n
t=1 K [(x xt ) /n ]
n
X
wt yy
t=1
We see that the weight depends upon every data point in the sample. To
al
ulate the t
at every point in a sample of size
n, on the order of n2 k
x.
et .
nish the problem on a single node divided by the time to nish the problem on
nodes.
Note that you
an get 10X speedups, as
laimed in the introdu
tion. It's pretty obvious
that mu
h greater speedups
ould be obtained using a larger
luster, for the embarrassingly
parallel problems.
20.1.
297
EXAMPLE PROBLEMS
MONTECARLO
BOOTSTRAP
MLE
GMM
KERNEL
9
8
7
6
5
4
3
2
1
2
8
nodes
10
12
298
CHAPTER 20.
Bibliography
[1 Bru
he, M. (2003) A note on embarassingly parallel
omputation using OpenMosix
and Ox, working paper, Finan
ial Markets Group, London S
hool of E
onomi
s.
[2 Creel, M. (2005) User-friendly parallel
omputations with e
onometri
examples,
at
.ugr.es/javier-bin/mpitb.
[5 Ra
ine, Je (2002) Parallel distributed kernel estimation,
299
an
300
BIBLIOGRAPHY
Chapter 21
Final proje
t: e
onometri
estimation of a RBC model
THIS IS NOT FINISHED - IGNORE IT FOR NOW
In this last
hapter we'll go through a worked example that
ombines a number of the
topi
s we've seen.
21.1 Data
We'll develop a model for private
onsumption and real gross private investment.
The
data are obtained from the US Bureau of E
onomi
Analysis (BEA) National In
ome and
Produ
t A
ounts (NIPA), Table 11.1.5, Lines 2 and 6 (you
an download quarterly data
from 1947-I to the present). The data we use are in the le rb
_data.m. This data is real
(
onstant dollars).
The program plots.m will make a few plots, in
luding Figures 21.1 though 21.3. First
looking at the plot for levels, we
an see that real
onsumption and investment are
learly
nonstationary (surprise, surprise). There appears to be somewhat of a stru
tural
hange
in the mid-1970's.
Looking at growth rates, the series for
onsumption has an extended period of high growth
in the 1970's, be
oming more moderate in the 90's. The volatility of growth of
onsumption
has de
lined somewhat, over time. Looking at investment, there are some notable periods
of high volatility in the mid-1970's and early 1980's, for example. Sin
e 1990 or so, volatility
seems to have de
lined.
E
onomi
models for growth often imply that there is no long term growth (!) - the
Examples/RBC/levels.eps
301
Examples/RBC/growth.eps
Examples/RBC/filtered.eps
generate needs to be passed through the inverse of a lter. We'll follow this, and generate
stationary business
y
le data by applying the bandpass lter of Christiano and Fitzgerald
(1999). The ltered data is in Figure 21.3. We'll try to spe
ify an e
onomi
model that
an
generate similar data. To get data that look like the levels for
onsumption and investment,
we'd need to apply the inverse of the bandpass lter.
max{ct ,kt }
E0
t=0
t=0
t U (c )
t
(1 ) kt1 + t kt1
ct + kt
log t
log t1 + t
IIN (0, 2 )
U (ct ) =
c1
1
t
1
is observed in period
t.
21.3.
303
When
= 1,
logarithmi .
gross investment,
it ,
it = kt (1 ) kt1
(k0 , 0 )
is given.
= , , , , , 2
have on
onsumption and investment. This problem is very similar to the GMM estimation
of the portfolio model dis
ussed in Se
tions 15.11 and 15.12. On
e
an derive the Euler
ondition in the same way we did there, and use it to dene a GMM estimator.
approa
h was not very su
essful, re
all.
That
yt
be a
G-ve tor
of
c is a G-ve tor
of parameters, and
Aj ,
j=1,2,...,p, are
G G matri es of parameters.
Rt Rt . You
an think of a VAR model as the redu
ed form of a dynami
linear simultaneous
Let
equations model where all of the variables are treated as endogenous. Clearly, if all of the
variables are endogenous, one would need some form of additional information to identify
a stru
tural model. But we already have a stru
tural model, and we're only going to use
the VAR to help us estimate the parameters. A well-tting redu
ed form model will be
adequate for the purpose.
We're seen that our data seems to have episodes where the varian
e of growth rates
and ltered data is non-
onstant. This brings us to the general area of sto
hasti
volatility.
Without going into details, we'll just
onsider the exponential GARCH model of Nelson
(1991) as presented in Hamilton (1994, pg. 668-669).
Dene
this is a
ht = vec (Rt ),
31
Rt
o
n
p
log hjt = j + P(j,.) |vt1 | 2/ + (j,.)vt1 + G(j,.) log ht1
The varian
e of the VAR error depends upon its own past, as well as upon the past
realizations of the sho
ks.
v, m
for lags of
h).
The advantage of the EGARCH formulation is that the varian
e is assuredly positive
without parameter restri
tions
The matrix
has dimension
3 2.
The matrix
has dimension
3 3.
The matrix
We will probably want to restri t these parameter matri es in some way. For instan e,
allows
for
2 2.
t IIN (0, I2 )
t = R1
t vt
Rt
and
vt ,
1
c
t = Et ct+1 1 + t+1 kt
or
h
n
io 1
1
ct = Et ct+1 1 + t+1 kt
ct
t,
h
i
1
exp (0 + 1 log t + 2 log kt1 )
1
k
Et c
t+1
t
t+1
For given values of the parameters of this approximating fun
tion, we
an solve for
then for
kt
ct + kt = (1 ) kt1 + t kt1
ct ,
and
21.5.
305
{(ct , kt )}.
1
= exp (0 + 1 log t + 2 log kt1 ) + t
c
t+1 1 + t+1 kt
parameters of the approximation to expe
tations is iterated until the parameters no longer
hange. When this is the
ase, the expe
tations fun
tion is the best t to the generated
data. As long it is a ri
h enough parametri
model to en
ompass the true expe
tations
fun
tion, it
an be made to be equal to the true expe
tations fun
tion by using a long
enough simulation.
Thus, given the parameters of the stru
tural model,
generate data
{(ct , kt )}
it = kt (1 ) kt1 .
= , , , , , 2
, we an
{(ct , it )}
using
redu ed form model to dene moments, using the simulated data from the stru tural model.
Bibliography
[1 Creel. M (2005) A Note on Parallelizing the Parameterized Expe
tations Algorithm.
[2 den Haan, W. and Mar
et, A. (1990) Solving the sto
hasti
growth model by parameterized expe
tations,
[3 Hamilton, J. (1994)
[4 Maliar, L. and Maliar, S. (2003) Matlab
ode for Solving a Neo
lassi
al Growh Model with a Parametriz
[5 Nelson, D. (1991) Conditional heteros
edasti
ity is asset returns: a new approa
h,
http://ideas.repe .org/p/fip/fedfap/2002-13.html
307
308
BIBLIOGRAPHY
Chapter 22
Introdu
tion to O
tave
Why is O
tave being used here, sin
e it's not that well-known by e
onometri
ians? Well,
be
ause it is a high quality environment that is easily extensible, uses well-tested and high
performan
e numeri
al libraries, it is li
ensed under the GNU GPL, so you
an get it for
free and modify it if you like, and it runs on both GNU/Linux, Ma
OSX and Windows
systems. It's also quite easy to learn.
This will give you this same PDF le, but with all of the
running the CD (or sitting in the
omputer room a
ross the hall from my o
e), or that
you have
ongured your
omputer to be able to run the
*.m
After this, you
an look at the example programs s
attered throughout the do
ument (and
edit them, and run them) to learn more about how O
tave
an be used to do e
onometri
s.
Students of mine: your problem sets will in
lude exer
ises that
an be done by modifying
the example programs in relatively minor ways. So study the examples!
O
tave
an be used intera
tively, or it
an be used to run programs that are written using a text editor. We'll use this se
ond method, preparing programs with NEdit, and
alling
O
tave from within the editor. The program rst.m gets us started. To run this, open it up
with NEdit (by nding the
orre
t le inside the
/home/knoppix/Desktop/E onometri s
folder and
li
king on the i
on) and then type CTRL-ALT-o, or use the O
tave item in
the Shell menu (see Figure 22.1).
printf()
Edit first.m so that the 8th line reads printf(hello
world\n);
309
That's be ause
310
CHAPTER 22.
INTRODUCTION TO OCTAVE
We need to know how to load and save data. The program se
ond.m shows how. On
e
you have run this, you will nd the le x in the dire
tory
You might have a look at it with NEdit to see O
tave's default format for saving data.
Basi
ally, if you have data in an ASCII text le, named for example myfile.data, formed
of numbers separated by spa
es, just use the
ommand load
myfile.data.
After having
done so, the matrix myfile (without extension) will
ontain the data.
Please have a look at CommonOperations.m for examples of how to do some basi
things in O
tave. Now that we're done with the basi
s, have a look at the O
tave programs
that are in
luded as examples.
programs are available here and the support les needed to run these are available here.
Those pages will allow you to examine individual les, out of
ontext.
To a tually use
these les (edit and run them), you should go to the home page of this do
ument, sin
e
you will probably want to download the pdf version together with all the support les and
examples. Or get the bootable CD.
There are some other resour
es for doing e
onometri
s with O
tave. You might like to
he
k the arti
le E
onometri
s with O
tave
Get the olle tion of support programs and the examples, from the do ument home page.
Put them somewhere, and tell O tave how to nd them, e.g., by putting a link to
Make sure nedit is installed and
ongured to run O
tave and use syntax highlighting.
Copy the le
22.3.
311
NeditConguration and save it in your $HOME dire
tory with the name .nedit.
Not to put too ne a point on it, please note that there is a period in that name.
Asso iate
*.m
les with NEdit so that they open up in the editor when you li k on
312
CHAPTER 22.
INTRODUCTION TO OCTAVE
Chapter 23
Notation and Review
All ve
tors will be
olumn ve
tors, unless they have a transpose symbol (or I forget
to apply this rule - your help
at
hing typos and er0rors is mu
h appre
iated). For
example, if
xt
is a
p1
ve tor,
xt
1p
is a
p-ve tor,
s() : p
p-ve tor,
s()
is a
s()
=
1p
2 s()
=
s()
1
s()
2
.
.
.
s()
p
ve tor, and
s()
Let
1n
f ():p n
be a
valued transpose of
Produ
t rule:
the
Let
p-ve tor .
n-ve
tor
. Then
f ():p n
and
Then
has dimension
1 p.
s()
a x
x
= a.
pp
p-ve tor .
313
Let
f ()
be the
f ().
+f
be
n-ve
tor
f h+
h f
p 1.
matrix. Also,
h() f () =
2 s()
is a
h():p n
h() f () = h
s()
is organized
=
f ()
Then
p-ve tor .
314
CHAPTER 23.
Chain rule :
Let
r
and let g():
Then
has dimension
f ():p n
p be a
p-ve tor
n-ve tor
x Ax
x
r -ve tor
= A + A .
p-ve tor
argument,
valued argument
f [g ()] =
f ()
g()
=g()
n r.
exp(x )
= exp(x )x.
The sto hasti modes are those whi h will be used later in the
ourse.
Denition 36 A sequen
e is a mapping from the natural numbers {1, 2, ...} = {n}
n=1 =
so that the set is ordered a ording to the natural numbers asso iated
[Convergen e
{fn ()}
where
fn : T .
A sequen
e of fun
tions {fn ()}
onverges pointwise on to the fun
tion f () if for all > 0 and there exists an integer N su
h
that
Denition 38
[Pointwise onvergen e
gen e throughout
depends upon
A sequen
e of fun
tions {fn ()}
onverges uniformly on to the fun
tion f () if for any > 0 there exists an integer N su
h that
Denition 39
[Uniform onvergen e
23.2.
315
CONVERGENGE MODES
(insert a diagram here showing the envelope around f () in whi h fn() must lie)
(, F, P ) ,
re all that a random variable maps the sample spa e to the real line
X() : .
{Xn ()}
i.e.
Y = X 0 + ,
n = (X X)1 X Y, where n
n }. A number of
random ve
tors {
modes of
onvergen
e are in use when dealing with sequen
es of random variables. Several
su
h modes of
onvergen
e should already be familiar:
Denition 40
[Convergen e in probability
Xn X,
or plim
Xn = X.
P (A) = 1.
In other words,
set
C = A
Xn X, a.s.
Xn () X()
su h that
P (C) = 0.
a.s.
Xn X,
or
a.s.
Xn X Xn X.
[Convergen e in distribution
Xn X.
n = 0 +
and
X
n
X X
n
1
X
n
a.s.
n 0
a.s.
in the
by a SLLN. Note that this term is not a fun tion of the parameter
This
easy proof is a result of the linearity of the model, whi h allows us to express the estimator
316
CHAPTER 23.
in a way that separates parameters from random fun
tions. In general, this is not possible.
We often deal with the more
ompli
ated situation where the sto
hasti
sequen
e depends
on parameters in a manner that is not redu
ible to a simple sequen
e of random variables.
In this
ase, we have a sequen
e of random fun
tions that depend on
ea
h
Xn (, )
parameter
: {Xn (, )},
Denition 43
surely in to X(, ) if
{Xn (, )}
(, F, P )
where
and the
Xn (, )
and
X(, )
u.a.s.
and uniform
u.p.
An equivalent denition, based on the fa
t that almost sure means with probability
one is
Pr
lim sup |Xn (, ) X(, )| = 0 = 1
This has a form similar to that of the denition of a.s.
onvergen
e - the essential
dieren
e is the addition of the
sup.
Denition 44
f (n) = o(g(n))
[Little-o
Let f (n) and g(n) be two real-valued fun tions. The notation
(n)
means limn fg(n)
= 0.
[Big-O
{fn }
and
{gn }
f (n)
g(n) have a limit (it may u
tuate boundedly).
f (n) p
g(n)
0.
Example 47 The least squares estimator = (X X)1 X Y = (X X)1 X X 0 + =
Denition 48 The notation f (n) = Op (g(n)) means there exists some N su
h that for
>0
f (n)
< K > 1 ,
P
g(n)
23.3.
317
Example 50 Consider a random sample of iid r.v.'s with mean 0 and varian
e 2 . The
P
A
estimator of the mean = 1/n ni=1 xi is asymptoti
ally normally distributed, e.g., n1/2
N (0, 2 ). So n1/2 = Op (1), so = Op (n1/2 ). Before we had = op (1), now we have have
the stronger result that relates the rate of
onvergen
e to the sample size.
Example 51 Now
onsider a random sample of iid r.v.'s with mean and varian
e 2 .
P
The estimator
i=1
of the mean = 1/n
xi is asymptoti
ally normally distributed, e.g.,
A
1/2
2
1/2
N (0, ). So n
= Op (1), so = Op (n1/2 ), so = Op (1).
n
These two examples show that averages of
entered (mean zero) quantities typi
ally
have plim 0, while averages of un
entered quantities have nite nonzero plims. Note that
the denition of
Op
f (n)
and
g(n)
Denition 52 Two sequen es of random variables {fn } and {gn } are asymptoti ally equal
(written fn =a gn ) if
plim
f (n)
g(n)
op
=1
and
Op
318
CHAPTER 23.
For
For
For
and
both
For
and
both
and
both
pp
p1
p1
matrix and
p1
p1
Dx a x = a.
Dx2 x Ax = A + A .
D2 exp x .
Write an O
tave program that veries ea
h of the previous results by taking numeri
derivatives. For a hint, type
help numgradient
and
help numhessian
inside o tave.
Chapter 24
Li
enses
This do
ument and the asso
iated examples and materials are
opyright Mi
hael Creel,
under the terms of the GNU General Publi
Li
ense, ver. 2., or at your option, under the
Creative Commons Attribution-Share Alike Li
ense, Version 2.5. The li
enses follow.
319
320
CHAPTER 24.
LICENSES
24.1.
321
THE GPL
322
CHAPTER 24.
LICENSES
24.1.
THE GPL
4. You may not
opy, modify, subli
ense, or distribute the Program
ex
ept as expressly provided under this Li
ense. Any attempt
otherwise to
opy, modify, subli
ense or distribute the Program is
void, and will automati
ally terminate your rights under this Li
ense.
However, parties who have re
eived
opies, or rights, from you under
this Li
ense will not have their li
enses terminated so long as su
h
parties remain in full
omplian
e.
5. You are not required to a
ept this Li
ense, sin
e you have not
signed it. However, nothing else grants you permission to modify or
distribute the Program or its derivative works. These a
tions are
prohibited by law if you do not a
ept this Li
ense. Therefore, by
modifying or distributing the Program (or any work based on the
Program), you indi
ate your a
eptan
e of this Li
ense to do so, and
all its terms and
onditions for
opying, distributing or modifying
the Program or works based on it.
6. Ea
h time you redistribute the Program (or any work based on the
Program), the re
ipient automati
ally re
eives a li
ense from the
original li
ensor to
opy, distribute or modify the Program subje
t to
323
324
CHAPTER 24.
LICENSES
these terms and
onditions. You may not impose any further
restri
tions on the re
ipients' exer
ise of the rights granted herein.
You are not responsible for enfor
ing
omplian
e by third parties to
this Li
ense.
7. If, as a
onsequen
e of a
ourt judgment or allegation of patent
infringement or for any other reason (not limited to patent issues),
onditions are imposed on you (whether by
ourt order, agreement or
otherwise) that
ontradi
t the
onditions of this Li
ense, they do not
ex
use you from the
onditions of this Li
ense. If you
annot
distribute so as to satisfy simultaneously your obligations under this
Li
ense and any other pertinent obligations, then as a
onsequen
e you
may not distribute the Program at all. For example, if a patent
li
ense would not permit royalty-free redistribution of the Program by
all those who re
eive
opies dire
tly or indire
tly through you, then
the only way you
ould satisfy both it and this Li
ense would be to
refrain entirely from distribution of the Program.
If any portion of this se
tion is held invalid or unenfor
eable under
any parti
ular
ir
umstan
e, the balan
e of the se
tion is intended to
apply and the se
tion as a whole is intended to apply in other
ir
umstan
es.
It is not the purpose of this se
tion to indu
e you to infringe any
patents or other property right
laims or to
ontest validity of any
su
h
laims; this se
tion has the sole purpose of prote
ting the
integrity of the free software distribution system, whi
h is
implemented by publi
li
ense pra
ti
es. Many people have made
generous
ontributions to the wide range of software distributed
through that system in relian
e on
onsistent appli
ation of that
system; it is up to the author/donor to de
ide if he or she is willing
to distribute software through any other system and a li
ensee
annot
impose that
hoi
e.
This se
tion is intended to make thoroughly
lear what is believed to
be a
onsequen
e of the rest of this Li
ense.
24.1.
THE GPL
325
326
CHAPTER 24.
LICENSES
YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGES.
END OF TERMS AND CONDITIONS
Also add information on how to
onta
t you by ele
troni
and paper mail.
If the program is intera
tive, make it output a short noti
e like this
when it starts in an intera
tive mode:
24.2.
CREATIVE COMMONS
327
THE
328
CHAPTER 24.
LICENSES
omposition or sound re
ording, the syn
hronization of the Work in timed-relation with a
moving image ("syn
hing") will be
onsidered a Derivative Work for the purpose of this
Li
ense.
3. "Li
ensor" means the individual or entity that oers the Work under the terms of
this Li
ense.
4. "Original Author" means the individual or entity who
reated the Work.
5. "Work" means the
opyrightable work of authorship oered under the terms of this
Li
ense.
6. "You" means an individual or entity exer
ising rights under this Li
ense who has
not previously violated the terms of this Li
ense with respe
t to the Work, or who has
re
eived express permission from the Li
ensor to exer
ise rights under this Li
ense despite
a previous violation.
7. "Li
ense Elements" means the following high-level li
ense attributes as sele
ted by
Li
ensor and indi
ated in the title of this Li
ense: Attribution, ShareAlike.
2. Fair Use Rights. Nothing in this li
ense is intended to redu
e, limit, or restri
t any
rights arising from fair use, rst sale or other limitations on the ex
lusive rights of the
opyright owner under
opyright law or other appli
able laws.
3. Li
ense Grant. Subje
t to the terms and
onditions of this Li
ense, Li
ensor hereby
grants You a worldwide, royalty-free, non-ex
lusive, perpetual (for the duration of the
appli
able
opyright) li
ense to exer
ise the rights in the Work as stated below:
1. to reprodu
e the Work, to in
orporate the Work into one or more Colle
tive Works,
and to reprodu
e the Work as in
orporated in the Colle
tive Works;
2. to
reate and reprodu
e Derivative Works;
3. to distribute
opies or phonore
ords of, display publi
ly, perform publi
ly, and perform publi
ly by means of a digital audio transmission the Work in
luding as in
orporated
in Colle
tive Works;
4.
to distribute opies or phonore ords of, display publi ly, perform publi ly, and
24.2.
329
CREATIVE COMMONS
1. Performan
e Royalties Under Blanket Li
enses. Li
ensor waives the ex
lusive right
to
olle
t, whether individually or via a performan
e rights so
iety (e.g.
ASCAP, BMI,
SESAC), royalties for the publi
performan
e or publi
digital performan
e (e.g. web
ast)
of the Work.
2. Me
hani
al Rights and Statutory Royalties. Li
ensor waives the ex
lusive right to
olle
t, whether individually or via a musi
rights so
iety or designated agent (e.g. Harry
Fox Agen
y), royalties for any phonore
ord You
reate from the Work ("
over version")
and distribute, subje
t to the
ompulsory li
ense
reated by 17 USC Se
tion 115 of the US
Copyright A
t (or the equivalent in other jurisdi
tions).
6. Web
asting Rights and Statutory Royalties. For the avoidan
e of doubt, where the
Work is a sound re
ording, Li
ensor waives the ex
lusive right to
olle
t, whether individually or via a performan
e-rights so
iety (e.g.
digital performan
e (e.g. web
ast) of the Work, subje
t to the
ompulsory li
ense
reated
by 17 USC Se
tion 114 of the US Copyright A
t (or the equivalent in other jurisdi
tions).
The above rights may be exer
ised in all media and formats whether now known or
hereafter devised. The above rights in
lude the right to make su
h modi
ations as are
te
hni
ally ne
essary to exer
ise the rights in other media and formats.
Work. You must keep inta
t all noti
es that refer to this Li
ense and to the dis
laimer of
warranties. You may not distribute, publi
ly display, publi
ly perform, or publi
ly digitally
perform the Work with any te
hnologi
al measures that
ontrol a
ess or use of the Work in
a manner in
onsistent with the terms of this Li
ense Agreement. The above applies to the
Work as in
orporated in a Colle
tive Work, but this does not require the Colle
tive Work
apart from the Work itself to be made subje
t to the terms of this Li
ense. If You
reate
a Colle
tive Work, upon noti
e from any Li
ensor You must, to the extent pra
ti
able,
remove from the Colle
tive Work any
redit as required by
lause 4(
), as requested. If
You
reate a Derivative Work, upon noti
e from any Li
ensor You must, to the extent
pra
ti
able, remove from the Derivative Work any
redit as required by
lause 4(
), as
requested.
2. You may distribute, publi
ly display, publi
ly perform, or publi
ly digitally perform
a Derivative Work only under the terms of this Li
ense, a later version of this Li
ense
with the same Li
ense Elements as this Li
ense, or a Creative Commons iCommons li
ense
that
ontains the same Li
ense Elements as this Li
ense (e.g. Attribution-ShareAlike 2.5
Japan). You must in
lude a
opy of, or the Uniform Resour
e Identier for, this Li
ense
or other li
ense spe
ied in the previous senten
e with every
opy or phonore
ord of ea
h
Derivative Work You distribute, publi
ly display, publi
ly perform, or publi
ly digitally
330
CHAPTER 24.
LICENSES
perform. You may not oer or impose any terms on the Derivative Works that alter or
restri
t the terms of this Li
ense or the re
ipients' exer
ise of the rights granted hereunder,
and You must keep inta
t all noti
es that refer to this Li
ense and to the dis
laimer of
warranties. You may not distribute, publi
ly display, publi
ly perform, or publi
ly digitally
perform the Derivative Work with any te
hnologi
al measures that
ontrol a
ess or use of
the Work in a manner in
onsistent with the terms of this Li
ense Agreement. The above
applies to the Derivative Work as in
orporated in a Colle
tive Work, but this does not
require the Colle
tive Work apart from the Derivative Work itself to be made subje
t to
the terms of this Li
ense.
3.
the Work or any Derivative Works or Colle
tive Works, You must keep inta
t all
opyright
noti
es for the Work and provide, reasonable to the medium or means You are utilizing:
(i) the name of the Original Author (or pseudonym, if appli
able) if supplied, and/or (ii)
if the Original Author and/or Li
ensor designate another party or parties (e.g. a sponsor
institute, publishing entity, journal) for attribution in Li
ensor's
opyright noti
e, terms
of servi
e or by other reasonable means, the name of su
h party or parties; the title of the
Work if supplied; to the extent reasonably pra
ti
able, the Uniform Resour
e Identier,
if any, that Li
ensor spe
ies to be asso
iated with the Work, unless su
h URI does not
refer to the
opyright noti
e or li
ensing information for the Work; and in the
ase of a
Derivative Work, a
redit identifying the use of the Work in the Derivative Work (e.g.,
"Fren
h translation of the Work by Original Author," or "S
reenplay based on original
Work by Original Author"). Su
h
redit may be implemented in any reasonable manner;
provided, however, that in the
ase of a Derivative Work or Colle
tive Work, at a minimum
su
h
redit will appear where any other
omparable authorship
redit appears and in a
manner at least as prominent as su
h other
omparable authorship
redit.
5. Representations, Warranties and Dis
laimer
UNLESS OTHERWISE AGREED TO BY THE PARTIES IN WRITING, LICENSOR OFFERS THE WORK AS-IS AND MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE MATERIALS, EXPRESS, IMPLIED,
STATUTORY OR OTHERWISE, INCLUDING, WITHOUT LIMITATION, WARRANTIES
OF TITLE, MERCHANTIBILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, ACCURACY, OR THE PRESENCE OF ABSENCE OF ERRORS, WHETHER OR NOT DISCOVERABLE. SOME JURISDICTIONS DO NOT ALLOW THE EXCLUSION OF IMPLIED WARRANTIES, SO SUCH EXCLUSION MAY NOT APPLY TO YOU.
6. Limitation on Liability. EXCEPT TO THE EXTENT REQUIRED BY APPLICABLE LAW, IN NO EVENT WILL LICENSOR BE LIABLE TO YOU ON ANY LEGAL
THEORY FOR ANY SPECIAL, INCIDENTAL, CONSEQUENTIAL, PUNITIVE OR
EXEMPLARY DAMAGES ARISING OUT OF THIS LICENSE OR THE USE OF THE
WORK, EVEN IF LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH
DAMAGES.
7. Termination
1. This Li
ense and the rights granted hereunder will terminate automati
ally upon
any brea
h by You of the terms of this Li
ense. Individuals or entities who have re
eived
24.2.
331
CREATIVE COMMONS
Derivative Works or Colle
tive Works from You under this Li
ense, however, will not have
their li
enses terminated provided su
h individuals or entities remain in full
omplian
e
with those li
enses. Se
tions 1, 2, 5, 6, 7, and 8 will survive any termination of this Li
ense.
2.
Subje t to the above terms and onditions, the li ense granted here is perpetual
Li
ensor reserves the right to release the Work under dierent li
ense terms or to stop
distributing the Work at any time; provided, however that any su
h ele
tion will not serve
to withdraw this Li
ense (or any other li
ense that has been, or is required to be, granted
under the terms of this Li
ense), and this Li
ense will
ontinue in full for
e and ee
t
unless terminated as stated above.
8. Mis
ellaneous
1.
Ea h time You distribute or publi ly digitally perform the Work or a Colle tive
Work, the Li
ensor oers to the re
ipient a li
ense to the Work on the same terms and
onditions as the li
ense granted to You under this Li
ense.
2. Ea
h time You distribute or publi
ly digitally perform a Derivative Work, Li
ensor
oers to the re
ipient a li
ense to the original Work on the same terms and
onditions as
the li
ense granted to You under this Li
ense.
3.
If any provision of this Li ense is invalid or unenfor eable under appli able law,
it shall not ae
t the validity or enfor
eability of the remainder of the terms of this Li
ense, and without further a
tion by the parties to this agreement, su
h provision shall be
reformed to the minimum extent ne
essary to make su
h provision valid and enfor
eable.
4. No term or provision of this Li
ense shall be deemed waived and no brea
h
onsented
to unless su
h waiver or
onsent shall be in writing and signed by the party to be
harged
with su
h waiver or
onsent.
5. This Li
ense
onstitutes the entire agreement between the parties with respe
t to
the Work li
ensed here. There are no understandings, agreements or representations with
respe
t to the Work not spe
ied here.
modied without the mutual written agreement of the Li
ensor and You.
Creative Commons is not a party to this Li
ense, and makes no warranty whatsoever in
onne
tion with the Work. Creative Commons will not be liable to You or any party on any
legal theory for any damages whatsoever, in
luding without limitation any general, spe
ial,
in
idental or
onsequential damages arising in
onne
tion to this li
ense. Notwithstanding
the foregoing two (2) senten
es, if Creative Commons has expressly identied itself as the
Li
ensor hereunder, it shall have all rights and obligations of Li
ensor.
Ex
ept for the limited purpose of indi
ating to the publi
that the Work is li
ensed
under the CCPL, neither party will use the trademark "Creative Commons" or any related
trademark or logo of Creative Commons without the prior written
onsent of Creative
Commons. Any permitted use will be in
omplian
e with Creative Commons' then-
urrent
trademark usage guidelines, as may be published on its website or otherwise made available
upon request from time to time.
Creative Commons may be
onta
ted at http://
reative
ommons.org/.
332
CHAPTER 24.
LICENSES
Chapter 25
The atti
This holds material that is not really ready to be in
orporated into the main body, but
that I don't want to lose.
Basi ally, ignore it, unless you'd like to help get it ready for
in lusion.
Pn
f (y = j) =
i 1(yi
= j)/n
f(y = j) =
i=1 fY (j|xi , )/n We see that for the OBDV measure, there are many more a
tual zeros
Table 25.1: A
tual and Poisson tted frequen
ies
Count
OBDV
ERV
Count
A tual
Fitted
A tual
Fitted
0.32
0.06
0.86
0.83
0.18
0.15
0.10
0.14
0.11
0.19
0.02
0.02
0.10
0.18
0.004
0.002
0.052
0.15
0.002
0.0002
0.032
0.10
2.4e-5
For ERV, there are somewhat more a tual zeros than tted, but the
visits are needed. This is a prin
ipal/agent type situation, where the total number of visits
depends upon the de
ision of both the patient and the do
tor. Sin
e dierent parameters
may govern the two de
ision-makers
hoi
es, we might expe
t that dierent parameters
govern the probability of zeros versus the other
ounts. Let
patient's demand for visits, and let
The patient will initiate visits a
ording to a dis
rete
hoi
e model, for example, a logit
model:
333
334
CHAPTER 25.
THE ATTIC
1/ [1 + exp(p )] ,
The above probabilities are used to estimate the binary 0/1 hurdle pro ess.
Then, for
the observations where visits are positive, a trun
ated Poisson density is estimated. This
density is
fY (y, d |y > 0) =
=
fY (y, d )
Pr(y > 0)
fY (y, d )
1 exp(d )
Pr(y = 0) =
exp(d )0d
.
0!
Sin e the hurdle and trun ated omponents of the overall density for
share no parameters,
they may be estimated separately, whi
h is
omputationally more e
ient than estimating
the overall model. (Re
all that the BFGS algorithm, for example, will have to invert the
approximated Hessian. The
omputational overhead is of order
of parameters to be estimated) . The expe
tation of
K2
where
is
is the number
25.1.
HURDLE MODELS
335
Here are hurdle Poisson estimation results for OBDV, obtained from this estimation program
**************************************************************************
MEPS data, OBDV
logit results
Strong
onvergen
e
Observations = 500
Fun
tion value
-0.58939
t-Stats
params
t(OPG)
t(Sand.)
t(Hess)
onstant
-1.5502
-2.5709
-2.5269
-2.5560
pub_ins
1.0519
3.0520
3.0027
3.0384
priv_ins
0.45867
1.7289
1.6924
1.7166
sex
0.63570
3.0873
3.1677
3.1366
age
0.018614
2.1547
2.1969
2.1807
edu
0.039606
1.0467
0.98710
1.0222
in
0.077446
1.7655
2.1672
1.9601
Information Criteria
Consistent Akaike
639.89
S
hwartz
632.89
Hannan-Quinn
614.96
Akaike
603.39
**************************************************************************
336
CHAPTER 25.
THE ATTIC
**************************************************************************
MEPS data, OBDV
tpoisson results
Strong
onvergen
e
Observations = 500
Fun
tion value
-2.7042
t-Stats
params
t(OPG)
t(Sand.)
t(Hess)
onstant
0.54254
7.4291
1.1747
3.2323
pub_ins
0.31001
6.5708
1.7573
3.7183
priv_ins
0.014382
0.29433
0.10438
0.18112
sex
0.19075
10.293
1.1890
3.6942
age
0.016683
16.148
3.5262
7.9814
edu
0.016286
4.2144
0.56547
1.6353
in
-0.0079016
-2.3186
-0.35309
-0.96078
Information Criteria
Consistent Akaike
2754.7
S
hwartz
2747.7
Hannan-Quinn
2729.8
Akaike
2718.2
**************************************************************************
25.1.
337
HURDLE MODELS
Fitted and a tual probabilites (NB-II ts are provided as well) are:
OBDV
Count
A tual
0
1
ERV
Fitted HP
Fitted NB-II
A tual
Fitted HP
Fitted NB-II
0.32
0.32
0.34
0.86
0.86
0.86
0.18
0.035
0.16
0.10
0.10
0.10
0.11
0.071
0.11
0.02
0.02
0.02
0.10
0.10
0.08
0.004
0.006
0.006
0.052
0.11
0.06
0.002
0.002
0.002
0.032
0.10
0.05
0.0005
0.001
Zeros are exa t, but 1's and 2's are underestimated, and higher ounts are
overestimated. For the NB-II ts, performan
e is at least as good as the hurdle Poisson
model, and one should re
all that many fewer parameters are used. Hurdle version of the
negative binomial model are also widely used.
338
CHAPTER 25.
THE ATTIC
**************************************************************************
MEPS data, OBDV
mixnegbin results
Strong
onvergen
e
Observations = 500
Fun
tion value
-2.2312
t-Stats
params
t(OPG)
t(Sand.)
t(Hess)
onstant
0.64852
1.3851
1.3226
1.4358
pub_ins
-0.062139
-0.23188
-0.13802
-0.18729
priv_ins
0.093396
0.46948
0.33046
0.40854
sex
0.39785
2.6121
2.2148
2.4882
age
0.015969
2.5173
2.5475
2.7151
edu
-0.049175
-1.8013
-1.7061
-1.8036
in
0.015880
0.58386
0.76782
0.73281
ln_alpha
0.69961
2.3456
2.0396
2.4029
onstant
-3.6130
-1.6126
-1.7365
-1.8411
pub_ins
2.3456
1.7527
3.7677
2.6519
priv_ins
0.77431
0.73854
1.1366
0.97338
sex
0.34886
0.80035
0.74016
0.81892
age
0.021425
1.1354
1.3032
1.3387
edu
0.22461
2.0922
1.7826
2.1470
in
0.019227
0.20453
0.40854
0.36313
ln_alpha
2.8419
6.2497
6.8702
7.6182
logit_inv_mix
0.85186
1.7096
1.4827
1.7883
Information Criteria
Consistent Akaike
2353.8
S
hwartz
2336.8
Hannan-Quinn
2293.3
Akaike
2265.2
**************************************************************************
Delta method for mix parameter st. err.
mix
se_mix
0.70096
0.12043
The 95%
onden
e interval for the mix parameter is perilously
lose to 1, whi
h
suggests that there may really be only one
omponent density, rather than a mixture.
Again, this is
not
relatively few visits, edu ation seems to have a positive ee t on visits.
For the
25.1.
339
HURDLE MODELS
unhealthy group, edu
ation has a negative ee
t on visits. The other results are
more mixed. A larger sample
ould help
larify things.
The following are results for a 2
omponent
onstrained mixture negative binomial model
where all the slope parameters in
j = exj
340
CHAPTER 25.
THE ATTIC
**************************************************************************
MEPS data, OBDV
mixnegbin results
Strong
onvergen
e
Observations = 500
Fun
tion value
t-Stats
-2.2441
onstant
pub_ins
priv_ins
sex
params
-0.34153
0.45320
0.20663
0.37714
t(OPG)
-0.94203
2.6206
1.4258
3.1948
t(Sand.)
-0.91456
2.5088
1.3105
3.4929
t(Hess)
-0.97943
2.7067
1.3895
3.5319
age
edu
in
ln_alpha
onst_2
lnalpha_2
0.015822
0.011784
0.014088
1.1798
1.2621
2.7769
3.1212
0.65887
0.69088
4.6140
0.47525
1.5539
3.7806
0.50362
0.96831
7.2462
2.5219
6.4918
3.7042
0.58331
0.83408
6.4293
1.5060
4.2243
2.4888
0.60073
3.7224
1.9693
logit_inv_mix
Information Criteria
Consistent Akaike
2323.5
S
hwartz
2312.5
Hannan-Quinn
2284.3
Akaike
2266.1
**************************************************************************
Delta method for mix parameter st.
mix
se_mix
0.92335
0.047318
err.
The slope parameter estimates are pretty lose to what we got with the NB-I model.
This is very
25.2.
341
xt .
yt
as a fun tion
yt
as a
fun
tion only of its own lagged values, un
onditional on other observable variables. One
an think of this as modeling the behavior of
yt
While it's not immediately
lear why a model that has other explanatory variables should
marginalize to a linear in the parameters time series model, most time series work is done
with linear models, though nonlinear time series is also a large and growing eld. We'll
sti
k with linear time series models.
{Yt }
t=
(25.1)
Denition 54 (Time series) A time series is one observation of a sto hasti pro ess,
{yt }nt=1
So a time series is a sample of size
(25.2)
in mind that
on
eptually, one
ould draw another sample, and that the values would be
dierent.
Denition 55 (Auto
ovarian
e) The j th auto
ovarian
e of a sto
hasti
pro
ess is
jt = E(yt t )(ytj tj )
(25.3)
where t = E (yt ) .
Denition 56 (Covarian
e (weak) stationarity) A sto
hasti
pro
ess is
ovarian
e sta-
tionary if it has time
onstant mean and auto
ovarian
es of all orders:
t
= , t
jt = j , t
As we've seen, this implies that
j = j :
Denition 57 (Strong stationarity) A sto hasti pro ess is strongly stationary if the
arity.
Yt ?
The time series is one sample from the sto hasti pro ess.
M
1 X
p
lim
ytm E(Yt )
M M
m=1
{ytm }
By a LLN, we
342
CHAPTER 25.
THE ATTIC
The problem is, we have only one sample to work with, sin
e we
an't go ba
k in time
and
olle
t another. How
an
needed property.
E(Yt )
ergodi ity
is the
Denition 58 (Ergodi ity) A stationary sto hasti pro ess is ergodi (for the mean) if
1X
p
yt
n t=1
(25.4)
A su ient ondition for ergodi ity is that the auto ovarian es be absolutely summable:
X
j=0
|j | <
This implies that the auto ovarian es die o, so that the
yt
Denition 59 (Auto orrelation) The j th auto orrelation, j is just the j th auto ovari-
j =
j
0
(25.5)
Denition 60 (White noise) White noise is just the time series literature term for a
lassi
al error. t is white noise if i) E(t ) = 0, t, ii) V (t ) = 2 , t, and iii) t and s are
independent, t 6= s. Gaussian white noise just adds a normality assumption.
q th
yt = + t + 1 t1 + 2 t2 + + q tq
where
= E (yt )2
= E (t + 1 t1 + 2 t2 + + q tq )2
= 2 1 + 12 + 22 + + q2
= j + j+1 1 + j+2 2 + + q qj , j q
= 0, j > q
25.2.
343
Therefore an MA(q) pro
ess is ne
essarily
ovarian
e stationary and ergodi
, as long as
and all of the
are nite.
pth
order dier-
yt
yt1
.
.
.
ytp+1
1 2
1
0
= . 0
.
. .
..
0
0
c
1
..
0
..
..
..
yt1
0
yt2
0
.
..
0
ytp
0
0
+ .
.
.
0
or
Yt = C + F Yt1 + Et
With this, we
an re
ursively work forward in time:
Yt+1
= C + F Yt + Et+1
= C + F (C + F Yt1 + Et ) + Et+1
= C + F C + F 2 Yt1 + F Et + Et+1
and
Yt+2
= C + F Yt+1 + Et+2
= C + F C + F C + F 2 Yt1 + F Et + Et+1 + Et+2
on
yt+j .
This is simply
Yt+j
j
= F(1,1)
Et (1,1)
If the system is to be stationary, then as we move forward in time this impa
t must die o.
Otherwise a sho
k
auses a permanent
hange in the mean of
requires that
j
=0
lim F(1,1)
yt .
Therefore, stationarity
344
CHAPTER 25.
F.
THE ATTIC
su h that
|F IP | = 0
The determinant here
an be expressed as a polynomial. for example, for
p = 1, the matrix
is simply
F = 1
so
|1 | = 0
an be written as
1 = 0
When
p = 2,
the matrix
is
F =
"
so
F IP =
"
1 2
1
1 2
and
|F IP | = 2 1 2
So the eigenvalues are the roots of the polynomial
2 1 2
whi
h
an be found using the quadrati
equation. This generalizes. For a
pth
order AR
p p1 1 p2 2 p1 p = 0
Supposing that all of the roots of this polynomial are distin
t, then the matrix
an be
fa tored as
F = T T 1
where
F,
and
is a diagonal
matrix with the eigenvalues on the main diagonal. Using this de omposition, we an write
F j = T T 1
where
T T 1
is repeated
T T 1 T T 1
F j = T j T 1
25.2.
345
and
0
=
0
j
j1 0
i i = 1, 2, ..., p
j2
..
jp
j
=0
lim F(1,1)
j
requires that
|i | < 1, i = 1, 2, ..., p
e.g., the eigenvalues must be less than one in absolute value.
It may be the
ase that some eigenvalues are
omplex-valued. The previous result
generalizes to the requirement that the eigenvalues be less than one in
the modulus of a
omplex number
a + bi
modulus, where
is
mod(a + bi) =
a2 + b2
This leads to the famous statement that stationarity requires the roots of the determinantal polynomial to lie inside the omplex unit ir le.
Dynami multipliers:
response
fun tion.
j
yt+j /t = F(1,1)
is a
dynami multiplier
or an
impulse-
eigenvalue lead to o
illatory behavior. Of
ourse, when there are multiple eigenvalues
the overall ee
t
an be a mixture.
pi tures
Lyt = yt1
The lag operator is dened to behave just as an algebrai
quantity, e.g.,
L2 yt = L(Lyt )
= Lyt1
= yt2
or
346
CHAPTER 25.
THE ATTIC
yt (1 1 L 2 L2 p Lp ) = t
Fa
tor this polynomial as
1 1 L 2 L2 p Lp = (1 1 L)(1 2 L) (1 p L)
For the moment, just assume that the
Sin e
is
su h that the following two expressions are the same for all
z:
1 1 z 2 z 2 p z p = (1 1 z)(1 2 z) (1 p z)
Multiply both sides by
z p
z p 1 z 1p 2 z 2p p1 z 1 p = (z 1 1 )(z 1 2 ) (z 1 p )
and now dene
= z 1
so we get
p 1 p1 2 p2 p1 p = ( 1 )( 2 ) ( p )
The LHS is pre
isely the determinantal polynomial that gives the eigenvalues of
fore, the
matrix
F.
that are the oe ients of the fa torization are simply the eigenvalues of the
F.
(1 L)yt = t
|| < 1.
1 + L + 2 L2 + ... + j Lj
to get
1 + L + 2 L2 + ... + j Lj (1 L)yt = 1 + L + 2 L2 + ... + j Lj t
1 + L + 2 L2 + ... + j Lj L 2 L2 ... j Lj j+1 Lj+1 yt
== 1 + L + 2 L2 + ... + j Lj t
so
There-
1 j+1 Lj+1 yt = 1 + L + 2 L2 + ... + j Lj t
yt = j+1 Lj+1 yt + 1 + L + 2 L2 + ... + j Lj t
25.2.
347
Now as
j , j+1 Lj+1 yt 0,
sin e
|| < 1,
so
yt
= 1 + L + 2 L2 + ... + j Lj t
(1 L)yt = t
Substituting this into the above equation we have
yt
= 1 + L + 2 L2 + ... + j Lj (1 L)yt
so
1 + L + 2 L2 + ... + j Lj (1 L)
=1
|| < 1,
dene
(1 L)
j Lj
j=0
yt (1 1 L 2 L2 p Lp ) = t
an be written using the fa
torization
yt (1 1 L)(1 2 L) (1 p L) = t
where the
F, and
|i | < 1.
Therefore, we
X
X
X
jp Lj t
j1 Lj
j2 Lj
yt =
j=0
j=0
j=0
L,
whi h an be represented as
yt = (1 + 1 L + 2 L2 + )t
where the
The
The
i ,
i .
i are real-valued be
ause any
omplex-valued i always o
ur in
onjugate pairs.
a + bi
is an eigenvalue of
F,
then so is
a bi.
In multipli ation
348
CHAPTER 25.
THE ATTIC
whi h is real-valued.
This shows that an AR(p) pro ess is representable as an innite-order MA(q) pro ess.
periods to get
j , the lagged Y
Ets
for their rst element, so we see that the rst equation here, in the limit, is just
yt =
Fj
j=0
1,1 tj
and the
(and the
j
re
alling the previous fa
torization of F ).
E(yt ) = , t,
so
= c + 1 + 2 + ... + p
so
c
1 1 2 ... p
and
c = 1 ... p
so
0 = 1 1 + 2 2 + ... + p p + 2
as well,
25.2.
349
j1
p+1
j>p
j = j ,
2
unknowns ( ,
0 , 1 , ..., p )
p+1
equations for
j = 0, 1, ..., p,
whi h
for
yt = (1 + 1 L + ... + q Lq )t
As before, the polynomial on the RHS
an be fa
tored as
(1 i L)
an be inverted as long as
|i | < 1.
(1 + 1 L + ... + q Lq )1 (yt ) = t
where
(1 + 1 L + ... + q Lq )1
will be an innite-order polynomial in
X
j=0
with
0 = 1,
L,
so we get
j Lj (ytj ) = t
or
c = + 1 + 2 + ...
So we see that an MA(q) has an innite AR representation, as long as the
i = 1, 2, ..., q.
|i | < 1,
It turns out that one
an always manipulate the parameters of an MA(q) pro
ess to
nd an invertible representation. For example, the two MA(1) pro
esses
yt = (1 L)t
350
CHAPTER 25.
THE ATTIC
and
yt = (1 1 L)t
have exa
tly the same moments if
2 = 2 2
For example, we've seen that
0 = 2 (1 + 2 ).
Given the above relationships amongst the parameters,
0 = 2 2 (1 + 2 ) = 2 (1 + 2 )
so the varian
es are the same.
all
the same, as is easily he ked. This means that the two MA pro esses are
tionally equivalent.
observa-
For a given MA(q) pro
ess, it's always possible to manipulate the parameters to nd
an invertible representation (whi
h is unique).
It's important to nd an invertible representation, sin
e it's the only representation
that allows one to represent
y s.
express
MA() representation, one
an reverse the argument and note that at least some
MA() pro
esses have an AR(1) representation.
lot easier to estimate the single AR(1)
oe
ient rather than the innite number of
oe
ients asso
iated with the MA representation.
This is the reason that ARMA models are popular. Combining low-order AR and
MA models
an usually oer a satisfa
tory representation of univariate time series
data with a reasonable number of parameters.
Exer
ise 61 Cal
ulate the auto
ovarian
es of an ARMA(1,1) model: (1 + L)yt = c +
(1 + L)t
Bibliography
[1 Davidson, R. and J.G. Ma
Kinnon (1993)
Univ. Press.
[3 Gallant, A.R. (1985)
[5 Hamilton, J. (1994)
[6 Hayashi, F. (2000)
[7 Wooldridge (2003),
351
Index
asymptoti
equality, 317
Chain rule, 314
Cobb-Douglas model, 22
onvergen
e, almost sure, 315
onvergen
e, in distribution, 315
onvergen
e, in probability, 315
Convergen
e, ordinary, 314
onvergen
e, pointwise, 314
onvergen
e, uniform, 314
onvergen
e, uniform almost sure, 316
ross se
tion, 19
estimator, linear, 26, 33
estimator, OLS, 23
extremum estimator, 167
tted values, 23
leverage, 27
likelihood fun
tion, 41
matrix, idempotent, 26
matrix, proje
tion, 25
matrix, symmetri
, 26
observations, inuential, 26
outliers, 26
own inuen
e, 27
parameter spa
e, 41
Produ
t rule, 313
R- squared, un
entered, 28
R-squared,
entered, 29
residuals, 23
352